SlideShare una empresa de Scribd logo
1 de 20
Bilingual Data Mining for the
  English-Amharic Statistical
 Machine Translation (EASMT)

                 Mulu Gebreegziabher
Addis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University
                 Prof. Laurent Besacier
      Grenoble, France: University Joseph Fourier
         Dr. Girma Taye & Dr. Dereje Teferi
         Addis Ababa, Ethiopia: Addis Ababa University
                    December 2, 2011
Presentation Outline
•   Introduction
•   Objectives
•   Experiment on the English-Amharic bilingual corpus
•   ENA English-Amharic parallel news corpus
•   Parliamentary English-Amharic parallel proclamation corpus
•   Sentence level aligned English-Amharic parallel corpora
•   Way Forward
Introduction                         MT is the application
                                                          of computers to
                                                        translate text from
                                                       one natural language
                                                             to another.
              Machine Translation Systems



   Machine Assisted                         Fully Automated
     Translation                               Translation


Human Aided Machine Aided                                  Rule-based
                                 Empirical Systems          Systems
 Translation Translation

                      Statistical Machine          Example-based
                          Translation                Translation
Introduction Contd…
•   SMT systems are data driven that rely on bilingual
    parallel aligned corpus.
•   The performance of a SMT systems depends on the
    size of the available training corpus.
•   The larger the corpus, the better is the
    performance of the SMT system.
•   To develop EASMT, parallel data has to be collected
    from English-Amharic bilingual sentence pairs.
•   The experiment is to be conducted on at least a
    corpus of size 2M word pairs (40K sentence pairs).
Introduction Contd…
English-Amharic Statistical Machine Translation (EASMT)
• Translation between two disparate languages
                       Amharic         English

 Language Family       Afro-Asiatic    Indo-European


 Morphology            Complex         Less inflected


 Syntactic Structure   SOV             SVO


 Writing System        Geez Alphabet   Latin Letters
Introduction Contd…
Parallel Corpus
• Parallel corpus is a collection of text paired with
   translations into another language.
• The experiment is conducted on training corpus of
   both languages based on expressions that are found
   in parallel Amharic-English news, parliamentary and
   constitutional documents.
• The parallel ENA news contains sentences of day-to-
   day usage:
  –   Direct translations of each other
  –   Indirect translations written on the same topic in different
      languages called comparable corpora.
Objectives


The objective of the research is to study and
develop an English-Amharic Statistical
Machine Translation (EASMT) system and to
improve the translation quality by integrating
linguistic knowledge into the system.
Experiment on the English-Amharic
          bilingual corpus
Mining the parallel corpus
• There are five steps to process a bilingual text corpus
  used for SMT system. (by Besacier et.al, 2009):
  – Raw data collection: proclamation and parallel
      news corpora have been collected
  – Document alignment: manual & automatic
  – Tokenization: splitting and trimming
  – Sentence splitting: done using the punct. [?!. ፡፡   ]
  – Sentence alignment: almost completed
ENA English-Amharic parallel news corpus
 • News coverage: Aug 21, 2006 - January 06, 2008


     News Corpus                                  Counts     Total

                            Domestic Language       10,116
     Amharic                                                   23,771
                            Regional                13,655
     English                Foreign Language        11,276      11,276
                            Monitoring                 494
     Amharic-English                                             3,610
                            Information              3,116

                       Table 1: ENA news corpus
ENA English-Amharic parallel news corpus
 • Count Summary: ENA news corpus

     Collected                          Amharic English   Total
                         Documents        23,771   11,276    35,047
                         Sentences       322,673 212,050 534,723
     Counts of Raw                      5,277,711 3,704,644 8,982,355
                         Words
                         Vocabularies     270,786     130,803    401,589
                         Documents          1,036       1,036      2,072
                         Sentences         26,112      25,834     51,946
     Counts of Aligned                    207,200     198,461    405,661
                         Words
                         Vocabularies       36,519     17,987     54,506

 Table 2: The status of English-Amharic parallel news corpus on May 25, 2011
ENA English-Amharic parallel news corpus
 • Manual alignment at document level: Challenges
   – Easy: preprocessing including exporting from SQL
      database to word, converting to Unicode using
      Zilla word to text converter
   – Time consuming: difficult to align at document
      level, since the files are stored in different folders
      with        no        structure,       the        date
      difference, punctuation, heading information
      differs (parallel/comparable corpus)
   – Document level alignment is done by looking at
      the heading and pick the news id from the folders
ENA English-Amharic parallel news corpus
 • Automatically aligned English-Amharic Sample ENA
   news corpora at document level
 • The aligner takes the following into consideration to
   align the news items:
    – Start from the English corpus (constitute 32%).
    – Match news items that have different story language.
    – Limit the match with neighboring Amharic corpus to look 80
      files around the current file.
    – A scoring method is used that gives equal weights to all
      matching columns.
ENA English-Amharic parallel news corpus
 • The output result of the automatic aligner.

  Aligned Corpus                 Counts         Cumulative     %
  1-1                                     383            383       0.37
  1-2                                     155            538       0.52
  1-3                                     498          1,036       1.00
  Total Exact Matches                                   880        0.85

  Unique Amharic Corpus                                 968        0.93

  Unique English Corpus                                1,036       1.00

        Table 4: Automatically Aligned English-Amharic Sample
                 ENA news items
ENA English-Amharic parallel news corpus

• Some of the sample English Documents were
  better aligned with not seen document, e.g.
  – 41827  41791 (manual 41827  41826)
• 85% matches have been exactly automatically
  aligned similar to the manual alignment.
• Thus, 15% is a new match that does not
  indicate to an error.


      Table ENA: Aligned Sample English/Amharic News corpus
ENA English-Amharic parallel news corpus
 • Extended to automatically align the whole English-
   Amharic ENA news items

        Aligned Corpus          Counts Cumulative %
        1-1                      2,928       2,928    0.26
        1-2                      1,535       4,463    0.40
        1-3                      6,813      11,276    1.00
        Unique Amharic Corpus               10,487    0.93

        Unique English Corpus               11,276    1.00


  Table 5: Automatically Aligned English-Amharic ENA news items
Parliamentary English-Amharic parallel
        proclamation corpus
• Proclamation coverage: Aug 21, 1995 - July 16, 2010
   Collected                          Amharic English     Total
   Counts of Raw Documents                632      632            1,264
                       Documents          115      115             230
                       Sentences        19,115   25,730      44,845
   Counts of Aligned
                       Words           219,430 283,578      503,008
                       Vocabularies     32,299   17,908      50,207


       Table 6: Aligned Parliamentary English-Amharic
                parallel proclamation corpus
Sentence level aligned English-Amharic
           parallel corpora
• The alignment process is similar for both the ENA
  news items and the proclamation.
• The alignment is done using a sentence aligner called
  Hunalign (similar to Gale and Church ,1993).
• Hunalign aligns bilingual text using sentence-length.
• An English-Amharic bilingual dictionary of word lists
  sized 8,212 have been adopted and used
  (Armbruster, 2007).
• The aligner aligns an English Sentence to Amharic in
  0-1, 1-1 or 1-2.
Sentence level aligned English-Amharic
           parallel corpora
• The result of the alignment at the sentence level for
  both the ENA news and the proclamation

      Aligned Sentence pairs                 Counts

      ENA Corpus                                      155,200

      Proclamation Corpus                              18,632

                                     Total            173,832

    Table 7: Sentence aligned English-Amharic bilingual corpus
Way Forward


• To increase the number of the English-Amharic
  proclamation corpus as much as possible.
• To further analyze the experiment conducted so far.
• To increase the translation quality using
  linguistic knowledge: morpho-syntactically.
Thank You!!!

Más contenido relacionado

Destacado

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)Nurbek Matzhani
 
8 Google Translate
8 Google Translate8 Google Translate
8 Google Translateaptwano
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroommarafaye
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clusteringGuy De Pauw
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Updatemrsvogel
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Daniel Adenew
 
Machine Translation=Google Translator
Machine Translation=Google TranslatorMachine Translation=Google Translator
Machine Translation=Google TranslatorNerea
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introductionnlab_utokyo
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translationRushdi Shams
 
MyGengo.com State Of Global Translation Industry (2009)
MyGengo.com State Of Global Translation Industry (2009)MyGengo.com State Of Global Translation Industry (2009)
MyGengo.com State Of Global Translation Industry (2009)Dave McClure
 
5 Best Powerpoint Templates Amazing Creative Presentation Themes
5 Best Powerpoint Templates   Amazing Creative Presentation Themes5 Best Powerpoint Templates   Amazing Creative Presentation Themes
5 Best Powerpoint Templates Amazing Creative Presentation ThemesYeasir Arafat
 

Destacado (16)

Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
Google translate (new russian)
Google translate (new russian)Google translate (new russian)
Google translate (new russian)
 
8 Google Translate
8 Google Translate8 Google Translate
8 Google Translate
 
Google Translate in the Classroom
Google Translate in the ClassroomGoogle Translate in the Classroom
Google Translate in the Classroom
 
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
Linguistic Evaluation of Support Verb Construction Translations by OpenLogos ...
 
Amharic document clustering
Amharic document clusteringAmharic document clustering
Amharic document clustering
 
Google Translate Update
Google Translate UpdateGoogle Translate Update
Google Translate Update
 
Google translate
Google translateGoogle translate
Google translate
 
Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...Natural language processing with python and amharic syntax parse tree by dani...
Natural language processing with python and amharic syntax parse tree by dani...
 
Machine Translation=Google Translator
Machine Translation=Google TranslatorMachine Translation=Google Translator
Machine Translation=Google Translator
 
Machine Translation Introduction
Machine Translation IntroductionMachine Translation Introduction
Machine Translation Introduction
 
Techniques in Translation
Techniques in TranslationTechniques in Translation
Techniques in Translation
 
Types of machine translation
Types of machine translationTypes of machine translation
Types of machine translation
 
MyGengo.com State Of Global Translation Industry (2009)
MyGengo.com State Of Global Translation Industry (2009)MyGengo.com State Of Global Translation Industry (2009)
MyGengo.com State Of Global Translation Industry (2009)
 
Slideshare
SlideshareSlideshare
Slideshare
 
5 Best Powerpoint Templates Amazing Creative Presentation Themes
5 Best Powerpoint Templates   Amazing Creative Presentation Themes5 Best Powerpoint Templates   Amazing Creative Presentation Themes
5 Best Powerpoint Templates Amazing Creative Presentation Themes
 

Similar a Bilingual Data Mining for Improving English-Amharic Machine Translation

SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti1
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionSarvnaz Karimi
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemWaqas Tariq
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationijnlc
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifyingcsandit
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsshrey bhate
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsMatīss ‎‎‎‎‎‎‎  
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeRajpootBhatti5
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdfAmir Abdalla
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONcscpconf
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTNathan Mathis
 

Similar a Bilingual Data Mining for Improving English-Amharic Machine Translation (20)

SiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptxSiddhantSancheti_MediumShortStory.pptx
SiddhantSancheti_MediumShortStory.pptx
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration ExtractionEnriching Transliteration Lexicon Using Automatic Transliteration Extraction
Enriching Transliteration Lexicon Using Automatic Transliteration Extraction
 
Building of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert SystemBuilding of Database for English-Azerbaijani Machine Translation Expert System
Building of Database for English-Azerbaijani Machine Translation Expert System
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
 
A new hybrid metric for verifying
A new hybrid metric for verifyingA new hybrid metric for verifying
A new hybrid metric for verifying
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation SystemsHybrid Machine Translation by Combining Multiple Machine Translation Systems
Hybrid Machine Translation by Combining Multiple Machine Translation Systems
 
“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...“Neural Machine Translation for low resource languages: Use case anglais - wo...
“Neural Machine Translation for low resource languages: Use case anglais - wo...
 
What are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 RoutledgeWhat are the basics of Analysing a corpus? chpt.10 Routledge
What are the basics of Analysing a corpus? chpt.10 Routledge
 
Machine Transalation.pdf
Machine Transalation.pdfMachine Transalation.pdf
Machine Transalation.pdf
 
Translationusing moses1
Translationusing moses1Translationusing moses1
Translationusing moses1
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATIONAN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
AN ADVANCED APPROACH FOR RULE BASED ENGLISH TO BENGALI MACHINE TRANSLATION
 
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENTAMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
AMHARIC TEXT TO SPEECH SYNTHESIS FOR SYSTEM DEVELOPMENT
 
Jq3616701679
Jq3616701679Jq3616701679
Jq3616701679
 

Más de Guy De Pauw

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Guy De Pauw
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Guy De Pauw
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingGuy De Pauw
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageGuy De Pauw
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguageGuy De Pauw
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)Guy De Pauw
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Guy De Pauw
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusGuy De Pauw
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of SantomeGuy De Pauw
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Guy De Pauw
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTGuy De Pauw
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionGuy De Pauw
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingGuy De Pauw
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishGuy De Pauw
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsGuy De Pauw
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersGuy De Pauw
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentGuy De Pauw
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersGuy De Pauw
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemGuy De Pauw
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemGuy De Pauw
 

Más de Guy De Pauw (20)

Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...Technological Tools for Dictionary and Corpora Building for Minority Language...
Technological Tools for Dictionary and Corpora Building for Minority Language...
 
Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...Semi-automated extraction of morphological grammars for Nguni with special re...
Semi-automated extraction of morphological grammars for Nguni with special re...
 
Resource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech TaggingResource-Light Bantu Part-of-Speech Tagging
Resource-Light Bantu Part-of-Speech Tagging
 
Natural Language Processing for Amazigh Language
Natural Language Processing for Amazigh LanguageNatural Language Processing for Amazigh Language
Natural Language Processing for Amazigh Language
 
POS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik LanguagePOS Annotated 50m Corpus of Tajik Language
POS Annotated 50m Corpus of Tajik Language
 
The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)The Tagged Icelandic Corpus (MÍM)
The Tagged Icelandic Corpus (MÍM)
 
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
Describing Morphologically Rich Languages Using Metagrammars a Look at Verbs ...
 
Tagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News CorpusTagging and Verifying an Amharic News Corpus
Tagging and Verifying an Amharic News Corpus
 
A Corpus of Santome
A Corpus of SantomeA Corpus of Santome
A Corpus of Santome
 
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
Automatic Structuring and Correction Suggestion System for Hungarian Clinical...
 
Compiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFSTCompiling Apertium Dictionaries with HFST
Compiling Apertium Dictionaries with HFST
 
The Database of Modern Icelandic Inflection
The Database of Modern Icelandic InflectionThe Database of Modern Icelandic Inflection
The Database of Modern Icelandic Inflection
 
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic ProgrammingLearning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
Learning Morphological Rules for Amharic Verbs Using Inductive Logic Programming
 
Issues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken IrishIssues in Designing a Corpus of Spoken Irish
Issues in Designing a Corpus of Spoken Irish
 
How to build language technology resources for the next 100 years
How to build language technology resources for the next 100 yearsHow to build language technology resources for the next 100 years
How to build language technology resources for the next 100 years
 
Towards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound AnalysersTowards Standardizing Evaluation Test Sets for Compound Analysers
Towards Standardizing Evaluation Test Sets for Compound Analysers
 
The PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource DevelopmentThe PALDO Concept - New Paradigms for African Language Resource Development
The PALDO Concept - New Paradigms for African Language Resource Development
 
A System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá CharactersA System for the Recognition of Handwritten Yorùbá Characters
A System for the Recognition of Handwritten Yorùbá Characters
 
IFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation SystemIFE-MT: An English-to-Yorùbá Machine Translation System
IFE-MT: An English-to-Yorùbá Machine Translation System
 
A Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription SystemA Number to Yorùbá Text Transcription System
A Number to Yorùbá Text Transcription System
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 

Bilingual Data Mining for Improving English-Amharic Machine Translation

  • 1. Bilingual Data Mining for the English-Amharic Statistical Machine Translation (EASMT) Mulu Gebreegziabher Addis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University Prof. Laurent Besacier Grenoble, France: University Joseph Fourier Dr. Girma Taye & Dr. Dereje Teferi Addis Ababa, Ethiopia: Addis Ababa University December 2, 2011
  • 2. Presentation Outline • Introduction • Objectives • Experiment on the English-Amharic bilingual corpus • ENA English-Amharic parallel news corpus • Parliamentary English-Amharic parallel proclamation corpus • Sentence level aligned English-Amharic parallel corpora • Way Forward
  • 3. Introduction MT is the application of computers to translate text from one natural language to another. Machine Translation Systems Machine Assisted Fully Automated Translation Translation Human Aided Machine Aided Rule-based Empirical Systems Systems Translation Translation Statistical Machine Example-based Translation Translation
  • 4. Introduction Contd… • SMT systems are data driven that rely on bilingual parallel aligned corpus. • The performance of a SMT systems depends on the size of the available training corpus. • The larger the corpus, the better is the performance of the SMT system. • To develop EASMT, parallel data has to be collected from English-Amharic bilingual sentence pairs. • The experiment is to be conducted on at least a corpus of size 2M word pairs (40K sentence pairs).
  • 5. Introduction Contd… English-Amharic Statistical Machine Translation (EASMT) • Translation between two disparate languages Amharic English Language Family Afro-Asiatic Indo-European Morphology Complex Less inflected Syntactic Structure SOV SVO Writing System Geez Alphabet Latin Letters
  • 6. Introduction Contd… Parallel Corpus • Parallel corpus is a collection of text paired with translations into another language. • The experiment is conducted on training corpus of both languages based on expressions that are found in parallel Amharic-English news, parliamentary and constitutional documents. • The parallel ENA news contains sentences of day-to- day usage: – Direct translations of each other – Indirect translations written on the same topic in different languages called comparable corpora.
  • 7. Objectives The objective of the research is to study and develop an English-Amharic Statistical Machine Translation (EASMT) system and to improve the translation quality by integrating linguistic knowledge into the system.
  • 8. Experiment on the English-Amharic bilingual corpus Mining the parallel corpus • There are five steps to process a bilingual text corpus used for SMT system. (by Besacier et.al, 2009): – Raw data collection: proclamation and parallel news corpora have been collected – Document alignment: manual & automatic – Tokenization: splitting and trimming – Sentence splitting: done using the punct. [?!. ፡፡ ] – Sentence alignment: almost completed
  • 9. ENA English-Amharic parallel news corpus • News coverage: Aug 21, 2006 - January 06, 2008 News Corpus Counts Total Domestic Language 10,116 Amharic 23,771 Regional 13,655 English Foreign Language 11,276 11,276 Monitoring 494 Amharic-English 3,610 Information 3,116 Table 1: ENA news corpus
  • 10. ENA English-Amharic parallel news corpus • Count Summary: ENA news corpus Collected Amharic English Total Documents 23,771 11,276 35,047 Sentences 322,673 212,050 534,723 Counts of Raw 5,277,711 3,704,644 8,982,355 Words Vocabularies 270,786 130,803 401,589 Documents 1,036 1,036 2,072 Sentences 26,112 25,834 51,946 Counts of Aligned 207,200 198,461 405,661 Words Vocabularies 36,519 17,987 54,506 Table 2: The status of English-Amharic parallel news corpus on May 25, 2011
  • 11. ENA English-Amharic parallel news corpus • Manual alignment at document level: Challenges – Easy: preprocessing including exporting from SQL database to word, converting to Unicode using Zilla word to text converter – Time consuming: difficult to align at document level, since the files are stored in different folders with no structure, the date difference, punctuation, heading information differs (parallel/comparable corpus) – Document level alignment is done by looking at the heading and pick the news id from the folders
  • 12. ENA English-Amharic parallel news corpus • Automatically aligned English-Amharic Sample ENA news corpora at document level • The aligner takes the following into consideration to align the news items: – Start from the English corpus (constitute 32%). – Match news items that have different story language. – Limit the match with neighboring Amharic corpus to look 80 files around the current file. – A scoring method is used that gives equal weights to all matching columns.
  • 13. ENA English-Amharic parallel news corpus • The output result of the automatic aligner. Aligned Corpus Counts Cumulative % 1-1 383 383 0.37 1-2 155 538 0.52 1-3 498 1,036 1.00 Total Exact Matches 880 0.85 Unique Amharic Corpus 968 0.93 Unique English Corpus 1,036 1.00 Table 4: Automatically Aligned English-Amharic Sample ENA news items
  • 14. ENA English-Amharic parallel news corpus • Some of the sample English Documents were better aligned with not seen document, e.g. – 41827  41791 (manual 41827  41826) • 85% matches have been exactly automatically aligned similar to the manual alignment. • Thus, 15% is a new match that does not indicate to an error. Table ENA: Aligned Sample English/Amharic News corpus
  • 15. ENA English-Amharic parallel news corpus • Extended to automatically align the whole English- Amharic ENA news items Aligned Corpus Counts Cumulative % 1-1 2,928 2,928 0.26 1-2 1,535 4,463 0.40 1-3 6,813 11,276 1.00 Unique Amharic Corpus 10,487 0.93 Unique English Corpus 11,276 1.00 Table 5: Automatically Aligned English-Amharic ENA news items
  • 16. Parliamentary English-Amharic parallel proclamation corpus • Proclamation coverage: Aug 21, 1995 - July 16, 2010 Collected Amharic English Total Counts of Raw Documents 632 632 1,264 Documents 115 115 230 Sentences 19,115 25,730 44,845 Counts of Aligned Words 219,430 283,578 503,008 Vocabularies 32,299 17,908 50,207 Table 6: Aligned Parliamentary English-Amharic parallel proclamation corpus
  • 17. Sentence level aligned English-Amharic parallel corpora • The alignment process is similar for both the ENA news items and the proclamation. • The alignment is done using a sentence aligner called Hunalign (similar to Gale and Church ,1993). • Hunalign aligns bilingual text using sentence-length. • An English-Amharic bilingual dictionary of word lists sized 8,212 have been adopted and used (Armbruster, 2007). • The aligner aligns an English Sentence to Amharic in 0-1, 1-1 or 1-2.
  • 18. Sentence level aligned English-Amharic parallel corpora • The result of the alignment at the sentence level for both the ENA news and the proclamation Aligned Sentence pairs Counts ENA Corpus 155,200 Proclamation Corpus 18,632 Total 173,832 Table 7: Sentence aligned English-Amharic bilingual corpus
  • 19. Way Forward • To increase the number of the English-Amharic proclamation corpus as much as possible. • To further analyze the experiment conducted so far. • To increase the translation quality using linguistic knowledge: morpho-syntactically.