Designing IA for AI - Information Architecture Conference 2024
Bilingual Data Mining for Improving English-Amharic Machine Translation
1. Bilingual Data Mining for the
English-Amharic Statistical
Machine Translation (EASMT)
Mulu Gebreegziabher
Addis Ababa, Ethiopia: IT Doctoral Program, Addis Ababa University
Prof. Laurent Besacier
Grenoble, France: University Joseph Fourier
Dr. Girma Taye & Dr. Dereje Teferi
Addis Ababa, Ethiopia: Addis Ababa University
December 2, 2011
2. Presentation Outline
• Introduction
• Objectives
• Experiment on the English-Amharic bilingual corpus
• ENA English-Amharic parallel news corpus
• Parliamentary English-Amharic parallel proclamation corpus
• Sentence level aligned English-Amharic parallel corpora
• Way Forward
3. Introduction MT is the application
of computers to
translate text from
one natural language
to another.
Machine Translation Systems
Machine Assisted Fully Automated
Translation Translation
Human Aided Machine Aided Rule-based
Empirical Systems Systems
Translation Translation
Statistical Machine Example-based
Translation Translation
4. Introduction Contd…
• SMT systems are data driven that rely on bilingual
parallel aligned corpus.
• The performance of a SMT systems depends on the
size of the available training corpus.
• The larger the corpus, the better is the
performance of the SMT system.
• To develop EASMT, parallel data has to be collected
from English-Amharic bilingual sentence pairs.
• The experiment is to be conducted on at least a
corpus of size 2M word pairs (40K sentence pairs).
5. Introduction Contd…
English-Amharic Statistical Machine Translation (EASMT)
• Translation between two disparate languages
Amharic English
Language Family Afro-Asiatic Indo-European
Morphology Complex Less inflected
Syntactic Structure SOV SVO
Writing System Geez Alphabet Latin Letters
6. Introduction Contd…
Parallel Corpus
• Parallel corpus is a collection of text paired with
translations into another language.
• The experiment is conducted on training corpus of
both languages based on expressions that are found
in parallel Amharic-English news, parliamentary and
constitutional documents.
• The parallel ENA news contains sentences of day-to-
day usage:
– Direct translations of each other
– Indirect translations written on the same topic in different
languages called comparable corpora.
7. Objectives
The objective of the research is to study and
develop an English-Amharic Statistical
Machine Translation (EASMT) system and to
improve the translation quality by integrating
linguistic knowledge into the system.
8. Experiment on the English-Amharic
bilingual corpus
Mining the parallel corpus
• There are five steps to process a bilingual text corpus
used for SMT system. (by Besacier et.al, 2009):
– Raw data collection: proclamation and parallel
news corpora have been collected
– Document alignment: manual & automatic
– Tokenization: splitting and trimming
– Sentence splitting: done using the punct. [?!. ፡፡ ]
– Sentence alignment: almost completed
9. ENA English-Amharic parallel news corpus
• News coverage: Aug 21, 2006 - January 06, 2008
News Corpus Counts Total
Domestic Language 10,116
Amharic 23,771
Regional 13,655
English Foreign Language 11,276 11,276
Monitoring 494
Amharic-English 3,610
Information 3,116
Table 1: ENA news corpus
10. ENA English-Amharic parallel news corpus
• Count Summary: ENA news corpus
Collected Amharic English Total
Documents 23,771 11,276 35,047
Sentences 322,673 212,050 534,723
Counts of Raw 5,277,711 3,704,644 8,982,355
Words
Vocabularies 270,786 130,803 401,589
Documents 1,036 1,036 2,072
Sentences 26,112 25,834 51,946
Counts of Aligned 207,200 198,461 405,661
Words
Vocabularies 36,519 17,987 54,506
Table 2: The status of English-Amharic parallel news corpus on May 25, 2011
11. ENA English-Amharic parallel news corpus
• Manual alignment at document level: Challenges
– Easy: preprocessing including exporting from SQL
database to word, converting to Unicode using
Zilla word to text converter
– Time consuming: difficult to align at document
level, since the files are stored in different folders
with no structure, the date
difference, punctuation, heading information
differs (parallel/comparable corpus)
– Document level alignment is done by looking at
the heading and pick the news id from the folders
12. ENA English-Amharic parallel news corpus
• Automatically aligned English-Amharic Sample ENA
news corpora at document level
• The aligner takes the following into consideration to
align the news items:
– Start from the English corpus (constitute 32%).
– Match news items that have different story language.
– Limit the match with neighboring Amharic corpus to look 80
files around the current file.
– A scoring method is used that gives equal weights to all
matching columns.
13. ENA English-Amharic parallel news corpus
• The output result of the automatic aligner.
Aligned Corpus Counts Cumulative %
1-1 383 383 0.37
1-2 155 538 0.52
1-3 498 1,036 1.00
Total Exact Matches 880 0.85
Unique Amharic Corpus 968 0.93
Unique English Corpus 1,036 1.00
Table 4: Automatically Aligned English-Amharic Sample
ENA news items
14. ENA English-Amharic parallel news corpus
• Some of the sample English Documents were
better aligned with not seen document, e.g.
– 41827 41791 (manual 41827 41826)
• 85% matches have been exactly automatically
aligned similar to the manual alignment.
• Thus, 15% is a new match that does not
indicate to an error.
Table ENA: Aligned Sample English/Amharic News corpus
15. ENA English-Amharic parallel news corpus
• Extended to automatically align the whole English-
Amharic ENA news items
Aligned Corpus Counts Cumulative %
1-1 2,928 2,928 0.26
1-2 1,535 4,463 0.40
1-3 6,813 11,276 1.00
Unique Amharic Corpus 10,487 0.93
Unique English Corpus 11,276 1.00
Table 5: Automatically Aligned English-Amharic ENA news items
16. Parliamentary English-Amharic parallel
proclamation corpus
• Proclamation coverage: Aug 21, 1995 - July 16, 2010
Collected Amharic English Total
Counts of Raw Documents 632 632 1,264
Documents 115 115 230
Sentences 19,115 25,730 44,845
Counts of Aligned
Words 219,430 283,578 503,008
Vocabularies 32,299 17,908 50,207
Table 6: Aligned Parliamentary English-Amharic
parallel proclamation corpus
17. Sentence level aligned English-Amharic
parallel corpora
• The alignment process is similar for both the ENA
news items and the proclamation.
• The alignment is done using a sentence aligner called
Hunalign (similar to Gale and Church ,1993).
• Hunalign aligns bilingual text using sentence-length.
• An English-Amharic bilingual dictionary of word lists
sized 8,212 have been adopted and used
(Armbruster, 2007).
• The aligner aligns an English Sentence to Amharic in
0-1, 1-1 or 1-2.
18. Sentence level aligned English-Amharic
parallel corpora
• The result of the alignment at the sentence level for
both the ENA news and the proclamation
Aligned Sentence pairs Counts
ENA Corpus 155,200
Proclamation Corpus 18,632
Total 173,832
Table 7: Sentence aligned English-Amharic bilingual corpus
19. Way Forward
• To increase the number of the English-Amharic
proclamation corpus as much as possible.
• To further analyze the experiment conducted so far.
• To increase the translation quality using
linguistic knowledge: morpho-syntactically.