Romancing the Rosetta Stone

This is about machine translation. An article with this title at the University of Southern California describes a relatively successful machine translation system devised by their Dr. Franz Josef Och (who did a lot of preliminary work at the Rheinisch-Westphälische Technische Hochschule, RWTH, in Aachen, based on late 1980s IBM research).

The idea is to stuff a machine with masses of parallel texts and let the machine work out the translation using statistical comparisons.

Och’s system scored highest of 23 Arabic- and Chinese-to-English systems. It was also entered in a recent competition to devise from scratch an MT system for Hindi as fast as possible. The results are not out yet.

Creation of the parallel texts needed for Och’s system to work was complicated by the fact that Hindi is written in a non-Latin script, which has numerous different digital encodings instead of one or two standard ones.

You’re not kidding! And I wonder what the quality of the input translations was.

I also think of translation memory problems where segments in one language don’t always match those in the other in different texts.

This method ignores, or rather rolls over, explicit grammatical rules and even traditional dictionary lists of vocabulary in favor of letting the computer itself find matchup patterns between a given Chinese or Arabic (or any other language) texts and English translations.

Such abilities have grown, as computers have improved, by enabling them to move from using individual words as the basic unit to using groups of words — phrases.

Different human translators’ versions of the same text will often vary considerably. Another key improvement has been the use of multiple English human translations to allow the computer to more freely and widely check its rendering by a scoring system.

And I wonder how many versions they had for Hindi, Chinese and Arabic.

Via Slashdot and Roland Piquepaille’s Technology Trends.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.