Machine translation is a complex subject with enormous amount of currently active research. This is an area in which I have no formal training but I need a lightweight stand-alone translator that does a reasonable job for normal chat conversation. A list of open source tools can be found here. From my point of view, I simply want a neat package that will translate as a black box, but that seems to be impossible with the open source libraries. Thus I need to gain familiarity with the basic ideas in the area and how these tools work.

The foundational text for statistical natural language processing is Christopher Manning and Hinrich Schutze’s *Foundations of Statistical Language Processing*. A modern open source system that implements statistical machine language algorithms is Moses.

The basic idea I have is to try the simplest possible thing that uses tools from a system such as Moses. Put together a corpus of English text that represents minimal topics of conversation in chat communication, translate the corpus into different languages using a good heavyweight translator such as Google Translate, then use the output to train machine learning algorithms from Moses, and use these trained algorithms as standalone translators.

Clearly this approach is not going to produce very good translations, but it is guaranteed to produce lightweight translators that could be put into server code that can be distributed in a network. Since the central problem that we are interested in resolving is viability of a system capable of providing one-to-one simultaneous translate-chat for 6.7 billion, the quality of translations is not as important as removing the bottleneck in network communication with Google or Yahoo or other servers for the translations.

Once we show proof of concept that a lightweight translator can, in principle, still allow simultaneous chat among 6.7 billion without insurmountable network overhead or unreasonable delays, then we can return to the problem of translation quality.

## Basic ideas of Statistical Approach to Machine Translation

The idea of a statistical approach to machine translation seems to have originated in the IBM Thomas Watson Lab around 1990 by Peter Brown et. al, and their paper can be found here. An *n*-gram is an ordered *n*-tuple, and an *n*-gram probability model for a language is essentially the conditional probability distribution of a word given a particular *n*-tuple seen before it. The power of language modeling by the fairly simple *n*-gram model was seen when a bag reconstruction test provided 84% correct recovery of sentence meaning in a sample of 38 short sentences with less than 11 words each with just a 3-gram model of English, according to the 1990 paper of Brown et. al.

## Choice of a corpus

The basic division of a machine translation system like Moses consists of programs to produce a so-called language model, and programs that use the language model to decode and translate from one language to another. Concretely, a language model for a given pair of languages is simply a large probability table for phrases and their translations. The actual translation program decodes sentences in the language, uses the probability table to decide translations, and produces output. The bulk of the hard work for the translation is done in the production of the language model table by tools from statistical analysis such as hidden markov models.

In a concrete translation system, we can just assume available, for each pair of languages of interest, a language model table, so that a stand-alone system consists of: (a) a number of language model files, and (b) a decoder. The sophistication and size of the the language model tables decides then the size of the the stand-alone translator. If the accuracy of translation is not the major factor, then these can be relatively small and can be distributed with the servers of the system for local translations that do not require the use of network services such as Google Translate and Yahoo Babelfish.

Read Full Post »