Feeds:
Posts
Comments

Archive for January, 2011

Although this program is fairly trivial conceptually, there is an annoying delicate issue that one must keep in mind for it.  It is that perl’s regular expressions affect and are affected by the encoding.  One must therefore decode Arabic encoded words — which are in ISO-8859-6 — use the perl regular expressions and then encode the results back in the translation results.  The rest of the code is fairly straightforward.  It takes English text from the standard input and produces iso-8859-6 encoded Arabic translation on the standard output.

 

#!/usr/bin/perl

use warnings;
use WWW::Mechanize;
use Encode qw(encode decode);
use HTML::Parser;
use HTML::Tree;
use Data::Dumper;

my $s = WWW::Mechanize->new( agent => ‘Linux Mozilla’);

$s->get( “http://translate.google.com/” );

my $text;
&get_text(\$text);

$s->submit_form(
form_number => 1,
fields => { sl=>”en”, tl=>”ar”, text => $text }
);
die “$!” unless ($s->success);

my $res = $s->response;
my $t = $res->content;

my $m = decode( “iso-8859-6″, $t);

my @arabic = ($m =~ /fff'”>([^<]+)?</g);

foreach my $a (@arabic) {
$a = encode( “iso-8859-6”, $a);
print “$a\n”;
}

sub get_text {
my $text = shift;
while (<>) {
chomp;
$$text .= $_;
}
}

Read Full Post »

Machine translation is a complex subject with enormous amount of currently active research.  This is an area in which I have no formal training but I need a lightweight stand-alone translator that does a reasonable job for normal chat conversation.  A list of open source tools can be found here.  From my point of view, I simply want a neat package that will translate as a black box, but that seems to be impossible with the open source libraries.  Thus I need to gain familiarity with the basic ideas in the area and how these tools work.

The foundational text for statistical natural language processing is Christopher Manning and Hinrich Schutze’s Foundations of Statistical Language Processing.  A modern open source system that implements statistical machine language algorithms is Moses.

The basic idea I have is to try the simplest possible thing that uses tools from a system such as Moses.  Put together a corpus of English text that represents minimal topics of conversation in chat communication, translate the corpus into different languages using a good heavyweight translator such as Google Translate, then use the output to train machine learning algorithms from Moses, and use these trained algorithms as standalone translators.

Clearly this approach is not going to produce very good translations, but it is guaranteed to produce lightweight translators that could be put into server code that can be distributed in a network.  Since the central problem that we are interested in resolving is viability of a system capable of providing one-to-one simultaneous translate-chat for 6.7 billion, the quality of translations is not as important as removing the bottleneck in network communication with Google or Yahoo or other servers for the translations.

Once we show proof of concept that a lightweight translator can, in principle, still allow simultaneous chat among 6.7 billion without insurmountable network overhead or unreasonable delays, then we can return to the problem of translation quality.

Basic ideas of Statistical Approach to Machine Translation

The idea of a statistical approach to machine translation seems to have originated in the IBM Thomas Watson Lab around 1990 by Peter Brown et. al, and their paper can be found here.  An n-gram is an ordered n-tuple, and an n-gram probability model for a language is essentially the conditional probability distribution of a word given a particular n-tuple seen before it.  The power of language modeling by the fairly simple n-gram model was seen when a bag reconstruction test provided 84% correct recovery of sentence meaning in a sample of 38 short sentences with less than 11 words each with just a 3-gram model of English, according to the 1990 paper of Brown et. al.

Choice of a corpus

The basic division of a machine translation system like Moses consists of programs to produce a so-called language model, and programs that use the language model to decode and translate from one language to another.  Concretely, a language model for a given pair of languages is simply a large probability table for phrases and their translations.  The actual translation program decodes sentences in the language, uses the probability table to decide translations, and produces output.  The bulk of the hard work for the translation is done in the production of the language model table by tools from statistical analysis such as hidden markov models.

In a concrete translation system, we can just assume available, for each pair of languages of interest, a language model table, so that a stand-alone system consists of:  (a) a number of language model files, and (b) a decoder.  The sophistication and size of the the language model tables decides then the size of the the stand-alone translator.  If the accuracy of translation is not the major factor, then these can be relatively small and can be distributed with the servers of the system for local translations that do not require the use of network services such as Google Translate and Yahoo Babelfish.

Read Full Post »

I am interested in developing is a network application that allows simultaneous one-to-one translate-chat among 6.7 billion.  It is extremely easy to write language translation programs using heavyweight services like Yahoo’s Babelfish or the Google Translator.  One could use the CPAN module WWW::Babelfish to write a little script that can do translations for a wide range of pairs of languages.  These heavyweight services are not viable for simultaneous one-to-one chat for 6.7 billion because they contact specific servers of Google or Yahoo, which is not a good way to build a distributed service of the type we require.

An independent and lightweight translation service is required that could perhaps be even local to the servers of our application.  There are two observations that are important for this:

  • We are not attempting to translate sophisticated tracts but chat between people, for which the vocabulary is much smaller than a Shakespearean tome.  Thus a small corpus translator is sufficient for our purposes.
  • The viability of efficient traffic of communication is much more important than sophistication of translation for our purposes.  For a Chinese speaker who knows no English, any translation is infinitely better than none.

I have chosen to proceed with the Open Source lightweight solution provided by Apertium precisely because it’s simple and usable in our own server code:  http://en.wikipedia.org/wiki/Apertium.  There are other options but services like Google Translate and Babelfish are not real options for us.

 

Read Full Post »

First, 8 billion is in the order of  2 raised to the 33rd power.  Thus pairs of people are approximately represented by a 66-bit vector.  A normal DSL connection with bandwidth of 1.544 megabits per second would allow a maximum transfer of around 23,000 contiguous blocks of such 66-bit data elements.

The prototype of a communication network is the IRC network whose model is a tree with servers and clients.  A network application designed to allow one-to-one translate chat among 6.7 billion could be designed similarly, and it is reasonable to make the assumption that all the servers are connected by at worst physical DSL-speed lines.

If the above is true, then so long as we have a viable method of transferring around 10,000 key-value pairs per second from one server to another connected to it, and the network is sufficiently sparse as not to have a single server serve more than 10,000 clients, then the following scheme could be used for simultaneous 6.7 billion one-to-one chats:

  • Each server tracks a red-black tree representing a semi-synchronized table of who is chatting with whom
  • Periodically a server sends messages to the contiguous servers about bulk changes in conversation partners and accepts the same from other servers.
  • For a fixed state of the table of who is chatting with whom, the server blocks message paths not on its list and lets those on its list through.

The third step above allows the message traffic to be viable even for simultaneous chat by 6.7 billion.  The first two form the foundation layer of such a communication system because it is a way of synchronizing the network’s knowledge of who is talking to whom.

Implementation is viable for such a state variable with synchronization if each server holds a red-black tree for the table of who is talking to whom, and is able to efficiently transfer the table of changes to other servers.  I have tested the speed of transfer and reconstruction of red-black tree in perl using three pieces:  the POE modules PoCo::Server::TCP and PoCo::Client::TCP, a mild extension of Tree::RedBlack that adds methods for reading and writing from strings using compression provided by the Compress::Zlib functions ‘compress’ and ‘uncompress’.   Even without fine-tuned timing, it seems fairly clear that this communication scheme has no difficulties handling trees of size around a thousand key-value pairs per second, and the performance could be improved to the order 10,000 without too much effort.

 

 

 

Read Full Post »

I have some nostalgia for my first job. I worked at tenth floor or eleventh on WFC 3, Lehman Trading floor. I worked in the Fixed Income Research group under Andy Morton, who had just come out of Cornell as the youngest of the three in the Heath-Jarrow-Morton derivative pricing model. I had a Sun Sparc station and got more interested in coding than in derivative pricing models. For a decade I thought it was a huge error, but after 2008 I decided I did the right thing by not investing very much intellectual capital in the actual quantitative finance. I was probably the least successful of the group at the time. Andy Morton had gone on to the top of the firm before retiring a bit before Lehman disappeared in 2008.
While the focus of the group was the maintenance of the HJM model for pricing and hedging fixed income derivatives, I ended up spending much more time on learning C++ instead of the quantitative finance. I remember reading Stroustrup’s history of C++ late into the evenings after I first learned it in 1995. At the time STL had just come out, and it was even sold commercially by some companies.
I was not on top of option pricing models years later when I was working at Gresham Investment Management.  The Black-Scholes pricing was the bread and butter of option pricing for decades, but I had severe doubts about the entire enterprise for some reason I could not put my finger on precisely, although I knew quite well the standard description of limitations of option pricing theories that were discussed: Taleb’s intriguing ideas of option pricing I tried to digest without sufficient practice to have any worthwhile absorption, and I spent quite a bit of effort on trying to understand what people do about actual return distributions in the markets which have both much heavier tails than Gaussians and much more likelihood of “black swan” crashes.
Currently, my intuitive answer, which I trust much more than the option pricing theories is that if one looks at the relative inequalities in the size of the players, there is no question that there are a small group of large players who move the markets at will, and all the option pricing models are precisely incorrect, in the sense that they are guaranteed to generate prices that will reinforce the advantages of the large players and put the retail players at risk.

Read Full Post »

Names are there, Nature’s sacred watchwords, they
Were borne aloft in bright emblazonry;
The nations thronged around, and cried aloud,
As with one voice, Truth, liberty, and love!
Suddenly fierce confusion fell from heaven
Among them: there was strife, deceit, and fear:
Tyrants rushed in, and did divide the spoil.
This was the shadow of the truth I saw.

‎’My feet are at Moorgate, and my heart
Under my feet. After the event
He wept. He promised “a new start”.
I made no comment. What should I resent?’
‘On Margate Sands.
I can connect
Nothing with nothing.
The broken fingernails of dirty hands.
My people humble people who expect
Nothing.’
I took your urgent whisper,
Stole the arc of a white wing,
Rode like foam on the river of pity,
Turned its tide to strength,
Healed the hole that ripped in living.
… because we separate like ripples on a blank shore.
My story in the compositions of Shelley, Eliot, Suzanne Vega, and Tom Yorke.

Read Full Post »

In 1929 Edwin Hubble found a remarkable linear relation between distances of galaxies from Earth and their redshift.  Later work had reconfirmed with a larger datasets that this remarkable linear relation is not an artifact of data but in fact is describing something true about Nature.  The standard interpretation of this linear relationship between redshift and distance has been based on the Doppler effect.  The high school example of the Doppler effect is that if you listen to the whistle of a train standing still, it’s pitch will differ from either a train moving toward you or a train moving away from you.  The pitch of the train-whistle moving toward you will be higher than that of a train standing still while that of a train moving away from you will have a lower pitch.  Similarly, when light signal from a distant galaxy is compared to the received signal, the redshift is a lowering of frequency and the interpretation from a Doppler effect would be that it is moving away from us.

But an entirely different explanation of the redshift is plausible.  Between distant galaxies and ourselves, the space is not empty but filled with cosmic background radiation which is in thermal equilibrium.  This is observed to be accurately described by a Planck form with temperature around 2.72 kelvin.  The signal from a distant galaxy is traveling through this soup of radiation, and the linear relation between distance and redshift might indicate that the cosmic background radiation acts as a convolution filter that reddens the signal as a linear function of the time spent in this radiation.

This plausible reinterpretation of the redshift would provide support for a perfectly static universe, with relative positions of galaxies being constant relative to each other.  This alternative explanation of the redshift also removes the fundamental basis for big bang and expansionary cosmologies, and gives plausibility to a compact stationary model of the universe as the S4 model.

So why do I say that this is the greatest scientific error of the twentieth century?  The finite time models of the universe, which are the standard, are essentially scientific creation myths according to Hannes Alfven, and I tend to agree.  Phenomena supporting quantum mechanics and general relativity are quite quickly resolved in a stationary S4 model of the universe simultaneously simplifying the description of space-time which expansionary universe models will require complex and non-intuitive concepts such as “intrinsic expansion of the fabric of spacetime” which only exist because of the interpretation of the redshift of galaxies as a Doppler effect.  Thus in a sense, this misinterpretation has been “holding up” unification.

Read Full Post »

Older Posts »