Could Google Translate maintain a central codex “language” therefore bypassing artifacts that come from English-as-central-language issue?

Google Translate, like many machine translation projects, does not maintain [math]n^2[/math] language pairs when adding languages to its bank; it appears to maintain just n:English mappings—so that a translation from, say, Greek to Persian is pretty clearly via English as an interlanguage. That is a clear scalability issue, if you’re going to maintain the number of languages that Google does.

Is there a better interlanguage than English? Maybe, if you’ve got the resources to handcraft one. Esperantists are familiar with the Distributed Language Translation project in the 80s and 90s, which was using Esperanto as an interlanguage for European Union translating. (An Esperanto with a fair few tweaks, and with rule-based translation.) Predictably, it ran out of funding in 1997.

And if you’re using statistical methods rather than handcrafting rules (which has been the mainstream in machine translation for a very long time now), then any target language is going to have to be a human language, for which you can get a big enough corpus to do statistics to begin with. That means, unfortunately, that English as an interlanguage for machine translation between a large number of pairs of languages actually is as good as you’re going to get.

What you’d hope is that other language pairs, not involving English, get their own statistical training; for all I know, that is happening. But that will still have to be prioritised by demand: Japanese–Chinese or French–German is more likely to be realised than Greek–Persian.

Leave a Reply

Your email address will not be published. Required fields are marked *