Wednesday, February 29, 2012

Microsoft Translator Hub, where automatic translation helps endangered languages

Microsoft Translator Hub
Beyond finely-tuned marketing slogans (“Where language meets the world”, “Bridging languages, cultures and technology”) and laudable intentions (“Microsoft Translator Hub is helping smaller languages thrive by putting the power to build machine translation systems in the hands of local communities”), What could be Microsoft's and its new Translator Hub's goals?

Launched yesterday, the Microsoft Translator Hub is a service that enables anybody (individuals, local communities, companies) to build, train and deploy customized automatic language translation system. It uses the number of supported target languages as a main feature: 1,462, which is quite an impressive number for that kind of tool, and makes it possible to show it as a tool dedicated to not widely used languages. A priori a laudable intention.

Things get less fresh once the model training phase begins, as you have to feed it with aligned segments files which make a correspondance between a text in the source language and the same text in the target language. Because to these files are attached rights you lose right away (“By uploading my documents, I confirm that the content I submit does not infringe the copyrights, publicity rights, privacy rights or other intellectual property rights of others. I have the sufficient rights in the content to grant Microsoft the license provided in the Terms of Use.”, Terms of Use which are quite explicit).

We can see that under the guise of a service dedicated to more or less rare languages (as relayed here and there), Microsoft will collect enough data to train its own automatic translation algorithms at low cost, while detecting the most in demand languages. For there lies a major problem for rare, minority or endangered languages: their under-representation on the web (and therefore, the very limited number of data available to create corpus automatic alignments).

We can also see that adding a consequent list of potentially supported languages (by the way, why limit them to 1,462?) creates an announcement effect per se. Otherwise, that Translator Hub would have been just another new tool.

Does this mean you have to neglect this tool?

If you wish to develop inexpensively an automatic translation tool for an under-represented language for which you have enough original and translated texts, Microsoft's offer can be of interest, as you may help this way the future development of translation tools for that language. But if you want to keep your rights on your corpus, you will have to develop your own tools, or to use commercial ones.

“Imagine a web of hundreds of thousands of automatic translators trained not only for a few languages and industry sectors but tuned to a myriad of language pairs, many sub-domains and customized for every company and offering. [...] [T]his web will need to be fed with endless streams of translated words.” (source: Who gets paid for translation in 2020)

That future is already here: they have to feed the machine, and inexpensively. Web data exploitation, books digitization, aligned corpus offered by the community… Each and every track has to be exploited.


Microsoft Translator Hub, ou la traduction automatique au service des langues en danger (in French)
Microsoft Translator Hub, ou a tradução automática ao serviço das línguas em perigo (in Portuguese)
Microsoft Translator Hub, o la traducción automática al servicio de las lenguas en peligro (in Spanish)

No comments:

Post a Comment