Using existing human translations to aid machine translation to a new language - localization

In the past, my company has used human, professional translators to translate our software from English into some 13 languages. It's expensive but the quality is high.
The application we're translating contains industry jargon. It also contains a lot of sentence fragments and single words which, out of context, are unlikely to be correctly translated.
I am wondering if there is a machine translation system or service that could use our existing professionally-generated translations to more accurately create a machine translation into any new language.
If an industry term, phrase or sentence fragment has been translated from en-US to es-AR, pt-BR, cs-CZ, etc., then couldn't those prior translations be used as a hint regarding what the correct word choice should be for some new language? They could be used, in a sense, to triangulate. At worst, they could be used to create a majority voting system (e.g. if 9 of 13 languages translated a phrase to the same thing in the new language, we go with it).
Is anyone aware of a machine translation service that works like this?

I have no idea about tranlation systems, but such functionality -- custom translations for specific words -- will be offered by any commercial system, I guess. It is even possible with google translate by clicking on the book with the star on cover.
As a trivial non-invasive method, you could for each goal-language make up a dictionary [] the required terminology in the form [word-as-is-translated, word-as-should-be-translated], where you have an N:1 relationship (in one language multiple word-as-is-translated should be mapped to one word-as-should-be-translated. The words word-as-is-translated thereby depend on the actual translation system).
After preparing those dictionaries, you can simply search the translation result for those words and replace them with the desired words.

Related

Is there a common file that contains the localized text of each country?

Is there a common file that contains the localized text of each country?
The content is the words frequently used by the application, such as [submit] [cancel], and so on.
Just using this file we can create applications in various languages ​​without having to translate them ourselves.
You are looking for Translation Memory (TM). TM is a database of the source strings and their corresponding translations into different languages that can speed up the translation of the same or similar strings in your projects.
The thing here is that the file you're looking for probably doesn't exist because such a file would be huge and a bit not usable.
Translation Memories - is one of the core concepts in modern CAT (Computer-Assisted Translation) Tools.
There are a bunch of offers on the CAT Tools market that provide the Global Translation Memory feature. Global TM is a huge database of billions of previously made translations in different projects for different language pairs.
For example, Crowdin is a popular Localization Management platform that has a Global TM with billions of previously translated texts and allows users to use this TM for translating their strings. Furthermore, the process could be totally automated.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

How to gauge or compare relative frequency of arbitrary words without a search engine API?

More than a few times I've wanted to programmatically pick the better of two words or phrase using frequency of use on the Internet as a heuristic.
The obvious way, and the way to do it manually, is to enter each term into a search engine and note how many "hits".
But the big search engines have deprecated their search APIs or limit to 100 queries per day free of charge even with an API key. Not great if you're working on a free project. Also the big search engines have a "no scraping" clause in their terms of service.
I need it to work for arbitrary, perhaps even unidentified languages, and from a device with limited storage. This rules out having a local corpus or database.
One area of application is tools for Wiktionary editors, helping them choose the main spelling of several variants even if the don't know the language. The one I have in mind right now is using frequency as a heuristic to help choose the best conversion between a spelling in a foreign script and a lossy transliteration in the Latin alphabet.

Converting free form english text to spanish, what are the options?

I have an application that will be used by spanish speaking people as well as english speaking people. I am using .resx files and localization to translate all the hard coded text. I am also retrieving language specific data from the database for some things that don't change often like "Category Descriptions". Here is my question. I think I already know the answer. Is there a way to translate free form text entered by a user? For example can a string entered as saved to a database in english be displayed in spanish? One more issue is these strings often contain engineering terms and technical abbreviations that I don't think could be translated with something like google translate. Is there anything else out there? I am thinking that this text can only be translated by a human with knowledge of the terminolgy and abbreviations used in this particular industry.
There are some online services such as Google Translate as pointed to by Binary Worrier. However, one should bear in mind that none of these services give accurate translations. Because, as you wrote, translation is a very difficult matter. Current obstacles to good automated translation include, as you wrote, lack of context.
This is a problem even for human translators. Ask a translator for a given sentence in another language. She'll answer: "Ok, what do you mean by this word: X or Y ? In which context ? Who are you talking to? Is this a formal or informal tone? etc...
This is especially true regarding localization where texts are usually very short. This increases the lack of context. Think of a simple menu item: "Load". Is it a name? Is it a verb? Damn, even a human translator needs more information. So don't expect a computer to solve the problem.
Of course, it all depends on the accuracy that you need and the acceptance factor of your users for bad translations. Google Translate et al are very successful because people prefer a bad translation than nothing.
If I were you, I'd make a few manual tests with typical texts in your DBs and see if the translation accuracy fits your needs.
BTW, I believe Google Translate is free for reasonable of amount of use. Basically, unless you want to translate the whole Wikipedia every week, you should be on the safe side ;-)
You can hook into Google Translate APIs and translate this stuff on the fly, I think there's a charge though
I have an answer from my users. Have the users enter the strings in both English and Spanish and store them to the database. Display the correct strings based on the language of the browser. I still have alot of grunt work to do with filling out the .resx files and modifying all the words I need translated.

Designing a Non-Specific Language Application, e.g. planning for localization

Made this community wiki :3
I'm developing a basic RPG, and one of my goals from the beginning is to make sure that my program is language non-specific. Basically, before I design or start programming any menus, I want to make sure that I can load and display them out of supported languages so I am not hard-coding in values.
(It would save me from many migranes down the road)
For this example, let's use Western Left-to-Right languages. English, Spanish, German, French, Italian.
This is a basic example of what I have.
One XML file contains a mapping and design of a conversation.
<conversation>
<dialog>line1</dialog>
<dialog>line2</dialog>
</conversation>
Other XML files contains the definitions.
<mappings language="English">
<line1>This is line 1 in English!</line1>
<line2>Other lines are contained in language-separated xml files</line2>
</mappings>
Heh. This would work great, besides the fact that I forgot that English doesn't assign genders to their words, whereas other languages do. So, where one sentence might be enough in English, I might need to have two sentences in other languages, one to cover the masuline tense and the other to cover the feminine tense.
What would be the most condusive way of solving this problem? Right now, I've considered coming up with different mapping tables, one excuslively for masculine-tense sentences whereas the other table would cover just feminine-tenses. Or just reading from different defintion tables.
And another kicker would be based within my game data design. I never thought about it, but I might need to store within my game items and characters their sexes so I can use the correct sentence. However, other languages might have their own specific quirks that I would need to consider as well (though thankfully, from what I know Italian and Spanish are relatively similar, and French possibly as well.)
So, obviously this is a huge task ahead of me. What other design considerations should I think of? Rightnow, I'm thinking a static class would be easiest. Configure selected language at startup, throw in inputs and hopefully get a string back.
Any ideas (looking to throw ideas around :P)
There's two general ways to approach this: brute force and trying to be clever. Brute force means writing each possible line and including it with your XML files. It's a lot of work, but it will work.
Trying to be clever gets into deep water, fairly fast, particularly if you're trying to cover a whole lot of languages.
You need to keep more information about characters than gender. In Russian, for example, there are different words meaning "you" depending on whether you're being informal or formal (or talking to multiple people), and the verb endings are also different. There are different translations of "please pass the bread" depending on the formality. In other languages, getting the translation right depends on social status.
There are issues, as pawel_dyda pointed out, with singular, plural, and possibly dual case. Other languages also use different word orders: "The arrows are X coppers each, so to buy Y arrows you'll need Z silver" may require you to keep track of the order of the numbers.
Visual C++ and MFC come with internationalization facilities that are actually pretty good. You'd keep the strings in a resource file, and it's possible to substitute numbers and the like in while keeping the order correct for different languages.
Look up "internationalization" (often abbreviated to "i18n") on the web. There's plenty of stuff out there.
As for genders you may try encourage translators to use non-gender specific translations (which is usually possible in business applications but might be impossible here).
You may have also encounter the problem somewhere else. Other (non-English) languages have multiple plural forms. For example: "Your team has acquired 2 swords". No matter how many swords you will actually receive, be it 5 or 1000, in English you will always end up with one plural sentence. But this is not the case in many languages.

Resources