Is there a common file that contains the localized text of each country? - localization

Is there a common file that contains the localized text of each country?
The content is the words frequently used by the application, such as [submit] [cancel], and so on.
Just using this file we can create applications in various languages ​​without having to translate them ourselves.

You are looking for Translation Memory (TM). TM is a database of the source strings and their corresponding translations into different languages that can speed up the translation of the same or similar strings in your projects.
The thing here is that the file you're looking for probably doesn't exist because such a file would be huge and a bit not usable.
Translation Memories - is one of the core concepts in modern CAT (Computer-Assisted Translation) Tools.
There are a bunch of offers on the CAT Tools market that provide the Global Translation Memory feature. Global TM is a huge database of billions of previously made translations in different projects for different language pairs.
For example, Crowdin is a popular Localization Management platform that has a Global TM with billions of previously translated texts and allows users to use this TM for translating their strings. Furthermore, the process could be totally automated.

Related

Single YAML file VS multiple YAML files in different folders and subfolders

I am working on automating the translation workflow and improving the Localization process as a whole of a Rails website. I am using SimpleBackend so only YAML files are used for storing translations.
The current locales directory consists of folders, then sub-folders (in some cases) and those sub-folders containing yml files. I am considering to integrate the project with some third-party tool like Transifex for translation management so may be using a single YAML file for each language may be good for management of workflow.
If someone can highlight the pros and cons of both structures then it would be really helpful to decide whether I should switch from nested file structure to single file pattern or not. Also, the project is an Open-Source project with active contributors and so thinking for a long-term solution.
Thanks!
I think whatever tools you are using to make the process flow smoothly factors a lot in this decision. You should explore how exactly Transifex wants things to be structured in output, and try to keep your current input structure, and give that a shot before making a decision.
However, in my opinion, for a large app with a lot of translatable text, my preference would be to allow for multiple yaml files in your default locale, and one or two consolidated yaml files for each foreign translation. If there isn't a lot of translatable text in your app, maybe a single file is fine for you, but given it's already split up, there's a good chance that's the better choice. On a team with many contributors you can end up with a very high churn file (maybe with a lot of merge conflicts) that everyone changes all the time.
Splitting into separate files lets you logically separate out text to match a domain in your app, like a separate yaml file for mailers (or even each mailer), and one for each domain (or controller). Either way, it puts you in control of your organization strategy.
However, there isn't a lot of value, IMO in separating your foreign translations to mirror that structure. The systems I have experience with (not Transifex) generate your foreign translation files for you, so you just need to sync with the web interface and commit the results.

Using existing human translations to aid machine translation to a new language

In the past, my company has used human, professional translators to translate our software from English into some 13 languages. It's expensive but the quality is high.
The application we're translating contains industry jargon. It also contains a lot of sentence fragments and single words which, out of context, are unlikely to be correctly translated.
I am wondering if there is a machine translation system or service that could use our existing professionally-generated translations to more accurately create a machine translation into any new language.
If an industry term, phrase or sentence fragment has been translated from en-US to es-AR, pt-BR, cs-CZ, etc., then couldn't those prior translations be used as a hint regarding what the correct word choice should be for some new language? They could be used, in a sense, to triangulate. At worst, they could be used to create a majority voting system (e.g. if 9 of 13 languages translated a phrase to the same thing in the new language, we go with it).
Is anyone aware of a machine translation service that works like this?
I have no idea about tranlation systems, but such functionality -- custom translations for specific words -- will be offered by any commercial system, I guess. It is even possible with google translate by clicking on the book with the star on cover.
As a trivial non-invasive method, you could for each goal-language make up a dictionary [] the required terminology in the form [word-as-is-translated, word-as-should-be-translated], where you have an N:1 relationship (in one language multiple word-as-is-translated should be mapped to one word-as-should-be-translated. The words word-as-is-translated thereby depend on the actual translation system).
After preparing those dictionaries, you can simply search the translation result for those words and replace them with the desired words.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

Manage delta changes in .resx files before go live

In our web site we use .resx files to provide labels and GUI in 4 different languages (English, Spanish, French and German). A specific department is in charge to provide translations, given the english values (default).
We, as programmers, define english translations and send them to the language department in a specific day. Usually after a week we get the translated list back and we integrate it in the solution.
However between the date we send the list out and we get it translated, it might happen that new labels are created (usually between 10 and 20 entries) and managed internally by another department to save time.
What would be the best practice to manage and process the "delta" entries that need to be translated and then integrated in the labels list?
How current approach is try to sort the .resx files and then compare them to find out the new fields missing a translation. But I guess there is a better approach for this.
I'm working for the Translation Agency Supertext and we often have this problem too. But most translation departments use tools like Trados, MemoQ,etc. They all have built in Translation Memories. E.g. everything that has already been translated once, can automatically be translated again.
In your case, you should be able to just send them the files again and they can just add the missing translations. If your department is not using any tools, I would have a serious word with them...
Alternative there are online tools like transifex, where you could actually manage this process yourself. There are other similar tools, but would have to dig them up.
In any case, the core of the message is that you use a TM (Translation Memory).

What is the best way to store multiple language versions of a website?

My web site (on Linux servers) needs to support multiple languages.
What is the best practice to have/store multiple languages versions of the same site?
Some I can think of:
store in DB
different view file for each language
gettex
hard coded words in PHP files (like in phpBB)
With web sites, you really have several categories of content to consider for localization:
The article-type content elements that you would in many cases create, edit and publish in a CMS.
The smaller content blocks that are common to every page (or a sub-group of pages), such as tagline, blurb, text around a contact form, but also imported content such as a news ticker or ads and affiliate links. Some of these may only appear for one language (for example, if you don't offer some services in some regions, or don't have, say, language-appropriate imported content for a particular language: it can be better to remove an element rather than offering English to people who may not speak it).
The purely functional elements, like "Click here to comment", "More...", high-level navigation, etc., which are sometimes part of your template. Some of these may be inside images.
For 1. the main decision is using a CMS or not. If yes, you absolutely need to choose one that supports multiple languages. I'm not up-to-date with recent developments in PHP CMS's, but several of the Django CMS apps (Django-CMS-2, FeinCMS) support multi-language content. Don't forget that date stamps, for example, need to be localized, too (or you can get around this by choosing ISO dates, though that may not always be possible). If you don't use a CMS, and everything is in your HTML files, then gettext is the way to go, and keep the .mo files (and your offline .po files) in folders by language.
For 2. if you have a CMS with good multi-lingual support, get as much as possible inside the CMS. The reason is that these bits do change, and you want to edit your template as little as possible. If you write code yourself, think of ways of exporting all in-CMS strings per language, to hand them to translators. Otherwise, again, gettext. The main issue is that these elements may require hard-coding language-selection code (if $language = X display content1 ...)
For 3., if it's in your template, use gettext. For images, the per-language folders will come in handy, and for heaven's sake make choose images the generation of which can be automated, or you (or your graphic artist) will go mad with editing 100s of custom images with strings in languages you don't understand.
For both 2. and 3., abstracting from the language selection may help selecting the appropriate blocks or content directory (where localized images or .mo files are kept).
What you definitely want to avoid is keeping a pile of HTML files with extensive text content in them that would be a nightmare to maintain.
EDIT: Everything about gettext, .po and .mo files is in the GNU gettext manual (more than you ever wanted to know) or a slightly dated but friendlier tutorial. For PHP, there's are the PHP gettext functions, and also the Zend Locale documentation
I recommend using Zend_Translate's Gettext adapter which parses mo files. Very efficient + caching. Your calls would be like
echo $translation->_("Hello World");
Which would find the locale specific key for that specified string.
Check out i18n support for php: http://php-flp.sourceforge.net/getting_started_english.htm

Resources