Converting free form english text to spanish, what are the options?

Converting free form english text to spanish, what are the options? - localization

I have an application that will be used by spanish speaking people as well as english speaking people. I am using .resx files and localization to translate all the hard coded text. I am also retrieving language specific data from the database for some things that don't change often like "Category Descriptions". Here is my question. I think I already know the answer. Is there a way to translate free form text entered by a user? For example can a string entered as saved to a database in english be displayed in spanish? One more issue is these strings often contain engineering terms and technical abbreviations that I don't think could be translated with something like google translate. Is there anything else out there? I am thinking that this text can only be translated by a human with knowledge of the terminolgy and abbreviations used in this particular industry.

There are some online services such as Google Translate as pointed to by Binary Worrier. However, one should bear in mind that none of these services give accurate translations. Because, as you wrote, translation is a very difficult matter. Current obstacles to good automated translation include, as you wrote, lack of context.
This is a problem even for human translators. Ask a translator for a given sentence in another language. She'll answer: "Ok, what do you mean by this word: X or Y ? In which context ? Who are you talking to? Is this a formal or informal tone? etc...
This is especially true regarding localization where texts are usually very short. This increases the lack of context. Think of a simple menu item: "Load". Is it a name? Is it a verb? Damn, even a human translator needs more information. So don't expect a computer to solve the problem.
Of course, it all depends on the accuracy that you need and the acceptance factor of your users for bad translations. Google Translate et al are very successful because people prefer a bad translation than nothing.
If I were you, I'd make a few manual tests with typical texts in your DBs and see if the translation accuracy fits your needs.
BTW, I believe Google Translate is free for reasonable of amount of use. Basically, unless you want to translate the whole Wikipedia every week, you should be on the safe side ;-)

You can hook into Google Translate APIs and translate this stuff on the fly, I think there's a charge though

I have an answer from my users. Have the users enter the strings in both English and Spanish and store them to the database. Display the correct strings based on the language of the browser. I still have alot of grunt work to do with filling out the .resx files and modifying all the words I need translated.

Related

spell checker uses language model

I look for spell checker that could use language model.
I know there is a lot of good spell checkers such as Hunspell, however as I see it doesn't relate to context, so it only token-based spell checker.
for example,
I lick eating banana
so here at token-based level no misspellings at all, all words are correct, but there is no meaning in the sentence. However "smart" spell checker would recognize that "lick" is actually correctly written word, but may be the author meant "like" and then there is a meaning in the sentence.
I have a bunch of correctly written sentences in the specific domain, I want to train "smart" spell checker to recognize misspelling and to learn language model, such that it would recognize that even thought "lick" is written correctly, however the author meant "like".
I don't see that Hunspell has such feature, can you suggest any other spell checker, that could do so.

See "The Design of a Proofreading Software Service" by Raphael Mudge. He describes both the data sources (Wikipedia, blogs etc) and the algorithm (basically comparing probabilities) of his approach. The source of this system, After the Deadline, is available, but it's not actively maintained anymore.

One way to do this is via a character-based language model (rather than a word-based n-gram model). See my answer to Figuring out where to add punctuation in bad user generated content?. The problem you're describing is different, but you can apply a similar solution. And, as I noted there, the LingPipe tutorial is a pretty straightforward way of developing a proof-of-concept implementation.
One important difference - to capture more context, you may want to train a larger n-gram model than the one I recommended for punctuation restoration. Maybe 15-30 characters? You'll have to experiment a little there.

Most efficient way to translate a Sitecore website to 4 other languages (Not having the translators in you Sitecore CMS)

I am looking for a good way to translate an excisting Sitecore installation (English language is available) to 4 other languages (Russian , Chinese, Portuguese etc.) A dedicated translation company will translate all texts we deliver to the specified languages, but I'm curious on how other companies set this up. I thought about just exporting all Sitecore items which have to be translated using the Database language Export function in Sitecore and having the translation company edit those files. By just replacing the language tags in the XML we should be able to import this file as the newly created other language, however I'm affraid that this XML structure will be totally useless for a translation company and that they will drown in the codes inside this XML. How can we efficiently do this? Is there any other way then just giving those translation people access to the Sitecore environment and having them edit the languages here? Any Shared Source Module to achieve this? I still have alot of questions, is there anyone with some experience in achieving this?

Your primary options are either the language export/import functionality (as you mention), or a workflow-based solution that integrates with your translation agency's Translation Management System (if they have one -- hopefully they do).
The former is better for the initial translation. Typically, your agency should be able to handle translation of content within XML files. A good one can. If you create all needed language versions beforehand and copy english content into them, it will make the files easier to work with as they'll have tags for the new languages in them already. I've seen the creation of these layers done with Revolver (http://www.codeflood.net/revolver/) but could also be done with custom code or workflow.
For ongoing maintenance of your translated content, you'll probably want to integrate through workflow. Clay Tablet Technologies (http://www.clay-tablet.com/) have a middleware component w/ Sitecore integration that can make this easier, depending on your translation agency. You can also do your own workflow-based integration, with workflow commands that allow your users to send content for translation. Then you'd need some sort of listener that pulls the translated content back in, and continues the workflow.
Hope this helps!

You could also check out Lionbrdige (http://en-us.lionbridge.com/sitecore-and-lionbridge-announce-partnership-to-help-companies-thrive-across-borders.htm) as a solution.
From my own experience our customers normally use the Sitecore import/export function as a first step and then use Lionbridge or Clay Tablet as a service.
One important thing to think about with translations is the ongoing work. The initial translation is rather simple, but the second and so on might be more troublesome. What if different changes has been made in different languages. If local changes were made in the content for sat the french version you couldn´t just send the English version (second translate then) since you would also have to accomodate for the regional changes in the content.

Having worked with literally dozens of Sitecore clients worldwide — and helped get content to and from all the largest, and many smaller translation firms —, I can attest to the ineffeciency of trying to do translation in situ, that is in Sitecore. I liken it to asking an electrician to come over and rewire your house, but as they reach for their toolbox from the truck you tell them, "Nope — you need to do it by hand".
The very best way to manage anything more than a page or two of content for translation is to export it seamlessly. Deliver it to the LSP in a proper format (XML or XLIFF) and, when possible, auto import it to their TMS. Once translated, the content should then flow seamlessly back into Sitecore.
You can code this yourself — but the pitfalls are non-trivial just on the Sitecore side. (If you want intuitive UI's, scalability, and all the features that meet the needs of translation). Let alone the challenges of connecting to the systems LSP's use. (For example, who here knows the relative merits/risks of using SLD's Nexus connector versus their CTA for connecting to TMS?)
As kindly mentioned above, there are commecially available solutions that meet all these needs and more. So if you've got even a modest amount of content — and want to send that to any translation provider of your choice — I'd be happy to discuss how we can help.

The main issue with translation isn't technical at all, the XML export is a simple enough format and all agencies should be able to deal with it with no porblems. as others have suggested, maintenance after the initial translation is slightly more problematic but they also point to tools to achieve this.
The main issue we've found with translation is actually linguistic: how to achieve consistency of phrasing and that matches the original but is sufficiently adjusted to local requirements. Translation companies usually have software to aid this - libraries of of the phrases they translate etc. - working with an exported XML file doesn't provide the context of seeing content in situ. A particular item may be translated correctly and the site consistently, but as each page may be built from multiple items there can easily be conflicts between content as presented.
That makes working with the Sitecore backend (maybe with field security settings to limit ) or in the page editor (possibly pre filling fields with English values) a viable idea.

Designing a Non-Specific Language Application, e.g. planning for localization

Made this community wiki :3
I'm developing a basic RPG, and one of my goals from the beginning is to make sure that my program is language non-specific. Basically, before I design or start programming any menus, I want to make sure that I can load and display them out of supported languages so I am not hard-coding in values.
(It would save me from many migranes down the road)
For this example, let's use Western Left-to-Right languages. English, Spanish, German, French, Italian.
This is a basic example of what I have.
One XML file contains a mapping and design of a conversation.
<conversation>
<dialog>line1</dialog>
<dialog>line2</dialog>
</conversation>
Other XML files contains the definitions.
<mappings language="English">
<line1>This is line 1 in English!</line1>
<line2>Other lines are contained in language-separated xml files</line2>
</mappings>
Heh. This would work great, besides the fact that I forgot that English doesn't assign genders to their words, whereas other languages do. So, where one sentence might be enough in English, I might need to have two sentences in other languages, one to cover the masuline tense and the other to cover the feminine tense.
What would be the most condusive way of solving this problem? Right now, I've considered coming up with different mapping tables, one excuslively for masculine-tense sentences whereas the other table would cover just feminine-tenses. Or just reading from different defintion tables.
And another kicker would be based within my game data design. I never thought about it, but I might need to store within my game items and characters their sexes so I can use the correct sentence. However, other languages might have their own specific quirks that I would need to consider as well (though thankfully, from what I know Italian and Spanish are relatively similar, and French possibly as well.)
So, obviously this is a huge task ahead of me. What other design considerations should I think of? Rightnow, I'm thinking a static class would be easiest. Configure selected language at startup, throw in inputs and hopefully get a string back.
Any ideas (looking to throw ideas around :P)

There's two general ways to approach this: brute force and trying to be clever. Brute force means writing each possible line and including it with your XML files. It's a lot of work, but it will work.
Trying to be clever gets into deep water, fairly fast, particularly if you're trying to cover a whole lot of languages.
You need to keep more information about characters than gender. In Russian, for example, there are different words meaning "you" depending on whether you're being informal or formal (or talking to multiple people), and the verb endings are also different. There are different translations of "please pass the bread" depending on the formality. In other languages, getting the translation right depends on social status.
There are issues, as pawel_dyda pointed out, with singular, plural, and possibly dual case. Other languages also use different word orders: "The arrows are X coppers each, so to buy Y arrows you'll need Z silver" may require you to keep track of the order of the numbers.
Visual C++ and MFC come with internationalization facilities that are actually pretty good. You'd keep the strings in a resource file, and it's possible to substitute numbers and the like in while keeping the order correct for different languages.
Look up "internationalization" (often abbreviated to "i18n") on the web. There's plenty of stuff out there.

As for genders you may try encourage translators to use non-gender specific translations (which is usually possible in business applications but might be impossible here).
You may have also encounter the problem somewhere else. Other (non-English) languages have multiple plural forms. For example: "Your team has acquired 2 swords". No matter how many swords you will actually receive, be it 5 or 1000, in English you will always end up with one plural sentence. But this is not the case in many languages.

Intelligent text parsing and translation

What would be an intelligent way to store text, so that it can be intelligently parsed and translated later on.
For example, The employee is outstanding as he can identify his own strengths and weaknesses and is comfortable with himself.
The above could be the generic text which is shown to the user prior to evaluation. If the user is a Male (say Shaun) or female (say Mary), the above text should be translated as follows.
Mary is outstanding as she can identify her own strengths and weaknesses and is comfortable with herself.
Shaun is outstanding as he can identify his own strengths and weaknesses and is comfortable with himself.
How do we store the evaluation criteria in the first place with appropriate place or token holders. (In the above case employee should be translated to employee name and based on his gender the words he or she, himself or herself needs to be translated)
Is there a mechanism to automatically translate the text with the above information.

The basic idea of doing something like this is called Mail Merge.
This page seems to discus how to implement something like this in Ruby.
[Edit]
A google search gave me this - http://freemarker.org/
I don't know much about this library, but it looks like what you need.

This is a very broad question in the field of Natural Language Processing. There are numerous ways to go around it, the questions you asked seem too broad.
If I understand correctly part of your question this could be done this way :
#variable{name} is outstanding as #gender{he/she} can identify #gender{his/hers} own strengths and weaknesses and is comfortable with #gender{himself/herself}.
Or:
#name is outstanding as #he can identify #his own strengths and weaknesses and is comfortable with #himself.
... if gender is the major problem.

I have had some experience working with a tool called Grammatica, when building a custom user input excel like formula parsing and evaluation engine. It may not be to the level of sophistication you're looking for but it's a start. This basically uses many of the same concepts that popular code compiler parsers employ. It's definitely worth checking out.

I agree with Kornel, this question is too broad. What you seem to be talking about is semantics for which RDF's and OWL can be a good starting point. Read about modeling semantics using markup and you can work your way up from there.

Tool to parse text for possible Wikipedia links

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?
For example, I'd like a tool that could turn something like:
The most popular search algorithm on a
sorted list is the binary search.
Into:
The most popular search algorithm on a
sorted list is the binary search.
It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.
In my example I simply linked all combinations which linked directly to an entry except for The and most.

There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.

You have two separate problems to solve here:
Deciding which words should be linked
Determining if there's a suitable entry to link these words to
Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.
(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.
To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.

Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

Categories

HOME

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart