Made this community wiki :3
I'm developing a basic RPG, and one of my goals from the beginning is to make sure that my program is language non-specific. Basically, before I design or start programming any menus, I want to make sure that I can load and display them out of supported languages so I am not hard-coding in values.
(It would save me from many migranes down the road)
For this example, let's use Western Left-to-Right languages. English, Spanish, German, French, Italian.
This is a basic example of what I have.
One XML file contains a mapping and design of a conversation.
<conversation>
<dialog>line1</dialog>
<dialog>line2</dialog>
</conversation>
Other XML files contains the definitions.
<mappings language="English">
<line1>This is line 1 in English!</line1>
<line2>Other lines are contained in language-separated xml files</line2>
</mappings>
Heh. This would work great, besides the fact that I forgot that English doesn't assign genders to their words, whereas other languages do. So, where one sentence might be enough in English, I might need to have two sentences in other languages, one to cover the masuline tense and the other to cover the feminine tense.
What would be the most condusive way of solving this problem? Right now, I've considered coming up with different mapping tables, one excuslively for masculine-tense sentences whereas the other table would cover just feminine-tenses. Or just reading from different defintion tables.
And another kicker would be based within my game data design. I never thought about it, but I might need to store within my game items and characters their sexes so I can use the correct sentence. However, other languages might have their own specific quirks that I would need to consider as well (though thankfully, from what I know Italian and Spanish are relatively similar, and French possibly as well.)
So, obviously this is a huge task ahead of me. What other design considerations should I think of? Rightnow, I'm thinking a static class would be easiest. Configure selected language at startup, throw in inputs and hopefully get a string back.
Any ideas (looking to throw ideas around :P)
There's two general ways to approach this: brute force and trying to be clever. Brute force means writing each possible line and including it with your XML files. It's a lot of work, but it will work.
Trying to be clever gets into deep water, fairly fast, particularly if you're trying to cover a whole lot of languages.
You need to keep more information about characters than gender. In Russian, for example, there are different words meaning "you" depending on whether you're being informal or formal (or talking to multiple people), and the verb endings are also different. There are different translations of "please pass the bread" depending on the formality. In other languages, getting the translation right depends on social status.
There are issues, as pawel_dyda pointed out, with singular, plural, and possibly dual case. Other languages also use different word orders: "The arrows are X coppers each, so to buy Y arrows you'll need Z silver" may require you to keep track of the order of the numbers.
Visual C++ and MFC come with internationalization facilities that are actually pretty good. You'd keep the strings in a resource file, and it's possible to substitute numbers and the like in while keeping the order correct for different languages.
Look up "internationalization" (often abbreviated to "i18n") on the web. There's plenty of stuff out there.
As for genders you may try encourage translators to use non-gender specific translations (which is usually possible in business applications but might be impossible here).
You may have also encounter the problem somewhere else. Other (non-English) languages have multiple plural forms. For example: "Your team has acquired 2 swords". No matter how many swords you will actually receive, be it 5 or 1000, in English you will always end up with one plural sentence. But this is not the case in many languages.
Related
In the past, my company has used human, professional translators to translate our software from English into some 13 languages. It's expensive but the quality is high.
The application we're translating contains industry jargon. It also contains a lot of sentence fragments and single words which, out of context, are unlikely to be correctly translated.
I am wondering if there is a machine translation system or service that could use our existing professionally-generated translations to more accurately create a machine translation into any new language.
If an industry term, phrase or sentence fragment has been translated from en-US to es-AR, pt-BR, cs-CZ, etc., then couldn't those prior translations be used as a hint regarding what the correct word choice should be for some new language? They could be used, in a sense, to triangulate. At worst, they could be used to create a majority voting system (e.g. if 9 of 13 languages translated a phrase to the same thing in the new language, we go with it).
Is anyone aware of a machine translation service that works like this?
I have no idea about tranlation systems, but such functionality -- custom translations for specific words -- will be offered by any commercial system, I guess. It is even possible with google translate by clicking on the book with the star on cover.
As a trivial non-invasive method, you could for each goal-language make up a dictionary [] the required terminology in the form [word-as-is-translated, word-as-should-be-translated], where you have an N:1 relationship (in one language multiple word-as-is-translated should be mapped to one word-as-should-be-translated. The words word-as-is-translated thereby depend on the actual translation system).
After preparing those dictionaries, you can simply search the translation result for those words and replace them with the desired words.
I have a list of song tracks that I uploaded from the iTunes API. Some of them are duplicates, but not perfect duplicates. For example, one might say "All 4 u" vs "All for you", or "Some song" vs "some song feat. some other artist"
I want to be able to identify the duplicates. Is the best way to compute the Levenshtein distance for all pairs? That seems excessive.
I'm working in the Cocoa Touch framework for iOS programming so if anyone knows of any libraries that would help a lot.
Why do you consider computing the Levenshtein distance excessive? What algorithm would you use if you were sitting down to a list with pencil and paper?
That said, Levenshtein is likely necessary, but not sufficient. I would start by normalizing the strings. In some cases, a string might normalize a couple of ways and you'll need to do both. Normalization would look like:
convert to lowercase
Strip any leading numbers followed by punctuation ( "1.", "1 - ", etc.)
Tentatively strip anything after "feat." or "with"
This is an example of special knowledge about your problem set. You're going to have to use a lot of special knowledge like this.
"Tentatively" means you should probably keep both the stripped and non-stripped versions of the string
Keep in mind that things including "feat." might be remixes, so you have to be careful about assuming duplicates. This is of course true of almost any attempt at de-dupping. There are often multiple versions.
Tentatively expand common abbreviations (u=>you, 4=>for, 2=>two, w/=>with, etc. etc.)
Tentatively strip anything in parentheses
Strip English articles (a, an, the). Maybe even strip all very short words (3 or less characters) as a first pass.
Doing this well is complicated and will require a lot of trial and error. I've done a lot of contact de-dupping in the past, and one piece of advice: start conservative. It is very easy to accidentally de-dupe way too much. Build a big list of test data that you've de-duped by hand and test, test, test after every algorithm change. Make sure your UI can present the user with anything you're uncertain about, because there are going to be many, many records that you can't be certain about. (This is true even when you do it by hand. Look at a big list of human-entered titles and tell me which ones are duplicates 100% without listening to the tracks. A computer isn't going to do better than you at this.)
I'm not aware of any publicly available library for this. It's been solved by many people many times (search for "dedupe song titles" or anything similar). But it's generally commercial software.
One more piece of advice for this, since it's a huge O(n^2) or worse problem. Look for bucketing opportunities. If you can match artists first, then albums, then tracks, you can divide and conquer in much less time.
My background is as a linguist, and so for new years decide to learn computer language, c#, out of interest and so that I can make small word games for my children and students.
I have started to look at word games and have been reading about using char types and char arrays which I have been playing with so I have been able to generate the alphabet.
What I really want to do is have a word appear with random letters missing and then letters of the alphabet appears and player needs to select correct letter to complete word.
I am not after code, as an educator I am not fond of cheating, just advice on where I could start, what should I be reading about so that I can achieve what I described.
Many thanks in advance for help and advice
If I understand your question correctly, you're looking for an algorithm (or even pseudocode) rather than code or anything else. If I was to implement a game as you described, I would go about it the following fashion:
Select a word from a list. This "dictionary" could be as simple as a text file containing different words, or a more complex database of all words in the English language.
Pick a letter from the word and remove it.
Ask the user for the missing letter. Keep asking until they guess correctly or they run out of guesses.
Rise and repeat.
This is a pretty simple game, which uses pretty basic concepts. I believe the XNA would be complete overkill in this situation. Like Mustafa mentioned in the comment on the original post, XNA provides a framework that makes game progamming easier because it provides templates, but it also adds a lot of overhead and needless complexity (especially for a novice programmer.) Since you're coming from a non-programming background, I would suggest Python or Ruby as a good starting language, and suggest looking into the following topics:
Reading from file (the "dictionary" mentioned above)
Loops, specifically for-loops and while-loops or the language equivalent (to allow the user to keep guessing until they run out of guesses or guess correctly)
Command-line input/output (IO) -- print to screen and read input from the console.
Arrays and Strings
Once you've built out a working command-line application, then I would suggest looking into things like Graphical User Interfaces (GUIs) and making it look "pretty."
I have an application that will be used by spanish speaking people as well as english speaking people. I am using .resx files and localization to translate all the hard coded text. I am also retrieving language specific data from the database for some things that don't change often like "Category Descriptions". Here is my question. I think I already know the answer. Is there a way to translate free form text entered by a user? For example can a string entered as saved to a database in english be displayed in spanish? One more issue is these strings often contain engineering terms and technical abbreviations that I don't think could be translated with something like google translate. Is there anything else out there? I am thinking that this text can only be translated by a human with knowledge of the terminolgy and abbreviations used in this particular industry.
There are some online services such as Google Translate as pointed to by Binary Worrier. However, one should bear in mind that none of these services give accurate translations. Because, as you wrote, translation is a very difficult matter. Current obstacles to good automated translation include, as you wrote, lack of context.
This is a problem even for human translators. Ask a translator for a given sentence in another language. She'll answer: "Ok, what do you mean by this word: X or Y ? In which context ? Who are you talking to? Is this a formal or informal tone? etc...
This is especially true regarding localization where texts are usually very short. This increases the lack of context. Think of a simple menu item: "Load". Is it a name? Is it a verb? Damn, even a human translator needs more information. So don't expect a computer to solve the problem.
Of course, it all depends on the accuracy that you need and the acceptance factor of your users for bad translations. Google Translate et al are very successful because people prefer a bad translation than nothing.
If I were you, I'd make a few manual tests with typical texts in your DBs and see if the translation accuracy fits your needs.
BTW, I believe Google Translate is free for reasonable of amount of use. Basically, unless you want to translate the whole Wikipedia every week, you should be on the safe side ;-)
You can hook into Google Translate APIs and translate this stuff on the fly, I think there's a charge though
I have an answer from my users. Have the users enter the strings in both English and Spanish and store them to the database. Display the correct strings based on the language of the browser. I still have alot of grunt work to do with filling out the .resx files and modifying all the words I need translated.
I believe several of us have already worked on a project where not only the UI, but also data has to be supported in different languages. Such as - being able to provide and store a translation for what I'm writing here, for instance.
What's more, I also believe several of us have some time-triggered events (such as when expiring membership access) where user location should be taken into account to calculate, like, midnight according to the right time-zone.
Finally there's also the need to support Right to Left user interfaces accoring to certain languages and the use of diferent encodings when reading submitted data files (parsing text and excel data, for instance)
Currently I'm storing all my translations for all my entities on a single table (not so pratical as it is very hard to find yourself when doing sql queries to look into a problem), setting UI translations mainly on satellite assemblies and not supporting neither time zones nor right to left design.
What are your experiences when dealing with these challenges?
[Edit]
I assume most people think that this level of multiculture requirement is just like building a huge project. As a matter of fact if you tihnk about an online survey where:
Answers will collected only until
midnight
Questionnaire definition and part of
the answers come from a text file
(in any language) as well as
translations
Questions and response options must
be displayed in several languages,
according to who is accessing it
Reports also have to be shown and
generated in several different
languages
As one can see, we do not have to go too far in an application to have this kind of requirements.
[Edit2]
Just found out my question is a duplicate
i18n in your projects
The first answer (when ordering by vote) is so compreheensive I have to get at least a part of it implemented someday.
Be very very cautious. From what you say about the i18n features you're trying to implement, I wonder if you're over-reaching.
Notice that the big boy (e.g. eBay, amazon.com, yahoo, bbc) web applications actually deliver separate apps in each language they want to support. Each of these web applications do consume a common core set of services. Don't be surprised if the business needs of two different countries that even speak the same language (e.g. UK & US) are different enough that you do need a separate app for each.
On the other hand, you might need to become like the next amazon.com. It's difficult to deliver a successful web application in one language, much less many. You should not be afraid to favor one user population (say, your Asian-language speakers) over others if this makes sense for your web app's business needs.
Go slow.
Think everything through, then really think about what you're doing again. Bear in mind that the more you add (like Right to Left) the longer your QA cycle will be.
The primary piece to your puzzle will be extensive use of interfaces on the code side, and either one data source that gets passed through a translator to whichever languages need to be supported, or separate data sources for each language.
The time issues can be handled by the interfaces, because presumably you will want things to function in the same fashion, but differ in the implementation details. To a large extent, a similar thought process can be applied to the creation of the interface when adjusting it to support differing languages. When you get down to it, skinning is exactly this, where the content being skinned is the interface, and the look/feel is the implementation.
Do what your users need. For instance, most programmer understand English, there is no sense to translate posts on this site. If many of your users need a translation, add a new table column with the language id, and another column to link a translated row to its original. If your target auditory contains the users from the Middle East, implement Right to Left. If time precision is critical up to an hour, add a time zone column to the user table, and so on.
If you're on *NIX, use gettext. Most languages I've used have some level of support; PHP's is pretty good, for instance.
I'll describe what has been done in my project (it wasn't my original architecture but I liked it anyways)
Providing Translation Support
Text which needs to be translated have been divided into three different categories:
Error text: Like errors which happen deep in the application business layer
UI Text: Text which is shown in the User interface (labels, buttons, grid titles, menus)
User-defined Text: text which needs to be translatable according to the final user's preferences (that is - the user creates a question in a survey and he can also create a translated version of that survey)
For each different cathegory the schema used to provide translation service is different - so that we have:
Error Text: A library with static functions which access resource files
UI Text: A "Helper" class which, linked to the view engine, provides translations from remote assemblies
User-defined Text: A table in the database which provides translations (according to typeID of the translated entity and object id) and is linked to the entity via a 1 x N relationship
I haven't, however, attacked the other obvious problems such as dealing with time zones, different layouts and picture translation (if this is really necessary). Does anyone have tackled this problem in a different way?
Has anyone ever tackled the other i18n problems?