I am setting up a Moses Machine Translation System from Arabic to English.
In which format should the arabic text file be, should I input the text file as it is or should I reverse the word order of each sentence?
In other words, does the Moses tokenizer need arabic reversed or as is?
No, it should not be reversed.
Related
I'm trying to reproduce a character conversion...
Essentially out of the Chinese word for Login. In this example the Chinese word for Login, "登录"should be converted into this text instead "µÇ¼".
It would be nice if there was a piece of software that did this for me already...
I am working on an iOS app that is multi language (English & Arabic). Given that the Middle East app markets cannot be localized with additional languages other than English, I have to include Arabic description and keywords.
The meta keywords (100 char.) are tricky as I need to include both English (which is written left to right) and Arabic (which is written right to left).
My questions are follows:
What do I used to separate the English and Arabic keywords? Do I used the English comma separator (,) or the Arabic comma separator (،).
Is the following example the correct way of doing it (food,guide,طعام,دليل)?
Thanks you.
I'm currently translating a website from english into other languages but have a problem when it comes to technical terms (non words) like "crontab".
Should I keep the english translation or is there another way to find the equivalent?
These aren't actually english words and when it comes to languages like Japanese, I'm at a loss as to what to do.
Here's an example sentence as an example:
"Use crontab to schedule scripts."
which translated into Japanese via Google Translate becomes:
"スクリプトをスケジュールするcrontabを使用してください。"
You can see how bizarre this looks, and I'm wondering if the sentence could even be understood by a Japanese speaker.
What do I do in these situations?
Using English words in Japanese
Talking about the word crontab, I think it's not bizarre to write it in English in a Japanese sentence like this:
crotabを使用してください
(please use crontab)
On Japanese wikipedia, you can see how crontab is used without translating into Japanese.
http://ja.wikipedia.org/wiki/Crontab
In Japanese technical writing, especially when you mention name of tools, it is common to use English as it is without translating into Japanese.
Using Katakana
You could also write the sentence like below using Katakana.
クーロンタブを使用してください
(please use crontab).
Japanese usually writes words from English in Katakana. Japanese Katakana is phonetic, in other words each character represents a sound (not meaning). But In this case, it doesn't look natural.
Mistranslation
There is a mistranslation in your Japanese sentence.
スクリプトをスケジュールするcrontabを使用してください。
(Please use crontab which scedule a script.)
To correct this, you could go like this:
スクリプトをスケジュールするには、crontabを使用してください。
(In order to schedule a script, please use crontab.)
Hope this helps.
Given a character (one letter of a string), how could I identify to which language it belongs ? The options are: English, Russian, Hebrew.
Background: this character was entered by user in a form and then stored in a database.
It can be for example the first letter in one of these words:
Hello
Привет
שלום
The UNICODE standard is divided into "blocks". Go here:
http://www.unicode.org/charts/
http://en.wikipedia.org/wiki/Unicode_block
http://www.unicode.org/versions/Unicode6.0.0/
and find unicode blocks (intervals) for each language.
My guess:
English
Hebrew
Russian
So for you its the matter of simple number comparsion for each character (unicode ordinal value). Very simple.
I know utf8,but what's the difference between *.utf8?
From the answer to my post
Locale = ja_JP
Encoding = UTF-8
Before Unicode, handling non-english characters was done using tricks like Code Pages (like this) and special character sets (like this: Shift_JIS). UTF-8 contains a much larger range of characters with a completely different mapping system (i.e. the way each character is addressed by number).
When setting ja_JP.UTF8 as the locale, the "UTF8" part signifies the encoding for the special characters needed. For example, when you output a currency amount in the Japanese locale, you will need the ¥ character. The encoding information defines which character set to use to display the ¥.
I'm assuming there could exist a ja_JP.Shift_JIS locale. One difference to the UTF8 one - among others - would be that the ¥ sign is displayed in a way that works in this specific encoding.
Why ja_JP?
The two codes ja_JP signify language (I think based on this ISO norm) and country (based on this one). This is important if a language is spoken in more than one country. In the german speaking area, for example, the Swiss format numbers differently than the germans: 1'000'000 vs. 1.000.000. The country code serves to define these distinctions within the same language.
In which context? ja_JP tells us that the string is in the Japanese language. That does not have anything to do with the character encoding, but is probably used - depending on context - for sorting, keyboard input and language on displayed text in the program.
At a guess, I'd say each utf8 file with that naming convention contains a language definition for translating your site.
It's a locale name. The basic format is language_COUNTRY. ja = Japanese language, JP = Japan.
In addition to a date format, currency symbol, etc., each locale is associated with a character encoding. This is a historical legacy from the days when every language had its own encoding. Now, UTF-8 provides a common encoding for every locale.
The reason .UTF8 is part of the locale name is to distinguish it from older locales with a different encoding. For example, there's an ja_JP.EUC-JP locale available on my system. And for Germany, there's the choice of de_DE (obsolete pre-Euro locale with ISO-8859-1 encoding), de_DE#euro (ISO-8859-15 encoding, to provide the € sign), and de_DE.UTF-8.