charset tables for sphinx in utf8 - character-encoding

I am searching for charset_tables for some languages for Sphinx. I don't know how to build them and I will be very thankful if anyopne can help me.
I need the charset_tables for: Dansk, Finnish, Norwegian, Hungarian, Slovakian, Polnish, Spanish, French.
I have found this post (http://sphinxsearch.com/forum/view.html?id=19#2857 ) and a character values for common european languages, but for which?
Thank you very much!
Best Regards
Nik

Use this link, to write charset table you needed.
This is a list of unicode characters in suitable format.
But you need to spend some time to make your own chartable.

Related

What should I name my variables in a translation project?

I'm abstracting strings from views and I don't want to name my variables after the strings... what should I name them? I also don't want to number them in case I insert a new string into a view at some point.
I want short names that are easy to reference, not hard to put into my brain's short term memory, and not confusing to my translators.
The current version is in English, the future versions will be in Chinese, Spanish, Vietnamese, and Tagalog, in addition to English.
long descriptive names that are similar to the original string.
Based on this Mozilla article
https://developer.mozilla.org/en-US/docs/Mozilla/Localization/Localization_content_best_practices
Thanks to #davejagoda for the advice.

Translating and localizing technical words into other languages

I'm currently translating a website from english into other languages but have a problem when it comes to technical terms (non words) like "crontab".
Should I keep the english translation or is there another way to find the equivalent?
These aren't actually english words and when it comes to languages like Japanese, I'm at a loss as to what to do.
Here's an example sentence as an example:
"Use crontab to schedule scripts."
which translated into Japanese via Google Translate becomes:
"スクリプトをスケジュールするcrontabを使用してください。"
You can see how bizarre this looks, and I'm wondering if the sentence could even be understood by a Japanese speaker.
What do I do in these situations?
Using English words in Japanese
Talking about the word crontab, I think it's not bizarre to write it in English in a Japanese sentence like this:
crotabを使用してください
(please use crontab)
On Japanese wikipedia, you can see how crontab is used without translating into Japanese.
http://ja.wikipedia.org/wiki/Crontab
In Japanese technical writing, especially when you mention name of tools, it is common to use English as it is without translating into Japanese.
Using Katakana
You could also write the sentence like below using Katakana.
クーロンタブを使用してください
(please use crontab).
Japanese usually writes words from English in Katakana. Japanese Katakana is phonetic, in other words each character represents a sound (not meaning). But In this case, it doesn't look natural.
Mistranslation
There is a mistranslation in your Japanese sentence.
スクリプトをスケジュールするcrontabを使用してください。
(Please use crontab which scedule a script.)
To correct this, you could go like this:
スクリプトをスケジュールするには、crontabを使用してください。
(In order to schedule a script, please use crontab.)
Hope this helps.

YouTube Api (Search) Other Language..(Arabic, Korean, Etc....)

this time i've one Q, which is how to search through YouTube Api...
English is perfectly searched... but other language(arabic, korean, etc...) doesn't work...T T
http://gdata.youtube.com/feeds/api/videos?q=(SEARCH_WORD)&start-index=1&max-results=3&v=2
=> My Access Code...
I'd like to search Arabic or Korean.. plz comment anything....
I need you guys help...
Have a Nice day~!!!
Please try
http://gdata.youtube.com/feeds/api/videos?q=%EA%B0%95%EB%82%A8%EC%8A%A4%ED%83%80%EC%9D%BC&start-index=1&max-results=3&v=2
or
http://gdata.youtube.com/feeds/api/videos?q=\uac15\ub0a8\uc2a4\ud0c0\uc77c&start-index=1&max-results=3&v=2
on your browser.
Cheers
The more general concept underlying the other answer is that the 'q' parameter (all parameters, really) will only accept any valid unicode character; so if the string you're trying to search on is not unicode (i.e. some other encoding set), those code points will be interpreted as unicode and thus result in a search on random characters (returning no results).

Best practice for SEO urls(ASCII vs urlencoded UTF8)?

I'am building a website where I need to make an url form article title. First option is to convert all utf8 to ASCII. This can be done, because every language has some kind of Romanization available. But I don't know if, for example, for Chinese people romanticized versions of title makes any sense.
Second options is to urlencode utf8 title like Wikipedia does: http://ar.wikipedia.org/wiki/سيارة.
What are pluses or minuses for both options?
which version is better to use?
Google, for one, has no problems indexing and listing sites with Unicode characters outside of 7-bit ASCII.

better alternative in letters substitution

Is there any better alternative to this?
name.gsub('è','e').gsub('à','a').gsub('ò','o').gsub('ì','i').gsub('ù','u')
thanks
Use tr.
Maybe like string.tr('èàòìù', 'eaoiu').
substitutes = {'è'=>'e', 'à'=>'a', 'ò'=>'o', 'ì'=>'i', 'ù'=>'u'}
substitutes.each do |old, new|
name.gsub!(old, new)
end
Or you could use an extension of String such as this one to do it for you.
If you really want a full solution, try pulling the tables from Perl's Unidecode module. After translating those tables to Ruby, you'll want to loop over each character of the input, substituting the table's value for that character.
Taking a wild stab in the dark, but if you're trying to remove the accented characters because you're using a legacy text encoding format you should look at Iconv.
An introduction which is great on the subject: http://blog.grayproductions.net/articles/encoding_conversion_with_iconv
In case you are wondering the technical terms for what you want to do is Case Folding and possibly Unicode Normalization (and sometimes collation).
Here is a case folding configuration for ThinkingSphinx to give you an idea of how many characters you need to worry about.
If JRuby is an option, see the answer to my question:
How do I detect unicode characters in a Java string?
It deals with removing accents from letters, using a Normalizer. You could access that class from JRuby.

Resources