URL for russian language - cyrillic or latin - url

I am doing website translation to russian language and I dont know what is better, more used for URL namespace - cyrillic or latin, for example xyz.com/команда or xyz.com/komanda. I found when I was googling more sites in latin. Is it more used in Russia?

depends on server, usually browser converts not Latin url symbols to UTF-8 and sends, server may accept them or reject.
For example, all these 3 URL below are identical:
https://www.rabota.ru/vacancy/программист/
https://www.rabota.ru/vacancy/%D0%BF%D1%80%D0%BE%D0%B3%D1%80%D0%B0%D0%BC%D0%BC%D0%B8%D1%81%D1%82/
Cyrillic symbols UTF-8 encoding:
%D0%BF - п
%D1%80 - р
%D0%BE - о
and so on

Related

What encoding is this and how do I turn it into something I can see properly?

I'm writing a script that will operate on the subtitle files of a popular streaming service (Netfl*x).
The subtitle files have strange characters in them and I can't get them to render in a way that my text editors or web browser will display in a readable way. The xml encoding says UTF-8, but some characters are not readable.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<tt xmlns:tt="http://www.w3.org/ns/ttml" xmlns:ttm="http://www.w3.org/ns/ttml#metadata" xmlns:ttp="http://www.w3.org/ns/ttml#parameter" xmlns:tts="http://www.w3.org/ns/ttml#styling" ttp:tickRate="10000000" ttp:timeBase="media" xmlns="http://www.w3.org/ns/ttml">
<p>de 15 % la nuit dernière.</span></p>
<p>if youâve got things to doâ¦</span></p>
And in Vim:
This is what it looks like in the browser:
How can I convert this into something I can use?
I'll go out on a limb and say that file is UTF-8 encoded just fine, and you're merely looking at it using the wrong encoding. The character À encoded in UTF-8 is C3 80. C3 in ISO-8859-1 is Ã, which in your screenshot is followed by an 80. So looks like you're looking at a UTF-8 file using the (wrong) ISO-8859 encoding.
Use the correct encoding when opening the file.
My terminal is set to en_US.UTF-8, but was also rendering this supposedly UTF-8 encoded file incorrectly (sonné -> sonné). I was able to solve this by using iconv to encode the file in ISO8859-1.
iconv original.xml -t ISO8859-1 -o converted.xml
In the new file, the characters were properly rendered, although I don't quite understand why.

Ruby Parsing: Error when I trying to put Cyrillic symbols into URL request

I wrote a Bot for Telegram, where users can receive images for their requests. But there was one problem, which I could not solve.
Some example with parsing on Ruby:
json_object = JSON.parse(open("https://api.site.com/search/photos?query=" + message.text + "&per_page=10&client_id=42324d2lkedi234fs342dfse2c038fdfsdfs").read)
message.text - It's a field with request from users.
Everything works fine with latin literals, but when I send Cyrillic(API also supports Cyrillic alphabet) symbols I get the below error:
/Users/me/.rvm/rubies/ruby-2.4.0/lib/ruby/2.4.0/uri/rfc3986_parser.rb:21:in
`split': URI must be ascii only
"https://api.site.com/search/photos?query=\u0432\u0430\u0432\u0430&per_page=10&client_id=42324d2lkedi234fs342dfse2c038fdfsdfs"
(URI::InvalidURIError)
I used Encoding with utf-8 and win-1252, but nothing helped. How should this be fixed?
You should encode your cyrillic string:
URI.encode('http://google.com?1=АБВ') # => "%D0%90%D0%91%D0%92"
So, use it like this (or encode whole url):
URI.encode(message.text)
Try with
"anything".parameterize.underscore.humanize.downcase

BlackBerry - language support for Chinese

I have localised my app by adding the correct resource files for various European languages / dialects.
I have the required folder in my project: ./res/com/demo/localization
It contains the required files e.g. Demo.rrh, Demo.rrc, Demo_de.rrc etc.
I want to add support for 2 Chinese dialects, and I have the translations in an Excel file. On iPhone, they are referred to by the codes zh_TW & zh_CM. Following the pattern with German, I created 2 extra files called Demo_zh_TW.rrc & Demo_zh_CN.rrc.
I opened file Demo_zh_CN.rrc using Eclipse's text editor, and pasted in line of the Chinese translation using the normal resource file format:
START_LOCATION#0="开始位置";
When I tried to save the file, I got Eclipse's error about the Cp1252 character encoding:
Save could not be completed.
Reason:
Some characters cannot be mapped using "Cp1252" character encoding.
Either change the encoding or remove the characters which are not
supported by the "Cp1252" character encoding.
It seems the Eclipse editor will accept the Chinese characters, but the resource tool expects that these characters must be saved in the resource file as Java Unicode /u encoding.
How do I add language support for these 2 regions without manually copy n pasting in each string?
Is there maybe a tool that I can use to Java Unicode /u encode the strings from Excel so they can be saved in Code page 1252 Latin chars only?
I'm not aware of any readily available tools for working with BlackBerry's peculiar localization style.
Here's a snippet of Java-SE code I use to convert the UTF-8 strings I get for use with BlackBerry:
private static String unicodeEscape(String value, CharsetEncoder encoder) {
StringBuilder sb = new StringBuilder();
for(char c : value.toCharArray()) {
if(encoder.canEncode(c)) {
sb.append(c);
} else {
sb.append("\\u");
sb.append(hex4(c));
}
}
return sb.toString();
}
private static String hex4(char c) {
String ret = Integer.toHexString(c);
while(ret.length() < 4) {
ret = "0" + ret;
}
return ret;
}
Call unicodeEscape with the 8859-1 encoder with Charset.forName("ISO-8859-1").newEncoder()
I suggest you look at Blackberry Hindi and Gujarati text display
You need to use the resource editor to make these files with the right encoding. Eclipse will escape the characters automatically.
This is a problem with the encoding of your resource file. 1252 Code Page contains Latin characters only.
I have never worked with Eclipse, but there should be somewhere you specify the encoding of the file, you should set your default encoding for files to UTF-8 if possible. This will handle your chinese characters.
You could also use a good editor like Notepad++ or EMEditor to set the encoding of your file.
See here for how you can configure Eclipse to use UTF-8 by default.

XML Parsing with parseFromString with portuguese characters

I have a Blackberry app developed using PhoneGap. I am using suds client to call web service. There are some Portuguese character in the webservice XML. I am not able to parse to XMLDoc using the DOMParser.
I am using
xmlDoc = parser.parseFromString(_xml, "text/xml");
The encoding type is UTF-8. Without the Portuguese character, parsing is working perfectly.
"I am using is UTF-8 encoding type." - this can mean several things, so it is unclear what exactly you do in order to support UTF-8 end-to-end.
E.g. you should check:
your web service really sends data in UTF-8 (when it converts string chars into bytes to be sent into output stream it should use UTF-8)
the device code that reads data from web really uses UTF-8 to convert bytes to string _xml
P.S. I'm not familiar with phonegap API so this is just a general plan.

HTML decoding in C/C++

I'm using libcurl for getting HTML pages.
I have some problems with Hebrew characters.
for example this: סלקום
gets gibberish.
How do I get Hebrew characters and not gibberish?
Do I need some HTML decoder?
Does libcurl support such operation?
Does libiconv support such operation?
I appreciate any help.
Thanks
Edit: Ok, so what you’re seeing is UTF-8 data being decoded as Windows-1252 (so the numeric character references were a red herring). Here’s a demonstration in Python:
>>> u = ''.join(map(unichr, [1505, 1500, 1511, 1493, 1501]))
>>> s = u.encode('utf-8')
>>> print s.decode('cp1255', 'replace')
׳¡׳�׳§׳•׳�
The solution to this problem depends on the environment in which the output is displayed. Merely outputting the bytes received and expecting them to be interpreted as characters leads to problems like this.
An HTML document typically contains a header tag like <meta charset=utf-8> to indicate to the browser what its encoding should be. A document served by a web server contains an HTTP header like Content-Type: text/html; charset=utf-8.
You should ask libcurl for the Content-Type HTTP header to know the encoding of the document, and then convert it to the system encoding using iconv. While in your case that would be codepage 1255, it depends on the user’s system and so you should look up the appropriate functions to detect that.
(Read Unicode and Character Sets and the character-encoding tag on this site for a wealth of further information.)

Resources