Interesting Encoding - character-encoding

I have an interesting promblem with social network http://www.odnoklassniki.ru/.
When I use advanced searching my cyrillic symbols are encoded in no understantable symbols for me.
For Example:
Иван Иванов Encode %25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2+%25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2
Any ideas?

It's a double URL-encoded string. The %25 sequences represent the percent sign. Decoding once gives %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2+%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2.
Decoding again gives the UTF-8 string иванов иванов.

That's URL- or percent- encoding. The percent starts it. Then its the 4 hex-digits for the char. The + is the space.
See: http://en.wikipedia.org/wiki/Percent-encoding

Well, it appears to be twice URL encoded. If we unwrap it once, we get
%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2 %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2
and again, we get
иванов иванов
This appears to be UTF-8 with the bytes encoded separately.

Related

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

How to convert arabic digits to numeric in ruby

Params include arabic digits also which I want to convert it into digits:-
"lexus/yr_٢٠٠١_٢٠٠٦"
I tried this one
params[:query].tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
but it gives an error
Encoding::CompatibilityError Exception: incompatible character encodings: UTF-8 and ASCII-8BIT
for that I do
params[:query].force_encoding('utf-8').encode.tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
how I can convert this?
If you have enforced UTF-8 encoding then this should work fine.
str = "lexus/yr_٢٠٠١_٢٠٠٦"
str.tr('٠١٢٣٤٥٦٧٨٩','0123456789')
returns "lexus/yr_2001_2006"
ASCII 8-bit is not really an encoding. It is binary data, not something text based. Transcoding ASCII 8-bit to UTF-8 is not a meaningful operation. I would recommend ensuring that the request that passes the query parameter through your textfield is using valid string encoding, if you can control this. You can use String#valid_encoding? method in ruby to check you are receiving a correctly encoded string.

Handing strings with binary data in it using java.nio

I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.

Is % percentage a valid url character

I am trying to put a url, something like the following urn:test.project:123, as part of the url.
Does it make sense to encode urn:test.project:123 into urn%3atest.project%3a123 and decode it back to urn:test.project:123 at the receiver end?
http://{domain}/abc/urn%3atest.project%3a123/Manifest
Yes, it's a valid character. It's the escape character for URLs in a similar way to how the ampersand & is the escape character for xml/html, and the backslash \ is the escape character for string literals in c-like languages. It's the (very important) character that allows you to specify (through an escape sequence) all the other characters that wouldn't be valid in a URL.
(And yes, it makes sense to encode such a string so it's a legal URL, and as #PaulPRO mentions, most frameworks will automatically decode it for you on the server-side.)
Yes, the %3a means that 3a is the HEX encoded value for ':'
If you put it in the url as %3a your server will most likely automatically decode it.

What encoding type of these text?

When I search in Google by Thai language. Google will convert like these.
%E0%B8%A0%E0%B8%B2%E0%B8%A9%E0%B8%B2%E0%B9%84%E0%B8%97%E0%B8%A2
URL Encoding: See http://www.w3schools.com/tags/ref_urlencode.asp
It's a URL encoding in which all
non-alphanumeric characters except
-_. are replaced with a percent (%)
sign followed by two hex digits and
spaces encoded as plus (+) signs. It
is encoded the same way that the
posted data from a WWW form is
encoded, that is the same way as in
application/x-www-form-urlencoded
media type.
(Information copied from http://php.net/manual/en/function.urlencode.php)
UTF-8 + URL Encoding.

Resources