How to convert arabic digits to numeric in ruby - ruby-on-rails

Params include arabic digits also which I want to convert it into digits:-
"lexus/yr_٢٠٠١_٢٠٠٦"
I tried this one
params[:query].tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
but it gives an error
Encoding::CompatibilityError Exception: incompatible character encodings: UTF-8 and ASCII-8BIT
for that I do
params[:query].force_encoding('utf-8').encode.tr!('٠١٢٣٤٥٦٧٨٩','0123456789').delete!(" ")
how I can convert this?

If you have enforced UTF-8 encoding then this should work fine.
str = "lexus/yr_٢٠٠١_٢٠٠٦"
str.tr('٠١٢٣٤٥٦٧٨٩','0123456789')
returns "lexus/yr_2001_2006"
ASCII 8-bit is not really an encoding. It is binary data, not something text based. Transcoding ASCII 8-bit to UTF-8 is not a meaningful operation. I would recommend ensuring that the request that passes the query parameter through your textfield is using valid string encoding, if you can control this. You can use String#valid_encoding? method in ruby to check you are receiving a correctly encoded string.

Related

Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f

I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).

Lua hex string to ASCII?

I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".

How to properly handle invalid bytes in UTF-8 strings?

I have a string with encoding ASCII-8BIT:
str = 'quindi \xE8 al \r\ngoverno'
I want to transcode it to UTF-8, for not having problems with char visualization.
Naturally, \xE8 is not a valid sequence in UTF-8, so I get the error when I try to:
str.encode 'utf-8'
Which returns:
UndefinedConversionError "\xE8" from ASCII-8BIT to UTF-8
Reading the docs about encode method, and I came up with this solution:
encode('UTF-8', invalid: :replace, undef: :replace)
This way all the invalid sequences are replaced with the ?. But if I want to display the proper char instead of the ?. I have different escape sequences in this text, \xE8, \xE0 ...
Is there a way to automatically replace them with the right escaped char?
Your string seems to be ISO-8859-1 encoded. This should work:
str = "quindi \xE8 al \r\ngoverno"
str.force_encoding('ISO-8859-1').encode('UTF-8')
#=> "quindi è al \r\ngoverno"
Note that you have to use double quotes.

Handing strings with binary data in it using java.nio

I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.

Interesting Encoding

I have an interesting promblem with social network http://www.odnoklassniki.ru/.
When I use advanced searching my cyrillic symbols are encoded in no understantable symbols for me.
For Example:
Иван Иванов Encode %25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2+%25D0%25B8%25D0%25B2%25D0%25B0%25D0%25BD%25D0%25BE%25D0%25B2
Any ideas?
It's a double URL-encoded string. The %25 sequences represent the percent sign. Decoding once gives %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2+%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2.
Decoding again gives the UTF-8 string иванов иванов.
That's URL- or percent- encoding. The percent starts it. Then its the 4 hex-digits for the char. The + is the space.
See: http://en.wikipedia.org/wiki/Percent-encoding
Well, it appears to be twice URL encoded. If we unwrap it once, we get
%D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2 %D0%B8%D0%B2%D0%B0%D0%BD%D0%BE%D0%B2
and again, we get
иванов иванов
This appears to be UTF-8 with the bytes encoded separately.

Resources