So I forked this gem on GitHub, thinking that I may be able to fix and update some of the issues with it for use in a Rails project. I basically get this output:
irb(main):020:0> query = Query::simpleQuery('xx.xxx.xxx.xx', 25565)
=> [false, #<Encoding::CompatibilityError: incompatible character encodings: UTF-8 and ASCII-8BIT>]
irb(main):021:0> query = Query::simpleQuery('xx.xxx.xxx.xx', 25565)
=> {:motd=>"Craftnet", :gametype=>"SMP", :map=>"world", :numplayers=>"0", :maxplayers=>"48"}
The first response is the example of the Encoding error, and the second is the wanted output (IP's taken out). Basically this is querying a Minecraft server for information on it.
I tried using
Encoding.default_external = Encoding::UTF_8
Encoding.default_internal = Encoding::UTF_8
But that just gave the same response, randomly spitting encoding errors and not.
Here is the relevant GitHub repo with all the code: RubyMinecraft
Any help would be greatly appreciated.
In the Query class there is this line:
#key = Array(key).pack('N')
This creates a String with an associated encoding of ASCII-8BIT (i.e. it’s a binary string).
Later #key gets used in this line:
query = #sock.send("\xFE\xFD\x00\x01\x02\x03\x04" + #key, 0)
In Ruby 2.0 the default encoding of String literals is UTF-8, so this is combining a UTF-8 string with a binary one.
When Ruby tries to do this it first checks to see if the binary string only contains 7-bit values (i.e. all bytes are less than or equal to 127, with the top byte being 0), and if it does it considers it compatible with UTF-8 and so combines them without further issue. If it doesn’t, (i.e. if it contains bytes greater than 127) then the two strings are not compatible and an Encoding::CompatibilityError is raised.
Whether an error is raised depends on the contents of #key, which is initialized from a response from the server. Sometimes this value happens to contain only 7-bit values, so no error is raised, at other times there is a byte with the high bit set, so it generates an error. This is why the errors appear to be “random”.
To fix it you can specify that the string literal in the line where the two strings are combined should be treated as binary. The simplest way would be to use force_encoding like this:
query = #sock.send("\xFE\xFD\x00\x01\x02\x03\x04".force_encoding(Encoding::ASCII_8BIT) + #key, 0)
Related
I work with a payment API and it returns some XML. For logging I want to save the API response in my database.
One word in the API is "manhã" but the API returns "manh�". Other chars like á ou ç are being returned correctly, this is some bug in the API I guess.
But when trying to save this in my DB I get:
Postgres invalid byte sequence for encoding "UTF8": 0xc3 0x2f
How can I solve this?
I tried things like
response.encode("UTF-8") and also force_encode but all I get is:
Encoding::UndefinedConversionError ("\xC3" from ASCII-8BIT to UTF-8)
I need to either remove this wrong character or convert it somehow.
You’re on the right track - you should be able to solve the problem with the encode method - when the source encoding is known you should be able to simply use:
response.encode(‘UTF-8’, ‘ISO-8859-1’)
There may be times where there are invalid characters in the source encoding, and to get around exceptions, you can instruct ruby how to handle them:
# This will transcode the string to UTF-8 and replace any invalid/undefined characters with ‘’ (empty string)
response.encode(‘UTF-8’, 'ISO-8859-1', invalid: :replace, undef: :replace, replace: ‘’)
This is all laid out in the Ruby docs for String - check them out!
—--
Note, many people incorrectly assume that force_encode will somehow fix encoding problems. force_encode simply tags the string as the specified encoding - it does not transcode and replace/remove the invalid characters. When you're converting between encodings, you must transcode so that characters in one character set are correctly represented in the other character set.
As pointed out in the comment section, you can use force_encoding to transcode your string if you used: response.force_encoding('ISO-8859-1').encode('UTF-8') (which is equivalent to the first example using encode above).
My Ruby environment is: Ruby 2.3.1 and Rails 5.0.0.1.
I'm trying to convert a negative string number in an integer, for instance:
When I try to turn this "-2000" in the irb terminal, I got the result expected -2000
But I'm trying to convert this when I import this data from a CSV file.
I'm using the following information:
CSV file
345,-2000
345,120000
Code file
CSV.foreach("file.csv") do |row|
p [row[0], row[1]]
p row[1].to_i
p row[1].force_encoding('UTF-8').to_i
p Integer(row[1])
p Integer(row[1].force_encoding('UTF-8'))
end
I got that:
["345", "-2000"]
0
0
'Integer': invalid value for Integer(): "\xC2\xAD2000" (ArgumentError)
'Integer': invalid value for Integer(): "\xC2\xAD2000" (ArgumentError)
Using the Integer(), I discovered that the - sign is represented by "\xC2\xAD".
In summary, the to_i method is converting "\xC2\xAD2000" to 0 and the Integer() is trigging an error.
Could someone help with that?
Thanks for your attention.
It looks like you actually have two characters here..
\xC2: SublimeText keeps inserting \xc2 characters
\xAD: (soft-hyphen) http://www.fileformat.info/info/unicode/char/00ad/index.htm
Personally, I would replace this combination of characters with an actual hyphen and then convert to integer:
CSV.foreach("file.csv") do |row|
p row[1].sub("\xC2\xAD", '-').to_i
end
That, or clean up the source file. Unsure of how you are generating it, but worth looking into.
I am having issues parsing text files that have illegal characters(binary markers) in them. An answer would be something as follows:
test.csv
^000000^id1,text1,text2,text3
Here the ^000000^ is a textual representation of illegal characters in the source file.
I was thinking about using the java.nio to validate the line before I process it. So, I was thinking of introducing a Validator trait as follows:
import java.nio.charset._
trait Validator{
private def encoder = Charset.forName("UTF-8").newEncoder
def isValidEncoding(line:String):Boolean = {
encoder.canEncode(line)
}
}
Do you guys think this is the correct approach to handle the situation?
Thanks
It is too late when you already have a String, UTF-8 can always encode any string*. You need to go to the point where you are decoding the file initially.
ISO-8859-1 is an encoding with interesting properties:
Literally any byte sequence is valid ISO-8859-1
The code point of each decoded character is exactly the same as the value of the byte it was decoded from
So you could decode the file as ISO-8859-1 and just strip non-English characters:
//Pseudo code
str = file.decode("ISO-8859-1");
str = str.replace( "[\u0000-\u0019\u007F-\u00FF]", "");
You can also iterate line-by-line, and ignore each line that contains a character in [\u0000-\u0019\u007F-\u00FF], if that's what you mean by validating a line before processing it.
It also occurred to me that the binary marker could be a BOM. You can use a hex editor to view the values.
*Except those with illegal surrogates which is probably not the case here.
Binary data is not a string. Don't try to hack around input sequences that would be illegal upon conversion to a String.
If your input is an arbitrary sequence of bytes (even if many of them conform to ASCII), don't even try to convert it to a String.
I have a website thats running on ruby 1.8.7 . I have a validation on an incoming post that checks to make sure that we allow upto max of 12000 characters. The spaces are counted as characters and tab and carriage returns are stripped off before the post is subjected to the validation.
Here is the post that is subjected to validation http://pastie.org/5047582
In ruby 1.9 the string length shows up as 11909 which is correct. But when I check the length on ruby 1.8.7 is turns out to be 12044.
I used codepad.org to run this ruby code which gives me http://codepad.org/OxgSuKGZ ( which outputs the length as 12044 which is wrong) but when i run this same code in the console at codeacademy.org the string length is 11909.
Can anybody explain me why this is happening ???
Thanks
This is a Unicode issue. The string you are using contains characters outside the ASCII range, and the UTF-8 encoding that is frequently used encodes those as 2 (or more) bytes.
Ruby 1.8 did not handle Unicode properly, and length simply gives the number of bytes in the string, which results in fun stuff like:
"ą".length
=> 2
Ruby 1.9 has better Unicode handling. This includes length returning the actual number of characters in the string, as long as Ruby knows the encoding:
"ä".length
=> 1
One possible workaround in Ruby 1.8 is using regular expressions, which can be made Unicode aware:
"ą".scan(/./mu).size
=> 1
When working with different languages, what is the proper way to sub a string out in Rails?
Example (Czech Translation):
str = "pro více informací"
replace = "<em>více</em>"
str["více"] = replace
puts str
The problem I keep running into (and this is for multiple languages, not just Czech) is the following: IndexError (string not matched)
Is there a better way to do a string replacement? I know about gsub and sub, but both methods cause the following errors
.gsub! and gsub errors: RegexpError (invalid multibyte character)
.sub! and .sub errors: RegexpError (invalid multibyte character)
You will want to browse through this thread. Use the byte values for replacement.