CGI::unescape can't handle unescaping "wymiana+teflon%F3w"? - ruby-on-rails

I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.

ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.

I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML

Related

How to decode unexpected strings from users?

I've published an app, and I find some of the comments to be like this: РекамедÑ
I have googled a lot and I cannot decode it so that the comment will not be shown this way. This is the way it is stored in database; it can be in Cyrillic, but I could not decode it as well. Any clue on how to understand this kind of comments?
These appear to be doubly encoded HTML entities. So for example, & was turned to & and that was then again turned to &
When decoding the data twice using this online tool (there are many others) the result is
РекамедÑ
That could be Unicode data, e.g. UTF-8 in a non-western character set like Cyrillic or Arabic, that
was misinterpreted as single-byte input
was garbled by a misguided "sanitation" method, possibly a call or two to PHP's htmlentities() (which incidentally assumes the single-byte ISO-8859-1 encoding by default in older versions, so a call to this function could be the whole source of the problem).
The fix will likely need to be on server side.
If you are using PHP, see UTF-8 all the way through for a handy guide.

Handling UTF-8 Character with Latin1 db encoding

I keep getting an exception that ActiveRecord::StatementInvalid: PG::UntranslatableCharacter: ERROR: character with byte sequence 0xe2 0x80 0x99 in encoding "UTF8" has no equivalent in encoding "LATIN1". I did some checking and it looks like it is the backtick or apostrophe. What is the best way to handle this? Just strip out the character or convert the whole db to UTF-8? If it is converting to UTF-8 how can I do that permanently as it always seems to revert if you do it in the shell?
I don't understand what you mean by "revert, if done in the shell", but: You seem to have an application where some parts (at least the database) using encoding LATIN1, and one part (your Rails App) is using UTF-8. IMO, it is best if you have every in Unicode, but to what extend a conversion makes sense, can not be said in general. For example, if your database is also being processed by other tools, and those expect Latin1, a conversion is not sensible.
In any case, you need to define a clear borderline between where you use which encoding, and handle conversion at this border. This applies not only to the database, but also - for example - to the HTML pages you are generating (hopefully UTF-8), to files uploaded by the users and processes by your application, and so on.
If you convert to an encoding, where certain characters can not be represented - as this is in your case -, you have only three choices:
Reject the data (they must have been generated somewhere, perhaps as user input in a web form),
Simply remove the offending characters
Replace the offending characters by a placeholder (for instance, a question mark)
None of these options is very pleasant, but if converting your database to UTF-8 is no option, you should deal with this problem at the point where the problem string is generated, and not when it is written into the database.

ASCII Representation of Hexadecimal

I have a string that, by using string.format("%02X", char), I've received the following:
74657874000000EDD37001000300
In the end, I'd like that string to look like the following:
t e x t NUL NUL NUL í Ó p SOH NUL ETX NUL (spaces are there just for clarification of characters desired in example).
I've tried to use \x..(hex#), string.char(0x..(hex#)) (where (hex#) is alphanumeric representation of my desired character) and I am still having issues with getting the result I'm looking for. After reading another thread about this topic: what is the way to represent a unichar in lua and the links provided in the answers, I am not fully understanding what I need to do in my final code that is acceptable for this to work.
I'm looking for some help in better understanding an approach that would help me to achieve my desired result provided below.
ETA:
Well I thought that I had fixed it with the following code:
function hexToAscii(input)
local convString = ""
for char in input:gmatch("(..)") do
convString = convString..(string.char("0x"..char))
end
return convString
end
It appeared to work, but didnt think about characters above 127. Rookie mistake. Now I'm unsure how I can get the additional characters up to 256 display their ASCII values.
I did the following to check since I couldn't truly "see" them in the file.
function asciiSub(input)
input = input:gsub(string.char(0x00), "<NUL>") -- suggested by a coworker
print(input)
end
I did a few gsub strings to substitute in other characters and my file comes back with the replacement strings. But when I ran into characters in the extended ASCII table, it got all forgotten.
Can anyone assist me in understanding a fix or new approach to this problem? As I've stated before, I read other topics on this and am still confused as to the best approach towards this issue.
The simple way to transform a base16-encoded string is just to
function unhex( input )
return (input:gsub( "..", function(c)
return string.char( tonumber( c, 16 ) )
end))
end
This is basically what you have, just a bit cleaner. (There's no need to say "(..)", ".." is enough – if you specify no captures, you'll automatically get the whole match. And while it might work if you write string.char( "0x"..c ), it's just evil – you concatenate lots of strings and then trigger the automatic conversion to numbers. Much better to just specify the base when explicitly converting.)
The resulting string should be exactly what went into the hex-dumper, no matter the encoding.
If you cannot correctly display the result, your viewer will also be unable to display the original input. If you used different viewers for the original input and the resulting output (e.g. a text editor and a terminal), try writing the output to a file instead and looking at it with the same viewer you used for the original input, then the two should be exactly the same.
Getting viewers that assume different encodings (e.g. one of the "old" 8-bit code pages or one of the many versions of Unicode) to display the same thing will require conversion between different formats, which tends to be quite complicated or even impossible. As you did not mention what encodings are involved (nor any other information like OS or programs used that might hint at the likely encodings), this could be just about anything, so it's impossible to say anything more specific on that.
You actually have a couple of problems:
First, make sure you know the meaning of the term character encoding, and that you know the difference between characters and bytes. A popular post on the topic is The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, what encoding was used for the bytes you just received? You need to know this, otherwise you don't know what byte 234 means. For example it could be ISO-8859-1, in which case it is U+00EA, the character ê.
The characters 0 to 31 are control characters (eg. 0 is NUL). Use a lookup table for these.
Then, displaying the characters on the terminal is the hard part. There is no platform-independent way to display ê on the terminal. It may well be impossible with the standard print function. If you can't figure this step out you can search for a question dealing specifically with how to print Unicode text from Lua.

Squeak Monticello character-encoding

For a work project I am using headless Squeak on a (displayless, remote) Linuxserver and also using Squeak on a Windows developer-machine.
Code on the developer machine is managed using Monticello. I have to copy the mcz to the server using SFTP unfortunately (e.g. having a push-repository on the server is not possible for security reasons). The code is then merged by eg:
MczInstaller installFileNamed: 'name-b.18.mcz'.
Which generally works.
Unfortunately our code-base contains strings that contain Umlauts and other non-ascii characters. During the Monticello-reimport some of them get replaced with other characters and some get replaced with nothing.
I also tried e.g.
MczInstaller installStream: (FileStream readOnlyFileNamed: '...') binary
(note .mcz's are actually .zip's, so binary should be appropriate, i guess it is the default anyway)
Finding out how to make Monticello's transfer preserve the Squeak internal-encoding of non-ascii's is the main Goal of my question. Changing all the source code to only use ascii-strings is (at least in this codebase) much less desirable because manual labor is involved. If you are interested in why it is not a simple grep-replace in this case read this side note:
(Side note: (A simplified/special case) The codebase uses Seaside's #text: method to render strings that contain chars that have to be html-escaped. This works fine with our non-ascii's e.g. it converts ä into ä, if we were to grep-replace the literal ä's by ä explicitly, then we would have to use the #html: method instead (else double-escape), however that would then require that we replace all other characters that have to be html-escaped as well (e.g. &), but then again the source-code itself contains such characters. And there are other cases, like some #text:'s that take third-party strings, they may not be replaced by #html's...)
Squeak does use unicode (ISO 10646) internally for encoding characters in a String.
It might use extension like CP1252 for characters in range 16r80 to: 16r9F, but I'm not really sure anymore.
The characters codes are written as is on the stream source.st, and these codes are made of a single byte for a ByteString when all characters are <= 16rFF. In this case, the file should look like encoded in ISO-8859-L1 or CP1252.
If ever you have character codes > 16rFF, then a WideString is used in Squeak. Once again the codes are written as is on the stream source.st, but this time these are 32 bits codes (written in big-endian order). Technically, the encoding is thus UTF-32BE.
Now what does MczInstaller does? It uses the snapshot/source.st file, and uses setConverterForCode for reading this file, which is either UTF-8 or MacRoman... So non ASCII characters might get changed, and this is even worse in case of WideString which will be re-interpreted as ByteString.
MC itself doesn't use the snapshot/source.st member in the archive.
It rather uses the snapshot.bin (see code in MCMczReader, MCMczWriter).
This is a binary file whose format is governed by DataStream.
The snippet that you should use is rather:
MCMczReader loadVersionFile: 'YourPackage-b.18.mcz'
Monticello isn't really aware of character encoding. I don't know the present situation in squeak but the last time I've looked into it there was an assumed character encoding of latin1. But that would mean it should work flawlessly in your situation.
It should work somehow anyway if you are writing and reading from the same kind of image. If the proper character encoding fails usually the internal byte representation is written from memory to disk. While this prevents any cross dialect exchange of packages it should work if using the same image kind.
Anyway there are things that should or could work but they often go wrong. So most projects try to avoid using non 7bit characters in their code.
You don't need to convert non 7bit characters to HTML entities. You can use
Character value: 228
for producing an ä in your code without using non 7bit characters. On every character you like to add a conversion you can do
$ä asciiValue => 228
I know this is not the kind of answer some would want to get. But monticello is one of these things that still need to be adjusted for proper character encoding.

Grails UrlEncoding non latin characters like åäö

I have some link resources with none latin characters like åäö
These are usually user uploaded files
The problem is that i am not successfull in encoding them
using filename.encodeAsURL seems to not encode it the right way
For example the character ö is turned into o%CC%88
Testing to type the same thing in firefox and copy the contents gives %C3%B6
What are the difference between these encodings and what should i use to get the correct encoding??
Both encodings are correct. You are actually seeing the encoding of two different strings.
The key here is noticing the o at the beginning of the string:
o%CC%88 is the letter o followed by Unicode Character Combining Diaeresis, which combines with the previous character when rendered.
%C3%B6 is Unicode Character Latin Small O With Diaeresis.
What you are seeing is that in the first case, the string entered is something like these two characters: o ¨, which are actually rendered as ö.
In the second case, it's the actual character ö.
My guess is you are seeing the difference between two different inputs.
Update based on below discussion: If you are dynamically processing Unicode characters, and you do not have control over the input methods, you can try to normalize the Unicode, using java.text.Normalizer (Java 1.6 or newer).
Normalizing attempts to ensure that all characters are consistently represented, so that accented characters are always represented by a combined character or always by the character+combining mark.
Rough example:
String.metaClass.normalizeUnicode = {
return java.text.Normalizer.normalize(delegate, java.text.Normalizer.Form.NFC)
}
input = input.normalizeUnicode()
There are four forms of normalization. I picked the one that seems to be best for your case based on the description of how they work, but you may prefer to try the other ones and see what works most consistently.
All that being said, if you are try to representing Unicode characters in a URL, and they are not being loaded and processed by the code directly, it's probably best to avoid using non-latin characters altogether. Not only does this have the benefit of consistently, but also significantly shorter and more legible URLs. boo.pdf is a lot easier to read than bo%CC%88o.pdf.

Resources