I have data in CSV format that has been seriously scrambled character encoding wise, likely going back and forth between different software applications (LibreOffice Calc, Microsoft, Excel, Google Refine, custom PHP/MySQL software; on Windows XP, Windows 7 and GNU/Linux machines from various regions of the world...). It seems like somewhere in the process, non-ASCII characters have become seriously scrambled, and I'm not sure how to descramble them or detect a pattern. To do so manually would involve a few thousand records...
Here's an example. For "Trois-Rivières", when I open this portion of the CSV file in Python, it says:
Trois-Rivi\xc3\x83\xc2\x85\xc3\x82\xc2\xa0res
Question: through what process can I reverse
\xc3\x83\xc2\x85\xc3\x82\xc2\xa0
to get back
è
i.e. how can I unscramble this? How might this have become scrambled in the first place? How can I reverse engineer this bug?
You can check the solutions that were offered in: Double-decoding unicode in python
Another simpler brute force solution is to create a mapping table between the small set of scrambled characters using regular expression (((\\\x[a-c0-9]{2}){8})) search on your input file. For a file of a single source, you should have less than 32 for French and less than 10 for German. Then you can run "find and replace" using this small mapping table.
Based on dan04's comment above, we can guess that somehow the letter "è" was misinterpreted as an "Š", which then had three-fold UTF-8 encoding applied to it.
So how did "è" turn into "Š", then? Well, I had a hunch that the most likely explanation would be between two different 8-bit charsets, so I looked up some common character encodings on Wikipedia, and found a match: in CP850 (and in various other related 8-bit DOS code pages, such as CP851, CP853, CP857, etc.) the letter "è" is encoded as the byte 0x8A, which in Windows-1252 represents "Š" instead.
With this knowledge, we can recreate this tortuous chain of mis-encodings with a simple Unix shell command line:
$ echo "Trois-Rivières" \
| iconv -t cp850 \
| iconv -f windows-1252 -t utf-8 \
| iconv -f iso-8859-1 -t utf-8 \
| iconv -f iso-8859-1 -t utf-8 \
| iconv -f ascii --byte-subst='\x%02X'
Trois-Rivi\xC3\x83\xC2\x85\xC3\x82\xC2\xA0res
Here, the first iconv call just converts the string from my local character encoding — which happens to be UTF-8 — to CP850, and the last one just encodes the non-ASCII bytes with Python-style \xNN escape codes. The three iconv calls in the middle recreate the actual re-encoding steps applied to the data: first from (assumed) Windows-1252 to UTF-8, and then twice from ISO-8859-1 to UTF-8.
So how can we fix it? Well, we just need to apply the same steps in reverse:
$ echo -e 'Trois-Rivi\xC3\x83\xC2\x85\xC3\x82\xC2\xA0res' \
| iconv -f utf-8 -t iso-8859-1 \
| iconv -f utf-8 -t iso-8859-1 \
| iconv -f utf-8 -t windows-1252 \
| iconv -f cp850
Trois-Rivières
The good news is that this process should be mostly reversible. The bad news is that any "ü", "ì", "Å", "É" and "Ø" letters in the original text may have been irreversibly mangled, since the bytes used to encode those letters in CP850 are undefined in Windows-1252. (If you're lucky, they may have been interpreted as the same C1 control codes that those bytes represent in ISO-8859-1, in which case back-conversion should in principle be possible. I haven't managed to figure out how to convince iconv to do it, though.)
Related
I am batch converting documents that are primarily written in Chinese from docx to markdown using Pandoc within a Powershell script. However, all Chinese characters are being converted to question marks - "?"
The command I am using is
pandoc.exe -f docx -t $converter-simple_tables-multiline_tables-grid_tables+pipe_tables -i $fullexportpath -o "$($fullfilepathwithoutextension).md" --wrap=none --atx-headers --extract-media="$($mediaPath)"
It works perfectly for English documents, so I presume I just need to add some sort of modifier to handle Chinese characters.
I have seen various posts about needing to use --latex-engine=xelatex and/or -V CJKmainfont="Font Name" (I've been using it with "Microsoft YaHei", the font in Word), but no combinations of these seems to matter (in fact, using --latex-engine at all breaks the conversion). It seems that they are for converting to PDF.
Any suggestions?
Edit: the issue was not with Pandoc but with the script I was using that did some find and replace conversion. Works fine with English but not Chinese characters.
I was wondering how to make the heart sign or "♥" in Lua, I have tried \003 because that is the ASCII code for it, but it does not print it out.
This has little to do with Lua.
You need to find out which character set and encoding is used in your environment and select a font that supports ♥ in that encoding.
Then you need to use an editor for your Lua script that saves in that encoding. If that part is not possible then you can determine the byte sequence required, code it as numeric escapes in a literal string and save in a compatible encoding such as CP437. For example, if you are outputting to a UTF-8 processor, "\xE2\x99\xA5".
Keep in mind that a Lua string is a counted sequence of bytes. It's up to you and your editor to put the right bytes in in the file, it's up to your environment (e.g., console) to interpret those bytes in a particular character encoding, and up to the font to display the glyph.
In a Windows console, you can select the Lucinda Console font, chcp 65001 to use UTF-8 and use Lua 5.1 like this: lua -e "print('\226\153\165')". As a comparison, chcp 437 to use IBM437 and use Lua 5.1 like this: lua -e "print('\003')".
For ASCII, only range 0x20 to 0x7E are printable. Others, including 0x03, isn't printable. Printing its value would be up to the implementation.
If the environment supports Unicode, you can simply call:
print("♥")
For instance, Lua Demo outputs ♥, same in ideone.
I've noted that my text file on Windows(chinese version), when port to Ubuntu, turned garbled.
After more research, I know the default encode on Windows CN version is GBK, while on Ubuntu is utf-8, and iconv can do the encode translation, for example, from GBK to utf-8:
iconv -f gbk -t utf-8 input.txt > output.txt
But I am still confused by the relationship of these encode. What are they? what is the similarity and difference between them?
First it is not about the OS, but about the program you are using to read the file.
On a bare .txt, the program has to be able to guess the encoding, which is not always possible, but might work. On an html, encoding is given as metadata, so browsers don't need to do that.
Second, do you know ASCII? Do you see how it represents symbols via numbers? If not this is the first thing you should learn now.
Next, do you see the difference between Unicode and UTF-XXX? It must be clear to you that Unicode is just a map of numbers (code points) to glyphs (symbols, including Chinese characters, ASCII characters, Egyptian characters, etc.)
UTF-XXX on the other hand says, given a string of bytes, which Unicode numbers (code points) do they represent. Therefore, UTF-8 and UTF-16 are different efficient ways to represent Unicode.
As you may imagine, unlike ASCII, both UTF and GBK must allow more than one byte per character, since there are much more than 256 of them.
In GBK all characters are encoded as 1 or 2 bytes.
Since GBK is specialized for Chinese, it uses less bytes in average than UTF-XXX to represent a given Chinese text, and more for other languages.
In UTF-8 and 16, the number of bytes per glyph is variable, so you have to look at how many bytes are used for the Chinese code points.
In Unicode, Chinese glyphs are on the following ranges. Then you have to look at how efficiently UTF-8 and UTF-16 represent those ranges.
According to Wikipedia articles on UTF-8 and UTF-16, the first and most common range for Chinese glyphs 4E00-9FFF is represented in UTF-8 as either 2 or 3 bytes, while in UTF-16 it is represented as 2 bytes. Therefore, if you are going to use lots of Chinese, UTF-16 might be more efficient. You also have to look into the other ranges to see how many bytes per character are used.
For portability, the best choice is UTF, since UTF can represent almost any possible character set, so it is more likely that viewers will have been programmed to decode it correctly. The size gain of GBK is not that large.
I have a file. I don't know how it was processed. It's probably a double encoding. I've found this link about double encoding that solved almost my problem:
http://www.spamusers.com/encoding.htm
It has all the double encodings substitutions to do like:
À àÁ
 Â
Unfortnately I still others weird characters like:
ú
ç
ö
Do you have an idea on how to clean these weird characters? For the ones I know I've just made a bash script and I've just replaced them. But I don't know how to recognize the others. I'm running on linux so if you have some magic commands I would like that.
The “double encodings substitutions” page that you link to seems to contain mappings meant to fix character data that has been doubly UTF-8 encoded. Thus, the proper fixing routine would be to reverse such mappings and see if the result makes sense.
For example, if you take A with grave accent, À, U+00C0, and UTF-8 encode it, you get the bytes C3 A0. If these are then mistakenly understood as single-byte encodings according to windows-1252 for example, you get the characters U+00C3 U+00A0 (letter à and no-break space). If these are then UTF-8 encoded, you get C3 83 for the former and C2 80 for the latter. If these bytes in turn are interpreted according to windows-1252, you get À as on the page.
But you don’t actually have “À”, do you? You have some digital data, bytes, which display that way if interpreted according to windows-1252. But that would be a wrong interpretation.
You should first read the data as UTF-8 encoded, decode it to characters, checking that all codes are less than 100 hexadecimal (if not, there’s yet another error involved somewhere), then UTF-9 decode again.
I have a 300MB file (link) with utf-8 characters in it. I want to write a haskell program equivalent to:
cat bigfile.txt | grep "^en " | wc -l
This runs in 2.6s on my system.
Right now, I'm reading the file as a normal String (readFile), and have this:
main = do
contents <- readFile "bigfile.txt"
putStrLn $ show $ length $ lines contents
After a couple seconds I get this error:
Dictionary.hs: bigfile.txt: hGetContents: invalid argument (Illegal byte sequence)
I assume I need to use something more utf-8 friendly? How can I make it both fast, and utf-8 compatible? I read about Data.ByteString.Lazy for speed, but Real World Haskell says it doesn't support utf-8.
Package utf8-string provides support for reading and writing UTF8 Strings. It reuses the ByteString infrastructure so the interface is likely to be very similar.
Another Unicode strings project which is likely to be related to the above and is also inspired by ByteStrings is discussed in this Masters thesis.