Fixing UTF-8 encoded as ISO-8859-1 - character-encoding

Say you have a file which contains both UTF-8 characters and UTF-8 characters there were once read by a program who thought they were ISO-8859-1. So you have things like "é" instead of "é". How do you fix that ?

I finally came up with a single sed command that did the job for me :
LANG='' sed -re 's/(\xc3)\x83\xc2([\x80-\xbf])/\1\2/g'
It does not handle unicode code point 0xA0 to 0xBF, but it should be pretty easy to adapt for those.

Related

How to determine if characters within a file have been corrupted or are just being viewed with incorrect encoding

The file I'm dealing with contains text where some characters aren't showing correctly upon opening.
I've been told the file has UTF-8 encoding but when I open in sublime text 3 (I even used the re-open in UTF-8 option) there are a number of characters that show as ? -
For example Jiří is incorrectly being shown as Ji?í - so the ř isn't shown correctly but the long i í is. There are other characters for example č ň ř that are also not showing correctly.
After some investigation it looks like the file is in ASCII encoding (a subset of UTF-8).
file ~/my_location/my_file.txt
~/my_location/my_file.txt: ASCII text, with very long lines, with CRLF line terminators
I've checked the character set for ASCII and for example ř is present, so I'm wondering if the issue is that these characters are already corrupted or the above file encoding check that I used isn't showing the correct file encoding.
I've tried a few conversions to utf8 but none of them fix the characters.
iconv -f ISO-8859-1 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_ISO-8859-1.txt
iconv -f CP1252 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_CP1252.txt
iconv -f Windows-1252 -t UTF-8 ~/my_location/my_file.txt > ~/my_location/my_file_f_Windows-1252.txt
Would appreciate if anyone has any thoughts on how I can proceed in the investigation...

Lua hex string to ASCII?

I'm wanting to convert a hex string to ASCII character, (for the game ROBLOX).
Here's the page for the ASCII icon:
http://www.fileformat.info/info/unicode/char/25ba/index.htm
Although I'm not even sure that Lua supports that icon.
EDIT:
Turns out ROBLOX doesn't support UTF-8 symbols at all due to their 'chat filtering'.
Strings in Lua are encoding-agnostic and you can just use the character in the string:
print"►"
Alternatively:
Output the Unicode code directly with print"\u{25BA}".
Output the UTF-8 encoding directly with print"\xE2\x96\xBA".
Output the UTF-8 encoding directly with print"\226\150\186".

Sublime Text: Not representable characters

I'm using Sublime Text for Latex, so i need to use a specific encoding. However, in some cases, when I paste text copied from a different program (word/browser in most cases), I'm getting the message:
"Not all characters are representable in XXX encoding, falling back to UTF-8"
My question is: Is there any way to see which parts of the text cannot be encoded, so I can delete them manually?
I had this problem. It is caused by corrupt characters in your document. Here is how i solved it.
1) Make a search in your document for all standard characters. Make sure you enable regular expressions in your search, then paste this :
[^a-zA-Z0-9 -\.;<>/ ={}\[\]\^\?_\\\|:\r\n#]
You can add to that the normal accented characters of your language, here are the characters for French and German. Such as éà and so on :
[^a-zA-Z0-9 -\.;<>/ ='{}\[\]\^\?_\\\|:\r\n~#éàèêîôâûçäöüÄÖÜß]
2) Search for that, and Keep pressing F3 until you see mangled characters. Usually something like "è" which is a corrupt version of "à".
3) Delete those characters or replace them with what they should be.
You will be able to convert the document to another encoding when you have cleared all corrupt characters out.
For Linux users, it's also possible to automatically remove broken characters with command iconv:
iconv -f UTF-8 -t Windows-1251 -c < ~/temp/data.csv > ~/temp/data01.csv
-c Silently discard characters that cannot be converted instead of terminating when encountering such characters.
Just adding to #Draken response: here is the RegEx with spanish characters added.
[^a-zA-Z0-9 -\.;<>/ =“”'{}\[\]\^\?_\\\|:\r\n~#àèêîôâûçäöüÄÖÜßáéíóúñÑ¿€]
In my case I hitted Ctrl+H (for replacement) and as a replacement expression used nothing. So everything got cleared super fast and I was able to save it using ISO-8859-1.

Turkish character encoding in gedit

I have a Turkish written text but I have some strange characters for example:
ý instead of ı, Ý instead of İ etc... I tried to convert encoding to iso 8859-9 but it didn't help.
If you're running a UNIX/Linux machine, try the following shell command:
you#somewhere:~$ file --mime yourfile.txt
It should output something like the snippet below, where iso-8859-1 is the acutal character set your system assumes:
yourfile.txt: text/plain; charset=iso-8859-1
Now you can convert the file into some more flexible charset, like UTF-8:
you#somewhere:~$ iconv -f iso-8859-1 -t utf-8 yourfile.txt > converted.txt
The above snippet specifies both, the charset to convert -from (which should equal the output of the file command) as well as the charset to convert -to. The result of the conversion of yourfile.txt is then stored in converted.txt, which you should be able to open with gedit.
If that doesn't work, you may paste the output of the file command, as well as some real line of your file, in the comment section...

What character encoding are the following German words using?

I'm trying to process a German word list and can't figure out what encoding the file is in. The 'file' unix command says the file is "Non-ISO extended-ASCII text". Most of the words are in ascii, but here are the exceptions:
ANDR\x82
ATTACH\x82
C\x82ZANNE
CH\x83TEAU
CONF\x82RENCIER
FABERG\x82
L\x82VI-STRAUSS
RH\x93NETAL
P\xF2ANGE
Any hints would be great. Thanks!
EDIT: To be clear, the hex codes above are C hex string literals so replace \xXX with the literal hex value XX.
It looks like CP437 or CP852, assuming the \x82 sequences encode single characters, and are not literally four characters. Well, at least everything else does, but the last line is a bit of a puzzle.

Resources