Determine the encoding of a text file - character-encoding

I have a txt file I am trying to determine the encoding of. I open it in Firefox and click on View-> Character encoding and it says ISO-8859-1. If I open it in Notepad or Notepad++ it says ANSI.
So I am now confused.

ANSI, and ISO-8859-1 are both common mislabels for Windows-1252. If the characters look correct, then the file is encoded in Windows-1252.
Note that ISO-8859-1 is a real encoding (ANSI is not an encoding at all), it's just that Windows-1252 encodes useful characters in the range 0x80-0x9F, which are otherwise wasted in ISO-8859-1.

Related

UTF-32 encoding in Erlang

I want to create an application with wxErlang, in which I need to use UTF-32 strings. I can load source code from the file with UTF-8 encoding, but getting errors when the file is converted to UTF-32. I need to use Cyrillic characters in my application, that's why I want to solve this problem with UTF-32 encoding.
If you look at the unicode usage in Erlang page, you'll see that the current release of Erlang, 16A, supports UTF-8 source files. UTF-32 is not supported. However, if you want Cyrillic, UTF-8 has everything you can write in UTF-32.

How are non-ASCII file names encoded in RAR files?

I have a RAR file with non ASCII letters in filenames. I tried decoding it in Delphi. My code works fine for ASCII filenames but it failed on these. It is not WideChar, nor UTF8. I found RAR specs here:
http://ams.cern.ch/AMS/amsexch/arch/rar/technote.txt
but it says nothing about the character encoding.
I tried WOTSIT.org but all links to RARs are dead (almost every link is dead there; I even contacted admin but he didn't respond and didn't fix links).
It seems it is not an 8bit encoding, but no idea what could it be.
This is the only paragraph that says something about the name:
0x200 - FILE_NAME contains both usual and encoded
Unicode name separated by zero. In this case
NAME_SIZE field is equal to the length
of usual name plus encoded Unicode name plus 1.
If this flag is present, but FILE_NAME does not
contain zero bytes, it means that file name
is encoded using UTF-8.
It seems that it is UTF-8, but you say it is not. Can you try again?

KRL RSS parser: Handle encoding issues?

I'm importing an RSS feed from Tumblr into a Kynetx app. It appears that the RSS feed has some encoding issues, as apostrophes appear like this:
The feed (which you can find here) claims to be encoded in UTF-8.
Is there a way to specify the encoding or else replace those characters with regular apostrophes?
While not optimal, you could try to catch these encodings and replace them with the UTF-8 standard:
newstring = oldstring.replace(re/’/\'/);
This appears to be a case of a service that specifies UTF-8, but does't explicitly enforce it. I uploaded an image of the RSS feed that you provided. For comparison, I cut and pasted the text into a notepad document and then typed in the same text from my keyboard.
I don't know if you can tell from the image, but the apostrophe that is mangled is different from the apostrophe that is generated by my UTF-8 browser.
I suspect that this post was submitted via a Windows client. If you look at your encoding options, you will see an option for Western (Windows-1252).
Windows-1252 is a legacy encoding from windows that resembles ISO 8859-1, but substitutes some of their own characters for control characters in the ANSI standard and changes the location in the codepage of others.
A couple of quotes from the wikipedia page that I cite above:
It is very common to mislabel Windows-1252 text data with the charset label ISO-8859-1. Many web browsers and e-mail clients treat the MIME charset ISO-8859-1 as Windows-1252 characters in order to accommodate such mislabeling
Many Microsoft programs, such as Word will automatically substitute Windows-1252 characters when standard ASCII characters are entered, such as for "smart quotes" (e.g. substituting ’ for the apostrophe in a contraction) or substituting © for the three characters '(c)'.
KRL supports all of the language charsets supported by UTF-8, so it supports multi-byte international characters natively; however, that comes at the expense of being able to fudge encodings that is possible when you only have ISO-8859-1 or Windows-1252 to choose from.

invalid token error while parsing an XML file with UTF-8 encoding

invalid token error while parsing an XML file with UTF-8 encoding.
This error is coming when it encountered extended ASCII character 'â' { "â", "â" }.
When I have changed the encoding from UTF-8 to ISO-8859-1 the parsing is successful. But my application should support UTF-8, ASCII and extended ASCII characters. What should I do for this?
Any ideas are welcome.
Thanks in Advance for your time and solution.
Telling a parser that a latin-1 file is UTF-8 by setting the encoding attribute of the XML declaration will result in an error similar to that which you report.
If the 'â' character (U+00E2) appears in a UTF-8 encoded file, then that character will be encoded in that file as a two byte sequence. So if you are not changing the bytes in the file when you say you are changing the encoding, you are not changing the encoding of the file, only telling the parser that a non-UTF-8 file is UTF-8.

Percent Encoded UTF-8 to Ascii(8-bit) conversion

Im reading in urls and they often have percent encoded characters.
Example: %C3%A9 is actually é
According to http://www.microsystools.com/products/sitemap-generator/faq/character-percentage-url-encoding/ , characters in the upper half of 8-Bit ASCII (128-255) are encoded as UTF-8, then their bytes are saved as hex. Now, when I get my URL, the %HEX's have been reencoded as 8-bit ascii, and I need to convert those back to their true 8bit ascii. Is there any function/library I can use, or else, how would I go about the conversion?
Im using C/C++.
First you need to URLDecode. Not a function available in cross-platform C++, but, luckily for you, not a hard problem. Copy bytes from source to target. Non-% bytes just get copied. When you hit %xx, convert XX from hex chars to binary, and you have your byte.
This gives you a buffer of text in UTF-8. You say you want 'ASCII' -- ISO-646. Then you can't have an accented e. I can think of several possibilities for what you really want:
ISO-8859-1. You can use ICU to convert UTF-8 to ISO-8859-1.
ISO-646. You can also use ICU, and I believe it will make accented chars into their ISO-646 equivalents.

Resources