trying to figure out the charset - character-encoding

I'm downloading a CSV from Google Docs and in it characters like “ are saved as \xE2\x80\x9C and ” are saved as \xE2\x80\x9D.
My question is... what charset are those being saved in? How might I go about figuring that out?

It is in UTF-8.. You can tell by decoding it as UTF-8 and it shows the correct characters.
UTF-8 also has a unique and very distinctive pattern, just 3 bytes with highest bit set forming a valid UTF-8 sequence are enough to tell if something is UTF-8 with 99% confidence. Even with 2 bytes with highest bit set forming a valid UTF-8 sequence, you can already get to 90%.
In a case it wasn't UTF-8, and was some 8-bit code page instead, it would be impossible to tell just by looking at the bytes alone. Without any other information, you would basically have to brute force by decoding it in various 8-bit encodings and then seeing if it looks correct. The other possibility is using an algorithm that would go through the encodings automatically, and see if it the result makes sense in any language.
With more information like what operating system and locale the file was saved in, you could reduce the amount of possible encodings to try by a huge deal though.

Related

Can urls have UTF-8 characters?

I was curious if I should encode urls with ASCII or UTF-8. I was under the belief that urls cannot have non-ASCII characters, but someone told me they can have UTF-8, and I searched around and couldn't quite find which one is true. Does anyone know?
There are two parts to this, but they both amount to "yes".
With IDNA, it is possible to register domain names using the full Unicode repertoire (with a few minor twists to prevent ambiguities and abuse).
The path part is not strictly regulated, but it's possible to encode arbitrary strings in the path. The browser could opt to display a human-readable rendering rather than an encoded path. However, this requires heuristics, as there is no way to specify the character set and encoding of the path.
So, http://xn--msic-0ra.example/mot%C3%B6rhead is a (fictional example, not entirely correct) computer-readable encoded URL which could be displayed to the user as http://müsic.example/motörhead. The domain name is encoded as xn--msic-0ra.example in something called Punycode, and the path contains the label "motörhead" encoded as UTF-8 and URL encoded (the Unicode code point U+00F6 is reprecented with the two bytes 0xC3 0xB6 in UTF-8).
The path could also be mot%F6rhead which is the same label in Latin-1. In this case, deducing a reasonable human-readable representation would be much harder, but perhaps the context of the surrounding characters could offer enough hints for a good guess.
In isolation, %F6 could be pretty much anything, and %C3%B6 could be e.g. UTF-16.

What is the usefulness of mb_http_output() given that the output encoding is typically fixed by other means?

All over the Internet, including in stackoverflow, it is suggested to use mb_http_input('utf-8') to have PHP works in the UTF-8 encoding. For example, see PHP/MySQL encoding problems. � instead of certain characters. On the other hand, the PHP manual says that we cannot fix the input encoding within the PHP script and that mb_http_input is only a way to query what it is, not a way to set it. See http://www.php.net/manual/en/mbstring.http.php and http://php.net/manual/en/function.mb-httpetinput.php . Ok, this was just a clarification of the context before the question. It seems to me that there is a lot of redundant commands in Apache + PHP + HTML to control the conversion from the input encoding to the internal encoding and finally to the output encoding. I don't understand the usefulness of this. For example, if the original input encoding from some external HTTP client is EUC-JP and I set the internal encoding to UTF-8, then PHP would have to make the conversion. Am I right? If I am right, why would I set an input encoding in php.ini (instead of just passing the original one) given that it would be next immediately converted to the utf-8 internal encoding anyway? A similar question hold for the output. In all my htpp files, I use a meta tag with charset=utf-8. So, the output HTTP encoding is fixed. Moreover, in PHP.ini, I can set the default_charset that will appear in the HTTP header to utf-8. Why would I bother to use mb_http_output('uft-8') when the final output encoding is already fixed. To sum up, can someone give me a practical concrete example where mb_http_output('uft-8') is clearly necessary and cannot be replaced by more usual commands that are often inserted by default in editors such as Dreamweaver?
These two options are just about the worst idea the PHP designers ever had, and they had plenty of bad ideas when it comes to encodings.
To convert strings to a specific encoding, one has to know what encoding one is converting from. Incoming data is often in an undeclared encoding; the server just receives some binary data, it doesn't know what encoding it represents. You should declare what encoding you expect the browser to send by setting the accept-charset attribute on forms; doing that is no guarantee that the browser will do so and it doesn't make PHP know what encoding to expect though.
The same goes for output; PHP strings are just byte arrays, they do not have an associated encoding. I have no idea how PHP thinks it knows how to convert arbitrary strings to a specific encoding upon input or output.
You should handle this manually, and it's really easy to do anyway: declare to clients what encoding you expect, check whether input is in the correct encoding using mb_check_encoding (not _detect encoding or some such, just check), reject invalid input, take care to keep everything in the same encoding within the whole application flow. I.e., ideally you have no conversion whatsoever in your app.
If you do need to convert at any point, make it a Unicode sandwich: convert input from the expected encoding to UTF-8 or another Unicode encoding on input, convert it back to desired output encoding upon output. Whenever you need to convert, make sure you know what you're converting from. You cannot magically "make all strings UTF-8" with one declaration.

difference between characters online and on localhost

The question is ,I hope,simple for someone who knows about character encoding.
This is my site.
http://www.football-tennis-stats.com/index.php/stats/display/tennis
Online, the character set is wrong ,I get this weird Â,while on localhost everything is allright.
I know there is a lot of good reading to be done on this subject,but I don't even know where to start.
There does not seem to be any character encoding issue, just spurious data, namely bytes 0xC3 0x82, which represent the character  when interpreted in UTF-8, which is the declared encoding. Otherwise, the content seems to be all ASCII, because the names are in “internationalized”, i.e. anglicized form, e.g. Djokovic instead of Đoković, Soderling instead of Söderling etc. With this data, it does not matter much how you declare its encoding, since ASCII characters mostly have the same representation anyway.
I have no idea where the bytes come from, but they seem to appear systematically between a comma and a space, so it’s apparently something in the code that generates the table.

Character Encoding and the ’ Issue

Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:
(Note: This is an example, not a spam job post... :-)
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
What causes this particular, common encoding issue?
As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
What causes this particular, common encoding issue?
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) ’ which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, € and ™.
This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).
As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.
Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you're familiar with Java (your question history confirms this), you may find this article useful.

What strategies are there for escaping character entities?

We are doing Natural Language Processing on a range of English language documents (mainly scientific) and run into problems in carrying non-ANSI characters through the various components. The documents may be "ASCII", UNICODE, PDF, or HTML. We cannot predict at this stage what tools will be in our chain or whether they will allow character encodings other than ANSI. Even ISO-Latin characters expressed in UNICODE will give problems (e.g. displaying incorrectly in browsers). We are likely to encounter a range of symbols including mathematical and Greek. We would like to "flatten" these into a text string which will survive multistep processing (including XML and regex tools) and then possibly reconstitute it in the last step (although it is the semantics rather than the typography we are concerned with so this is a minor concern).
I appreciate that there is no absolute answer - any escaping can clash in some cases - but I am looking for something allong the lines of XML's <![CDATA[ ...]]> which will survive most non-recursive XML operations. Characters such as [ are bad as they are common in regexes. So I'm wondering if there is a generally adopted approach rather than inventing our own.
A typical example is the "degrees" symbol:
HTML Entity (decimal) °
HTML Entity (hex) °
HTML Entity (named) °
How to type in Microsoft Windows Alt +00B0
Alt 0176
Alt 248
UTF-8 (hex) 0xC2 0xB0 (c2b0)
UTF-8 (binary) 11000010:10110000
UTF-16 (hex) 0x00B0 (00b0)
UTF-16 (decimal) 176
UTF-32 (hex) 0x000000B0 (00b0)
UTF-32 (decimal) 176
C/C++/Java source code "\u00B0"
Python source code u"\u00B0"
We are also likely to encounter TeX
$10\,^{\circ}{\rm C}$
or
\degree
so backslashes, curlies and dollars are a poor idea.
We could for example use markup like:
__deg__
__#176__
and this will probably work but I'd appreciate advice from those who have similar problems.
update I accept #MichaelB's insistence that we use UTF-8 throughout. I am worried that some of our tools may not conform and if so I'll revisit this. Note that my original question is not well worded - read his answer and the link in it.
Get someone to do this who really understands character encodings. It looks like you don't, because you're not using the terminology correctly. Alternatively, read this.
Do not brew up your own escape scheme - it will cause you more problems than it will solve. Instead, normalize the various source encodings to UTF-8 (which is really just one such escape scheme, except efficient and standardized) and handle character encodings correctly. Perhaps use UTF-7 if you're really that scared of high bits.
In this day and age, not handling character encodings correctly is not acceptable. If a tool doesn't, abandon it - it is most likely very bad quality code in many other ways as well and not worth the hassle using.
Maybe I don't get the problem correctly, but I would create a very unique escape marker which is unlikely to be touched, and then use it to enclose the entity encoded as a base32 string.
Eventually, you can transmit the unique markers and their number along the chain through a separate channel, and check their presence and number at the end.
Example, something like
the value of the temperature was 18 cd48d8c50d7f40aeb6a164181b17feee EZSGKZY= cd48d8c50d7f40aeb6a164181b17feee
your marker is a uuid, and the entity is &deg encoded in base32. You then pass along the marker cd48d8c50d7f40aeb6a164181b17feee. It cannot be corrupted (if it gets corrupted, your filters will probably corrupt anything made of letters and numbers anyway, but at least you can exclude them because they are fixed length) and you can always recover the content by looking inside the two markers.
Of course, if you have uuids in your documents, this could represent a problem, but since you are not transmitting them as authorized markers along the lateral channel, they won't be recognized as such (and in any case, what's inbetween won't validate as a base32 string anyway).
If you need to search for them, then you can keep the uuid subdivision, and then use a proper regexp to spot these occurrences. Example:
>>> re.search("(\w{8}-\w{4}-\w{4}-\w{4}-\w{12})(.*?)(\\1)", s)
<_sre.SRE_Match object at 0x1003d31f8>
>>> _.groups()
('6d378205-1265-44e4-80b8-a47d1ceaad51', ' EZSGKZY= ', '6d378205-1265-44e4-80b8-a47d1ceaad51')
>>>
If you really need a specific "token" to test, you can use a uuid1, with a very defined specification of a node:
>>> uuid.uuid1(node=0x1234567890)
UUID('bdcce554-e95d-11de-bd0f-001234567890')
>>> uuid.uuid1(node=0x1234567890)
UUID('c4c57a91-e95d-11de-90ca-001234567890')
>>>
You can use anything you prefer as a node, the uuid will be unique, but you can still test for presence (although you can get false positives).

Resources