My application is parsing incoming emails. I try to parse them as best as possible but every now and then I get one with puzzling content. This time is an email that looks to be in ASCII but the specified charset is: ansi_x3.110-1983.
My application handles it correctly by defaulting to ASCII, but it throws a warning which I'd like to stop receiving, so my question is: what is ansi_x3.110-1983 and what should I do with it?
According to this page on the IANA's site, ANSI_X3.110-1983 is also known as:
iso-ir-99
CSA_T500-1983
NAPLPS
csISO99NAPLPS
Of those, only the name NAPLPS seems interesting or informative. If you can, consider getting in touch with the people sending those mails. If they're really using Prodigy in this day and age, I'd be amazed.
The IANA site also has a pointer to RFC 1345, which contains a description of the bytes and the characters that they map to. Compared to ISO-8859-1, the control characters are the same, as are most of the punctuation, all of the numbers and letters, and most of the remaining characters in the first 7 bits.
You could possibly use the guide in the RFC to write a tool to map the characters over, if someone hasn't written a tool for it already. To be honest, it may be easier to simply ignore the whines about the weird character set given that the character mapping is close enough to what is expected anyway...
Related
I am working on data imported from legacy database into sqlite for development, legacy database has a lot of url encoded strings with Polish characters. I can get most of these strings readable by using
CGI::unescape_html( CGI::unescape "string" )
except for one case (that I noticed yet, there may be more as I didn't do any testing yet), the letter "ó". For instance, using unescapeHTML on string "wymiana+teflon%F3w" throws an invalid byte sequence exception.
Question now is either my string is properly escaped, as other Polish characters are using sequences of "&#nnn;" like "b%26%23322%3Bad+zapisu+%2D+powinno+by%26%23263%3B+brak", which seems to follow standard for numeric character referencing. BTW, this string is properly unescaped into
"bład zapisu - powinno być brak"
But, on the other hand, there are also strings with similar character encoding, e.g. "odpowietrzanie+weza%5C" which is properly handled by CGI::unescapeHTML. However, %5C represents a backslash not a letter with code point lower than U+0256. Can it be the reason? I tried to research on this but haven't found any explanation. I also updated my Ruby to 2.1.0 as CGI::Util has changed in new version, but still no luck.
ó is 0xF3 in ISO-8859-2 (and ISO-8859-1) but '\xF3' is not a valid UTF-8 string, that ó should be %C3%B3 in the URL if you're expecting UTF-8. Someone somewhere probably used the deprecated escape JavaScript function to encode the string instead of modern encodeURIComponent; you can see the difference with a simple test in your browser's JavaScript console:
> escape('ó')
"%F3"
> encodeURIComponent('ó')
"%C3%B3"
There's the %F3 you're seeing and the %C3%B3 that you want to see. One thing that should work is to fix the encoding by hand:
irb> CGI::unescape('wymiana+teflon%F3w').force_encoding('ISO-8859-2').encode('UTF-8')
=> "wymiana teflonów"
This assumes that you know what should be ISO-8859-1 and what should be UTF-8. You might have a mix of both ISO-8859-2 (or -1, -3, ..., Windows CP-1258, ...) in your data; unfortunately, there's no reliable way to tell the difference as the encodings overlap and there's no way to be sure what result makes sense without eye-balling it and knowing the various languages involved.
Probably the best you can do is:
Send everything through through your CGI::unescape_html(CGI::unescape(...)) converter.
Wrap that in an exception handler to trap the inevitable problems.
Stash the problem strings off to the side somewhere.
Try the ISO-8859-2 to UTF-8 conversion on the strings from (3) and eye-ball them to see if they makes sense.
Repeat with other common encodings until there's nothing left that you care about.
Note that I'm using ISO-8859-2 instead of the more common ISO-8859-1 as Latin-2 is for Eastern European languages (such as Polish) whereas Latin-1 is for Western European languages. They overlap on ó but there is no ł in Latin-1. With tasks like this you usually try the encodings that are probably there first, then fall back on other common encodings, then fall back to whatever other encodings you can think of, and then fall back on hard liquor.
Good luck, modernizing legacy data is not the funnest job in the world.
I've chosen another way to solve my problem, simply substituting all occurrences of '%F3' with '%26%23xF3%3B' before unescaping. BTW, capital letter Ó also needs similar substitution. The actual code I used:
def unescape_ó(s)
s = s.gsub(/%D3|%F3/, {'%D3' =>'%26%23xD3%3B', '%F3' => '%26%23xF3%3B'})
end
With this approach I don't have to handle invalid byte sequence exception as properly escaped string is used in CGI::unescapeHTML
I need to reference to a Unicode character with a URI. Following IANA references list multiple schemes and namespaces but do not mention anything about identifiers for the Unicode characters. Does anyone know if something like this exists already?
http://www.iana.org/assignments/uri-schemes.html
http://www.iana.org/assignments/urn-namespaces/urn-namespaces.xml
I hoped to find something like
unicode://U+0394
urn:unicode://0394
http://unicode.org/unicode/0394
for the greek capital letter delta Δ.
If someone wonders, this is for a semantic web like application that uses URIs as identifiers for concepts, including concepts of the Unicode characters.
I’m afraid there is no URL or URN for referring authoritative information on a Unicode character in general. In the Unicode Standard, information about individual characters is partly in the so-called character database (mostly plain text files in specific formats), partly in the Code Charts (PDF files). Neither of them offers a way to point at an individual character. Moreover, the information there is not exhaustive: there are important remarks on individual characters information scattered around the standard.
The Decodeunicode site has individually addressable items, such as
http://www.decodeunicode.org/en/u+0394
but its information content varies a lot and is generally very limited. It is not official, and it currently contains Unicode 5.0 only.
The Fileformat.info site is much more systematic, but it, too, is unofficial. It is basically limited to formal properties and data derivable from them, plus comments extracted from the Code Charts, plus instructions on typing the character in Windows, plus information about support in fonts—but that’s quite a lot! Example:
http://www.fileformat.info/info/unicode/char/0394/
[ EDIT ] : found this URL matching your needs : http://unicode.org/cldr/utility/character.jsp?a=1F40F
.
Well, there is an URL referencing the authoritative information on the Unicode database, even though it does not describe (as said in the other answer) all the information on one specific character.
You have the following URL, pointing to the latest Unicode database. This is a simple list of existing valid Unicode characters. Some upcoming characters are missing (㋿), and you should expect it to be mutable.
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt
The contents looks like the following, which isn't so practical to use as-is.
$ grep -ai kangaroo UnicodeData.txt -C 7
1F991;SQUID;So;0;ON;;;;;N;;;;;
1F992;GIRAFFE FACE;So;0;ON;;;;;N;;;;;
1F993;ZEBRA FACE;So;0;ON;;;;;N;;;;;
1F994;HEDGEHOG;So;0;ON;;;;;N;;;;;
1F995;SAUROPOD;So;0;ON;;;;;N;;;;;
1F996;T-REX;So;0;ON;;;;;N;;;;;
1F997;CRICKET;So;0;ON;;;;;N;;;;;
1F998;KANGAROO;So;0;ON;;;;;N;;;;;
1F999;LLAMA;So;0;ON;;;;;N;;;;;
1F99A;PEACOCK;So;0;ON;;;;;N;;;;;
1F99B;HIPPOPOTAMUS;So;0;ON;;;;;N;;;;;
1F99C;PARROT;So;0;ON;;;;;N;;;;;
1F99D;RACCOON;So;0;ON;;;;;N;;;;;
1F99E;LOBSTER;So;0;ON;;;;;N;;;;;
1F99F;MOSQUITO;So;0;ON;;;;;N;;;;;
You could build up a hacky « hash-based » namespace with a suffix like this, but that's definitely non-standard.
https://www.unicode.org/Public/UCD/latest/ucd/UnicodeData.txt#1F998
Since this is also tagged semantic-web, I will try to pick URIs that are easily (and permanently) dereferenceable and cannot be mistaken for a document describing that character: the data: scheme. Not only can that refer to a character in Unicode, but any encoding, and also any string thereof.
data:;charset=utf-8,%CE%94
Attempting to open this URI should result in a text/plain file with the single character as its content.
If the system accepts IRIs (as many semantic web applications do), the character can be included directly:
data:;charset=utf-8,Δ
This is mapped to the same URI as shown above, and your browser may convert it directly. Specifying UTF-8 is necessary in this case, since the mapping is not defined for other encodings.
Even today, one frequently sees character encoding problems with significant frequency. Take for example this recent job post:
(Note: This is an example, not a spam job post... :-)
I have recently seen that exact error on websites, in popular IM programs, and in the background graphics on CNN.
My two-part question:
What causes this particular, common encoding issue?
As a developer, what should I do with user input to avoid common encoding issues like
this one? If this question requires simplification to provide a
meaningful answer, assume content is entered through a web browser.
What causes this particular, common encoding issue?
This will occur when the conversion between characters and bytes has taken place using the wrong charset. Computers handles data as bytes, but to represent the data in a sensible manner to humans, it has to be converted to characters (strings). This conversion takes place based on a charset of which there are many different ones.
In the particular ’ example, this is a typical CP1252 representation of the Unicode Character 'RIGHT SINQLE QUOTATION MARK' (U+2019) ’ which was been read using UTF-8. In UTF-8, that character exist of the bytes 0xE2, 0x80 and 0x99. If you check the CP1252 codepage layout, then you'll see that those bytes represent exactly the characters â, € and ™.
This can be caused by the website not having read in the original source properly (it should have used CP1252 for this), or is displaying an UTF-8 page with the wrong charset=CP1252 attribute in Content-Type response header (or the attribute is missing; on Windows machines the default charset of CP1252 would be used then).
As a developer, what should I do with user input to avoid common encoding issues like this one? If this question requires simplification to provide a meaningful answer, assume content is entered through a web browser.
Ensure that you read the characters from arbitrary byte stream sources (e.g. a file, an URL, a network socket, etc) using a known and predefinied charset. Then, ensure that you're consistently storing, writing and sending it using an Unicode charset, preferably UTF-8.
If you're familiar with Java (your question history confirms this), you may find this article useful.
I came across the following URL today:
http://www.sfgate.com/cgi-bin/blogs/inmarin/detail??blogid=122&entry_id=64497
Notice the doubled question mark at the beginning of the query string:
??blogid=122&entry_id=64497
My browser didn't seem to have any trouble with it, and running a quick bookmarklet:
javascript:alert(document.location.search);
just gave me the query string shown above.
Is this a valid URL? The reason I'm being so pedantic (assuming that I am) is because I need to parse URLs like this for query parameters, and supporting doubled question marks would require some changes to my code. Obviously if they're in the wild, I'll need to support them; I'm mainly curious if it's my fault for not adhering to URL standards exactly, or if it's in fact a non-standard URL.
Yes, it is valid. Only the first ? in a URL has significance, any after it are treated as literal question marks:
The query component is indicated by
the first question mark ("?")
character and terminated by a number
sign ("#") character or by the end of
the URI.
...
The characters slash ("/") and
question mark ("?") may represent data
within the query component. Beware
that some older, erroneous
implementations may not handle such
data correctly when it is used as the
base URI for relative references
(Section 5.1), apparently because they
fail to distinguish query data from
path data when looking for
hierarchical separators. However, as
query components are often used to
carry identifying information in the
form of "key=value" pairs and one
frequently used value is a reference
to another URI, it is sometimes better
for usability to avoid
percent-encoding those characters.
https://www.rfc-editor.org/rfc/rfc3986#section-3.4
As a tangentially related answer, foo?spam=1?&eggs=3 gives the parameter spam the value 1?
I have lots of UTF-8 content that I want inserted into the URL for SEO purposes. For example, post tags that I want to include in th URI (site.com/tags/id/TAG-NAME). However, only ASCII characters are allowed by the standards.
Characters that are allowed in a URI
but do not have a reserved purpose are
called unreserved. These include
uppercase and lowercase letters,
decimal digits, hyphen, period,
underscore, and tilde.
The solution seems to be to:
Convert the character string into a
sequence of bytes using the UTF-8
encoding
Convert each byte that is
not an ASCII letter or digit to %HH,
where HH is the hexadecimal value of
the byte
However, that converts the legible (and SEO valuable) words into mumbo-jumbo. So I'm wondering if google is still smart enough to handle searches in URL's that contain encoded data - or if I should attempt to convert those non-english characters into there semi-ASCII counterparts (which might help with latin based languages)?
Firstly, search engines really don't care about the URLs. They help visitors: visitors link to sites, and search engines care about that. URLs are easy to spam, if they cared there would be incentive to spam. No major search engines wants that. The allinurl: is merely a feature of google to help advanced users, not something that gets factored into organic rankings. Any benefits you get from using a more natural URL will probably come as a fringe benefit of the PR from an inferior search engine indexing your site -- and there is some evidence this can be negative with the advent of negative PR too.
From Google Webmaster Central
Does that mean I should avoid
rewriting dynamic URLs at all?
That's
our recommendation, unless your
rewrites are limited to removing
unnecessary parameters, or you are
very diligent in removing all
parameters that could cause problems.
If you transform your dynamic URL to
make it look static you should be
aware that we might not be able to
interpret the information correctly in
all cases. If you want to serve a
static equivalent of your site, you
might want to consider transforming
the underlying content by serving a
replacement which is truly static. One
example would be to generate files for
all the paths and make them accessible
somewhere on your site. However, if
you're using URL rewriting (rather
than making a copy of the content) to
produce static-looking URLs from a
dynamic site, you could be doing harm
rather than good. Feel free to serve
us your standard dynamic URL and we
will automatically find the parameters
which are unnecessary.
I personally don't believe it matters all that much short of getting a little more click through and helping users out. So far as Unicode, you don't understand how this works: the request goes to the hex-encoded unicode destination, but the rendering engine must know how to handle this if it wishes to decode them back to something visually appealing. Google will render (aka decode) unicode (encoded) URL's properly.
Some browsers make this slightly more complex by always encoding the hostname portion, because of phishing attacks using ideographs that look the same.
I wanted to show you an example of this, here is request to http://hy.wikipedia.org/wiki/Գլխավոր_Էջ issued by wget:
Hypertext Transfer Protocol
GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n
[Expert Info (Chat/Sequence): GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Message: GET /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB HTTP/1.0\r\n]
[Severity level: Chat]
[Group: Sequence]
Request Method: GET
Request URI: /wiki/%D4%B3%D5%AC%D5%AD%D5%A1%D5%BE%D5%B8%D6%80_%D4%B7%D5%BB
Request Version: HTTP/1.0
User-Agent: Wget/1.11.4\r\n
Accept: */*\r\n
Host: hy.wikipedia.org\r\n
Connection: Keep-Alive\r\n
\r\n
As you can see, wget like every other browser will just url-encode the destination for you, and the continue the request to the url-encoded destination. The url-decoded domain only exists as a visual convenience.
Do you know what language everything will be in? Is it all latin based?
If so, then I would suggest building a sort of lookup table that will convert UTF-8 to ASCII when possible(and non-colliding) Something like that would convert Ź into Z and such, and when there is a collision or the character doesn't exist in your lookup table, then it just uses %HH.