How to identify character encoding from website? - character-encoding

What I'm trying to do:
I'm getting from a database a list of uris and download them,
removing the stopwords and counting the frequency that the words appears in the webpage,
then trying to save in the mongodb.
The Problem:
When I try to save the result in the database I get the error
bson.errors.invalidDocument: the document must be a valid utf-8
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something'
when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already tried
I've tried identify the char encode through the header from the webpage
I've tried utilize the chardet
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'
(English isn't my first language so forgive me)
EDIT: if anyone want see the source code

It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')
See also the answers to this question.

Related

\u0092 is not printed in UILabel

I have a local json file with some descriptions of an app and I have found a weird behaviour when parsing \u0092 and \u0091 characters.
When json file contains these characters, the corresponding parsed NSString is printed like "?" and in UIlabel it dissapears completely.
Example "L\u2019H\u00e9r." is showed as "LHér." instead of "L'Hér."
If I replace this characters with \u2019, then I can see the caracter ' in UILabel
Does anybody any clue about this?
EDIT: For the moment I will substitute both of them with character \u2019, it is also a ' and there is no problem confusing it with a control character. Thank you all!
This answer is a little speculative, but I hope it gets you on the right tracks.
Your best bet may be to give up and substitute \u0091 and \u0092 for something else as a preprocessing step before string display. These are control characters and are unprintable in most encodings. But:
If rest of the file is proper UTF, your json file probably has problems: encoding is wrong (CP-1250?) while you read the file as UTF, some error has been made when converting the file, or a similar issue. So another solution is of course fixing your file.
If you're not sure about how your file is encoded, it may simply be encoded in CP-1250 - so reading the file using NSWindowsCP1250StringEncoding might fix your problem.
BTW, if you hardcode a string #"\u0091", you'll get a compilation time error Universal character name refers to a control character. Yes, not even a warning, it's that much unprintable in Unicode ;)

How to transform encoded URL to readable texts?

It's about Bangla Unicode texts, but can be a problem for any language other than Latin glyphs.
I'm a host of a Bangla blog with all its texts and categories in Bangla (I prefer not to say Bengali as because the name of the language is Bangla rather than Bengali).
So the category in Bangla "বাংলা" saying a URL like:
http://www.example.com/category/বাংলা
But whenever I copied the URL from address bar and put 'em into a chat panel or somewhere else, it changed with some strange characters, for example:
http://www.example.com/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0*
* it's just an example, not the exact gibberish for the word "বাংলা")
So, in many cases I got some encoded URLs like above, from where I found no trace which Unicode text they are saying. Recently I'm getting some 404 error logged by one of my plugin. From there I found a URI like:
/category/%E0%A6%B8%E0%A7%8D%E0%A6%A8%E0%A6%BE%E0%A7%9F%E0%A7%81%E0%A6%AC%E0%A6%BF%E0%A6%A6%E0%A7%8D%E0%A6%AF%E0
I used the Jetpack's Omnisearch to find out any match, but the result is empty. I can't even trace which category that is— creating such a 404.
So here comes the question:
How can I transform the encoded URL to readable glyphs?
http://www.example.com/category/বাংলা
isn't a URL; URLs can only contain ASCII characters. This is an IRI.
http://www.example.com/category/%E0%A6%AC%E0%A6%BE%E0%A6%82%E0%A6%B2%E0%A6%BE
is the URI representation of that IRI. They are otherwise equivalent. A browser may display the ‘pretty’ IRI version in the user interface, but put the URI version on the clipboard so that you can paste it into other tools that don't support IRI.
The 404 address you pasted translates to:
/category/স্নায়ুবিদ্য�
where the last character is a � because it is an invalid, truncated UTF-8 sequence. (This is probably why the request failed.) Someone may have mis-pasted a partial URI here.
If you're using javascript you can do:
decodeURIComponent(url);
This will make sure the original language is preserved.

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

Problem with cyrillic characters in Ruby on Rails

In my rails app I work a lot with cyrillic characters. Thats no problem, I store them in the db, I can display it in html.
But I have a problem exporting them in a plain txt file. A string like "элиас" gets "—ç–ª–∏–∞—Å" if I let rails put in in a txt file and download it. Whats wrong here? What has to be done?
Regards,
Elias
Obviously, there's a problem with your encoding. Make sure you text is in Unicode before writing it to the text file. You may use something like this:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
your_unicode_text = ic.iconv(your_text + ' ')[0..-2]
Also, double check that your database encoding is UTF-8. Cyrillic characters can display fine in DB and in html with non-unicode encoding, e.g. KOI8-RU, but you're guaranteed to have problems with them elsewhere.

Translate binary characters to a human readable string?

So let's say we have a string that is like this:
‰û]M§Äq¸ºþe Ø·¦ŸßÛµÖ˜eÆÈym™ÎB+KºªXv©+Å+óS—¶ê'å‚4ŒBFJF󒉚Ү}Fó†ŽxöÒ&‹¢ T†^¤( OêIº ò|<)ð
How do I turn it into a human readable string of chars, cuz like it was a wierd output of HTML from a webserver that is text I think cuz half the web page loaded correctly. Do I need to read it with like C or Python or something. That's only a snippet of the string.
If that is in fact supposed to be a human-readable string, you'll need to figure out what character encoding it uses and translate. It's also possible that the string is compressed, encrypted, or represents binary data. It would be helpful to know where you got your string from.
I'm guessing your web server isn't sending the correct mime-type. I'd suggest taking a look at the http headers using Firefox's Live Headers plugin. If a web server decides to send you a pdf, but doesn't set the mime-type, you'll just see garbage on your screen. Alternatively, save the page to a file, and then run these commands from Cygwin or a unix shell:
file mypage.htm
strings mypage.htm
The first will tell you if the header bytes follow any recognizable pattern. The second will strip out and display all the human readable text.

Resources