lua reading chinese character - lua

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?

I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.

I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

Related

charachter encoding in PHP Extension

I'm currently writing a PHP extension in C++ with the Zend API. Basically I make PHP_METHOD{..} wrappers around my native C++ interface methods and using "zend_parse_parameters(..)" to fetch the corresponding input arguments.
This extension contains methods which can take strings as arguments, such as a filename.
I know from http://php.net/manual/en/language.types.string.php#language.types.string.details that strings have no encoding in PHP, but still can I expect from the PHP programmer that he will use a function like "utf8_decode(..)" such that the input strings can be read by the extension correctly?
Or does the PHP Programmer expect that the extension detects the encoding from the php-script and handles strings accordingly?
Every help is highly appreciated! Thanks!
You are correct. Strings are just binary blobs in PHP. As the author of an extension. Your options:
Have the user hand your extension UTF-8: By far the best option. The user has to make the decision. Assert that the string is UTF-8 encodable and fail early.
Encode yourself: You cannot know the meaning of the string. As PHP strings are just binary blobs and have no encoding information you do not know what the intended string content is. It might as well just come from a Windows file with weird encoding and was concatenated with a complete different encoding. Worse, it might be UTF-8 encodable, but actually not UTF-8, in which way you interpret it wrongly, without the user knowing. Hence, solution 1, have the user pass UTF-8.
Alternative: Force the user to pass an input encoding.
Here is an example of the alterantive 3:
$obj = MyExtensionClass('UTF-8'); // force encoding
$obj->someMethod($inputStr); // try to convert now
The standard library uses approach 1. See json_encode as an example:

Ignore � (non-UTF-8 characters) in email attachment or strip them from the attachment?

Users of our application are able to upload plain text files. These files might then be added as attachments to outgoing ActionMailer emails. Recently an attempt to send said email resulted in an invalid byte sequence in UTF-8 error. The email was not sent. This symbol, �, appears throughout the offending attachment.
We're using ActionMailer so although it ought to go without saying, here's representative code for the attachment action within the mailer class's method:
attachments['file-name.jpg'] = File.read('file-name.jpg')
From a business standpoint we don't care about the content of these text files. Ideally I'd like for our application to ignore the content and simply attach them to emails.
Is it possible to somehow tell Rails / ActionMailer to ignore the formatting? Or should I parse the incoming text file, stripping out non-UTF-8 characters?
I did search through like questions here on Stack Overflow but nothing addressed the problem I'm currently facing.
Edit: I did call #readlines on the file in a Rails console and found that the black diamond is a representation of \xA0. This is likely a non-breaking space in Latin1 (ISO 8859-1).
If Ruby is having problems reading the file and corrupting the characters during the read then try using File.binread. File.binread is inherited from IO
...
attachments['attachment.txt'] = File.binread('/path/to/file')
...
If your file already has corrupted characters then you can either find some process to 'uncorrupt' them, which is not fun, or strip them using by re-encoding from ASCII-8bit to UTF-8 stripping out the invalid characters.
...
attachments['attachment.txt'] = File.binread('/path/to/file')
.encode('utf-8', 'binary', invalid: :replace, undef: :replace)
...
(String#scrub does this but since you can't read it in as UTF-8 then you cant use it.)
With your edit, this seems pretty clear to me:
The file on your filesystem is encoded in latin1.
File.read uses the standard ruby encoding by default. If LANG contains something like "en_GB.utf8", File.read will associate the string with utf-8 encoding. You can verify this by logging the value of str.encoding (where str is the value of File.read).
File.read does not actually verify the encoding, it only slurps in the bytes and slaps on the encoding (like force_encoding).
Later, in ActionMailer, something wants to transcode the string, for whatever reason, and that fails as expected (and with the result you are noticing).
If your text files are encoded in latin1, then use File.read(path, encoding: Encoding::ISO_8859_1). This way, it may work. Let us know if it doesn't...
When reading the file at time of attachment, I can use the following syntax.
mail.attachments[file.file_name.to_s] = File.read(path_to_file).force_encoding("BINARY").gsub(0xA0.chr,"")
The important addition is the following, which goes after the call to File.read(...):
.force_encoding("BINARY").gsub(0xA0.chr,"")
The stripping and encoding ought to be done at time of file upload to our system, so this answer isn't the resolution. It's a short-term band-aid.

How to identify character encoding from website?

What I'm trying to do:
I'm getting from a database a list of uris and download them,
removing the stopwords and counting the frequency that the words appears in the webpage,
then trying to save in the mongodb.
The Problem:
When I try to save the result in the database I get the error
bson.errors.invalidDocument: the document must be a valid utf-8
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something'
when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already tried
I've tried identify the char encode through the header from the webpage
I've tried utilize the chardet
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'â'
(English isn't my first language so forgive me)
EDIT: if anyone want see the source code
It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')
See also the answers to this question.

Weird charactors on HTML page

i am using Last.fm API to fetch some info of artists .I save info in DB and then display on my webpage.
But characters like “ (double quote) are shown as “ .
Example Artist info http://www.last.fm/music/David+Penn
and i got the first line as " Producer, arranger, dj and musician from Madrid-Spain. He has his own record company “Zen Recordsâ€, and ".
Mine Db is UTF-8 but i dunno why this error is still coming .
This seems to be a character encoding error. Confirm that you are reading the webpage as the correct encoding and are showing the results in the correct encoding.
You should be using UTF-8 all the way through. Check that:
your connection to the database is UTF-8 (using mysql_set_charset);
the pages you're outputting are marked as UTF-8 (<meta http-equiv="Content-Type" content="text/html;charset=utf-8">);
when you output strings from the database, you HTML-encode them using htmlspecialchars() and not htmlentities().
htmlentities HTML-encodes all non-ASCII characters, and by default assumes you are passing it bytes in ISO-8859-1. So if you pass it “ encoded as UTF-8 (bytes 0xE2, 0x80, 0x9C), you'd get “, instead of the expected “ or “. This can be fixed by passing in utf-8 as the optional $charset argument.
However it's usually easier to just use htmlspecialchars() instead, as this leaves non-ASCII characters alone, as raw bytes instead of HTML entity references. This results in a smaller page output, so is preferable as long as you're sure the HTML you're producing will keep its charset information (which you can usually rely on, except in context like sending snippets of HTML in a mail or something).
htmlspecialchars() does have an optional $charset argument too, but setting it to utf-8 is not critical since that results in no change of behaviour over the default ISO-8859-1 charset. If you are producing output in old-school multibyte encodings like Shift-JIS you do have to worry about setting this argument correctly, but today that's quite rare as most sane people use UTF-8 in preference.

Translate binary characters to a human readable string?

So let's say we have a string that is like this:
‰û]M§Äq¸ºþe Ø·¦ŸßÛµÖ˜eÆÈym™ÎB+KºªXv©+Å+óS—¶ê'å‚4ŒBFJF󒉚Ү}Fó†ŽxöÒ&‹¢ T†^¤( OêIº ò|<)ð
How do I turn it into a human readable string of chars, cuz like it was a wierd output of HTML from a webserver that is text I think cuz half the web page loaded correctly. Do I need to read it with like C or Python or something. That's only a snippet of the string.
If that is in fact supposed to be a human-readable string, you'll need to figure out what character encoding it uses and translate. It's also possible that the string is compressed, encrypted, or represents binary data. It would be helpful to know where you got your string from.
I'm guessing your web server isn't sending the correct mime-type. I'd suggest taking a look at the http headers using Firefox's Live Headers plugin. If a web server decides to send you a pdf, but doesn't set the mime-type, you'll just see garbage on your screen. Alternatively, save the page to a file, and then run these commands from Cygwin or a unix shell:
file mypage.htm
strings mypage.htm
The first will tell you if the header bytes follow any recognizable pattern. The second will strip out and display all the human readable text.

Resources