SAX PARSER IS NOT PASRSING AFTER "&" SYMBOL - blackberry

my requirement is to parse xml data from the server side and display it in Blackberry, I am using SAX parser to perform this operation. I am using an example to explain the scenario.
<Name>ABC</Name>
<Company>TCS</Company>
<Name>DEF</Name>
<Company>E&Y</Company>
In the above example, it is possible to read all the attribute except the "E&Y".

Your xml is corrupted. Check for xml escaping.
Proper xml should look like:
<Company>E&Y</Company>
Fix your xml and the parser becomes to work OK.

Check this thread
Blackberry UTF-8 Problem
One ansewr says:
Most likely your xml is in UTF-8 while you have response.getBytes(). String.getBytes() returns bytes for default OS encoding which is ISO-8859-1 on BB. So try to get UTF-8 bytes by calling response.getBytes("UTF-8").
Hope that helps

I guess the problem is the encoding
Search for "encoding='UTF-8' sax parser"

Related

UTF-8 Chars in FTP Greeting

I tried to use Unicode characters in my FTP server's greeting, but the client seems to read them as two different characters each. Because of this, I need a way to encode them into UTF-8. For now, I have the greeting HTML encoded because I am displaying it on a webpage, but on any other client it will display the encoding. How can I set the greeting to be parsed as UTF-8? And if I can't, then is there a way I can parse the greeting correctly?
EDIT: Answered my own question, see below.
I found the answer to the question. It was actually UTF-8 encoded already, and I had to decode it from UTF-8. Here is what I did:
decodeURIComponent(escape(greeting))
Don't forget to replace the line breaks with <br> if you are displaying it on a webpage like I am!
decodeURIComponent(escape(greeting)).replace(/\n/g,'<br>')

charachter encoding in PHP Extension

I'm currently writing a PHP extension in C++ with the Zend API. Basically I make PHP_METHOD{..} wrappers around my native C++ interface methods and using "zend_parse_parameters(..)" to fetch the corresponding input arguments.
This extension contains methods which can take strings as arguments, such as a filename.
I know from http://php.net/manual/en/language.types.string.php#language.types.string.details that strings have no encoding in PHP, but still can I expect from the PHP programmer that he will use a function like "utf8_decode(..)" such that the input strings can be read by the extension correctly?
Or does the PHP Programmer expect that the extension detects the encoding from the php-script and handles strings accordingly?
Every help is highly appreciated! Thanks!
You are correct. Strings are just binary blobs in PHP. As the author of an extension. Your options:
Have the user hand your extension UTF-8: By far the best option. The user has to make the decision. Assert that the string is UTF-8 encodable and fail early.
Encode yourself: You cannot know the meaning of the string. As PHP strings are just binary blobs and have no encoding information you do not know what the intended string content is. It might as well just come from a Windows file with weird encoding and was concatenated with a complete different encoding. Worse, it might be UTF-8 encodable, but actually not UTF-8, in which way you interpret it wrongly, without the user knowing. Hence, solution 1, have the user pass UTF-8.
Alternative: Force the user to pass an input encoding.
Here is an example of the alterantive 3:
$obj = MyExtensionClass('UTF-8'); // force encoding
$obj->someMethod($inputStr); // try to convert now
The standard library uses approach 1. See json_encode as an example:

lua reading chinese character

I have the following xml that I would like to read:
chinese xml - https://news.google.com/news/popular?ned=cn&topic=po&output=rss
korean xml - http://www.voanews.com/templates/Articles.rss?sectionPath=/korean/news
Currently, I try to use a luaxml to parse in the xml which contain the chinese character. However, when I print out using the console, the result is that the chinese character cannot be printed correctly and show as a garbage character.
I would like to ask if there is anyway to parse a chinese or korean character into lua table?
I don't think Lua is the issue here. The raw data the remote site sends is encoded using UTF-8, and Lua does no special interpretation of that—which means it should be preserved perfectly if you just (1) read from the remote site, and (2) save the read data to a file. The data in the file will contain CJK characters encoded in UTF-8, just like the remote site sent back.
If you're getting funny results like you mention, the fault probably lies either with the library you're using to read from the remote site, or perhaps simply with the way your console displays the results when you output to it.
I managed to convert the "中美" into chinese character.
I would need to do one additional step which has to convert all the the series of string by using this method from this link, http://forum.luahub.com/index.php?topic=3617.msg8595#msg8595 before saving into xml format.
string.gsub(l,"&#([0-9]+);", function(c) return string.char(tonumber(c)) end)
I would like to ask for LuaXML, I have come across this method xml.registerCode(decoded,encoded)
Under that method, it says that
registers a custom code for the conversion between non-standard characters and XML character entities
What do they mean by non-standard characters and how do I use it?

Rails 3 dealing with special characters

I want to provide user with ability to fill-in input field with special characters (i.e. ¥ and others).
User input could be saved in xml file and later fetched and rendered back to form input.
What is the best practice of saving special symbols to xml (maybe using html entities or hexadecimal form)?
Thanks for advance.
I'd say if you save the file in utf-8 you will have no problems.
If some controller/view has problems with encoding you have to place this in the first line:
# encoding: utf-8
There's nothing special about them and you can don't need to encode them. Let your XML library deal with that, XML supports unicode ever since, and what you call "special symbols" are just unicode characters.

HTML DOM Parse and Character encoding on XMLHTTPRequest at Firefox extension

I am now writing firefox 4 bootstrapped extension.
The following is my story:
When I'm using #mozilla.org/xmlextras/xmlhttprequest;1, nsIXMLHttpRequest, content of target URL can be successfully loaded by req.responseText.
I parsed the responseText to DOM by createElement method and innerHTML property into a BODY Element.
Everything seems to be successful.
However, there is a problem on character encoding ( charset ).
As I need the extension detect the charset of target documents, overriding the Mine type of request with text/html; charset=blahblah.. seems not to meet my need.
I've tried the #mozilla.org/intl/utf8converterservice;1, nsIUTF8ConverterService, but it seems that XMLHTTPRequest has no ScriptableInputStream or even any InputStream or readable stream.
I have no idea on reading a target document content in a suitable, automatically detected charset, no matter the function of Auto-Detect Character Encoding in GUI or the charset readed at head meta tag of the content document.
EDIT: Would it be practical if I parse whole document including HTML, HEAD, BODY tag to a DOM object, but without loading extensive document like js, css, ico files?
EDIT: Method on the article at MDC titled as "HTML to DOM" which is using #mozilla.org/feed-unescapehtml;1, nsIScriptableUnescapeHTML is inappropriate as it parsed with lots of error and mistake with baseURI can not be set in type of text/html. All attribute HREF in A Elements are missed when it contains a relative path.
EDIT#2: It would still be nice if there are any methods that can convert the incoming responseText into readable UTF-8 charset strings. :-)
Any ideas or works to solve encoding problem are appreciated. :-)
PS. the target documents are universal so there are no specific charset ( or ... preknown ), and of course not only UTF8 as it has already defined in default.
SUPP:
Til now, I have two brief main ideas of solving this problem.
Can anybody could help me to work out of the XPCOM modules and methods' names?
To Specify the charset while parsing Content into DOM.
We need to first know the charset of the document ( by extracting head meta Tag, or header).
Then,
find out a method that can specify the charset when parsing body content.
find out a method that can parse both head and body.
To Convert or Make Incoming responseText into/be UTF-8 so parsing to DOM Element with default charset UTF-8 is still working.
X seems to be not practical and sensible : Overiding the Mine type with charset is an implementation of this idea but we can not preknow the charset before initiating a request.
It seems that there are no more other answer.
After a day of tests, I've found out that there is a way (although it is clumsy) to solve my problem.
xhr.overrideMimeType('text/plain; charset=x-user-defined'); , where xhr stand for XMLHttpRequest Handler.
To force Firefox to treat it as plain
text, using a user-defined character
set. This tells Firefox not to parse
it, and to let the bytes pass through
unprocessed.
Refers to MDC Document: Using_XMLHttpRequest#Receiving_binary_data
And then use Scriptable Unicode Converter : #mozilla.org/intl/scriptableunicodeconverter, nsIScriptableUnicodeConverter
Variable charset can be extracted from head meta tags no matter by regexp from req.responseText (with unknown charset) or something other method.
var unicodeConverter = Components.classes["#mozilla.org/intl/scriptableunicodeconverter"].createInstance(Components.interfaces.nsIScriptableUnicodeConverter);
unicodeConverter.charset = charset;
str = unicodeConverter.ConvertToUnicode(str);
An unicode string, as well as a family of UTF-8, is finally produced. :-)
Then simply parse to body element and meet my need.
Other brilliant ideas are still welcome. Feel free to object my answer by sufficient reason. :-)

Resources