I've unicode character text for indian language(telugu) like this
పురాణాలు
I'm getting the above text from database to an xml file format. I'm reading the xml file and
when i am printing the text it is showing as పురాణాలు
Is there any way print the text as it is without any encoded character type &#...?
How are you parsing the XML? A proper XML parser should decode the numeric references.
I'm guessing that you are attempting to hand parse an XML document instead of relying on NSXMLParser. If so, you really should use an XML parser. Bad Guess on my part, it's likely that the entities are being double encoded.
To answer your question directly, Objective C HTML escape/unescape shows how to decode entities with a quick and dirty method.
Related
I have several files which include various strings in different written languages. The files I am working with are in the .inf format which is somewhat similar to .ini files.
I am inputting the text from these files into a parser which considers the [ symbol as the beginning of a 'category'. Therefore, it is important that this character does not accidentally appear in string sequences or parsing will fail because it interprets these as "control characters".
For example, this string contains some Japanese writings:
iANSProtocol_HELP="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X Protocol �̓`�[���������щ��z LAN �Ȃǂ̍��x�#�\�Ɏg�����܂��B"
DISKNAME ="�C���e��(R) �A�h�o���X�g�E�l�b�g���[�N�E�T�[�r�X CD-ROM �܂��̓t���b�s�[�f�B�X�N"
In my text-editors (Atom) default UTF-8 encoding this gives me garbage text which would not be an issue, however the 0x5B character is interpreted as [. Which causes the parser to fail because it assumes that this is signaling the beginning of a new category.
If I change the encoding to Japanese (CP 932), these characters are interpreted correctly as:
iANSProtocol_HELP="インテル(R) アドバンスト・ネットワーク・サービス Protocol はチーム化および仮想 LAN などの高度機能に使われます。"
DISKNAME ="インテル(R) アドバンスト・ネットワーク・サービス CD-ROM またはフロッピーディスク"
Of course I cannot encode every file to be Japanese because they may contain Chinese or other languages which will be written incorrectly.
What is the best course of action for this situation? Should I edit the code of the parser to escape characters inside string literals? Are there any special types of encoding that would allow me to see all special characters and languages?
Thanks
If the source file is in shift-jis, then you should use a parser that can support it, or convert the file to UTF-8 before you parse it.
I believe that this character set also uses ASCII as it's base type but it uses 2 bytes to for certain characters, so if 0x5B it probably doesn't appear as the 'first byte' of a character. (note: this is conjecture based on how I think shift-jis works).
So yea, you need to modify your parser to understand shift-jis, or you need to convert the file to UTF-8 before parsing. I imagine that converting is the easiest.
I'm having trouble parsing utf8 characters into Text when deriving a Read instance. For example, when I run the following in ghci...
> import Data.Text
> data Message = Message Text deriving (Read, Show)
> read ("Message \"→\"") :: Message
Message "\8594"
Can I do anything to keep my text inside Message utf-8 encoded? I.e. The result should be...
Message "→"
(P.S. I already receive my serialized messages as Text, but currently need to unpack to a String in order to call read. I'd love to avoid this...)
EDIT: Ah sorry, answers rightly point out that it's show not read which converts to "\8594" - is there a way to show and convert back to Text again without the backslash encoding?
To the best of my knowledge, the internal encoding used by Text (which is actually UTF-16) is consistent and not exposed directly. If you want UTF-8, you can decode/encode a Text value as appropriate. Similarly, it doesn't make sense to talk about an encoding for String, because that's just a list of Char, where each Char is a unicode code point.
Most likely, it's only the Show instance for Text displaying things differently here.
Also, keep in mind that (by consistent convention in standard libraries) read and show are expected to behave as (de-)serialization functions, with a "serialized" format that, interpreted as a Haskell expression, describes a value equivalent to the one being (de-)serialized. As such, the slash encoding with ASCII text is often preferred for being widely supported and unambiguous. If you want to display a Text value with the actual code points, show isn't what you want.
I'm not entirely clear on what you want to do with the Text--using show directly is exactly what you're trying to avoid. If you want to display text in a terminal window that's going to dictate the encoding, and you want the stuff defined in Data.Text.IO. If you need to convert to a specific encoding for whatever other reason, Data.Text.Encoding will give you an encoded ByteString (emphasis on "byte", not "string"--a ByteString is a sequence of raw bytes, not a string of characters).
If you just want to convert from Text to String and back to Text... what's wrong with the slash encoding? show is not really intended for pretty-printing output for users to read, despite many people's initial expectations otherwise.
What is the difference between XML Serialization and XML Parsing? When should we use each one?
Parsing is, generally speaking, the processing of an input stream into meaningful data structures; in the XML context, parsing is the process of reading a sequence of characters conforming to the grammar and other constraints of the XML spec into whatever internal representation of XML your program uses.
Serialization is the opposite process: processing the internal data structures of a program (in this context, your internal representation of an XML document) and creating a character sequence (typically written to an output stream) that conforms to the angle-bracket syntax of the spec.
Use a parser to read XML from a character stream into data structures; use a serializer to write data structures out into a character stream.
I don't know much about XML, but here's what I know about serialization and parsing.
parsing - reading data (parse-in) from storage, and writing data (parse-out) to storage… "such as a text file"
serializing - (serialize) translating data into a readable format, and (de-serialize) translate that format back to data… "i.e. you want to translate a struct into readable content, stream that content across a network, and translate it back into code."
here's a new one…
marshalling - (marshall and unmarshall) similar to serialize, except marshalling is used to translate data into a different format… "i.e. you want to translate a stream of bytes into an 32 bit structure (one byte to four bytes)"
in easy terms (for beginners)
TL;DR
XML parsing (or XML deserialization) ==> input: valid XML, output: data structures
XML serialization ==> input: data structures, output: valid XML
XML parsing (a.k.a XML de-serialization)
You take a .xml file (example.xml) as input to process it with your programming language of choise, so that your programm can do something usefull with the data in that file. Your programm will transform the information from the file into data structures that your programming language can deal with (i.e. lists, arrays, objects, etc.).
XML serialization
Your programm (in any programming language), transforms information represented as data structures (lists, arrays, objects, etc.) into a valid XML output which can be saved into a file or tranmitted to another programm.
NOTE: Technically the input (when we are takling about parsing) and the output (when we are talking about serialization) does not have to be a file. As said in the more professional answer above it can be any input/output stream, too. And files don't have to have .xml extension, they can have any file extension which represents a valid XML format (i.e. .svg is also a XML based format). The key to understanding is, that when we do XML parsing we have valid XML on the input side and data structures on the output side, and when we do XML serialization we have data structures on the input side and valid XML on the output side.
To give an example from the Python world: you can use buildin packages (like xml.etree.ElementTree) or third party libraries (like lxml (recommended) or xmltodict) to do both - parse (deserialize) or create (serialize) XML data.
I'm using the TBXML framework to parse some XML, but am having problems with the returned string values. The problem is that the returned values contain parts such as "£" instead of £, etc. Is there a convenient way to simply convert all of these into the correct characters so that they can be displayed in a UILabel?
Thanks
Maybe this can help you any further:
HTML character decoding in Objective-C / Cocoa Touch
You maybe can use HTML entities to make your currency character.
I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?
You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?