What kind of char is this and how do I convert it to a text? - character-encoding

What kind of char is this and how do I convert it to a text in c#/vb.net?
I opened a .dat file in notepad, took a screenshot and attached it here.

Your screenshot looks like the digits "0003" in box. This is a common way to display characters for which a glyph isn't available.
U+0003 is the "END OF TEXT" control character. It's unlikely to occur within a text file, but a ".dat" file might be a mixture of text and binary data.

You'll need to use a hex editor to find the exact ASCII code (assuming the file is ASCII, which seems to be an entirely incorrect assumption) that the file contains. It's safe to say that whatever byte sequence is contained in the file is not a printable character in whatever encoding the editor used to open the file, and that is why it used that graphic in place of the actual character.


Contents of executable files cannot be copied?

So, text files can be copied and pasted to another location by copying the contents of the original file into a blank text file. This can be done with a text editor. Highlight contents of text file, copy, create new blank text file, paste in to it.
But, why can't image, audio, video, executable files, etc., be copied and pasted like this? For example, I open an executable file with a text editor, copy all of it's contents, create a new blank text file, change the extension to .exe, and paste into it (through a text editor). But, the file cannot be run. Why?
Also, I would like to be able to edit these types of files like I do with text files. Is there a way?
Because executable and media files are "binary" files. Text files are binary as well, but different. All files are created binary, but some are created more binary than others.
You're opening a binary file in a text editor. This immediately changes the semantics of the bytes. The main problem is bytes containing a value that happens to correspond to those of newline characters if it were a text file (0x0A and 0x0D), which will be rendered as a platform-dependent newline (\r\n on Windows, for example). When you copy that, you've changed either 0x0A or 0x0D to 0x0D 0x0A.
Then there's control characters or non-printable characters. Not all bytes between 0x00 and 0xFF can be represented as a character. They'll either be omitted or replaced with a displayable character.
So when you copy a text containing those, they'll be omitted or otherwise mangled.
In conclusion: you cannot reliably use text to display all possible byte values, unless you choose to encode the bytes' values, as is done using for example Base64 encoding.
If you want to edit a binary file, use an editor that is aware of those bytes: a "hex editor". Do note that changing random byte values in a binary file does not guarantee the sanity of that file: there may be checksums built into the format, and your edit will invalidate that checksum.

How to load TextAsset from txt file containing extended ASCII characters in Unity?

In my case specifically, I have bullet-points (• or #149) in my text file.
If I copy paste "•" into my Unity text field in the editor, it shows up, so I am pretty sure the bullet-point is lost in the reading process. (I checked in debug mode, and indeed the bullet-point is lost at reading).
This is how I read in my text file as a TextAsset:
TextAsset content = Resources.Load(SlideManager.slideLanguage+"\\"+fileName+" ("+SlideManager.slideNumber+")") as TextAsset;
It turns out, that the way I read is completely fine. It reads the file correctly, but the encoding of the file is ASCII, therefore the resource loader cannot interpret none ASCII characters, and drops them.
Thus, since the bullet-point is not standard ASCII, but extended ASCII character, you have to specify the encoding of your text files.
For example, set encoding to UTF-8, and then it will work. I used notepad++ to set encoding, but I am sure there are many other ways you can do it.
To set encoding in Notepad++
Click on the tab named Encoding (fifth tab from the left on the top by default), and select Convert to UTF-8.

Read special character bytes from PDF to unichar or NSString

First off this solution doesn't work for ligatures:
Convert or Print CGPDFStringRef string
I'm reading text from a PDF and trying to convert it to a NSString. I can get a byte array of text using Apple's CGPDFScanner in the form of a CGPDFString. The "fi" ligature character is giving me trouble. When I look at my byte array in the debugger I see a '\f'
So for simplicity sake lets say that I have this char:
unsigned char myLigatureFromPDF = '\f';
Ultimately I'd like to convert it to this (the unicode value for the "fi" ligature):
unichar whatIWant = 0xFB01;
This is my failed attempt (I copied this from PDFKitten btw):
const char str[] = {myLigatureFromPDF, '\0'};
NSString* stringEncodedLigature = [NSString stringWithCString:str encoding:NSUTF8StringEncoding];
unichar encodedLigature = [stringEncodedLigature characterAtIndex:0];
If anyone can tell me how to do this that would be great
Also, as a side note how does the debugger interpret the unencoded byte array, in other words when I hover over the array how does it know to show a '\f'
Every PDF parser is limited in its capabilities by one single important point of the PDF specifications: characters in literal strings are encoded as bytes or words, but the encoding does not need to be included in the file.
For example, if a subset of a font is included where the code "1" corresponds to the image (character glyph) of an "h" and the code "2" maps to a glyph "a", the string (\1\2\1\2) will show "haha", as expected. But if the PDF contains no further information on how the glyphs in that font correspond to Unicode, there is no way for a string decoder to find out the correct character codes for "glyph #1" and "glyph #2".
It seems your test PDF does contain that information -- else, how could it infer the correct characters for "regular" characters? -- but in this case, the "regular" characters were simply not remapped to other binary codes, for convenience. Also, again for convenience, the glyph for the single character "fi" was remapped to "0x0C" in the original font (or in the subset that got included into your file). But, again, if the file does not contain a translation table between character codes and Unicode values, there is no way to retrieve the correct code.
The above is true for all PDFs and strings. If the font definition in the PDF contains an encoding, your string extraction method should use it; if the PDF contains a /ToUnicode table for the font, again, your method should use it. If it contains neither, you get the literal string contents (and, presumably, you are not informed which method was used and how reliable it is).
As a final footnote: in TeX and LaTeX fonts, ligatures are mapped to lower ASCII codes (as well as a smattering of other non-ASCII codes, such as the curly quotes). It seems you are reading a PDF that was created through TeX here -- but that can only be inferred from this particular encoding. Also, even if you know in advance that the PDF was generated through TeX, it's not guaranteed that it does use this particular encoding, as the decision to translate or not translate is at the discretion of the PDF generator, not TeX itself.

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

How to read unicode characters accurately

I have a text file containing what I am told are unicode characters, for example:
\320\222\320\21015-25'ish per main or \320\222\320\21020-40'ish per starter
Which should read:
£15-25'ish per main or £20-40'ish per main starter
However, when viewing this text in Firefox, the output is mangled with various unwanted characters.
So, are these really unicode characters? And if so, how can I convert them to a form which is displayable correctly?
You need to:
know the encoding of the text file
read the data without losing information (either by reading it as binary or by reading it as text with the right encoding)
write the data with the right encoding (either by writing it out in binary and specifying the original encoding, or writing it out as text in an encoding which you also specify in the headers)
Try to separate out the problem into "reading" and/or "writing". Do you know the encoding of the file? What do you have to do with the file? When you've written it with backslashes, is that actually what's in the file (i.e. an escaped form) or is it actually just a "normal" text encoding such as UTF-8?
