In Xcode, how do I extract .png image from an .icns file programmatically? - ios

A beginner in iOS7 and Xcode. If I understand well, an .icns file is a container and not an image. It contains different sizes with different extensions (e.g. png, jpg) of the icons and perhaps other stuff. Right or wrong?
If so, is there a way of extracting a 512x512.png version of icon from this file? I was told I have to use NSdata to get hold of the contents of .icns file and then substring it to get the image data. But what I don't know is where does the block of code for 512x512.png starts where does it end?
I hope I am formulating my question correctly. I am new at this.
Thank you for your help

I don't have any code ready (though you might want to look at libicns).
Basically, you have to read the file contents sequentially. First you read the header (8 bytes). Then after the header, you read icon by icon. Each icon starts with 8 byte header containing the type and length. You read the type and length and if the type doesn't match, you skip the icon and proceed with the next one.
Once you have found the desired icon, you can extract it. It starts right after the length field and is length - 8 bytes long (the length includes the header which is only needed for the icns file format).
Note that the length is stored with the most significant bit first (big endian) but iOS is little endian. So you need to convert the numbers.
The icon type can either be treated as string of four characters (without a trailing zero) or as a unsigned 4 byte number. Some compilers can treat 'ic09' as an unsigned int constant (note the single quotes vs. the double quotes for a string).

Related

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

How can these strings be different?

I am facing a weird problem.
I have extracted data from an Excel file. It should contain an IBAN account number.
Then I tried to analyze the set of account numbers (which the source guarantees to be good) with a Java library.
To keep the scope of the question narrow, I can't explain the following. The below strings are different
030​69
03069
The first is a copy & paste from the Excel file, the second is handwritten. Google returns different results for abi [above number] and in fact in the second case I can find that it is the bank code for Intesa Sanpaolo bank (exact page displaying the ABI code, localized, here).
So, to keep the scope narrow: how is that possible? Is it something to do with the encoding?
Try it yourself: do CTRL+F and try type "030", it will select both lines. Now type 6, it will match only the 2nd line.
Same happened in Notepad++
There's an U+200B ZERO WIDTH SPACE in between 030 and 69 in the first text.
Paste the text in https://www.branah.com/unicode-converter for example, or edit in a hexadecimal capable editor.
The solution for cleaning such strings could be for example to whitelist characters, so replace everything that isn't A-Z0-9 will be scrubbed.

PDFKitten is highlighting on wrong position

I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

What kind of char is this and how do I convert it to a text?

What kind of char is this and how do I convert it to a text in c#/vb.net?
I opened a .dat file in notepad, took a screenshot and attached it here.
Your screenshot looks like the digits "0003" in box. This is a common way to display characters for which a glyph isn't available.
U+0003 is the "END OF TEXT" control character. It's unlikely to occur within a text file, but a ".dat" file might be a mixture of text and binary data.
You'll need to use a hex editor to find the exact ASCII code (assuming the file is ASCII, which seems to be an entirely incorrect assumption) that the file contains. It's safe to say that whatever byte sequence is contained in the file is not a printable character in whatever encoding the editor used to open the file, and that is why it used that graphic in place of the actual character.

Resources