CGPDFScanner and Adobe-Japan1 - ios

I'm using CGPDFScanner to extract text from a PDF.
At the time my TJ operator callback is called, the current font has CIDSystemInfo->Registry value "Adobe" and CIDSystemInfo->Ordering value "Japan1". i.e. character set "Adobe-Japan1".
How do I use this fact to convert all the text I've found with the Tj operator to unicode?
I'm sure I'm not seeing the wood for the trees here.

You can use Adobe's CMAP files to re-map Japan1 to unicode. Also look at the "Supplement" to get the correct file.
http://opensource.adobe.com/wiki/display/cmap/Downloads

Related

How can convert Maple formula to Mathtype format

as you know when working with Maple each equation you typed with maple format will be seen like as mathtype after calculation with default blue font, for example you type QK:=Matrix(3,3,[Q[1,1],Q[1,2],0,Q[2,1],Q[2,2],0,0,0,Q[3,3]]); and you see this in the matrix format with subscript of each member of matrix. If you want to send this output to MS word software you got the picture of equation that can not edited or formatted like as mathtype formulation.
the question is how can I convert Maple output calculation to mathtype format?
I have found MathML formulation editor but when copy maple output on it, it can show formulation only in the plain text format that can not calculate or convert to the mathtype format.
Thanks, find it!
at the first step right click on Maple output and select special copy and select copy as MathMl and paste it on the Formulator editor. at the second step copy the available formula in the Formulator and paste it on the Mathtype. the result was excellent for formula that dont have any Greek character, but if Greek character is available in the output formula, the result in the Formulator substitute this character with english spell with & sign at first and ; sing at the end. I want to find solution for this problem as example β change to &betta;
It seems you have discovered the answer that we (Design Science) have on our website: http://www.dessci.com/en/support/mathtype/works_with.asp#!target=maple. One issue with Maple's MathML when pasted into MathType is the attributes it adds to the markup. For example, this is the MathML output from Maple for the variable x:
<mi mathcolor='#0000ff' color='#0000ff' fontstyle='2D Output' fontweight='normal'>x</mi>
Those 4 attributes help to replicate the look of the math as displayed in Maple, but they're unnecessary in MathType and most anywhere else you may paste the MathML. It doesn't take long before all these attributes contribute so much overhead that MathType simply can't handle it. (MathType thinks it's taking so long to interpret the MathML that surely it's in an infinite loop -- and gives an error.)
The problem with the Greek letters from Maple is that what should be a single character ampersand – & – comes out in the MathML markup as the entity for ampersand: &. It would be great if MathType could just look at that and think "Hmm. Ok, I'm pretty sure Maple meant just &, so I'll assume that." But it doesn't.
Let M be maple form. For finding latex form you can use this commend:
latex(M)

#font-face: Icon fonts & Converting CSS character (Hex) Value

Background
I am working a lot at the moment with webfonts, and specifically icon fonts. I need to ascertain the which character a specific icon is for testing purposes, so I can simply type the character &/or copy-paste it.
Example
The CSS of most icon fonts is similar, using the :before pseudo approach e.g.
.icon-search:before{content:"\f002"}
Question
I believe this encoding to be called CSS character (Hex) is this the
correct?
Are there any tools that allow me to enter the escaped CSS character value and convert it to a value I can copy and paste
Is there a tool that can convert this to a HTML decimal value e.g. & = simple amperstand
Summary
I would love to be able to find out which character it is so I can simply type it on my keyboard. I have spent ages looking it up but am not quite sure what this type of encoding and conversion is called so can't find what i'm looking for. I'd appreciate some pointers.
SOLVED - the answer below for completeness
After some research myself I just want to confirm that the encoding used in CSS is indeed called HEX encoding.
I did find a converter that allows me to enter the HEX value and converts it to Decimal http://www.binaryhexconverter.com/hex-to-decimal-converter
If you want to use a HTML entity then all you need to do is wrap the converted decimal value in the obligatory &# ; entity start/finish characters and you are good to go.
Example
(HEXvalue = \f002) converts to (Decimal = 61442)
This HTML entity is therefore 

PDFKitten is highlighting on wrong position

I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))

Parsing PDF files

I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...

What kind of char is this and how do I convert it to a text?

What kind of char is this and how do I convert it to a text in c#/vb.net?
I opened a .dat file in notepad, took a screenshot and attached it here.
Your screenshot looks like the digits "0003" in box. This is a common way to display characters for which a glyph isn't available.
U+0003 is the "END OF TEXT" control character. It's unlikely to occur within a text file, but a ".dat" file might be a mixture of text and binary data.
You'll need to use a hex editor to find the exact ASCII code (assuming the file is ASCII, which seems to be an entirely incorrect assumption) that the file contains. It's safe to say that whatever byte sequence is contained in the file is not a printable character in whatever encoding the editor used to open the file, and that is why it used that graphic in place of the actual character.

Resources