I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))
Related
I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.
In the console, for some unprocessable characters it emits this:
Can someone tell me:
1) What exactly does that symbol mean?
2) What's it called?
3) How can I detect which chars will give me that - e.g. if I want to write code to find all the integers for which the string representation is rendered as that, how do I do it:
(60000..70000).select {|i| what_do_i_do_here?(i) }
The symbol means that the font cannot display a character.
The name: check out the sister site: https://english.stackexchange.com/questions/62524/what-do-you-call-the-phenomenon-where-a-rectangle-is-shown-because-a-font-lack , so officially in Unicode: replacement glyph, but often it is called tofu.
It depends on your font, and the font/language setting. Note: you are just trying single code-points, but some glyphs ("letters" as displayed by a font), could be made by several combining code-points (the "char" in many computer languages).
I really do not know how to know which characters are supported by a specific font (just by having the font). Often the font creators tell you which glyphs are implemented.
i am trying to show \u1F318 in my application. but iphone app just use first 4 digit and and create the image. Can any one guide me what i am doing wrong to show image of unicode \u1F318 in iPhone.
[(OneLabelTableViewCell *)cell textView].text = #"\u1F318";
out in application is
Note: this answer is based on my experience of Java and C#. If it turns out not to be useful, I'll delete it. I figured it was worth the OP's time to try the options presented here...
The \u escape sequence always expects four hex digits - as such, it can only represent characters in the Basic Multilingual Plane.
If this is Objective-C, I believe that supports \U followed by eight hex digits, e.g. \U0001F318. If so, that's the simplest approach:
[(OneLabelTableViewCell *)cell textView].text = #"\U0001F318";
If that doesn't work, it's possible that you need to specify the character as a surrogate pair of UTF-16 code points. In this case, U+1F318 is represented by U+D83C U+DF18, so you'd write:
[(OneLabelTableViewCell *)cell textView].text = #"\uD83c\uDF18";
Of course, this is assuming that it's UTF-16-based...
Even if that's the correct way of representing the character you want, it's entirely feasible that the font you're using doesn't support it. In that case, I'd expect you to see a single character (a question mark, a box, or something similar to represent an error).
(Side-note: I don't know what # is used for in Objective-C. In C# that would stop the \u from being an escape sequence in the first place, but presumably Objective-C is slightly different, given the code in your question and the output.)
Considering this Arabic word(جبل) made of 3 letters .
-the first letter is جـ,
-name is (ǧīm),
-its Unicode value is FE9F when its in the beginning,
-its basic value is 062C and
-its isolated value is FE9D but the last two values return the same shape drawing ج .
Now, Whenever I try to get it as a single character -trying many different ways-, Delphi returns the basic Unicode value.
well,that makes sense,but what happens to the char with transformation? It is a single char too..Looks like it takes the transformed value only when it is within a string, but where? how to extract it?When and which process decides these values?
Again the MAIN QUESTION:
How can I get the Arabic letter or its Unicode value as it is within a string?
just for information: Unlike English which has tow cases for its letters(Capital and Small), Arabic has four cases(Isolated, Beginning,Middle And End) with different rules as well.
I'm not sure I understand the question. If you want to know how to write U+FE9F in Delphi source code, in a modern Unicode version of Delphi. Do that simply like so:
Char($FE9F)
If you want to read individual characters from جبل then do it like this:
const
MyWord = 'جبل';
var
c: Char;
....
c := MyWord[1];//this is U+062C
Note that the code above is fine for your particular word because each code point can be encoded with a single UTF-16 WideChar character element. If the code point required multiple elements, then it would be best to transform to UTF-32 for code point level processing.
Now, let's look at the string that you included in the question. I downloaded this question using wget and the file that came down the wires was UTF-8 encoded. I used Notepad++ to convert to UTF16-LE and then picked out the three UTF-16 characters of your string. They are:
U+062C
U+0628
U+0644
You stated:
The first letter is جـ, name is (ǧīm), its Unicode value is U+FE9F.
But that is simply incorrect. As can be seen from the above, the actual character you posted was U+062C. So the reason why your attempts to read the first character yield U+062C is that U+062C really is the first character of your string.
The bottom line is that nothing in your Delphi code is transforming your character. When you do:
S[1] := Char($FE9F);
the compiler performs a simple two byte copy. There is no context aware transformation that occurs. And likewise when reading S[1].
Let's look at how these characters are displayed, using this simple code on a VCL forms application that contains a memo control:
Memo1.Clear;
Memo1.Lines.Add(StringOfChar(Char($FE9F), 2));
Memo1.Lines.Add(StringOfChar(Char($062C), 2));
The output looks like this:
As you can see, the rendering layer knows what to do with a U+062C character that appears at the beginning of the string.
Shaping of Arabic characters for presentation in Windows is served by the Uniscribe services (USP10.dll).
UniScribe
You may find the following blog post useful:
Roozbeh's Programming Blog
I don't think you can do it using string/char related methods. But using pchar, maybe can you access the memory and read the Pword values directly
EDIT: After discussing with David, I think that you will always get the basic/isolated value of the letter. The fact that begin or end glyph is used, is probably just handled by the display framework of the OS
Background
I am working a lot at the moment with webfonts, and specifically icon fonts. I need to ascertain the which character a specific icon is for testing purposes, so I can simply type the character &/or copy-paste it.
Example
The CSS of most icon fonts is similar, using the :before pseudo approach e.g.
.icon-search:before{content:"\f002"}
Question
I believe this encoding to be called CSS character (Hex) is this the
correct?
Are there any tools that allow me to enter the escaped CSS character value and convert it to a value I can copy and paste
Is there a tool that can convert this to a HTML decimal value e.g. & = simple amperstand
Summary
I would love to be able to find out which character it is so I can simply type it on my keyboard. I have spent ages looking it up but am not quite sure what this type of encoding and conversion is called so can't find what i'm looking for. I'd appreciate some pointers.
SOLVED - the answer below for completeness
After some research myself I just want to confirm that the encoding used in CSS is indeed called HEX encoding.
I did find a converter that allows me to enter the HEX value and converts it to Decimal http://www.binaryhexconverter.com/hex-to-decimal-converter
If you want to use a HTML entity then all you need to do is wrap the converted decimal value in the obligatory &# ; entity start/finish characters and you are good to go.
Example
(HEXvalue = \f002) converts to (Decimal = 61442)
This HTML entity is therefore