Parsing PDF font operator missing - parsing

I'm parsing a PDF file and it seems that a Tf operator is missing. I can see, on PDF readers like Acrobat reader or Preview, that the font changes. But during the parse I don't have the Tf operator. I still have the ET operator that marks the ending of the previous text block and BT operator for the beginning of the new one. I also have text showing operator Tj & co.
Just to be clear, I do have Tf operators, but just in one place it should be there, it isn't.
The PDF reference states :
There is no initial value for either font or size; they must be
specified explicitly by using Tf before any text is shown.
I don't understand why if I don't have Tf operator, how those readers can render the text correctly ?
Does someone know where the problem could come from ?

AFAIK text state is part of graphics state so if you have a Q operator somewhere there then that would explain the font changing (it would restore the state saved by previous q operator).
Also, the graphics state operator gs could cause the change of font.

Related

Eggplant : How to read text with special characters like ' _ etc

I am trying to read a text in a given rectangle using readText() function.
The function works correctly except when it has to read some text which has special characters like ' _ & etc.
I tried using validCharacters with readText() function. But it didn't help.
Code -
put ReadText((287,125,810,164),validCharacters:"_-'.ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz01234567890") into Login
I tried working with character collections. But that doesn't seem to be right because the text trying to pick is a dynamic text combination of numbers alphabets and a special character. So one cannot create a library of character collection of every alphabet (a-z, A-Z), numbers(0-9) and special characters.
Example of text trying to read:
Login_Userid1_1, Login'Userid1_1
So how do I read such text correctly
Debugging OCR is a bit of an imprecise science. EggPlant has a lot of OCR Parameters to tweak. When designing test cases it's best to try use other mechanisms to gather information whenever possible. ReadText() should be considered a last resort when more reliable methods are unavailable. When I've used it I've often needed to do a lot of trial and error to find the right set of settings, and SearchRectangle to get consistent results. Without seeing exactly what images you are trying to read text from it's difficult to impossible to troubleshoot where the issue might be.
One thing that does stand out to me is that you're trying to read strings that may contain underscores. ReadText() has an optional property IgnoreUnderscores which treats underscores as spaces. By default this property is set to ON. It defaults to ON because some OCR engines have problems identifying underscore characters consistently.
If you want to have ReadText() handle underscores you'll want to explicitly set this property to OFF.
ReadText(rect, validCharacters:chars, ignoreUnderscores:OFF)

RTF #PCDATA vs Document Text

I'm trying to understand the RTF 1.9.1 specification document but #PCDATA (text without control words) is confusing me. Below is some sample code to show what I don't understand. Note that the text below is formatted incorrectly. I formatted it to make it look nicer.
{
\fonttbl
{
\f0
\fbidi
\froman
\fcharset0
\fprq2
{
\*
\panose
02020603050405020304
}
Times New Roman;
}
}
The specification says:
If the character is anything other than an opening brace ({), closing brace (}), backslash (\), or a CRLF (carriage return/line feed), the reader assumes that the character is plain text and writes the character to the current destination using the current formatting properties.
If I were to follow the specification above, I would end up writing Times New Roman to the document. How is a parser supposed to know whether it has encountered #PCDATA or document text?
The answer is on page 9 of the RTF 1.9.1 specification.
Certain control words, referred to as destinations, mark the beginning of a collection of related text that could appear at another position, or destination, within the document. Destinations may also include text that is used but does not appear within the document at all.
In the example I gave in the question, fonttbl is a destination control word meaning the text doesn't appear in the document. On page 11 of the specification a list of example control words that change the destination is given:
Examples of control words that change destination are \footnote, \header, \footer, \pict, \info, \fonttbl, \stylesheet, and \colortbl.
There are many more but those are the main ones.

How do I put a literal vertical bar "|" in a control's Hint property?

I have an application that allows the users to enter a regular expression (that they craft) to parse a repository of documents. The results of the search are displayed in a TTreeView control. I want to set the TreeView's Hint property (not each Treenode) to the regular expression that was used, but the problem I'm having is that the regular expression can have a pipe (|) character within it (regex OR), which Delphi interprets as the separation between the hint and long hint. I tried replacing each occurrence of | with || hoping it would have the same effect as using && rather than & (such as in menu items) to no avail.
Is there any way to embed a | within a hint without it being interpreted as the separator?
Not exact but perhaps near enough
Component.Hint := Stringreplace(TheHintText,'|',#5,[rfReplaceAll]);

PDFKitten is highlighting on wrong position

I am using PDFKitten for searching strings within PDF documents with highlighting of the results. FastPDFKit or any other commercial library is no option so i sticked to the most close one for my requirements.
As you can see in the screenshot i searched for the string "in" which is always correctly highlighted except the last one. I got a more complex PDF document where the highlighted box for "in" is nearly 40% wrong.
I read the whole syntax and checked the issues tracker but except line height problems i found nothing regarding the width calculation. For the moment i dont see any pattern where the calculation goes or could be wrong and i hope that maybe someone else had a close problem to mine.
My current expectation is that the coordinates and character width is wrong calculated somewhere in the font classes or RenderingState.m. The project is very complex and maybe someone of you had a similar problem with PDFKitten in the past.
I have used the original sample PDF document from PDFKitten for my screenshot.
This might be a bug in PDFKitten when calculating the width of characters whose character identifier does not coincide with its unicode character code.
appendPDFString in StringDetector works with two strings when processing some string data:
// Use CID string for font-related computations.
NSString *cidString = [font stringWithPDFString:string];
// Use Unicode string to compare with user input.
NSString *unicodeString = [[font stringWithPDFString:string] lowercaseString];
stringWithPDFString in Font transforms the sequence of character identifiers of its argument into a unicode string.
Thus, in spite of the name of the variable, cidString is not a sequence of character identifiers but instead of unicode chars. Nonetheless its entries are used as argument of didScanCharacter which in Scanner is implemented to forward the position by the character width: It is using the value as parameter of widthOfCharacter in Font to determine the character width, and that method (according to the comment "Width of the given character (CID) scaled to fontsize") expects its argument to be a character identifier.
So, if CID and unicode character code don't coincide, the wrong character widths is determined and the position of any following character cannot be trusted. In the case at hand, the /fi ligature has a CID of 12 which is way different from its Unicode code 0xfb01.
I would propose PDFKitten to be enhanced to also define a didScanCID method in StringDetector which in appendPDFString should be called next to didScanCharacter for each processed character forwarding its CID. Scanner then should make use of this new method instead to calculate the width to forward its cursor.
This should be triple-checked first, though. Maybe some widthOfCharacter implementations (there are different ones for different font types) in spite of the comment expect the argument to be a unicode code after all...
(Sorry if I used the wrong vocabulary here or there, I'm a 'Java guy... :))

CGPDFScanner and Adobe-Japan1

I'm using CGPDFScanner to extract text from a PDF.
At the time my TJ operator callback is called, the current font has CIDSystemInfo->Registry value "Adobe" and CIDSystemInfo->Ordering value "Japan1". i.e. character set "Adobe-Japan1".
How do I use this fact to convert all the text I've found with the Tj operator to unicode?
I'm sure I'm not seeing the wood for the trees here.
You can use Adobe's CMAP files to re-map Japan1 to unicode. Also look at the "Supplement" to get the correct file.
http://opensource.adobe.com/wiki/display/cmap/Downloads

Resources