I know in Apple's PDFKit I can get 'string' which returns an NSString object representing the text on the page.
https://developer.apple.com/documentation/pdfkit/pdfpage?language=objc
Is there a way to change text that's in the PDF? If not, how do you recommend I go about figuring out how to edit text in a PDF? Thank you!
To understand your real problem, you need to know more about how a PDF works.
First, a PDF is more like a container of (drawing, rendering) instructions than a container of content.
There are two flavors of PDF. Tagged and untagged. Tagged PDF is essentially a normal PDF document + a tree-like datastructure that tells you which parts of the document make up which logical elements.
Comparable to HTML, which contains a logical structure, the tags mark paragraphs, bullet points in lists, rows in tables, etc.
If you have an untagged document, you are essentially left with nothing but the bare rendering instructions
go to position 50, 50
set font to Arial
set font color to 0, color-space to grayshades
draw the glyph for 'H'
go to position 60, 50
draw the glyph for 'e'
Instructions like this are gathered into objects. Objects can be gathered into streams. Streams can be compressed. Instructions and objects do not need to appear in any logical order.
Having objects means that you can re-use certain things. Like drawing an image on every page of a company letterhead. Or instructions like 'use the font in object 456'.
In order to be able to work with these objects, every object is given a number. And a mapping of objects, their number, and their byte-offset in the file is stored at the back of the document. This is known as the XREF table.
xref
152 42
0000000016 00000 n
0000001240 00000 n
0000002133 00000 n
0000002296 00000 n
0000002344 00000 n
0000002380 00000 n
0000002551 00000 n
Now, back to your problem.
Suppose that you change a word 'dog' by a word 'cats'.
You'd run into several problems:
every byte offset in the document is suddenly wrong, since 'cats' contains 4 bytes, and 'dog' contains 3 bytes.
no object can be found, all instructions go wrong
if at any point your substitution causes the text to go too far out of alignment, you would need to perform layout again.
Why is layout such a problem?
Remember what I said earlier about the PDF containing only the rendering instructions. It's insanely hard to reconstruct things like paragraph-boundaries, or tables, lists, etc from the raw instructions.
Especially so if you want to do this for other scripts than just Latin script (imagine Hebrew, or Arabic). Or if your page layout is non-standard (like a scientific article, which appears in columns rather than lines that take up an entire page.)
Structure recognition is in fact the topic of ongoing research.
Related
[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.
I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.
We are using Jfreechart along with iText for generating pdf reports. For Japanese, we realized that in the rendered content for the graph legend, characters don't have any spaces between them. They basically overlap which makes it hard to read.
Do we need to use any special encoding?
Attached are images for expected and actual(generated by jfreechart), in that order
Below is a snippet of the graph generated with the legend
According to the PDF specification, a CIDFont dictionary contains an optional dictionary called DW and an optional array called W. DW is the default width for glyphs. If not set, it defaults to 1000.
The W array describes individual widths for characters in the font (if not specified they default to the value of DW). For many Japanese fonts, I've seen the value set to lower than 1000, but in this case it might be too low.
You can take a look at these values using Acrobat's "preflight>browse internal structure" tool. If these seem off, you make be using the wrong encoding. Setting encoding to "UniJIS-UCS2-H" should help resolve this issue.
I'm quite at a lost on this subject. I've read pretty much every post about it here on SO, I would very much appreciate it if somebody would nudge me in the right direction.
I have a PDF and I would like to extract it's text, I'm only interested in words and spaces. I have setup a CGPDFScanner and it's callback methods. What I have read is that I only need to consider 4 operators TJ, Tj, qout(') and doubleqout(") as far as extracting text goes.
I guess I also need to keep track of the text space to be able to determine whether the letters should be put together to form a word or should be separated by a space. But I have no idea how I would have to do this.
In the PDF, all text is in the format
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
but I have not been able to figure out (using the PDF specification) what these numbers mean. Somebody on SO said that you should not be scared of the PDF specs but frankly I do not find them very easy to read/understand.
I have studied the PDFKitten code which was helpful.
Any help would be greatly appreciated.
I cannot give you advice how to extract words from PDF, but the format of
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
is explained for example in the PDF 1.7 Specification, section "9.4.3 Text-Showing Operators". The description of the TJ operator is:
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space.
So the numbers are adjustments to the distance between the letters.
I'm finding it difficult to parse a pdf file that's created in a non-english language. I used pdfbox and itext but couldn't find anything in there that could help parse this file. Here's the pdf file that I'm talking about: http://prapatti.com/slokas/telugu/vishnusahasranaamam.pdf The pdf says that it's created use LaTeX and Tikkana font. I have Tikkana font installed on my machine, but that didn't help. Please help me in this.
Thanks, K
When you say "parse PDF files", my first thought was that the PDF in question wasn't opening in various PDF viewers & libraries, and was therefore corrupt in some way.
But that's not the case at all. It opens just fine in Acrobat Reader X. And then I see the text on the page.
And when I copy/paste that text from the first page, I get:
Ûûp{¨¶ðQ{p{¨|={pÛû{¨>üb¶úN}l{¨d{p{¨> >Ûpû¶bp{¨}|=/}pT¶=}Nm{Z{Úpd{m}a¾Ú}mp{Ú¶¨>ztNð{øÔ_c}m{ТÁ}=N{Nzt¶ztbm}¥Ázv¬b¢Á
Á ÛûÁøÛûzÏrze¨=ztTzv}lÛzt{¨d¨c}p{Ðu{¨½ÐuÛ½{=Û Á{=Á Á ÁÛûb}ßb{q{d}p{¨ze=Vm{Ðu½Û{=Á
That's from Reader.
Much of the text in this PDF is written using various "Type 3" fonts. These fonts claim to use "WinAnsiEncoding" (Also Known As code page 1252), with a "differences" array. This differences array is wrong:
47 /BB 61 /BP /BQ 81 /C6...
The first number is the code point being replaced, the second is a Name of a character that replaces the original value at that code point.
There's no such character names as BB, BP, BQ, C9... and so on. So when you copy-paste that text, you get the above garbage.
I'm sorry, but the only reliable way to extract text from such a PDF is OCR (optical character recognition).
Eh... Long shot idea:
If you can find the specific versions of the specific fonts used to generate this PDF, you just might be able to determine the actual stream contents of known characters converted to Type 3 fonts in this way.
Once you have these known streams, you can compare them to the streams in the PDF and use that to build your own translation table.
You could either fix the existing PDF[s] (by changing the names in the encoding dictionary and Type 3 charproc entries) such that these text extractors will work correctly, or just grab the bytes out of the stream and translate them yourself.
The workflow would go something like this:
For each character in a font used in the form:
render it to PDF by itself using the same LaTeK/GhostScript versions.
Open the PDF and find the CharProc for that particular known character.
Store that stream along with the known character used to build it.
For each text byte in the PDF to be interpreted.
Get the glyph name for the given byte based on the existing encoding array
Get the "char proc" stream for that glyph name and compare it to your known char procs.
NOTE: This could be rewritten to be much more efficient with some caching, but it gets the idea across (I hope).
All that requires a fairly deep understanding of PDF and the parsing methods involved. But it just might work. Might not too...