Auto page break in libHaru PDF - ios

I'd like to add an automatic page break to a libHaru PDF in iOS.
I do have several text fields in the app which contain the user filled data. when i generate the pdf i first measure the expected size of the text-rect going to be created. if it exceeds the remaining space i trigger a hpdf_new_page event and put the text on an new page. i'd like to have this just in part automatically. so if the text exceeds the space on the current page it should split and continue on a new page without me checking or doing anything.
unfortunately i can't find anything like this in the documentation.

Line counting using fgets() may help. When your print program opens a file to print, each line can be copied to the pdf file and checked for a form feed character
or
if the line count has reached a limit.
Another possible solution is to use a character count limit with "while(getc(file) != EOF)".
This link uses libharu to print basic text files with PCL commands to change the font.
https://github.com/DaDaDadeo/GetCycle/blob/master/pcl_to_pdf.c
The form feed character '\f' (ascii 12) and 61 lines will trigger a new page. There are other conditions in the program to restrict a new page but the general idea is illustrated.
The results are the same as a printer using telnet raw 9100 protocol. The pcl commands are limited to just a couple of font changes so it is not too complicated.

Libharu is rather low-level library, and I could not even expect of appearing such automatic page splitting in newer versions due to number of reasons. Hereafter I state two of them:
There is no good, preferred strategy how to place remaining of non-fitting text on the next page. In some cases it could be even impossible at all.
There is no good, preferred strategy for text splitting.
Why?
Consider your font is extremely large, and just one letter (for instance, wide one as "W") does not fit into the page. Where we are supposed to place it? On the next page? Ok, we add new page... oops, it does not fit this page too - as soon as all our pages have the same size. Dead-end without any good, straightforward way out.
In other words, there should be a user-defined strategy for these cases. Almosy every naive implementation will have such a corner cases.
libharu does not know where it should split your text automatically. It does not know hyphenation rules of your language, it does not know whether it should respect spaces or not (wrap whole words only or not), and so on. It's up to you to specify these rules.
So, you should call HPDF_Font_MeasureText for some part of your text string, decide if it fits into your page (excluding margins, footers - which also out of libharu's internal knowledge) and render it. And note that there is no simple formula for text size depending on its length. String "wwww" is more than twice wider than "iiii", of course if your font is not mono-spaced.

Related

Extracting PDF Tables into Excel in Automation Anywhere

[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.

Parse PDF file and output single character locations

I'm trying to extract text information from a (digital) PDF by identifying content and location of each character and each word. For words, pdftotext --bbox from xpdf / poppler works quite well, but I cannot find an easy way to extract character location.
What I've tried
The solution I currently have is to convert the pdf to svg (via pdf2svg), and then parse the resulting svg to extract single character (= glyph) locations. In a third step, the resulting boxes are compared, each character is assigned to a word and hopefully the numbers match.
Problems
While the above works for most "basic" fonts, there are two (main) situations where this approach fails:
In script fonts (or some extreme italic fonts), bounding boxes are way larger than their content; as a result, words overlap significantly, and it can well happen that a character is entirely contained in two words. In this case, the mapping fails, because once I translate to svg I have no information on what character is contained in which glyph.
In many fonts multiple characters can be ligated, giving rise to a single glyph. In this case, the count of character boxes does not match the number of characters in the word, and matching each letter to a box is again problematic.
The second point (which is the main one for me) has a partial workaround by identifying the common ligatures and (if the counts don't match) splitting the corresponding bounding boxes into multiple pieces; but that cannot always work, because for example "ffi" is sometimes ligated to a single glyph, sometimes in two glyphs "ff" + "i", and sometimes in two glyphs "f" + "fi", depending on the font.
What I would hope
It is my understanding that pdf actually contain glyph information, and not words. If so, all the programs that extract text from pdf (like pdftotext) must first extract and locate the various characters, and then maybe group them into words/lines; so I am a bit surprised that I could not find options to output location for each single character. Converting to svg essentially gives me that, but in that conversion all information about the content (i.e. the mapping glyph-to-character, or glyph-to-characters, if there was a ligature) is lost, because there is no font anymore. And redoing the effort of matching each glyph to a character by looking at the font again feels like rewriting a pdf parser...
I would therefore be very grateful for any idea of how to solve this. The top answer here suggests that this might be doable with TET, but it's a paying option, and replacing my whole infrastructure to handle just one limit case seems a big overkill...
A PDF file doesn't necessarily specify the position of each character explicitly. Typically, it breaks a text into runs of characters (all using the same font, anything up to a line, I think) and then for each run, specifies the position of the bounding box that should contain the glyphs for those characters. So the exact position of each glyph will depend on metrics (mostly glyph-widths) of the font used to render it.
The Python package pdfminer has a script pdf2txt.py. Try invoking it with -t xml. The docs just say XML format. Provides the most information. But my notes indicate that it will apply the font-metrics and give you a <text> element for every single glyph, with font and bounding-box info.
There are various versions in various places (e.g. PyPI and github). If you need Python 3 support, look for pdfminer.six.

Retrieve text from pdf in iOS, using zachron iphonepdf?

I'm using the same zachrone iphonepdf but I did not get the text. My text view shows nothing. What's the problem?
Here is my code:
NSString *text=convertPDF(#"Course.pdf");
texview.text=text;
But I did not get anything in text view?
The text extractor zachron / pdfiphone (I assume you meant that one) is extremely naive and makes very many assumptions.
It ignores the PDF file structure and, therefore completely ignores whether the data it inspects is still used in the current revision.
It ignores encryption and therefore will fail completely for many documents with usage restrictions.
It completely ignores font encodings and implicitely assumes an ASCII'ish one --- this is fairly often true in small PDFs with English text only and not embedded fonts; otherwise the result can be anything.
... many many more assumptions ...
Unless one only has to deal with very simple documents and the extracted text is not really necessary for the functionality of one's code, I would propose using different code for text extraction.

Sanitize pasted text from MS-Word

Here's my wild and whacky psuedo-code. Anyone know how to make this real?
Background:
This dynamic content comes from a ckeditor. And a lot of folks paste Microsoft Word content in it. No worries, if I just call the attribute untouched it loads pretty. But the catch is that I want it to be just 125 characters abbreviated. When I add truncation to it, then all of the Microsoft Word scripts start popping up. Then I added simple_format, and sanitize, and truncate, and even made my controller start spotting out specific variables that MS would make and gsub them out. But there's too many of them, and it seems like an awfully messy way to accomplish this. Thus so! Realizing that by itself, its clean. I thought, why not just slice it. However, the microsoft word text becomes blank but still holds its numbered position in the string. So I came up with this (probably awful) solution below.
It's in three steps.
When the text parses, it doesn't display any of the MSWord junk. But that text still holds a number position in a slice statement. So I want to use a regexp to find the first actual character.
Take that character and find out what its numbered position is in the total string.
Use a slice statement to cut it from.
def about_us_truncated
x = self.about_us.find.first(regExp representing first actual character)
x.charCount = y
self.about_us[y..125]
end
The only other idea i got, is a regex statement that allows it to explicitly slice only actual characters like so :
about_us([a-zA-Z][0..125]) , but that is definately not how it is written.
Here is some sample text of MS Word junk :
&Lt;! [If Gte Mso 9]>&Lt;Xml>&Lt;Br /> &Lt;O:Office Document Settings>&Lt;Br /> &Lt;O:Allow Png/>&Lt;Br /> &Lt;/O:Off...
You haven't provided much information to go off of, but don't be too leery of trying to build this regex on your own before you seek help...
Take your sample text and paste it in Rubular in the test string area and start building your regex. It has a great quick reference at the bottom.
Stumbled across this
http://gist.github.com/139987
it looks like it requires the sanitize gem.
This is technically not a straight answer, but it seems like the best possible one you can find.
In order to prevent MS Word, you should be using CK Editor's built-in MS word sanitizer. This is because writing regex for it can be very complicated and you can very easily break tags in half and destroy your site with it.
What I did as a workaround, is I did a force paste as plain text in the CK Editor.

Dealing with widows in a multicol environment

I have problems with dealing with widows within a multicols environment, that is, I have not managed to instruct LaTeX to remove the them.
This PDF document shows an example of the problem. At the top of the second page, I get a widow from the last paragraph of the first page. I have tried a couple of approaches, without luck:
setting both \widowpenalty and \clubpenalty to high values
switching between \raggedcolumns and \flushcolumns
adjusting the collectmore and unbalance counters
I've also read through the documentation for multicol but have not found anything useful.
Is there anything else I could try?
(The complete LaTeX document for the above example)
{\obeyspaces\gdef\nomorebreak{\beginnomorebreak\let \nobreakspace}}
\def\beginnomorebreak{\begingroup
\def\par{\endgraf\endgroup\par\penalty 9999 }\obeyspaces
\brokenpenalty 10000 \widowpenalty 10000 \clubpenalty 10000 }
\def\nobreakspace{\vadjust{\nobreak} \removespaces}
\def\removespaces{\futurelet\next\checkspace}
\def\checkspace{\ifx\next\nobreakspace\expandafter\removesinglespace\fi}
\def\removesinglespace#1{\removespaces}
Insert \nomorebreak at any place of your paragraph. Page breaks will be prohibited after this macro until the end of the paragraph.
It seems that the TeX FAQ item Controlling widows and orphans has some options you did not try yet.
Getting rid of a widow can be more tricky. Options are
If the previous page contains a long paragraph with a short last line, it may be possible to set it “tight”: write \looseness=-1
immediately after the last word of the paragraph.
If that doesn’t work, adjusting the page size, using \enlargethispage{\baselineskip} to “add a line” to the page, which
may have the effect of getting the whole paragraph on one page.
Reducing the size of the page by \enlargethispage{-\baselineskip} may produce a (more-or-less) acceptable “two-line widow”.

Resources