I would like to build invoice text base to have following format:
https://gist.github.com/samnang/9d35a8622af5779a9228
But when I try to render as text format from my controller, then I find out it's very to control how it look because those text are dynamic like company name could be short or long. I don't want to other texts move position depend on something else because those format will be sent to print company to match print out paper.
Is there any examples, gems, or suggestions how to build invoice text base?
You are going to have to truncate some fields to fit into the defined tab area. If you allocate 20 characters of description data, then you need to only allow 20 characters on the output your_text[0..19]
I think working in text for this type of work is going to be overly complex, and you should look at converting the output to pdf to have more control. Take a look at https://github.com/prawnpdf/prawn
Related
I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language
[![enter image description here][4]][4][![enter image description here][5]][5]I have a PDF that has tabular data that runs over 50+ pages, i want to extract this table into an excel file using Automation Anywhere. (i am using community version of AA 11.3). I watched videos of the PDF integration command but haven't had any success trying this for tabular data.
Requesting assistance.
Thanks.
I am afraid that your case will be quite challenging... and the main reason for that are the values that contains multiple lines. You can still achieve what you need, and with good performance, but the code itself will not be pretty. You will also be facing challanges with Automation Anywhere, since it does not really provide the right tools to do such a thing and you may need to resort to scripting (VBScripts) or Metabots.
Solution 1
This one will try to use purely text extraction and Regular expressions. Mainly standard functionality, nothing too "dirty".
First you need to realise how do the exported data look like. You can see that you can export to Plain or Structured.
The Plain one is not useful at all as the data is all over the place, without any clear pattern.
The Structured one is much better as the data structure resembles the data from the original document. From looking at the data you can make these observations:
Each row contains 5 columns
All columns are always filled (at least in the visible sample set)
The last two columns can serve as a pattern "anchor" (identifier), because they contain a clear pattern (a number followed by minimum of two spaces followed by a dollar sign and another number)
Rows with data are separated by a blank row
The text columns may contain a multiline value, which will duplicate the rows (this one thing makes it especially tricky)
First wou need to ensure that the Structured data contain only the table, nothing else. You can probably use the Before-After string command for that.
Then you need to check if you can reliably identify the character width of every column. You can try this for yourself if you copy the text into Excel, use the Text to Columns with the Fixed Width option and try to play around with the sliders
The you need to try to find a way how to reliably identify each row and prepare it for the Split command in AA. For that you need to have a delimiter. But since each data row can actually consists of multiple text rows, you need to create a delimiter of your own. I used the Replace function with Regular Expression option and replace a specific pattern for a delimiter (pipe). See here.
Now that you have added a custom delimiter, you can use the Split command to add each row into a list and loop through it.
Because each data row may consists of several rows, you will need to use Split again, this time use the [ENTER] as delimiter. Now you need to loop through each of the text line of a single data line and use the Substring function to extract data based on column width and concatenate them to a single value that you store somewhere else.
All in all, a painful process.
Solution 2
This may not be applicable, but it's worth a try - open the PDF in Microsoft Word. It will give you a warning, ignore it. Word will attempt to open the document and, if you're lucky, it will recognise your table as a table. If it works, it will make the data extraction much easier an you will be able to use Macros/VBA or even simple Copy&Paste. I tried it on a random PDF of my own and it works quite well.
I'd like to add an automatic page break to a libHaru PDF in iOS.
I do have several text fields in the app which contain the user filled data. when i generate the pdf i first measure the expected size of the text-rect going to be created. if it exceeds the remaining space i trigger a hpdf_new_page event and put the text on an new page. i'd like to have this just in part automatically. so if the text exceeds the space on the current page it should split and continue on a new page without me checking or doing anything.
unfortunately i can't find anything like this in the documentation.
Line counting using fgets() may help. When your print program opens a file to print, each line can be copied to the pdf file and checked for a form feed character
or
if the line count has reached a limit.
Another possible solution is to use a character count limit with "while(getc(file) != EOF)".
This link uses libharu to print basic text files with PCL commands to change the font.
https://github.com/DaDaDadeo/GetCycle/blob/master/pcl_to_pdf.c
The form feed character '\f' (ascii 12) and 61 lines will trigger a new page. There are other conditions in the program to restrict a new page but the general idea is illustrated.
The results are the same as a printer using telnet raw 9100 protocol. The pcl commands are limited to just a couple of font changes so it is not too complicated.
Libharu is rather low-level library, and I could not even expect of appearing such automatic page splitting in newer versions due to number of reasons. Hereafter I state two of them:
There is no good, preferred strategy how to place remaining of non-fitting text on the next page. In some cases it could be even impossible at all.
There is no good, preferred strategy for text splitting.
Why?
Consider your font is extremely large, and just one letter (for instance, wide one as "W") does not fit into the page. Where we are supposed to place it? On the next page? Ok, we add new page... oops, it does not fit this page too - as soon as all our pages have the same size. Dead-end without any good, straightforward way out.
In other words, there should be a user-defined strategy for these cases. Almosy every naive implementation will have such a corner cases.
libharu does not know where it should split your text automatically. It does not know hyphenation rules of your language, it does not know whether it should respect spaces or not (wrap whole words only or not), and so on. It's up to you to specify these rules.
So, you should call HPDF_Font_MeasureText for some part of your text string, decide if it fits into your page (excluding margins, footers - which also out of libharu's internal knowledge) and render it. And note that there is no simple formula for text size depending on its length. String "wwww" is more than twice wider than "iiii", of course if your font is not mono-spaced.
I'm quite at a lost on this subject. I've read pretty much every post about it here on SO, I would very much appreciate it if somebody would nudge me in the right direction.
I have a PDF and I would like to extract it's text, I'm only interested in words and spaces. I have setup a CGPDFScanner and it's callback methods. What I have read is that I only need to consider 4 operators TJ, Tj, qout(') and doubleqout(") as far as extracting text goes.
I guess I also need to keep track of the text space to be able to determine whether the letters should be put together to form a word or should be separated by a space. But I have no idea how I would have to do this.
In the PDF, all text is in the format
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
but I have not been able to figure out (using the PDF specification) what these numbers mean. Somebody on SO said that you should not be scared of the PDF specs but frankly I do not find them very easy to read/understand.
I have studied the PDFKitten code which was helpful.
Any help would be greatly appreciated.
I cannot give you advice how to extract words from PDF, but the format of
[(X)-24.2524(X)-24.2524(X)-24.2524(Y)-24.2524(Y)-24.2524]TJ
is explained for example in the PDF 1.7 Specification, section "9.4.3 Text-Showing Operators". The description of the TJ operator is:
Show one or more text strings, allowing individual glyph positioning.
Each element of array shall be either a string or a number. If the
element is a string, this operator shall show the string. If it is a
number, the operator shall adjust the text position by that amount;
that is, it shall translate the text matrix, Tm. The number shall be
expressed in thousandths of a unit of text space.
So the numbers are adjustments to the distance between the letters.
Here's what I have:
Quarter Sheet Flyer (4 per page) as a PSD or JPG
Text file with one entry of text per line.
What I want to do:
Print out 100 flyers (on 25 pieces of paper)
Somehow automate the process of adding the text to the image, either via some scripting language or a Photoshop automated task. Then format the pages to print, either to generate a 25 page PDF file or generate four at a time and send them to the printer page by page.
Anyone have any experience with something like this or have any recommendations on how I should go about doing this?
Thanks for your help!
You can use Microsoft Word automation to generate a word file with the correct text and image, and then just print it.
This would be one of the simpler solutions, you can implement the entire thing as a word macro (VBA).
A more complex solution would be to use VB6 or .net to print the text and the image into the form and then print the form.
You can write a script that will generate an html page with the image and the text, and then print out the html using a browser.