Is there a way to parse the form fields of signed PDFs e.g. using Python or Java and write them to a CSV? - parsing

I would like to parse form fields from signed PDF's. With this I mean for example the checkboxes. I have already tried different ways (with Python) like PyPDF2, pikepdf or even pdfminer, however I only get the letters out and not the form fields. If someone has an approach how I could parse form fields from signed PDFs it would be my salvation. I can parse the individual letters, but not the form fields. I'm already thinking about trying OCR, but it seems very complicated to me and it might be easier.
Does anyone of you have an idea how I can parse the form fields out of signed PDF?
Thanks in advance!

disclaimer: I am the author of borb, the library used in this answer.
It's unclear what you want precisely.
You want to extract information from the form fields in the PDF
Your PDF is signed and then scanned, you want to extract an image of the signature
Either option is possible using borb
If you want to extract information of the form fields, I would recommend you look at section 4.4 of the examples repository. I'll post the example here for the sake of completeness.
from decimal import Decimal
from borb.pdf import HexColor
from borb.pdf import PageLayout
from borb.pdf import Document
from borb.pdf import Page
from borb.pdf import PDF
def main():
# open document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
assert doc is not None
# get
print("Name: %s" % doc.get_page(0).get_form_field_value("name"))
print("Firstname: %s" % doc.get_page(0).get_form_field_value("firstname"))
print("Country: %s" % doc.get_page(0).get_form_field_value("country"))
if __name__ == "__main__":
main()
This example reads an input PDF, and then fetches the values of the form fields.
You can also do more low-level manipulations, borb represents the PDF as a JSON-like datastructure (nested arrays, dictionaries and primitives). So you can get the information relatively easily.
If you want to apply OCR to a PDF, I would recommend yet another example in the examples repository. This time in section 7.2.
from borb.pdf import Document
from borb.pdf import PDF
from borb.toolkit.ocr.ocr_as_optional_content_group import OCRAsOptionalContentGroup
from pathlib import Path
def main():
# set up everything for OCR
tesseract_data_dir: Path = Path("/home/joris/Downloads/tessdata-master/")
assert tesseract_data_dir.exists()
l: OCRAsOptionalContentGroup = OCRAsOptionalContentGroup(tesseract_data_dir)
# read Document
doc: typing.Optional[Document] = None
with open("output_001.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle, [l])
assert doc is not None
# store Document
with open("output_002.pdf", "wb") as pdf_file_handle:
PDF.dumps(pdf_file_handle, doc)
if __name__ == "__main__":
main()

You can extract (but also manipulate) Form Fields with PyMuPDF - whether signed or not:
import fitz # the PyMuPDF package
doc = fitz.open("your.pdf")
for page in doc: # iterate over pages
print()
print(f"Form fields on page {page.number}")
for field in page.widgets(): # iterate over form fields on the page
print(f"field type '{field.field_type_string}', value '{field.field_value}`")

Related

Is it possible to import text from an online .txt file to Google sheets

I am trying to import texts from ads.txt files from a certain websites to Gsheets. I try Importxml however it states that the important xml content can not be parsed.
example:
I'm trying to import text from this file --> financhill.com/ads.txt
I'm using this code =IMPORTXML("https://financhill.com/ads.txt","/html/body/pre/text()")
the result is N/A important xml content can not be parsed.
When I saw the data from the URL, it seems that the data is the CSV data. I think that this is the reason of your issue of the result is N/A important xml content can not be parsed.. In this case, how about using IMPORTDATA as follows?
=IMPORTDATA("https://financhill.com/ads.txt")
and
=IMPORTDATA("https://financhill.com/ads.txt",",")
Reference:
IMPORTDATA

how to convert pdf file into xlsx file in ruby on rails

I have uploaded 1 PDF then convert it to xlsx file. I have tried different ways but not getting actual output.pdf2xls only displays single line format not whole file data. I want whole PDF file data to display on xlsx file.
i have one method convert PDF to xlsx but not display proper format.
def do_excel_to_pdf
#user=User.create!(pdf: params[:pdf])
#path_in = #user.pdf.path
temp1 = #user.pdf.path
#path_out = #user.pdf.path.slice(0..#user.pdf.path.rindex(/\//))
query = "libreoffice --headless --invisible --convert-to pdf " + #path_in + " --outdir " + #path_out
system(query)
file = #path_out+#user.pdf.original_filename.slice(0..#user.pdf.original_filename.rindex('.')-1)+".pdf"
send_file file, :type=>"application/msexcel", :x_sendfile=>true
end
if any one use please help me, any gem any script.
I would start with reading from the PDF, inserting the data in the XLSX is easy, if you have problems with that ask another question and specify which gem you use and what you tried for that part.
You use libreoffice to read the PDF but according to the FAQ your PDF needs to be hybrid, perhaps that is the problem.
As an alternative you could try to use some conversion tool for ebooks like the one in Calibre but I'm afraid you will lose too much formatting to recover the data you need.
All depends on how the data in your PDF is structured, if regular text without much formatting and positioning it can be as easy as using the gem pdf-reader
I used it in the past and my data had a lot of formatting - you would be surprised to know how complicated the PDF structure is - so I had to specify for each field at which location exactly which data had to be read, not for the faint of heart.
Here a simple example.
require 'pdf/reader' # gem install pdf-reader
reader = PDF::Reader.new("my.pdf")
reader.pages.each do |page|
# puts page.text
page.page_object.each do |e|
p e.first.contents
end
end
not able to find options to convert from PDF to xsls but API Options available for converting PDF to Image and PDF to powerpoint(Link Given Below)
Not sure u can change the requirement to show results in other formats!!
http://www.convertapi.com/

Parse a string like a CSV file with seek, rewind, position

My application accepts an uploaded file from the user and parses it, making use of seek and rewind methods quite heavily to parse blocks from the file (lines can begin with 'start' or 'end' to enclose a section of data, etc).
A new requirement allows the user to upload encrypted files. I've implemented decryption of the content of the file and return the content string to the existing method. I can parse the string as a CSV but lose the file controls.
Storing an unencrypted version of the file is not an option for business reasons.
I'm using FasterCSV but not averse to using something else if I can keep the seek/rewind behaviour.
Current code:
FasterCSV.open(path, 'rb') do |csv| # Can I open a string as if it were a file?
unless csv.eof? # Catch empty files
# Read, store position, seek, rewind all used during parsing
position = csv.pos
row = csv.readline
csv.seek(pos)
After some digging and experimentation I've found that it was possible to retain the IO methods by using the StringIO class like so:
csv = StringIO.new(decrypted_content)
unless csv.nil?
unless csv.eof? # Catch empty files
position = csv.pos
row = csv.readline.chomp.split(',')
csv.seek(pos)
Only change is needing to manually split the line to be able to use it like a csv row, not much extra work.
You don't need the CSV gem anymore but if you prefer the seek/rewind behaviour you can roll your own for strings. Something like this might work for your scenario:
array_of_lines=unecrypted_file_string.split('\n')
array_of_lines.each_with_index do |line,index|
position=index
row=line
seek=line[10]
end

How to identify character encoding from website?

What I'm trying to do:
I'm getting from a database a list of uris and download them,
removing the stopwords and counting the frequency that the words appears in the webpage,
then trying to save in the mongodb.
The Problem:
When I try to save the result in the database I get the error
bson.errors.invalidDocument: the document must be a valid utf-8
it appears to be related to the codes '\xc3someotherstrangewords', '\xe2something'
when I'm processing the webpages I try remove the punctuation, but I can't remove accents because I'll get a wrong word.
What I already tried
I've tried identify the char encode through the header from the webpage
I've tried utilize the chardet
utilize the re.compile(r"[^a-zA-Z]") and/or unicode(variable,'ascii', 'ignore');
that isn't good for non-English languages because they remove the accents.
What I want know is:
anyone know how identify the chars and translate to the right word/encode?
e.g. get this from webpage '\xe2' and translate to 'รข'
(English isn't my first language so forgive me)
EDIT: if anyone want see the source code
It is not easy to find out the correct character encoding of a website because the information in the header might be wrong. BeautifulSoup does a pretty good job at guessing the character encoding and automatically decodes it to Unicode.
from bs4 import BeautifulSoup
import urllib
url = 'http://www.google.de'
fh = urllib.urlopen(url)
html = fh.read()
soup = BeautifulSoup(html)
# text is a Unicode string
text = soup.body.get_text()
# encoded_text is a utf-8 string that you can store in mongo
encoded_text = text.encode('utf-8')
See also the answers to this question.

Import CSV from url address and export as XML -- Rails

Two questions:
How can I import a file from a web address, without a form?
Example: Organisation.import(:from => 'http://wufoo.com/report.csv')
How can I use xml builder without pulling from the db?
More Info
My company uses wufoo for web forms. The data from wufoo is exported as csv files. To get the data into my company's cms, it needs to be formatted as xml. I don't need to store any of the data, aside from the url to the csv file. I thought this might work well as a simple rails app.
Use open-uri (http://www.ruby-doc.org/stdlib/libdoc/open-uri/rdoc/) to fetch the file, and ruby's csv library to parse it. Or, use csv-mapper which is nice and simple (http://csv-mapper.rubyforge.org/).
Here is a way:
require 'rio'
require 'fastercsv'
url = 'http://remote-url.com/file.csv'
people = FasterCSV.parse(rio(url).read)
xml = ''
1.upto(people.size-1) do |row_idx|
xml << " <record>\n"
people[0].each_with_index do |column, col_idx|
xml << " <#{column.parameterize}>#{people[row_idx][col_idx]}</#{column.parameterize}>\n"
end
xml << " </record>\n"
end

Resources