How to extract only highlight data from pdf file in rails - ruby-on-rails

I want to take out all the highlighted text from a pdf in rails does anyone have any idea I am not able to figure it out.Sample data

You can either use the pdf-reader gem (the example/text.rb example is simple and worked for me): https://github.com/yob/pdf-reader
Or the command-line utility pdftotext.
# Extract all text from a single PDF
require 'rubygems'
require 'pdf/reader'
filename = File.expand_path(File.dirname(__FILE__)) + "/../spec/data/cairo-unicode.pdf"
PDF::Reader.open(filename) do |reader|
reader.pages.each do |page|
puts page.text
end
end
This is a basic text extraction using the gem mentioned above. This should get you a nice head start. You can grab all the text from the document then figure out how to grab those specific sections based on the data you receive back.

Related

how to convert pdf file into xlsx file in ruby on rails

I have uploaded 1 PDF then convert it to xlsx file. I have tried different ways but not getting actual output.pdf2xls only displays single line format not whole file data. I want whole PDF file data to display on xlsx file.
i have one method convert PDF to xlsx but not display proper format.
def do_excel_to_pdf
#user=User.create!(pdf: params[:pdf])
#path_in = #user.pdf.path
temp1 = #user.pdf.path
#path_out = #user.pdf.path.slice(0..#user.pdf.path.rindex(/\//))
query = "libreoffice --headless --invisible --convert-to pdf " + #path_in + " --outdir " + #path_out
system(query)
file = #path_out+#user.pdf.original_filename.slice(0..#user.pdf.original_filename.rindex('.')-1)+".pdf"
send_file file, :type=>"application/msexcel", :x_sendfile=>true
end
if any one use please help me, any gem any script.
I would start with reading from the PDF, inserting the data in the XLSX is easy, if you have problems with that ask another question and specify which gem you use and what you tried for that part.
You use libreoffice to read the PDF but according to the FAQ your PDF needs to be hybrid, perhaps that is the problem.
As an alternative you could try to use some conversion tool for ebooks like the one in Calibre but I'm afraid you will lose too much formatting to recover the data you need.
All depends on how the data in your PDF is structured, if regular text without much formatting and positioning it can be as easy as using the gem pdf-reader
I used it in the past and my data had a lot of formatting - you would be surprised to know how complicated the PDF structure is - so I had to specify for each field at which location exactly which data had to be read, not for the faint of heart.
Here a simple example.
require 'pdf/reader' # gem install pdf-reader
reader = PDF::Reader.new("my.pdf")
reader.pages.each do |page|
# puts page.text
page.page_object.each do |e|
p e.first.contents
end
end
not able to find options to convert from PDF to xsls but API Options available for converting PDF to Image and PDF to powerpoint(Link Given Below)
Not sure u can change the requirement to show results in other formats!!
http://www.convertapi.com/

Inserting external PDF into Prawn generated document

How can I insert an existing PDF into a Prawn generated document? I am generating a pdf for a bill (as a view), and that bill can have many attachments (png, jpg, or pdf). How can I insert/embed/include those external pdf attachments in my generated document? I've read the manual, looked over the source code, and searched online, but no luck so far.
The closest hint I've found is to use ImageMagick or something similar to convert the pdf to another format, but since I don't need to resize/manipulate the document, that seems wasteful. The old way to do it seems to be through templates, but my understanding is that the code for templating is unstable.
Does anyone know how to include PDF pages in a Prawn generated PDF? If Prawn won't do this, do you know of any supplementary gems that will? If someone can point me towards something like prawn-templates but more reliable, that would be awesome.
Edit: I am using prawnto and prawn to render PDF views in Rails 4.2.0 with Ruby 2.2.0.
Strategies that I've found but that seem inapplicable/too messy:
Create a jpg preview of a PDF on upload, include that in the generated document (downsides: no text selection/searching, expensive). This is currently my favorite option, but I don't like it.
prawn-templates (downside: unstable, unmaintained codebase; this is a business-critical application)
Merge PDFs through a gem like 'combine-pdf'–I can't figure out how to make this work for rendering a view with the external PDFs inserted at specific places (the generated pdf is a collection of bills, and I need them to follow the bill they're attached to)
You're right about the lack of existing documentation for this - I found only this issue from 2010 which uses the outdated methods you describe. I also found this SO answer which does not work now since Prawn dropped support for templates.
However, the good news is that there is a way to do what you want with Ruby! What you will be doing is merging the PDFs together, not "inserting" PDFs into the original PDF.
I would recommend this library, combine_pdf, to do so. The documentation is good, so doing what you want would be as simple as:
my_prawn_pdf = CombinePDF.new
my_prawn_pdf << CombinePDF.new("my_bill_pdf.pdf")
my_prawn_pdf << CombinePDF.new("attachment.pdf")
my_prawn_pdf.save "combined.pdf"
Edit
In response to your questions:
I'm using Prawn to render a pdf view in Rails, which means that I don't think I get that kind of post-processing
You do! If you look at the documentation for combine_pdf, you'll see that loading from memory is the fastest way to use the gem - the documentation even explicitly says that Prawn can be used as input.
I'm not just tacking the PDFs to the end: a bill attachment must directly follow the generated page(s) for a bill
The combine_pdf gem isn't just for adding pages on the end. As the documentation shows, you can cycle through a PDF adding pages when you want to, for example:
my_pdf # previously defined
new_pdf = CombinePDF.new
my_pdf.pages.each.do |page|
i += 1
new_pdf << my_pdf if i == bill_number # or however you want to handle this logic
end
new_pdf.save "new_pdf.pdf"

using 'puts' to get information from external domain

ive just started with ruby on rails the other day and i was wandering is it possible to using the puts function to get the content of a div from a page on an external page.
something like puts "http://www.example.com #about"
would something like this work ? or would you have to get the entire page and then puts that section that you wanted ?
additionaly if the content on the "example.com" #about div is constantly changing would puts constantly update its output or would it only run the script each time the page is refreshed ?
The open-uri library (for fetching the page) and the Nokogiri gem (for parsing and retrieving specific content) can assist with this.
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open('http://www.example.com/'))
puts doc.at('#about').text
puts will not work that way. Ruby makes parsing HTML fairly easy though. Take a look at the Nokogirl library, and you can use xpath queries to get to the div you want to print out. I believe you would need to reopen the file if the div changes, but I'm not positive about that - you can easily test it (or someone here can confirm or reject that statement).

Construct URLs after scraping for image paths

I'm trying to scrape a web URL inputed by the user and then output an array of valid non-broken image elements with absolute paths in HTML. I'm using Nokogiri for scraping and I want to know if there is anything I can use to easily process the unpredicatble URLs provided by user and image paths scraped short of figuring out how to write something from scratch.
Examples:
http://domain.com/ and /system/images/image.png
=> http://domain.com/system/images/image.png
http://sub.domain.com and images/common/image.png
=> http://sub.domain.com/images/common/image.png
http://domain.com/dir/ and images/image.png
=> http://domain.com/dir/images/image.png
http://domain.com/dir and /images/small/image.png
=> http://domain.com/images/small/image.png
http://domain.com and http://s3.amazon-aws.com/bucket/image.png
=> http://s3.amazon-aws.com/bucket/image.png
Instead of downloading the pages and using Nokogiri, I would recommend using Mechanize. It is built on top of Nokogiri, so everything you can do with Nokogiri you can do with Mechanize, but it adds a lot of useful functionality for scraping/navigating. It will take care of the relative URL problem you describe above.
require 'rubygems'
require 'mechanize'
url='http://stackoverflow.com/questions/5903218/construct-urls-after-scraping-for-image-paths/5903417'
Mechanize.new.get(url) {|page| puts page.image_urls.join "\n"}
If you really want to do it yourself (instead of using Mechanize, say), use URI::join:
require 'uri'
URI::join("http://domain.com/dir", "/images/small/image.png")
# => http://domain.com/images/small/image.png
Note that you have to respect the HTML page's BASE tag if there is one...

In Ruby on Rails, how can I convert html to word?

how can I convert html to word
thanks.
I have created a Ruby html to word gem that should help you do just that. You can check it out at https://github.com/nickfrandsen/htmltoword - You simply pass it a html string and it will create a corresponding word docx file.
def show
respond_to do |format|
format.docx do
file = Htmltoword::Document.create params[:docx_html_source], "file_name.docx"
send_file file.path, :disposition => "attachment"
end
end
end
Hope you find it helpful.
I am not aware of any solution which does this, i.e. convert HTML to Word format. If you literally mean that, you will have to parse the HTML document first using something like Nokogiri. If you mean you want to output data persisted in your model objects, there is obviously no need to parse HTML! As far as outputting to Word, I'm afraid it looks as if you will have to directly interface with a running instance of Microsoft Word via OLE!
A quick google search for win32ole ruby word will get you started:
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/241606
Good luck!
I agree with CodeJoust that it is better to generate a PDF. However, if you really need to generate a Word document then you can do the following:
If your server is a Windows machine, you can install Office in it and use ruby's OLE binding to generate the Word document into the public folder and then deliver the file in the response.
To use ruby's OLE binding, see the "Programming Ruby" ebook that comes with the one-click ruby installer for Windows. You may have to use custom logic to convert from HTML to Word unless you can find a function in the OLE api of Word to do that.
http://prawn.majesticseacreature.com/
You could allow the user to download a PDF or a .html file, but there aren't any helpful ruby libraries to do that. You're better off generating a 'printable and downloadable' version, without much styling, and/or a pdf version using a library like prawn.
You could always generate a simple .rtf file, I think word'll be pretty happy reading that...

Resources