How to convert PDF files to images using RMagick and Ruby - ruby-on-rails

I'd like to take a PDF file and convert it to images, each PDF page becoming a separate image.
"Convert a .doc or .pdf to an image and display a thumbnail in Ruby?" is a similar post, but it doesn't cover how to make separate images for each page.

Using RMagick itself, you can create images for different pages:
require 'RMagick'
pdf_file_name = "test.pdf"
im = Magick::Image.read(pdf_file_name)
The code above will give you an array arr[], which will have one entry for corresponding pages. Do this if you want to generate a JPEG image of the fifth page:
im[4].write(pdf_file_name + ".jpg")
But this will load the entire PDF, so it can be slow.
Alternatively, if you want to create an image of the fifth page and don't want to load the complete PDF file:
require 'RMagick'
pdf_file_name = "test.pdf[5]"
im = Magick::Image.read(pdf_file_name)
im[0].write(pdf_file_name + ".jpg")

ImageMagick can do that with PDFs. Presumably RMagick can do it too, but I'm not familiar with it.
The code from the post you linked to:
require 'RMagick'
pdf = Magick::ImageList.new("doc.pdf")
pdf is an ImageList object, which according to the documentation delegates many of its methods to Array. You should be able to iterate over pdf and call write to write the individual images to files.

Since I can't find a way to deal with PDFs on a per-page basis in RMagick, I'd recommend first splitting the PDF into pages with pdftk's burst command, then dealing with the individual pages in RMagick. This is probably less performant than an all-in-one solution, but unfortunately no all-in-one solution presents itself.
There's also PDF::Toolkit for Ruby that hooks into pdftk but I've never used it.

Related

What is the best way to generate PDF/HTML/DOCX in Ruby/Rails

I need to create an app which makes auto-generated CVs from fields. I need to convert them in PDF/HTML/DOC, but there are many gems available.
Which gem do you think is the most appropriate in order to make CV in PDF, HTML and DOC formats?
I found prawn for PDF, but is it the most appropriate to make CV-like PDF?
Thank you in advance.
EDIT : I found a gem similar to Prawn but for docx maybe that could interest you.
https://github.com/trade-informatics/caracal/
For authoring PDF files
I would go with Prawn over the Wicked-PDF.
I know wicked is more convenient, but Prawn is both native and more flexible.
Wicked depends on wkhtmltopdf and uses systems call - and system calls can cause issues with concurrency and are expensive as far as performance goes.
Which ever one your using, I would probably add CombinePDF into the mix.
(I'm biased - I'm the author of the gem)
This will allow you to use template PDF files - so you can write your CV using Prawn or Wicked and throw it over a beautifully designed template using the CombinePDF gem.
CombinePDF could also be used to create simple text objects, number pages or write tables...
require 'combine_pdf'
pdf = CombinePDF.new
pdf.new_page.textbox "Hello World!", font_size: 18, opacity: 0.75
pdf.save 'test.pdf' # use pdf.to_pdf to render the PDF into a String object.
...but I think Prawn is a better tool for authoring, while CombinePDF is a great complementing library that can add pre-made content and design:
require 'combine_pdf'
# get a "stamp" or page template.
# the secure copy will attempt to change the PDF data to avoid name conflicts.
template = CombinePDF.load(template_file_path).pages[0].copy(true)
# move the Prawn document into CombinePDF
pdf = CombinePDF.parse prawn_document.render
# add the template to each page, putting the template on the bottom
# (#>> for bottom vs. #<< for top)
pdf.pages.each {|page| page >> template}
# save the pdf file
pdf.save 'final.pdf' # or render to string, for sending over HTTP: pdf.to_pdf
As for docx, I am clueless... frankly, it's a proprietary format that's usually buggy when reverse-engineered. I would avoid it and let people copy and paste off the HTML if they really wanted.
I always use Prawn for PDF and Caracal for DOCX.
I believe the creator of Caracal was inspired by the ease of use of Prawn.
I use those gems at work, and they are pretty fine, if it can help.
PDF: https://github.com/pdfcrowd/pdfcrowd-ruby
DOCX: https://github.com/chrahunt/docx
For the html, a simple render could work.
you can have a look at Ruby toolbox for available gems for pdf generation.i will recommend Prawn,as i have used it myself and very easy simple code to generate pdf is needed.
for example:-
require "prawn"
Prawn::Document.generate("hello.pdf") do
text "Hello World!"
end
Use the Ruby Toolbox to find other gems for docx/html as well :)

Is there a gem or a quick way to render some prawn code as an image?

I'm looking for a way to re-use my prawn code in other views (HAML). It can also be an image, but I don't need the whole PDF (with the layout and other pages), just a small graphic (chart) that I render in Prawn using a method.
EDIT:
because I'm using Heroku and I'm not saving the file (it's a response to a web request), I'm don't really want to open a pdf file with ImageMagick for example and process it into an image, unless it is my last option (besides coding the graphic again in HTML+CSS).
I don't know about a gem that might do the trick but you can always use RMagick or MiniMagick to covert the generated pdf to an image and scale it down:
require 'RMagick'
pdf = Magick::ImageList.new("doc.pdf")
image = pdf.scale(300, 300)
image.write "doc.png"
Hopefully you'll find this useful.

How to edit or write on existing PDF with Ruby?

I have a couple of PDF template files with complex content and several blank regions/areas in them. I need to be able to write text in those blank regions and save the resulting PDFs in a folder.
I googled for answers on this question quite intensively, but I didn't find definite answers. One of the better solutions is PDF::Toolkit, but it would require the purchase of Adobe Acrobat to add replaceable attributes to existing PDF documents.
The PHP world is blessed with FPDI that can be used to simply open a PDF file and write/draw on it over the existing content. There is a Ruby port of this library, but the last commit for it happened at the beginning of 2009. Also that project doesn't look like it is widely used and supported.
The question is: What is the better Ruby way of editing, writing or drawing on existing PDFs?
This question also doesn't seem to be answered on here. These questions are related, but not really the same:
Prawn gem: How to create the .pdf from an *existing* file (.xls)
watermark existing pdf with ruby
Ruby library for manipulating existing PDF
How to replace a word in an existing PDF using Ruby Prawn?
you have to definitely check out Prawn gem, by which you can generate any custom pdf files. You can actually use prawn to write in text into existing pdfs by treating the existing PDF as a template for your new Prawn document.
For example:
filename = "#{Prawn::DATADIR}/pdfs/multipage_template.pdf"
Prawn::Document.generate("full_template.pdf", :template => filename) do
text "THis content is written on the first page of the template", :align => :center
end
This will write text onto the first page of the old pdf.
See more here:
http://prawn.majesticseacreature.com/manual.pdf
Since Prawn has removed the template feature (it was full of bugs) the easiest way I've found is the following:
Use Prawn to generate a PDF with ONLY the dynamic parts you want to add.
Use PDF::Toolkit (which wraps PDFtk) to combine the Prawn PDF with the original.
Rough Example:
require 'prawn'
require 'pdf/toolkit'
template_filename = 'some/dir/Awesome-Graphics.pdf'
prawn_filename = 'temp.pdf'
output_filename = 'output.pdf'
Prawn::Document.generate(prawn_filename) do
# Generate whatever you want here.
text_box "This is some new text!", :at => [100, 300]
end
PDF::Toolkit.pdftk(prawn_filename, "background", template_filename, "output", output_filename)
I recommend prawn for generating PDFs and then using combine_pdf to combine two generated PDFs together into one. I use it like this and it works just fine.
Short example (from the README) of how to combine two PDFs:
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each { |page| page << company_logo } # notice the << operator is on a page and not a PDF object.
pdf.save "content_with_logo.pdf"
You don't need to use a combination of gems you can use just one gem!
Working with PDF's is really challenging in Ruby/Rails (so I have found out!)
This is the way I was able to add text dynamically to a PDF in rails.
add this gem to your gem file gem combine_pdf
and then you can use code like this:
# get the record from the database to add dynamically to the pdf
user = User.last
# get the existing pdf
pdf = CombinePDF.load "#{Rails.root}/public/pdf/existing_pdf.pdf"
# create a textbox and add it to the existing pdf on page 2
pdf.pages[1].textbox "#{user.first_name} #{user.last_name}", height: 20, width: 70, y: 596, x: 72
# output the new pdf which now contains your dynamic data
pdf.save "#{Rails.root}/public/pdf/output#{Time.now.to_s}.pdf"
You can find details of the textbox method here: https://www.rubydoc.info/gems/combine_pdf/0.2.5/CombinePDF/Page_Methods#textbox-instance_method
I spent days on this working through a number of different gems: prawn wicked_pdf pdfkit fillable_pdf
But this was by far the most smooth solution for me as of 2019.
I hope this saves someone a lot of time so they don't have to go through all the trial and error I had to with PDF's!!
The best I can think of is Rails-latex, it doesn't allow you to edit existing PDF files but it would allow you to set up template *.tex.erb which you may dynamically modify and compile them into PDF format (along with dvi and a number of others).
PDFLib seems to do the thing you want and has ruby bindings.
According to my research, Prawn is one of the free and best gems I found. The template functionality isn't working in later version. The latest version I could find to work with templates is 1.0.0.rc2 - March 1, 2013. Couldn't find any later version which works with templates. So be mindful if you are using later versions than this. Check below thread for more info.
https://groups.google.com/forum/#!searchin/prawn-ruby/prawn$20templates/prawn-ruby/RYGPImNcR0I/7mxtnrEDHeQJ
PDFtk is another capable tool for PDF manipulation and to work with templates. But it mentions following points,
This library is free for personal use, but requires a license if used
in production
This is a non-ruby command line tool
For more information please refer the below link
http://adamalbrecht.com/2014/01/31/pre-filling-pdf-form-templates-in-ruby-on-rails-with-pdftk/
You can use Origami gem to add a password to the existing pdf or edit it.
pdf = WickedPdf.new.pdf_from_url(pdf_params[:url])
temp_file = Tempfile.new('temp', encoding: 'ascii-8bit')
temp_file.write(pdf)
# Creates an encrypted document with AES256 and passwords.
pdf = PDF.read(temp_file.path).encrypt(cipher: 'aes', key_size: 256,user_passwd: pdf_params[:user_password], owner_passwd: pdf_params[:owner_password])
save_path = "#{File.basename(__FILE__, ".rb")}.pdf"
pdf.save(save_path)
temp_file.close

Watermark in existing PDF in Ruby

I would like to add a dynamically generated text. Is there a way to watermark an existing PDF in Ruby?
This will do it:
PDF::Reader to count the number of pages in the file.
Prawn to create a new PDF document using each page of the input pdf as a template.
require 'prawn'
require 'pdf-reader'
input_filename = 'input.pdf'
output_filename = 'output.pdf'
page_count = PDF::Reader.new(input_filename).page_count
Prawn::Document.generate(output_filename, :skip_page_creation => true) do |pdf|
page_count.times do |num|
pdf.start_new_page(:template => input_filename, :template_page => num+1)
pdf.text('WATERMARK')
end
end
However, in my testing the output file size was huge with the latest Gem version of Prawn (0.12), but after pointing my Gemfile at the master branch on github, all worked fine.
Another option is to use PDFTK. It can be used to add a watermark and create a new PDF. Maybe prawn will do the same thing with it's templating.
pdftk in.pdf background arecibo.pdf output wmark1.pdf
Some more info: http://rhodesmill.org/brandon/2009/pdf-watermark-margins/
There is a ruby wrapper gem called active_pdftk which supports backgrounding, so you don't have to do the shell commands yourself.
Prawn doesn't support templates anymore...
Try the combine_pdf gem.
You can combine, watermark, page-number and add simple text to existing PDF files (including the creation of a simple table of contents without PDF links).
It's a very simple and basic PDF library, written in pure Ruby with no dependencies.
This example can fit your situation (it's from the readme file):
# load the logo as a pdf page
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
# load the content file
pdf = CombinePDF.load "content_file.pdf"
# inject the logo to each page in the content
pdf.pages.each {|page| page << company_logo}
# save to a new file, with the logo.
pdf.save "content_with_logo.pdf"
Check out Prawn(http://github.com/sandal/prawn) for just ruby and Prawnto(http://github.com/thorny-sun/prawnto) for Ruby on Rails.
You are probably going to want to either use the image embedding functionality or the background image functionality.
Here's an example of using a background image http://cracklabs.com/prawnto/code/prawn_demos/source/general/background
Use Ruport.
1st result for Googling ruby pdf watermark.

Ruby: Reading PDF files

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).
Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.
My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?
You might find Docsplit useful:
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.
You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.
Did you have a look at the CombinePDF library?
It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.
Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.
require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"
You can also stamp text, number pages or use :
require 'combine_pdf'
pdf = CombinePDF.load "content_file.pdf"
pdf.number_pages #adds page numbers. you can add formatting and placement options.
pdf.pages.each {|page| page.textbox "One Way To Stamp"}
#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"
#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo
# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]
pdf.save "content_with_logo.pdf"
It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.
Here's some options:
http://en.wikipedia.org/wiki/List_of_PDF_software
From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/
Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).
The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.
If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient.
it is not suited for dealing with images.

Resources