Combine multi-page PDFs into one PDF with ImageMagick - imagemagick

I am trying to use ImageMagick (6.8.0) to combine several multi-page PDFs into a single PDF. This command:
$ convert multi-page-1.pdf multi-page-2.pdf merged.pdf
Returns merged.pdf, which contains the first page of multi-page-1.pdf and the first page of multi-page-2.pdf.
This command:
$ convert multi-page-1.pdf[2] multi-page-2.pdf[2] merged.pdf
Returns merged.pdf, which contains the third page of multi-page-1.pdf and the third page of multi-page--2.pdf.
I would like to merged.pdf to contain all of the pages of each multi-page pdf. I have so far not found a way of telling the convert command to use a range of pages, although I have tried adding [0-1] and [0,1] at the end of the filenames.
Interestingly, this ghostscript command (which I found via StackOverflow but cannot re-find) does work as I would like it to:
$ gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -sOutputFile=merged.pdf multi-page-1.pdf multi-page-2.pdf
The problem is, the ImageMagick 'convert' command takes urls as inputs and ghostscript does not, and I need my program to take url input rather than file paths.
Is it possible to get the result of the above ghostscript command using ImageMagick?

Why don't you use pdfunite?
Example:
$ pdfunite 1.pdf 2.pdf 3.pdf merged.pdf

I asked this question on an internal company forum, and the conclusion was that there is no way to do the type of document merging we would like to do with ImageMagick without first downloading the file the the local filesystem.
For those of you using Heroku, we are taking advantage of the Heroku 'tmp' directory in order to save the file "locally" on staging and production: https://devcenter.heroku.com/articles/read-only-filesystem
Once we save the file in 'tmp', we will iterate through each page of the pdf and save them all separately. We will find the number of PDF pages using the 'pdf-reader' gem.
EDIT:
Here is the custom paperclip processor I wrote to deal with this (all files are pulled down to the tmp directory beforehand):
https://gist.github.com/jessieay/5832466

Related

Why jupyter is not able to download as pdf a markdown cell using LaTex \mathscr?

Just created a markdown cell in Jupyter using some equations, and some of them using \mathscr to have like "math" fonts. When I run the kernel containing the equations everything is ok, however when I click the option to Download as PDF via LaTex, I'm getting the error below:
! Undefined control sequence.
l.300 [\mathscr
{L}({\bf{y}}|\beta, \sigma^2, {\bf{X}}) = (2\pi\sigma^2)^{-...
?
! Emergency stop.
l.300 [\mathscr
{L}({\bf{y}}|\beta, \sigma^2, {\bf{X}}) = (2\pi\sigma^2)^{-...
If I remove the \mathscr part everything can be exported with no issues (excepting some convertion problems for special characters), however, I wanted to know ho to solve it. I've been reading and it looks like the nbconvert configuration file can be modified to solve this, but I couldn't find the mentioned file and the exact way to modify it
Thanks for your help
I think the problem is with absent \usepackage{mathrsfs} directive in an intermediate .tex-file.
So you have a several ways to overcome it.
If you face with this problem occasianaly you could the following:
download the .tex-file instead pdf;
manually insert to \usepackage{mathrsfs} to it.
before the first \usepackage for example;
run something like
xelatex file.tex to finally convert to pdf.
If you will do it often, you could try to edit appropriate jinja-template.
At first, find the place where nbconvert was installed. For example with pip: pip show nbconvert. Imagine the path is /home/i/.local/lib/python3.5/site-packages
Then the template would be at /home/i/.local/lib/python3.5/site-packages/nbconvert/templates/latex/base.tplx.
And again: just add \usepackage{mathrsfs} right after ((* block packages *)).
Voila -- the problem should gone.
At the end you have the third option -- you can create your own template from scratch and use it with nbconvert. I don't think it's very convenient way to solve your problem. You could read more in the documentation: http://nbconvert.readthedocs.io/en/latest/customizing.html

ImageMagick: Convert selective images to Multipage PDF without using external text file?

Let's say I have a directory with 5 TIFF files in it and I want to convert some of them to a multipage PDF, but that there are other TIFS in the same directory that I do not want in the same PDF.
In other words, I want to convert file1.TIF, file2.TIF, file3.TIF --> foo.pdf, but I want to ignore file4.TIF and file5.TIF located in the same folder.
It would seem from the documentation that the only way to do this is to provide ImageMagick with a text file listing out the files and then point to it when calling the program, as in:
convert #FilesToConvert.txt C:\foo3.pdf
Is there no way to make the call inline though, so that I don't have to create a separate text file for each conversion?
Thanks in advance!
You should be able to use:
convert file1.TIF file2.TIF file3.TIF foo.pdf

Ruby PDF:Toolkit using pdftotext

I'm converting pdf files in my Ruby project. I'm using the pdf toolkit gem for this.
The documentation shows how you can use pdftotext
pdftotext(file,outfile = nil,&block)
In my project I am converting a PDF file without any arguments and can just do this:
PDF::Toolkit.pdftotext("file.pdf", "file.txt)
If I run it from the command line, I can preserve the layout by passing that param
pdftotext -layout file.pdf
What is the correct syntax to achieve this with PDF::Toolkit?
Thanks!
Figured out how to make it work so I'm answering my own question, but if there's a "proper way" to do this, I'd love to see how to do it.
Put the options in the second argument and the text file will be named file_name.txt
PDF::Toolkit.pdftotext("file_name.pdf","-layout" )

How to convert PDF files to images using RMagick and Ruby

I'd like to take a PDF file and convert it to images, each PDF page becoming a separate image.
"Convert a .doc or .pdf to an image and display a thumbnail in Ruby?" is a similar post, but it doesn't cover how to make separate images for each page.
Using RMagick itself, you can create images for different pages:
require 'RMagick'
pdf_file_name = "test.pdf"
im = Magick::Image.read(pdf_file_name)
The code above will give you an array arr[], which will have one entry for corresponding pages. Do this if you want to generate a JPEG image of the fifth page:
im[4].write(pdf_file_name + ".jpg")
But this will load the entire PDF, so it can be slow.
Alternatively, if you want to create an image of the fifth page and don't want to load the complete PDF file:
require 'RMagick'
pdf_file_name = "test.pdf[5]"
im = Magick::Image.read(pdf_file_name)
im[0].write(pdf_file_name + ".jpg")
ImageMagick can do that with PDFs. Presumably RMagick can do it too, but I'm not familiar with it.
The code from the post you linked to:
require 'RMagick'
pdf = Magick::ImageList.new("doc.pdf")
pdf is an ImageList object, which according to the documentation delegates many of its methods to Array. You should be able to iterate over pdf and call write to write the individual images to files.
Since I can't find a way to deal with PDFs on a per-page basis in RMagick, I'd recommend first splitting the PDF into pages with pdftk's burst command, then dealing with the individual pages in RMagick. This is probably less performant than an all-in-one solution, but unfortunately no all-in-one solution presents itself.
There's also PDF::Toolkit for Ruby that hooks into pdftk but I've never used it.

Ruby: Reading PDF files

I'm looking for a fast and reliable way to read/parse large PDF files in Ruby (on Linux and OSX).
Until now I've found the rather old and simple PDF-toolkit (a pdftotext-wrapper) and PDF-reader, which was unable to read most of my files. Though the two libraries provide exactly the functionality I was looking for.
My question: Have I missed something? Is there a tool that is better suited (faster and more reliable) to solve my problem?
You might find Docsplit useful:
Docsplit is a command-line utility and Ruby library for splitting apart documents into their component parts: searchable UTF-8 plain text, page images or thumbnails in any format, PDFs, single pages, and document metadata (title, author, number of pages...)
After trying different methods, I'm using PDF-Toolkit now. It's quite old, but it's fast, stable and reliable. Besides, it really doesn't need to be new, because it just wraps the xpdf commandline utilities.
You could use JRuby and a Java PDF library parser such as ApachePDFBox (https://www.ohloh.net/p/pdfbox). See also http://java-source.net/open-source/pdf-libraries.
Did you have a look at the CombinePDF library?
It's a pure ruby solution that allows some PDF manipulation, such as extracting pages, overlaying one PDF page over another, page numbering, writing basic text and tables, etc'.
Here's an example for stumping an existing PDF file with a logo. The example reads a PDF file, extracts one page to use as a stamp and stamps another PDF file.
require 'combine_pdf'
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf = CombinePDF.load "content_file.pdf"
pdf.pages.each {|page| page << company_logo}
pdf.save "content_with_logo.pdf"
You can also stamp text, number pages or use :
require 'combine_pdf'
pdf = CombinePDF.load "content_file.pdf"
pdf.number_pages #adds page numbers. you can add formatting and placement options.
pdf.pages.each {|page| page.textbox "One Way To Stamp"}
#you can a shortcut method to stamp pages
pdf.stamp_pages "Another way to stamp"
#you can use the shortcut method for both text and PDF stamps
company_logo = CombinePDF.load("company_logo.pdf").pages[0]
pdf.stamp_pages company_logo
# you can use write simple tables
pdf.pages[0].write_table headers: ['first name', 'surname'], table_data: [['John', 'Doe'], ['Mr.', 'Smith']]
pdf.save "content_with_logo.pdf"
It's not meant for complex operations, but it complements most PDF authoring libraries and allows you to use PDF templates instead of writing the whole thing from scratch.
Here's some options:
http://en.wikipedia.org/wiki/List_of_PDF_software
From that link, and searching sourceforge, there's a couple of command line utilities that might do what you want, like this one: http://pdftohtml.sourceforge.net/
Depending on your requirements and what the PDFs look like, you could look at using the Google Docs API (uploading the PDF and then downloading it as text), or could also try something like gocr. I've had a lot of luck parsing image text with gocr in the past, and you'd just have to bounce out to the shell to do it, like gocr -i whatever.pdf (I think it works with PDFs).
The downside to all of these is that they're not pure-Ruby implementations, but lots of the good (and free) OCR projects seem to be done that way.
If you just need to get the text content out of a pdf file, pdftohtml at sourceforge is efficient.
it is not suited for dealing with images.

Resources