iText or PDFBox for PDF snapshot?

iText or PDFBox for PDF snapshot? - imagemagick

If I would like to export a specific region of a PDF page to an image with high resolution, which software could work?
I searched the code of iText. Although it excels at creating/manipulation, there doesn’t seems to be export/snapshot options. Or would ImageMagick or Apache PDFBox better for the job?
Feng

Related

imagemagick splitting large pdf into png's

I have a pdf I'd like to split into individual pictures, each page is a picture, I am using the following imagemajick command to do so:
convert -density 400 mypdf.pdf out.png
and it works fine however I have tested it on the first 5 pages of my pdf and it took 10 seconds, at this rate it should take about half an hour to split my pdf, which seems strange to me considering that I'm not really doing anything fancy, I'm not rotating the images or modifying them in anyway, I'd like to know if there is a faster way to do this. Thanks
Also, I'd like to preserve the quality, I was doing it before without the density flag but the quality dropped dramatically.

PDF rendering is a bit of a mess.
The best system is probably GhostScript, and MuPDF, its library form. It's extremely fast and scales well to large documents. Unfortunately the library licensing (AFL) is difficult and you can't really link directly to the binary.
ImageMagick gets around this restriction by shelling out to the ghostscript command-line tool, but of course that means that rendering a page of a PDF is now a many-stage process: the PDF is copied to /tmp, ghostscript is executed with a set of command-line flags to render the document out to an image file in /tmp, this temporary image file is read back in again, a page is extracted and finally the image is written to the output PNG.
On my laptop I see:
$ time convert -density 400 nipguide.pdf[8] x.png
real 0m2.598s
The other popular PDF renderer is poppler. This came out of the xpdf document previewer project, so it's fast, but is only really happy rendering to RGB. It can struggle on large documents too, and it's GPL, so you can't link to it without also becoming GPL.
libvips links directly to poppler-glib for PDF rendering, so you save some copies. I see:
$ time vips copy nipguide.pdf[page=8,dpi=400] x.png
real 0m0.904s
Finally, there's PDFium. This is the PDF render library from Chrome -- it's the old Foxit PDF previewer, rather crudely cut out and made into a library. It's a little slower than poppler, but it has a very generous license, which means you can use it in situations where poppler would just not work.
There's an experimental libvips branch which uses PDFium for PDF rendering. With that, I see:
$ time vips copy nipguide.pdf[page=8,dpi=400] x.png
real 0m1.152s

If you have Python installed, you should try PyMuPDF. It is a Python binding for MuPDF, extremely easy to use and extremely fast (3 times faster than xpdf).
Rendering PDF pages is bread-and-butter business for this package. Use a script like this:
#----------------------------------------------------------------------------------
import fitz
fname = sys.argv[1] # get filename from command line
doc = fitz.open(fname) # open the file
mat = fitz.Matrix(2,2) # controls resolution: scale factor in x and y direction
for page in doc:
pix = page.getPixmap(matrix=mat, alpha=False)
pix.writePNG("p-%i.png" % page.number) # write the page's image
#----------------------------------------------------------------------------------
More to "Matrix":
This form scales each direction by a factor of 2. So the resulting PNG becomes about 4 times larger than the default version in original, 100% size. Both dimensions can be scaled independently. Rotation or rendering only parts of a page is possible also.
More to PyMuPDF:
Available as binary wheel for Windows, OSX and all Linux versions from PyPI. Installation therefore is a matter of a few seconds. The license for the Python part is GNU GPL 3, for the MuPDF part GNU AFFERO GPL 3. So it's open source and freeware. Creating commercial products is excluded, but you can freely distribute under the same licenses.

How to convert CorelDraw .WI wavelet-compressed image

I have a large sample of .WI images I need to convert to e.g. JPEGs, but the format now seems defunct.
The mimetype is image/wavelet.
The compression algorithm was developed by Summus, a US company that also now seems defunct.
The last CorelDraw support for the format was under 32-bit Windows. If I go down the hardware route I need to be able to make calls to a server via e.g. REST.
I think under *nix djvulibre might be able to open the files, but I haven't been able to test this yet.
Another option is to re-implement the codec myself.
It would be a nice-have to be able to script this.
Here's an example file http://www.wolfgang-rolke.de/graphics/wavelet.wi

Optimize Images for Google Page Speed

i'm tryng to optimize the images from my webpage to fit the google pagespeed test. But i didnt get how to compress the files with the tools provided by google on the size that google wants to have.
So i use jpegoptim and jpegtran for jpegs with this command:
jpegoptim.exe FILENAME
jpegtran.exe -copy none -debug -optimize -copy none -outfile FILENAME FILENAME
Where FILENAME is the fullpath to the img file. In most cases the files would be a bit smaller, but not that small if i download it from google(over the PageSpeed Insights Tool). Can anyone help me to find out the right parameters or another tool(working on windows) that gives me perfect results(or results that are accepted by Google).
THanks in advance,
J. Doe ;)

In the end of the Google page speed insights page is a link where you can download optimized resources for your website.
Link is called:
Download optimized image, JavaScript, and CSS resources for this page.

Get text from doc/docx file in pages using Apache tika

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?

Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

ImageMagick create PDF version 1.4 from image?

I know that I can use ImageMagick's convert tool to turn different image files into PDF documents. However, is there some way to specify what version of PDF document I want to use for the output? Can I convert an image to a PDF v1.4 document?
I am trying to find a way to automate the conversion of image files (probably SVG) to PDF files that need to be sent to a printing service. The printer's service requires the PDF files to meet certain requirements, and one of them is that the PDF file is v1.4. My version of convert is "6.5.7-8 2010-12-02 Q16".
Thanks,
Carl

This question on superuser.com
https://superuser.com/questions/193791/batch-convert-pdf-versions
will give you some hints how to change the version number in the PDF afterwards.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart