I am processing pdf files with imagemagick to images but this particular file is processed to some really gibberish stuff
To simplify stuff I am doing simple
convert file.pdf out.jpg
Just an idea is that it is mix of text pdf and image pdf and this could cause troubles. Can you help?
Pages of document which are in text are converted to this gibberish, last page which is actually scan is fine
this is the link to the original
file
EDIT: I found out that also files without combination of text and scan are causing issues, actually files which contain text data, not scanned image. So the question is how to setup imagemagick to convert pdf with pure text to image without getting this output
Problem was with ghostscript 9.22,
update to 9.23 helps
Related
I am dealing with a problem when calling a script to do a transformation. I have a script in ubuntu that splits a multipage pdf in single page pdf files, then with convert (from imagemagick) transforms it to tif, then it generates the html with tesseract ocr, convert it back to pdf with the text layer, and merge everythin back into a single pdf with text layer.
The scripts works fine in the console, but in Alfresco, because of different enviroment variables in the path, use a different convert (/opt/alfresco-3.4.d/common/bin/convert) instead of /usr/bin/convert. The result is a pdf 1.3 instead a tiff so tesseract does not do nothing. The servlet is tomcat, I tried to copy the /usr/bin/convert to catalina home, and to alfresco common directory , rename the convert to conv and call it, etc but nothing happen.
How could I tell Alfresco to use the right convert instead of his /opt/alfresco-3.4.d/common/bin/convert
Thanks
I know that I can use ImageMagick's convert tool to turn different image files into PDF documents. However, is there some way to specify what version of PDF document I want to use for the output? Can I convert an image to a PDF v1.4 document?
I am trying to find a way to automate the conversion of image files (probably SVG) to PDF files that need to be sent to a printing service. The printer's service requires the PDF files to meet certain requirements, and one of them is that the PDF file is v1.4. My version of convert is "6.5.7-8 2010-12-02 Q16".
Thanks,
Carl
This question on superuser.com
https://superuser.com/questions/193791/batch-convert-pdf-versions
will give you some hints how to change the version number in the PDF afterwards.
Hi am trying to parse a pdf file, am able to extract the Text from pdf, but if the pdf is compressed (using flatedecode), i get junk characters so needed to know how to decompress the text, also how to know the filter used?
You can use zlib library if working in c++ to do decompression of the bytes for the content stream of a page.
Does anybody know how to create a thumbnail from an Adobe Illustrator file without using Illustrator? I have a php/linux based application and I'd like to do so.
-Dave
By default, Adobe Illustrator saves files as PDF compatible. Unless the file was saved in a strange way, you should be able to use ImageMagick directly to generate a thumbnail. For example:
convert file.ai -thumbnail 250x250 -unsharp 0x.5 thumbnail.png
Note: If the file has multiple artboards (which are interpreted as pages as a PDF), it will generate multiple files or, if saved as a GIF, an animated GIF.
If you can save it in PDF, PS, or EPS format you may be able to manipulate it in things like ImageMagick or Ghostscript.
EDIT:
I think you can actually use ImageMagick's convert with *.ai files as well.
Is it possible to search "words" in pdf files with delphi?
I have code with which I can search in many others files like (exe, dll, txt) but it doesn't work with pdf files.
It depends on the structure of the specific PDF.
If the pdf is made of images (scanned pages) then you have to OCR each image and build a full text index inside the PDF. (To see if its image based, open it with notepad and look for obj tags full of random chars). There are a few utilities and apps that do this kind of work for you, CVision PDF Compressor is one that I have used before.
If the pdf is a standard PDF, then you should be able to open it like any other text file and search for the words.
Here is page that will detail some of the structure of a PDF. This a SO post for the same.
The components/libraries mentioned in the answer to this question should do what you need.
I'm just working on a project that does this. The method I use is to convert the PDF file to plain text (with pdftotext.exe) and create an index on the resulting text. We do the same with word and other office files, works pretty good!
Searching directly into pdf files from Delphi (without external app) is more difficult I think. If you find anything, please update here as I would also be very interested in that!
One option I have used is to use Microsoft's ifilter technology, this is used by windows desktop search and many other products such as sharepoint and SQL server full-text search.
It supports almost any office/office-like file format, even dwg, msg, pdf, and files in zip/rar archives.
The easiest way to use it is to run FiltDump.exe on any files you have, and index the text output.
To know about the filters installed on your PC, you can use ifilter explorer.
Wikipedia has some links on its ifilters page.
Quick PDF Library's GetPageText function can give you the words from a PDF as well as the page number and the co-ordinates of those words - sometimes useful for highlighting.
PDF is not just a binary representation. Think of it as a tree of objects, where an object node has some metadata and some content information. Some of these objects have string data, some don't. Some of these are even encrypted, and some are compressed. So, there's very little chance your string finder will work on any arbitrary PDF.