How to keep Tesseract from inserting extra whitespace in words? - imagemagick

I asked about this on the Tesseract forum already
Via Tesseract (and ImageMagick), I'm trying to find out the text of this
PDF file
This is the section of the PDF that I'm working on, it's line #7 of the
PDF:
In this section, Tesseract is running into problems when trying to identify
the string CONSTRUCTORA.
It sees CO NSTRUCTO RA
It should see CONSTRUCTORA
Can anyone suggest any possible fixes for this?
This is the commandline sequence:
convert -density 600 my_pdf.pdf tmp.tif
tesseract -l spa tmp.tif stdout > tmp.txt
These are the software versions:
~% tesseract --version
tesseract 3.05.01
leptonica-1.74.4
libgif 4.1.6(?) : libjpeg 8d (libjpeg-turbo 1.3.0) : libpng 1.2.50 :
libtiff 4.0.3 : zlib 1.2.8
~% convert --version
Version: ImageMagick 6.7.7-10 2014-08-28 Q16 http://www.imagemagick.org
Copyright: Copyright (C) 1999-2012 ImageMagick Studio LLC
Features: OpenMP

For dealing with the irregular kerning of the PDF file, Will suggested tweaking the parameters around tosp_min_sane_kn_sp of the docs https://github.com/naptha/tesseract.js/blob/master/docs/tesseract_parameters.md
Setting tosp_min_sane_kn_sp=2.8 solved the issue that was described in the question.
The new Tesseract invocation is the following:
tesseract -c tosp_min_sane_kn_sp=2.8 -l spa tmp.tif stdout > tmp.txt
The default value for tosp_min_sane_kn_sp seems to be 1.5. So far, I have only tested with values larger than 1.5.

Related

Why does ImageMagick fail with "no decode delegate for this image format `PNG'" even when the png delegate is installed?

I'm working on openSUSE Tumbleweed.
I have ImageMagick installed.
$ convert --version
Version: ImageMagick 7.1.0-52 Q16-HDRI x86_64 20549 https://imagemagick.org
Copyright: (C) 1999 ImageMagick Studio LLC
License: https://imagemagick.org/script/license.php
Features: Cipher DPC HDRI Modules OpenMP(4.5)
Delegates (built-in): bzlib djvu fontconfig freetype gslib jng jpeg lcms ltdl lzma png ps raw tiff x xml zlib
Compiler: gcc (12.2)
Notice that the PNG delegate is installed, at least according to --version.
Despite this, converting fails.
$ convert covalent_platform.svg covalent_platform.png
convert: no decode delegate for this image format `PNG' # error/constitute.c/ReadImage/776.
convert: no images defined `covalent_platform.png' # error/convert.c/ConvertImageCommand/3342.
Even identifying a non-PNG file fails, with the same PNG delegate problem.
$ identify covalent_platform.svg
identify: no decode delegate for this image format `PNG' # error/constitute.c/ReadImage/776.
Any ideas for what might be wrong here?
I reinstalled libpng-devel and reinstalled IM, but the problem remains.
I figured out the problem, thanks to #fmw42 pointing out that if magick -list format has no output then the installation is bad.
I had installed IM from https://imagemagick.org/script/download.php
These downloads are not compatible with openSUSE.
After uninstalling and reinstalling IM from https://software.opensuse.org/package/ImageMagick, everything works as expected.

imagemagick wmf support on Alpine Linux

I am installing imagemagick on Alpine, which picks up 7.0.10 version of imagemagick.
My primary use is to covert WMF to PNG.
But convert sample.wmf sample.png gives error
convert: no decode delegate for this image format `WMF' # error/constitute.c/ReadImage/572.
convert: no images defined `sample.png' # error/convert.c/ConvertImageCommand/3322.
As per https://www.imagemagick.org/script/formats.php, I also installed libwmf, which does not resolve the issue.
identify -list format | grep WMF does not return any result.
Updated Answer
You are in luck! GraphicsMagick can do it on alpine:latest:
apk add graphicsmagick
gm identify -version
GraphicsMagick 1.3.36 20201226 Q16 http://www.GraphicsMagick.org/
Copyright (C) 2002-2020 GraphicsMagick Group.
Additional copyrights and licenses apply to this software.
See http://www.GraphicsMagick.org/www/Copyright.html for details.
Feature Support:
Native Thread Safe yes
Large Files (> 32 bit) yes
Large Memory (> 32 bit) yes
BZIP no
DPS no
FlashPix no
FreeType yes
Ghostscript (Library) no
JBIG no
JPEG-2000 no
JPEG yes
Little CMS no
Loadable Modules yes
Solaris mtmalloc no
Google perftools tcmalloc no
OpenMP no
PNG yes
TIFF yes
TRIO no
Solaris umem no
WebP yes
WMF yes <--- HERE IT IS
X11 no
XML yes
ZLIB yes
Now do the conversion from WMF to PNG:
gm convert sample.wmf result.png
Original Answer
I don't think you'll be able to do that, with ImageMagick at least...
I tried installing libwmf on alpine:latest and then installing ImageMagick from source and it declined to use libwmf v0.2.12
So I checked what ImageMagick requires and it wants libwmf v0.2.8.2.
So I tried alpine:3.8 which can install libwmf v0.2.8.4 but ImageMagick still wouldn't accept that (xxx/ipa.h is missing).
So I looked back to alpine:3.3 and alpine:3.4 and they have no libwmf.
So I tried alpine:3.5 and that was the same libwmf version as alpine:3.8
TLDR; alpine:3.5's libwmf is too new for ImageMagick and alpine:3.4 doesn't have libwmf at all.
Note: I found the packages and versions of libwmf on this website.

convert wmf to jpg with graphicsmagick, unable to read font

I try to convert a wmf file into a jpg. gm says it is "Unable to read font (n021004l.pfb) [No such file or directory]." My command is like that:
gm convert 456.wmf 456.jpg
What could be wrong? I am using the latest gm version 1.3.29 and have ghostscript installed. OS: Windows 7
Here's the small file that I am trying to convert: https://mycloud.m-box.at/index.php/s/lBMeCG0cjK45sI1
And here's the version log (wmf is enabled):
GraphicsMagick 1.3.29 2018-04-29 Q16 http://www.GraphicsMagick.org/
Copyright (C) 2002-2018 GraphicsMagick Group.
Additional copyrights and licenses apply to this software.
See http://www.GraphicsMagick.org/www/Copyright.html for details.
Feature Support:
Native Thread Safe yes
Large Files (> 32 bit) yes
Large Memory (> 32 bit) yes
BZIP yes
DPS no
FlashPix no
FreeType yes
Ghostscript (Library) no
JBIG yes
JPEG-2000 yes
JPEG yes
Little CMS yes
Loadable Modules yes
OpenMP yes (200203)
PNG yes
TIFF yes
TRIO no
UMEM no
WebP yes
WMF yes
X11 no
XML yes
ZLIB yes
Windows Build Parameters:
MSVC Version: 1500
This looks a bit odd. I would suggest you enable debugging messages and see if you can trace what is going wrong yourself.
So, use:
gm convert -debug all 456.wmf 456.jpg > debug.txt 2>&1
and then look at the last few lines in debug.txt.
Sorry I can't help further!

Check for ImageMagick current version based on that uninstall or Install

We are going for penetration testing and for that reason I need to upgrade ImageMagick on our server to 7.0.2-2 because previous version 6.7.x.x (which we currently run) has some high vulnerabilities and to get rid of them we plan to update this.
I want to mange it through ansible and to do a separate installation I have a script but this script doesn't check if there is any previously installed version. It would be great if some can help me in writing a script which first checks for any previous version if there is one uninstalls it and then do new installation of upgraded version.
Many thanks in advance!
You can obtain the version by parsing it out of the first line of output from
identify --version
e.g., on a nearly up-to-date host:
$ identify --version
Version: ImageMagick 7.0.1-1 Q16 x86_64 2016-06-24
http://www.imagemagick.org Copyright: Copyright (C) 1999-2016
ImageMagick Studio LLC License:
http://www.imagemagick.org/script/license.php Features: Cipher DPC
HDRI Delegates (built-in): bzlib djvu fftw fontconfig freetype jbig
jng jpeg lcms lqr lzma openexr png tiff wmf x xml zlib
or on a not-so-up-to-date host:
$ identify --version
Version: ImageMagick 6.6.9-6 2011-04-28 Q16
http://www.imagemagick.org Copyright: Copyright (C) 1999-2011
ImageMagick Studio LLC Features:
Use "sed" or your favorite editor to extract the version string:
identify --version|sed "-e s/Version: ImageMagick //" -e "s/ .*//"|head -1

Imagemagick Convert PDF to JPEG: FailedToExecuteCommand `"gswin32c.exe" / PDFDelegateFailed

I have PDFs that I need to convert to images. I have installed Imagemagick. I have a PDF named a.pdf that I can open (it is not corrupt) in the folder C:\Convert\
From the command line I am trying
C:\Convert>convert a.pdf a.jpg
And I am getting the error.
convert.exe: FailedToExecuteCommand `"gswin32c.exe" -q -dQUIET -dSAFER -dBATCH -
dNOPAUSE -dNOPROMPT -dMaxBitmap=500000000 -dAlignToPixels=0 -dGridFitTT=2 "-sDEV
ICE=pamcmyk32" -dTextAlphaBits=4 -dGraphicsAlphaBits=4 "-r72x72" -dUseCIEColor
"-sOutputFile=C:/Users/MNALDO~1.COR/AppData/Local/Temp/magick-3704HYGOqqIK5rhI%d
" "-fC:/Users//MNALDO~1.COR/AppData/Local/Temp/magick-3704vK6aHo7Ju9WO" "-fC:/Use
rs//MNALDO~1.COR/AppData/Local/Temp/magick-3704GQSF9kK8WAw6"' (The system cannot
find the file specified.
) # error/delegate.c/ExternalDelegateCommand/480.
convert.exe: PDFDelegateFailed `The system cannot find the file specified.
' # error/pdf.c/ReadPDFImage/797.
convert.exe: no images defined `a.jpg' # error/convert.c/ConvertImageCommand/323
0.
UPDATE:
After the SO community helped me solve this issue I put together a little tool to batch convert images. Hope it helps somebody.
https://github.com/MattDolan/ImageConverter
You need to install Ghostscript in order to rasterize vector files (PDF, EPS, PS, etc.) with ImageMagick. IM will shell out to Ghostscript when doing these manipulations (you can see it if you use the -verbose tag in your IM invocation). You could also use Ghostscript by itself to rasterize vector files.
Since you actually have to install Ghostscript to do this, why not drop ImageMagick all-together? It just forwards the command to Ghostscript anyway, not adding any value, just taking way longer to process (and loading everything into RAM while its at it).
Install GhostScript and run the command:
gswin64c.exe -dNOPAUSE -sDEVICE=jpeg -r200 -dJPEGQ=60 -sOutputFile=foo-%03d.jpg foo.pdf -dBATCH
This is identical and faster than running:
convert -quality 60 -density 200 foo.pdf foo-%03d.jpg
It's in the docs now. https://github.com/dlemstra/Magick.NET/blob/main/docs/ConvertPDF.md
You need to install the latest version of GhostScript before you can convert a pdf using Magick.NET.
Make sure you only install the version of GhostScript with the same
platform. If you use the 64-bit version of Magick.NET you should also
install the 64-bit version of Ghostscript. You can use the 32-bit
version together with the 64-version but you will get a better
performance if you keep the platforms the same.
Here is a wrapper: https://archive.codeplex.com/?p=ghostscriptnet
I found that I had installed GhostScript, but GhostScript was not able to execute because it needed additional libraries. By typing "gs" on a command line, I was able to see what libraries were missing.
Install GhostScript GNU Affero General Public License from here.

Resources