How to make Tesseract recognize o as o and not as zero? - image-processing

I have following images:
img01.png
img02.png
When I run tesseract img01.png img01.txt -l eng --psm 7 I get the texts
7.819 0 for the first image and
10.024 for the second one.
The second result is correct. However, in the first image, it is an o and not a zero.
How can I make Tesseract recognize o as o?
Update 1: I tried using the --oem 1 option as suggested in this answer (tesseract --oem 1 img01.png img01-ocred -l eng --psm 7), but it did not help.
Update 2: Binarizing the image using magick img01.png +dither -colors 3 -colors 2 -colorspace gray -normalize img01-binarized.png also didn't help. the binarized image looks like this:

You just need to enlarge the image twice the original then use tesseract.
wget https://i.stack.imgur.com/bSO87.png
identify -format "%wx%h" bSO87.png
40x20
tesseract -l eng --oem 3 --psm 6 bSO87.png stdout
7.819 0
convert bSO87.png -resize 80x40 bSO87.png
identify -format "%wx%h" bSO87.png
80x40
tesseract -l eng --oem 3 --psm 6 bSO87.png stdout
7.819 o

Related

ImageMagick script fails with MISLEADING filename or extension too long (-sparse-color)

Problem
In my quite short script I have the problem that it sometimes reports that the filename or extension is too long. Depending on the $image and $size values in my script this error may occur or not.
E.g. the script below produces this error with the image from here - saved and converted to "example3.png".
I do use Version: ImageMagick 7.0.10-62 Q16 x64 on windows and I don't know what to do with the error message... Any ideas what the problem is here?
Powershell script
#####################
# Setup
#####################
$image = "./example3.png"
$out = "./result.png"
$outPalette = "./palette.png"
$size = 50
$fuzz = 50
$colors = 6
$resizedSize = "$($size)x$($size)`!"
$histogramSize = "$($size)x$($size)"
#####################
# Program
#####################
Write-Host ""
# 1) Scale + change depth + remove unwanted colors (b/w)
Write-Host "- Step 1..." -ForegroundColor Green
magick convert $image -scale $resizedSize -depth 8 `
-fuzz $fuzz -transparent black -transparent white `
$out
#2) create histogram with the help of the sparse colors
Write-Host "- Step 2..." -ForegroundColor Green
$dataHistogram = magick convert -size $histogramSize xc: -sparse-color voronoi ( magick convert $out sparse-color: ) +dither -colors $colors -depth 8 -format %c histogram:info:
# ... more ...
Edit: Adjustments
replaced magick convert with magick
replaced $fuzz = 50 with $fuzz = "50%"
replaced $size = 50 with $size = 100
More images do work now but e.g. following still fails with the same error:
Edit2:
The result of the inner magick command (magick convert $out sparse-color:) looks like following:
# ImageMagick pixel enumeration: 100,100,255,srgba
0,0: (87,72,86,0) #57485600 srgba(87,72,86,0)
1,0: (105,81,91,0) #69515B00 srgba(105,81,91,0)
...
I am not sure what's going on with powershell but if the issue is the length of the command-line, you can supply the sparse colour from a file like this:
magick -size 800x600 xc: -sparse-color voronoi #colors.txt result.png
Or on stdin like this:
echo "10,10 red 200,200 yellow" | magick -size 800x600 xc: -sparse-color voronoi #- result.png

Batch append images in groups of two with Imagemagick

I have a directory of images and need to merge those images horizontally in groups of two, then save the output of each to a new image file:
image-1.jpeg
image-2.jpeg
image-3.jpeg
image-4.jpeg
image-5.jpeg
image-6.jpeg
Using Imagemagick via command line, is there a way to loop through every other image in a directory and run magick convert image-1.jpeg image-2.jpeg +append image-combined-*.jpg?
So the result would be combined pairs of images:
image-1.jpeg image-2.jpeg -> image-combined-1.jpg
image-3.jpeg image-4.jpeg -> image-combined-2.jpg
image-5.jpeg image-6.jpeg -> image-combined-3.jpg
Get them all appended succinctly and in parallel with GNU Parallel and actually use all those lovely CPU cores you paid Intel for!
parallel -N2 convert {1} {2} +append combined-{#}.jpeg ::: *jpeg
where:
-N2 says to take two files at a time
{1} and {2} are the first two parameters
{#} is the sequential job number, and
::: demarcates the start of the parameters
If your CPU has 8 cores, GNU Parallel will run 8 converts at once, unless you specify say 4 jobs at a time by adding -j4.
If you are learning and just finding your way with GNU Parallel add:
--dry-run so you can see what it would do without actually doing anything
-k to keep the outputs in order
So, I mean:
parallel --dry-run -k -N2 convert {1} {2} +append combined-{#}.jpeg ::: *jpeg
Sample Output
convert image-1.jpeg image-2.jpeg +append combined-1.jpeg
convert image-3.jpeg image-4.jpeg +append combined-2.jpeg
convert image-5.jpeg image-6.jpeg +append combined-3.jpeg
On macOS, you can simply install GNU Parallel with:
brew install parallel
If you have thousands, or hundreds of thousands of files, you may run into an error Argument list too long - although this is pretty rare on macOS because the limit is 262,144 characters:
sysctl -a kern.argmax
kern.argmax: 262144
If that happens, you can use this syntax to pipe the filenames in GNU Parallel instead:
find /somewhere -iname "*.jpeg" -print0 | parallel -0 -N2 convert {1} {2} +append combined-{#}.jpeg
If the images are all the same size and orientation, and if your system has the memory to read in all the images in the directory, it can be done as simply as this...
magick *.jpeg -set option:doublewide %[fx:w*2] \
+append +repage -crop %[doublewide]x%[h] +repage image-combined-%02d.jpg
This can be scripted easily using ImageMagick. I could show you how in Unix. But if you have more than 9 images, then you may have to rename with leading zeros, since alphabetically image-10 will come before image-2. You do not mention your IM version or platform and scripting will differ depending upon OS.
Here is a Unix solution. I have images rose-01.jpg ... rose-06.jpg in folder test on my desktop (Mac OSX). Each image has a label under it with its filename so we can keep track of the files.
cd
cd desktop/test
arr=(`ls *.jpg`)
num=${#arr[*]}
for ((i=0; i<num; i=i+2)); do
j=$((i+1))
k=$((i+2))
magick ${arr[$i]} ${arr[$j]} +append newimage_${j}_${k}.jpg
done
Note that arrays start with index 0. So I use j=i+1 and k=i+2 for the images that correspond to 1,2 3,4 5,6 in the filenames from ls in the array.
The result is (newimage_1_2.jpg, newimage_3_4.jpg, newimage_5_6.jpg)
An alternate solution is to montage all the images together two-by-two as an array of 2x3 and then equally crop them into 3 sections vertically. So in ImageMagick, this also works since these images are all the same size.
cd
cd desktop/test
arr=(`ls *.jpg`)
num=${#arr[*]}
num2=`magick xc: -format "%[fx:ceil($num/2)]" info:`
magick montage ${arr[*]} -tile 2x -geometry +0+0 miff:- | magick - -crop 1x3# +repage newimage.jpg
The results are: newimage-0.jpg, newimage-1.jpg, newimage-2.jpg
Ole Tang wrote:
Fails on filenames like My summer photo.jpg
So here is the solution using ImageMagick as modified from my original post.
Images:
rose 1.png
rose 2.png
rose 3.png
rose 4.png
rose 5.png
rose 6.png
OLDIFS=IFS
IFS=$'\n'
arr=(`ls *.png`)
for ((i=0;i<6;i++)); do
echo "${arr[$i]}"
done
IFS=OLDIFS
num=${#arr[*]}
for ((i=0; i<num; i=i+2)); do
j=$((i+1))
k=$((i+2))
magick "${arr[$i]}" "${arr[$j]}" +append newimage_${j}_${k}.jpg
done
This produces:
newimage_1_2.jpg
newimage_3_4.jpg
newimage_5_6.jpg

Resize indexed PNG image with ImageMagick while preserving color map

I am using custom batch script to make resized copies (33% and 66%) of all PNG images in folder. Here is my code:
for f in $(find /myFolder -name '*.png');
do
sudo cp -a $f "${f/%.png/-3x.png}";
sudo convert $f -resize 66.67% "${f/%.png/-2x.png}";
sudo convert $f -resize 33.33% $f;
done
It works fine, except when the original image is indexed. In this case the smaller version of the image is RGB (so even larger file size then original image).
I have try several versions but not worked. One that I guess supposed to sort this out was fallowing:
for f in $(find /myFolder -name '*.png');
do
sudo cp -a $f "${f/%.png/-3x.png}";
sudo convert $f -define png:preserve-colormap -resize 66.67% "${f/%.png/-2x.png}";
sudo convert $f -define png:preserve-colormap -resize 33.33% $f;
done
But it doesn't work.
EDIT:
This is updated co, but it still doesn't work as it supposed to (see the attached image-left is original, right is resized):
for f in $(find /myFolder -name '*.png');
do
sudo cp -a $f "${f/%.png/-3x.png}";
numberOfColors=`identify -format "%k" $f`
convert "$f" \
\( +clone -resize 66.67% -colors $numberOfColors -write "${f/%.png/-2x.png}" +delete \) \
-resize 33.33% -colors $numberOfColors "$f"
done
Original image:
Scaled version:
Use "-sample" instead of "-resize" to preserve the color set. This causes the resizing to be done by nearest-neighbor color selection rather than any kind of interpolation.
Otherwise, the colormap ends up with more than 256 colors and the png encoder can't preserve it, due to the 256-color limit on the size of a PNG PLTE chunk. I cannot guarantee that you'll like the appearance of the result, though.
Also, be sure you are using a recent version of ImageMagick.
I'm not observing this problem with the current release (6.9.3-7). Your script works fine and produces clean -2x and -3x images.
There are several things to address here...
find vs glob
You say you want to process all files in a folder, then you use find which will search down into sub-directories as well. If you just want to process files in the current directory, you can let bash do the globbing directly for you. So, instead of
for f in $(find . -name "*.png"); do
you can just do:
shopt -s nullglob
for f in *.png; do
Performance
You run convert twice and load the original image twice, and that is not very efficient. You can run a single process that loads a single image and resizes to two different sizes and writes both to disk. So, instead of
for ...; do
convert ...
convert ...
done
you can write the following to start one convert, read the image once, clone it in memory and write it out, delete the spare copy in memory and then resize the original image and re-save that.
for ...; do
convert "$f" \
\( +clone -resize 66.67% -write "${f/%.png/-2x.png}" +delete \) \
-resize 33.33% "$f"
done
Palette
It seems you actually only want to output palettised (indexed) images with "any" colormap rather than with a "specific" colormap. Glenn's answer is perfect if you want to retain a specific colormap. However, if any colormap is ok, you can use -colors to reduce the colours in the resulting image to a level where the PNG library can make the decision to create a palettised image. Glenn knows a lot more than me about that as he wrote it! However, I think if you reduce the colours to 250 (or so) you will probably get a 256 entry colormap and if you reduce the colours to around 60 or so, you will get a 64 entry colourmap. So, you would do:
shopt -s nullglob
for f in *.png; do
sudo cp ... ...
convert "$f" \
\( +clone -resize 66.67% -colors 250 -write "${f/%.png/-2x.png}" +delete \) \
-resize 33.33% -colors 250 "$f"
done
You can try experimenting with other numbers of colours and see how that affects filesize - the number you need will depend on your images.

Imagemagick parallel conversion

I want to get screenshot of each page of a pdf into jpg. To do this I am using ImageMagick's convert command in command line.
I have to achieve the following -
Get screenshots of each page of the pdf file.
resize the screenshot into 3 different sizes (small, med and preview).
store the different sizes in different folders (small, med and preview).
I am using the following command which works, however, it is slow. How can I improve its execution time or execute the commands parallely.
convert -density 400 -quality 100 /input/test.pdf -resize 170x117> -scene 1 /small/test_%d_small.jpg & convert -density 400 -quality 100 /input/test.pdf -resize 230x160> -scene 1 /med/test_%d_med.jpg & convert -density 400 -quality 100 /input/test.pdf -resize 1310x650> -scene 1 /preview/test_%d_preview.jpg
Splitting the command for readability
convert -density 400 -quality 100 /input/test.pdf -resize 170x117> -scene 1 /small/test_%d_small.jpg
convert -density 400 -quality 100 /input/test.pdf -resize 230x160> -scene 1 /med/test_%d_med.jpg
convert -density 400 -quality 100 /input/test.pdf -resize 1310x650> -scene 1 /preview/test_%d_preview.jpg
Updated Answer
I see you have long, multi-page documents and while my original answer is good for making multiple sizes of a single page quickly, it doesn't address doing pages in parallel. So, here is a way of doing it using GNU Parallel which is available for free for OS X (using homebrew), installed on most Linux distros and also available for Windows - if you really must.
The code looks like this:
#!/bin/bash
shopt -s nullglob
shopt -s nocaseglob
doPage(){
# Expecting filename as first parameter and page number as second
# echo DEBUG: File: $1 Page: $2
noexten=${1%%.*}
convert -density 400 -quality 100 "$1[$2]" \
-resize 1310x650 -write "${noexten}-p-$2-large.jpg" \
-resize 230x160 -write "${noexten}-p-$2-med.jpg" \
-resize 170x117 "${noexten}-p-$2-small.jpg"
}
export -f doPage
# First, get list of all PDF documents
for d in *.pdf; do
# Now get number of pages in this document - "pdfinfo" is probably quicker
p=$(identify "$d" | wc -l)
for ((i=0;i<$p;i++));do
echo $d:$i
done
done | parallel --eta --colsep ':' doPage {1} {2}
If you want to see how it works, remove the | parallel .... from the last line and you will see that the preceding loop just echoes a list of filenames and a counter for the page number into GNU Parallel. It will then run one process per CPU core, unless you specify -j 8 if you want say 8 processes to run in parallel. Remove the --eta if you don't want any updates on when the command is likely to finish.
In the comment I allude to pdfinfo being faster than identify, if you have that available (it's part of the poppler package under homebrew on OS X), then you can use this to get the number of pages in a PDF:
pdfinfo SomeDocument.pdf | awk '/^Pages:/ {print $2}'
Original Answer
Something along these lines so you only read it in once and then generate successively smaller images from the largest one:
convert -density 400 -quality 100 x.pdf \
-resize 1310x650 -write large.jpg \
-resize 230x160 -write medium.jpg \
-resize 170x117 small.jpg
Unless you mean you have, say, a 50 page PDF, and you want to do all 50 pages in parallel. If you do, say so, and I'll show you that using GNU Parallel when I get up in 10 hours...

Imagemagick GraphicsMagick image mean command

I use the following command that works in imagemagick to get the mean of a picture
identify -format "%[mean]" photo.jpg
the same command does not work under graphicsmagick. Is there an equivalent I can use?
You can do this, for example:
gm identify -verbose photo.jpg | grep -E "Mean|Red|Green|Blue"
Or, if you want Red, Green and Blue as 3 separate integers
gm identify -verbose photo.jpg | awk '/Mean:/{s=s int($2) " "} END{print s}'
0 29 225
Or, if you want the average of all channels, like this:
gm identify -verbose photo.jpg | awk '/Mean:/{n++;t+=$2} END{print int(t/n)}'
85

Resources