Mathematica's TextRecognize not up to par - image-processing

Please take a look at the screenshot below and see if you can tell me why this won't work. The examples in on the reference page for TextRecognize look pretty impressive, I don't think recognizing single letters like this should be a problem. I've tried resizing the letters as well as having the image sharpened.
For convenience in case you want to try this yourself I have included the image that I use at the bottom of this post. You can also find plenty more like this by searching for "Wordfeud" in Google Image Search.

Very cool question!
TextRecognize uses heuristics to recognize whole words from the English language. This is
the gotcha that makes recognizing single letters very hard
Consider the following line of thought:
s = Import["http://i.stack.imgur.com/JHYuh.png"];
p = ImagePartition[s, 32]
Now pick letters to form the English word 'EXIT':
x = {p[[1, 13]], p[[6, 6]], p[[3, 13]], p[[1, 12]]}
Now clean up these images a bit, like so:
d = ImageAssemble[ Map[ImageTake[#, {3, 27}, {2, 20}] &, x ]];
Then this returns the string "EXIT":
TextRecognize[d]

This is an approach completely different from using TextRecognize, so I am posting this as a separate answer. It uses the same image recognition technique from the How do I find Waldo with Mathematica.
First get the puzzle:
wordfeud = Import["http://i.stack.imgur.com/JHYuh.png"]
And then get the pieces of the puzzle:
Grid[pieces = ImagePartition[s, 32]]
Let's be interested in the letter E:
LetterE = pieces[[4, 3]]
Get the correlation image:
correlation =
ImageCorrelate[wordfeud, Binarize[LetterE],
NormalizedSquaredEuclideanDistance]
And highlight the matches:
positions = Dilation[ColorNegate[Binarize[correlation, .1]], DiskMatrix[20]];
found = ImageMultiply[wordfeud, ImageAdd[ColorConvert[positions, "GrayLevel"], .5]]
As before, this requires a bit of tuning on binarizing the correlation image, but other than
that this should help to identify bits and pieces of this puzzle.

I thought the quality of your image might be interfering. Binarizing your image did not help : recognition was zilch. I also tried a very sharp black and white image of a crossword puzzle solution. (see below) Again, nothing was recognized whether in regular or binarized format.
So I removed the black background leaving only the letters and their thin black frames. Again, recognition was about 0%.
When I removed the frames from around some of the letters AND binarized the image the only parts that were recognizable were those regions in which there was nothing but letters. (see below)
Notice in the output below, ANTS, TIRES, and TEXAS are correctly identified (as well as VECTORS), but just about nothing else.
Notice also that, even though the strings were widely spaced, mma interpreted them as words, rather than separate letters. Note "TEXAS" instead of "T E X A S".
TextRecognize[Binarize#img]
(* output *)
ANTS FFWWW FEEWF
E R o If IU I?
E A FI5F WWWFF 5
5552? L E F F
T s E NTT BT|
H0RWW#0WVlWF;EE F
5 W E ; OCS
FOFT W W R AL%AE
A TT I T ? _
i iE#W'NF WG%S W
A A EW F I i
SWWTW W ALTFCWD N
H A V 5 A F F
PLATT EWWLIGHT
W N E T
HE TIRES C
TEXAS VECTORS
I didn't have the patience to completely clean up the image. It would have been much faster to retype the text by hand.
Conclusion: Don't use text recognition in mma unless you have absolutely clear text against an even-colored, bright, preferrably white, background.
The results also varied depending on the file format used. Avoid .pdf altogether.
Edit
acl captured and tried to recognize the last 5 lines (above Edit). His results (in a comment below): mostly gibberish.
I decided to do the same. But since Prashant warned that text size makes a difference, I zoomed in first so that the text appear (to my eyes) to be about 20 pica. Below is the picture of the text I scanned and TextRecognized.
Here's the result of an unbinarized TextRecognize (at that large size):
Gliii. Q lk-ii`t`*¥ if EY £\[CloseCurlyDoubleQuote]1\[Euro]'EE \
Di'¥C~E\"P ITF SKI' T»f}!E'!',IL:?E\[CloseCurlyDoubleQuote] I 2 VEEE5\
\[CloseCurlyQuote] LEP \"- \"VE
1. ur e=\\..r.1.»».»\\\\ rw r 1»»\\|a'*r | r .fm -»'-an \
\[OpenCurlyQuote] -.-rr -_.»~|-.'i~-.w~,.-- nv n.w~»-\
\[OpenCurlyDoubleQuote]~"
Now, here's the result for the TextRecognize of the binarized image. The original image was a .png from Jing.
I didn't have the patience to completely clean up the image. It would \
have been much faster to retype the
text by hand.
Conclusion: Don't use text recognition in mma unless you have \
absolutely clear text against an even-
colored, bright, preferrably white, background.
The results also varied depending on the file format used. Avoid .pdf \
altogether.

Related

"For all" using Apache Jenas rule engine

I'm currently working on some small examples about Apache Jena. What I want to show is universal quantification.
Let's say I have balls that each have a different color. These balls are stored within boxes. I now want to determine whether these boxes only contain balls that have the same color of if they are mixed.
So basically something along these lines:
SAME_COLOR = ∃x∀y:{y in Box a → color of y = x}
I know that this is probably not possible with Jena, and can be converted to the following:
SAME_COLOR = ∃x¬∃y:{y in Box a → color of y != x}
With "not exists" Jena's "NoValue" can be used, however, this does (at least for me) not work and I don't know how to translate above logical representations in Jena. Any thoughts on this?
See the code below, which is the only way I could think of:
(?box, ex:isA, ex:Box)
(?ball, ex:isIn, ?box)
(?ball, ex:hasColor, ?color)
(?ball2, ex:isIn, ?box)
(?ball2, ex:hasColor, ?color2)
NotEqual(?color, ?color2)
->
(?box, ex:hasSomeColors, "No").
(?box, ex:isA, ex:Box)
NoValue(?box, ex:hasSomeColors)
->
(?box, ex:hasSomeColors, "Yes").
A box with mixed content now has both values "Yes" and "No".
I've ran into the same sort of problem, which is more simplified.
The question is how to get a collection of objects or count no. of objects in rule engine.
Given that res:subj ont:has res:obj_xxx(several objects), how to get this value in rule engine?
But I just found a Primitive called Remove(), which may inspire me a bit.

RGB Conversion in Lua/TI-Nspire (TI-image)

I know this may be kinda of a dead and old unanswered topic, but
im trying to convert a RGB color code to the string format in a TI-image file, which doesnt make sense to me here:
https://wiki.inspired-lua.org/TI.Image
I understand all it mentions, until I reach the rgb conversion part. The article says that each rgb color has to have 5 bits, but it doesnt tell how to convert it, and i cant make sense of how to convert it following the given example. For instance:
R=255 → R = 31
G=012 → G = 1
B=123 → B = 15
What would I have to do to convert R255, G012 and B123 to the above output?
I understand the remaining instructions on the article, except this.
Anyone have an idea on how to do this?

Improve accuracy of underlined text

Some underlines in my image are very close to the text. For that particular text tesseract is unable to produce accurate results. I have attached the image and text file. Is there any way i can increase accuracy of the text?
I have tried to remove the underlines with some of the image processing techniques, but the problem is those lines which are close to the text are not getting removed.
And are there any parameters in tesseract which i can use to improve the accuracy? Thanks in advance.
image which i am trying to run
Its Result:
ARR!
D.
1.
\OCIJHJO‘LI'IJ?‘
3..
10.
E.
F.
SITE NUMBER
ARCHEOLOGICAL DESCRIPTION
General site description SITE IS COVERED WITH LARGE PINES AND IS IN RELATIVLY
GOOD CONDITION, snowING'EITTrE‘SIGNS‘OFTRosmN—EXCEPT—AEONG-Tmm
_"—""NHERE IT DROPS or INTO FLOODPLAIN OF CREEK THERE ARE A EEN ANIMAL TRAILS THAT
HAVE APPARENTLY ERODED OUT IN THE PAST. ONE OF THESE WAS QUIET DEEP ACCORDING
“TO AUGER TEST, BUT HAS FILLED UP WITH SAND AND GROWN OVER AGAIN. FIRST AUGER TEST
“WAS INTO THIS DEE P"GULLY" AND GAVE A FALSE IMPRESSION AS TO THE TRUE DEPTH OF
SITE. THIS TEST HOLE PRODUCED LIEHLQ FLAKES ALL THE WAY DOWN TO 42 INCHES AND
_m STERILE SAND DOWN TO 60 INCHES= REST OF SITE PRODUCED SAND AND CHIPS ONLY TO
l- an I ' A: : I L I i : ‘5!) THIS 3 1.0 5.- 3.. 'Y __
FINE SITE.
Site size .AT L S - E Y CONSIDERABLY MQBE
Nature of archeological deposition EAIBIEIHNDESTURBED EXCEPT ALONG THE EDGES OF SITE
T D0.
Site depth. 20-22 INCHES
Hidden
Faunal preservation
Floral preservation
Human remains
Cultural features (type and number)
Charcoal preservation
DATA RECOVERY METHODS
Ground surface visibility: 0% x 1-251 26—50% 51-75% 76—100%
Description of ground cover iMATURE PINE FOREST
Time spent collecting Number of peeple collecting
Description of surface collecting methods
Type and extent of testing and/or excavation FIVE TEST HOLES WERE SUNK IN SITE WITH 8"
AUGERa THESE WERE TAKEN DOWN IN 6" LEVELS UNTIL STERILE CLAY WAS REACHED. DIRTTA T-
FROM EACH 6" LEVEL WAS SCREENED THROUGH_l/4" WIRE MESH AND ARTIFACTS KEPT FOR
ANALYSIS. ALL TEST HOLES QERE PLQIIED EIIE TRANSIT IN RELATION TO DATUM MARKER
WHI IS A PIPE ‘ _ -: fl' : 3:0. . .: U' J I: : : . !" uFF 3L
GROUND. P__\l: IS I : um \I' :i “I ' I ' .M' I ' D' . I’ I 2! ti 0 .1. ' -. _ .L l' .
ARCHEOLOGICAL COMPONENTS
Paleo-Indian Late Whodland 17th century
Early Archaic Mississippian 18th century
Middle Archaic Late prehistoric 19th century
Late Archaic Unknown prehistoric ___ 20th century __
Early Woodland Ceramic prehistoric ____ Unknown historic
Middle Woodland 16th century

Tesseract not able to recognize characters even for a high quality Image

I am doing the process of cleaning up and image using leptonica and then passing it to tesseract for OCR.However it is not able to recognize the characters even though the image is of high quality.The image specifications are as follows.
1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution
Following are the image processing operations I carry out in sequence using leptonica
pixConvertTo8
pixBackgroundNormSimple
pixOtsuAdaptiveThreshold
pixContrastTRC {Regarding this - I am passing high values like 1.0 or even 5.0 but image doesnt really change}
pixFindSkew
pixRotate { rotate by angle found by pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4 orientations}
pixClipRectangle {crop image}
Finally tesseract command
I get garbage characters in the output.A sample Input Image is as follows.
The output that i get is as follows
Final K-1
II]
s h d | K-1 ,.,
(F°o.~?n‘i&1) 5/>.©12 mm E2‘;
Deparlrnenl of tho Treasury , ,
I 1 I l I
‘mama, Ravenuo SGMW For cnlundm your 201), ‘ " °F°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
‘ 7660
and ondmg _ W vv I go
Beneï¬ciary's Share of Income, Deductions,
cl'editS, etc. F 800 buck 01 loam nnd lnstruoflons»
___lnformatI0n About mo Estate or Trust
‘ Ordmary d|v|dm
i 12113
_
‘; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trust‘: namo
ESTATE OF MARTHA SMITH
0 Fiduc§ary's name, address, clly, smlu‘ and /IP codo
N01 long~lerm c
\ 24043
u
‘ 28% vale gann
Ti
Unreptumd 5
Omar porfloho 4
nonbuslness lfll
/\..4........ L. ._.._ ,.
What Should i do to improve the accuracy.
Part 2:
I tried to follow this link.And created a eng.user-words.traineddata file and bazaar.train file and tried to run with "bazaar" as additional parameter.but i get "read_params_file: can't open bazaar error".
Any suggestions?
For part one,
I don't know if the image you posted up here is the actual one you've been trying to scan but when I tried it, I got this:-
Department oi the Treasury Internal Revenue Service
For cnlundm your V019, 1 ‘ '"l0T°5' |nC0m0
or tax yam boqlnnlnq , 2o12_ ‘ 7660 and ondlng I go 2: ‘ Ordinary
dlvndm " “T ' x 12113
1; Quali?ed dwnda ‘ 8132 Netshun-term:
M Not long ~terrn c
i 24043 Ab ‘ 2896 ralagann
Bene?ciary’s Share of Income, Deductions, Cfedits, etc. 5 800 back oi
form nnd Instruc?ons
| Partl Information About the state or Trust
A Estate's or IvLsl's omuoym Idonnlncnluon numhu
56-0987654
8 Estate‘: a trust‘: namo
ESTATE OF MARTHA SMITH
M: Unreptumd 5
017161 portioho : nonbuslness Inl
C Fiduc§ary's name, address, city, smlul an-(V1/If’ Eooo
It's not great but it seems a bit better than what you got. I'm using Tesseract v3 on Windows.
My basic command was:
- tesseract.exe nnm.tif nnm
For part two,
your bazaar file should be in the configs folder
.....\Tesseract-OCR\tessdata\configs\bazaar
and there's some requirements for it to be saved in a particular format, like UTF8 with only a LF at the end of the line not a CR + LF, it seems to be quite fussy about the file formats.
you can get a copy of it from http://code.metager.de/source/raw/google/tesseract-ocr/tessdata/configs/bazaar
I made a digits config file that I used for scanning some images where I was only interested in the numbers and that worked fine:
- tesseract.exe scanfile.jpg scanfile digits
The documentation for Tesseract is pretty poor and it doesn't work well on a PC.
For part one,
I think you should consider the preprocessing done by Capture2Text. It is using both Leptonica and Tesseract to OCR the images.
I am not sure about part 2.

Can a SHA-1 hash be all-zeroes?

Is there any input that SHA-1 will compute to a hex value of fourty-zeros, i.e. "0000000000000000000000000000000000000000"?
Yes, it's just incredibly unlikely. I.e. one in 2^160, or 0.00000000000000000000000000000000000000000000006842277657836021%.
Also, becuase SHA1 is cryptographically strong, it would also be computationally unfeasible (at least with current computer technology -- all bets are off for emergent technologies such as quantum computing) to find out what data would result in an all-zero hash until it occurred in practice. If you really must use the "0" hash as a sentinel be sure to include an appropriate assertion (that you did not just hash input data to your "zero" hash sentinel) that survives into production. It is a failure condition your code will permanently need to check for. WARNING: Your code will permanently be broken if it does.
Depending on your situation (if your logic can cope with handling the empty string as a special case in order to forbid it from input) you could use the SHA1 hash ('da39a3ee5e6b4b0d3255bfef95601890afd80709') of the empty string. Also possible is using the hash for any string not in your input domain such as sha1('a') if your input has numeric-only as an invariant. If the input is preprocessed to add any regular decoration then a hash of something without the decoration would work as well (eg: sha1('abc') if your inputs like 'foo' are decorated with quotes to something like '"foo"').
I don't think so.
There is no easy way to show why it's not possible. If there was, then this would itself be the basis of an algorithm to find collisions.
Longer analysis:
The preprocessing makes sure that there is always at least one 1 bit in the input.
The loop over w[i] will leave the original stream alone, so there is at least one 1 bit in the input (words 0 to 15). Even with clever design of the bit patterns, at least some of the values from 0 to 15 must be non-zero since the loop doesn't affect them.
Note: leftrotate is circular, so no 1 bits will get lost.
In the main loop, it's easy to see that the factor k is never zero, so temp can't be zero for the reason that all operands on the right hand side are zero (k never is).
This leaves us with the question whether you can create a bit pattern for which (a leftrotate 5) + f + e + k + w[i] returns 0 by overflowing the sum. For this, we need to find values for w[i] such that w[i] = 0 - ((a leftrotate 5) + f + e + k)
This is possible for the first 16 values of w[i] since you have full control over them. But the words 16 to 79 are again created by xoring the first 16 values.
So the next step could be to unroll the loops and create a system of linear equations. I'll leave that as an exercise to the reader ;-) The system is interesting since we have a loop that creates additional equations until we end up with a stable result.
Basically, the algorithm was chosen in such a way that you can create individual 0 words by selecting input patterns but these effects are countered by xoring the input patterns to create the 64 other inputs.
Just an example: To make temp 0, we have
a = h0 = 0x67452301
f = (b and c) or ((not b) and d)
= (h1 and h2) or ((not h1) and h3)
= (0xEFCDAB89 & 0x98BADCFE) | (~0x98BADCFE & 0x10325476)
= 0x98badcfe
e = 0xC3D2E1F0
k = 0x5A827999
which gives us w[0] = 0x9fb498b3, etc. This value is then used in the words 16, 19, 22, 24-25, 27-28, 30-79.
Word 1, similarly, is used in words 1, 17, 20, 23, 25-26, 28-29, 31-79.
As you can see, there is a lot of overlap. If you calculate the input value that would give you a 0 result, that value influences at last 32 other input values.
The post by Aaron is incorrect. It is getting hung up on the internals of the SHA1 computation while ignoring what happens at the end of the round function.
Specifically, see the pseudo-code from Wikipedia. At the end of the round, the following computation is done:
h0 = h0 + a
h1 = h1 + b
h2 = h2 + c
h3 = h3 + d
h4 = h4 + e
So an all 0 output can happen if h0 == -a, h1 == -b, h2 == -c, h3 == -d, and h4 == -e going into this last section, where the computations are mod 2^32.
To answer your question: nobody knows whether there exists an input that produces all zero outputs, but cryptographers expect that there are based upon the simple argument provided by daf.
Without any knowledge of SHA-1 internals, I don't see why any particular value should be impossible (unless explicitly stated in the description of the algorithm). An all-zero value is no more or less probable than any other specific value.
Contrary to all of the current answers here, nobody knows that. There's a big difference between a probability estimation and a proof.
But you can safely assume it won't happen. In fact, you can safely assume that just about ANY value won't be the result (assuming it wasn't obtained through some SHA-1-like procedures). You can assume this as long as SHA-1 is secure (it actually isn't anymore, at least theoretically).
People doesn't seem realize just how improbable it is (if all humanity focused all of it's current resources on finding a zero hash by bruteforcing, it would take about xxx... ages of the current universe to crack it).
If you know the function is safe, it's not wrong to assume it won't happen. That may change in the future, so assume some malicious inputs could give that value (e.g. don't erase user's HDD if you find a zero hash).
If anyone still thinks it's not "clean" or something, I can tell you that nothing is guaranteed in the real world, because of quantum mechanics. You assume you can't walk through a solid wall just because of an insanely low probability.
[I'm done with this site... My first answer here, I tried to write a nice answer, but all I see is a bunch of downvoting morons who are wrong and can't even tell the reason why are they doing it. Your community really disappointed me. I'll still use this site, but only passively]
Contrary to all answers here, the answer is simply No.
The hash value always contains bits set to 1.

Resources