Improve accuracy of underlined text - image-processing

Some underlines in my image are very close to the text. For that particular text tesseract is unable to produce accurate results. I have attached the image and text file. Is there any way i can increase accuracy of the text?
I have tried to remove the underlines with some of the image processing techniques, but the problem is those lines which are close to the text are not getting removed.
And are there any parameters in tesseract which i can use to improve the accuracy? Thanks in advance.
image which i am trying to run
Its Result:
ARR!
D.
1.
\OCIJHJO‘LI'IJ?‘
3..
10.
E.
F.
SITE NUMBER
ARCHEOLOGICAL DESCRIPTION
General site description SITE IS COVERED WITH LARGE PINES AND IS IN RELATIVLY
GOOD CONDITION, snowING'EITTrE‘SIGNS‘OFTRosmN—EXCEPT—AEONG-Tmm
_"—""NHERE IT DROPS or INTO FLOODPLAIN OF CREEK THERE ARE A EEN ANIMAL TRAILS THAT
HAVE APPARENTLY ERODED OUT IN THE PAST. ONE OF THESE WAS QUIET DEEP ACCORDING
“TO AUGER TEST, BUT HAS FILLED UP WITH SAND AND GROWN OVER AGAIN. FIRST AUGER TEST
“WAS INTO THIS DEE P"GULLY" AND GAVE A FALSE IMPRESSION AS TO THE TRUE DEPTH OF
SITE. THIS TEST HOLE PRODUCED LIEHLQ FLAKES ALL THE WAY DOWN TO 42 INCHES AND
_m STERILE SAND DOWN TO 60 INCHES= REST OF SITE PRODUCED SAND AND CHIPS ONLY TO
l- an I ' A: : I L I i : ‘5!) THIS 3 1.0 5.- 3.. 'Y __
FINE SITE.
Site size .AT L S - E Y CONSIDERABLY MQBE
Nature of archeological deposition EAIBIEIHNDESTURBED EXCEPT ALONG THE EDGES OF SITE
T D0.
Site depth. 20-22 INCHES
Hidden
Faunal preservation
Floral preservation
Human remains
Cultural features (type and number)
Charcoal preservation
DATA RECOVERY METHODS
Ground surface visibility: 0% x 1-251 26—50% 51-75% 76—100%
Description of ground cover iMATURE PINE FOREST
Time spent collecting Number of peeple collecting
Description of surface collecting methods
Type and extent of testing and/or excavation FIVE TEST HOLES WERE SUNK IN SITE WITH 8"
AUGERa THESE WERE TAKEN DOWN IN 6" LEVELS UNTIL STERILE CLAY WAS REACHED. DIRTTA T-
FROM EACH 6" LEVEL WAS SCREENED THROUGH_l/4" WIRE MESH AND ARTIFACTS KEPT FOR
ANALYSIS. ALL TEST HOLES QERE PLQIIED EIIE TRANSIT IN RELATION TO DATUM MARKER
WHI IS A PIPE ‘ _ -: fl' : 3:0. . .: U' J I: : : . !" uFF 3L
GROUND. P__\l: IS I : um \I' :i “I ' I ' .M' I ' D' . I’ I 2! ti 0 .1. ' -. _ .L l' .
ARCHEOLOGICAL COMPONENTS
Paleo-Indian Late Whodland 17th century
Early Archaic Mississippian 18th century
Middle Archaic Late prehistoric 19th century
Late Archaic Unknown prehistoric ___ 20th century __
Early Woodland Ceramic prehistoric ____ Unknown historic
Middle Woodland 16th century

Related

Language detection using pycld2

I am trying to use the pycld2 package to detect multiple languages in text. This is the example I am testing out:
import pycld2 as cld2
text = '''The universal connection with an additional advantage: Push-in connection. Terminate solid and stranded (Class B 7 strands or less), as well as ferruled conductors, by simply pushing them in – no tools required. La connessione universale con un ulteriore vantaggio: Connessione push-in. Terminare solido e incagliato (trefoli di classe B 7 o meno), così come i conduttori a puntale, semplicemente spingendoli in – nessun attrezzo richiesto. Der universelle Anschluss mit zusätzlichem Vorteil: Push-in-Anschluss Vollständig und verseilt abschließen (Klasse B 7 Stränge oder weniger), sowie Aderendhülsen durch einfaches Aufschieben in – kein Werkzeug erforderlich.'''
reliable, index, top_3_choices,vecs = cld2.detect(text, returnVectors=True)
The top 3 detected languages are the following:
print(top_3_choices)
(('GERMAN', 'de', 34, 1089.0), ('ITALIAN', 'it', 33, 355.0), ('ENGLISH', 'en', 32, 953.0))
According to the documentation the confidence score is the fourth argument in each tuple and the third argument corresponds to the percentage of the original text detected in the respective language. I am struggling though how to interpret the score so I can flag the confidence of the detected language. Can I somehow normalize the score to get some form of interpretable probabilities?

Getting to PyTesseract to work on cropped images of digits - unable to get correct digits

I am trying read digits from the face of a six sided die. It is a cropped image of just the face of the die. However, despite using many different configurations for image_to_string function I mostly get no result or a bunch of rubbish letters. These are some of the configurations I have tried:
custom_config = '--oem 3 --psm 6 outputbase digits'
custom_config1 = '--psm 13 --oem 3 -c tessedit_char_whitelist=0123456789'
custom_config2 = '--psm '
custom_config3 = '-l eng --oem 3 --psm 12'
custom_config4 = '--oem 1 --psm 9'
And here is an example of some of the images I am trying it on:
Die showing 1
Die showing 2
Die showing 3
Die showing 4
Die showing 5
Die showing 6
Not sure what other configuration I could use to successfully recognize the digits in the image. I thought this should be fairly simple, but apparently not as none of the configurations I have tried so far have worked. Can someone please try it on their own machine, and find a configuration that works? Or some guidelines on what to do next.
Unfortunately, Tesseract is not the right choice for your problem. It can't handle rotated text. Even if you use the OSD, your characters count should be +50 chars, and the rotation must be in 90, 180, 270 degrees. This is not stated in the documentation but It has been +2 years since I started using it intensively. I would suggest that you try PaddleOCR and if the rotation is known, rotate your dice image first.

how to grep between 2 txt files

I have 2 txt files
The 1) txt file is like this :
sequence_id description
Solyc01g005420.2.1 No description available
Solyc01g006950.3.1 "31.4 cell.vesicle transport Encodes a syntaxin localized at the plasma membrane (SYR1 Syntaxin Related Protein 1 also known as SYP121 PENETRATION1/PEN1). SYR1/PEN1 is a member of the SNARE superfamily proteins. SNARE proteins are involved in cell signaling vesicle traffic growth and development. SYR1/PEN1 functions in positioning anchoring of the KAT1 K+ channel protein at the plasma membrane. Transcription is upregulated by abscisic acid suggesting a role in ABA signaling. Also functions in non-host resistance against barley powdery mildew Blumeria graminis sp. hordei. SYR1/PEN1 is a nonessential component of the preinvasive resistance against Colletotrichum fungus. Required for mlo resistance. syntaxin of plants 121 (SYP121)"
Solyc01g007770.2.1 No description available
Solyc01g008560.3.1 No description available
Solyc01g068490.3.1 20.1 stress.biotic Encodes a protein containing a U-box and an ARM domain. senescence-associated E3 ubiquitin ligase 1 (SAUL1)
..
.
the 2nd txt file has the gene ids:
Solyc02g080050.2.1
Solyc09g083200.3.1
Solyc05g050380.3.1
Solyc09g011490.3.1
Solyc04g051490.3.1
Solyc08g006470.3.1
Solyc01g107810.3.1
Solyc03g095770.3.1
Solyc12g006370.2.1
Solyc03g033840.3.1
Solyc02g069250.3.1
Solyc02g077040.3.1
Solyc03g093890.3.1
..
.
.
Each txt has a lot more lines than the ones i show. I just wanted to know what grep command should i use that i only get the genes that are on the 2nd txt file, deduct from the 1st with the description next to it.
thanks

Tesseract not able to recognize characters even for a high quality Image

I am doing the process of cleaning up and image using leptonica and then passing it to tesseract for OCR.However it is not able to recognize the characters even though the image is of high quality.The image specifications are as follows.
1 bpp, uncompressed, 1280 * 960 , 300dpi horizontal and vertical resolution
Following are the image processing operations I carry out in sequence using leptonica
pixConvertTo8
pixBackgroundNormSimple
pixOtsuAdaptiveThreshold
pixContrastTRC {Regarding this - I am passing high values like 1.0 or even 5.0 but image doesnt really change}
pixFindSkew
pixRotate { rotate by angle found by pixFindSkew}
pixRotate90 {do this 4 times to read image in all 4 orientations}
pixClipRectangle {crop image}
Finally tesseract command
I get garbage characters in the output.A sample Input Image is as follows.
The output that i get is as follows
Final K-1
II]
s h d | K-1 ,.,
(F°o.~?n‘i&1) 5/>.©12 mm E2‘;
Deparlrnenl of tho Treasury , ,
I 1 I l I
‘mama, Ravenuo SGMW For cnlundm your 201), ‘ " °F°$ "'100fTIO
or lax yum boqmnnnq 7 _ 20\Q_
‘ 7660
and ondmg _ W vv I go
Beneï¬ciary's Share of Income, Deductions,
cl'editS, etc. F 800 buck 01 loam nnd lnstruoflons»
___lnformatI0n About mo Estate or Trust
‘ Ordmary d|v|dm
i 12113
_
‘; Quahfmd dlVIdG
\ 8132
3 1
Net shun-term
A Estate's at trust's omgiuym ldonnlmnluon numbol
56-0987654
B Estate's u trust‘: namo
ESTATE OF MARTHA SMITH
0 Fiduc§ary's name, address, clly, smlu‘ and /IP codo
N01 long~lerm c
\ 24043
u
‘ 28% vale gann
Ti
Unreptumd 5
Omar porfloho 4
nonbuslness lfll
/\..4........ L. ._.._ ,.
What Should i do to improve the accuracy.
Part 2:
I tried to follow this link.And created a eng.user-words.traineddata file and bazaar.train file and tried to run with "bazaar" as additional parameter.but i get "read_params_file: can't open bazaar error".
Any suggestions?
For part one,
I don't know if the image you posted up here is the actual one you've been trying to scan but when I tried it, I got this:-
Department oi the Treasury Internal Revenue Service
For cnlundm your V019, 1 ‘ '"l0T°5' |nC0m0
or tax yam boqlnnlnq , 2o12_ ‘ 7660 and ondlng I go 2: ‘ Ordinary
dlvndm " “T ' x 12113
1; Quali?ed dwnda ‘ 8132 Netshun-term:
M Not long ~terrn c
i 24043 Ab ‘ 2896 ralagann
Bene?ciary’s Share of Income, Deductions, Cfedits, etc. 5 800 back oi
form nnd Instruc?ons
| Partl Information About the state or Trust
A Estate's or IvLsl's omuoym Idonnlncnluon numhu
56-0987654
8 Estate‘: a trust‘: namo
ESTATE OF MARTHA SMITH
M: Unreptumd 5
017161 portioho : nonbuslness Inl
C Fiduc§ary's name, address, city, smlul an-(V1/If’ Eooo
It's not great but it seems a bit better than what you got. I'm using Tesseract v3 on Windows.
My basic command was:
- tesseract.exe nnm.tif nnm
For part two,
your bazaar file should be in the configs folder
.....\Tesseract-OCR\tessdata\configs\bazaar
and there's some requirements for it to be saved in a particular format, like UTF8 with only a LF at the end of the line not a CR + LF, it seems to be quite fussy about the file formats.
you can get a copy of it from http://code.metager.de/source/raw/google/tesseract-ocr/tessdata/configs/bazaar
I made a digits config file that I used for scanning some images where I was only interested in the numbers and that worked fine:
- tesseract.exe scanfile.jpg scanfile digits
The documentation for Tesseract is pretty poor and it doesn't work well on a PC.
For part one,
I think you should consider the preprocessing done by Capture2Text. It is using both Leptonica and Tesseract to OCR the images.
I am not sure about part 2.

Mathematica's TextRecognize not up to par

Please take a look at the screenshot below and see if you can tell me why this won't work. The examples in on the reference page for TextRecognize look pretty impressive, I don't think recognizing single letters like this should be a problem. I've tried resizing the letters as well as having the image sharpened.
For convenience in case you want to try this yourself I have included the image that I use at the bottom of this post. You can also find plenty more like this by searching for "Wordfeud" in Google Image Search.
Very cool question!
TextRecognize uses heuristics to recognize whole words from the English language. This is
the gotcha that makes recognizing single letters very hard
Consider the following line of thought:
s = Import["http://i.stack.imgur.com/JHYuh.png"];
p = ImagePartition[s, 32]
Now pick letters to form the English word 'EXIT':
x = {p[[1, 13]], p[[6, 6]], p[[3, 13]], p[[1, 12]]}
Now clean up these images a bit, like so:
d = ImageAssemble[ Map[ImageTake[#, {3, 27}, {2, 20}] &, x ]];
Then this returns the string "EXIT":
TextRecognize[d]
This is an approach completely different from using TextRecognize, so I am posting this as a separate answer. It uses the same image recognition technique from the How do I find Waldo with Mathematica.
First get the puzzle:
wordfeud = Import["http://i.stack.imgur.com/JHYuh.png"]
And then get the pieces of the puzzle:
Grid[pieces = ImagePartition[s, 32]]
Let's be interested in the letter E:
LetterE = pieces[[4, 3]]
Get the correlation image:
correlation =
ImageCorrelate[wordfeud, Binarize[LetterE],
NormalizedSquaredEuclideanDistance]
And highlight the matches:
positions = Dilation[ColorNegate[Binarize[correlation, .1]], DiskMatrix[20]];
found = ImageMultiply[wordfeud, ImageAdd[ColorConvert[positions, "GrayLevel"], .5]]
As before, this requires a bit of tuning on binarizing the correlation image, but other than
that this should help to identify bits and pieces of this puzzle.
I thought the quality of your image might be interfering. Binarizing your image did not help : recognition was zilch. I also tried a very sharp black and white image of a crossword puzzle solution. (see below) Again, nothing was recognized whether in regular or binarized format.
So I removed the black background leaving only the letters and their thin black frames. Again, recognition was about 0%.
When I removed the frames from around some of the letters AND binarized the image the only parts that were recognizable were those regions in which there was nothing but letters. (see below)
Notice in the output below, ANTS, TIRES, and TEXAS are correctly identified (as well as VECTORS), but just about nothing else.
Notice also that, even though the strings were widely spaced, mma interpreted them as words, rather than separate letters. Note "TEXAS" instead of "T E X A S".
TextRecognize[Binarize#img]
(* output *)
ANTS FFWWW FEEWF
E R o If IU I?
E A FI5F WWWFF 5
5552? L E F F
T s E NTT BT|
H0RWW#0WVlWF;EE F
5 W E ; OCS
FOFT W W R AL%AE
A TT I T ? _
i iE#W'NF WG%S W
A A EW F I i
SWWTW W ALTFCWD N
H A V 5 A F F
PLATT EWWLIGHT
W N E T
HE TIRES C
TEXAS VECTORS
I didn't have the patience to completely clean up the image. It would have been much faster to retype the text by hand.
Conclusion: Don't use text recognition in mma unless you have absolutely clear text against an even-colored, bright, preferrably white, background.
The results also varied depending on the file format used. Avoid .pdf altogether.
Edit
acl captured and tried to recognize the last 5 lines (above Edit). His results (in a comment below): mostly gibberish.
I decided to do the same. But since Prashant warned that text size makes a difference, I zoomed in first so that the text appear (to my eyes) to be about 20 pica. Below is the picture of the text I scanned and TextRecognized.
Here's the result of an unbinarized TextRecognize (at that large size):
Gliii. Q lk-ii`t`*¥ if EY £\[CloseCurlyDoubleQuote]1\[Euro]'EE \
Di'¥C~E\"P ITF SKI' T»f}!E'!',IL:?E\[CloseCurlyDoubleQuote] I 2 VEEE5\
\[CloseCurlyQuote] LEP \"- \"VE
1. ur e=\\..r.1.»».»\\\\ rw r 1»»\\|a'*r | r .fm -»'-an \
\[OpenCurlyQuote] -.-rr -_.»~|-.'i~-.w~,.-- nv n.w~»-\
\[OpenCurlyDoubleQuote]~"
Now, here's the result for the TextRecognize of the binarized image. The original image was a .png from Jing.
I didn't have the patience to completely clean up the image. It would \
have been much faster to retype the
text by hand.
Conclusion: Don't use text recognition in mma unless you have \
absolutely clear text against an even-
colored, bright, preferrably white, background.
The results also varied depending on the file format used. Avoid .pdf \
altogether.

Resources