Character recognition using Tesseract-OCR and OpenCV (Cannot capture '[' and ']') - opencv

I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached an example image as below:
It cannot capture '[' and ']' properly. The extraction output of this image is (testScreenshot):
Elektronik Mühendisliği Bölümü
Ozturkfat)osmaniye.edu.tr
0328 8271000
Expected result is [at] instead of fat). If I change the language to English rather than Turkish, fat] is captured. Don't you that this is weird ? How can I capture properly this as [at] with the setting of Turkish?
Thanks in advance
from PIL import Image
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Edit: If I give only '[' and ']', it also do not capture inside of the bracket as well. Example input image is:
The output:
rolfat)
rolfat)
As you can see that, right half of the image ([at]) not captured because I remove the beginning text (rol). Somehow, it is sensitive to the characters of [ and ]. They might be sharper on the image compared to other characters. This can be a reason ?

Related

Tesseract-OCR cannot capture '[' and ']'

I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached a simple image as below. I have created this image on paint which means there is no noise or pre-processing needs.
Scenario 1:
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Output:
İtestöü)
Scenario 2:
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='eng', config=tessdata_dir_config)
print(plainText)
Output:
[testou]
Still, I cannot capture very simple text properly. If I change the language settings, it captures parenthesis but miss the Turkish characters which is acceptable. However, the one with Turkish settings (Scenario 1) is not acceptable because it is missing parentheses. Any suggestions?
tesseract v5.0.0-alpha.20200328
leptonica-1.78.0
libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.3) : libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
Found AVX2
Found AVX
Found FMA
Found SSE

How to extract text with math symbols using pytesseract/tesseract version 4.0 (using equ.traineddata). 'equ' is no longer supported

How can I use the tesseract to extract the mathematical equation?
While reading the image given below:
after using:
img = cv2.imread(IN_PATH+'sample1.png')
pytesseract.image_to_string(img)
I get the result as:
'The value of 7/8144 is\n- (a) 20.2 (b) 20.16\n(c) 20.12 (d) 20.4'
With the older versions, I could have used
config='-l eng + equ'
pytesseract.image_to_string(img,config=config)
but the equ is no longer supported in the tesseract 4.0+.
I have the equ.traineddata file too but I do not know how that'll work and when I tried to paste it inside the /usr/share/tesseract-ocr/4.00/tessdata/ it threw an error that it can not be copied.
Please help how can I extract some text with simple mathematics symbols in it.

How can I solve " character U+1f604 is above the range (U+0000-U+FFFF) allowed by Tcl" error problem in Spyder?

import spectral as spy
namedEnt =nltk.ne_chunk(tagged_words)
namedEnt.draw()
spy.ImageView
dir(spy)
In My variable explorer showed the parts of speech of every word but when I trying to execute this code to show the NLTK image there has occured an error
>
character U+1f604 is above the range (U+0000-U+FFFF) allowed by Tcl
Please help me how can i solve it.

'Missing $ inserted' error message when converting jupyter notebook to pdf with nbconvert

When attempting to convert a jupyter notebook to pdf with the following command:
jupyter nbconvert --to pdf "Search and Other Content Finding Features.ipynb"
I'm getting an error message:
! Missing $ inserted.
<inserted text>
$
l.380 ... Other Content Finding Features_10_0.png}
?
! Emergency stop.
<inserted text>
$
l.380 ... Other Content Finding Features_10_0.png}
I've found some discussion of what that is here.
However, I can't find these characters in my code. Could there be another cause?
For me it was another, although related issue: underlines. I assume that the cause is that text in cells marked as Raw Text will be passed directly to LaTeX, where it can be interpreted as LaTeX code itself. Maybe the underlines in your figure's name?
At some point, I had a raw cell with three underlines ___ which were then making the conversion break. The temporary solution was to convert the cell to markdown, instead of raw (and not run it) to appear in the pdf.
To find the error, I used the following conversion (taken from this answer):
jupyter nbconvert thenotebook.ipynb --to latex
Another error, related, was caused by a link containing underlines:
[text](https://en.wikipedia.org/wiki/Python_(programming_language))
This was also in a Raw Text cell, which I converted to markdown to generate the pdf. The format (colors, links) are different, though.
Last note: My file's name also contains empty spaces, but that wasn't an issue at all!
A very common gotcha here might be the following:
Leading or trailing spaces are not allowed in the pandoc extension tex_math_dollars, which is used by nbconvert.
This means, that this won't work:
$ \epsilon \gt 0 $
And we see the error message:
! Missing $ inserted.
<inserted text>
$
l.364 \$ \epsilon
\gt 0 \$
?
! Emergency stop.
<inserted text>
$
l.364 \$ \epsilon
\gt 0 \$
No pages of output.
Transcript written on notebook.log.
The correct formula without spaces works fine:
$\epsilon \gt 0$
This seems to be a bug in Jupyter nbconvert.
The pandoc documentation suggests that for pandoc this is by design to allow to use dollar symbols without escape sequence:
Anything between two $ characters will be treated as TeX math. The
opening $ must have a non-space character immediately to its right,
while the closing $ must have a non-space character immediately to its
left, and must not be followed immediately by a digit. Thus, $20,000
and $30,000 won’t parse as math. If for some reason you need to
enclose text in literal $ characters, backslash-escape them and they
won’t be treated as math delimiters.
The problem in this case seems to have been caused by my notebook's filename. I don't fully understand what caused the problem, but the error message above includes a reference to some text:
... Other Content Finding Features_10_0.png}.
That text includes _ which can cause this error. I think what happens is that somewhere in the conversion script, if there are spaces in the filename, a file is generated with underscores as shown, and that then triggers the error. (This seems a little bit like a bug to me, or at least a weakness).
The fix that worked for me was simply to change the jupyter notebook's filename not to include any spaces. Then the conversion ran without a hitch.
For me it's caused by significant difference between LaTeX and MathJax. For example cases environment can be rendered outside math mode with MathJax, which is the default choice of jupyter notebook. However, it causes an error stating "missing $ insert" in LaTeX. The error message disappeared after correcting syntax in Markdown cells.

How do I get accurate text using Tesseract OCR in iOS?

I am working on iPhone application.Here I need to get text from the images, after googling I found Tesseract can do that.Its working fine but not getting accurate results.I used this and processed the image but still not getting good results.
Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:#"tessdata" language:#"eng"];
UIImage *selectedImage=[UIImage imageNamed:#"download.jpg"];
[tesseract setImage:selectedImage];
ImageWrapper *greyScale=Image::createImage(selectedImage, selectedImage.size.width+100, selectedImage.size.height+100);
ImageWrapper *edges = greyScale.image->autoLocalThreshold();
[tesseract setImage:edges.image->toUIImage()];
[tesseract recognize];
NSLog(#"%#", [tesseract recognizedText]);
I used below image for testing.But I am getting results like .-|llIAT&T JG H109 PM ED
' '» "rr ~ ‘
ma» mania-J ‘E,
‘M, 4 ., -_
\ ~ \ Download Image 53.0 KB \
_11.04 PM
| Hey | am in buenos aires right
‘now. Check out this mm phfllu 111:5 PM
|' lam in Budapest on WiF. n is \
maePMu 001d here. ;
l 1 .
, ‘
l, .
11.05 PM u, .——; _
| Nice picture. Let me send you
an audio nuke. _11 08PM
How to solve the above issue.If any one worked on it please guide me.Thanks in advance.
I tried it to recognise my image with ABBYY Cloud OCR SDK.
Here to solved like this , I tried to extract text and export it in XML format. This format contains recognized text, with structure and parameters which are described with the help of XML. The par tag corresponces to one paragraph of a recognized text. After getting the text from XML you could work with it as you want.
I processed chat screen shots with the following settings:
"…/processImage?language=English&profile=documentConversion&exportFormat=xml"
and got the attached XML files. These images are processed correctly, each dialog block is detected as separate paragraph.
Hope the information is helpful.
Thanks to Abbyy OCR SDK team for providing solution.
I tried to recognise your image with ABBYY Cloud OCR SDK and decided to share result with you.
I think its rather accurate:
You can try demo recognition here: http://cloud.ocrsdk.com/demo (its a marketing tool without opportunity to extract data).
I work for ABBYY and ready to help you. Just let me know in comments.

Resources