How do I get accurate text using Tesseract OCR in iOS? - ios

I am working on iPhone application.Here I need to get text from the images, after googling I found Tesseract can do that.Its working fine but not getting accurate results.I used this and processed the image but still not getting good results.
Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:#"tessdata" language:#"eng"];
UIImage *selectedImage=[UIImage imageNamed:#"download.jpg"];
[tesseract setImage:selectedImage];
ImageWrapper *greyScale=Image::createImage(selectedImage, selectedImage.size.width+100, selectedImage.size.height+100);
ImageWrapper *edges = greyScale.image->autoLocalThreshold();
[tesseract setImage:edges.image->toUIImage()];
[tesseract recognize];
NSLog(#"%#", [tesseract recognizedText]);
I used below image for testing.But I am getting results like .-|llIAT&T JG H109 PM ED
' '» "rr ~ ‘
ma» mania-J ‘E,
‘M, 4 ., -_
\ ~ \ Download Image 53.0 KB \
_11.04 PM
| Hey | am in buenos aires right
‘now. Check out this mm phfllu 111:5 PM
|' lam in Budapest on WiF. n is \
maePMu 001d here. ;
l 1 .
, ‘
l, .
11.05 PM u, .——; _
| Nice picture. Let me send you
an audio nuke. _11 08PM
How to solve the above issue.If any one worked on it please guide me.Thanks in advance.

I tried it to recognise my image with ABBYY Cloud OCR SDK.
Here to solved like this , I tried to extract text and export it in XML format. This format contains recognized text, with structure and parameters which are described with the help of XML. The par tag corresponces to one paragraph of a recognized text. After getting the text from XML you could work with it as you want.
I processed chat screen shots with the following settings:
"…/processImage?language=English&profile=documentConversion&exportFormat=xml"
and got the attached XML files. These images are processed correctly, each dialog block is detected as separate paragraph.
Hope the information is helpful.
Thanks to Abbyy OCR SDK team for providing solution.

I tried to recognise your image with ABBYY Cloud OCR SDK and decided to share result with you.
I think its rather accurate:
You can try demo recognition here: http://cloud.ocrsdk.com/demo (its a marketing tool without opportunity to extract data).
I work for ABBYY and ready to help you. Just let me know in comments.

Related

Character recognition using Tesseract-OCR and OpenCV (Cannot capture '[' and ']')

I am trying to extract text part of an image using Tesseract-OCR and OpenCV in Python. I have attached an example image as below:
It cannot capture '[' and ']' properly. The extraction output of this image is (testScreenshot):
Elektronik Mühendisliği Bölümü
Ozturkfat)osmaniye.edu.tr
0328 8271000
Expected result is [at] instead of fat). If I change the language to English rather than Turkish, fat] is captured. Don't you that this is weird ? How can I capture properly this as [at] with the setting of Turkish?
Thanks in advance
from PIL import Image
import pytesseract
plainText = pytesseract.image_to_string(Image.open(testScreenshot), lang='tur', config=tessdata_dir_config)
print(plainText)
Edit: If I give only '[' and ']', it also do not capture inside of the bracket as well. Example input image is:
The output:
rolfat)
rolfat)
As you can see that, right half of the image ([at]) not captured because I remove the beginning text (rol). Somehow, it is sensitive to the characters of [ and ]. They might be sharper on the image compared to other characters. This can be a reason ?

How to extract text with math symbols using pytesseract/tesseract version 4.0 (using equ.traineddata). 'equ' is no longer supported

How can I use the tesseract to extract the mathematical equation?
While reading the image given below:
after using:
img = cv2.imread(IN_PATH+'sample1.png')
pytesseract.image_to_string(img)
I get the result as:
'The value of 7/8144 is\n- (a) 20.2 (b) 20.16\n(c) 20.12 (d) 20.4'
With the older versions, I could have used
config='-l eng + equ'
pytesseract.image_to_string(img,config=config)
but the equ is no longer supported in the tesseract 4.0+.
I have the equ.traineddata file too but I do not know how that'll work and when I tried to paste it inside the /usr/share/tesseract-ocr/4.00/tessdata/ it threw an error that it can not be copied.
Please help how can I extract some text with simple mathematics symbols in it.

How to recognise the accurate text from an image using Tesseract OCR Library in iOS?

I am creating an iPhone application in Objective C. I am trying to recognise the text from an image (taken from camera). For this, I use in my app Tesseract OCR Library. Its working fine for some of the text but not getting accurate results from the captured image. Also have the latest tessdata file from Google code.
I added tesseract library from this link.
Below is my image that I tried to recognise :
My code is as follows:
G8Tesseract *tesseract = [[G8Tesseract alloc] initWithLanguage:#"eng+fra" engineMode:G8OCREngineModeTesseractCubeCombined];
[tesseract setVariableValue:#"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:;,.!-()#&÷" forKey:#"tessedit_char_whitelist"];
tesseract.pageSegmentationMode = G8PageSegmentationModeAuto;
tesseract.maximumRecognitionTime = 60.0;
tesseract.image = [selectedImage g8_blackAndWhite];
[tesseract recognize];
NSLog(#"%#", [tesseract recognizedText]);
But I am getting results like this :
BAZAAR
mm; l Savees l smmamm l mm; l Accessories
commemw Street ' _ . «mm. me o snwapnagay
www minabazaav.cum
I already go through from this links:
How do I get accurate text using Tesseract OCR in iOS?
Why Tesseract OCR library (iOS) cannot recognize text at all?
http://www.scriptscoop2.com/t/42247286510f/c-3.5-why-i-am-not-able-to-recognize-text-in-image-using-tesseract.html.
Do any one else experience the same problem?
For my case Tesseract Library was not accurate most of the time. instead Abby was kind of ok. but abby is not offline
Abby Stackoverflow Channel

Corona extract each character from string

I'm beginner in game development using corona, can you please help me guys how to get every character in a word then add background image to it and make it clickable like in 4 Pics and 1 Word Game. Can you please suggest some ideas or tutorial link. Thanks. So far I don't have enough reputation to put image here but here is the Screenshot Link
Here are some pieces that may help you
1. Iterate over each character
local str="Something"
for i = 1, str:len() do
print(str:sub(i,i));
end
load image
local img = display.newImage("images/" . letter . "1.jpg");
if you need anything specific ask here
To iterate over each character, try also this:
local str="Something"
for c in str:gmatch(".") do
print(c)
end
(Actually, this iterates over each byte in the string, which may not be what you want if the string contains Unicode characters.)

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?
I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.
I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w
latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.
To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)
In Texmaker interface you can get the word count by right clicking in the PDF preview:
Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:
I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.
If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.
For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Resources