Training Tesseract ocr using jTessBoxEdit - ios

Hi i want to generate tesseract OCR training data file(tessdata). I'm using jTessBoxEditor tool(On Mac Os) for achieving this, but i have no idea how can i use this tool.And further i use the tessdata file in my ios application.
I'm also searching for this i share the links with you
http://vietocr.sourceforge.net/training.html
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3
but i have no luck :( . So please share the links which provide detail/steep by by steep implementation of training file(teasdata file).

Here is the download for the tess files
http://code.google.com/p/tesseract-ocr/downloads/detail?name=tesseract-ocr-setup-3.02.02.exe&can=2&q=
I'm on the same page as you with getting this to work. Here is the tut im using.
http://www.resolveradiologic.com/blog/2013/01/15/training-tesseract/
I have learned that you need a .tif file with a .box file in the same folder to load the boxes.
For Example
testdata.tif
testdata.box
anotherExample.eng.tif
anotherExmaple.eng.box
To create box files easily if you don't know how do this after you download and install the tesseract files.
->Open command prompt and CD to your tesseract file, which is usually in your programfiles/Tesseract-ocr folder
-> Run Box creator tesseract C:\location of the tif file\thetiffile.tif C:\location of the tif file\thetiffile.tif
batch.nochop makebox
and that should spit out the box file you need.
I'm in the process of going through and discovering. I will keep you updated. If you have any other issues let me know and maybe I can help.

You'll have to build or install all the Tesseract training executables first. Then inside jTessBoxEditor, set the appropriate Tesseract Executable location.

Related

Tesseract .tr file empty

I'm trying to integrate text recognition into my app with TesseractOCR. I need it to learn a custom font. I have Tesseract installed on my Mac via Homebrew. I have a tiff file: eng.scout-cond.exp0.tiff that I'm converting into a ".box" file. When I run the command
tesseract eng.scout-cond.exp0.tiff eng.scout-cond.box nobatch box.train.stderr
It says Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Then generates a file called eng.scout-cond.box.tr
I don't understand why it's a .tr extension instead of the .box extension shown to me in tutorials.
When opening the .tr file in a text editor, it's empty.
What would be causing it to be empty?
eng.scout-cond.exp0.tiff
The tutorial I'm following
I missed a step, the command shown in the original post is to create a .tr file from the .box file and the .tiff file.
SOLUTION:
I used the command tesseract eng.scout-cond.exp0.png eng.scout-cond.exp0 batch.nochop makebox
To make the .box file. Then I ran the command in the original post.

Is there a way to open and run a complete ML project in kaggle notebooks just like in google colab?

I have my complete ML project on colab. But i want to shift to kaggle. Does kaggle allows a way to upload an entire project folder and let you traverse through the code files? Similar to how colab lets you mount and then you can traverse through your project folder.
Yeah you can simply compress your project directory into a zip file ... and upload the zip file as a dataset, then launch a kernel from there. Just keep in mind everything is read only unless you are in kaggle/working/

Inkscape screws up EPS files

I have been trying to use Inkscape to prepare artwork graphics for my scientific papers. I use LaTeX, and I need my figures to be prepared as high-quality Encapsulated PostScript (EPS) images. The work order is as follows. First, I plot parts of my figure using matplotlib and save them in EPS format. Second, I launch Inkscape and import the EPS files. Using Inkscape I compose a figure, leaving needed objects, killing unneeded, and adding some markups. So I used to do when I worked with CorelDraw in Windows, but now I work in Linux.
Unfortunately, Inkscape damages EPS files: it changes the colors and does not save all the objects. Over last years I tried to search for a solution, but I cannot find that people complained. The complaints (found on the Web) are related to something like "incorrect font rendering" when exporting from svg to eps or back. (For me this is not a problem - the text always can be represented as curves).
I currently work in Mandriva Linux 2010 and use Inkscape version 0.47 r22583 (Jan 14 2010). Somewhere I read that such problems could be caused by some outdated versions of cairo - mine is 1.9.14. I spend a lot of effort to build newer cairo (1.12.14), but I am still far from the purpose. I got confused in 32 and 64-bit libraries coexisting in my system...
I would be very grateful to anyone who has similar problems and, may be, advanced further towards the solution. Let me illustrate the problem.
Sorry, I do not have enough reputation points to neither post images nor insert more than 2 links, so, please take a look at the copy of this post with the images in my livejournal page:
http://benkev.livejournal.com/1093.html
The figure captions are below.
(1) Here are the three eps images I would like to combine in one figure:
(2) Here is what I get after importing the images in Inkscape and saving in SVG format. Note color and resolution distortion. Also, I draw three red circles around the feature of interest.
(3) Here is what I get when I export this figure to EPS file. One can notice that one of the three red circles gone: only two circles left!
Thank you!
This appears to be a bug in inkscape. The following steps might help:
Open the svg file in inkscape.
Select all (Ctrl+A)
Un-group (Ctrl+Shift+G). you may need to repeat this step several times.
Save the result as eps format.
For what it's worth after more than one year: I've been experiencing the same problems with Inkscape V0.48: the EPS was missing items when opened in other software (e.g. Latex).
I didn't completely solve the problem, but I found that it helped to remove groups. Simply select all components and keep ungrouping until there are no groups left. Save as EPS and the result should be better.
If there are still items missing, try to use 'raise selection up to top' on the missing items and save again.
I know this is old, but the bug is still present in Inkscape so here's my two cents. My workaround is to save a copy of my project as "Plain svg". And export that as eps.
I hope it helps!
I created a new layer and moved the text which was not showing up in the EPS to this layer. Then it was showing up in the exported EPS file.
P.S. Make sure you make the new layer below the current layer and move text there.
It is a bug in inkscape (0.91 Window) but easy fix. Save directly into pdf from inkscape and then from pdf file save as to eps. Work like a charm for me.
A permanent solution for this problem is to export your *SVG to a *PNG and then export the *PNG (e. g. via the free Software GIMP) as an *EPS file type. The missing items are always included when I use this approach.

Is there a way to teach tesseract for iOS a new font?

Im currently using tesseract for iOS using Nolan Brown's example. It works ok, but I need it to start picking up a new font (which I have in .tff format) which will always be numbers.
I have found questions on StackOverflow about tesseract learning fonts which all point to the google guides on how to teach Tesseract a new font using command line. But I'm already using a compiled copy of the lib from Nolan's example.
How can I teach tesseract a new font? Will I need to recompile the lib for iOS? How do I do this?
You might try training a new "traineddata" file using these instructions.

Getting text from image on ios (image processing)

I am thinking of making an application that requires extracting TEXT from an image. I haven't done any thing similar and I don't want to implement the whole stuff on my own. Is there any known library or open source code (supported for ios, objective-C) which can help me in extracting the text from the image. A basic source code will also do (I will try to modify it as per my need).
Kindly let me know if some one has any idea on this.
Thanks,
Vikram
One of the main open source libraries used to do OCR on iOS is a google-sponsored open source project called tesseract.
Here is some info on compiling tesseract for use in iOS apps:
tesseract
The same guy has a nice sample project on github demonstrating how a simple client might use the compiled library:
Pocket-OCR

Resources