jpg file parsing to extract info/text - machine-learning

I have an idea for a project that I wanted some advice/pointers on.
I am planning to write an application to automatically parse expense receipts in JPG format and automatically extract the amount and also categorize using some learning algorithm. Is this at all doable? What libraries are available to parse jpg files to extract textual information and currency information from it?
Any pointers appreciated..I have a vanilla HP all in one scanner that I will use to scan all receipts.
Thanks
RS

You will need a OCR plugin (Optical character recognition) this will recognize and retrieve text from images. It has been a while since I last used OCR software, not sure what the best SDK's / plugins are at the moment.
I did find an article on The Code Project which uses a OCR product from Leadtool.

Related

What is the way to parse a string of a well known format from an image on iOS (some library created specifically for this purpose)?

Local travel cards in Saint-Petersburg, Russia have got huge id numbers that aren't easy to read and type into a web page when topping up the card online. So I want to build a small app that would take a photo of a travel card and parse the number out.
The task is a bit easier than a free form recognition:
card is of the very well known size
id numbers are of known size, are located in the very well known location on a card and they are number only, no letters (okay, there are two variations I think and maybe they will add 1-2 more in the future)
even the font is known in advance
even the first several numbers are the same for most of the card (so far there are only two prefixes used)
How would you do it? Are there any libraries tuned not for the general OCR, but for a "hinted" OCR like I need?
Best regards,
Artem.
P.S.
Actually a free/cheap web service for this task would also be good enough
Yes Google has a library called Tesseract and there is an iOS SDK on Github you can import into your application. So you can use this SDK and it has some documentation that you can read that will explain how to set it up in your app. It has methods that will return you a string with the text of the card in the string. BUT it will be ALL of the text from the card. So best thing to do would be to:
1 "clip" the original image to extract a sub image that displays only the portion of the card you wish to get the numbers from.
2 Process this sub image through Tesseract to retrieve the string you are looking for.
3 Then parse through the string and pick out the data that you need.
But just be warned, it can be a bit quirky. This SDK tends to recognize words best from images that are scanned, not "taken a picture of". Because although it is an advance piece of technology, it isn't perfect. So to get it to work as perfectly as possible for you, try to get scanned copies of the originals.
Best of luck.
The ideal solution for you would have three components:
1) Detection of the card. This is useful because if you have the detection, then the end users have much easier time actually using the scanner, because they can place the phone above the card in an arbitrary direction
2) Accurate OCR component. Ideally, customizable for this exact font you have on the card, for the exact position on the card.
3) Parsing mechanism. This would enable you to obtain the exact string written on the card without writing huge amount of OCR parsing code.
BlinkID SDK has all this. It has a preset for detection cards in the ID-1 format. It has integrated OCR engine. And it provides RegexParser, where you can define the exact format of the text which you're trying to extract from the document.
BlinkID was initially built for scanning ID documents which have very similar properties as the problem you're trying to solve.
Note. I'm one of the developers working on BlinkID.

Converting an image to Doc

I am trying to make an application which make a editable document file(doc or pdf) from an image. I am planning to use tesseract for extraction of the text. But i am not yet sure how to get the basic formatting of the text(size,bold,italic,underline) & images that might be present in the document image. I am planning to use J2EE, to make a Web Based App(Have to use J2EE). I think i might be able to recognize the components and formatting of the document using OpenCV, but i am not really sure.
Given that you are planning to use Tesseract for the basic OCR capabilities, try looking into the hORC formatted output. That includes quite a lot of additional information about font-size, font-face, position, etc.
You can find a description of hOCR here:
https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#heading=h.e903b9bca924
If that doesn't work out, it depends on how much effort you want to put into Tesseract. It's internal APIs (available in Java via Tess4J, among others) do provide much of the information that you would need to reconstruct the page layout.

using LEADTOOLS to convert doc to pdf

I am playing around with Leadtools to see how it might benefit me but i am a little frustrated with their documentation regarding how the process works. I am creating a library with methods that take an input file, convert it to pdf, add a qrcode to the file and save it and then reading the qrcode again.
Does a pdf have to be converted to an image before leadtools is
able to read the qrcode?
Does leadtools allow converting from
doc to pdf and then adding the qrcode or do i have to convert it to
an image as well?
Is there anywhere I could look at code
samples of how I can go about doing what I talked about other than
the leadtools site itself?
I am sorry to hear that you are having difficulties, but I will do my best to get you pointed in the right direction.
To answer your questions:
A1.) Yes, the PDF will need to be rasterized before the LEADTOOLS barcode engine can be used. Our barcode engine will only work with raw image data. Once the file is decompressed into raw data, we will not access the file any further.
A2.) Yes, you can rasterize Microsoft Word documents using either our file I/O methods or with the LEADTOOLS Virtual Printer. Once you have the raw image data, you can pass it to the barcode engine to write the QR code into the data. Once the barcode is written, you can then compress the image into any supported format, including (raster) PDF. You can also create a searchable PDF by running the resultant image through an OCR engine & outputting to PDF.
A3.) The LEADTOOLS SDK has a main barcode demo that should illustrate the ability of the SDK to handle the features you describe here. There are also tutorials in the help file, and various projects on our support forums. We have also created a couple different CodeProject articles here:
Multi-Platform Barcode with LEADTOOLS 18
How to Read Barcodes from Images using LEADTOOLS
You haven't mentioned here what programming language you are developing with or what the specific problem are that you have encountered. Without knowing either of those, it's difficult to get more specific into any methods or other resources to check out. For a simple raster conversion of a Microsoft Word Doc to PDF and writing a barcode, I think this would probably take between 10-15 lines of code.
If you have not already, I would highly recommend sending an email to Support#leadtools.com or open a live chat with the LEADTOOLS Support team from LEADTOOLS.com. We can get into more specifics there and help you more directly with any issues you are encountering.
Walter Bates
LEADTOOLS Developer Support
I tried adding this as a comment, but it is apparently too long for that. So I have added it as another answer.
Even if you are building a DLL, I would suggest starting out building a simple demo with a view of the image so you can see what exactly is happening to the image. Once you are comfortable that the image is being modified the way you want, then implement that code in your own library.
Also, I would recommend testing out the toolkit with the provided main demos. The demos are there to illustrate the different options you have access to in the code. If you can accomplish what your application or library will need to do through the demos, then it would be worth your time to begin coding specifically what you need. You might even need to use multiple demos to verify the tools can accomplish the goals that you have. You have all the toolkit code for the demos, so you can take them apart and use the specific pieces that you need in your application.
If you are having trouble identifying which demos to try out or whether the toolkit has the specific functionality that you need, your best bet is to contact Tech Support directly to ask. We are here to help get you pointed in the right direction.
To get down to brass tacks, the source of the image data is not all that important from the perspective of the barcode engine. It needs a RasterImage handle (raw image data) to write the specified barcode. Whether the image data is created on the fly, read from file, or generated from a scanner, it does not make a whole lot of difference.
To find the main .NET barcode demo, I would start out by going to the LEADTOOLS shortcuts. To get there, go to the Start menu -> LEADTOOLS -> Help and Demos. The shortcuts are broken down by programming language, feature, and then the base toolkit. You should be able to find the WinForms .NET barcode demo here:..\Shortcuts.NET Class Libraries.NET Framework\01 Imaging\07 Barcode
Our toolkit example is a .NET WinForms project, but it will work in ASP.NET also.
Here are some links to tutorials if you want to dig right into the code:
Loading and Displaying an Image in WinForms
Reading Barcodes
HOW TO: Load and Display an Image with WebImageViewer
There was also this recent code tip posted illustrating how to read and write UTF-8 characters in a QR barcode.
We provide both .NET 2.0 and .NET 4.0 DLLs for our barcode engine. Both of these work within Visual Studio 2012.

How can I parse, manipulate, and save Adobe Photoshop files?

How can I write a script or program to manipulate Adobe Photoshop files? I'd like to be able to do something like read a Adobe PSD file, rename the layers, and save it back to a PSD format.
The files look to be saved with a combination of XML and serialized data. I looked at the file's code and see that it has <x:xmpmeta near the start, did some google searching to find the wikipedia article about xmp - Extensible Metadata Platform, but I'm unclear if that is the format for the entire file or just for the metadata portion.
I saw that there is a PSD parser class for PHP available, and not a bad article about how to use it, although it seems like it is just for reading / converting and not for writing / saving.
But I'd like to know:
What format are these files stored in?
Where are the guidelines for interfacing with that format?
Are there some classes / tools available for manipulating that file format? Any language would be fine for a start.
I'm happy to do more research on my own but I'm hoping for some guidance to know what I should be looking for.
I'm not familiar with it myself, but there is an official SDK for Photoshop available that should let you do all that and more with .psd files.
There are not so many options. The general advice would be to look into buying Adobe InDesign Server. In some cases it can be cost prohibitive and you might be interested in 3-party SDKs. Unfortunately there are a few options in the market. One of them is Graphics Mill image processing SDK (http://www.graphicsmill.com/photoshop-psd).
Disclaimer: I work for Aurigma which runs Graphics Mill project.

Using Ruby And Ubuntu With Optical Character Recognition

I am a university student and it's time to buy textbooks again. This quarter there are over 20 books I need for classes. Normally this wouldn't be such a big deal, as I would just copy and paste the ISBNs into Amazon. The ISBNs, however, are converted into an image on my school's book site. All I want to do is get the ISBNs into a string so I don't have to type each one by hand. I have used GOCR to convert the images into text, but I want to use it with a Ruby script so I can automate the process and do the same for my classmates.
I can navigate to the site. How can I save the image to a file on my computer (running UBUNTU), convert the image with GOCR, and finally save it to a file so I can then access them again with my Ruby script?
GOCR seems to be a good choice at first, but from what I can tell from my own "research", quality isn't quite sufficient for daily use. Maybe this could lead to a problem, depending on the image input. If it doesn't work out for you, try the "new" feature of Google Docs, which allows you to upload images for OCR. You can then retrieve the results using some google api ( there are tons out there, I'm using gdata-ruby-util which requires some hacking, though.
You could also use tesseract-ocr for the OCR part, it's also open source and in active development.
For the retrieval part, I would as well stick with hpricot, super-powerful and flexible.
Sounds like a cool project, and shouldn't be too hard if the ISBN images are stored in individual files.
This all can be run in the background:
download web page (net/http)
save metadata + image file for each book (paperclip)
run GOCR on all the images
All you need is a list of urls or a crawler (mechanize) and then you probably need to spend a few minutes writing a parser (see joe's post) for the university html pages.

Resources