I am trying to make an application which make a editable document file(doc or pdf) from an image. I am planning to use tesseract for extraction of the text. But i am not yet sure how to get the basic formatting of the text(size,bold,italic,underline) & images that might be present in the document image. I am planning to use J2EE, to make a Web Based App(Have to use J2EE). I think i might be able to recognize the components and formatting of the document using OpenCV, but i am not really sure.
Given that you are planning to use Tesseract for the basic OCR capabilities, try looking into the hORC formatted output. That includes quite a lot of additional information about font-size, font-face, position, etc.
You can find a description of hOCR here:
https://docs.google.com/document/d/1QQnIQtvdAC_8n92-LhwPcjtAUFwBlzE8EWnKAxlgVf0/preview#heading=h.e903b9bca924
If that doesn't work out, it depends on how much effort you want to put into Tesseract. It's internal APIs (available in Java via Tess4J, among others) do provide much of the information that you would need to reconstruct the page layout.
Related
After studying many articles and some questions on StackOverflow, I know that I will need to write a customized parser plugin for the purpose and I also know how to do this, but I am stuck at how to proceed.
In fact I am confused with the "flow chart" of the system, which perhaps needs too much in depth study of the Nutch crawling and parsing mechanism. Where to start? Customising the HTML parsing process, then parsing the img tags on the relevant pages and finally completing the process with tools like JSoup etc.
For example let I have to crawl the web and collect all the images of some specific brand item. The images search will take place by the file name and the surrounding text (this makes it necessary to include text parsing as well).
How should the system flow chart look like to start writing the customized plugin for?
I am using Nutch 1.12 and Solr 6.3 integrated...
Let me start by saying that what you're trying to do is not an easy task, but let's go step by step:
Assuming that you don't have all the URLs of the images before the crawl begins, you need to crawl the entire web, but you only need to keep the images in your index (and all the associated metadata). For this particular issue you can use the mimetype-plugin one of the sample configurations does a simplistic version of this (block everything and show only the images).
You need to extract metadata about the image (size, colors, etc.) the good news is that Tika already parse the images and detect a lot of metadata. And you'll need to write a custom parse filter for extracting all the additional data that you want.
Also you'll need to extract the text around the image, this is not really hard in an HtmlParseFilter, the tricky part is how would you relate this content with the image metadata. One way you can accomplish this is writing a custom ScoringPlugin to share the data from the original HTML page (where the text is) to the actual NutchDocument for the image itself (keep in mind that this are processed in different Nutch steps). One other chance is to index this as two separated documents (image metadata + metadata extracted from the HTML), and do a group/join on the query side of your application (web application for instance).
Some additional notes, this particular use case is not really straightforward to implement at the moment with Nutch's out-of-the-box features, but is definitively doable. I built an image search engine based on Nutch and Solr following the previous approach.
Is it possible to perform OCR on image (for example, from assets) instead of live video with Anyline, microblink or other SDKs?
Tesseract is not an option due to my limited time.
I've tested it but the results are very inappropriate. I know that it can be improved with OpenCv or something but I have to keep a deadline.
EDIT:
This is an example of what the image looks like when it arrives to the OCR SDK.
I am not sure for the others, but you can use microblink SDK for reading from a single image. It is documented here.
Reading from a video stream will give much better results, but it all depends on what you are trying to do exactly. What are you trying to read?
For reading barcodes or MRZ from i.e. identity documents, it works pretty well. For raw text OCR, not quite as good but it is not really intended for that anyway.
https://github.com/garnele007/SwiftOCR
Machine learning based, Trainable on different font, chars, etc.
and free
I am playing around with Leadtools to see how it might benefit me but i am a little frustrated with their documentation regarding how the process works. I am creating a library with methods that take an input file, convert it to pdf, add a qrcode to the file and save it and then reading the qrcode again.
Does a pdf have to be converted to an image before leadtools is
able to read the qrcode?
Does leadtools allow converting from
doc to pdf and then adding the qrcode or do i have to convert it to
an image as well?
Is there anywhere I could look at code
samples of how I can go about doing what I talked about other than
the leadtools site itself?
I am sorry to hear that you are having difficulties, but I will do my best to get you pointed in the right direction.
To answer your questions:
A1.) Yes, the PDF will need to be rasterized before the LEADTOOLS barcode engine can be used. Our barcode engine will only work with raw image data. Once the file is decompressed into raw data, we will not access the file any further.
A2.) Yes, you can rasterize Microsoft Word documents using either our file I/O methods or with the LEADTOOLS Virtual Printer. Once you have the raw image data, you can pass it to the barcode engine to write the QR code into the data. Once the barcode is written, you can then compress the image into any supported format, including (raster) PDF. You can also create a searchable PDF by running the resultant image through an OCR engine & outputting to PDF.
A3.) The LEADTOOLS SDK has a main barcode demo that should illustrate the ability of the SDK to handle the features you describe here. There are also tutorials in the help file, and various projects on our support forums. We have also created a couple different CodeProject articles here:
Multi-Platform Barcode with LEADTOOLS 18
How to Read Barcodes from Images using LEADTOOLS
You haven't mentioned here what programming language you are developing with or what the specific problem are that you have encountered. Without knowing either of those, it's difficult to get more specific into any methods or other resources to check out. For a simple raster conversion of a Microsoft Word Doc to PDF and writing a barcode, I think this would probably take between 10-15 lines of code.
If you have not already, I would highly recommend sending an email to Support#leadtools.com or open a live chat with the LEADTOOLS Support team from LEADTOOLS.com. We can get into more specifics there and help you more directly with any issues you are encountering.
Walter Bates
LEADTOOLS Developer Support
I tried adding this as a comment, but it is apparently too long for that. So I have added it as another answer.
Even if you are building a DLL, I would suggest starting out building a simple demo with a view of the image so you can see what exactly is happening to the image. Once you are comfortable that the image is being modified the way you want, then implement that code in your own library.
Also, I would recommend testing out the toolkit with the provided main demos. The demos are there to illustrate the different options you have access to in the code. If you can accomplish what your application or library will need to do through the demos, then it would be worth your time to begin coding specifically what you need. You might even need to use multiple demos to verify the tools can accomplish the goals that you have. You have all the toolkit code for the demos, so you can take them apart and use the specific pieces that you need in your application.
If you are having trouble identifying which demos to try out or whether the toolkit has the specific functionality that you need, your best bet is to contact Tech Support directly to ask. We are here to help get you pointed in the right direction.
To get down to brass tacks, the source of the image data is not all that important from the perspective of the barcode engine. It needs a RasterImage handle (raw image data) to write the specified barcode. Whether the image data is created on the fly, read from file, or generated from a scanner, it does not make a whole lot of difference.
To find the main .NET barcode demo, I would start out by going to the LEADTOOLS shortcuts. To get there, go to the Start menu -> LEADTOOLS -> Help and Demos. The shortcuts are broken down by programming language, feature, and then the base toolkit. You should be able to find the WinForms .NET barcode demo here:..\Shortcuts.NET Class Libraries.NET Framework\01 Imaging\07 Barcode
Our toolkit example is a .NET WinForms project, but it will work in ASP.NET also.
Here are some links to tutorials if you want to dig right into the code:
Loading and Displaying an Image in WinForms
Reading Barcodes
HOW TO: Load and Display an Image with WebImageViewer
There was also this recent code tip posted illustrating how to read and write UTF-8 characters in a QR barcode.
We provide both .NET 2.0 and .NET 4.0 DLLs for our barcode engine. Both of these work within Visual Studio 2012.
I downloaded the EverNote API Xcode Project but I have a question regarding the OCR feature. With their OCR service, can I take a picture and show the extracted text in a UILabel or does it not work like that?
Or is the text that is extracted not shown to me but only is for the search function of photos?
Has anyone ever had any experience with this or any ideas?
Thanks!
Yes, but it looks like it's going to be a bit of work.
When you get an EDAMResource that corresponds to an image, it has a property called recognition that returns an EDAMData object that contains the XML that defines the recognition info. For example, I attached this image to a note:
I inspected the recognition info that was attached to the corresponding EDAMResource object, and found this:
the xml i found on pastie.org, because it's too big to fit in an answer
As you can see, there's a LOT of information here. The XML is defined in the API documentation, so this would be where you parse the XML and extract the relevant information yourself. Fortunately, the structure of the XML is quite simple (you could write a parser in a few minutes). The hard part will be to figure out what parts you want to use.
It doesn't really work like that. Evernote doesn't really do "OCR" in the pure sense of turning document images into coherent paragraphs of text.
Evernote's recognition XML (which you can retrieve after via the technique that #DaveDeLong shows above) is most useful as an index to search against; the service will provide you sets of rectangles and sets of possible words/text fragments with probability scores attached. This makes a great basis for matching search terms, but a terrible one for constructing a single string that represents the document.
(I know this answer is like 4 years late, but Dave's excellent description doesn't really address this philosophical distinction that you'll run up against if you try to actually do what you were suggesting in the question.)
I have a site where a user can upload a photo. I have no idea how to handle photos. To make a thumbnail out of a large res photo, do I just resize the width and the height? Or is there a better way to do this?
If you could point me to any resources or give me any tips, that would be great.
I'm using Ruby on Rails, if that matters. I don't really want gems for this because I want to learn how to do it myself.
For some of this, "learning how to do it yourself" is going to be a significant undertaking. Resizing the image, for example. Go ahead and find an open source library (such as a gem) that resizes images and look through its source code. It's not impossible to do on your own, but a lot of that sort of thing is built on the expertise that's come before, etc. There's nothing wrong with making use of a tool somebody else has created, provided that you understand what the tool is doing.
A few points to hopefully help you out:
Go with a white-list approach of which image formats you support. Don't just let users upload anything that they call an image.
Each format you support is going to have its own standards (possibly multiple) for meta-data. Stripping out that data wholesale may or may not be a good idea. For example, a jpeg may contain its orientation in its EXIF data and if you strip that out you may be effectively rotating the image. Certain fields, such as geotagging, you may want to strip out in the effort to protect your users' privacy, etc. Again, look into existing libraries for this and see how they do it.
DO NOT implicitly trust the file name extension for determining the type of the image. It's possible for a user to construct a malicious file that isn't really an image, pass it off as an image to an unsuspecting host, and effectively open a security flaw on that host as it tries to process the file as an image. There was a question about determining file type in Ruby here, and I'm sure there's a lot more to be found on the subject.
David answered the question well, but I thought I might be able to provide some more specific information regarding your question.
Use the Paperclip gem, combined with RMagick, an ImageMagick wrapper for Ruby. You can set post-processing options and create multiple resizings.
If you really want to do it yourself, checkout the actual gem at https://github.com/thoughtbot/paperclip and you'll see how that author does it. Part of the thought behind Ruby on Rails is DRTW (Don't Reinvent the Wheel). Utilize what's out there and build on it. It will save you time, and enable you to do more in the longrun.
An alternative is to use a third party service such as http://resizer.co (I am not affiliated)
Replace a URL to an image from your site, say:
http://example.com/images/abc123.jpg
with the following (for a 200 x 200 image)
http://www.resizer.co?image=http://example.com/images/abc123.jpg&w=200&h=200
You may run into issues with distortion and the 'aspect ratio' not being maintained, but it could be a good start.