I often work with scanned papers. The papers contain tables (similar to Excel tables) which I need to type into the computer manually. To make the task worse the tables can be of different number of columns. Manually entering them into Excel is mundane to say the least.
I thought I can save myself a week of work if I can put a program to OCR it. Would it be possible to detect headers text areas with the OpenCV and OCR the text behind the detected image coordinates.
Can I achieve this with the help of OpenCV or do I need entirely different approach?
Edit: Example table is really just a standard table similar to what you can see in Excel and other spread-sheet applications, see below.
This question seems a little old but i was also working on a similar problem and got my own solution which i am explaining here.
For reading text using any OCR engine there are many challanges in getting good accuracy which includes following main cases:
Presence of noise due to poor image quality / unwanted elements/blobs in the background region. This will require some pre-processing like noise removal which can be easily done using gaussian filter or normal median filter methods. These are also available in opencv.
Wrong orientation of image: Because of wrong orientation OCR engine fails to segment the lines and words in image correctly which gives the worst accuracy.
Presence of lines: While doing word or line segmentation OCR engine sometimes also tries to merge the words and lines together and thus processing wrong content and hence giving wrong results.
There are other issues also but these are the basic ones.
In this case i think the scan image quality is quite good and simple and following steps can be used solve the problem.
Simple image binarization will remove the background content leaving only necessary content as shown here.
Now we have to remove lines which in this case is tabular grid. This can also be identified using connected components and removing the large connected components. So our final image that is needed to be fed to OCR engine will look like this.
For OCR we can use Tesseract Open Source OCR Engine. I got following results from OCR:
Caption title
header! header2 header3
row1cell1 row1cell2 row1cell3
row2cell1 row2cell2 row2cell3
As we can see here that result is quite accurate but there are some issues like
header! which should be header1, this is because OCR engine misunderstood ! with 1. This problem can be solved by further processing the result using Regex based operations.
After post processing the OCR result it can be parsed to read the row and column values.
Also here in this case to classify the sheet title, heading and normal cell values their font information can be used.
Related
I've got a large number of 3- and 6-column journal and newspaper pages that I want to OCR. I want to automate recognition of columns.
I've used tesseract (see a previous question) and Google Cloud Document AI (using the R package daiR) without great success.
These programs read the text very well, but do not do a good job of recognizing the column format of pages.
Here's a couple of examples from daiR:
Obviously these are complex images with some double columns and some tables inside columns. What I want is for the OCR to try to look for 6 columns.
I get good results if I preprocess images (for instance by cropping them into single columns or adding vertical lines), but I haven't found an efficient way to do this in large batches. Is there a way of preprocessing images or telling OCR programs to look for a given number of columns?
I have a system that extracts text from pictures ((MS word, tables, ..) documents with text more than 2 lines for example), but I do not want pictures that obviously have no text in it (such as textures, photos, ...). I've tried algorithm's from detect-text-in-images-with-opencv and detect-text-region-in-image-using-opencv, but all algorithms works not so good for example with these images
OpenCV has a module containing text detection.
https://docs.opencv.org/master/da/d56/group__text__detect.html
the module comes with sample code and documentation.
there is also a method based on deep learning called "EAST". here's the sample:
https://github.com/opencv/opencv/blob/master/samples/dnn/text_detection.cpp
I have a bit of an unorthodox question that I cannot think of an approach how to tackle. I have some letters written like this:
/\ |---\ /---\
/ \ |___/ |
/----\ | \ |
/ \ |___/ \---/
Now, the idea is to read this content (possibly from a text file) and parse it to the real letters they actually represent. So this should be parsed to ABC.
I understand this is not OCR, but I have no idea if something like that is possible. I am not asking for a solution, but rather, how would you best attack this problem? What would be a good criteria for distinguishing when a 'letter' starts and when does it end?
Based on the comments it sounds like you could store a character font map (2-dimensional array for each character) and then read the input file and buffer a number of lines equal to the height of the characters.
Then, for each group of lines you would want to segment the input based on the width of the characters and slide across horizontally, looking for matches against your font map.
If you need to support multiple fonts then things get more complicated and you'd benefit more from a neural-net approach to character recognition of sorts.
One important aspect to keep in mind about how OCR typically works is that it takes an arbitrary image and it "pixelates" it generating a much lower resolution image. In your case you've already got a "pixelated" representation of the image and all you'd have to do is read in the input and feed that into the rest of the pipeline.
I would still approach this as an OCR-esque problem.
You could first draw the characters onto an image and run it through an available OCR library.
Or you could do it yourself.
Pre-process it by converting vertical and horzitonal characters into lines first.
Then where there are forward and backslashes, approximate start and finish points of the curve by where they meet the previous horizontal and vertical (a different approach would be needed for letters such as 'o' or 'e').
Once you have this image a simple pattern analysis approach, such as naive bayes should be able to produce reliable results.
Whether the pre processing would actually provide accuracy improvements, i'm not sure
With the PDF below, I would like to do the following things.
Localize the four sudoku grids so as to treat each of them separately.
For each grid picture, I would like to obtain a matrix of the pictures corresponding to each cell.
Finally, I would like to "find" the values printed in each cell.
The problem is that I'm a real beginner with OpenCV (I've bought a book about OpenCV with Python but I've not received it yet).
I'm not a beginner in Python, neither in math so every clue is welcomed.
You're in luck:
sudoku solver part 1
part 2
part 3
part 4
Python 3.x isn't supported by OpenCV though.
tesseract has nice python bindings, too (and is more specialized on that 'ocr' job ;)
welcome to opencv, though !
I tried tikz/pgf a bit but have not had much luck creating a nice diagram to visualize bitfields or byte fields of packed data structures (i.e. in memory). Essentially I want a set of rectangles representing ranges of bits with labels inside, and offsets along the top. There should be multiple rows for each word of the data structure. This is similar to most of the diagrams in most processor manuals labeling opcode encoding etc.
Has anyone else tried to do this using latex or is there a package for this?
I have successfully used the bytefield package for something like this. If it doesn't do exactly what you want, please extend your question with an example...
You will find several examples with both tikz code source and a visual rendering of this code at http://www.texample.net/tikz/examples/