I have pdf files which have electronical origin, but can be in various layouts. These pdfs include tables, which are sometimes rotated in +90 or -90 degrees. It can sometimes happen, that the first line of the file has the normal orientation (0 degrees), but all other content is rotated.
The metadata of these files do not include the rotation information, rotation is always 0. I need to extract tables from these files with help of https://pdftables.com.
Pdftables was first opensource based on pdfminer python library, now it is a commercial product. What happens when I send these strange pages to pdftables is, that the content can not be read properly. So I need to figure out, the orientation of the page before I send it there.
I tried to extract text with pdfminer by myself and compare it with extraction from unix tool pdftotext (which can extract text properly) and whenever there was a "difference" I would take the file to rotation.
Unfortunately, this does not work always, because pdfminer doesnt give me always the same results as pdftables.
I have tried pythons OpenCV library for images of these pdfs, but this could only recognise the skew of text, not 90 degrees angle.
I have also tried the Hough transform method to find lines from text and estimate their direction, but since there are tables on pages, it is hard to estimate if line is just the text or a real line.
Please, do you have some suggestions, how to solve this problem? Thanks
Related
I'm currently an MS student in Medical Physics and I have a great need to be able to overlay an isodose distribution from an RTDOSE file onto a CT image from a .dcm file set.
I've managed to extract the image and the dose pixel arrays myself using pydicom and dicom_numpy, but the two arrays are not the same size! So, if I overlay the two together, the dose will not be in the correct position based on what the Elekta Gamma Plan software exported it as.
I've played around with dicompyler and 3DSlicer and they obviously are able to do this even though the arrays are not the same size. However, I think I cannot export the numerical data when using these softwares.I can only scroll through and view it as an image. How can I overlay the RTDOSE to an CT image?
Thank you
for what you want it sounds like you should use Simple ITK (or equivalent - my experience is with sitk) to do the dicom handling, not pydicom.
Dicom has built in a complete system for 3D point and location specifications for all the pixel data in patient coordinates. This uses a bunch of attributes in the dicom files in the Image Plane Module set of tags. See here for a good overview.
The simple ITK library fully understands and uses the full 3D Image Plane tags to identify and locate any images in patient coordinates by default - irrespective of such things as the specific pixel spacing, slice thickness etc etc.
So - in your case - if you use SITK to open your studies, then you should be able to overlay them correctly "out of the box", because SITK will do all the work to parse the Image Plane Module tags and locate the data in patient coordinates - just like you get with 3DSlicer.
Pydicom, in contrast, doesn't itself try to use any of that information at all. It only gives you the raw pixel arrays (for images).
Note I use both pydicom and SITK. This isn't something bad about pydicom, but more a question of right tool for the job. In fact, for many (most?) things I use pydicom, but for any true 3D type work, SITK is the easier toolkit to use.
I am researching into the best way to detect test in a photo using open source libraries.
I think the standard way is as follows (note: steps 1 - 4 all use OpenCV):
1) detect outline of document
2) transform document so it's flat and cropped, using said outline
3) Make the background of document white, using a filter
4) Feed resulting image to Tesseract
Is this the optimum process, or is there a better way, or better tools?
Also, what happens for case if the photo doesn't have a document outline (It's possible that step 1 & 2 are redundant)?
Is there anyway to automatically detect document orientation (i.e. portrait / landscape)?
I think your process is fine. I've used a similar process for an Android project.
I think that the only way you can discover if a document is portrait/landscape is to reason with the length of the sides of the bounding box of your outline.
I don't think there's an automatic way to do this, maybe you can find the most external contour approximable with a 4 segment polyline (all doable in opencv). In order to get this you'll have to work with contour hierarchy and contous approximation (see cv2.approxPolyDP).
This is how I would go for automatic outline detection. As I said, the rest of your algorithm seems just fine to me.
PS. I'll leave my Android project GitHub link. I don't know if it can be useful to you, but here I specify the outline by dragging some handles, then transform the image and feed it to Tesseract, using Java and OpenCV. Yeah It's a very bad idea to do that in the main thread of an Android app and yeah, the app is not finished. I just wanted to experiment with OCR, so I didn't care much of performance and usability, since this was not intended to use, but just for studying.
Look up the uniform width transform.
What this does is detect edges which have more or less the same width with respect to their opposite edge. So things like drainpipes (which can be eliminated at a later pass) but also the majority of text. Whilst conceptually it's similar to a distance transform, the published method uses rather ad hoc normal projection methods and Canny edge detection.
I'm looking for a way to locate known text within an image.
Specifically, I'm trying to create a tool convert a set of scanned pages into PDFs that support searching and copy+paste. I understand how this is usually done: OCR the page, retaining the position of the text, and then add the text as an invisible layer to the PDF. Acrobat has this functionality built in, and tesseract can output hOCR files (containing the recognized text along with its location), which can be used by hocr2pdf to generate a text layer.
Unfortunately, my source images are rather low quality (at most 150 DPI, with plenty of JPEG artifacts, and non-solid backgrounds behind some of the text), leading to pretty poor OCR results. However, I do have the a copy of the text (sans pictures and layout) that appears on each page.
Matching already known text to it's location on a scanned page seems like it would be much easier to do accurately, but I failed to discover any software with this capability built-in. How can I leverage existing software to do this?
Edit: The text varies in size and font, though passages of it are consistent.
The thought that springs to mind for me would be a cross-correlation. So, I would take the list of words that you know occur on the page and render them one at a time onto a canvas to create a picture of that word. You would need to use a similar font and size as the words in the document - which is what I asked in my comment. Then I would run a normalised cross-correlation of the picture of the word against the scanned image to see where it occurs. I would do all that with ImageMagick which is available for Windows and OSX (use homebrew on OS X) and included in most Linux distros.
So, let's take a screengrab of the second paragraph of your question and look for the word pretty - where you mention pretty poor OCR.
First, you need to render the word pretty onto a white background. The command will be something like this:
convert -background white -fill black -font Times -pointsize 14 label:pretty word.png
Result:
Then perform a normalised cross-correlation using Fred Weinhaus's script from here like this:
normcrosscorr -p word.png scan.png correlation-result.png
Match Coords: (504,30) And Score In Range 0 to 1: (0.999803)
and you can see the coordinates of the match are 504,30.
Result:
Another Idea
Another idea might be to take Google's Tesseract-OCR and replace the standard dictionary with the text file containing the words on the page you are processing...
I have a lot of data to plot in a single plot window and it looks really ugly and not understandable. Moreover legends are coming on to the curves which make curves unreadable. I cannot put curve alone one by one into my latex report which makes it again difficult to maneuver between the plots.
My question is- can't it be possible to put all the curves in single plot generated from gnu plot which can be easily maneuver back and forth in a single plot window the latex report?
I know a bit about tikz pictures where no of frameworks can be easily accessible in single plot.
can't it be used for a whole curves one by one assuming as different frame work. and at last all the plots in the the plot window.
It would be very helpful if is possible so.
I have data with N rows and M columns in it. I need plots of N rows vs. each column separately to be shown in each frame in Latex generated report and in the last frame all the curves should be present. I need a proper procedure to follow to animate the curves.
Yes, this kind of thing can be done with the animate package in latex. I have successfully used it in the past for presentations that I put together with beamer. You could switch between different gnuplot graphs that are loaded into the animateinline environment, but you can also use pgfplots within tikz to modify the plot directly on your latex document without need for an external plot.
Using animate requires investing a bit of time at the beginning but the results can be very nice. Also, Okular (and I'm guessing other PDF viewers as well) seem to have trouble visualizing the animations but Adobe reader (acroread on linux) loads them without problems.
As an example, you can check a 5-minute presentation I put together last year: in slides 4 and 5 you can use the buttons to run the animation. The one in slide 4 includes plotting a gaussian with pgfplots changing the curve parameters between frames. You need to open it with the Adobe reader for it to play correctly.
all I can find in the web is about OCR but I'm not there yet, I still have to recognize where the letters are in the image.
any help will be appreciated
The interesting thing is that the answer is not that simple as it may seem. Some may think that locating characters on the picture is first step of OCR, but it is not the case. Actually, you won't be sure where each character is located until you actually finish with recognizing.
The way it works completely depends on the type of image you are going to recognize. First you should segment you image on text areas (blocks) and everything other.
Just few examples:
If you are recognizing license plate on car picture, you should first locate license plate, and only then split it to separate characters.
If you are recognizing some application form, you can locate areas where text is just by knowing it's layout
If you are recognizing scan of book page, you have to distinguish pictures from text areas and then work only on text.
Starting from this moment you don't need original image any more, all you need is binarized image of text block. All OCR alorithms work on binary images. You may need also doing other kind of image transformations like line straightening, perspective correction, skew correction and so on - all that again depends on type of images you are recognizing.
Once text block is found and normalized, you should go further and find lines of text on the text block. In trivial case of horisontal lines of text it is quite simple by creating pixel histogram by horisontal lines.
Now, when you have lines, you may think that now it is simple, you can split it to characters, huray! Again, it is wrong. There are such phenomena as connected characters, broken characters and even ligatures (two letters forming one single shape), or letter that have their parts go further to the right above or bellow next character. What you should do is to create several hipotesis of splitting line to words and individual characters, then try OCR every single variant, weight every hypotesis with confidence level. Last step would be checking different paths in this graph using dictionary and selecting best one.
And only now, when you actually recognized everything, you can say where individual characters are located.
So, simple answer is: recognize your image with OCR program, and get coordinates of charaters from it's output.
Generally speaking you'll be looking for small contiguous areas of nearly solid color. I would suggest sampling each pixel and building an array of nearby pixels that also fall within a threshold of the original pixels color (repeat for neighbours of each matching pixel). Put the entire array aside as a potential character (or check it now) and move on (potentially ignoring previously collected pixels for a speedup).
Optimisations are possible if you know in advance the font-size, quality and/or color of the text. If not you'll want to be fairly generous with your thresholds of what constitutes a "contiguous area".