I am using ghostscript and tesseract to extract text data from scanned PDFs. But the scan result for some part of the pdf is not accurate. For testing purpose, I am taking screenshot of pdf and passing it to tesseract. Below is the scenarios and the problem I'm facing.
Scenario 1:
Link to Screenshot: https://dl.dropbox.com/u/9409594/scenario_1.tif
Once I pass this image (screenshot from a 125% zoomed pdf) to tesseract, below is the result text I'm getting:
ART\CLE STANDARD
NUMBER PFUCE
Scenario 2:
Link to screenshot: https://dl.dropbox.com/u/9409594/scenario_2.tif
If I pass the above screenshot (300% zoom) to tesseract, result is good.
ARTICLE NUMBER
Below are the arguments I'm using with ghostscript and tesseract:
Ghostscript:
gswin64.exe -dNOPAUSE -dBATCH -dSAFER -sDEVICE=tifflzw -r600 -sOutputFile="C:\test\output.tiff" "C:\test\input.pdf"
Tesseract:
tesseract.exe "c:\test\output.tif" "c:\test\output.html" -l eng -psm 6 hocr
From my testing, I feel that if a zoomed version of image is passed to tesseract, result is good. Can I zoom the image using ghostscript before converting it into image? Or is there a better way to do this?
Appreciate your time and help!
You can try this,
http://www.fmwconcepts.com/imagemagick/textcleaner/index.php
You may be aware of this, related to taking screen shot, instead of taking screen shot you can try convertion of pdf to tif using convert command of imagemagik or if its multiple page pdf use pdftoppm and then to tif using convert command.
Related
New to Juypter, trying to use it with Latex. Everything works fine except for the images. I used this tutorial: https://www.youtube.com/watch?v=m3o1KXA1Rjk&t=149s
Png Images are fine in markdown, i.e
But LaTex cannot determine size of png images. If I save the image as pdf I get the same issue. If I save the image as eps then LaTex complains that it cannot convert eps to pdf.
Has anyone had this issue? Anyone know how to solve it?
PNG images always contain the size in pixels.
Optionally, they can include a chunk of data named pHYs. This contains the resolution of the image. If this chuck is present, you should be able to find the actual text pHYs in the file.
If this chunk is missing from the PNG file, the scale of the image can not be found.
If you are on a UNIX-like operating system you could use grep or hexdump to check for the text pHYs in the PNG file. The identify program from the ImageMagick suite also can display the resolution of PNG images.
Note that there is an error in the video. The author first uses latex and then pdflatex three times. That is not a good idea, since they have different capabilities w.r.t. graphics. Stick with pdflatex.
I am writing a large document in LaTeX and generate most of my schematics using Inkscape. I store all Inkscape schematics as PDF and include them using the default latex figure environment.
Because the final result is quite large I then compress the document using ghostscript, using the following bash command:
gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 -dPDFSETTINGS=/printer -sOutputFile=${FNAME}_compressed.pdf $FNAME.pdf
$FNAME is the PDF file name. I noticed that the compressed PDF distorts some of the PDF graphics generated by Inkscape, mostly replacing solid black lines for dashed ones.
I uploaded a distortion example. The uncompressed original is on the right. Ghostscript distorted version on the left.
The distortion happens regardless of the dPDFSETTINGS flag. As far as I understand Ghostscript compression works by regenerating the PDF code so that it looks the same, but obviously this is not working correctly. However, the only images affected appear to be the Inkscape generated PDFs. Matplotlib generated ones appear to be fine.
Could you help me figure out, what is wrong here?
Thanks in advance
I wanted to know if anyone has ever used Tesseract with ImageMagick to get precise text from image. My main concern is with the small font texts present in an image (or some text that are not clearly visible). The only way I am able to retrieve those unclear texts are by modifying the image by ImageMagick like - by scaling the image, sometimes cropping the image....
I wanted to know if someone has integrated ImageMagick and Tesseract to create even powerful tool?
Till now, I have come up with a script that can search the text in the image... The script uses imagemagick and tesseract. The script is still under development, but you can look at it here
I'm running Embedded Linux on an evaluation kit (Zoom OMAP35x Torpedo Development Kit). The board has an LCD and I would like to be able to take screen shots convert them into a gif or png. I can get the raw data by doing the following: "cp /dev/fb0 screen.raw", but I am stumped on how to convert the image into a gif or png format.
I played around with convert from ImageMagick (example: "convert -depth 8 -size 240x320 rgb:./screen.raw -swap 0,2 -separate -combine screen.png"), but have been unable to get an image that looks right.
Does anyone know of any other tools that I could try out? Or does anyone have tips for using ImageMagick?
Take a look at fbgrab, an application that does just that (it saves the framebuffer content as a png).
You can simply capture the framebuffer to a file and open it in any raw image viewer or try online eg: https://rawpixels.net/
cat /dev/fb0 > fbdump
It might not be possible / easy to do it directly with ImageMagick.
The Linux kernel 4.2 documentation https://github.com/torvalds/linux/blob/v4.2/Documentation/fb/api.txt#45 says:
Pixels are stored in memory in hardware-dependent formats. Applications need
to be aware of the pixel storage format in order to write image data to the
frame buffer memory in the format expected by the hardware.
Formats are described by frame buffer types and visuals. Some visuals require
additional information, which are stored in the variable screen information
bits_per_pixel, grayscale, red, green, blue and transp fields.
Visuals describe how color information is encoded and assembled to create
macropixels. Types describe how macropixels are stored in memory. The following types and visuals are supported.
A list of visuals and types follows, but the description is not enough for me to understand the exact formats immediately.
But it seems likely that it might not be a format that ImageMagick will understand directly, or at least you'd have to find out the used format to decide the ImageMagick options.
I have a small drawing in Inkscape and I want to embed it in a LaTeX document which I compile using pdftex. pdftex seem to have an oddity of not accepting .eps. infact if what I understood is correct the only vector graphics format it accepts is pdf. When I save my drawing in Inkscape as pdf then what I get is a pdf with a full page with my drawing in the upper corner.
Is there a way to import an Inkscape drawing to pdftex and ignoring this page size? Or do I need to start fiddling with the page settings to make the page size exactly fit the size of my drawing?
So it turns out that Inkscape has a button on the Document properties titled "Fit page to selection" which makes this easier. Oh well.
Yes, pdftex does not accept eps.
I have used inkscape to make figures that I incorporate into .tex documents that I then process with pdflatex. And yes, I set the page size in inkscape so that the figure fits.
You could also try to export to .eps from inkscape, then convert to pdf with the "epstopdf" tool.
Are you giving the optional scaling parameters to \includegraphics? PDF handles bounding boxes differently from encapsulated postscript, and auto-sizing does not seem to work as well.
As far as I know, you have to adjust the bounding box resp. paper size in your PDF. There are tools like eps2pdf to convert EPS to PDF with the same bounding box.
You can even fully automate the process of converting your svgs into pdfs, since inkscape can be called from command line. For instance, the following makefile code does the job (copping and converting) for me:
# svg -> pdf
$(GRAPHIC_DIR)/%.pdf: $(GRAPHIC_DIR)/%.svg
cp $(GRAPHIC_DIR)/$*.svg $(GRAPHIC_DIR)/$*-crop.svg
inkscape --verb=FitCanvasToSelectionOrDrawing --verb=FileSave --verb=FileClose $(GRAPHIC_DIR)/$*-crop.svg
inkscape -A $(GRAPHIC_DIR)/$*.pdf $(GRAPHIC_DIR)/$*-crop.svg
rm $(GRAPHIC_DIR)/$*-crop.svg
Moreover, cropping pdf files can be also done using pdfcrop.