Ghostscript and page margin - printing

I have a HP Deskjet 5150 PCL compatible printer and I need to print down a PostScript file. If I view the file with gv, its margins are fine. When I try to print it with: gs -dSAFER -dNOPAUSE -dBATCH -q -sDEVICE=hpdjportable -sOutputFile=/dev/usb/lp0 file.ps the left margin is shifted to the right by approximately 6 mm. As a consequence, the rightmost 6 mm of the page are cropped off. I know this flaw is barely noticeable, but I dislike it. The print is otherwise more than fair.
Any help is greatly appreciated.

Sounds like your printer has a hardware margin, an area where it simply cannot print, often due to paper handling hardware.
This can mean that the printable area of the paper is less than the size of the media, so if you try to print right to the edges then bits 'drop off'. Screen displays obviously don't suffer from this problem....
Generally PostScript consuming printers will either use a PPD which includes the printable area, or they will rescale the input slightly to fit.
Now, I suspect that the PCL output from Ghostscript is nothing more than a bitmap wrapped up with just enough PCL to make it print, which means that it will be assuming it can print right to the edges. So your solution will be to rescale the output slightly and probably shift it on the media a bit as well.
You can use any of several different command line options to select a different Media size such as DEVICEWIDTHPOINTS and DEVICEHEIGHTPOINTS or -g you will also need to select -dFIXEDMEDIA (so the PostScript can't change the media size) and -dFitPage to make GS scale the contents to fit the new size. Finally you will need to write a little PostScript to move the output around a little:
-c "<</PageOffset [-18 0]>> setpagedevice" -f
You should put that as the last option, just before the input filename. You will almost certainly need to meddle with the numbers in there to make it come out right.

Related

Finding known text in an image (guided OCR)

I'm looking for a way to locate known text within an image.
Specifically, I'm trying to create a tool convert a set of scanned pages into PDFs that support searching and copy+paste. I understand how this is usually done: OCR the page, retaining the position of the text, and then add the text as an invisible layer to the PDF. Acrobat has this functionality built in, and tesseract can output hOCR files (containing the recognized text along with its location), which can be used by hocr2pdf to generate a text layer.
Unfortunately, my source images are rather low quality (at most 150 DPI, with plenty of JPEG artifacts, and non-solid backgrounds behind some of the text), leading to pretty poor OCR results. However, I do have the a copy of the text (sans pictures and layout) that appears on each page.
Matching already known text to it's location on a scanned page seems like it would be much easier to do accurately, but I failed to discover any software with this capability built-in. How can I leverage existing software to do this?
Edit: The text varies in size and font, though passages of it are consistent.
The thought that springs to mind for me would be a cross-correlation. So, I would take the list of words that you know occur on the page and render them one at a time onto a canvas to create a picture of that word. You would need to use a similar font and size as the words in the document - which is what I asked in my comment. Then I would run a normalised cross-correlation of the picture of the word against the scanned image to see where it occurs. I would do all that with ImageMagick which is available for Windows and OSX (use homebrew on OS X) and included in most Linux distros.
So, let's take a screengrab of the second paragraph of your question and look for the word pretty - where you mention pretty poor OCR.
First, you need to render the word pretty onto a white background. The command will be something like this:
convert -background white -fill black -font Times -pointsize 14 label:pretty word.png
Result:
Then perform a normalised cross-correlation using Fred Weinhaus's script from here like this:
normcrosscorr -p word.png scan.png correlation-result.png
Match Coords: (504,30) And Score In Range 0 to 1: (0.999803)
and you can see the coordinates of the match are 504,30.
Result:
Another Idea
Another idea might be to take Google's Tesseract-OCR and replace the standard dictionary with the text file containing the words on the page you are processing...

How can I tell Tesseract that my font has a particular size?

I have a collection of type-written image captions which look like this:
I know that the typewriter is consistent and monospace, with characters measuring 14x22px (as measured from the top of a capital letter to the bottom of a descender).
Tesseract is producing output like this:
The results are mostly good when Tesseract has detected the correct bounding boxes for the letters. But there are many strings of letters which are clumped together (e.g. "Ea", "tree", "fr" and "om" on the first line). These are always transcribed incorrectly and account for the majority of errors.
This is frustrating because I know a priori that all the characters are of a particular size. Is it possible pass this knowledge on to the tesseract command line tool?
My command to generate the box file is:
tesseract foo.jpg foo batch.nochop makebox
If possible, I'd prefer to avoid training Tesseract on the font—I don't have any manually transcribed samples, so building a corpus of training data would require some effort.
I'm not sure that Tesseract throws connected characters completely off as Noremac said.
Actually I think that it includes a chopping of joined characters whenever the result of a word detection is unsatisfactory, as explained in the paragraph 4.1 of An Overview of the Tesseract OCR Engine
And I also think that once it finds a fixed pitch text, it should automatically chop the text, even if the characters are connected (look at figure 2 of the same paper).
I know that it's a little bit late to add this answer, but maybe it will help some future visitors!
The issue isn't the font size as much as it is with the letters connecting. If you zoom in on the above images with a program that will show the actual pixels (rather than blurring them together) you can see that those grouping two characters are actually connected. tessearctOCR is completely based on connected components so if they are connected at all then it throws it completely off. I see a couple of options:
If possible, give it a higher resolution image where there is more separation between the characters
Adjust the preprocessing to do a more strict threshold.
I noticed that the pixel connecting the E and the a on the first occurrence is lighter so adjusting the threshold will remove that connection. However, this could affect more than what you want, such as disjointing characters where you don't expect.
For updating the thresholding consider this: https://groups.google.com/forum/#!topic/tesseract-ocr/JRwIz3xL45U

MySQL WorkBench EER diagram dimensions are terrible

I am using MYSQL workbench to generate an EER diagram, and to the best of my knowledge, one can not control the dimensions of the canvas, only the size in number of pages. This has the result that you get a huge amount of white space around your diagram, making it nearly unusable. Why anyone would design it this way is beyond me. There are a lot of questions which ask how to crop a pdf, but they are either more complicated (ie. crop to a certain dimension, or crop and output to different format and ratio) or they do not preserve the image quality, or they just plain do not work. My question therefore is this:
How does one create or convert an EER diagram using MySQL Workbench such that there are no white borders AND the image quality is preserved?
Note I asked the question here as it pertains to databasing, but apologies if it is in the wrong place.
Looks like what you are after is a way to limit the output of an image export to a relatively small area, so that it fits nicely in another document. Several options are possible:
1) Export as png and simply cut off the unwanted parts. Depending on the further usage this might be good enough.
2) Export as SVG and use any of the SVG editors to limit the image size to the wanted area only. Then convert it to the format you need in your target document.
3) Set a paper size in the model that encompasses the content as close as possible. E.g. the statement paper type is quite small. Then rearrange your objects. Resize them if you need larger ones. By setting a larger font (via Preferences) you should be able to make the entire appearence larger. Then export as PDF.

Printing with delphi

I am facing some difficulties while printing, when I print my reports to physical printer the texts are perfectly centred but when I print the same report to PDF printer (e.g. cutePDF) or XPS document writer the left margin becomes 0. Meanwhile when I am trying to adjust the margin it works fine in PDF and XPS but the physical printing prints the pages with some extra left margin. I am not able to find out this difference also I tried to set the margin only for non-physical printing but could not able to do this.
It would be great if it will possible to set the marige according to printer selection e.g. if I will select PDF printer or XPS writer the margin gets changed. I am using Printer.canvas.textout(), procedure to print the text.
Can anybody please help me for this.
Some points which are worth be highligted:
From the Windows (and Delphi's TPrinter.Canvas) POV, there is no such concept as margins during drawing: the whole paper size is available to the canvas - for instance, X=0 will point to the absolute leftmost part of the paper;
There are so called "hardware margins" or "physical margins", depending on the printer capability: this is the non printable area around the paper; that is, if you draw something in this area, it won't be painted - these margins depend on the technology and model of printer used, and in some cases, it is possible to retrieve those "margins" values from the printer driver via GetDeviceCaps API calls;
But, from my experiment, do not trust those "physical margins" as retrieved by the printer driver - it is better (and more esthetical) to use some software defined margins, and let your user change it if necessary (like the "Page layout" options of MS Word);
PDF printers usually are virtual printers, so they do not have any "physical margin";
When you print a PDF document, Acrobat Reader is able to "fit" the page content to the "physical margins" of the physical printer.
So here are some possible solutions:
From Acrobat Reader, if your PDF has no margin, click on Print, then select "Fit to Printable Area" in the "Page Handling / Page Scaling" option - I guess you have "None " as settings here so the result is truncated by the printer;
From your Delphi application, set some "logical" margins (e.g. 1 cm around your paper) when drawing your report - that is, do not start at X=0 and Y=0, but with some offsets, and let the width and height of your drawing area be smaller (see for instance how is implemented our Open Source Report engine);
From your Delphi application, if you use a Report component, there should be some properties to set the margins.
See this article about general printing using Delphi (some info is old, but most is still accurate), or set up properly your report engine.
If you use TextOut (and not DrawText) you have the x and y coordinate where you are going to put the string(s) you need to print. You can follow the calculations in the debugger (or log them if the application runs without a debugger present). Perhaps something goes wrong in determining the coordinates (eg TextExtend fails to measure the text before centering, eg the resolution is different from what you expect, you get the Printer Canvas with a transformation so the coordinates are not 1:1 with the pixels.
If you are unsure about the coordinates / font issues: try drawing some boxes at expected coordinates so you can leave all font related errors out of the equation. If they exhibit the same problems it's a coordinate problem, if not it's font problem somehow.
As Ken said, we cannot know anything more if you do not show the code... so many possibilities..

Generate font from an image of text

Is it possible to generate a specific
set of font from the below given image
?
My idea is to generate a specific font
for the below given image of text ,by
manually selecting portion of the
image and mapping it to a set of
letter's.Generate the font for this
and then use this font to make it
readable for an OCR.Is generation of
font possible using any open-source
implementation ? Also please suggest
any good OCR's.
Abbyy FineReader 10 gets better than expected results but predictably gets confused when the characters touch.
Your problem is that the line spacing is too small. The descenders of each line overlap the character bounding boxes of the characters in the line directly below. This makes character segmentation almost impossible because the characters are touching and overlapping. The number of combinations of overlapping characters is virtually impossible to train for. The 'g' and 'y' characters are the worst offenders.
A double line spaced version of this would probably OCR reasonably well.
A custom solution that segmented and separated the each line along with a good dictionary would definitely improve the results. There would still be some errors to correct manually though. The custom routine would have to deal with the ascenders and descenders and try and segment the image into lines which can then be fed to a decent OCR engine. One way would be to analyse every character blob on the page and allocate it to a line. Leptonica (www.leptonica.com - C Imaging Library) would probably make this job a little easier.
I would not try this without increasing the resolution to 200 or 300 dpi first.
With this custom solution, training a font becomes an option if the OCR engine does a poor job initially.
Abbyy (www.abbyy.com) or Google Tesseract OCR 3.00 would be a good place to start.
No guarantees as to whether all of this will work though. This is quite a difficult page to OCR and you need to work out whether it is better to have it typed up manually overseas. It depends on the number of pages to need to process.

Resources