Extraction text as csv from scanned pdf file using tesseract - opencv

enter image description hereI need help to extract text from scanned pdf. I have tried to extract it using pymupdf and pillow and pytesseract, but I am not getting correct results, there are some text are returned incorrectly.
I tried to increase sharpness and brightness but still did not get a good result.
I have already checked many answers using OpenCV, but I am fairly new to OpenCV. Please help.
def pdf_to_text(pdf_file,text_file_name,rotate_pdf=False,adj_sharpness=False,adj_contract=False,adj_brightness=False):
try:
doc = fitz.open(pdf_file)
zoom_x=2.5
zoom_y=2.5
mat = fitz.Matrix(zoom_x,zoom_y)
files = []
for n in range(doc.page_count):
#print(f'Extracting {n} image')
page = doc.load_page(n)
if rotate_pdf:
page.set_rotation(-90)
#pix = page.get_pixmap(dpi=600)
pix = page.get_pixmap(alpha=False,matrix=mat,dpi=300)
folder=os.path.join(os.getcwd(),"images")
if not os.path.exists(folder):
os.makedirs(folder)
fname = os.path.join(folder,"page-%i.png"%n)
pix.save(fname)
im = Image.open(fname)
im = adjust_sharpness(im,2.5)
im = adjust_brightness(im,1.1)
im = adjust_contrast(im,2.8)
#im = im.filter(ImageFilter.SMOOTH)
im.save(fname)
#remove_lines(fname)
files.append(fname)
#if n>1:
# break
print("Extracting Images Completed")
print("Now Extracting data from image file")
for file in files:
#file = "./images/page-0.png"
text = image_to_string(file, lang_code="eng")
#text = image_to_string(file, lang_code="fra+eng")
make_textfile(text, text_file_name)
print("Extracting and saving text files completed")
except FileNotFoundError:
print(f"File not available {pdf_file}")
return None
pytesseract.image_to_string(image=Image.open(image_name))
The image:

To process tables in Tesseract you are likely to need to remove table lines to help the OCR engine with the segmentation of the image. However, you may try this first to see how Tesseract will perform.
text = image_to_data(file, lang="eng", config="--psm 6")
This will treat your image as a block to avoid missing as much text as possible, but removing the lines and binarizing the image will lead to better results. This link would help you with the removal of lines.

Related

Find PDF Form Field position

I realize this question has been asked a lot, but I could not find any information about how to do this in RoR. I am filling a PDF with form text fields using pdf-forms but this does not support adding images, and I need to be able to add an image of a customer's signature into the PDF. I have used prawn to render the image on the existing PDF, but I need to know the exact location to add the image on the signature line. So my question is how can I look at an arbitrary PDF and find the exact position of the "Signature" form field?
I ended up using pdf2json to find the x,y position of the form field. I generate a JSON file of the original pdf using this command:
%x{ pdf2json -f "#{form_path}" }
The JSON file is generated in the same directory as form_path. I find the field I want using these commands:
jsonObj = JSON.parse(File.read(json_path))
signature_fields = jsonObj["formImage"]["Pages"].first["Fields"].find_all do |f|
f["id"]["Id"] == 'signature'
end
I can use prawn to first create a new PDF with the image. Then using pdf-forms, I multistamp the image pdf onto the original PDF that I want to add the image to. But multistamp applies each page of the stamp PDF to the corresponding page of the input PDF so make sure your image PDF has the correct number of pages or else your image will get stamped on every page. I only want the image stamped onto the first page, so I do the following:
num_pages = %x{ #{Rails.configuration.pdftk_path} #{form_path} dump_data | grep "NumberOfPages" | cut -d":" -f2 }.to_i
signaturePDF = "/tmp/signature.pdf"
Prawn::Document.generate(signaturePDF) do
signature_fields.each do |field|
image Rails.root.join("signature.png"), at: [field["x"], field["y"]],
width: 50
end
[0...num_pages - 1].each{|p| start_new_page }
end
outputPDF = "/tmp/output.pdf"
pdftk.multistamp originalPDF, signaturePDF, outputPDF
You can use this gem 'wicked_pdf. You just write html, and this gem automatically convert it to pdf
Read more https://github.com/mileszs/wicked_pdf
Here's a pure ruby implementation that will return the field's name, page, x, y, height, and width using Origami https://github.com/gdelugre/origami
require "origami"
def pdf_field_metadata(file_path)
pdf = Origami::PDF.read file_path
field_to_page = {}
pdf.pages.each_with_index do |page, page_index|
(page.Annots || []).each do |annot|
field_to_page[annot.refno] = page_index
end
end
field_metas = []
pdf.fields.each do |field|
field_metas << {
name: field.T,
page_index: field_to_page[field.no],
x: field.Rect[0].to_f,
y: field.Rect[1].to_f,
h: field.Rect[3].to_f - field.Rect[1],
w: field.Rect[2].to_f - field.Rect[0]
}
end
field_metas
end
pdf_field_metadata "<path to pdf>"
I haven't tested it particularly thoroughly but the snippet can hopefully get you most of the way there.
Also -- keep in mind the above coordinates calculated are in points from the bottom left of the pdf page rather than the top right (and are not in pixels). I believe there's always 72 points per inch, and you can get the total page points by calling page.MediaBox in the pdf.pages loop above. If you're looking for pixel coordinates, you need to know the DPI of the resulting rendered document.

Error in batch merging images (x.tif is not a valid choice for "C2 (green):")

I want to merge two sets of fluorescence microscope images into a green & blue image, but I'm having trouble with the macro (haven't used ImageJ before). I have a folder of FITC-images to be coloured green and a folder of DAPI-images to be coloured blue. I have been using this modified version of a macro I found online:
macro "batch_merge_channel"{
count = 1;
setBatchMode(true);
file1= getDirectory("Choose a Directory");
list1= getFileList(file1);
n1=lengthOf(list1);
file2= getDirectory("Choose a Directory");
list2= getFileList(file2);
n2=lengthOf(list2);
open(file1+list1[1]);
open(file2+list2[1]);
small = n1;
if(small<n2)
small = n2;
for(i=0;i<small;i++)
{
run("Merge Channels...", "c2="+list1[1]+ " c3="+list2[1]+ " keep");
name = substring(list1, 0, 13)+")_merge";
saveAs("tiff", "C:\\Merge\\"+name);
first += 2;
close();
setBatchMode(false);
}
This, however returns an error
x.tif is not a valid choice for "C2 (green):"
with x being the name of the first file in the first folder.
If I merge the images manually, two by two, there is no error. So I'm presuming the problem is in the macro code.
I found several cases of this error online, but none of the solutions that seemed to work for those people work for me.
Any help would be appreciated!
In case you didn't solve this already, a great place to get help on ImageJ questions is the forum.
I can suggest a couple of ideas:
Is your image successfully opened by the macro? You could set the batch mode to false to check this.
It looks to me like the for loop does not employ the variable i. It works on the first pair of
images (list1[1], list2[1]), then closes the merged image, but then
tries to process image 1 again. To actually loop through all the
images in the folder, you have to put inside the loop something
like this (you don't need 'keep' -- better to leave it out so the source images will automatically be closed)
open(file1+list1[i]);
open(file2+list2[i]);
run("Merge Channels...", "c2="+list1[i]+ " c3="+list2[i]);
-- Turning off batch mode should be done after the loop, not within the loop.
Here's a version that works for me.
// #File(label = "Green images", style = "directory") file1
// #File(label = "Blue images", style = "directory") file2
// #File(label = "Output directory", style = "directory") output
// Do not delete or move the top 3 lines! They contain essential parameters
setBatchMode(true);
list1= getFileList(file1);
n1=lengthOf(list1);
print("n1 = ",n1);
list2= getFileList(file2);
n2=lengthOf(list2);
small = n1;
if(small<n2)
small = n2;
for(i=0;i<small;i++)
{
image1=list1[i];
image2=list2[i];
open(file1+File.separator+list1[i]);
open(file2+File.separator+list2[i]);
print("processing image",i);
run("Merge Channels...", "c2=&image1 c3=&image2");
name = substring(image1, 0, 13)+"_merge";
saveAs("tiff", output+File.separator+name);
close();
}
setBatchMode(false);
Hope this helps.

MiniMagick (+Rails): How to display number of scenes in an image

I have a Rails app that uploads images for image processing, and I want to be able to 1) See how many pages/frames/scenes there are in an image, and 2) split multi-page images into single-page jpegs.
I'm having no trouble converting image types for single-scene images, but I can't quite puncture the ImageMagick documentation to understand exactly what I'm to do. The doc page I'm using is here:
http://www.imagemagick.org/www/escape.html
For the most part, I would like the code to be as simple as
def multiPage?( image )
img = MiniMagick::Image.open(image.path)
numPages = img.format("%n") #This returns Nil
count > 1 ? true : false
end
Does anyone have a better idea of what to do than I do? Thanks in advance!
Ok, well this is a bit of a hack, but when I did:
numPages = img[:n]
I would get numPages resulting in a string of the letter 'n' as many times as there are pages in an image, so:
#img -> 4-page image
numPages = img[:n] # => 'nnnn'
Probably not the best answer, but at least it works.
UPDATE:
Found a better way
numPages = Integer(img["%n"])

imagej get current image from imagestack

i need to get the current image opened after importing an image sequence to ImageJ.
As i need to save the overlay information to a text file bearing the name of the image
int number = imp.getImageStackSize();
if(number > 1)
{
ImagePlus check = imp.duplicate();
gd.removeAll();
gd.addMessage(check.getTitle());
gd.showDialog();
}
imp.gettitle returns the folder name the images were loaded from.
couldn't find any solution so far :(
Any way to find the text in the status bar would be appreciated..
it was a simple issue..
the name of the current image can be retrieved by the following piece of code..
fileName = imp.getImageStack().getShortSliceLabel(imp.getCurrentSlice());
Thanks everyone..

Automate Photoshop to insert text from file

I have a multilanguage website and need automate the process of updating textlayers in psd-files from a csv-source.
I know that there might be glitches in the psp because of changed widths, but anyway it would help a lot to have the text inside the documents.
What are my options?
EDIT:
Murmelschlurmel has a working solution. Here is the link to the Adobe documentation.
http://livedocs.adobe.com/en_US/Photoshop/10.0/help.html?content=WSfd1234e1c4b69f30ea53e41001031ab64-740d.html
The format of the csv-file is not so nice: you need a column for each variable. I would expect a row for each variable.
It works with Umlaut (ä, ö etc)
EDIT 1:
Another solution is to use com to automate Photoshop. Thats nice if you have a couple of templates (buttons) that need changed text. Here is my script in python that might get you startet.
You need to have an excel file with columns:
TemplateFileName, TargetFileName, TargetFormat, Text
(ie template.psd, button1 , gif , NiceButton) .
The first row of the sheet is not used.
The psp template should only have 1 textlayer and can not have layergroups.
import win32com.client
import xlrd
spreadsheet = xlrd.open_workbook("text_buttons.xls")
sheet = spreadsheet.sheet_by_index(0)
psApp = win32com.client.Dispatch("Photoshop.Application")
jpgSaveOptions = win32com.client.Dispatch("Photoshop.JPEGSaveOptions")
jpgSaveOptions.EmbedColorProfile = True
jpgSaveOptions.FormatOptions = 1
jpgSaveOptions.Matte = 1
jpgSaveOptions.Quality = 1
gifSaveOptions = win32com.client.Dispatch("Photoshop.GIFSaveOptions")
for rowIndex in range(sheet.nrows):
if(rowIndex > 0):
template = sheet.row(rowIndex)[0].value
targetFile = sheet.row(rowIndex)[1].value
targetFileFormat = sheet.row(rowIndex)[2].value
textTranslated = sheet.row(rowIndex)[3].value
psApp.Open(r"D:\Design\Produktion\%s" % template )
doc = psApp.Application.ActiveDocument
for layer in doc.Layers:
if (layer.Kind == 2):
layer.TextItem.Contents = textTranslated
if(targetFileFormat == "gif"):
doc.SaveAs(r"D:\Design\Produktion\de\%s" % targetFile, gifSaveOptions, True, 2)
if(targetFileFormat == "jpg"):
doc.SaveAs(r"D:\Design\Produktion\de\%s" % targetFile, jpgSaveOptions, True, 2)
You can use "Data Driven Design" to do this. There is also a concept of data driven design in computer science, but as far as I can see this is not not related to the use of the word in Photoshop.
Here is how to proceed:
Load your image in Photoshop and define your variables with Image > Variable > Define.
Then convert your csv to a format Photoshop can read. I had the best experiences with tab delimted text.
Finally load the text file in Photoshop with Images > Variables > Data Set and let Photoshop save all iterations.
When I tried this first, I found that the Photoshop help file didn't provide enough details. I searched the Internet for photoshop "data set" and found some good tutorials, e.g. this one from digitaltutors.
It might be little bit off too much, but I have used Adobe AlterCast/Grphics server to handle exactly same issue.
Also if its just Text GIF/JPG image, you can use Python+PIL (Python Imaging Library).
Here is a sample code (works on Windows OS with Arial and Osaka fonts installed.)
#!/usr/bin/python
# -*- coding: utf-8 -*-
import ImageFont, ImageDraw, Image
#font = ImageFont.truetype("/usr/share/fonts/bitstream-vera/Vera.ttf", 24)
#font = ImageFont.truetype("futuratm.ttf", 18)
font = ImageFont.truetype("arial.ttf", 18)
im = Image.new("RGB", (365,20), "#fff")
draw = ImageDraw.Draw(im)
draw.text((0, 0), "Test Images", font=font, fill="#000")
im.save("TestImg_EN.gif", "GIF")
font = ImageFont.truetype("osaka.ttf", 18)
im = Image.new("RGB", (365,20), "#fff")
draw = ImageDraw.Draw(im)
draw.text((0, 0), u"テストイメージ", font=font, fill="#000")
im.save("TestImg_JP.gif", "GIF")

Resources