What is the best way to Parse a scanned PDF file using PHP or JS? - parsing

I have a translation website and I would like to parse PDF files so that I can count words and I set the price for translation.
I have tried Poppler JS before. But It can't handle the scanned files. How should I handle them?
For example this PDF is a scanned article. It is a PDF file but each page is a picture and I need to extract the text:

What you are looking for is an OCR library. There are a bunch of options to do this, here are some Software Recommendation Stack Exchange links:
Scan Text Document To PDF With OCR
JavaScript library for OCR

Related

Does PDF.js extract any page styling information?

I'm in a situation where I need to distinguish whether or not a PDF has a certain layout after scanning the document for text. Is this possible with PDF.js and if so where would I find this information?
Unfortunately, PDFs consist of very low-level drawing commands, and as such it is very difficult to extract any formatting information from them, no matter what tool/library. (See for example, here)

Google doc ignore microsoft word shapes when conveted

I am trying to convert a docx file to google doc format. The docx file contains images, text, tables etc, but when converted to google doc format it simply ignores all the shapes in it..
Microsoft shapes are not supported by google docs ? or what could be the possible reason ?
Not all features on word is supported to convert. Rasterize shapes to some forms of common image and try convert again.

using LEADTOOLS to convert doc to pdf

I am playing around with Leadtools to see how it might benefit me but i am a little frustrated with their documentation regarding how the process works. I am creating a library with methods that take an input file, convert it to pdf, add a qrcode to the file and save it and then reading the qrcode again.
Does a pdf have to be converted to an image before leadtools is
able to read the qrcode?
Does leadtools allow converting from
doc to pdf and then adding the qrcode or do i have to convert it to
an image as well?
Is there anywhere I could look at code
samples of how I can go about doing what I talked about other than
the leadtools site itself?
I am sorry to hear that you are having difficulties, but I will do my best to get you pointed in the right direction.
To answer your questions:
A1.) Yes, the PDF will need to be rasterized before the LEADTOOLS barcode engine can be used. Our barcode engine will only work with raw image data. Once the file is decompressed into raw data, we will not access the file any further.
A2.) Yes, you can rasterize Microsoft Word documents using either our file I/O methods or with the LEADTOOLS Virtual Printer. Once you have the raw image data, you can pass it to the barcode engine to write the QR code into the data. Once the barcode is written, you can then compress the image into any supported format, including (raster) PDF. You can also create a searchable PDF by running the resultant image through an OCR engine & outputting to PDF.
A3.) The LEADTOOLS SDK has a main barcode demo that should illustrate the ability of the SDK to handle the features you describe here. There are also tutorials in the help file, and various projects on our support forums. We have also created a couple different CodeProject articles here:
Multi-Platform Barcode with LEADTOOLS 18
How to Read Barcodes from Images using LEADTOOLS
You haven't mentioned here what programming language you are developing with or what the specific problem are that you have encountered. Without knowing either of those, it's difficult to get more specific into any methods or other resources to check out. For a simple raster conversion of a Microsoft Word Doc to PDF and writing a barcode, I think this would probably take between 10-15 lines of code.
If you have not already, I would highly recommend sending an email to Support#leadtools.com or open a live chat with the LEADTOOLS Support team from LEADTOOLS.com. We can get into more specifics there and help you more directly with any issues you are encountering.
Walter Bates
LEADTOOLS Developer Support
I tried adding this as a comment, but it is apparently too long for that. So I have added it as another answer.
Even if you are building a DLL, I would suggest starting out building a simple demo with a view of the image so you can see what exactly is happening to the image. Once you are comfortable that the image is being modified the way you want, then implement that code in your own library.
Also, I would recommend testing out the toolkit with the provided main demos. The demos are there to illustrate the different options you have access to in the code. If you can accomplish what your application or library will need to do through the demos, then it would be worth your time to begin coding specifically what you need. You might even need to use multiple demos to verify the tools can accomplish the goals that you have. You have all the toolkit code for the demos, so you can take them apart and use the specific pieces that you need in your application.
If you are having trouble identifying which demos to try out or whether the toolkit has the specific functionality that you need, your best bet is to contact Tech Support directly to ask. We are here to help get you pointed in the right direction.
To get down to brass tacks, the source of the image data is not all that important from the perspective of the barcode engine. It needs a RasterImage handle (raw image data) to write the specified barcode. Whether the image data is created on the fly, read from file, or generated from a scanner, it does not make a whole lot of difference.
To find the main .NET barcode demo, I would start out by going to the LEADTOOLS shortcuts. To get there, go to the Start menu -> LEADTOOLS -> Help and Demos. The shortcuts are broken down by programming language, feature, and then the base toolkit. You should be able to find the WinForms .NET barcode demo here:..\Shortcuts.NET Class Libraries.NET Framework\01 Imaging\07 Barcode
Our toolkit example is a .NET WinForms project, but it will work in ASP.NET also.
Here are some links to tutorials if you want to dig right into the code:
Loading and Displaying an Image in WinForms
Reading Barcodes
HOW TO: Load and Display an Image with WebImageViewer
There was also this recent code tip posted illustrating how to read and write UTF-8 characters in a QR barcode.
We provide both .NET 2.0 and .NET 4.0 DLLs for our barcode engine. Both of these work within Visual Studio 2012.

How can I parse, manipulate, and save Adobe Photoshop files?

How can I write a script or program to manipulate Adobe Photoshop files? I'd like to be able to do something like read a Adobe PSD file, rename the layers, and save it back to a PSD format.
The files look to be saved with a combination of XML and serialized data. I looked at the file's code and see that it has <x:xmpmeta near the start, did some google searching to find the wikipedia article about xmp - Extensible Metadata Platform, but I'm unclear if that is the format for the entire file or just for the metadata portion.
I saw that there is a PSD parser class for PHP available, and not a bad article about how to use it, although it seems like it is just for reading / converting and not for writing / saving.
But I'd like to know:
What format are these files stored in?
Where are the guidelines for interfacing with that format?
Are there some classes / tools available for manipulating that file format? Any language would be fine for a start.
I'm happy to do more research on my own but I'm hoping for some guidance to know what I should be looking for.
I'm not familiar with it myself, but there is an official SDK for Photoshop available that should let you do all that and more with .psd files.
There are not so many options. The general advice would be to look into buying Adobe InDesign Server. In some cases it can be cost prohibitive and you might be interested in 3-party SDKs. Unfortunately there are a few options in the market. One of them is Graphics Mill image processing SDK (http://www.graphicsmill.com/photoshop-psd).
Disclaimer: I work for Aurigma which runs Graphics Mill project.

jpg file parsing to extract info/text

I have an idea for a project that I wanted some advice/pointers on.
I am planning to write an application to automatically parse expense receipts in JPG format and automatically extract the amount and also categorize using some learning algorithm. Is this at all doable? What libraries are available to parse jpg files to extract textual information and currency information from it?
Any pointers appreciated..I have a vanilla HP all in one scanner that I will use to scan all receipts.
Thanks
RS
You will need a OCR plugin (Optical character recognition) this will recognize and retrieve text from images. It has been a while since I last used OCR software, not sure what the best SDK's / plugins are at the moment.
I did find an article on The Code Project which uses a OCR product from Leadtool.

Resources