NLP: Extracting domain specific Data from PDF Files

NLP: Extracting domain specific Data from PDF Files - machine-learning

NLP Problem:
I have pdf file which contain some important information that needs to be extracted. Some of them are in key value pairs. . For example, the pdf file contains the following information.
Name : Mr. John Wick
Toy purchased : Gun
Price : £2,000
Date: XYZ
However, not all the documents will have the same keys, for example in some docs it could be
Price of Item : £4,000
Current Date or Purchase Date: ABC
Purchased Toy Etc.
What is the best way to extract this data ?

Related

How can Named Entity Recognition work without NLP?

I have a question about Machine Learning and Names Entity Recognition.
My goal is to extract named entities from an invoice document. Invoices are typical structured text and this kind of data is usually not useful for Natural Language processing (NLP). I already tried to train a model with the NLP Library Spacy to detect invoice meta data like Date, Total, Customer name. This works more or less good. As far as I understand, an invoice does not provide the unstructured plain text which is usually expected from NLP.
An typical text example for an invoice text analyzed with NLP ML which I found often in the Internet, looks like this:
“Partial invoice (€100,000, so roughly 40%) for the consignment C27655 we shipped on 15th August to London from the Make Believe Town depot. INV2345 is for the balance.. Customer contact (Sigourney) says they will pay this on the usual credit terms (30 days).”
NLP loves this kind of text. But text extracted form a Invoice PDF (using Apache Tika) usually looks more like this:
Client no: Invoice no: Invoice date: Due date:
1000011128 DEAXXXD220012269 26-Jul-2022 02-Aug-2022
Invoice to: Booking Reference
LOGISTCS GMBH Client Reference :
DEMOSTRASSE 2-6 Comments:
28195 BREMEN
Germany
Vessel : Voy : Place of Receipt : POL: B/LNo:
XXX JUBILEE NUBBBW SAV33NAH, GA ME000243
ETA: Final Destination : POD:
15-Jul-2022 ANTWERP, BELGIUM
Charge Quantity(days) x Rate Currency Total ROE Total EUR VAT
STORAGE_IMP_FOREIGN 1 day(s) x 30,00 EUR EUR 30,00 1,000000 30,00 0,00
So I guess NLP is in general the wrong approach to train the recognition of meta data from an invoice document. I think the problem is more like recognizing cats in a picture.
What could be a more promising approach for Named Entity Recognition to train structured text with a machine learning framework?

How do I check whether a given string is a valid geographical location or not?

I have a list of strings (noun phrases) and I want to filter out all valid geographical locations from them. Most of these (unwanted location names) are country or city or state names. What would be a way to do this? Is there any open-source lookup table available which contains all country, states, cities of the world?
Example desired output:
TREC4: false, Vienna: true, Ministry: false, IBM: false, Montreal: true, Singapore: true
Unlike this post: Verify user input location string is a valid geographic location?
I have a high number of strings like these (~0.7 million) so google geolocation API is probably not an option for me.

You can use geoplanet data by Yahoo, or geonames data by geonames.org.
Here is a link to geoplanet TSV file containing 5 million geographical places of the world :
https://developer.yahoo.com/geo/geoplanet/data/
Moreover, geoplanet data will provide you type ( city,country,suburb etc) of the geographical place, along with a unique id.
https://developer.yahoo.com/geo/geoplanet/guide/concepts.html
You can do a lowercase, sanitized ( e.g. remove special characters and other anomalies) match of your needle string to the names present in this data.
If you do not want full file scans, first processing this data to store it in a fast lookup database like mongodb or redis will be beneficial.

I can suggest the following three options:
a) Using the Alchemy API: http://www.alchemyapi.com/
If you try their demo, places like France, Honolulu give the entity type as Country or City
b) Using TAGME: http://tagme.di.unipi.it/
TAGME connects every entity in a given text to the corresponding wikipedia page. Crawl the wikipedia page and check the infobox and filter
c) Using Wikipedia Miner: I was unable to find relevant links for this. However, this also works like TAGME.
Suggest you to try all three and do majority voting for each instance.

What type should I use to store the ISBN number in C?

I'm creating a code which storing the book data into a struct. So, the data including the book's title, author's name, price and ISBN number (which including the dashes to group the 13-digit code). So for title and name, for sure using string and also float for price. But I'm stuck in ISBN number because it containing dashes. So, what type should I use to store the numbers and dashes?
I'm still a beginner in coding program. :(

Magmi few rows with same SKU error

I´m using magmi to import products with different names (depending on the language selected in store)
Watching the CSV file exported by magento, I had founded that in the first row all product data is stored and in the following one, there is only the fields you wish to add to the "store view" to the desired language (French, in this case ---> fr).
If I empty the database and imported from the CSV magento own, I successfully created the products and the different names and descriptions for each store view.
The problem is that Magmi tells me I need the SKU on the 3rd line, because logically not find the SKU. Using the same SKU in the second and third line, the last line always prevails introduced, crushing the previous.
Any idea how I could intruducir with Magmi, the names and descriptions in other languages without smash the previus data? .... I begin to be a bit desperate!
CSV example rows:
sku;_store;_attribute_set;_type;_category......... name .......short_description ..etc
09110-296555;fr;Default;simple;books;products...name_english ..description_english
;fr;;;;;;;;;;;;;;;;;;;;;;;;;;;;;name_french......... description_french

how to normalize the records? e.g. several similar columns into rows

I am using Pentaho Kettle and thinking on way to normalize my flat file (csv) data. Eventually store it to database.
csv structure: item name, store1 sold quantity, store2 sold quantity, store...
expected result: item name, store name, sold quantity
Any guidance is appreciated.

You can do this with the Row Normalizer step as long as the number of stores is fixed or at least has a maximum. If it's variable, you'll have to use a JavaScript step, or a UDJC. See the docs for how to use these steps:
PDI Transform Steps
If it's variable, I'd consider preprocessing the file before loading. I've done this with Python and it works pretty well.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart