Natural Language Processing for data extraction from PDF - machine-learning

I have many different formats of scanned pdfs with many different fields. Think of it as an invoice that has been scanned. I need to extract the information from the scanned pdf and output the fields and the texts that are in each of the fields.
I have an OCR tool that does a good job in extracting all the texts in the raw format. I somehow using NLP have to be able to extract the fields and their values from the raw text. As there are many formats of the invoice, using OCR is not an option in this case. How could NLP help me in solving this problem?

Most NLP tools are designed to extract data from statements. If you don't have punctuation, it might not work out well. If you are using an NLU service, like https://mynlu.com you also will need to provide examples of common phrases and the locations of the relevant data contained therein (entities). If you can split this into statements, something like myNLU or another NLU service (LUIS, Watson etc) can get you out the door in < 10 minutes.

Related

What's a good ML model/technique for breaking down large documents/text/HTML into segments?

I'd like to break down HTML documents into small chunks of information. With sources like Wikipedia articles (as an example) this is reasonably easy to do, without machine learning, because the content is structured in a highly predictable way.
When working with something like a converted Word Doc or a Blog post, the HTML is a bit more unpredictable. For example, sometimes there are no DIVs, more than one H1 in a document, or no Headers at all etc.
I'm trying to figure out a decent/reliable way of automatically putting content breaks into my content, in order to break it down into chunks of an acceptable size.
I've had a little dig around for existing trained models for this application but I couldn't find anything off-the-shelf. I've considered training my own model, but I'm not confident of the best way to structure the training data. One option I've considered in relation to training data is providing a sample of where section breaks are numerically likely to exist within a document but I don't think that's the best possible approach...
How would you approach this problem?
P.s. I'm currently using Tensorflow but happy to go down a different path.
I've found the GROBID library quite robust for different input documents (since it's based on ML models trained on a large variety of documents). The standard model parses input PDF documents into structured XML/TEI encoded files, which are much easier to deal with. https://grobid.readthedocs.io/en/latest/Introduction/
If your inputs are HTML documents the library also offers the possibility to train your own models. Have a look at: https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/

Extracting information from web-pages using NER

My task is to extract information from a various web-pages of a particular site. Now, the information to be extracted can be of the form as product name, product id, price, etc. The information is given in text using natural language. Also, I have been asked to extract that information using some Machine Learning algorithm. I thought of using NER (Named Entity Recognition) and training it on custom training data (which I can prepare using the scraped data and manually labeling the integers/data as required). I wanted to know if the model can even work this way?
Also, let me know if I can improve this question further.
You say a particular site. I am assuming that it means you have some fair idea of what the structures of webpages are, if the data is in table form or a free text form, how the website generally looks. In this case, a simple regex (prices, ids etc) supported by some POS tagger to extract product names and all is enough for you. A supervised approach is definitely an overkill and might underperform than the simpler regex.

How to process XML files using Rapidminer for classification

I am new to Rapidminer. I have many XML files and I want to classify these files manually based on keywords. Then I would like to train a classifier like Naive Bayer and SVM on these data and calculate their performances using cross- validator.
Could you please let me know different steps for this?
Should I need to use text processing activities like tokenising, TFIDF etc.?
The steps would go something like this
Loop over files - i.e. iterate over all files in a folder and read each one in turn.
For each file
read it in as a document.
tokenize it using operators like Extract Information or Cut Document containing suitable XPath queries to output a row corresponding to the extracted information in the document.
Create a document vector with all the rows. This is where TF-IDF or other approaches would be used. The choice depends on the problem at hand with TF-IDF being a usual choice where it is important to give more weight to tokens that appear often in a relatively small number of the documents.
Build the model and use cross validation to get an estimate of the performance on unseen data.
I have included a link to a process that you could use as the basis for this. It reads the RapidMiner repository which contains XML files so is a good example of processing XML documents using text processing techniques. Obviously, you would have to make some large modifications for your case.
Hope it helps.
Probably, it is too late to reply. But it could help to other people. There is an extension called 'text mining extension', I am using version 6.1.0 . So you may go to RapidMiner > help>update and install this extension. It will get all the files from one directory. It has various text mining algorithms that you may use
Also, I found this tutorial video which could be of some help to you as well
https://www.youtube.com/watch?v=oXrUz5CWM4E

Information Extraction - business documents

i'm currently trying to extract information, e.g. sender or recipient from business documents like bills. The documents were processed with ocr software into xml files, so they are annotated with formatting characteristics. I want to extract specific information from a new document after annotated one similar document manually with features like sender and recipient.
So my question is, if there is a learning or matching algorithm which is able to extract specific data by comparing with only one or two examples of similar documents. If yes: is there somehow a java framework capable of that?
Yours thankfully
maggu
If the XML structure is always the same (using the same template):
Just save the XML parent nodes of the selected nodes where the information is located so you know the path to the information. Shouldn't be a problem - trivial task.
If you have to search for the information:
It could work by creating certain feature extraction rules and then use that features to train a Support Vector Machine for detecting the areas where the information is located.
I once asked a similar question Algorithm to match natural text in mail.
But that is far from trivial, and definitely needs more than one or two training documents.

Disease named entity recognition

I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:
Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.
What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.
Thanks in advance for your answers!
Here are two pipelines that are specifically designed for medical document parsing:
Apache cTAKES
NLM's MetaMap
Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.
See http://www.ebi.ac.uk/webservices/whatizit/info.jsf
Whatizit is a text processing system that allows you to do textmining
tasks on text. The tasks come defined by the pipelines in the drop
down list of the above window and the text can be pasted in the text
area.
You could also ask biostars: http://www.biostars.org/show/questions/
there are many tools to do that. some popular ones:
NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)
most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.
more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.
Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.
a bash script that has as example a lexicon generated from the disease ontology:
https://github.com/lasigeBioTM/MER

Resources