i'm currently trying to extract information, e.g. sender or recipient from business documents like bills. The documents were processed with ocr software into xml files, so they are annotated with formatting characteristics. I want to extract specific information from a new document after annotated one similar document manually with features like sender and recipient.
So my question is, if there is a learning or matching algorithm which is able to extract specific data by comparing with only one or two examples of similar documents. If yes: is there somehow a java framework capable of that?
Yours thankfully
maggu
If the XML structure is always the same (using the same template):
Just save the XML parent nodes of the selected nodes where the information is located so you know the path to the information. Shouldn't be a problem - trivial task.
If you have to search for the information:
It could work by creating certain feature extraction rules and then use that features to train a Support Vector Machine for detecting the areas where the information is located.
I once asked a similar question Algorithm to match natural text in mail.
But that is far from trivial, and definitely needs more than one or two training documents.
Related
My task is to extract information from a various web-pages of a particular site. Now, the information to be extracted can be of the form as product name, product id, price, etc. The information is given in text using natural language. Also, I have been asked to extract that information using some Machine Learning algorithm. I thought of using NER (Named Entity Recognition) and training it on custom training data (which I can prepare using the scraped data and manually labeling the integers/data as required). I wanted to know if the model can even work this way?
Also, let me know if I can improve this question further.
You say a particular site. I am assuming that it means you have some fair idea of what the structures of webpages are, if the data is in table form or a free text form, how the website generally looks. In this case, a simple regex (prices, ids etc) supported by some POS tagger to extract product names and all is enough for you. A supervised approach is definitely an overkill and might underperform than the simpler regex.
I have many different formats of scanned pdfs with many different fields. Think of it as an invoice that has been scanned. I need to extract the information from the scanned pdf and output the fields and the texts that are in each of the fields.
I have an OCR tool that does a good job in extracting all the texts in the raw format. I somehow using NLP have to be able to extract the fields and their values from the raw text. As there are many formats of the invoice, using OCR is not an option in this case. How could NLP help me in solving this problem?
Most NLP tools are designed to extract data from statements. If you don't have punctuation, it might not work out well. If you are using an NLU service, like https://mynlu.com you also will need to provide examples of common phrases and the locations of the relevant data contained therein (entities). If you can split this into statements, something like myNLU or another NLU service (LUIS, Watson etc) can get you out the door in < 10 minutes.
I have to create a system that generates all possible question answer pairs from unstructured text in a specific domain.Many questions may have the same answer but the system should generate all possible types of questions that an answer can have.The questions formed should be meaningful and grammatically correct.
For this purpose, I used nltk and trained an NER, creating entities according to my domain and then I created some rules to identify the question word using the combination of NER identified entities and POS tagged words. But this approach isn't working fine as I am not able to create meaningful questions from the text. Moreover, some question words are wrongly identified and some question words are missed. I also read research papers on using RNN for this purpose but I don't have a large training data since the domain is pretty small. Can anyone suggest a better approach?
I am new to Rapidminer. I have many XML files and I want to classify these files manually based on keywords. Then I would like to train a classifier like Naive Bayer and SVM on these data and calculate their performances using cross- validator.
Could you please let me know different steps for this?
Should I need to use text processing activities like tokenising, TFIDF etc.?
The steps would go something like this
Loop over files - i.e. iterate over all files in a folder and read each one in turn.
For each file
read it in as a document.
tokenize it using operators like Extract Information or Cut Document containing suitable XPath queries to output a row corresponding to the extracted information in the document.
Create a document vector with all the rows. This is where TF-IDF or other approaches would be used. The choice depends on the problem at hand with TF-IDF being a usual choice where it is important to give more weight to tokens that appear often in a relatively small number of the documents.
Build the model and use cross validation to get an estimate of the performance on unseen data.
I have included a link to a process that you could use as the basis for this. It reads the RapidMiner repository which contains XML files so is a good example of processing XML documents using text processing techniques. Obviously, you would have to make some large modifications for your case.
Hope it helps.
Probably, it is too late to reply. But it could help to other people. There is an extension called 'text mining extension', I am using version 6.1.0 . So you may go to RapidMiner > help>update and install this extension. It will get all the files from one directory. It has various text mining algorithms that you may use
Also, I found this tutorial video which could be of some help to you as well
https://www.youtube.com/watch?v=oXrUz5CWM4E
I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:
Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.
What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.
Thanks in advance for your answers!
Here are two pipelines that are specifically designed for medical document parsing:
Apache cTAKES
NLM's MetaMap
Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.
See http://www.ebi.ac.uk/webservices/whatizit/info.jsf
Whatizit is a text processing system that allows you to do textmining
tasks on text. The tasks come defined by the pipelines in the drop
down list of the above window and the text can be pasted in the text
area.
You could also ask biostars: http://www.biostars.org/show/questions/
there are many tools to do that. some popular ones:
NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)
most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.
more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.
Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.
a bash script that has as example a lexicon generated from the disease ontology:
https://github.com/lasigeBioTM/MER