I am new to Rapidminer. I have many XML files and I want to classify these files manually based on keywords. Then I would like to train a classifier like Naive Bayer and SVM on these data and calculate their performances using cross- validator.
Could you please let me know different steps for this?
Should I need to use text processing activities like tokenising, TFIDF etc.?
The steps would go something like this
Loop over files - i.e. iterate over all files in a folder and read each one in turn.
For each file
read it in as a document.
tokenize it using operators like Extract Information or Cut Document containing suitable XPath queries to output a row corresponding to the extracted information in the document.
Create a document vector with all the rows. This is where TF-IDF or other approaches would be used. The choice depends on the problem at hand with TF-IDF being a usual choice where it is important to give more weight to tokens that appear often in a relatively small number of the documents.
Build the model and use cross validation to get an estimate of the performance on unseen data.
I have included a link to a process that you could use as the basis for this. It reads the RapidMiner repository which contains XML files so is a good example of processing XML documents using text processing techniques. Obviously, you would have to make some large modifications for your case.
Hope it helps.
Probably, it is too late to reply. But it could help to other people. There is an extension called 'text mining extension', I am using version 6.1.0 . So you may go to RapidMiner > help>update and install this extension. It will get all the files from one directory. It has various text mining algorithms that you may use
Also, I found this tutorial video which could be of some help to you as well
https://www.youtube.com/watch?v=oXrUz5CWM4E
Related
I'd like to break down HTML documents into small chunks of information. With sources like Wikipedia articles (as an example) this is reasonably easy to do, without machine learning, because the content is structured in a highly predictable way.
When working with something like a converted Word Doc or a Blog post, the HTML is a bit more unpredictable. For example, sometimes there are no DIVs, more than one H1 in a document, or no Headers at all etc.
I'm trying to figure out a decent/reliable way of automatically putting content breaks into my content, in order to break it down into chunks of an acceptable size.
I've had a little dig around for existing trained models for this application but I couldn't find anything off-the-shelf. I've considered training my own model, but I'm not confident of the best way to structure the training data. One option I've considered in relation to training data is providing a sample of where section breaks are numerically likely to exist within a document but I don't think that's the best possible approach...
How would you approach this problem?
P.s. I'm currently using Tensorflow but happy to go down a different path.
I've found the GROBID library quite robust for different input documents (since it's based on ML models trained on a large variety of documents). The standard model parses input PDF documents into structured XML/TEI encoded files, which are much easier to deal with. https://grobid.readthedocs.io/en/latest/Introduction/
If your inputs are HTML documents the library also offers the possibility to train your own models. Have a look at: https://grobid.readthedocs.io/en/latest/Training-the-models-of-Grobid/
I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level.
I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that?
You should start reading about word2vec model.
use gensim, get the pretrained model of google.
For vectoring a document, use Doc2vec() function.
After getting vectors for all your document, use some distance metric like cosine distance or euclidean distance for comparison.
This is very difficult. There is actually no computational definition of "meaning". You should dive into text mining, summarization and libraries like gensim, spacy or pattern.
In my opinion, the more readily useable libraries available out there ie. higher return on investesment (ROI), that is if you are a newbie you might want to look at tools around chatbots they want to extract from natural language structured data. That is what is the most similar to "meaning". One example free software tool to achieve that is rasa natural language understanding.
The drawback of such tools is that they somewhat work but only in the domain where they were trained and prepared to work. And in particular they do not aim at comparing documents like you want.
I'm trying to find the best way to compare two text documents using AI
You must come up with a more precise task and from there find out which technic apply best to your use case. Do you want to classify documents in predefined categories. Do you to compute some similarity between two documents? Given an input document, do you want to find most similar documents in a database. Do you want to extract important topics or keywords in the document? Do you want to summarize the document? Is it an abstractig summary or key phrase extraction?
In particular, there is no software that allows to extract somekind of semantic fingerprint from any document. Depending on the end goal, the way to achieve it might be completly different.
You must narrow the precise goal you are trying to achieve; From there, you will be able to ask another question (or improve this one) to describe precisly your goal.
Text understanding is AI-Complete. So, just saying to the computer "tell me something about this two documents" doesn't work.
Like other have said, word2vec and other word embeddings are tools to achieve many goals in NLP but it only a mean for an end. You must define the input and output of the system you are trying to design to be able to start working on the implementation.
There is two other Stack Overflow communities that you might want to dig:
Linguistics
Data Science
Given the tfidf value for each token in your corpus (or the most meaningful ones) you can compute a sparse representation for a document.
This is implemented in the sklearn TFIDFVectorizer.
As other users have pointed out, this is not the best solution to your task.
You should take into account embeddings.
The easiest solution consists in using an embedding at the words level, such as the one provided by the FastText framework.
Then you can create an embedding for the whole document by summing together the embedding of the single words which compose it.
An alterative consists in training an embedding directly at the document level, using some Doc2Vec framework such as the gensim or DL4J one.
Also you can use LDA Or LSI Models for text corpus. these methods(and other methods like wor2vec and doc2vec) can summarize documents to fixed length vectors with respect to it's meaning and topics that this document belongs to.
read more:
https://radimrehurek.com/gensim/models/ldamodel.html
I heard there are three approaches from Dr. Golden:
- Cosine Angular Separation
- Hamming Distance
- Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
These methods are based on semantic similarity.
I also heard some company used tool called Spacy to summarize document to compare each other.
I decided to take a dip into ML and with a lot of trial and error was able to create a model using TS' inception.
To take this a step further, I want to use their Object Detection API. But their input preparation instructions, references the use of Pascal VOC 2012 dataset but I want to do the training on my own dataset.
Does this mean I need to setup my datasets to either Pascal VOC or Oxford IIT format? If yes, how do I go about doing this?
If no (my instinct says this is the case), what are the alternatives of using TS object detection with my own datasets?
Side Note: I know that my trained inception model can't be used for localization because its a classifier
Edit:
For those still looking to achieve this, here is how I went about doing it.
The training jobs in the Tensorflow Object Detection API expect to get TF Record files with certain fields populated with groundtruth data.
You can either set up your data in the same format as the Pascal VOC or Oxford-IIIT examples, or you can just directly create the TFRecord files ignoring the XML formats.
In the latter case, the create_pet_tf_record.py or create_pascal_tf_record.py scripts are likely to still be useful as a reference for which fields the API expects to see and what format they should take. Currently we do not provide a tool that creates these TFRecord files generally, so you will have to write your own.
Except TF Object Detection API you may look at OpenCV Haar Cascades. I was starting my object detection way from that point and if provide well prepared data set it works pretty fine.
There are also many articles and tutorials about creating your own cascades, so it`s easy to start.
I was using this blog, it helps me a lot.
i'm currently trying to extract information, e.g. sender or recipient from business documents like bills. The documents were processed with ocr software into xml files, so they are annotated with formatting characteristics. I want to extract specific information from a new document after annotated one similar document manually with features like sender and recipient.
So my question is, if there is a learning or matching algorithm which is able to extract specific data by comparing with only one or two examples of similar documents. If yes: is there somehow a java framework capable of that?
Yours thankfully
maggu
If the XML structure is always the same (using the same template):
Just save the XML parent nodes of the selected nodes where the information is located so you know the path to the information. Shouldn't be a problem - trivial task.
If you have to search for the information:
It could work by creating certain feature extraction rules and then use that features to train a Support Vector Machine for detecting the areas where the information is located.
I once asked a similar question Algorithm to match natural text in mail.
But that is far from trivial, and definitely needs more than one or two training documents.
I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:
Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.
What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.
Thanks in advance for your answers!
Here are two pipelines that are specifically designed for medical document parsing:
Apache cTAKES
NLM's MetaMap
Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.
See http://www.ebi.ac.uk/webservices/whatizit/info.jsf
Whatizit is a text processing system that allows you to do textmining
tasks on text. The tasks come defined by the pipelines in the drop
down list of the above window and the text can be pasted in the text
area.
You could also ask biostars: http://www.biostars.org/show/questions/
there are many tools to do that. some popular ones:
NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)
most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.
more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.
Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.
a bash script that has as example a lexicon generated from the disease ontology:
https://github.com/lasigeBioTM/MER