I have a bunch of text documents that describe diseases. Those documents are in most cases quite short and often only contain a single sentence. An example is given here:
Primary pulmonary hypertension is a progressive disease in which widespread occlusion of the smallest pulmonary arteries leads to increased pulmonary vascular resistance, and subsequently right ventricular failure.
What I need is a tool that finds all disease terms (e.g. "pulmonary hypertension" in this case) in the sentences and maps them to a controlled vocabulary like MeSH.
Thanks in advance for your answers!
Here are two pipelines that are specifically designed for medical document parsing:
Apache cTAKES
NLM's MetaMap
Both use UMLS, the unified medical language system, and thus require that you have a (free) license. Both are Java and more or less easy to set up.
See http://www.ebi.ac.uk/webservices/whatizit/info.jsf
Whatizit is a text processing system that allows you to do textmining
tasks on text. The tasks come defined by the pipelines in the drop
down list of the above window and the text can be pasted in the text
area.
You could also ask biostars: http://www.biostars.org/show/questions/
there are many tools to do that. some popular ones:
NLTK (python)
LingPipe (java)
Stanford NER (java)
OpenCalais (web service)
Illinois NER (java)
most of them come with some predefined models, i.e. they've already been trained on some general datasets (news articles, etc.). however, your texts are pretty specific, so you might want to first constitute a corpus and re-train one of those tools, in order to adjust it to your data.
more simply, as a first test, you can try a dictionary-based approach: design a list of entity names, and perform some exact or approximate matching. for instance, this operation is decribed in LingPipe's tutorial.
Open Targets has a module for this as part of LINK. It's not meant to be used directly so it might require some hacking and tinkering, but it's the most complete medical NER (named entity recognition) tool I've found for python. For more info, read their blog post.
a bash script that has as example a lexicon generated from the disease ontology:
https://github.com/lasigeBioTM/MER
Related
I'm trying to find the best way to compare two text documents using AI and machine learning methods. I've used the TF-IDF-Cosine Similarity and other similarity measures, but this compares the documents at a word (or n-gram) level.
I'm looking for a method that allows me to compare the meaning of the documents. What is the best way to do that?
You should start reading about word2vec model.
use gensim, get the pretrained model of google.
For vectoring a document, use Doc2vec() function.
After getting vectors for all your document, use some distance metric like cosine distance or euclidean distance for comparison.
This is very difficult. There is actually no computational definition of "meaning". You should dive into text mining, summarization and libraries like gensim, spacy or pattern.
In my opinion, the more readily useable libraries available out there ie. higher return on investesment (ROI), that is if you are a newbie you might want to look at tools around chatbots they want to extract from natural language structured data. That is what is the most similar to "meaning". One example free software tool to achieve that is rasa natural language understanding.
The drawback of such tools is that they somewhat work but only in the domain where they were trained and prepared to work. And in particular they do not aim at comparing documents like you want.
I'm trying to find the best way to compare two text documents using AI
You must come up with a more precise task and from there find out which technic apply best to your use case. Do you want to classify documents in predefined categories. Do you to compute some similarity between two documents? Given an input document, do you want to find most similar documents in a database. Do you want to extract important topics or keywords in the document? Do you want to summarize the document? Is it an abstractig summary or key phrase extraction?
In particular, there is no software that allows to extract somekind of semantic fingerprint from any document. Depending on the end goal, the way to achieve it might be completly different.
You must narrow the precise goal you are trying to achieve; From there, you will be able to ask another question (or improve this one) to describe precisly your goal.
Text understanding is AI-Complete. So, just saying to the computer "tell me something about this two documents" doesn't work.
Like other have said, word2vec and other word embeddings are tools to achieve many goals in NLP but it only a mean for an end. You must define the input and output of the system you are trying to design to be able to start working on the implementation.
There is two other Stack Overflow communities that you might want to dig:
Linguistics
Data Science
Given the tfidf value for each token in your corpus (or the most meaningful ones) you can compute a sparse representation for a document.
This is implemented in the sklearn TFIDFVectorizer.
As other users have pointed out, this is not the best solution to your task.
You should take into account embeddings.
The easiest solution consists in using an embedding at the words level, such as the one provided by the FastText framework.
Then you can create an embedding for the whole document by summing together the embedding of the single words which compose it.
An alterative consists in training an embedding directly at the document level, using some Doc2Vec framework such as the gensim or DL4J one.
Also you can use LDA Or LSI Models for text corpus. these methods(and other methods like wor2vec and doc2vec) can summarize documents to fixed length vectors with respect to it's meaning and topics that this document belongs to.
read more:
https://radimrehurek.com/gensim/models/ldamodel.html
I heard there are three approaches from Dr. Golden:
- Cosine Angular Separation
- Hamming Distance
- Latent Semantic Analysis (LSA) or Latent Semantic Indexing (LSI)
These methods are based on semantic similarity.
I also heard some company used tool called Spacy to summarize document to compare each other.
I'm a newbie in the field of Machine Learning and Supervised learning.
My task is the following: from the name of a movie file on a disk, I'd like to retrieve some metadata about the file. I have no control on how the file is named, but it has a title and one or more additional info, like a release year, a resolution, actor names and so on.
Currently I have developed a rule heuristic-based system, where I split the name into tokens and try to understand what each word could represent, either alone or with adjacent ones. For detecting people names for example, I'm using a dataset of english names, and score the word as being a potential person's name if I find it in the dataset. If adjacent to it is a word that I scored as a potential surname, I score the two words as being an actor. And so on. It works with a decent accuracy, but changing heuristic scores manually to "teach" the system is tedious and unpredictable.
Such a rule-based system is hard to maintain or develop further, so, out of curiosity, I was exploring the field of machine learning. What I would like to know is:
Is there some kind of public literature about these kinds of problems?
Is ML a good way to approach the problem, given the limited data set available?
How would I proceed to debug or try to understand the results of such a machine? I already have problems with the "simplistic" heuristic engine I have developed..
Thanks, any advice would be appreciated.
You need to look into NLP (natural language processing). NLP deals with text processing and other things; for example entity recognition and tagging.
Here is an example of using Spacy library: https://spacy.io/usage/linguistic-features.
Some time ago I did a similar thing, you can see it here: https://github.com/Erlemar/Erlemar.github.io/blob/master/Notebooks/Fate_Zero_explore.ipynb
I'm new to the ML/NLP field so my question is what technology would be most appropriate to achieve the following goal:
We have a short sentence - "Where to go for dinner?" or "What's your favorite bar?" or "What's your favorite cheap bar?"
Is there a technology that would enable me to train it providing the following data sets:
"Where to go for dinner?" -> Dinner
"What's your favorite bar?" -> Bar
"What's your favorite cheap restaurant?" -> Cheap, Restaurant
so that next time we have a similar question about an unknown activity, say, "What is your favorite expensive [whatever]" it would be able to extract "expensive" and [whatever]?
The goal is if we can train it with hundreds of variations(or thousands) of the question asked and relevant output data expected, so that it can work with everyday language.
I know how to make it even without NLP/ML if we have a dictionary of expected terms like Bar, Restaurant, Pool, etc., but we also want it to work with unknown terms.
I've seen examples with Rake and Scikit-learn for classification of "things", but I'm not sure how would I feed text into those and all those examples had predefined outputs for training.
I've also tried Google's NLP API, Amazon Lex and Wit to see how good they are at extracting entities, but the results are disappointing to say the least.
Reading about summarization techniques, I'm left with the impression it won't work with small, single-sentence texts, so I haven't delved into it.
As #polm23 mentioned for simple stuff you can use the POS tagging to do the extraction. The services you mentioned like LUIS, Dialog flow etc. , uses what is called Natural Language Understanding. They make uses of intents & entities(detailed explanation with examples you can find here). If you are concerned that your data is going online or sometimes you have to go offline, you always go for RASA.
Things you can do with RASA:
Entity extraction and sentence classification. Mention which particular term to be extracted from the sentence by tagging the word position with a variety of sentence. So if any different word comes other than what you had given in the training set it will be detected.
Uses rule-based learning and also keras LSTM for detection.
One downside when comparing with the online services is that you have to manually tag the position numbers in the JSON file for training as opposed to the click and tag features in the online services.
You can find the tutorial here.
I am having pain in my leg.
Eg I have trained RASA with a variety of sentences for identifying body part and symptom (I have limited to 2 entities only, you can add more), then when an unknown sentence (like the one above) appears it will correctly identify "pain" as "symptom" and "leg" as "body part".
Hope this answers your question!
Since "hundreds to thousands" sound like you have very little data for training a model from scratch. You might want to consider training (technically fine-tuning) a DialogFlow Agent to match sentences ("Where to go for dinner?") to intents ("Dinner"), then integrating via API calls.
Alternatively, you can invest time in fine-tuning a small pre-trained model like "Distilled BERT classifier" from "HuggingFace" as you won't need the 100s of thousands to billions of data samples required to train a production-worthy model. This can also be assessed offline and will equip you to solve other NLP problems in the future without much low-level understanding of the underlying statistics.
I am using AngelList DB to categorize startups based on their industries since these startups are categorized based on community input which is misleading most of the time.
My business objective is to extract keywords that indicate to which industry this specific startup belongs to then map it to one of the industries specified in LinkedIn sheet https://developer.linkedin.com/docs/reference/industry-codes
I experimented with Azure Machine learning, where I pushed 300 startups descriptions and analyzed the keyword extraction was pretty bad and was not even close to what I am trying to achieve.
I would like to know how data scientists will approach this problem? where should I look? and where I should not? is keyword analysis tools (like Google Adwords keyword planner is a viable option)
Using Text Classification...
To be able to treat this as a classification problem, you need a training set, which is a set of AngelList entries that are labeled with correct LinkedIn categories. This can be done manually, or you can hire some Mechanical Turks to do the job for you.
Since you have ~150 categories, I'd imagine you need at least 20-30* AngelList entries for each of them. So your training set will be {input: angellist_description, result: linkedin_id}
After that, you need to dig through text classification techniques to try and optimize the accuracy/precision of your results. The book "Taming Text" has a full chapter on text classification. And a good tool to implement a text-based classifier would be Apache Solr or Apache Lucene.
* 20-30 is a quick personal estimate and not based on a scientific method. You can look up some methods online for a good estimation method.
Using Text Clustering.
Step #1
Use text clustering to extract main 'topics' from all the descriptions. (Carrot2 can be helpful here)
Input corpus of all descriptions
Process: Text Clustering using Carrot2
Output each document will be labeled with a topic
Step #2
Manually map the extracted topics into LinkedIn's categories.
Step #3
Use the output of the first two steps to traverse from company -> extracted topic -> linkedin category
I am working on a text mining project which focus on the computer technology documents. So there're many jargons. Tasks like part-of-speech tagging require some training data to built a pos-tagger. And I think this training data should be from the same domain with words like ".NET, COM, JAVA" correctly tagged.
So where can I find such corpus? Or is there any work around? Or can we tune an existing tagger to handle domain specific task?
Gathering training data (and defining features) is going to be the hardest step of this problem. I'm sure there are datasets out there. But an alternative option for you would be to identify a few journals or news sites that focus on your area of interest and crawl them and pull down the text, perhaps validating each article you pull down by searching for keywords. I've done that before to develop a corpus focused on elections.
Unfortunately, it is domain-specific where you can find such a corpus.
Catch-22. There is no general source for specialized data.
Just like there is no universal software to solve domain-specific problems.