NLP Paragraph Level Predictions vs Doc Level Predictions? What Strategy to deploy - machine-learning

currently wanted to understand which model an approach I am incorporating for model development, I currently have a TF-IDF NLP model that reads in paragraphs for a document and makes a prediction based upon how many paragraphs scored a 1 label with that paragraph.
I am not sure if that is correct form of logic, should I just go with an document level model? what are the benefits and trade-offs of predicting at a paragraph level and rolling it up into a total prediction for the document vs just classifying the document itself.
Any Thoughts?
Thanks!

Depends on what problem you are trying to solve and the nature of your data.
If in one document different parts can be classified differently, it's better to make a prediction by paragraphs or even sentences. For example - quite often, the customer can be happy with one part of the product/item (the first sentence is positive). And be dissatisfied with another part of the product/item (the second sentence has a negative sentiment).
Or, if the document is entirely related to a specific topic, you can make a prediction using the entire text.
In the end, these are just assumptions. Hold out a test subset and validate your model for both cases.

Related

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

Adding vocabulary and improve word embedding with another model that was built on bigger corpus

I'm new to NLP. I'm currently building a NLP system in a specific domain. After training a word2vec and fasttext model on my documents, I found that the embedding is not really good because I didn't feed enough number of documents (e.g. the embedding can't see that "bar" and "pub" is strongly correlated to each other because "pub" only appears a few in the documents). Later, I found a word2vec model online built on that domain-specific corpus which definitely has a way better embedding (so "pub" is more related to "bar"). Is there any way to improve my word embedding using the model I found? Thanks!
Word2Vec (and similar) models really require a large volume of varied data to create strong vectors.
But also, a model's vectors are typically only meaningful alongside other vectors that were trained together in the same session. This is both because the process includes some randomness, and the vectors only acquire their useful positions via a tug-of-war with all other vectors and aspects of the model-in-training.
So, there's no standard location for a word like 'bar' - just a good position, within a certain model, given the training data and model parameters and other words co-populating the model.
This means mixing vectors from different models is non-trivial. There are ways to learn a 'translation' that moves vectors from the space of one model to another – but that is itself a lot like a re-training. You can pre-initialize a model with vectors from elsewhere... but as soon as training starts, all the words in your training corpus will start drifting into the best alignment for that data, and gradually away from their original positions, and away from pure comparability with other words that aren't being updated.
In my opinion, the best approach is usually to expand your corpus with more appropriate data, so that it has "enough" examples of every word important to you, in sufficiently varied contexts.
Many people use large free text dumps like Wikipedia articles for word-vector training, but be aware that its style of writing – dry, authoritative reference texts – may not be optimal for all domains. If your problem-area is "business reviews", you'd probably do best finding other review texts. If it's fiction stories, more fictional writing. And so forth. You can shuffle these other text-soruces in with your data to expand the vocabulary coverage.
You can also potentially shuffle in extra repeated examples of your own local data, if you want it to effectively have relatively more influence. (Generally, merely repeating a small number of non-varied examples can't help improve word-vectors: it's the subtle contrasts of different examples that helps. But as a way to incrementally boost the influence of some examples, when there are plenty of examples overall, it can make more sense.)

metric learning for information retrieval in semi-structured text?

I am interested in parsing semi-structured text. Assuming I have a text with labels of the kind: year_field, year_value, identity_field, identity_value, ..., address_field, address_value and so on.
These fields and their associated values can be everywhere in the text, but usually they are near to each other, and more generally the text in organized in a (very) rough matrix, but rather often the value is just after the associated field with eventually some non-interesting information in between.
The number of different format can be up to several dozens, and is not that rigid (do not count on spacing, moreover some information can be added and removed).
I am looking toward machine learning techniques to extract all those (field,value) of interest.
I think metric learning and/or conditional random fields (CRF) could be of a great help, but I have not practical experience with them.
Does anyone have already encounter a similar problem?
Any suggestion or literature on this topic?
Your task, if I understand correctly, is to extract all pre-defined entities from a text. What you describe here is exactly named entity recognition.
Stanford has a Stanford Named Entity Recognizer that you can download and use (python/java and more)
Regarding the models you considers (CRF for example) - the hard thing here is to get the training data - sentences with the entities already labeled. This is why you should consider getting a trained model, or use someone else's data to train your model (again, the model will recognize only entities it saw in the training part)
A great choice for already train model in python is nltk's Information Extraction module.
Hope this sums it up

what methods are there to classify documents?

I am trying to do document classification. But I am really confused between feature selections and tf-idf. Are they the same or two different ways of doing classification?
Hope somebody can tell me? I am not really sure that my question will make sense to you guys.
Yes, you are confusion a lot of things.
Feature selection is the abstract term for choosing features (0 or 1). Stopword removal can be seen as feature selection.
TF is one method of extracting features from text: counting words.
IDF is one method of assigning weights to features.
Neither of them is classification... they are popular for text classification, but they are even more popular for information retrieval, which is not classification...
However, many classifiers work on numeric data, so the common process is to 1. Extract features (e.g.: TF) 2. Select features (e.g. remove stopwords) 3. Weight features (e.g. IDF) 4. Train a classifier on the resulting numerical vectors. 5. Predict the classes of new/unlabeled documents.
Taking a look at this explanation may help a lot when it comes to understanding text classifiers.
TF-IDF is a good way to find a document that answers a given query, but it does not necessarily assigns documents with classes.
Examples that may be helpful:
1) You have a bunch of documents with subjects ranging from politics, economics, computer science and the arts. The documents belonging to each subject are separated into the appropriate directories for each subject (you have a labeled dataset). Now, you received a new document whose subject you do not know. In which directory should it be stored? A classifier can answer this question from the documents that are already labeled.
2) Now, you received a query regarding computer science. For instance, you received the query "Good methods for finding textual similarity". Which document in the directory of computer science can provide the best response to that query? TF-IDF would be a good approach to figure that out.
So, when you are classifying documents, you are trying to make a decision about whether a document is a member of a particular class (like, say, 'about birds' or 'not about birds').
Classifiers predict the value of the class given a set of features. A good set of features will be highly discriminative - they will tell you a lot about whether the document is of one class or another.
Tf-idf (term frequency inverse document frequency) is a particular feature that seems to be discriminative for document classification tasks. There are others, like word counts (tf or term frequency) or whether a regexp matches the text or what have you.
Feature selection is the task of selecting good (discriminative) features. Tfidf is probably a good feature to select.

Resources