Binary classification of dated documents with seasonal class variation - machine-learning

I have a collection of training documents with publication dates, where each document is labeled as belonging (or not) to some topic T. I want to train a model that will predict for a new document (with publication date) whether or not it belongs to T, where the publication date might be in the past or in the future. Assume that I have decomposed each training document's text into a set of features (e.g., TF-IDF of words or n-grams) suitable for analysis by an appropriate binary classification algorithm provided by a library like Weka (for instance, multinomial naive Bayes, random forests, or SVM). The concept to be learned exhibits multiple seasonality; i.e., the prior probability that an arbitrary document published on a given date belongs to T depends heavily on when the date falls in a 4-year cycle (due to elections), where it falls in an annual cycle (due to holidays), and on the day of the week.
My research indicates that classification algorithms generally assume (as part of their statistical models) that training data is randomly sampled from the same pool of data that the model will ultimately be applied to. When the distribution of classes in the training data differs substantially from the known distribution in the wild, this leads to the so-called "class imbalance" problem. There are ways of compensating for this, including over-sampling underrepresented classes, under-sampling overrepresented classes, and using cost-sensitive classification. This allows a model creator to implicitly specify the prior probability that a new document will be positively classified, but importantly (and unfortunately for my purposes), this prior probability is assumed to be equal for all new documents.
I require more flexibility in my model. Because of the concept's seasonality, when classifying a new document, the model must explicitly take the publication date into account when determining the prior probability that the document belongs to T, and when the model calculates the posterior probability of belonging to T in light of the document's features, this prior probability should be properly accounted for. I am looking for a classifier implementation that either (1) bakes sophisticated regression of prior probabilities based on dates into the classifier, or (2) can be extended with a user-specified regression function that takes a date as input and gives the prior probability as output.
I am most familiar with the Weka library, but am open to using other tools if they are appropriate to the job. What is the most straightforward way of accomplishing this task?
Edit (in response to Doxav's point #2):
My concern is that date-based attributes should not be used for learning rules about when the topic applies, rather, they should be used only for determining the prior probability of whether the topic applies. Here's a concrete example: suppose that the topic T is "Christmas". A story published in July is indeed much less likely to be about Christmas than a story published in December. But what makes a story about Christmas is the textual content of the story, not when it was published. The relationship between publication date and "being about Christmas" is mere correlation, and therefore only useful for calculating the prior probability of an arbitrary story on an arbitrary date being about Christmas. By comparison, the relationship between TF-IDF (for some term in the story text) and "being about Christmas" is inherent and causative, and therefore worthy of incorporation into our model of what it means for a story to be about Christmas.

It seems like it can be simplified into typical ML problems: text classification + imbalanced data + seasonality identification + architecture + typical batch/offline vs stream/online learning :
Text classification: https://www.youtube.com/watch?v=IY29uC4uem8 is a good tutorial on text classification with Weka and covers imbalance data issue.
Seasonality identification: the goal is to enable the model to learn rules/inference on some different time attributes, so we should ease its job by extracting best known useful attributes. It means extracting typical date cycles (ie. week day, day of month, month, year...) and, if possible, also merge it with other more specific cycles or events (ie. elections, holidays, any custom cycle or frequent event). If you expect the model to learn on time series/sequences, you should create some lag data (attributes who happened before or statistics on recent time interval). It can be good to remove the date itself or any data which would make biase the model construction.
I don't know if you plan to deliver this as a service, but this can be of good inspiration: http://fr.slideshare.net/TraianRebedea/autonomous-news-clustering-and-classification-for-an-intelligent-web-portal .
Typical batch/offline vs Stream / online learning: Apparently you already know Weka which focuses on batch/offline learning. I don't know the size of your data and if you plan to continuously process new data and rebuild models, then you could consider moving to stream processing and online learning. Therefore, you could move to MOA which is very close to Weka but dedicated to stream classification, or use new streaming features of the latest version of Weka (steam processing and new online learners).
UPDATE 1 ; I read your comment and I see different solutions:
answer #2 is still one possible solution for your need even if it is not optimal. Getting an attribute indicating it's Christmas period will set an higher probability to tag it as a Christmas topic, same for the TF-IDF of the "word" Chritmas, BUT only both attributes together will set the max classification prob very highly to be Christmas.
you can use an attribute providing a seasonal weight for each word: TF-IDF with time weight, or use current Google Trends data for each word.
if you want a state of the art adaptive prior upon context you could look into hierarchical Bayesian models and smoothing from NLP solutions. It won't be Weka then and not as fast to test.

Related

Is it a bad idea to use the cluster ID from clustering text data using K-means as feature to your supervised learning model?

I am building a model that will predict the lead time of products flowing through a pipeline.
I have a lot of different features, one is a string containing a few words about the purpose of the product (often abbreviations, name of the application it will be a part of and so forth). I have previously not used this field at all when doing feature engineering.
I was thinking that it would be nice to do some type of clustering on this data, and then use the cluster ID as a feature for my model, perhaps the lead time is correlated with the type of info present in that field.
Here was my line of thinking)
1) Cleaning & tokenizing text.
2) TF-IDF
3) Clustering
But after thinking more about it, is it a bad idea? Because the clustering was based on the old data, if new words are introduced in the new data this will not be captured by the clustering algorithm, and the data should perhaps be clustered differently now. Does this mean that I would have to retrain the entire model (k-means model and then the supervised model) whenever I want to predict new data points? Are there any best practices for this?
Are there better ways of finding clusters for text data to use as features in a supervised model?
I understand the urge to use an unsupervised clustering algorithm first to see for yourself, which clusters were found. And of course you can try if such a way helps your task.
But as you have labeled data, you can pass the product description without an intermediate clustering. Your supervised algorithm shall then learn for itself if and how this feature helps in your task (of course preprocessing such as removal of stopwords, cleaining, tokenizing and feature extraction needs to be done).
Depending of your text descriptions, I could also imagine that some simple sequence embeddings could work as feature-extraction. An embedding is a vector of for example 300 dimensions, which describes the words in a manner that hp office printer and canon ink jet shall be close to each other but nice leatherbag shall be farer away from the other to phrases. For example fasText-Word-Embeddings are already trained in english. To get a single embedding for a sequence of hp office printerone can take the average-vector of the three vectors (there are more ways to get an embedding for a whole sequence, for example doc2vec).
But in the end you need to run tests to choose your features and methods!

AutoML NL - Training model based on ICD10-CM - Amount of text required

We are currently working on integrating ICD10-CM for our medical company, to be used for patient diagnosis. ICD10-CM is a coding system for diagnoses.
I tried to import ICD10-CM data in description-code pairs but obviously, it didn't work since AutoML needed more text for that code(label). I found a dataset on Kaggle but it only contained hrefs to an ICD10 website. I did find out that the website contains multiple texts and descriptions associated with codes that can be used to train our desired model.
Kaggle Dataset:
https://www.kaggle.com/shamssam/icd10datacom
Sample of a page from ICD10data.com:
https://www.icd10data.com/ICD10CM/Codes/A00-B99/A15-A19/A17-/A17.0
Most notable fields are:
- Approximate Synonyms
- Clinical Information
- Diagnosis Index
If I made a dataset from the sentences found in these pages and assigned them to their code(labels), will it be enough for AutoML dataset training? since each label will have 2 or more texts finally instead of just one, but definitely still a lot less than a 100 for each code unlike those in demos/tutorials.
From what I can see here, the disease code has a tree-like structure where, for instance, all L00-L99 codes refer to "Diseases of the skin and subcutaneous tissue". At the same time L00-L08 codes refer to "Infections of the skin and subcutaneous tissue", and so on.
What I mean is that the problem is not 90000 examples for 90000 different independent labels, but a decision tree (you take several decisions in function of the previous decision: the first step would be choosing which of the about 15 most general categories fits best, then choosing which of the subcategories etc.)
In this sense, probably autoML is not the best product, given that you cannot implement a specially designed decision tree model that takes into account all of this.
Another way of using autoML would be training separately for each of the decisions and then combine the different models. This would easily work for the first layer of decision but would be exponentially time consuming (the number of models to train in order to be able to predict more accurately grows exponentially with the level of accuracy, by accurate I mean afirminng it is L00-L08 instad of L00-L99).
I hope this helps you understand better the problem and the different approaches you can give to it!

How to classify text with Knime

I'm trying to classify some data using knime with knime-labs deep learning plugin.
I have about 16.000 products in my DB, but I have about 700 of then that I know its category.
I'm trying to classify as much as possible using some DM (data mining) technique. I've downloaded some plugins to knime, now I have some deep learning tools as some text tools.
Here is my workflow, I'll use it to explain what I'm doing:
I'm transforming the product name into vector, than applying into it.
After I train a DL4J learner with DeepMLP. (I'm not really understand it all, it was the one that I thought I got the best results). Than I try to apply the model in the same data set.
I thought I would get the result with the predicted classes. But I'm getting a column with output_activations that looks that gets a pair of doubles. when sorting this column I get some related date close to each other. But I was expecting to get the classes.
Here is a print of the result table, here you can see the output with the input.
In columns selection it's getting just the converted_document and selected des_categoria as Label Column (learning node config). And in Predictor node I checked the "Append SoftMax Predicted Label?"
The nom_produto is the text column that I'm trying to use to predict the des_categoria column that it the product category.
I'm really newbie about DM and DL. If you could get me some help to solve what I'm trying to do would be awesome. Also be free to suggest some learning material about what attempting to achieve
PS: I also tried to apply it into the unclassified data (17,000 products), but I got the same result.
I won't answer with a workflow on this one because it is not going to be a simple one. However, be sure to find the text mining example on the KNIME server, i.e. the one that makes use of the bag of words approach.
The task
Product mapping to categories should be a straight-forward data mining task because the information that explains the target variable is available in a quasi-exhaustive manner. Depending on the number of categories to train though, there is a risk that you might need more than 700 instances to learn from.
Some resources
Here are some resources, only the first one being truly specialised in text mining:
Introduction on Information Retrieval, in particular chapter 13;
Data Science for Business is an excellent introduction to data mining, including text mining (chapter 10), also do not forget the chapter about similarity (chapter 6);
Machine Learning with R has the advantage of being accessible enough (chapter 4 provides an example of text classification with R code).
Preprocessing
First, you will have to preprocess your product labels a bit. Use KNIME's text analytics preprocessing nodes for that purpose, that is after you've transformed the product labels with Strings to Document:
Case Convert, Punctuation Erasure and Snowball Stemmer;
you probably won't need Stop Word Filter, however, there may be quasi-stop words such as "product", which you may need to remove manually with Dictionary Filter;
Be careful not to use any of the following without testing testing their impact first: N Chars Filter (g may be a useful word), Number Filter (numbers may indicate quantities, which may be useful for classification).
Should you encounter any trouble with the relevant nodes (e.g. Punctuation Erasure can be tricky amazingly thanks to the tokenizer), you can always apply String Manipulation with regex before converting the Strings to Document.
Keep it short and simple: the lookup table
You could build a lookup table based on the 700 training instances. The book Data mining techniques as well as resource (2) present this approach in some detail. If any model performs any worse than the lookup table, you should abandon the model.
Nearest neighbors
Neural networks are probably overkill for this task.
Start with a K Nearest Neighbor node (applying a string distance such as Cosine, Levensthein or Jaro-Winkler). This approach requires the least amount of data wrangling. At the very least, it will provide an excellent baseline model, so it is most definitely worth a shot.
You'll need to tune the parameter k and to experiment with the distance types. The Parameter Optimization Loop pair will help you with optimizing k, you can include a Cross-Validation meta node inside of the said loop to obtain an estimate of the expected performance given k instead of only one point estimate per value of k. Use Cohen's Kappa as an optimization criterion, as proposed by the resource number (3) and available via the Scorer node.
After the parameter tuning, you'll have to evaluate the relevance of your model using yet another Cross-Validation meta node, then follow up with a Loop pair including Scorer to calculate the descriptives on performance metric(s) per iteration, finally use Statistics. Kappa is a convenient metric for this task because the target variable consists of many product categories.
Don't forget to test its performance against the lookup table.
What next ?
Should lookup table or k-nn work well for you, then there's nothing else to add.
Should any of those approaches fail, you might want to analyse the precise cases on which it fails. In addition, training set size may be too low, so you could manually classify another few hundred or thousand instances.
If after increasing the training set size, you are still dealing with a bad model, you can try the bag of words approach together with a Naive Bayes classifier (see chapter 13 of the Information Retrieval reference). There is no room here to elaborate on the bag of words approach and Naive Bayes but you'll find the resources here above useful for that purpose.
One last note. Personally, I find KNIME's Naive Bayes node to perform poorly, probably because it does not implement Laplace smoothening. However, KNIME's R Learner and R Predictor nodes will allow you to use R's e1071 package, as demonstrated by resource (3).

What is the difference between inferential analysis and predictive analysis?

Objective
To clarify by having what traits or attributes, I can say an analysis is inferential or predictive.
Background
Taking a data science course which touches on analyses of Inferential and Predictive. The explanations (what I understood) are
Inferential
Induct a hypothesis from a small samples in a population, and see it is true in larger/entire population.
It seems to me it is generalisation. I think induct smoking causes lung cancer or CO2 causes global warming are inferential analyses.
Predictive
Induct a statement of what can happen by measuring variables of an object.
I think, identify what traits, behaviour, remarks people react favourably and make a presidential candidate popular enough to be the president is a predictive analysis (this is touched in the course as well).
Question
I am bit confused with the two as it looks to me there is a grey area or overlap.
Bayesian Inference is "inference" but I think it is used for prediction such as in a spam filter or fraudulent financial transaction identification. For instance, a bank may use previous observations on variables (such as IP address, originator country, beneficiary account type, etc) and predict if a transaction is fraudulent.
I suppose the theory of relativity is an inferential analysis that inducted a theory/hypothesis from observations and thought experimentations, but it also predicted light direction would be bent.
Kindly help me to understand what are Must Have attributes to categorise an analysis as inferential or predictive.
"What is the question?" by Jeffery T. Leek, Roger D. Peng has a nice description of the various types of analysis that go into a typical data science workflow. To address your question specifically:
An inferential data analysis quantifies whether an observed pattern
will likely hold beyond the data set in hand. This is the most common
statistical analysis in the formal scientific literature. An example
is a study of whether air pollution correlates with life expectancy at
the state level in the United States (9). In nonrandomized
experiments, it is usually only possible to determine the existence of
a relationship between two measurements, but not the underlying
mechanism or the reason for it.
Going beyond an inferential data analysis, which quantifies the
relationships at population scale, a predictive data analysis uses a
subset of measurements (the features) to predict another measurement
(the outcome) on a single person or unit. Web sites like
FiveThirtyEight.com use polling data to predict how people will vote
in an election. Predictive data analyses only show that you can
predict one measurement from another; they do not necessarily explain
why that choice of prediction works.
There is some gray area between the two but we can still make distinctions.
Inferential statistics is when you are trying to understand what causes a certain outcome. In such analyses there is a specific focus on the independent variables and you want to make sure you have an interpretable model. For instance, your example on a study to examine whether smoking causes lung cancer is inferential. Here you are trying to closely examine the factors that lead to lung cancer, and smoking happens to be one of them.
In predictive analytics you are more interested in using a certain dataset to help you predict future variation in the values of the outcome variable. Here you can make your model as complex as possible to the point that it is not interpretable as long as it gets the job done. A more simplified example is a real estate investment company interested in determining which combination of variables predicts prime price for a certain property so it can acquire them for profit. The potential predictors could be neighborhood income, crime, educational status, distance to a beach, and racial makeup. The primary aim here is to obtain an optimal combination of these variables that provide a better prediction of future house prices.
Here is where it gets murky. Let's say you conduct a study on middle aged men to determine the risks of heart disease. To do this you measure weight, height, race, income, marital status, cholestrol, education, and a potential serum chemical called "mx34" (just making this up) among others. Let's say you find that the chemical is indeed a good risk factor for heart disease. You have now achieved your inferential objective. However, you are satisfied with your new findings and you start to wonder whether you can use these variables to predict who is likely to get heart disease. You want to do this so that you can recommend preventive steps to prevent future heart disease.
The same academic paper I was reading that spurred this question for me also gave an answer (from Leo Breiman, a UC Berkeley statistician):
• Prediction. To be able to predict what the responses are going to be
to future input variables;
• [Inference].23 To [infer] how nature is associating the response
variables to the input variables.
Source: http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Resources