What is the difference between inferential analysis and predictive analysis? - machine-learning

Objective
To clarify by having what traits or attributes, I can say an analysis is inferential or predictive.
Background
Taking a data science course which touches on analyses of Inferential and Predictive. The explanations (what I understood) are
Inferential
Induct a hypothesis from a small samples in a population, and see it is true in larger/entire population.
It seems to me it is generalisation. I think induct smoking causes lung cancer or CO2 causes global warming are inferential analyses.
Predictive
Induct a statement of what can happen by measuring variables of an object.
I think, identify what traits, behaviour, remarks people react favourably and make a presidential candidate popular enough to be the president is a predictive analysis (this is touched in the course as well).
Question
I am bit confused with the two as it looks to me there is a grey area or overlap.
Bayesian Inference is "inference" but I think it is used for prediction such as in a spam filter or fraudulent financial transaction identification. For instance, a bank may use previous observations on variables (such as IP address, originator country, beneficiary account type, etc) and predict if a transaction is fraudulent.
I suppose the theory of relativity is an inferential analysis that inducted a theory/hypothesis from observations and thought experimentations, but it also predicted light direction would be bent.
Kindly help me to understand what are Must Have attributes to categorise an analysis as inferential or predictive.

"What is the question?" by Jeffery T. Leek, Roger D. Peng has a nice description of the various types of analysis that go into a typical data science workflow. To address your question specifically:
An inferential data analysis quantifies whether an observed pattern
will likely hold beyond the data set in hand. This is the most common
statistical analysis in the formal scientific literature. An example
is a study of whether air pollution correlates with life expectancy at
the state level in the United States (9). In nonrandomized
experiments, it is usually only possible to determine the existence of
a relationship between two measurements, but not the underlying
mechanism or the reason for it.
Going beyond an inferential data analysis, which quantifies the
relationships at population scale, a predictive data analysis uses a
subset of measurements (the features) to predict another measurement
(the outcome) on a single person or unit. Web sites like
FiveThirtyEight.com use polling data to predict how people will vote
in an election. Predictive data analyses only show that you can
predict one measurement from another; they do not necessarily explain
why that choice of prediction works.

There is some gray area between the two but we can still make distinctions.
Inferential statistics is when you are trying to understand what causes a certain outcome. In such analyses there is a specific focus on the independent variables and you want to make sure you have an interpretable model. For instance, your example on a study to examine whether smoking causes lung cancer is inferential. Here you are trying to closely examine the factors that lead to lung cancer, and smoking happens to be one of them.
In predictive analytics you are more interested in using a certain dataset to help you predict future variation in the values of the outcome variable. Here you can make your model as complex as possible to the point that it is not interpretable as long as it gets the job done. A more simplified example is a real estate investment company interested in determining which combination of variables predicts prime price for a certain property so it can acquire them for profit. The potential predictors could be neighborhood income, crime, educational status, distance to a beach, and racial makeup. The primary aim here is to obtain an optimal combination of these variables that provide a better prediction of future house prices.
Here is where it gets murky. Let's say you conduct a study on middle aged men to determine the risks of heart disease. To do this you measure weight, height, race, income, marital status, cholestrol, education, and a potential serum chemical called "mx34" (just making this up) among others. Let's say you find that the chemical is indeed a good risk factor for heart disease. You have now achieved your inferential objective. However, you are satisfied with your new findings and you start to wonder whether you can use these variables to predict who is likely to get heart disease. You want to do this so that you can recommend preventive steps to prevent future heart disease.

The same academic paper I was reading that spurred this question for me also gave an answer (from Leo Breiman, a UC Berkeley statistician):
• Prediction. To be able to predict what the responses are going to be
to future input variables;
• [Inference].23 To [infer] how nature is associating the response
variables to the input variables.
Source: http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf

Related

Computing a similarity score for a set of sentences

My team does a lot of chatbot training, and I'm trying to come up with some tools to improve the quality of our work. In chatbot training, it is really important to train intents with diverse utterances that phrase the same intent in very different ways. Ideally, there would be very little similarity in the syntax of the utterances in the set.
Here's an example for an intent inquiring about medical insurance coverage
Bad set of utterances
Is my daughter covered by insurance?
Is my son covered by medical insurance?
Will my son be covered by insurance?
Decent set of utterances
How can I look up whether we have insurance coverage for the whole family?
Seeking details on eligibility for medical coverage
Is there a document that details who is protected under our medical insurance policy?
I want to be able to take all of the utterances associated to an intent and analyze them for similarity. I would expect my set of bad utterances to have a high similarity score and my set of decent utterances to have a low similarity score.
I've tried playing around with a few doc2vec tutorials, but I feel like I'm missing something. I keep seeing stuff like this:
Train a set of data and then measure the similarity of a new sentence to your set of data
Measure the similarity between two sentences
I need to have an array of sentences and understand how similar they are to each other.
Any advice on achieving this?
Answering some questions:
What makes the bad utterances bad?The utterances themselves are not bad, it is the lack of variety between them. If most of the training had been like the “bad” set, then real user utterances of greater variety will not be recognized correctly.
Are you trying to discover new intents? No, this is for prerelease training, trying to improve the effectiveness of it.
Why do bad utterances have high similarity scores and decent utterances have low similarity scores? This is a hypothesis. I know how varied real user utterances are, and I have found my trainers fall into ruts when training, asking things the same way, and not seeing good accuracy results. Improving the variety in the utterances tends to result in better accuracy.
What will I do with this info? I’ll use it to assess the training quality of an intent, to determine if more training is likely necessary. In the future we might build real time tools as utterances are being added to let trainers know if they’re being too repetitive.
Most applications of text vectors benefit from the vectors capturing the "essential meaning" of a text, **without* regard to variances in word choice.
That is, it's considered a feature, not a flaw, if two completely different wordings with similar meaning have nearly the same vector. (Or, if some similarity-measure indicates they are totally similar.)
For example, to contrive an example similar to yours, consider the two phrasings:
"health coverage for brother"
"male sibling medical insurance"
There's no reuse of words, but the likely intended meaning is the same – so a good text-vectorization for typical purposes would create very similar vectors. And a similarity-measure using those vectors, or otherwise using the words/word-vectors as input, would indicate very high similarity.
But from your clarifying answers, it seems you actually want a more superficial "similarity" measure. You'd like a measure that reveals when certain phrasings show variety/contrast in their wording. (And specifically, you already know form other factors, like how they were hand-crafted, that groups of these phrasings are semantically related.)
What you want this similarity measure to show is actually a behavior that many projects using text-vectors would consider a failure of the vectors. So semantic methods like those in Word2Vec, Paragraph Vectors (aka "Doc2Vec"), etc are likely the wrong tool for your goal.
You could probably do well with a simpler measure based just on the words, or perhaps character-n-grams, of the texts.
For example, for two texts A and B, you could just tally the number of shared words (that appear in both A and B), and divide by the total number of unique words in both A and B, to get a 0.0 to 1.0 "word choice similarity" number.
And, when considering a new text against a set of prior texts, if its average similarity to the prior texts is low, it'd be "good" for your purposes.
Rather than just words, you could also use all n-character substrings ("n-grams") of your texts – which might help better highlight differences in word-forms, or common typos, which may also be useful variances for your purposes.
In general, I'd look at the scikit-learn text-vectorization functionality for ideas:
https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

Find the best set of features to separate 2 known group of data

I need some point of view to know if what I am doing is good or wrong or if there is better way to do it.
I have 10 000 elements. For each of them I have like 500 features.
I am looking to measure the separability between 2 sets of those elements. (I already know those 2 groups I don't try to find them)
For now I am using svm. I train the svm on 2000 of those elements, then I look at how good the score is when I test on the 8000 other elements.
Now I would like to now which features maximize this separation.
My first approach was to test each combination of feature with the svm and follow the score given by the svm. If the score is good those features are relevant to separate those 2 sets of data.
But this takes too much time. 500! possibility.
The second approach was to remove one feature and see how much the score is impacted. If the score changes a lot that feature is relevant. This is faster, but I am not sure if it is right. When there is 500 feature removing just one feature don't change a lot the final score.
Is this a correct way to do it?
Have you tried any other method ? Maybe you can try decision tree or random forest, it would give out your best features based on entropy gain. Can i assume all the features are independent of each other. if not please remove those as well.
Also for Support vectors , you can try to check out this paper:
http://axon.cs.byu.edu/Dan/778/papers/Feature%20Selection/guyon2.pdf
But it's based more on linear SVM.
You can do statistical analysis on the features to get indications of which terms best separate the data. I like Information Gain, but there are others.
I found this paper (Fabrizio Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys, Vol. 34, No.1, pp.1-47, 2002) to be a good theoretical treatment of text classification, including feature reduction by a variety of methods from the simple (Term Frequency) to the complex (Information-Theoretic).
These functions try to capture the intuition that the best terms for ci are the
ones distributed most differently in the sets of positive and negative examples of
ci. However, interpretations of this principle vary across different functions. For instance, in the experimental sciences χ2 is used to measure how the results of an observation differ (i.e., are independent) from the results expected according to an initial hypothesis (lower values indicate lower dependence). In DR we measure how independent tk and ci are. The terms tk with the lowest value for χ2(tk, ci) are thus the most independent from ci; since we are interested in the terms which are not, we select the terms for which χ2(tk, ci) is highest.
These techniques help you choose terms that are most useful in separating the training documents into the given classes; the terms with the highest predictive value for your problem. The features with the highest Information Gain are likely to best separate your data.
I've been successful using Information Gain for feature reduction and found this paper (Entropy based feature selection for text categorization Largeron, Christine and Moulin, Christophe and Géry, Mathias - SAC - Pages 924-928 2011) to be a very good practical guide.
Here the authors present a simple formulation of entropy-based feature selection that's useful for implementation in code:
Given a term tj and a category ck, ECCD(tj , ck) can be
computed from a contingency table. Let A be the number
of documents in the category containing tj ; B, the number
of documents in the other categories containing tj ; C, the
number of documents of ck which do not contain tj and D,
the number of documents in the other categories which do
not contain tj (with N = A + B + C + D):
Using this contingency table, Information Gain can be estimated by:
This approach is easy to implement and provides very good Information-Theoretic feature reduction.
You needn't use a single technique either; you can combine them. Term-Frequency is simple, but can also be effective. I've combined the Information Gain approach with Term Frequency to do feature selection successfully. You should experiment with your data to see which technique or techniques work most effectively.
If you want a single feature to discriminate your data, use a decision tree, and look at the root node.
SVM by design looks at combinations of all features.
Have you thought about Linear Discriminant Analysis (LDA)?
LDA aims at discovering a linear combination of features that maximizes the separability. The algorithm works by projecting your data in a space where the variance within classes is minimum and the one between classes is maximum.
You can use it reduce the number of dimensions required to classify, and also use it as a linear classifier.
However with this technique you would lose the original features with their meaning, and you may want to avoid that.
If you want more details I found this article to be a good introduction.

Binary classification of dated documents with seasonal class variation

I have a collection of training documents with publication dates, where each document is labeled as belonging (or not) to some topic T. I want to train a model that will predict for a new document (with publication date) whether or not it belongs to T, where the publication date might be in the past or in the future. Assume that I have decomposed each training document's text into a set of features (e.g., TF-IDF of words or n-grams) suitable for analysis by an appropriate binary classification algorithm provided by a library like Weka (for instance, multinomial naive Bayes, random forests, or SVM). The concept to be learned exhibits multiple seasonality; i.e., the prior probability that an arbitrary document published on a given date belongs to T depends heavily on when the date falls in a 4-year cycle (due to elections), where it falls in an annual cycle (due to holidays), and on the day of the week.
My research indicates that classification algorithms generally assume (as part of their statistical models) that training data is randomly sampled from the same pool of data that the model will ultimately be applied to. When the distribution of classes in the training data differs substantially from the known distribution in the wild, this leads to the so-called "class imbalance" problem. There are ways of compensating for this, including over-sampling underrepresented classes, under-sampling overrepresented classes, and using cost-sensitive classification. This allows a model creator to implicitly specify the prior probability that a new document will be positively classified, but importantly (and unfortunately for my purposes), this prior probability is assumed to be equal for all new documents.
I require more flexibility in my model. Because of the concept's seasonality, when classifying a new document, the model must explicitly take the publication date into account when determining the prior probability that the document belongs to T, and when the model calculates the posterior probability of belonging to T in light of the document's features, this prior probability should be properly accounted for. I am looking for a classifier implementation that either (1) bakes sophisticated regression of prior probabilities based on dates into the classifier, or (2) can be extended with a user-specified regression function that takes a date as input and gives the prior probability as output.
I am most familiar with the Weka library, but am open to using other tools if they are appropriate to the job. What is the most straightforward way of accomplishing this task?
Edit (in response to Doxav's point #2):
My concern is that date-based attributes should not be used for learning rules about when the topic applies, rather, they should be used only for determining the prior probability of whether the topic applies. Here's a concrete example: suppose that the topic T is "Christmas". A story published in July is indeed much less likely to be about Christmas than a story published in December. But what makes a story about Christmas is the textual content of the story, not when it was published. The relationship between publication date and "being about Christmas" is mere correlation, and therefore only useful for calculating the prior probability of an arbitrary story on an arbitrary date being about Christmas. By comparison, the relationship between TF-IDF (for some term in the story text) and "being about Christmas" is inherent and causative, and therefore worthy of incorporation into our model of what it means for a story to be about Christmas.
It seems like it can be simplified into typical ML problems: text classification + imbalanced data + seasonality identification + architecture + typical batch/offline vs stream/online learning :
Text classification: https://www.youtube.com/watch?v=IY29uC4uem8 is a good tutorial on text classification with Weka and covers imbalance data issue.
Seasonality identification: the goal is to enable the model to learn rules/inference on some different time attributes, so we should ease its job by extracting best known useful attributes. It means extracting typical date cycles (ie. week day, day of month, month, year...) and, if possible, also merge it with other more specific cycles or events (ie. elections, holidays, any custom cycle or frequent event). If you expect the model to learn on time series/sequences, you should create some lag data (attributes who happened before or statistics on recent time interval). It can be good to remove the date itself or any data which would make biase the model construction.
I don't know if you plan to deliver this as a service, but this can be of good inspiration: http://fr.slideshare.net/TraianRebedea/autonomous-news-clustering-and-classification-for-an-intelligent-web-portal .
Typical batch/offline vs Stream / online learning: Apparently you already know Weka which focuses on batch/offline learning. I don't know the size of your data and if you plan to continuously process new data and rebuild models, then you could consider moving to stream processing and online learning. Therefore, you could move to MOA which is very close to Weka but dedicated to stream classification, or use new streaming features of the latest version of Weka (steam processing and new online learners).
UPDATE 1 ; I read your comment and I see different solutions:
answer #2 is still one possible solution for your need even if it is not optimal. Getting an attribute indicating it's Christmas period will set an higher probability to tag it as a Christmas topic, same for the TF-IDF of the "word" Chritmas, BUT only both attributes together will set the max classification prob very highly to be Christmas.
you can use an attribute providing a seasonal weight for each word: TF-IDF with time weight, or use current Google Trends data for each word.
if you want a state of the art adaptive prior upon context you could look into hierarchical Bayesian models and smoothing from NLP solutions. It won't be Weka then and not as fast to test.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Reinforcement learning of a policy for multiple actors in large state spaces

I have a real-time domain where I need to assign an action to N actors involving moving one of O objects to one of L locations. At each time step, I'm given a reward R, indicating the overall success of all actors.
I have 10 actors, 50 unique objects, and 1000 locations, so for each actor I have to select from 500000 possible actions. Additionally, there are 50 environmental factors I may take into account, such as how close each object is to a wall, or how close it is to an actor. This results in 25000000 potential actions per actor.
Nearly all reinforcement learning algorithms don't seem to be suitable for this domain.
First, they nearly all involve evaluating the expected utility of each action in a given state. My state space is huge, so it would take forever to converge a policy using something as primitive as Q-learning, even if I used function approximation. Even if I could, it would take too long to find the best action out of a million actions in each time step.
Secondly, most algorithms assume a single reward per actor, whereas the reward I'm given might be polluted by the mistakes of one or more actors.
How should I approach this problem? I've found no code for domains like this, and the few academic papers I've found on multi-actor reinforcement learning algorithms don't provide nearly enough detail to reproduce the proposed algorithm.
Clarifying the problem
N=10 actors
O=50 objects
L=1K locations
S=50 features
As I understand it, you have a warehouse with N actors, O objects, L locations, and some walls. The goal is to make sure that each of the O objects ends up in any one of the L locations in the least amount of time. The action space consist of decisions on which actor should be moving which object to which location at any point in time. The state space consists of some 50 X-dimensional environmental factors that include features such as proximity of actors and objects to walls and to each other. So, at first glance, you have XS(OL)N action values, with most action dimensions discrete.
The problem as stated is not a good candidate for reinforcement learning. However, it is unclear what the environmental factors really are and how many of the restrictions are self-imposed. So, let's look at a related, but different problem.
Solving a different problem
We look at a single actor. Say, it knows it's own position in the warehouse, positions of the other 9 actors, positions of the 50 objects, and 1000 locations. It wants to achieve maximum reward, which happens when each of the 50 objects is at one of the 1000 locations.
Suppose, we have a P-dimensional representation of position in the warehouse. Each position could be occupied by the actor in focus, one of the other actors, an object, or a location. The action is to choose an object and a location. Therefore, we have a 4P-dimensional state space and a P2-dimensional action space. In other words, we have a 4PP2-dimensional value function. By futher experimenting with representation, using different-precision encoding for different parameters, and using options 2, it might be possible to bring the problem into the practical realm.
For examples of learning in complicated spatial settings, I would recommend reading the Konidaris papers 1 and 2.
1 Konidaris, G., Osentoski, S. & Thomas, P., 2008. Value function approximation in reinforcement learning using the Fourier basis. Computer Science Department Faculty Publication Series, p.101.
2 Konidaris, G. & Barto, A., 2009. Skill Discovery in Continuous Reinforcement Learning Domains using Skill Chaining Y. Bengio et al., eds. Advances in Neural Information Processing Systems, 18, pp.1015-1023.

Resources