Regression trees with standard deviation reduction - machine-learning

I have a data set of 1k records and my job is to do a decision algorithm based on those records.
Here is what I can share:
The target is a continuous value.
Some of the predictors (or attributes) are continuous values,
some of them are discrete and some are arrays of discrete values
(there can be more than one option)
My initial thoughts were to separate the arrays of discrete values and make them individual features (predictors). For the continuous values in the predictors I was thinking about just randomly picking a few decision boundaries and see which one reduces the entropy the most. Then make a decision tree (or a random forest) which use standard deviation reduction when creating the tree.
My question is: Am I on the right path? Is there a better way to do that?

I know this comes probably a bit late but what you are searching for are Model Trees. Model trees are decision trees with continuous rater than categorical values in the leafs. In general these values are predicted by linear regression models. One of the more prominent model trees and one that more or less suits your needs is the M5 model tree introduced by Quinlan. Wang and Witten re-implemented M5 and extended its functionality so that it can handle both, continuous and categorical attributes. Their version is called M5', you can find an implementation e.g. in Weka. The only thing left would be to handle the arrays. However, your description is a bit generic in that respect. From what I gather your choices are either flattening or, as you suggested, seperating them.
Note that, since Wang and Witten's work, more sophisticated model trees have been introduced. However, M5' is robust and does not need any parameterization in its original formulation, which makes it easy to use.

Related

Which predictive models in sklearn are affected by the order of the columns in the training dataframe?

I'm wondering if any of the estimators that Sci-kit Learn provides is affected by the order of the columns in the dataframe by which it is being trained. I tried establishing a baseline by using ExtraTreesRegressor and it came out to 3 different scores:
.531687 for the regular order
.535309 for the reverse order
.554458 for the regular order
Obviously ExtraTreesRegressor is not a good example here, so I tried LinearRegression but it gave .295898 no matter what the order of the columns were.
What I want to know is if there are ANY estimators that are affected by the order of the columns and if there are not then can you point me in the direction of some way, or provide some code, that I can use to make sure that the order of the columns does matter?
Any algorithm that involves some randomness in selecting features while building the model is expected to be affected from their order; AFAIK, the only cases present in scikit-learn are the Extra Trees and the Random Forest (in both their incarnations as classifiers or regressors), which indeed share some similarities.
The smoking gun for such a behavior is the argument max_features; from the RF docs (the description is identical in the Extra Trees as well):
max_features : {“auto”, “sqrt”, “log2”} int or float, default=”auto”
The number of features to consider when looking for the best split
I am not aware of other algorithms that involve such kind of random feature selection (linear models, decision trees, SVMs, naive Bayes, neural nets, and gradient boosted trees do not), but if you glimpse something similar enough in the documentation, you can bet that the respective algorithm is also affected by the order of the features.
Keep in mind that such slight discrepancies that should not happen in theory are rather to be expected in models where randomness enters from way too many angles. For a similar case with RF in R (slightly different results when asking for importance=TRUE), check my answer in Why does the importance parameter influence performance of Random Forest in R?

How to derive the top contributing factors in a binary classification problem

I have a binary classification problem with about 30 features and an ultimate pass/fail label. I first trained a classifier to be able to predict if new instances will pass or fail but now I want to get a deeper understanding.
How can I derive some analysis about why these items pass or fail based on their features? I would ideally like to be able to show the top contributing factors with a weight associated with each one. Complicating this is that my features are not necessarily statistically independent of each other. What sorts of methods should I look into, what keywords will point me in the right direction?
Some initial thoughts: Use a decision tree classifier (ID3 or CART) and look at the top of the tree for top factors. I am not sure how robust this approach would be and it isn't immediately clear to me how one can assign the importance of each factor (one would just get an ordered list).
If I understand your objectives correctly, you might want to consider a Random Forest model. Random forests have the advantage of naturally providing an importance to the features by virtue of how the algorithm works.
In Python's scikit-learn, check out sklearn.ensemble.RandomForestClassifier(). feature_importances_ would return the "weights" I believe you're looking for. Check out the example in the documentation.
Alternatively, you can use R's randomForest package. After constructing the model, you can use importance() to extract the feature importance values.

How to get the final equation that the Random Forest algorithm uses on your independent variables to predict your dependent variable?

I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.

Is doc vector learned through PV-DBOW equivalent to the average/sum of the word vectors contained in the doc?

I've seen some posts say that the average of the word vectors perform better in some tasks than the doc vectors learned through PV_DBOW. What is the relationship between the document's vector and the average/sum of its words' vectors? Can we say that vector d
is approximately equal to the average or sum of its word vectors?
Thanks!
No. The PV-DBOW vector is calculated by a different process, based on how well the PV-DBOW-vector can be incrementally nudged to predict each word in the text in turn, via a concurrently-trained shallow neural network.
But, a simple average-of-word-vectors often works fairly well as a summary vector for a text.
So, let's assume both the PV-DBOW vector and the simple-average-vector are the same dimensionality. Since they're bootstrapped from exactly the same inputs (the same list of words), and the neural-network isn't significantly more sophisticated (in its internal state) than a good set of word-vectors, the performance of the vectors on downstream evaluations may not be very different.
For example, if the training data for the PV-DBOW model is meager, or meta-parameters not well optimized, but the word-vectors used for the average-vector are very well-fit to your domain, maybe the simple-average-vector would work better for some downstream task. On the other hand, a PV-DBOW model trained on sufficient domain text could provide vectors that outperform a simple-average based on word-vectors from another domain.
Note that FastText's classification mode (and similar modes in Facebook's StarSpace) actually optimizes word-vectors to work as parts of a simple-average-vector used to predict known text-classes. So if your end-goal is to have a text-vector for classification, and you have a good training dataset with known-labels, those techniques are worth considering as well.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Resources