evaluating accuracy of decision tree/forrest model - machine-learning

Im relatively new to ML. Ive created a decision tree model to predict prices of an item based on some criteria.
For an example, lets say the model predicts the price of a car based on a few features such as engine size, number of doors, fuel type, mileage and age.
Analysis of the data showed me that my data was not linear, so decision tree was a better fit. The model also does an ok job at predicting but before i can give it to any users, i need to quantify its accuracy.
As its non linear, R squared doesnt seem liek a good method of assessing accuracy, but im unsure what i should use.
Appreciate any advice on this.

In these cases, what you can usually do is to assess the performance of the model against a test or hold-out set (not used during the construction of the model), using a evaluation metric.
For regression problems (like the ones you are describing) there are several evaluation metrics available. The most common ones are MAE (Mean Absolute Error) and RMSE (Root Mean Squared Error)
To fully understand how good the performance of your model is, you can then compare it against other models, or against simple baselines (like predicting always the average price, or returning the price of the most similar car in the training set).

Related

Feature selection - how to go about it when you have way too many features?

Let's assume you have 1,400 columns/data points for 200k entries and your goal is to determine which of these columns show the most signal towards a simple classification task.
I've already removed columns with a threshold of null values, low variance, bad and also too many levels for categorical, and I still have 900+ columns.
I can use lasso if I only include the 500+ numerical columns, but if I try to include the categorical as well I keep crashing, it's too much data to process.
How would you go about further reducing features in that case? My goal, more than the classification itself, is to identify the features that bring in the most information towards the classification task.
You could use a data driven approach, for example the most simple one would be to use the L1 regularisation on a logistic regression (with your simple classification task) and looking at the weights you select the ones that are not zero or close to zero.
Basically the L1 norm on the model weights enforces the sparsity of the weights vector, and in doing so, the only surviving weight are the ones corresponding to the "important" features.
In any case be careful and normalise the data before using this technique and also be careful about categorical and scalar features...
You can also use a Neural network, and then compute the gradient w.r.t. the input to see what influences the decision more.
Or some other technique like: https://link.springer.com/chapter/10.1007/978-3-030-33778-0_24
Alternatively you can also use a Random Forest model and do feature importance like: https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html

Can decision tree based model predict future?

I am trying to build a model that predicts the shipping volume of each month, week, and day.
I found that the decision tree-based model works better than linear regression.
But I read some articles about machine learning and it says decision tree based model can't predict future which model didn't learn. (extrapolation issues)
So I think it means that if the data is spread between the dates that train data has, the model can predcit well, but if the date of data is out of the range, it can not.
I'd like to confirm if my understand is correct.
some posting shows prediction for datetime based data using random forest model, and it makes me confused.
Also please let me know if there is any way to overcome extrapolation issues on decision tree based model.
It depends on the data.
Decision tree predicts class value of any sample in range of [minimum of class value of training data, maximum of class value of training data]. For example, let there are five samples [(X1, Y1), (X2, Y2), ..., (X5, Y5)], and well trained tree has two decision node. The first node N1 includes (X1, Y1), (X2, Y2) and the other node N2 includes (X3, Y3), (X4, Y4), and (X5, Y5). Then the tree will predict a new sample as mean of Y1 and Y2 when the sample reaches N1, but it will predict a new sample as men of Y3, Y4, Y5 when the sample reaches N2.
With this reason, if the class value of new sample could be bigger than the maximum of class value of training data or could be smaller than the minimum of class value of training data, it is not recommend to use decision tree. Otherwise, tree-based model such as random forest shows good performance.
There can be different forms of extrapolation issues here.
As already mentioned a classical decision tree for classification can only predict values it has encountered in its training/creation process. In that sense you won't predict any previously unseen values.
This issue can be remedied if you have the classifier predict relative updates instead of absolute values. But you need to have some understanding of your data, to determine what works best for different cases.
Things are similar for a decision tree used for regression.
The next issue with "extrapolation" is that decision trees might perform badly if your training data has changing statistics over time. Again, I would propose to predict update relationships.
Otherwise, predictions based on training data from a more recent past might yield better predictions. Since individual decision trees can't be trained in an online manner, you would have to create a new decision tree every x time steps.
Going further than this I'd say you'll want to start thinking in state machines and trying to use your classifier for state predictions. But this a fairly uncharted domain of theory for decision trees from when I last checked. This will work better if you already have some for of model for your data relationships in mind.

How do neural networks learn functions instead of memorize them?

For a class project, I designed a neural network to approximate sin(x), but ended up with a NN that just memorized my function over the data points I gave it. My NN took in x-values with a batch size of 200. Each x-value was multiplied by 200 different weights, mapping to 200 different neurons in my first layer. My first hidden layer contained 200 neurons, each one a linear combination of the x-values in the batch. My second hidden layer also contained 200 neurons, and my loss function was computed between the 200 neurons in my second layer and the 200 values of sin(x) that the input mapped to.
The problem is, my NN perfectly "approximated" sin(x) with 0 loss, but I know it wouldn't generalize to other data points.
What did I do wrong in designing this neural network, and how can I avoid memorization and instead design my NN's to "learn" about the patterns in my data?
It is same with any machine learning algorithm. You have a dataset based on which you try to learn "the" function f(x), which actually generated the data. In real life datasets, it is impossible to get the original function from the data, and therefore we approximate it using something g(x).
The main goal of any machine learning algorithm is to predict unseen data as best as possible using the function g(x).
Given a dataset D you can always train a model, which will perfectly classify all the datapoints (you can use a hashmap to get 0 error on the train set), but which is overfitting or memorization.
To avoid such things, you yourself have to make sure that the model does not memorise and learns the function. There are a few things which can be done. I am trying to write them down in an informal way (with links).
Train, Validation, Test
If you have large enough dataset, use Train, Validation, Test splits. Split the dataset in three parts. Typically 60%, 20% and 20% for Training, Validation and Test, respectively. (These numbers can vary based on need, also in case of imbalanced data, check how to get stratified partitions which preserve the class ratios in every split). Next, forget about the Test partition, keep it somewhere safe, don't touch it. Your model, will be trained using the Training partition. Once you have trained the model, evaluate the performance of the model using the Validation set. Then select another set of hyper-parameter configuration for your model (eg. number of hidden layer, learaning algorithm, other parameters etc.) and then train the model again, and evaluate based on Validation set. Keep on doing this for several such models. Then select the model, which got you the best validation score.
The role of validation set here is to check what the model has learned. If the model has overfit, then the validation scores will be very bad, and therefore in the above process you will discard those overfit models. But keep in mind, although you did not use the Validation set to train the model, directly, but the Validation set was used indirectly to select the model.
Once you have selected a final model based on Validation set. Now take out your Test set, as if you just got new dataset from real life, which no one has ever seen. The prediction of the model on this Test set will be an indication how well your model has "learned" as it is now trying to predict datapoints which it has never seen (directly or indirectly).
It is key to not go back and tune your model based on the Test score. This is because once you do this, the Test set will start contributing to your mode.
Crossvalidation and bootstrap sampling
On the other hand, if your dataset is small. You can use bootstrap sampling, or k-fold cross-validation. These ideas are similar. For example, for k-fold cross-validation, if k=5, then you split the dataset in 5 parts (also be carefull about stratified sampling). Let's name the parts a,b,c,d,e. Use the partitions [a,b,c,d] to train and get the prediction scores on [e] only. Next, use the partitions [a,b,c,e] and use the prediction scores on [d] only, and continue 5 times, where each time, you keep one partition alone and train the model with the other 4. After this, take an average of these scores. This is indicative of that your model might perform if it sees new data. It is also a good practice to do this multiple times and perform an average. For example, for smaller datasets, perform a 10 time 10-folds cross-validation, which will give a pretty stable score (depending on the dataset) which will be indicative of the prediction performance.
Bootstrap sampling is similar, but you need to sample the same number of datapoints (depends) with replacement from the dataset and use this sample to train. This set will have some datapoints repeated (as it was a sample with replacement). Then use the missing datapoins from the training dataset to evaluate the model. Perform this multiple times and average the performance.
Others
Other ways are to incorporate regularisation techniques in the classifier cost function itself. For example in Support Vector Machines, the cost function enforces conditions such that the decision boundary maintains a "margin" or a gap between two class regions. In neural networks one can also do similar things (although it is not same as in SVM).
In neural network you can use early stopping to stop the training. What this does, is train on the Train dataset, but at each epoch, it evaluates the performance on the Validation dataset. If the model starts to overfit from a specific epoch, then the error for Training dataset will keep on decreasing, but the error of the Validation dataset will start increasing, indicating that your model is overfitting. Based on this one can stop training.
A large dataset from real world tends not to overfit too much (citation needed). Also, if you have too many parameters in your model (to many hidden units and layers), and if the model is unnecessarily complex, it will tend to overfit. A model with lesser pameter will never overfit (though can underfit, if parameters are too low).
In the case of you sin function task, the neural net has to overfit, as it is ... the sin function. These tests can really help debug and experiment with your code.
Another important note, if you try to do a Train, Validation, Test, or k-fold crossvalidation on the data generated by the sin function dataset, then splitting it in the "usual" way will not work as in this case we are dealing with a time-series, and for those cases, one can use techniques mentioned here
First of all, I think it's a great project to approximate sin(x). It would be great if you could share the snippet or some additional details so that we could pin point the exact problem.
However, I think that the problem is that you are overfitting the data hence you are not able to generalize well to other data points.
Few tricks that might work,
Get more training points
Go for regularization
Add a test set so that you know whether you are overfitting or not.
Keep in mind that 0 loss or 100% accuracy is mostly not good on training set.

What does this learning curve show ? And how to handle non representativity of a sample?

==> to see learning curves
I am trying a random forest regressor for a machine learning problem (price estimation of spatial points). I have a sample of spatial points in a city. The sample is not randomly drawn since there are very few observations downtown. And I want to estimate prices for all addresses in the city.
I have a good cross validation score (absolute mean squared error) an also a good test score after splitting the training set. But predictions are very bad.
What could explain this results ?
I plotted the learning curve (link above) : cross validation score increases with number of instances (that sounds logical), training score remains high (should it decrease ?) ... What do these learning curves show ? And in general how do we "read" learning curves ?
Moreover, I suppose that the sample is not representative. I tried to make the dataset for which I want predictions spatially similar to the training set by drawing whitout replacement according to proportions of observations in each district for the training set. But this didn't change the result. How can I handle this non representativity ?
Thanks in advance for any help
There are a few common cases that pop up when looking at training and cross-validation scores:
Overfitting: When your model has a very high training score but a poor cross-validation score. Generally this occurs when your model is too complex, allowing it to fit the training data exceedingly well but giving it poor generalization to the validation dataset.
Underfitting: When neither the training nor the cross-validation scores are high. This occurs when your model is not complex enough.
Ideal fit: When both the training and cross-validation scores are fairly high. You model not only learns to represent the training data, but it generalizes well to new data.
Here's a nice graphic from this Quora post showing how model complexity and error relate to the type a fit a model exhibits.
In the plot above, the errors for a given complexity are the errors found at equilibrium. In contrast, learning curves show how the score progresses throughout the entire training process. Generally you never want to see the score decreasing during training, as this usually means your model is diverging. But the difference between the training and validation scores as they move forward in time (towards equilibrium) indicates how well your model is fitting.
Notice that even when you have an ideal fit (middle of complexity axis) it is common to see a training score that's higher than the cross-validation score, since the model's parameters are updated using the training data. But since you're getting poor predictions, and since validation score is ~10% lower than training score (assuming the score is out of 1), I would guess that your model is overfitting and could benefit from less complexity.
To answer your second point, models will generalize better if the training data is a better representation of validation data. So when splitting the data into training and validation sets, I recommend finding a way to randomly segregate the data. For example, you could generate a list of all the points in the city, iterate of the list, and for each point draw from a uniform distribution to decide which dataset that point belongs to.

Predictive features with high presence in one class

I am doing a logistic regression to predict the outcome of a binary variable, say whether a journal paper gets accepted or not. The dependent variable or predictors are all the phrases used in these papers - (unigrams, bigrams, trigrams). One of these phrases has a skewed presence in the 'accepted' class. Including this phrase gives me a classifier with a very high accuracy (more than 90%), while removing this phrase results in accuracy dropping to about 70%.
My more general (naive) machine learning question is:
Is it advisable to remove such skewed features when doing classification?
Is there a method to check skewed presence for every feature and then decide whether to keep it in the model or not?
If I understand correctly you ask whether some feature should be removed because it is a good predictor (it makes your classifier works better). So the answer is short and simple - do not remove it in fact, the whole concept is to find exactly such features.
The only reason to remove such feature would be that this phenomena only occurs in the training set, and not in real data. But in such case you have wrong data - which does not represnt the underlying data density and you should gather better data or "clean" the current one so it has analogous characteristics as the "real ones".
Based on your comments, it sounds like the feature in your documents that's highly predictive of the class is a near-tautology: "paper accepted on" correlates with accepted papers because at least some of the papers in your database were scraped from already-accepted papers and have been annotated by the authors as such.
To me, this sounds like a useless feature for trying to predict whether a paper will be accepted, because (I'd imagine) you're trying to predict paper acceptance before the actual acceptance has been issued ! In such a case, none of the papers you'd like to test your algorithm with will be annotated with "paper accepted on." So, I'd remove it.
You also asked about how to determine whether a feature correlates strongly with one class. There are three things that come to mind for this problem.
First, you could just compute a basic frequency count for each feature in your dataset and compare those values across classes. This is probably not super informative, but it's easy.
Second, since you're using a log-linear model, you can train your model on your training dataset, and then rank each feature in your model by its weight in the logistic regression parameter vector. Features with high positive weight are indicative of one class, while features with large negative weight are strongly indicative of the other.
Finally, just for the sake of completeness, I'll point out that you might also want to look into feature selection. There are many ways of selecting relevant features for a machine learning algorithm, but I think one of the most intuitive from your perspective might be greedy feature elimination. In such an approach, you train a classifier using all N features in your model, and measure the accuracy on some held-out validation set. Then, train N new models, each with N-1 features, such that each model eliminates one of the N features, and measure the resulting drop in accuracy. The feature with the biggest drop was probably strongly predictive of the class, while features that have no measurable difference can probably be omitted from your final model. As larsmans points out correctly in the comments below, this doesn't scale well at all, but it can be a useful method sometimes.

Resources