I have three questions about XGBoost.
What is the final model of XGBoost? That is, when I want to make a prediction, is the final prediction the average of all trees?
In R, how to check the prediction for each tree?
In R, how to interpret booster being gbtree and objective being reg:linear? Does that mean tree based model is used, and for leaf it is a linear regression model (rather than average)? If yes, what features did each leaf use?
Thanks!
Related
I am trying to build a model that predicts the shipping volume of each month, week, and day.
I found that the decision tree-based model works better than linear regression.
But I read some articles about machine learning and it says decision tree based model can't predict future which model didn't learn. (extrapolation issues)
So I think it means that if the data is spread between the dates that train data has, the model can predcit well, but if the date of data is out of the range, it can not.
I'd like to confirm if my understand is correct.
some posting shows prediction for datetime based data using random forest model, and it makes me confused.
Also please let me know if there is any way to overcome extrapolation issues on decision tree based model.
It depends on the data.
Decision tree predicts class value of any sample in range of [minimum of class value of training data, maximum of class value of training data]. For example, let there are five samples [(X1, Y1), (X2, Y2), ..., (X5, Y5)], and well trained tree has two decision node. The first node N1 includes (X1, Y1), (X2, Y2) and the other node N2 includes (X3, Y3), (X4, Y4), and (X5, Y5). Then the tree will predict a new sample as mean of Y1 and Y2 when the sample reaches N1, but it will predict a new sample as men of Y3, Y4, Y5 when the sample reaches N2.
With this reason, if the class value of new sample could be bigger than the maximum of class value of training data or could be smaller than the minimum of class value of training data, it is not recommend to use decision tree. Otherwise, tree-based model such as random forest shows good performance.
There can be different forms of extrapolation issues here.
As already mentioned a classical decision tree for classification can only predict values it has encountered in its training/creation process. In that sense you won't predict any previously unseen values.
This issue can be remedied if you have the classifier predict relative updates instead of absolute values. But you need to have some understanding of your data, to determine what works best for different cases.
Things are similar for a decision tree used for regression.
The next issue with "extrapolation" is that decision trees might perform badly if your training data has changing statistics over time. Again, I would propose to predict update relationships.
Otherwise, predictions based on training data from a more recent past might yield better predictions. Since individual decision trees can't be trained in an online manner, you would have to create a new decision tree every x time steps.
Going further than this I'd say you'll want to start thinking in state machines and trying to use your classifier for state predictions. But this a fairly uncharted domain of theory for decision trees from when I last checked. This will work better if you already have some for of model for your data relationships in mind.
Consider a parametric binary classifier (such as Logistic Regression, SVM etc.) trained on a dataset (say containing two features for e.g. Blood Pressure and Cholesterol level). The dataset is thrown away and the trained model can only be used as a black box (no tweaks and inside information can be gathered from the trained model). Only a set of data points can be provided and their labels predicted.
Is it possible to get information about the mean and/or standard deviation and/or range of the features of the dataset on which this model was trained? If yes, how so? and If no, then why can't we?
Thank you for your response! :)
SVM does not provide any information about the data statistics, it is a maximum margin classifier and it finds the best separating hyperplane between two datasets in the feature space, as a linear combination of "support vectors". If you use kernel functions, then this combination is in the kernel space, it is not even in the original feature space. SVM does not have a straightforward probabilistic interpretation whatsoever.
Logistic regression is a discriminative classifer and models the conditional probability p (y|x,w) where y is your label, x is your data and w are the features. After maximum likelihood training you are left with w and it is again a discriminator (hyperplane) in the feature space, so you don't have the features again.
The following can be considered. Use a Gaussian classifier. Assume that your class is produced by the prior class probability p (y). Then a class conditional density p (x|y,w) produces your data. Then by the Bayes rule, you will have: p (y|x,w) = (p (y)p (x|y,w))/p (x). If you define the class conditional density p (x|y,w) as Gaussian, its parameter set w will consists of the mean vector m and covariance matrix C of x, assuming it is being produced by the class y. But remember that, this will work only based on the assumption that the current data vector belongs to a specific class. Conditioned on w, a better option would be for mean vector: E [x|w]. This the expectation of x with respect to p (x|w). It comes down to a weighted average of mean vectors for the class y=0 and y=1, with respect to their prior class probabilities. Same should work for covariance as well, but it needs to be derived properly, I am not %100 sure right now.
Imagine a machine learning problem where you have 20 classes and about 7000 sparse boolean features.
I want to figure out what the 20 most unique features per class are. In other words, features that are used a lot in a specific class but aren't used in other classes, or hardly used.
What would be a good feature selection algorithm or heuristic that can do this?
When you train a Logistic Regression multi-class classifier the train model is a num_class x num_feature matrix which is called the model where its [i,j] value is the weight of feature j in class i. The indices of features are the same as your input feature matrix.
In scikit-learn you can access to the parameters of the model
If you use scikit-learn classification algorithms you'll be able to find the most important features per class by:
clf = SGDClassifier(loss='log', alpha=regul, penalty='l1', l1_ratio=0.9, learning_rate='optimal', n_iter=10, shuffle=False, n_jobs=3, fit_intercept=True)
clf.fit(X_train, Y_train)
for i in range(0, clf.coef_.shape[0]):
top20_indices = np.argsort(clf.coef_[i])[-20:]
print top20_indices
clf.coef_ is the matrix containing the weight of each feature in each class so clf.coef_[0][2] is the weight of the third feature in the first class.
If when you build your feature matrix you keep track of the index of each feature in a dictionary where dic[id] = feature_name you'll be able to retrieve the name of the top feature using that dictionary.
For more information refer to scikit-learn text classification example
Random Forest and Naive Bayes should be able to handle this for you. Given the sparsity, I'd go for the Naive Bayes first. Random Forest would be better if you're looking for combinations.
What is the difference between classification and prediction in machine learning?
Classification is the prediction of a categorial variable within a predefined vocabulary based on training examples.
The prediction of numerical (continuous) variables is called regression.
In summary, classification is one kind of prediction, but there are others. Hence, prediction is a more general problem.
Functionality
Classification is about determining a (categorial) class (or label) for an element in a dataset
Prediction is about predicting a missing/unknown element(continuous value) of a dataset
Working Strategy
In classification, data is grouped into categories based on a training dataset.
In prediction, a classification/regression model is built to predict the outcome(continuous value)
Example
In a hospital, the grouping of patients based on their medical record or treatment outcome is considered classification, whereas, if you use a classification model to predict the treatment outcome for a new patient, it is considered a prediction.
Classification is the process of identifying the category or class label of the new observation to which it belongs.
Predication is the process of identifying the missing or unavailable numerical data for a new observation.
That is the key difference between classification and prediction. The predication does not concern about the class label like in classification.
Predictions can be using both regression as well as classification models. It means that once a model is trained on the training data; the next phase is to do predictions for the data whose real/ground-truth values are either unknown or kept aside to evaluate the performance of model. If the nature of the problem is of determining classes/labels/categories athen its classification and if the problem is about determining real numbers (numeric) values then its regression. In nutshell, predictions are supposed to done with both classification and regression for the test data set.
1.Prediction is like saying something which may going to be happened in future.Prediction may be a kind of classification
2.Prediction is mostly based on our future assumptions
whereas
1.Classification is categorization of the things or data that we already have with us.This categorization can be based on any kind of technique or algorithms
2.Classification is mostly based on our current or past assumptions
I have a classification task, and I use svm_perf application.
The question is having trained the model I wonder whether it's possible to get the weight of the features.
There is an -a parametes which outputs the alphas, honestly I don't recall alphas in SVM I think the weights are always w.
If you are implementing linear SVM, there is a Python script based on the model file output by svm_learn and svm_perf_learn. To be more specific, the weight is just w=SUM_i (y_i*alpha_i*sv_i) where sv_i is the support vector, y_i is the category from trained sample.
If you are using non linear SVM, I don't think the weights coefficients are directly related to the input space. Yet you can get the decision function:
f(x) = sgn( SUM_i (alpha_i*y_i*K(sv_i,x)) + b );
where K is your kernel function.