How to start a process of making MAE maps of my random forest model using my covariates and a predictor (clay) - random-forest

I am working in R and I have my random forest model done and "cv".
What I am doing is predicting soil texture of sand, silt, and clay using 10 covariates.
Is there a way to create a MAE map of e.g., clay?
I think I have to raster brick my covariates then I do not know what to do from there.

Related

Weighted least squares loss function in tensor_forest

Does the Tensorflow random forest module (tensor_forest) allow to specify a true weighted least squares objective function to be minimized during the training of the random forest model?
From what I gather, random forest model training in tensor_forest occurs within a specific estimator (TensorForestEstimator), which does not seem to allow to specify a custom loss function (weighted least squares is the one I am interested).
How can I achieve this?

Create negative examples in dataset with only positive ones

Imagine we have a classification problem on a dataset where the examples are only positive (equivalently negative). For instance, on a problem where the the winning class is specified by position (e.g. think of a tennis dataset problem where the first player is always the winner). How can we create negative examples in order to train a supervised learning algorithm on this dataset? One idea could be to generate negative examples, by exchanging the positions of the features that are tied to each of the classes. Do you think this will give an unbiased dataset? Could we create negative duplicates of our original dataset and train a supervised learning algorithm on this double dataset?

How to use learning curves with random forests

I've been going through Andrew Ng's machine learning course and just got done with the learning curve lecture. I created a learning curve for a logistic regression model I created, and it looks like the training and CV scores converge, which means my model could benefit from more features. How could I do a similar analysis for something like a random forest? When I create a learning curve for a random forest classifier with the same data in sklearn my training score just stays very close to 1. Do I need to use a different method of getting the training error?
Learning Curves is a tool to learn about bias-variance-trade-off. Since your random forest model training score stays very close to 1, your random forest mode l is able to learn underlying function. If your underlying function was more non-linear, more complex, you would have had to add more features. See following example, figure Learning Curves.
Start with only 2 features and train your random forests model. Then use all of your features and train random forests your model.
You should see similar graph for your example.

Speeding up the classification process - PCA combined with SVM?

I have a cyclic method running which collects a data set of 15.000 feature vectors with 30 dimensions (every 200ms). My current setup simply feeds all raw feature vectors to a SVM with RBF (Radial basis function). The classification result is rather unconvincing as being costly in terms of time. I know that the dataset isn't that big, so classification in real-time could be possible with the right subsampling feature vector or so. The goal is to speed up the entire classification process (training/prediction) to reach a few milliseconds. To obtain an unsupervised classification approach, I currently run k-means to label the feature vectors. I pick a few cluster results and assign them class 1 and all others class 0.
The idea now the following:
collect all 15.000 (N) feature vectors with 30 (D) dimensions
PCA on all N feature vectors
use the eigenvalues to determine a feature vector with (d) dimensions (d < D)
Fed the new set of (n < N)
feature vectors
or: the eigenvectors ?
to train the svm
Maybe instead of SVM a KNN approach would result in similar result?
Does this approach makes sense?
Any ideas to improve the process or change it in order to speed it up?
How do I determine the best number of d?
The classification accuracy shouldn't suffer too much from the time reduction.
EDIT: Data stream mining
I was just reading about Data Stream Mining. I think this topic fits my setup quite well since I have to extract knowledge structures from continuous, rapid data records. Maybe I should replace the SVM with a Gradient Boosted Tree?
Thanks!

SVM vector of weights

I have a classification task, and I use svm_perf application.
The question is having trained the model I wonder whether it's possible to get the weight of the features.
There is an -a parametes which outputs the alphas, honestly I don't recall alphas in SVM I think the weights are always w.
If you are implementing linear SVM, there is a Python script based on the model file output by svm_learn and svm_perf_learn. To be more specific, the weight is just w=SUM_i (y_i*alpha_i*sv_i) where sv_i is the support vector, y_i is the category from trained sample.
If you are using non linear SVM, I don't think the weights coefficients are directly related to the input space. Yet you can get the decision function:
f(x) = sgn( SUM_i (alpha_i*y_i*K(sv_i,x)) + b );
where K is your kernel function.

Resources