Proximal Policy Optimization Algorithms paper - definition of "KL" operation? - machine-learning

In the original paper on Proximal Policy Optimization Algorithms
https://arxiv.org/pdf/1707.06347.pdf
in equation (4) the authors use an operation denoted by KL[]. Unfortunately, they never give a definition for it.
My question:
What does the KL[] operation stand for?

Maybe it's KL divergence?
KL divergence is used to compare differences between two probability distribution.

Related

Relation between coefficients in linear regression and feature importance in decision trees

Recently I have a Machine Learning(ML) project, which needs to identify the features(inputs, a1,a2,a3 ... an) that have large impacts on target/outputs.
I used linear regression to get the coefficients of the feature, and decision trees algorithm (for example Random Forest Regressor) to get important features (or feature importance).
Is my understanding right that the feature with large coefficient in linear regression shall be among the top list of importance of features in Decision tree algorithm?
Not really, if your input features are not normalized, you could have a relatively big co-efficient for features with a relatively big mean/std. If your features are normalized, then yes, this could be an indicator to the features importance, but there are still other things to consider.
You could try some of sklearn's feature selection classes that should do this automatically for you here.
Short answer to your question is No, not necessarily. Considering the fact that we do not know what are your different inputs, if they are in the same unit system, range of variation and etc.
I am not sure why you have combined Linear regression with Decision tree. But I just assume you have a working model, say a linear regression which provides good accuracy on the test set. From what you have asked, you probably need to look at sensitivity analysis based on the obtained model. I would suggest doing some reading on "SALib" library and generally the subject of sensitivity analysis.

Is it possible to calculate AUC using OOB sample in Bagged trees?

I have few questions on OOB sample in Bagged trees.
1.Do we always calculate only error on OOB samples? If yes, which error metric is used for evaluation(like rmse, misclassification err)?
2.Also, do we have this OOB concept in boosting also?
Is it possible to calculate AUC using OOB sample in Bagged trees?
An ROC curve is the most commonly used way to visualize the performance of a binary classifier, and AUC is (arguably) the best way to summarize its performance in a single number. It does not matter wether you are using Bagged Trees or not. You can find a nice explanation here
1.Do we always calculate only error on OOB samples?
Not necessarily, before bootstrapping, you can set aside validation set and do cross validation
If yes, which error metric is used for evaluation(like rmse, misclassification err)?
If it is Regression problem, sum of squared errors(RSS) for the tree can be used
For a Classification problem, Misclassification error rate can be used.
2.Also, do we have this OOB concept in boosting also?
Let's see what is OOB ? The key to bagging is that trees are repeatedly fit to bootstrapped subsets of the observations.On average, each bagged tree makes use of around two-thirds of the observations.The remaining one-third of the observations not used to fit a given bagged tree are referred to as the out-of-bag (OOB) observations. Reference: An Introduction to Statistical Learning, Section 8.2.1, Out-of-Bag Error Estimation
Boosting does not involve bootstrap sampling; instead each tree is fit on a modified version of the original data set. Reference: An Introduction to Statistical Learning, Section 8.2.3
Therefore going by the definition,OOB concept is not applicable for Boosting.
But note that most implementation of Boosted Tree algorithms will have an option to set OOB in some way. Please refer to documentation of respective implementation to understand their version.

K-Medoids Cluster Analysis

What are some analysis functions which can be used on the K-Medoids algorithms?
My main aim is to compare results of 2 different clustering results in order to see which is better.
Can SSE (sum of squared errors) be applied to K-Medoids algorithm?
The original k-medoid publication discusses the measures ESS, along with several other measures such as average dissimilarity, maximum dissimilarity, diameter that may be more appropriate to use.
SSE is closely related to Euclidean distance, so it usually is not appropriate (unless, of course, you use Euclidean; but why would you use k-medoids then instead of k-means?)
ARI, NMI, and Silhouette Coefficient can be used to compare the results

What is an example of using Adaboost (Adaptive Boosting) approach with Decision Trees

Is there any good tutorial that explains how to weight the samples during successive iterations of constructing the decision trees for a sample training set? I want to specifically how to the weights are assigned after the first decision tree is constructed.
Decision tree is designed using Information Gain as an anchor and I am wondering how is this affected due to the misclassifications in the previous iterations being weighted.
Any good tutorial / example is highly appreciated.
A Short Introduction to Boosting from Freund and Schapire supplies an example of the AdaBoost algorithm using Quinlan's C4.5 Decision Tree model.

Implementing Vocabulary Tree in OpenCV

I am trying to implement image search based on paper "Scalable Recognition with a Vocabulary Tree". I am using SURF for extracting the features and key points. For example, for an image i'm getting say 300 key points and each key point has 128 descriptor values. My Question is how can I apply the K-Means Clustering algorithm on the data. I mean Do I need to apply clustering algorithm on all the points i.e., 300*128 values or Do I need to find the distance between the consecutive descriptor values and store the values and apply the clustering algorithm on that. I am confused and any help will be appreciated.
Thanks,
Rocky.
From your question I would say you are quite confused. The vocabulary tree technique is grounded on the us of k-means hierarchical clustering and a TF-IDF weighting scheme for the leaf nodes.
In a nutshell the clustering algorithm employed for the vocabulary tree construction runs k-means once over all the d-dimensional data (d=128 for the case of SIFT) and then runs k-means again over each of the obtained clusters until some depth level. Hence the two main parameters for the vocabulary tree construction are the branching factor k and the tree depth L. Some improvements consider only the branching factor while the depth is automatically determined by cutting the tree to fulfill a minimum variance measure.
As for the implementation, cv::BOWTrainer from OpenCV is a good starting point though is not very well generalized for the case of a hierarchical BoW scheme since it imposes the centers to be stored in a simple cv::Mat while vocabulary tree is typically unbalanced and mapping it to a matrix in a level-wise fashion might not be efficient from the memory use point of view when the number of nodes is much lower than the theoretical number of nodes in a balanced tree with depth L and branching factor k, that is:
n << (1-k^L)/(1-k)
For what I know I think that you have to store all the descriptors on a cv::Mat and then add this to a "Kmeans Trainer", thus you can finally apply the clustering algorithm. Here a snippet that can give you an idea about what I am talking:
BOWKMeansTrainer bowtrainer(1000); //num clusters
bowtrainer.add(training_descriptors); // we add the descriptors
Mat vocabulary = bowtrainer.cluster(); // apply the clustering algorithm
And this maybe can be interesting to you: http://www.morethantechnical.com/2011/08/25/a-simple-object-classifier-with-bag-of-words-using-opencv-2-3-w-code/
Good luck!!
Checkout out the code in libvot, in src/vocab_tree/clustering.*, you can find a detailed implementation of the clustering algorithm.

Resources