Setting a seed for alglib decision forest implementation - random-forest

I am using alglib to train a random forest. I would like to actually train a number of forests using the same input data and the same set of input variables. To do this I need to control the seed of the random number generator but I can not find a way to access it. Does anyone knows whether such functionality is provided?

Related

How to get the final equation that the Random Forest algorithm uses on your independent variables to predict your dependent variable?

I am working on optimizing a manufacturing based dataset which consists of a huge number of controllable parameters. The goal is to attain the best run settings of these parameters.
I familiarized myself with several predictive algorithms while doing my research and if I say, use Random Forest to predict my dependent variable to understand how important each independent variable is, is there a way to extract the final equation/relationship the algorithm uses?
I'm not sure if my question was clear enough, please let me know if there's anything else I can add here.
There is no general way to get an interpretable equation from a random forest, explaining how your covariates affect the dependent variable. For that you can use a different model more suitable, e.g., linear regression (perhaps with kernel functions), or a decision tree. Note that you can use one model for prediction, and one model for descriptive analysis - there's no inherent reason to stick with a single model.
use Random Forest to predict my dependent variable to understand how important each independent variable is
Understanding how important each dependent variable, does not necessarily mean you need the question in the title of your question, namely getting the actual relationship. Most random forest packages have a method quantifying how much each covariate affected the model over the train set.
There is a number of methods to estimate feature importance based on trained model. For Random Forest, most famous methods are MDI (Mean Decrease of Impurity) and MDA (Mean Decrease of Accuracy). Many popular ML libraries support feature importance estimation out of the box for Random Forest.

How to get probability of data generated by generator using gan model?

We know that generative adversarial network(GAN) can generate data which is similar to real data. In general Generator needs a random variable z as input and generates a vector for representing data x. I don't know how to calculate the probability of P_G(x) when I have new data and want to know the probability of GAN generating it?
You cannot directly compute the likelihood of data using GAN.
Original GAN paper uses parzen window based estimation of likelihood. You can generate data from GAN and use that data for estimating likelihood with whatever method you like.

Random forest in sklearn

I was trying to fit a random forest model using the random forest classifier package from sklearn. However, my data set consists of columns with string values ('country'). The random forest classifier here does not take string values. It needs numerical values for all the features. I thought of getting some dummy variables in place of such columns. But, I am confused as to how will the feature importance plot now look like. There will be variables like country_India, country_usa etc. How can get the consolidated importance of the country variable as I would get if I had done my analysis using R.
You will have to do it by hand. There is no support in sklearn for mapping classifier specific methods through inverse transform of feature mappings. R is calculating importances based on multi-valued splits (as #Soren explained) - when using scikit-learn you are limtied to binary splits and you have to approximate actual importance. One of the simpliest solutions (although biased) is to store which features are actually binary encodings of your categorical variable and sum these resulting elements from feature importance vector. This will not be fully justified from mathematical perspective, but the simpliest thing to do to get some rough estimate. To do it correctly you should reimplement feature importance from scratch, and simply during calculation "for how many samples the feature is active during classification", you would have to use your mapping to correctly asses each sample only once to the actual feature (as adding dummy importances will count each dummy variable on the classification path, and you want to do min(1, #dummy on path) instead).
A random enumeration(assigning some integer to each category) of the countries will work quite well sometimes. Especially if categories are few and training set size is large. Sometimes better than one-hot encoding.
Some threads discussing the two options with sklearn:
https://github.com/scikit-learn/scikit-learn/issues/5442
How to use dummy variable to represent categorical data in python scikit-learn random forest
You can also choose to use an RF algorithm that truly supports categorical data such as Arborist(python and R front end), extraTrees(R, Java, RF'isch) or randomForest(R). Why sklearn chose not to support categorical splits, I don't know. Perhaps convenience of implementation.
The number of possible categorical splits to try blows up after 10 categories and the search becomes slow and the splits may become greedy. Arborist and extraTrees will only try a limited selection of splits in each node.

When classifying, does one need to normalize new incoming features when predicting on real data?

There are two data sets - the training one and a data set of features, labels for which are yet to be predicted (the new one).
I built a Random Forest classifier. Along the way I had to do two things:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Now I have two questions. When i am predicting labels for the new data:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
In my case I possess both data sets, so I could calculate all this stuff in advance, but what if I only had a classifier and a new data set?
I only have a basic knowledge of the type of classifiers and normalization techniques you're using, but the general rule, that I think applies to what you're doing as well, is to do the following.
Your classifier is not a Random Forest Classifier. That is only one step of the pipeline that acts as your actual classifier. This pipeline / actual classifier is what you describe:
Normalize continuous numeric features.
Perform a one-hot-encoding on the categorical ones.
Use a Random Forest Classifier on what you get from the first 2 steps.
This pipeline, that encompasses 3 things, is what you're actually using as your classifier.
Now, how does a classifier work?
You build some state based on the training data.
You use that state to make predictions on the test data.
So:
Do I need to normalize the incoming features? (common sense tells me that yes :) ) If so, should I take the mean, max, min values for a specific feature from the training data set or should I somehow take into account the new values of the features?
Your classifier normalizes the incoming features for the training data, so it will normalize those for unseen instances too. To do this, it must use the state it has built during training.
For example, if you were doing min-max scaling on your features, your state would store a min(f) and max(f) for each feature f. Then, during testing / prediction, you would do min-max scaling for each feature f using the stored min(f) and max(f) values.
I'm not sure what you mean by "normalize continuous numeric features". Do you mean discretization? If you build some state for this discretization during training, then you need to find a way to factor that in.
How do I hot-one-encode the new values of the features? Do I expand the dictionary of the possible categories for a specific category taking into account the possibly new values of the features?
Don't you know how many values each category can have beforehand? Usually you do (since categoricals are things like nationality, continent etc. - things you know in advance). If you can get a value for a categorical feature that you haven't seen during training, it begs the question if you should even care about it. What good is a categorical value you've never trained on?
Maybe add an "unknown" category. I think expanding for a single one should be fine, what good are more going to do if you've never trained on them?
What kind of categoricals do you have?
I could be wrong, but do you really need one-hot encoding? AFAIK, tree-based classifiers don't seem to benefit that much from it.

How to output resultant documents from Weka text-classification

So we are running a multinomial naive bayes classification algorithm on a set of 15k tweets. We first break up each tweet into a vector of word features based on Weka's StringToWordVector function. We then save the results to a new arff file to user as our training set. We repeat this process with another set of 5k tweets and re-evaluate the test set using the same model derived from our training set.
What we would like to do is to output each sentence that weka classified in the test set along with its classification... We can see the general information (Precision, recall, f-score) of the performance and accuracy of the algorithm but we cannot see the individual sentences that were classified by weka, based on our classifier... Is there anyway to do this?
Another problem is that ultimately our professor will give us 20k more tweets and expect us to classify this new document. We are not sure how to do this however as:
All of the data we have been working with has been classified manually, both the training and test sets...
however the data we will be getting from the professor will be UNclassified... How can we
reevaluate our model on the unclassified data if Weka requires that the attribute information must
be the same as the set used to form the model and the test set we are evaluating against?
Thanks for any help!
The easiest way to acomplish these tasks is using a FilteredClassifier. This kind of classifier integrates a Filter and a Classifier, so you can connect a StringToWordVector filter with the classifier you prefer (J48, NaiveBayes, whatever), and you will be always keeping the original training set (unprocessed text), and applying the classifier to new tweets (unprocessed) by using the vocabular derived by the StringToWordVector filter.
You can see how to do this in the command line in "Command Line Functions for Text Mining in WEKA" and via a program in "A Simple Text Classifier in Java with WEKA".

Resources