Randomforest classification weka - machine-learning

The attributes have been saved in 11 columns in csv file. If the order of columns change, Do Randomforest & RandomTree could give different accuracy in each time?

Ordering of the features does not affect any of classifiers I know (except those which are specially designed to do so - like specialistic classifiers for time series and other temporal features), no matter if it is Neural Network, SVM, RandomForest, RandomTree or NaiveBayes - it is just a numerical simplification, as arrays are faster then sets, while "under the hood" they are treated as unordered sets (only with indicies showing from which dimension it comes from).
What can change is the output of the particular classifier each time you run your code due to its probabilistic/stochastic methods of learning. For example - neural networks have random initializations, RandomForests has random subsampling etc.
So answer is suprisingly "yes, it can change after order of columns change", but the reason for this is not the change in order, but fact, that after you do so, the internal random number generator already passed some cycles and will generate different numbers.

Related

Does a random forest randomly sample the data for each tree?

I appreciate bagging randomly resamples the training set for each tree, and random forests randomly select a subset of features for each tree.
My question though is does a random forest also resample the training set as well as taking a random subset of features. Is it in effect double random?
The answer is yes, most of the times, if you want to.
Random forests bootstrap the data and randomly select features.
bootstrapping means that it samples a data-set with the same size as the original dataset, but with replacement. So if you have N data points, each tree will use N data points, but some my be duplicated (as it samples them one by one with replacement).
However, it really is up to you what you do. In the sklearn implementation, the default is to bootstrap but you can flag bootstarp=False, and then you only have the random features selection.
See the documentation here:
http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

Categorical variables with large amounts of categories in XGBoost/CatBoost

I have a question regarding random forests. Imagine that I have data on users interacting with items. The number of items is large, around 10 000. My output of the random forest should be the items that the user is likely to interact with (like a recommender system). For any user, I want to use a feature that describes the items that the user has interacted with in the past. However, mapping the categorical product feature as a one-hot encoding seems very memory inefficient as a user interacts with no more than a couple of hundred of the items at most, and sometimes as little as 5.
How would you go about constructing a random forest when one of the input features is a categorical variable with ~10 000 possible values and the output is a categorical variable with ~10 000 possible values? Should I use CatBoost with the features as categorical? Or should I use one-hot encoding, and if so, do you think XGBoost or CatBoost does better?
You could also try entity embeddings to reduce hundreds of boolean features into vectors of small dimension.
It is similar to word embedings for categorical features. In practical terms you define an embedding of your discrete space of features into a vector space of low dimension. It can enhance your results and save on memory. The downside is that you do need to train a neural network model to define the embedding before hand.
Check this article for more information.
XGBoost doesn't support categorical features directly, you need to do the preprocessing to use it with catfeatures. For example, you could do one-hot encoding. One-hot encoding usually works well if there are some frequent values of your cat feature.
CatBoost does have categorical features support - both, one-hot encoding and calculation of different statistics on categorical features. To use one-hot encoding you need to enable it with one_hot_max_size parameter, by default statistics are calculated. Statistics usually work better for categorical features with many values.
Assuming you have enough domain expertise, you could create a new categorical column from existing column.
ex:-
if you column has below values
A,B,C,D,E,F,G,H
if you are aware that A,B,C are similar D,E,F are similar and G,H are similar
your new column would be
Z,Z,Z,Y,Y,Y,X,X.
In your random forest model you should removing previous column and only include this new column. By transforming your features like this you would loose explainability of your mode.

Using LSTM for binary classification

I have time series data of size 100000*5. 100000 samples and five variables.I have labeled each 100000 samples as either 0 or 1. i.e. binary classification.
I want to train it using LSTM , because of the time series nature of data.I have seen examples of LSTM for time series prediction, Is it suitable to use it in my case.
Not sure about your needs.
LSTM is best suited for sequence models, like time series you said, and your description don't look a time series.
Any way, you may use LSTM for time series, not for prediction, but for classification like this article.
In my experience, for binary classification having only 5 features you could find better methods, will consume more memory thant other methods, and could get worst results.
First of all, you can see it from a different perspective, i.e. instead of having 10,000 labeled samples of 5 variables, you should treat it as 10,000 unlabeled samples of 6 variables, where the 6th variable is the label.
Therefore, you can train your LSTM as a multivariate predictor for your 6th variable, that is the sample label and compare with the ground truth during testing to evaluate its performance.

Can a recurrent neural network learn slightly different sequences at once?

Can a recurrent neural network be used to learn a sequence with slightly different variations? For example, could I get an RNN trained so that it could produce a sequence of consecutive integers or alternate integers if I have enough training data?
For example, if I train using
1,2,3,4
2,3,4,5
3,4,5,6
and so on
and also train the same network using
1,3,5,7
2,4,6,8
3,5,7,9
and so on,
would I be able to predict both sequences successfully for the test set?
What if I have even more variations in the training data like sequences of every three integers or every four integers, et cetera?
Yes, provided there is enough information in the sequence so that it is not ambiguous, a neural network should be able to learn to complete these sequences correctly.
You should note a few details though:
Neural networks, and ML models in general, are bad at extrapolation. A simple network is very unlikely to learn about sequences in general. It will never learn the concept of sequence logic in the way a child quickly would. So if you feed in test data outside of its experience (e.g. steps of 3 between items, when they were not in the training data), it will perform badly.
Neural networks prefer scaled inputs - a common pre-processing step is to normalise to mean 0 standard deviation 1 for each input column. Whilst it is possible for a network to accept larger range of numbers at inputs, that will reduce effectiveness of training. With a generated training set such as artificial numeric sequences, you may be able to force your way through that by training for longer with more examples.
You will need more neurons, and more layers, to support a larger variation of sequences.
For a RNN, it will predict badly if the sequence it has processed so far is ambiguous. E.g. if you train 1,2,3,4 and 1,2,3,5 with equal numbers of samples, it will predict either 4.5 (for regression) or 50% chance 4 or 5 (for classifier) when it shown sequence 1,2,3 and asked to predict.

Using weka to classify sensor data

I am working on a classification problem, which has different sensors. Each sensor collect a sets of numeric values.
I think its a classification problem and want to use weka as a ML tool for this problem. But I am not sure how to use weka to deal with the input values? And which classifier will best fit for this problem( one instance of a feature is a sets of numeric value)?
For example, I have three sensors A ,B, C. Can I define 5 collected data from all sensors,as one instance? Such as, One instance of A is {1,2,3,4,5,6,7}, and one instance of B is{3,434,534,213,55,4,7). C{424,24,24,13,24,5,6}.
Thanks a lot for your time on reviewing my question.
Commonly the first classifier to try is Naive Bayes (you can find it under "Bayes" directory in Weka) because it's fast, parameter less and the classification accuracy is hard to beat whenever the training sample is small.
Random Forest (you can find it under "Tree" directory in Weka) is another pleasant classifier since it process almost any data. Just run it and see whether it gives better results. It can be just necessary to increase the number of trees from the default 10 to some higher value. Since you have 7 attributes 100 trees should be enough.
Then I would try k-NN (you can find it under "Lazy" directory in Weka and it's called "IBk") because it commonly ranks amount the best single classifiers for a wide range of datasets. The only issues with k-nn are that it scales badly for large datasets (> 1GB) and it needs to fine tune k, the number of neighbors. This value is by default set to 1 but with increasing number of training samples it's commonly better to set it up to some higher integer value in range from 2 to 60.
And finally for some datasets where both, Naive Bayes and k-nn performs poorly, it's best to use SVM (under "Functions", it's called "Lib SVM"). However, it can be hassle to set up all the parameters of the SVM to get competitive results. Hence I leave it to the end when I already know what classification accuracies to expect. This classifier may not be the most convenient if you have more than two classes to classify.

Resources