Is there a way to specify the folds (using my data) with H20 autoML? - machine-learning

I'm trying to specify the folds for cross-validation in H20. I want to select the subset from the data to be a specific fold. For example, fold number 1 corresponds to my subject 1 data. The fold number 2 to my subject 2 and so on. Is there a way to do it?
I checked the docs for H20 but I didn't find a way for specifying the fold based on my data.

Yes, there's a way to do this in H2O AutoML and all the supervised H2O algorithms.
There's an argument called fold_column in which you specify the name of the column that defines the fold ID. So if you have a "subject_id" column, just set fold_column = "subject_id".
More info in the docs.

Related

How to use a material number as a feature for Machine Learning?

I have a problem. I would like to use a classification algorithm. For this I have a column materialNumber, like the name the column represents the material number.
How could I use that as a feature for my Machine Learning algorithm?
I can not use them e.g. as a One Hot Enconding matrix, because there is too much different material numbers (~4500 unique material numbers).
How can I use this column in a classification algorithm? Do I need to standardize/normalize it? I would like to use a RandomForest classifier.
customerId materialNumber
0 1 1234.0
1 1 4562.0
2 2 1234.0
3 2 4562.0
4 3 1547.0
5 3 1547.0
Here you can group material numbers by categorizing them. If you want to use a categorical variable in a machine learning algorithm, as you mentioned, you have to use the "one-hot encoding" method. But here, as the unique material number values ​​increase, the number of columns in your data will also increase.
For example, you have a material number like this:
material_num_list=[1,2,3,4,5,6,7,8,9,10]
Suppose the numbers are similar in themselves, for example:
[1,5,6,7], [2,3,8], [4,9,10]
We ourselves can assign values ​​to these numbers:
[1,5,6,7] --> A
[2,3,8] --> B
[4,9,10] --> C
As you can see, our tag count has decreased. And we can do "one-hot encoding" with fewer tags.
But here, the data set needs to be examined well and this grouping process needs to be done in a reasonable way. It might work if you can categorize the material numbers as I mentioned.

How can I get the index of features for each score of rfecv?

I am useing recursive feature elimination and cross-validated (rfecv) in order to find the best accuracy score for features.
As I see _grid_scoresis the score the estimator produced when trained with the i-th subset of features. Is there any way to get the index of subset features for each score in the _grid_score?
I can get the index of the selected features for highest score using get_support ( 5 subset of features).
subset_features, scores
5 , 0.976251
4 , 0.9762072
3 , 0.97322212
How can I get the indexes of 4 or 3 subset of features?
I checked the output of rfecv.ranking_ and the 5 features have rank =1 , but the Rank= 2 only has one feature and so on.
A (single) subset of 3 (or 4) features was (probably) never chosen!
This seems to be a common misconception on how RFECV works; see How does cross-validated recursive feature elimination drop features in each iteration (sklearn RFECV)?. There's an RFE for each cross-validation fold (say 5), and each will produce its own set of 3 features (probably different). Unfortunately (in this case at least), those RFE objects are not saved, so you cannot identify which sets of features each fold has selected; only the score is saved (source pt1, pt2) for choosing the optimal number of features, and then another RFE is trained on the entire dataset to reduce to the final set of features.

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

how to do Classification based on the correlation of multiple features for a Supervised scenario

I have 2 features: 'Contact_Last_Name' and 'Account_Last_Name' based on which I want to Classify my data:
The logic is that if the 2 features are same i.e. Contact_Last_Name is same as Account_Last_Name - then the result is 'Success' or else it is 'Denied'.
So. for example: if Contact_Last_Name is 'Johnson' and Account_Last_Name is 'Eigen' - the result is classified as 'Denied'. If both are equal say - 'Edison' - then the result is 'Success'.
How, can I have a Classification algorithm for this set of data?
[please note that usually we discard High Correlation columns but over here the correlation between columns seems to have the logic for Classification]
I have tried to use Decision Tree(C5.0) and Naive Bayes(naiveBayes) in R but both of these fail to Classify the dataset correctly.
First of all its not a good use case for machine learning, because this can be done with just string match, but still if you want to give to a classification algorith, then create a table with values as 'Contact_Last_Name' and 'Account_Last_Name' and 'Result' and give it for decision tree and predict the third column.
Note that you partition your data for training and testing.

Parameter selection and k-fold cross-validation

I have one dataset, and need to do cross-validation, for example, a 10-fold cross-validation, on the entire dataset. I would like to use radial basis function (RBF) kernel with parameter selection (there are two parameters for an RBF kernel: C and gamma). Usually, people select the hyperparameters of SVM using a dev set, and then use the best hyperparameters based on the dev set and apply it to the test set for evaluations. However, in my case, the original dataset is partitioned into 10 subsets. Sequentially one subset is tested using the classifier trained on the remaining 9 subsets. It is obviously that we do not have fixed training and test data. How should I do hyper-parameter selection in this case?
Is your data partitioned into exactly those 10 partitions for a specific reason? If not you could concatenate/shuffle them together again, then do regular (repeated) cross validation to perform a parameter grid search. For example, with using 10 partitions and 10 repeats gives a total of 100 training and evaluation sets. Those are now used to train and evaluate all parameter sets, hence you will get 100 results per parameter set you tried. The average performance per parameter set can be computed from those 100 results per set then.
This process is built-in in most ML tools already, like with this short example in R, using the caret library:
library(caret)
library(lattice)
library(doMC)
registerDoMC(3)
model <- train(x = iris[,1:4],
y = iris[,5],
method = 'svmRadial',
preProcess = c('center', 'scale'),
tuneGrid = expand.grid(C=3**(-3:3), sigma=3**(-3:3)), # all permutations of these parameters get evaluated
trControl = trainControl(method = 'repeatedcv',
number = 10,
repeats = 10,
returnResamp = 'all', # store results of all parameter sets on all partitions and repeats
allowParallel = T))
# performance of different parameter set (e.g. average and standard deviation of performance)
print(model$results)
# visualization of the above
levelplot(x = Accuracy~C*sigma, data = model$results, col.regions=gray(100:0/100), scales=list(log=3))
# results of all parameter sets over all partitions and repeats. From this the metrics above get calculated
str(model$resample)
Once you have evaluated a grid of hyperparameters you can chose a reasonable parameter set ("model selection", e.g. by choosing a well performing while still reasonable incomplex model).
BTW: I would recommend repeated cross validation over cross validation if possible (eventually using more than 10 repeats, but details depend on your problem); and as #christian-cerri already recommended, having an additional, unseen test set that is used to estimate the performance of your final model on new data is a good idea.

Resources