Making H2O grid search deterministic - machine-learning

In order to run the h2o RandomDiscreteValueWalker[DRFParameters] with deterministic results, is it sufficient to set the seed on the DRFParameters and the RandomDiscreteValueSearchCriteria ? I get non-deterministic results even when I have the seed fixed for these parameters?

Related

Feature selection needed before train-test split due to the small size of test set and small size of instance. What should be done?

I am working on an NLP project where I need to predict correct classes of short sentences -- which are instances in my case. I am using root-words as features. My dataset is not too large (about 6000 instances/sentences). Since there are too many features I used MI based feature-selection method to reduce the number of features to about 1000.
My problem is: if I split the dataset and then do feature selection on training set only, then the model/classifier is built based on features available in training set only -- most of which (features in trained model) are absent in the testing set. As a result our model may perform very bad.
What should I do to resolve this issue?
I am currently selecting features first and then doing CV. I know that this approach may cause data leakage from test set to train set. But I'm still doing that because of the aforementioned issue.

Result verification with Weka Experiment tab with individual classifier models

I ran different classifiers on the same dataset. I got some statistical values after run the classifiers.
This is the summary of all classifiers
I am using Weka to trained the model. Weka itself has a method to compare different algorithms. For that we need to use the Experiment tab. I have done with this option as well for the same dataset.
Weka gave me the result for Kappa statistics when use Experiment tab
Rootmean squared error is
Relative absolute error
and so on.....
Now I am unable to understand that the values I got from Experiment tab how does those are similar to the values that I have shared in the table format in the first picture?
I presume that the initial table was populated with statistics obtained from cross-validation runs in the Weka Explorer.
The Explorer aggregates the predictions across a single cross-validation run so that it appears that you had a single test set of that size. It is only to be used as an explorative tool, hence the name.
The Experimenter records the metrics (like accuracy, rmse, etc) generated from each fold pair across the number of runs that you perform during your experiment. The metrics collected across multiple classifiers and/or datasets can then be analyzed using significance tests. By default, 10 runs of 10-fold CV are used, which is recommended for such comparisons. This results in 100 individual values for each metric from which mean and standard deviation are generated. */v indicate whether there is a statistically significant loss/win.

Use k-means test results for training set SPSS

I am student working with SPSS (statistics) for the first time. I used 1,000 rows of test data to run k-means cluster tool and obtained the results. I now want to take those results and run against a test set (another 1,000) to see how my model did.
I am not sure how to do this; any help is greatly appreciated!
Thanks
For clustering model (or any unsupervised model), there really is no right or wrong result. As such, there is no target variable that you can compare the cluster model result (the cluster allocation) to and the idea of splitting the data set into a training and a testing partition does not apply to these types of models.
The best you can do is to review the output of the model and explore the cluster allocations and determine whether these appear to be useful for the intended purpose.

Tensorflow queue runner - is it possible to queue a specific subset?

In tensorflow, I plan to build some model and compare it to other baseline models with respect to different subsets of the training data. I.e. I would like to train my model and the baseline models with the same subsets of training data.
In the naive way queue-runner and TFreaders are implemented (e.g. im2txt), this requires duplicating the data per each selection of subsets, which is my case, will require to use very large amounts of disk space.
It will be best, if there would be a way to tell the queue to fetch only samples from a specified subset of ids, or to ignore samples if they are not part of a given subset of ids.
If I understand correctly ignoring samples is not trivial, because it will require to stitch samples from different reads to a single batch.
Does anybody knows a way to do that? Or can suggest an alternative approach which does not requires pre-loading all the training data into the RAM?
Thanks!
You could encode your condition as part of keep_input parameter of tf.train.maybe_batch

Classification in weka fails, caused by case sensitiveness of nominal values?

I made a classifier to classify search queries into one of the following classes: {Artist, Actor, Politician, Athlete, Facility, Geo, Definition, QA}. I have two csv files: one for training the classifier (contains 300 queries) and one for testing the classifier (currently contains about 200 queries). When I use the trainingset and testset for training/evaluating the classifier with weka knowledgeflow, most classes reach a pretty good accuracy. Setup of Weka knowledge flow training/testing situation:
After training I saved the MultiLayer Perceptron classifier from the knowledgeflow into classifier.model, which I used in java code to classify queries.
When I deserialize this model in java code and use it to classify all the queries of the testing set CSV-file (using the distributionForInstance()-method on the deserialized classifier) in the knowledgeflow it classifies all 'Geo' queries as 'Facility' queries and all 'QA' queries as 'Definition' queries. This surprised me a bit, as the ClassifierPerformanceEvaluator showed me a confusion matrix in which 'Geo' and 'QA' queries scored really well and the testing-queries are the same (the same CSV file was used). All other query classifications using the distributionForInstance()-method seem to work normally and so show the behavior that could be expected looking at the confusion matrix in the knowledgeflow. Does anyone know what could be possible causes for the classification difference between distributionForInstance()-method in the java code and the knowledgeflow evaluation results?
One thing that I can think of is the following:
The testing-CSV-file contains among other attributes a lot of nominal value attributes in all-capital casing. When I print out the values of all attributes of the instances before classification in the java code these values seem to be converted to lower capital letters (it seems like the DataSource.getDataSet() method behaves like this). Could it be that the casing of these attributes is the cause that some instances of my testing-CSV-file get classified differently? I read in Weka specification that nominal value attributes are case sensitive. I change these values to uppercase in the java file though, as weka then throws an exception that these values are not predefined for the nominal attribute.
Weka is likely using the same class in the knowledge flow as in your weka code to interpret the csv. This is why it works (produces data sets -- Instances objects -- that match) without tweaking and fails when you change things: the items don't match any more. This is to say that weka is handling the case of the input strings consistently, and does not require you to change it.
Check that you are looking at the Error on Test Data value and not the Error on Training Data value in the knowledge flow output, because the second one will be artificially high given that you built the model using those exact examples. It is possible that your classifier is performing the same in both places, but you are looking at different statistics.

Resources