How to increase minimum interation in BigQuery ML - machine-learning

I've tried out ML functions and only 2 iterations are made, I've started reading how to set more iterations but only max iterations are configurable.
Is there a way to be able to have mininum iterations?
Btw is there an augmenting feature that lets you to generate training data?
Also what numbers should we try for l1_reg and l2_reg to improve an accuracy of 56%.

To increase the number of iterations:
1- You need to set the number of iterations using max_iterations (the default is 10 so you don't need to change this for now).
2- Set min_rel_progress to a number that is less than the loss improvements between two consecutive iterations. You can set that to 0.0001.
Without seeing your data and use case it is hard for me to say what should l1_reg and l2_reg be and in general why you are getting low accuracy. My general guess is that you do not have good training data or good features.

Another option is to set early_stop to false, so that BQML will run max_iterations iterations (default is 20).

The reason the training stopped is probably because the model is not converging and the training/evaluation loss is increasing after iteration.
JiaXun Wu's answer will allow the training continues even if the model is not converging.
You can also check if you have fill in null values yourself. I haven't found documentation regarding how null values are handled by BQML, but for my models, it failed to converge using default null value fill in method.

Related

Mean error on training set equals mean error on testing set?

I am training a deep network (GoogleNet) on an image classification issue. I have a dataset of around 7300 images labelized among only 2 classes.
I divided my set in a training and validating set in these proportions : 0.66 / 0.33.
During the training I compute the mean error on the training set and on the testing set to see how it evolves.
The thing is that these two values are always equal (or reeeeally close).
So maybe it is not an issue but I did not expect that to happen. Since I am training on my training set I expected the mean error on my training set to always be ess that the mean error on my testing set (even if I hoped for these two value to converge towards around the same value).
Maybe someone here could tell me if it is normal or not ? And in the case of it being expected, why ? and if it's not, any idea on what is going on ?
Further info that might be useful : I use mini batches of 50, adam optimizer, my loss is computed with tf.nn.softmax_cross_entropy_with_logits(labels=y_, logits=y_predict), I use a dropout of 0.4 (but when I compute the mean error I make sure it is 1).
Thank you.
This is quite reasonable. You partitioned your data into two random samples from the same population. Yes, they should have nearly identical averages, given the sizes of the samples. This is a simple effect of the law of large numbers: samples taken from the same population will tend to have the same mean.

Caffe | solver.prototxt values setting strategy

On Caffe, I am trying to implement a Fully Convolution Network for semantic segmentation. I was wondering is there a specific strategy to set up your 'solver.prototxt' values for the following hyper-parameters:
test_iter
test_interval
iter_size
max_iter
Does it depend on the number of images you have for your training set? If so, how?
In order to set these values in a meaningful manner, you need to have a few more bits of information regarding your data:
1. Training set size the total number of training examples you have, let's call this quantity T.
2. Training batch size the number of training examples processed together in a single batch, this is usually set by the input data layer in the 'train_val.prototxt'. For example, in this file the train batch size is set to 256. Let's denote this quantity by tb.
3. Validation set size the total number of examples you set aside for validating your model, let's denote this by V.
4. Validation batch size value set in batch_size for the TEST phase. In this example it is set to 50. Let's call this vb.
Now, during training, you would like to get an un-biased estimate of the performance of your net every once in a while. To do so you run your net on the validation set for test_iter iterations. To cover the entire validation set you need to have test_iter = V/vb.
How often would you like to get this estimation? It's really up to you. If you have a very large validation set and a slow net, validating too often will make the training process too long. On the other hand, not validating often enough may prevent you from noting if and when your training process failed to converge. test_interval determines how often you validate: usually for large nets you set test_interval in the order of 5K, for smaller and faster nets you may choose lower values. Again, all up to you.
In order to cover the entire training set (completing an "epoch") you need to run T/tb iterations. Usually one trains for several epochs, thus max_iter=#epochs*T/tb.
Regarding iter_size: this allows to average gradients over several training mini batches, see this thread fro more information.

Training Random forest with different datasets gives totally different result! Why?

I am working with a dataset which contains 12 attributes including the timestamp and one attribute as the output. Also it has about 4000 rows. Besides there is no duplication in the records. I am trying to train a random forest to predict the output. For this purpose I created two different datasets:
ONE: Randomly chose 80% of data for the training and the other 20% for the testing.
TWO: Sort the dataset based on timestamp and then the first 80% for the training and the last 20% for the testing.
Then I removed the timestamp attribute from the both dataset and used the other 11 attributes for the training and the testing (I am sure the timestamp should not be part of the training).
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
I do appreciate if someone can help me to know
why I have this huge difference.
Also I need to have the test dataset with the latest timestamps (same as the dataset in the second experiment). Is there anyway to select data from the rest of the dataset for the training to improve the
training.
PS: I already test the random selection from the first 80% of the timestamp and it doesn't improved the performance.
First of all, it is not clear how exactly you're testing. Second, either way, you are doing the testing wrong.
RESULT: I am getting totally different result for these two datasets. For the first one AUC(Area under the curve) is 85%-90% (I did the experiment several times) and for the second one is 45%-50%.
Is this for the training set or the test set? If the test set, that means you have poor generalization.
You are doing it wrong because you are not allowed to tweak your model so that it performs well on the same test set, because it might lead you to a model that does just that, but that generalizes badly.
You should do one of two things:
1. A training-validation-test split
Keep 60% of the data for training, 20% for validation and 20% for testing in a random manner. Train your model so that it performs well on the validation set using your training set. Make sure you don't overfit: the performance on the training set should be close to that on the validation set, if it's very far, you've overfit your training set. Do not use the test set at all at this stage.
Once you're happy, train your selected model on the training set + validation set and test it on the test set you've held out. You should get acceptable performance. You are not allowed to tweak your model further based on the results you get on this test set, if you're not happy, you have to start from scratch.
2. Use cross validation
A popular form is 10-fold cross validation: shuffle your data and split it into 10 groups of equal or almost equal size. For each of the 10 groups, train on the other 9 and test on the remaining one. Average your results on the test groups.
You are allowed to make changes on your model to improve that average score, just run cross validation again after each change (make sure to reshuffle).
Personally I prefer cross validation.
I am guessing what happens is that by sorting based on timestamp, you make your algorithm generalize poorly. Maybe the 20% you keep for testing differ significantly somehow, and your algorithm is not given a chance to capture this difference? In general, your data should be sorted randomly in order to avoid such issues.
Of course, you might also have a buggy implementation.
I would suggest you try cross validation and see what results you get then.

Minimum number of observation when performing Random Forest

Is it possible to apply RandomForests to very small datasets?
I have a dataset with many variables but only 25 observation each. Random forests produce reasonable results with low OOB errors (10-25%).
Is there any rule of thumb regarding the minimum number of observations to use?
In fact one of the response variable is unbalanced, and if I'm going to subsample it I will end up with an even smaller number of observations.
Thanks in advance
Absolutely RF can be used on these type of datasets (i.e. p>n). In fact they use RF in fields like genomics where the number of fields >= 20000 and there are only a very small number of rows - say 10-12. The entire problem is figuring out which of the 20k variables would make up a parsimonious marker (i.e. feature selection is the entire problem).
I don't have any ROTs about minimum size other than if your model doesn't work well on a held back sample (or Hold-One-Back cross validation might work well in your case) well then you should try something else.
Hope this helps

Limiting prediction range for Weka

For my research I am using Weka to predict alpha values for different uses. The legal range of alpha is any real number between 0 and 1 inclusive. It is currently performing well, but some of the predictions are greater than 1. I want to keep the classifier as numerical since it is a real number, but I want to limit the range of the prediction to between 0 and 1. Any ideas on how to do this?
I think that #Lars-Kotthoff raises interesting points. I would provide my suggestions from a different perspective, ignoring completely the classification problems:
Once you have a set of values within a range [0, inf), you can just try to normalised them using some function such as logit or min-max, among others.
You can't do this in Weka. Whether it will be possible at all will depend on the implementation of the regression algorithm -- I'm not aware of something like this being implemented in any of the algorithms in Weka (although I might be wrong).
Even if it was implemented, the most likely thing that would happen is that everything greater than 1 would simply be replaced by 1. You can do the same thing by checking each prediction and replacing all values greater than 1.
Taking the possible output range into account when training the regression model is unlikely to improve performance.

Resources