I'm training a linear regressor in BigQuery. I'm training it with ~20000 rows of data and a target output column with values in {0, 1}. I'm setting the EARLY_STOP option to False. I have set the MAX_ITERATIONS option to 10. My training always stops at 1 iteration with an MAE of ~0.2. Here is the create model query:
CREATE OR REPLACE MODEL `sketching.linear_regressor_08-13--11-59-03.811249`
OPTIONS (model_type=#model_type,
input_label_cols=#input_label_cols,
max_iterations=#max_iterations,
data_split_method=#data_split_method,
early_stop=#early_stop) AS
SELECT * FROM `ds.training_table`
These are the parameters in the query (python object print):
[ScalarQueryParameter('model_type', 'STRING', 'linear_reg'),
ArrayQueryParameter('input_label_cols', 'STRING', ['target_output']),
ScalarQueryParameter('max_iterations', 'INT64', 10),
ScalarQueryParameter('data_split_method', 'STRING', 'RANDOM'),
ScalarQueryParameter('early_stop', 'BOOL', False)]
PS: I've inspected bytes_processed of the QueryJob. It checks out (meaning that it is actually processing the whole table.
UPDATE
It looks like BQ is ignoring most of the model options that is supplied. This is a screenshot of my model status on the bigquery web api:
As you can see in the training options section, it is showing none of the options that I provided and that options that are displayed are actually not provided. I changed the data split method option and it did affect a change here.
UPDATE 2
I provided the L1_REG (0.1) option and it magically fixed the problem. Training went up to 10 iterations (the max_iterations provided).
If I run the model without any optional options (or just the early_stop option) it stops at 1 iteration.
Your model is probably stopping because the least square solution is being directly computed. This is due to the fact that the default setting for OPTIMIZE_STRATEGY='AUTO_STRATEGY' and your query doesn't fulfill the requirements for batch gradient descent to be selected over the direct least square solution.[1]
If you don't want this to occur, set OPTIMIZE_STRATEGY='BATCH_GRADIENT_DESCENT'
[1] https://cloud.google.com/bigquery-ml/docs/reference/standard-sql/bigqueryml-syntax-create-glm#optimize_strategy
Related
I'm trying to find optimal hyperparams with ray:
tuner = tune.Tuner(
train,
param_space=hyperparams1,
tune_config=tune.TuneConfig(
num_samples=200,
metric="score",
mode="max",
scheduler=ASHAScheduler(grace_period=6),
),
run_config=RunConfig(stop={"score": 290},
checkpoint_config=CheckpointConfig(checkpoint_score_attribute="score"))
)
Sometimes my model overfits and it's results become worse over time, i.e., I get something like 100, 200, 220, 140, 90, 80. Ray shows me the current "best result" in its status, but it selects the best value only from the last iterations (i.e., the best value for the mentioned result is 80).
I'm sure that results with higher values are better, so it would be nice to select best results based on the whole history, not on the last value.
Is there a way to get force it to use the whole train history for selecting the best result? Or should I interrupt training manually when I see that model is no longer improving? Or it's already saving all the results and all I need is to filter them after it finishes?
I've seen this Checkpoint best model for a trial in ray tune and have added CheckpointConfig to my code, but it seems like it doesn't help: I still see the last result
The entire history of reported metrics is being saved within each Result, and you can access it via result.metrics_dataframe. See this section of the user guide for an example of what you can access within each result.
It's possible to filter out the entire history to only show the the maximum accuracy (or another metric you define) using the ResultGrid output of tuner.fit(). The ResultGrid.get_dataframe(filter_metric=<your-metric>, filter_mode=<min/max>) API will return a DataFrame that filters out the history of reported results of each trial. See the bottom of this section for an example of doing this.
I am reading Hands on Machine Learning book and author talks about random seed during train and test split, and at one point of time, the author says over the period Machine will see your whole dataset.
Author is using following function for dividing Tran and Test split,
def split_train_test(data, test_ratio):
shuffled_indices = np.random.permutation(len(data))
test_set_size = int(len(data) * test_ratio)
test_indices = shuffled_indices[:test_set_size]
train_indices = shuffled_indices[test_set_size:]
return data.iloc[train_indices], data.iloc[test_indices]
Usage of the function like this:
>>>train_set, test_set = split_train_test(housing, 0.2)
>>> len(train_set)
16512
>>> len(test_set)
4128
Well, this works, but it is not perfect: if you run the program again, it will generate a different test set! Over time, you (or your Machine Learning algorithms) will get to see the whole dataset, which is what you want to avoid.
Sachin Rastogi: Why and how will this impact my model performance? I understand that my model accuracy will vary on each run as Train set will always be different. How my model will see the whole dataset over a time ?
The author is also providing a few solutions,
One solution is to save the test set on the first run and then load it in subsequent runs. Another option is to set the random number generator’s seed (e.g., np.random.seed(42)) before calling np.random.permutation(), so that it always generates the same shuffled indices.
But both these solutions will break next time you fetch an updated dataset. A common solution is to use each instance’s identifier to decide whether or not it should go in the test set (assuming instances have a unique and immutable identifier).
Sachin Rastogi: Will it be a good train/test division? I think No, Train and Test should contain elements from across dataset to avoid any bias from the Train set.
The author is giving an example,
You could compute a hash of each instance’s identifier and put that instance in the test set if the hash is lower or equal to 20% of the maximum hash value. This ensures that the test set will remain consistent across multiple runs, even if you refresh the dataset.
The new test set will contain 20% of the new instances, but it will not contain any instance that was previously in the training set.
Sachin Rastogi: I am not able to understand this solution. Could you please help?
For me, these are the answers:
The point here is that you should better put aside part of your data (which will constitute your test set) before training the model. Indeed, what you want to achieve is to be able to generalize well on unseen examples. By running the code that you have shown, you'll get different test sets through time; in other words, you'll always train your model on different subsets of your data (and possibly on data that you've previously marked as test data). This in turn will affect training and - going to the limit - there will be nothing to generalize to.
This will be indeed a solution satisfying the previous requirement (of having a stable test set) provided that new data are not added.
As said in the comments to your question, by hashing each instance's identifier you can be sure that old instances always get assigned to the same subsets.
Instances that were put in the training set before the update of the dataset will remain there (as their hash value won't change - and so their left-most bit - and it will remain higher than 0.2*max_hash_value);
Instances that were put in the test set before the update of the dataset will remain there (as their hash value won't change and it will remain lower than 0.2*max_hash_value).
The updated test set will contain 20% of the new instances and all of the instances associated to the old test set, letting it remain stable.
I would also suggest to see here for an explanation from the author: https://github.com/ageron/handson-ml/issues/71.
I'm using SageMaker's built in XGBoost algorithm with the following training and validation sets:
https://files.fm/u/pm7n8zcm
When running the prediction model that comes out of the training with the above datasets always produces the exact same result.
Is there something obvious in the training or validation datasets that could explain this behavior?
Here is an example code snippet where I'm setting the Hyperparameters:
{
{"max_depth", "1000"},
{"eta", "0.001"},
{"min_child_weight", "10"},
{"subsample", "0.7"},
{"silent", "0"},
{"objective", "reg:linear"},
{"num_round", "50"}
}
And here is the source code: https://github.com/paulfryer/continuous-training/blob/master/ContinuousTraining/StateMachine/Retrain.cs#L326
It's not clear to me what hyper parameters might need to be adjusted.
This screenshot shows that I'm getting a result with 8 indexes:
But when I add the 11th one, it fails. This leads me to believe that I have to train the model with zero indexes instead of removing them. So I'll try that next.
Update: retraining with zero values included doesn't seem to help. I'm still getting the same value every time. I noticed i can't send more than 10 values to the prediction endpoint or it will return an error: "Unable to evaluate payload provided". So at this point using the libsvm format has only added more problems.
You've got a few things wrong there.
using {"num_round", "50"} with such a small ETA {"eta", "0.001"} will give you nothing.
{"max_depth", "1000"} 1000 is insane! (default value is 6)
Suggesting:
{"max_depth", "6"},
{"eta", "0.05"},
{"min_child_weight", "3"},
{"subsample", "0.8"},
{"silent", "0"},
{"objective", "reg:linear"},
{"num_round", "200"}
Try this and report your output
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.
I read a post about forecasting time series with LSTM using CNTK. It is very helpful for me to get better understanding of how to apply this method to tackle other problems. It is so simple to implement LSTM network using CNTK, only with a couple lines of code.
model = Sequential (
RecurrentLSTMLayer {$stateDim$, usePeepholes = true} : #first LSTM
DenseLayer {$labelDim$, bias=false} # followed by an adaptor layer (from LSTN output size to the output or label size)
)
I used the data preprocessing program included in the post to generate the training data and validation data. I tried to prepare my domain data by studing the training data file of the example, I can understand each row consists of 15 features (input window) and 12 labels (output output window). The first two rows of data shown below.
1|i -0.117767881389987 -0.136789685972378 -0.157142990666484 -0.110810379516591 -0.0514608885500003 -0.0519359184751851 -0.093395333203464 -0.0466859579796335 -0.027053633818924 -0.0228974319964887 -0.0226106294034727 -0.0771583282775792 0.0326521764808296 0.0382623225371779 0.0179878482650109 |o -0.0419931078602005 -0.00707823233794613 -0.0326514790959216 0.107345877141872 0.0500879860433807 -0.0227185182952923 0.0354644105738453 0.0276734047763592 0.0830922226488839 0.0670409830200276 0.0983389666100694 0.101450282120106 |
1|i -0.142277570256967 -0.162630874951073 -0.11629826380118 -0.0569487728345894 -0.0574238027597742 -0.0988832174880532 -0.0521738422642226 -0.0325415181035131 -0.0283853162810779 -0.0280985136880618 -0.0826462125621683 0.0271642921962405 0.0327744382525887 0.0124999639804217 -0.0474809921447896 |o -0.0125661166225353 -0.0381393633805107 0.101857992857283 0.0446001017587916 -0.0282064025798814 0.0299765262892562 0.02218552049177 0.0776043383642948 0.0615530987354385 0.0928510823254802 0.0959623978355166 0.0630698500493061 |
As mentioned in the post, Input(15 values) and output (12 values) windows move forward one step at a time(see picture below), therefore, data at each row should just shift one value at a time, but they don't appear to me that is the case. There don't seem to have any overlaps of values between two rows.
Input and output windows for the time series
My question is how should I prepare the training and validation data for time series prediction using CNTK LSTM?
I looked into the data preprocessing script (in R) and found out that values in each input and output window are sum of the smooth trend and the random component (noise) as part of data normalization.
I'm trying to create a model with a training dataset and want to label the records in a test data set.
All tutorials or help I find online has information on only using cross validation with one data set, i.e., training dataset. I couldn't find how to use test data. I tried to apply the result model on to the test set. But the test set seems to give different no. of attributes than training set after pre-processing. This is a text classification problem.
At the end I get some output like this
18.03.2013 01:47:00 Results of ResultWriter 'Write as Text (2)' [1]:
18.03.2013 01:47:00 SimpleExampleSet:
5275 examples,
366 regular attributes,
special attributes = {
confidence_1 = #367: confidence(1) (real/single_value)
confidence_5 = #368: confidence(5) (real/single_value)
confidence_2 = #369: confidence(2) (real/single_value)
confidence_4 = #370: confidence(4) (real/single_value)
prediction = #366: prediction(label) (nominal/single_value)/values=[1, 5, 2, 4]
}
But what I wanted is all my examples to be labelled.
It seems that my test data and training data have different no. of attributes, I see many of following in the logs.
Mar 18, 2013 1:46:41 AM WARNING: Kernel Model: The given example set does not contain a regular attribute with name 'wireless'. This might cause problems for some models depending on this particular attribute.
But how do we solve such problem in text classification as we cannot know no. of and name of attributes before hand.
Can some one please throw some pointers.
You probably use a Process Documents operator to preprocess both training and test set. Here it is important that both these operators are setup identically. To "synchronize" the wordlist, i.e. consider the same set of words in both of them, you have to connect the wordlist (wor) output of the Process Documents operator used for training to the corresponding input port of the Process Documents operator used for preprocessing the test set.