BigQuery ML model evaluate keeps returning null - machine-learning

Basically, title. I am trying to query an evaluation for an TimeSeries ARIMA ML model I made. BigQuery has a designated function to do this. After training the model and using it to make a 30 day forecast, I run the ML.EVALUATE query, but everytime I do it only returns null instead of the desired accuracy measures.
After training a timeseries model
Here is the query I'm running specifically:
SELECT
*
FROM
ML.EVALUATE(MODEL `mydataset.my_arima_model`,
(
SELECT
timeseries_date,
timeseries_metric
FROM
`mydataset.mytable`),
STRUCT(TRUE AS perform_aggregation, 30 AS horizon))
Where mydataset.my_arima_model is my model, mydataset.mytable is the training data, and the metric and date fields are set correctly. The query is successful, and a temporary table is returned, but all fields which should contain the error measures are null. Am I doing something wrong? Should I put something else instead of the training data for mytable? I tried to run an evaluation on different models, including some I trained on test data I grabbed from Google.

Returning NULL is expected in your case. The evaluation metrics are computed based on the difference between the forecasted values and the ground truth values. In your query, the ground truth values are from mydataset.mytable, which has no overlap with the forecasted values w.r.t timestampes. That is why NULL is returned. To give an example, suppose mydataset.mytable has a time series from Jan/1/2022 to June/1/2022. You need to use for example, Jan/1 to May/1, as training data in CREATE MODEL query. Then use the same ML.EVALUATE query. This compares the forecasted value and ground truth value in the whole June.

Related

forecasting with TFT - forecast results are very flat

I'm using TFT (Temporal Fusion Transformer) from Pytorch-forecating for the first time for my forecasting project. Im quite confused by a few things:
forecasted time series is very flat, what are the possible reasons for this? is it my train data set is too short? the train data set length is between 1 max_encoder_length and 2 times of max_encoder_length, ratio of max_encoder_length:max_prediction_length is 4:1;
how does optimize_hyperparameters work? will it update the model with best parameters after being run? or I should manually update the model with output values of the process? Im asking because after I run it, nothing seem happened, the model remain unchanged.

How to apply same processing pipeline for train and test data when they result in different final features

I'm trying to create a regression model to predict some housing sales and I am facing an issue with processing the train data and test data (this is not the validation data taken from the training set itself) the same way. The steps I'm performing for the processing are follows:
drop the columns with null values >50%
Impute the rest of the columns containing null values
One-hot encode the categorical columns
Say my train data has the following columns (after label extraction) (the ones in ** ** contain null values):
['col1', 'col2', '**col3**', 'col4', '**col5**', 'col6', '**col7**','**col8**', '**col9**', '**col10**', 'col11']
test data has the following columns:
['col1', '**col2**', 'col3', 'col4', 'col5', 'col6', '**col7**', '**col8**', '**col9**', '**col10**', 'col11']
I only drop those columns with >50% null values and the rest of the columns in bold, I impute. Say, in the train data, I will have:
cols_to_drop= ['**col3**','**col5**','**col7**' ]
cols_to_impute= ['**col8**', '**col9**','**col10**' ]
And if I retain the same columns to be dropped from test data too, my test data will have the following:
cols_to_drop= ['**col3**','**col5**','**col7**' ]
cols_to_impute= ['**col2**', '**col8**', '**col9**','**col10**' ]
The problem now comes with imputation where I have to .fit_transform my imputer with the cols_to_impute in train data and have to .transform the same imputer with the cols_to_impute in the test data since there is a clear difference in the number of features supplied here in both the cols_to_impute lists. (I did this as well and had issues with imputation)
Say, if I keep the same cols_to_impute in both train and test datasets ignoring the null column **col2** of test data, I faced an issue when it came to one-hot encoding saying Nan's need to be handled before encoding. So, how should the processing be done for train and test sets in such cases? Should I be concatenating both of them, perform processing and split them later again? I read about leakage issues in doing this.
Well, you should do the following:
Combine both train and test dataframe, then do the first two steps i.e. dropping the column with nulls and imputing them.
Then, split it back into train and test, then do one hot encoding.
This would ensure that both the data frames have same columns and there is no leakage in doing one hot encoding.

pmml4s model.predict() returns array instead of single value

I used sklearn2pmml to serialize my decision tree classifier to a pmml file.
I used pmml4s in java to deserialize the model and use it to predict.
Iuse the code below to make a prediction over a single incoming value. This should return either 0/1/2/3/4/5/6.
Object[] result = model.predict(new String[]{"220"});
The result array looks like this after the prediction:
Does anyone know why this is happening? Is my way of inputting the prediction value wrong or is something wrong in the serialization/deserialization?
It is certainty of model for each class. In your case it means that it's 4 with probability 94.5% or 5 with probability 5.5%
In simple case, if you want to receive value, you should pick index for the maximal value.
However you might use this probabilities to additional control logic, like thresholding when decision is ambiguous (two values with probability ~0.4, etc.)

Classification using H2O.ai H2O-3 Automl Algorithm on AWS SageMaker: Categorical Columns

I'm trying to train a model using H2O.ai's H2O-3 Automl Algorithm on AWS SageMaker using the console.
My model's goal is to predict if an arrest will be made based upon the year, type of crime, and location.
My data has 8 columns:
primary_type: enum
description: enum
location_description: enum
arrest: enum (true/false), this is the target column
domestic: enum (true/false)
year: number
latitude: number
longitude: number
When I use the SageMaker console on AWS and create a new training job using the H2O-3 Automl Algorithm, I specify the primary_type, description, location_description, and domestic columns as categorical.
However in the logs of the training job I always see the following two lines:
Converting specified columns to categorical values:
[]
This leads me to believe the categorical_columns attribute in the training hyperparameter is not being taken into account.
I have tried the following hyperparameters with the same output in the logs each time:
{'classification': 'true', 'categorical_columns':'primary_type,description,location_description,domestic', 'target': 'arrest'}
{'classification': 'true', 'categorical_columns':['primary_type','description','location_description','domestic'], 'target': 'arrest'}
I thought the list of categorical columns was supposed to be delimited by comma, which would then be split into a list.
I expected the list of categorical column names to be output in the logs instead of an empty list, like so:
Converting specified columns to categorical values:
['primary_type','description','location_description','domestic']
Can anyone help me figure out how to get these categorical columns to apply to the training of my model?
Also-
I think this is the code that's running when I train my model but I have yet to confirm that: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L93-L151
This seems to be a bug by h2o package. The code in https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106 shows that it's reading categorical_columns directly from the hyperparameters, not nested under the training field. However when move up the categorical_columns field a level, the algorithm doesn't recognize it. So no solution for this.
It seems based on the code here: https://github.com/h2oai/h2o3-sagemaker/blob/master/automl/automl_scripts/train#L106
that the parameter is looking for a comma separated string. E.g. "cat,dog,bird"
I would try: "primary_type,description,location_description,domestic"as the input parameter, rather than ['primary_type', 'description'... etc]

Are data dependencies relevant when preparing data for neural network?

Data: When I have N rows of data like this: (x,y,z) where logically f(x,y)=z, that is z is dependent on x and y, like in my case (setting1, setting2 ,signal) . Different x's and y's can lead to the same z, but the z's wouldn't mean the same thing.
There are 30 unique setting1, 30 setting2 and 1 signal for each (setting1, setting2)-pairing, hence 900 signal values.
Data set: These [900,3] data points are considered 1 data set. I have many samples of these data sets.
I want to make a classification based on these data sets, but I need to flatten the data (make them all into one row). If I flatten it, I will duplicate all the setting values (setting1 and setting2) 30 times, i.e. I will have a row with 3x900 columns.
Question:
Is it correct to keep all the duplicate setting1,setting2 values in the data set? Or should I remove them and only include the unique values a single time?, i.e. have a row with 30 + 30 + 900 columns. I'm worried, that the logical dependency of the signal to the settings will be lost this way. Is this relevant? Or shouldn't I bother including the settings at all (e.g. due to correlations)?
If I understand correctly, you are training NN on a sample where each observation is [900,3].
You are flatning it and getting an input layer of 3*900.
Some of those values are a result of a function on others.
It is important which function, as if it is a liniar function, NN might not work:
From here:
"If inputs are linearly dependent then you are in effect introducing
the same variable as multiple inputs. By doing so you've introduced a
new problem for the network, finding the dependency so that the
duplicated inputs are treated as a single input and a single new
dimension in the data. For some dependencies, finding appropriate
weights for the duplicate inputs is not possible."
Also, if you add dependent variables you risk the NN being biased towards said variables.
E.g. If you are running LMS on [x1,x2,x3,average(x1,x2)] to predict y, you basically assign a higher weight to the x1 and x2 variables.
Unless you have a reason to believe that those weights should be higher, don't include their function.
I was not able to find any link to support, but my intuition is that you might want to decrease your input layer in addition to omitting the dependent values:
From professor A. Ng's ML Course I remember that the input should be the minimum amount of values that are 'reasonable' to make the prediction.
Reasonable is vague, but I understand it so: If you try to predict the price of a house include footage, area quality, distance from major hub, do not include average sun spot activity during the open home day even though you got that data.
I would remove the duplicates, I would also look for any other data that can be omitted, maybe run PCA over the full set of Nx[3,900].

Resources