Excluding columns from training data in BigQuery ML - machine-learning

I am training a model using BigQuery ML, my input has several fields, one of which is a customer number, this number is not useful as a prediction feature, but I do need it in the final output so that I can reference which users scored high vs. low. How can I exclude this column from the model training without removing it completely?
Reading the docs the only way I can see to exclude columns is by adding it to input_label_cols which it clearly is not, or data_split_col which is not desirable.

You do not need include into model fields that not need to be part of model - not at all.
Rather, you need to include them during the prediction
For example in below model you have only 6 fields as input (carrier, origin, dest, dep_delay, taxi_out, distance)
#standardsql
CREATE OR REPLACE MODEL flights.ontime
OPTIONS
(model_type='logistic_reg', input_label_cols=['on_time']) AS
SELECT
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
While in prediction you can have all extra fields available, like below (and you can put them in any position of SELECT - but note - predicted columns will go first:
#standardsql
SELECT * FROM ml.PREDICT(MODEL `cloud-training-demos.flights.ontime`, (
SELECT
UNIQUE_CARRIER, -- extra column
ORIGIN_AIRPORT_ID, -- extra column
IF(arr_delay < 15, 1, 0) AS on_time,
carrier,
origin,
dest,
dep_delay,
taxi_out,
distance
FROM `cloud-training-demos.flights.tzcorr`
WHERE arr_delay IS NOT NULL
LIMIT 5
))
Obviously input_label_cols and data_split_col are for different purposes
input_label_cols STRING The label column name(s) in the training data.
data_split_col STRING This option identifies the column used to split the data [into training and evaluation sets]. This column cannot be used as a feature or label, and will be excluded from features automatically.

Related

Table transformation from variables

My dataset looks like down below, but there are way more brands and categories.
I would like to transfer it the way that the brand is a row, and attribute (good quality, affordable) in the column
I've tried VARSTOCASES and i can calculate mean from it but thats not my desirable output
I need to posses brand names somehow - should withdraw it from all of my variables by
compute brand=char.substr(brand, 16)
like
compute brand=char.substr(P1_Good_Quality_BMW, 16)
I am fine with the varstocases part, then I can put my output like GQ to column, but dont know how to possess all of the names of brands and to let them match mean values of attributes
Thank you in advance for your help
This will get the data in the structure you intended - with a row for each brand and a column for each attribute:
varstocases /make GQ from P1_GoodQuality_BMW P1_GoodQuality_Audi P1_GoodQuality_Mercedes
/make Afford from P2_Affordable_BMW P2_Affordable_Audi P2_Affordable_Mercedes
/index=brand(GQ).
* at this point you should have the table you were trying to create,
* we'll just extract the brand names properly.
compute brand=char.substr(brand, 16).
execute.
* now we have the data structured nicely, we can aggregate by brand.
dataset declare agg.
aggregate /out=agg /break=brand /GQ Afford = mean (GQ Afford).
dataset activate agg.

Handling a missing value in machine learning

I was analyzing a dataset in which i have column names as follows: [id , location, tweet, target_value]. I want to handle the missing values for column location in some rows. So i thought to extract location from tweet column from that row(if tweet contains some location) itself and put that value in the location column for that row.
Now i have some questions regarding above approach.
Is this a good way to do it this way?. Can we fill some missing values by using the training data itself?. Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)
Can you please clarify your dataset a little bit more?
First, If we assume that the location is the information of the tweet that has been posted from, then your method (filling out the location columns in the rows in which that information is missing) becomes wrong.
Secondly, if we assume that the tweet contains the location information correctly, then you can fill out the missing rows using the tweets' location information.
If our second assumption is correct, then it would be a good way because you are feeding your dataset with correct information. In other words, you are giving the model a more detailed information so that it could predict more correctly in the testing process.
Regarding to your question about "Will not this be considered as a redundant feature(because we are deriving the values of this feature using some other feature)":
You can try to remove the location column from your model and train your model with the rest of your 3 columns. Then, you can check the success of the new model using different parameters (accuracy etc.). You can compare it with the results of the model that you have trained using all 4 different columns. After that, if there is not any important difference or the results become severe, then you would say it, the column is redundant. Also you can use Principal Component Analysis(PCA) to detect correlated columns.
Finally, please NEVER use training data in your test dataset. It will lead to overtraining and when you use your model in the real world environment, your model will most probably fail.

Evaluation error while trying to build model in DSX when dataset has a feature column with unique values

I’m getting an evaluation error while building binary classification model in IBM Data Science Experience (DSX) using IBM Watson Machine Learning if one of the feature columns has unique categorical values.
The dataset i'm using looks like this -
Customer,Cust_No,Alerts,Churn
Ford,1000,8,0
GM,2000,50,1
Chrysler,3000,10,0
Tesla,4000,48,1
Toyota,5000,15,0
Honda,6000,55,1
Subaru,7000,12,0
BMW,8000,52,1
MBZ,9000,13,0
Porsche,10000,54,1
Ferrari,11000,9,0
Nissan,12000,49,1
Lexus,13000,10,0
Kia,14000,50,1
Saab,15000,12,0
Faraday,16000,47,1
Acura,17000,13,0
Infinity,18000,53,1
Eco,19000,16,0
Mazda,20000,52,1
In DSX, upload the above CSV data, then create a Model using automatic model builder. Select Churn as label column and Customer and Alerts as feature columns. Select Binary Classification model and use the default
settings for training/test split. Train the model. The model building fails with evaluation error. Instead if we select Cust_No and Alerts as feature columns, the model is created successfully. Why is that ?
When a model is built in DSX the data is split in training, test and holdout. These datasets are disjoint.
In case the Customer field is chosen, that is a string field this must be converted in numeric values to have a meaning for model ML algorithms (Linear regression / Logistic regression / Decision Trees etc. ).
How this is done:
The algorithm is iterating in each value from field Customer and creates a dictionary, mapping a string value to a numeric value (see spark StringIndexer - https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer).
When model is evaluated or scored the string fields from test subset are converted to numeric based on the dictionary made at training point. If a value is not found there are two options (skip entire record or throw an error - first option is choose by DSX).
Taking into consideration that all values from Customer field are unique , it means that none of the records from test dataset arrives in evaluation phase and from here the error that model can not be evaluated.
In case of Cust_No, the field is already a numeric and does not require a category encoding operation. Even if the values from evaluation step are not found in training the values will be use as is.
Taking a step back, it seems to me like your data doesn't really contain predictive information other than in Alerts.
The customer and Cust_no fields are basically ID columns, and seem to not contain predictive information.
Can you post a screenshot of your Evaluation error? I can try to help, I work on DSX.

Regression when size of explanatory variables differ in length/size

What is generally considered the correct approach when you are performing a regression and your training data contains 'incidents' of some sort, but there may be a varying number of these items per training line?
To give you an example - suppose I wanted to predict the likelihood of accidents on a number of different roads. For each road, I may have a history multiple accidents and each accident will have its own different attributes (date (how recent), number of casualties, etc). How does one encapsulate all this information on one line?
You could for example assume a maximum of (say) ten and include the details of each as a separate input (date1, NoC1, date2, NoC2, etc....) but the problem is we want each item to be treated similarly and the model will treat items in column 4 as fundamentally separate from those in column 2 above, which it should not.
Alternatively we could include one row for each incident, but then any other columns in each row which are not related to these 'incidents' (such as age of road, width, etc) will be included multiple times and hence produce bias in the results.
What is the standard method that is use to accomplish this?
Many thanks

How to select attributes with respect to Information Gain in weka?

i am working in weka for text classification. I have total of 113232 attributes in vocabulary out of which i want to select top 10, 000 attributes. Follwing is setting of my informationGain filter
AttributeSelection featureSelectionFilter = new AttributeSelection();
InfoGainAttributeEval informationGain = new InfoGainAttributeEval();
Ranker ranker = new Ranker();
ranker.setNumToSelect(10000);
ranker.setThreshold(0);
I assumed that it may arrange the attributes in descending order with respect to their information gain, i am not sure i am right or wrong in my assumption here is an image of three attributes
The maximum value, std dev, mean all of first attribute is higher than other which may indicate its importance but these values for second attribute is less than 3rd ? Is it right ? How to IG select attributes from vocabulary when we set numToSelect(10, 000); ?

Resources