I'm trying to build a logistic regression model following the syntax of the built in functions in vertica. The model builds properly but predictLogisticReg does not work.
I am able to build a model using
SELECT
v_ml.logisticReg('logRegModel', 'public.regression_training_table',
'longTermPlayer', ' IOS_or_not, firstDayTransactions',
'--epsilon=0.000001 --max_iterations=100');
and can verify that it has worked by checking the summary:
SELECT
v_ml.summaryLogisticReg(using parameters
model_name='logRegModel', owner='dbadmin');
When I try to predict features using
SELECT
user_id,
v_ml.predictLogisticReg('IOS_or_not', 'firstDayTransactions'
using parameters model_name='logRegModel', owner='dbadmin')
FROM public.regression_test_table
on a test set (with identical columns), I am getting the error:
The input column corresponding to "ios_or_not" is not available
If you have any idea why it doesn't seem to be recognising the data in the test set I'd very much appreciate it!
Thanks.
Solved. For those interested:
I passed the fields within inverted commas, while they should not have been. Replace with
SELECT
user_id,
v_ml.predictLogisticReg(IOS_or_not, firstDayTransactions
using parameters model_name='logRegModel', owner='dbadmin')
FROM public.regression_test_table
Related
I'm new to Machine Learning and ML.Net (both from a coding and model builder perspective). I've written code to train and predict (relatively simple examples against our data) but but thought it would be best to use the Model Builder since it picks the appropriate models to train.
I'm using the Data Classification scenario in the model builder. I have a dataset (from SQL Server) that successfully trains but I wanted to use a different version of the dataset (same schema, different data). When creating this other dataset, I now get the error "Trial 0 encounter error with message: Must be at least 2" and I've not been able to find any information about the error. I've compared the two datasets (column types, null values, checked the Advanced data options to make sure they are the same) - original one that trains and the new one that throws this exception and they appear to be identical other than the data itself.
I went as far as using Telerik JustDecompile to see where in the ML code (Microsoft.ML.Trainers - LinearMulticlassModelParametersBase) this error was being thrown from. I understand there are 2 different types of data classification scenarios - Binary and Multi class. I have a column defined as the label that should be either 1 or 0.
I appreciate any help. Hopefully someone can point me in the right direction. I've been analyzing the dataset that works and the one that doesn't for a # of days and cannot find the difference. Does the model use different algorithms based on the actual data being trained even when the schema is the same?
I'm going to try using these same 2 datasets through code (not using the model builder).
Thanks.
Tom
Did your label column have more than two categories in the original dataset?
It's possible your multiclass trainer requires at least 3 categories.
As for the selection of algorithms, the model builder picks one based on accuracy metrics by using the AutoML class. But you can just try out different ones in code. Once you have selected one in code it will use that specific algorithm. If you use the model builder you will get different algorithms depending on the dataset you give it.
For example you can just change your pipeline from this:
var pipeline = ctx.Transforms.Text
.FeaturizeText("Features", nameof(SentimentIssue.Text))
.Append(ctx.BinaryClassification.Trainers
.LbfgsLogisticRegression("Label", "Features"));
To this:
var pipeline = ctx.Transforms.Text
.FeaturizeText("Features", nameof(SentimentIssue.Text))
.Append (ctx.BinaryClassification.Trainers
.SdcaLogisticRegression();
Or even just run the new data through the model builder again and see which trainer it picks.
I got the exact same error message.
I fixed it by doing these things:
In the Model Builder > Data > Advanced Data Options. Make sure to set the Label as Binary as shown in the screenshot.
Restart Visual Studio a lot.
In the SQL to pull the CSV from SQL Server, I did an ORDER BY NEWID() to provide a random distribution of the data set. I don't know if that matters.
I'm preparing Azure Machine Learning exam and have a question shown below:
You are working on an Azure Machine Learning Experiment.
You have the dataset configured as shown in the following table:
You need to ensure that you can compare the performance of the models and add annotations to the results.
A. You consolidate the output of the Score Model modules by using the Add Rows module and then use the Execute R Script module.
B. You connect the Score Model modules from each trained model as inputs for the Evaluate Model module and then use the Execute R Script Module.
C. You save the output of the Score Model modules as a combined set, and then use the Project Columns modules to select the MAE.
D. You connect the Score Model modules from each trained model as inputs for the Evaluate Model module and then save the results as a dataset.
I think all of the above are correct but what confuses me is there are different answers from the internet. Some are the same with me, but others are not. I need someone to confirm my answer or explain to me the correct answer.
I want to get the details (unique id) of the incorrectly classified instances using Weka GUI. I am following the answers of this question. In that, they ask to use the filter StringToNominal in Preprocessing tab to convert the unique id, which is an string. However, by following that, I doubt if the classifier is considering the unique id column also as a feature during the classification?
Please suggest me the correct way of approaching this.
I happy to provide examples if needed.
Let's suppose you want to (1) add an instance ID, (2) not use that instance ID in the model, and (3) see the individual predictions, with the instance ID and maybe some other attributes.
We’re going to show this with a smaller data set. Open iris.arff, for example.
Use the AddID filter in the Preprocess tab, in the Unsupervised Attribute filters. ID will be the first attribute.
Now we need to ignore it during the modeling. Use the filtered classifier with the Remove filter.
And we need to output the predictions with the ID variable so we can see what happened. Here we are outputting all the attributes, although we don’t need to do all.
We get out this detail in the output window:
=== Predictions on test split ===
inst#,actual,predicted,error,prediction,ID,sepallength,sepalwidth,petallength,petalwidth
1,2:Iris-versicolor,2:Iris-versicolor,,0.968,53,6.9,3.1,4.9,1.5
2,3:Iris-virginica,3:Iris-virginica,,0.968,131,7.4,2.8,6.1,1.9
3,2:Iris-versicolor,2:Iris-versicolor,,0.968,59,6.6,2.9,4.6,1.3
4,1:Iris-setosa,1:Iris-setosa,,1,36,5,3.2,1.2,0.2
5,3:Iris-virginica,3:Iris-virginica,,0.968,101,6.3,3.3,6,2.5
6,2:Iris-versicolor,2:Iris-versicolor,,0.968,88,6.3,2.3,4.4,1.3
7,1:Iris-setosa,1:Iris-setosa,,1,42,4.5,2.3,1.3,0.3
8,1:Iris-setosa,1:Iris-setosa,,1,8,5,3.4,1.5,0.2
and so on.
I’m getting an evaluation error while building binary classification model in IBM Data Science Experience (DSX) using IBM Watson Machine Learning if one of the feature columns has unique categorical values.
The dataset i'm using looks like this -
Customer,Cust_No,Alerts,Churn
Ford,1000,8,0
GM,2000,50,1
Chrysler,3000,10,0
Tesla,4000,48,1
Toyota,5000,15,0
Honda,6000,55,1
Subaru,7000,12,0
BMW,8000,52,1
MBZ,9000,13,0
Porsche,10000,54,1
Ferrari,11000,9,0
Nissan,12000,49,1
Lexus,13000,10,0
Kia,14000,50,1
Saab,15000,12,0
Faraday,16000,47,1
Acura,17000,13,0
Infinity,18000,53,1
Eco,19000,16,0
Mazda,20000,52,1
In DSX, upload the above CSV data, then create a Model using automatic model builder. Select Churn as label column and Customer and Alerts as feature columns. Select Binary Classification model and use the default
settings for training/test split. Train the model. The model building fails with evaluation error. Instead if we select Cust_No and Alerts as feature columns, the model is created successfully. Why is that ?
When a model is built in DSX the data is split in training, test and holdout. These datasets are disjoint.
In case the Customer field is chosen, that is a string field this must be converted in numeric values to have a meaning for model ML algorithms (Linear regression / Logistic regression / Decision Trees etc. ).
How this is done:
The algorithm is iterating in each value from field Customer and creates a dictionary, mapping a string value to a numeric value (see spark StringIndexer - https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer).
When model is evaluated or scored the string fields from test subset are converted to numeric based on the dictionary made at training point. If a value is not found there are two options (skip entire record or throw an error - first option is choose by DSX).
Taking into consideration that all values from Customer field are unique , it means that none of the records from test dataset arrives in evaluation phase and from here the error that model can not be evaluated.
In case of Cust_No, the field is already a numeric and does not require a category encoding operation. Even if the values from evaluation step are not found in training the values will be use as is.
Taking a step back, it seems to me like your data doesn't really contain predictive information other than in Alerts.
The customer and Cust_no fields are basically ID columns, and seem to not contain predictive information.
Can you post a screenshot of your Evaluation error? I can try to help, I work on DSX.
Good Evening,
I am working on a supervised classification task. I have a big arff file full of data in the format, "text", class. There are only two classes, E and I.
I can load this data into Weka Explorer, apply the StringToWordVector with TF-IDF on it, then using LibSVM classify it and get results. But I need to use 5x2 Cross-Validation and get the Area under the ROC Curve. So I save that processed data, open up Weka Experimenter, load it in, set it to 2 folds, 5 iterations, and then set the algorithm to libSVM.
When I go to the RUN tab and press start I get the following error:
18:31:18: Started
18:31:18: Class attribute is not nominal!
18:31:18: Interrupted
18:31:18: There was 1 error
I don't know why this is happening, what exactly the error is, or how to fix it. I google this error and it is not leading me to any solutions. I am not sure where I should go from here to fix this.
I can go back to Explorer, reload in that processed file, and classify it without any issues but I need to do it in Experimenter.
In my case, there were nominal attributes in the file. However, Weka expects these to be last, since they indicate the class that the record is being assigned to. Here's how I rearranged the data so that the nominal value was last:
In Explorer, open the arff file.
Click 'Edit...' then find the column which should be the class of each record.
Right click on the column header and select 'Attribute as class'.
Click 'Save...' and use this new dataset in Experimenter.
Works like a charm.
If your class attribute is numeric (like 0,1) change it to a nominal form like true, false.
The StringToWordVector filter puts the class attribute as the first attribute in the data that it outputs. The Experimenter expects the last attribute in the data to be the class. You can reorder the attributes of the filtered data, but the best (and correct approach in general when combining filters with classifiers) is to use the FilteredClassifier to encapsulate your base classifier (LibSVM) with the StringToWordVector filter. This should work out just fine because the class attribute is the last attribute in your original "text", class data.