Losing output trained model in Kaggle after commit - machine-learning

I am new to kaggle. I have trained model for 8 hours. After training my model, I already saved my trained model. But, I clicked commit button. Now I want to load my trained model in order to do the submission. However, it's gone. Is there a way to get it back???????????

Since you've already committed the model, you can get the output files in your current notebook.
Go to :
Files --> Add or upload dataset
Select the Kernel output files tab and filter by Your work.
Your previously committed output files (say, saved models) will be imported to current notebook.
Hope this helps!

Related

Getting "Trial 0 encounter error with message: Must be at least 2. Parameter name: numClasses" Using VS 2019 Model Builder

I'm new to Machine Learning and ML.Net (both from a coding and model builder perspective). I've written code to train and predict (relatively simple examples against our data) but but thought it would be best to use the Model Builder since it picks the appropriate models to train.
I'm using the Data Classification scenario in the model builder. I have a dataset (from SQL Server) that successfully trains but I wanted to use a different version of the dataset (same schema, different data). When creating this other dataset, I now get the error "Trial 0 encounter error with message: Must be at least 2" and I've not been able to find any information about the error. I've compared the two datasets (column types, null values, checked the Advanced data options to make sure they are the same) - original one that trains and the new one that throws this exception and they appear to be identical other than the data itself.
I went as far as using Telerik JustDecompile to see where in the ML code (Microsoft.ML.Trainers - LinearMulticlassModelParametersBase) this error was being thrown from. I understand there are 2 different types of data classification scenarios - Binary and Multi class. I have a column defined as the label that should be either 1 or 0.
I appreciate any help. Hopefully someone can point me in the right direction. I've been analyzing the dataset that works and the one that doesn't for a # of days and cannot find the difference. Does the model use different algorithms based on the actual data being trained even when the schema is the same?
I'm going to try using these same 2 datasets through code (not using the model builder).
Thanks.
Tom
Did your label column have more than two categories in the original dataset?
It's possible your multiclass trainer requires at least 3 categories.
As for the selection of algorithms, the model builder picks one based on accuracy metrics by using the AutoML class. But you can just try out different ones in code. Once you have selected one in code it will use that specific algorithm. If you use the model builder you will get different algorithms depending on the dataset you give it.
For example you can just change your pipeline from this:
var pipeline = ctx.Transforms.Text
.FeaturizeText("Features", nameof(SentimentIssue.Text))
.Append(ctx.BinaryClassification.Trainers
.LbfgsLogisticRegression("Label", "Features"));
To this:
var pipeline = ctx.Transforms.Text
.FeaturizeText("Features", nameof(SentimentIssue.Text))
.Append (ctx.BinaryClassification.Trainers
.SdcaLogisticRegression();
Or even just run the new data through the model builder again and see which trainer it picks.
I got the exact same error message.
I fixed it by doing these things:
In the Model Builder > Data > Advanced Data Options. Make sure to set the Label as Binary as shown in the screenshot.
Restart Visual Studio a lot.
In the SQL to pull the CSV from SQL Server, I did an ORDER BY NEWID() to provide a random distribution of the data set. I don't know if that matters.

How much data / context needed to train custom NER Spacy model?

I am trying to extract previous Job titles from a CV using spacy and named entity recognition.
I would like to train spacy to detect a custom named entity type : 'JOB'. For that I have around 800 job title names from https://www.careerbuilder.com/browse/titles/ that I can use as training data.
In my training data for spacy, do I need to integrate these job titles in sentences added to provide context or not?
In general in the CV the job title kinda stands on it's own and is not really part of a full sentence.
Also, if I need to provide coherent context for each of the 800 titles, it will be too time-consuming for what I'm trying to do, so maybe there are other solutions than NER?
Generally, Named Entity Recognition relies on the context of words, otherwise the model would not be able to detect entities in previously unseen words. Consequently, the list of titles would not help you to train any model. You could rather run string matching to find any of those 800 titles in CV documents and you will even be guaranteed to find all of them - no unknown titles, though.
I you could find 800 (or less) real CVs and replace the Job names by those in your list (or others!), then you are all set to train a model capable of NER. This would be the way to go, I suppose. Just download as many freely available CVs from the web and see where this gets you. If it is not enough data, you can augment it, for example by exchanging the job titles in the data by some of the titles in your list.

Compare the performance of the models and add annotations to the results

I'm preparing Azure Machine Learning exam and have a question shown below:
You are working on an Azure Machine Learning Experiment.
You have the dataset configured as shown in the following table:
You need to ensure that you can compare the performance of the models and add annotations to the results.
A. You consolidate the output of the Score Model modules by using the Add Rows module and then use the Execute R Script module.
B. You connect the Score Model modules from each trained model as inputs for the Evaluate Model module and then use the Execute R Script Module.
C. You save the output of the Score Model modules as a combined set, and then use the Project Columns modules to select the MAE.
D. You connect the Score Model modules from each trained model as inputs for the Evaluate Model module and then save the results as a dataset.
I think all of the above are correct but what confuses me is there are different answers from the internet. Some are the same with me, but others are not. I need someone to confirm my answer or explain to me the correct answer.

Update a trained model in ML.NET

This example shows how to use matrix factorization to build a recommendation system. This example is particulary suitable for a dataset with only two related ids like user id and product id that the corresponding user has purchased.
Based on this example, I prepared an input data like below.
[UserId] [ProductId]
3    1
3    15
3    23
5    9
5    1
8    2
8    1
.
.
And change the column name, making TextLoader.
var reader = ctx.Data.TextReader(new TextLoader.Arguments()
{
Separator = "tab",
HasHeader = true,
Column = new[]
{
new TextLoader.Column("Label", DataKind.R4, 0),
new TextLoader.Column("UserId", DataKind.U4, new [] { new TextLoader.Range(0) }, new KeyRange(0, 100000)),
new TextLoader.Column("ProductId", DataKind.U4, new [] { new TextLoader.Range(1) }, new KeyRange(0, 300))
}
});
It works great. It recommends a list of products that the target user may purchase with individual scores. However, it doesn't work with a new customer data that didn't exist in the initial input data, say UserId 1, it gives score NaN as a result of the prediction.
Retraining the model could be an obvious answer, but it seems futile to retrain the model everytime a new data comes in. I think there's definitely a way to update the existing model but I cannot find the relevant documentation, APIs, or a sample anywhere. I ended up leaving a question in the official github of ML.NET but I've got no answers so far.
Question would be very simple, in a nutshell, how can I update a trained model in ML.NET? Linking a relevant source of information would be greatly appreciated too.
In this particular example because of the task being performed you are limited to the scope of observations the model was trained on and can make predictions on that set. As you mentioned, a good way to go about it would be to re-train. I haven't tried this myself, but you might want to try one of the following:
Run Fit function again using the new data you want to train with as your input. Not only should the model persist it's previous training but also re-train using the additional data you have provided it with.
Save model to file, Load persisted model, Run Fit function like above.
As of 2021:
The re-training process is described in details here: https://learn.microsoft.com/en-us/dotnet/machine-learning/how-to-guides/retrain-model-ml-net

Weka Experimenter 'Class attribute is not nominal' but data is processed from Explorer

Good Evening,
I am working on a supervised classification task. I have a big arff file full of data in the format, "text", class. There are only two classes, E and I.
I can load this data into Weka Explorer, apply the StringToWordVector with TF-IDF on it, then using LibSVM classify it and get results. But I need to use 5x2 Cross-Validation and get the Area under the ROC Curve. So I save that processed data, open up Weka Experimenter, load it in, set it to 2 folds, 5 iterations, and then set the algorithm to libSVM.
When I go to the RUN tab and press start I get the following error:
18:31:18: Started
18:31:18: Class attribute is not nominal!
18:31:18: Interrupted
18:31:18: There was 1 error
I don't know why this is happening, what exactly the error is, or how to fix it. I google this error and it is not leading me to any solutions. I am not sure where I should go from here to fix this.
I can go back to Explorer, reload in that processed file, and classify it without any issues but I need to do it in Experimenter.
In my case, there were nominal attributes in the file. However, Weka expects these to be last, since they indicate the class that the record is being assigned to. Here's how I rearranged the data so that the nominal value was last:
In Explorer, open the arff file.
Click 'Edit...' then find the column which should be the class of each record.
Right click on the column header and select 'Attribute as class'.
Click 'Save...' and use this new dataset in Experimenter.
Works like a charm.
If your class attribute is numeric (like 0,1) change it to a nominal form like true, false.
The StringToWordVector filter puts the class attribute as the first attribute in the data that it outputs. The Experimenter expects the last attribute in the data to be the class. You can reorder the attributes of the filtered data, but the best (and correct approach in general when combining filters with classifiers) is to use the FilteredClassifier to encapsulate your base classifier (LibSVM) with the StringToWordVector filter. This should work out just fine because the class attribute is the last attribute in your original "text", class data.

Resources