How to reach deterministic result with ML.NET multiclass classification? - machine-learning

According to this tutorial https://learn.microsoft.com/en-us/dotnet/machine-learning/tutorials/github-issue-classification
Initialize the _mlContext global variable with a new instance of MLContext with a random seed (seed: 0) for repeatable/deterministic results across multiple trainings. Replace the Console.WriteLine("Hello World!") line with the following code in the Main method:
_mlContext = new MLContext(seed: 0);
I tried to set the seed parameter to 0, 1 and null as well, but after I processed my data, builded and trained my model and call the following method, I always got different test metrics results.
MulticlassClassificationMetrics testMetrics = _mlContext.MulticlassClassification.Evaluate(_trainedModel.Transform(testDataView));
My data comes from my database, always with the same schema and data. My question is, how can I get deterministic result in ML.NET when I try to Evaluate a multiclass classification problem?
Thanks in advance for your help!

Related

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

MultiDataSetIterator with INDArrays (not csv files) and multiple outputs DL4J

I want to train a ComputationGraph which has two outputs (this model) and in my script I have INDArrays (1 input and 2 outputs) ready to be sent in the neural network and it seems that I should use a MultiDataSetIterator to be able to setup batchsize before using the model.fit() function. I have been looking for a way to implement that for a long time and I have always found answers with CSV files but it is not what I want to use because while performing the simulations of the game I am creating a dataset of INDArrays that are stored in the memory and I am not loading any kind of CSV file.
Any ideas on how to create my MultiDataSetIterator to feed my fit() function ?
You don't have to use the multidataset iterator. You can fit with a multidataset (here) or you can fit with arrays of ndarrays(here) using your ndarrays in memory.

How to create training and test DataSetIterators in deeplearning4j?

I am building a recurrent neural network with deeplearning4j and I need to create the training and test data sets.
All the examples provided in the documentation and the example code, use a CSVSequenceRecordReader to read CSV files.
Then a DataSetIterator is created with the SequenceRecordReaderDataSetIterator constructor and fed into the MultiLayerNetwork.fit() or the MultiLayerNetwork.evaluate() method (depending if it's a training or test data set iterator).
However, in my case, the data set I have is not stored in a CSV file. I access it online through a third-party library, pre-process it to obtain a List<Data> and a List<Labels> objects.
How can I:
1) create the DataSetIterator from my two lists?
2) split the DataSetIterator in a training set and a test set?
Edit:
I think my question is too broad. Let me try to narrow it down.
I have started to read this article which uses a very simple approach to create a data set:
It creates two INDArrays and builds a DataSet from them using the DataSet(INDArray first, INDArray second) constructor.
Training the data works using network.fit(dataSet);, but I can't evaluate it while training, as the method evaluate requires an data set iterator, not a data set.
Moreover, from what I understand, using this approach also means that there is only one huge data set, no mini batches.
I also guess that I could create mini batches from this big data set by using the batchBy(int num) method. But this method returns a list of data sets, and not an data set iterator... iterateWithMiniBatches() does return a data set iterator but when I looked at the source file, it returns null and is deprecated. Then I tried to see if there is an implementation of the DataSetIterator I could use, but there are a lot of them. I tried the BaseDataSetIterator but it does not take a DataSet as constructor parameter but a DataSetFetcher... Yet another layer.
Is there somewhere an example that shows how to create a data set without using the default record readers? Or should I just create my how implementation of a record reader?
1)
MultiLayerNetwork.evaluate() accepts ListDataSetIterator as a parameter
If you have a List<Data> object you can first map it into a double[] featureVector and a double[] labelVector and then create a ListDataSetIterator like this
INDArray x = Nd4j.create(featureVector, new int[]{featureVector.length/numberOfFeatures, numberOfFeatures}, 'c');
INDArray y = Nd4j.create(labelVector, new int[]{labelVector.length/numberOfLabels, numberOfLabels}, 'c');
final DataSet allData = new DataSet(x,y);
final List<DataSet> list = allData.asList();
ListDataSetIterator iterator = new ListDataSetIterator(list);
For 2) you should just create two seperate iterators, one for training, one for testing.
You can then evaluate your net with net.evaluate(testIterator);

XGBoost prediction always returning the same value - why?

I'm using SageMaker's built in XGBoost algorithm with the following training and validation sets:
https://files.fm/u/pm7n8zcm
When running the prediction model that comes out of the training with the above datasets always produces the exact same result.
Is there something obvious in the training or validation datasets that could explain this behavior?
Here is an example code snippet where I'm setting the Hyperparameters:
{
{"max_depth", "1000"},
{"eta", "0.001"},
{"min_child_weight", "10"},
{"subsample", "0.7"},
{"silent", "0"},
{"objective", "reg:linear"},
{"num_round", "50"}
}
And here is the source code: https://github.com/paulfryer/continuous-training/blob/master/ContinuousTraining/StateMachine/Retrain.cs#L326
It's not clear to me what hyper parameters might need to be adjusted.
This screenshot shows that I'm getting a result with 8 indexes:
But when I add the 11th one, it fails. This leads me to believe that I have to train the model with zero indexes instead of removing them. So I'll try that next.
Update: retraining with zero values included doesn't seem to help. I'm still getting the same value every time. I noticed i can't send more than 10 values to the prediction endpoint or it will return an error: "Unable to evaluate payload provided". So at this point using the libsvm format has only added more problems.
You've got a few things wrong there.
using {"num_round", "50"} with such a small ETA {"eta", "0.001"} will give you nothing.
{"max_depth", "1000"} 1000 is insane! (default value is 6)
Suggesting:
{"max_depth", "6"},
{"eta", "0.05"},
{"min_child_weight", "3"},
{"subsample", "0.8"},
{"silent", "0"},
{"objective", "reg:linear"},
{"num_round", "200"}
Try this and report your output
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.

Using test data set in RapidMiner

I'm trying to create a model with a training dataset and want to label the records in a test data set.
All tutorials or help I find online has information on only using cross validation with one data set, i.e., training dataset. I couldn't find how to use test data. I tried to apply the result model on to the test set. But the test set seems to give different no. of attributes than training set after pre-processing. This is a text classification problem.
At the end I get some output like this
18.03.2013 01:47:00 Results of ResultWriter 'Write as Text (2)' [1]:
18.03.2013 01:47:00 SimpleExampleSet:
5275 examples,
366 regular attributes,
special attributes = {
confidence_1 = #367: confidence(1) (real/single_value)
confidence_5 = #368: confidence(5) (real/single_value)
confidence_2 = #369: confidence(2) (real/single_value)
confidence_4 = #370: confidence(4) (real/single_value)
prediction = #366: prediction(label) (nominal/single_value)/values=[1, 5, 2, 4]
}
But what I wanted is all my examples to be labelled.
It seems that my test data and training data have different no. of attributes, I see many of following in the logs.
Mar 18, 2013 1:46:41 AM WARNING: Kernel Model: The given example set does not contain a regular attribute with name 'wireless'. This might cause problems for some models depending on this particular attribute.
But how do we solve such problem in text classification as we cannot know no. of and name of attributes before hand.
Can some one please throw some pointers.
You probably use a Process Documents operator to preprocess both training and test set. Here it is important that both these operators are setup identically. To "synchronize" the wordlist, i.e. consider the same set of words in both of them, you have to connect the wordlist (wor) output of the Process Documents operator used for training to the corresponding input port of the Process Documents operator used for preprocessing the test set.

Resources