Training model with multiple features who's values are conceptually the same - machine-learning

For example, say I am trying to train a binary classifier that takes sample inputs of the form
x = {d=(type of desk), p1=(type of pen on desk), p2=(type of *another* pen on desk)}
Say I then train a model on the samples:
x1 = {wood, ballpoint, gel}, y1 = {0}
x2 = {wood, ballpoint, ink-well}, y2 = {1}.
and try to predict on the new sample: x3 = {wood, gel, ballpoint}. The response that I am hoping for in this case is y3 = {0}, since conceptually it should not matter (ie. I don't want it to matter) which pen is designated as p1 or p2.
When trying to run this model (in my case, using an h2o.ai generated model), I get the error that the category enum for p2 is not valid (since the model has never seen 'ballpoint' in p2's category during training) (in h2o: hex.genmodel.easy.exception.PredictUnknownCategoricalLevelException)
My first idea was to generate permutations of the 'pens' features for each sample to train the model on. Is there a better way to handle this situation? Specifically, in h2o.ai Flow UI solution, since that is what I am using to build the model. Thanks.

H2O binary models (models running in the H2O cluster) will handle unseen categorical levels automatically, however, in when you are generating predictions using the pure Java POJO model method (like in your case), this is a configurable option. In the EasyPredictModelWrapper, the default behavior is that unknown categorical levels throw PredictUnknownCategoricalLevelException, which is why you are seeing that error.
There is more info about this in the EasyPredictModelWrapper Javadocs.
Here is an example:
The easy prediction API for generated POJO and MOJO models. Use as follows:
1. Instantiate an EasyPredictModelWrapper
2. Create a new row of data
3. Call one of the predict methods
Here is an example:
// Step 1.
modelClassName = "your_pojo_model_downloaded_from_h2o";
GenModel rawModel;
rawModel = (GenModel) Class.forName(modelClassName).newInstance();
EasyPredictModelWrapper model = new EasyPredictModelWrapper(
new EasyPredictModelWrapper.Config()
.setModel(rawModel)
.setConvertUnknownCategoricalLevelsToNa(true));
// Step 2.
RowData row = new RowData();
row.put(new String("CategoricalColumnName"), new String("LevelName"));
row.put(new String("NumericColumnName1"), new String("42.0"));
row.put(new String("NumericColumnName2"), new Double(42.0));
// Step 3.
BinomialModelPrediction p = model.predictBinomial(row);

Related

How to create a language model with 2 different heads in huggingface?

I know I can create a language model with 1 head:
from transformers import AutoModelForMultipleChoice
model = AutoModelForMultipleChoice.from_pretrained("distilbert-base-cased").to(device)
But how can I create the same base model structure (e.g., distilbert-base-cased) with 2 heads? Say, one is AutoModelForMultipleChoice and the second is AutoModelForSequenceClassification. I need the only difference between the 2 models (1 head vs 2 heads) to be the additional head (from parameters perspective).
So now my input for the 2 heads model is something like [sequence_label, multiple_choice_labels]
In general case you will need to create a custom class derived from the DistilBertPreTrainedModel. Inside __init__() you will need to define your desired heads architectures. Then you will need to create your own forward() function and define inside it a custom loss involving both heads, and return result.
But if you are talking specifically about DistilBertForMultipleChoice and DistilBertForSequenceClassification, there is a shortcut, as the heads architecture happen to be identical (see source) and the difference is only in loss function. So you can try to train your model as multi label sequence classification problem, where the label per sequence will be [sequence_label, multiple_choice_label_0, multiple_choice_label_1, ...] . For example, in case you have an entry like {sequence, choice0, choice1, seq_label:True, correct_choice:0}
your dataset will be
[ {'text':(sequence, choice0), 'label':(1 1 0)},
{'text':(sequence, choice1), 'label':(1 0 0)} ]
This way the result of the sequence classification will be in the first position and to get the correct choice probability you will need to apply softmax function on the rest of the logits.

Find the importance of each column to the model

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}
I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.
You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

How to train one model for several devices

I have some tabular device data comprising a
time column, some tabular features, target classes
There are around 500 rows (not same) in all devices data and target classes are same.
I have same data for around 1000 devices,
I want to train a general model for all the devices for detecting the class.
Can someone help me with the approach to train for the target variable. What kind of models work in this condition
If your device type is part of the data, you can train a decision tree. If the device type feature is important for classification sake, it will be added to the tree. First, create the device type features yourself - a binary column for each device type, like done in one-hot encoding. There will be a binary column per device type - is_device_samsung, is_device_lg, is_device_iphone and so forth. The number of columns created is equal to the number of device types. All but one of these columns will be 0, and the one indicating the current type will be 1. This will not guarantee the device type will be a part of the model - but let the AI decide this for you.
BTW - don't use get_dummies unless you know how to reuse it exactly as needed in the test data.
Another option is to use the python-weka wrapper, which accepts nominal attributes:
Example:
import weka.core.jvm as jvm
from weka.core.converters import Loader
from weka.classifiers import Classifier
def get_weka_prob(inst):
dist = c.distribution_for_instance(inst)
p = dist[next((i for i, x in enumerate(inst.class_attribute.values) if x == 'DONE'), -1)]
return p
jvm.start()
loader = Loader(classname="weka.core.converters.CSVLoader")
data = loader.load_file(r'.\recs_csv\df.csv')
data.class_is_last()
datatst = loader.load_file(r'.\recs_csv\dftst.csv')
datatst.class_is_last()
c = Classifier("weka.classifiers.trees.J48", options=["-C", "0.1"])
c.build_classifier(data)
print(c)
probstst = [get_weka_prob(inst) for inst in datatst]
jvm.stop()
Weka models are different models that use a java bridge to python - the methods are java methods that can be called using this bridge. To use the dataframe in sklearn - you would have to manipulate it with one-hot encoding. Note that the nominal attributes in weka cannot have any special character in them. so use
df = df.replace([',', '"', "'", "%", ";"], '', regex=True)
for any nominal attribute before saving it to csv.
If you want to ensure that the model_type feature will be included in your model, you can trick it and add a dummy model type - and ensure that the class column for this dummy model is always "1" or "True" - depending on your class variable. If you have enough rows with this dummy model - j48 will open it as the first branch. Once the attribute is selected by j48 - it will be branched for all of the model types, not just the dummy one.

How to create training and test DataSetIterators in deeplearning4j?

I am building a recurrent neural network with deeplearning4j and I need to create the training and test data sets.
All the examples provided in the documentation and the example code, use a CSVSequenceRecordReader to read CSV files.
Then a DataSetIterator is created with the SequenceRecordReaderDataSetIterator constructor and fed into the MultiLayerNetwork.fit() or the MultiLayerNetwork.evaluate() method (depending if it's a training or test data set iterator).
However, in my case, the data set I have is not stored in a CSV file. I access it online through a third-party library, pre-process it to obtain a List<Data> and a List<Labels> objects.
How can I:
1) create the DataSetIterator from my two lists?
2) split the DataSetIterator in a training set and a test set?
Edit:
I think my question is too broad. Let me try to narrow it down.
I have started to read this article which uses a very simple approach to create a data set:
It creates two INDArrays and builds a DataSet from them using the DataSet(INDArray first, INDArray second) constructor.
Training the data works using network.fit(dataSet);, but I can't evaluate it while training, as the method evaluate requires an data set iterator, not a data set.
Moreover, from what I understand, using this approach also means that there is only one huge data set, no mini batches.
I also guess that I could create mini batches from this big data set by using the batchBy(int num) method. But this method returns a list of data sets, and not an data set iterator... iterateWithMiniBatches() does return a data set iterator but when I looked at the source file, it returns null and is deprecated. Then I tried to see if there is an implementation of the DataSetIterator I could use, but there are a lot of them. I tried the BaseDataSetIterator but it does not take a DataSet as constructor parameter but a DataSetFetcher... Yet another layer.
Is there somewhere an example that shows how to create a data set without using the default record readers? Or should I just create my how implementation of a record reader?
1)
MultiLayerNetwork.evaluate() accepts ListDataSetIterator as a parameter
If you have a List<Data> object you can first map it into a double[] featureVector and a double[] labelVector and then create a ListDataSetIterator like this
INDArray x = Nd4j.create(featureVector, new int[]{featureVector.length/numberOfFeatures, numberOfFeatures}, 'c');
INDArray y = Nd4j.create(labelVector, new int[]{labelVector.length/numberOfLabels, numberOfLabels}, 'c');
final DataSet allData = new DataSet(x,y);
final List<DataSet> list = allData.asList();
ListDataSetIterator iterator = new ListDataSetIterator(list);
For 2) you should just create two seperate iterators, one for training, one for testing.
You can then evaluate your net with net.evaluate(testIterator);

Adding static data( not changing over time) to sequence data in LSTM

I am trying to build a model like the following figure. Please see the following image:
I want to pass sequence data in LSTM layer and static data (blood group, gender) in another feed forward neural network layer. Later I want to merge them. However, I am confused about the dimenstion here.
If my understaning is right(which i depict in the image), how the 5-dimensional sequence data can be merged with 4 dimenstional static data?
Also, what is the difference of attention mechanism with this structure? (I found in the KERAS documentation that attention mechanism is an way to add static data with sequence data)
Basically, I want to add the static data with sequence data. Any other suggestion is apprciated.
I am not sure if I got what you are asking, but I will try.
Example in Keras:
static_out = (static_input)
x = LSTM(n_cell_lstm, return_sequences=True)(dynamic_input)
x = Flatten()(x)
dynamic_out = (x)
z = concatenate([dynamic_out, static_out])
z = Dense(64, activation='relu')(z)
main_output = Dense(classes, activation='softmax', name='main_output')(z)
Practically you are using an LSTM architecture as you would if you where using only the dynamic data, but at the end you add the info coming from the static data. Hope this helps.

Resources