Updating the data-set when classifing new nominal instances - machine-learning

I'm using J48 to classify instances composed of both numeric and nominal values.
My problem is that I don't know which nominal-value I'll come across during my program.
Therefor I need to update my nominal-attribute's data of the model "on the fly".
For instance, say I have only 2 attribute, occupation and age and the run is as followed:
OccuptaionAttribute = {}.
input: [Piano teacher, 22].
OccuptaionAttribute = {Piano teacher}.
input: [school teacher, 30]
OccuptaionAttribute = {Piano teacher, school teacher}.
input: [Piano teacher, 40]
OccuptaionAttribute = {Piano teacher, school teacher}.
etc.
Now I've try to do so manually by copying the previous attributes, adding the new attribute and then updating the model's data.
That works fine when training the model.
But!
when I want to classify a new instance, say [SW engineer, 52], OccuptaionAttribute was updated:
OccuptaionAttribute = {Piano teacher, school teacher, SW engineer}, but the tree itself never "met" "SW engineer" before so the classification cannot be fulfilled and an Exception is thrown.
Can you direct how to handle the above situation?
Does Weka has any mechanism supporting the above issue?
Thanks!

When training add a placeholder data to your nominal-attributes like __other__.
Before trying to classify an instance first check whether the value of nominal attribute is seen before; if its not use the placeholder value:
Attribute attribute = instances.attribute("OccuptaionAttribute");
String s = "SW engineer";
int index = attribute.indexOfValue(s);
if (index == -1) {
index = attribute.indexOfValue("__other__");
}
When you have enough data train again with the new values.

Related

Why is my CNN returning tokens instead or readable labels?

I am currently studying machine learning and have created a CNN using fastai that labels the category of clothes items. I built this model using the Fashion-MNIST data set.
Everything funcitons fine and it looks like it's predicting correctly but I dont know how to make it return the labels and categories rather than this weird tokenized text it is returning. where am I going wrong?
Here is some code
This is where I create the dataframe that has the category mapped to the image path.
from fastcore.all import *
ds = dataFrame.filter(['masterCategory', 'imagePath'], axis=1)
ds
masterCategory imagePath
0 Apparel ../input/fashion-product-images-small/images/1...
1 Apparel ../input/fashion-product-images-small/images/3...
2 Accessories ../input/fashion-product-images-small/images/5...
3 Apparel ../input/fashion-product-images-small/images/2...
4 Apparel ../input/fashion-product-images-small/images/5...
... ... ...
44419 Footwear ../input/fashion-product-images-small/images/1...
44420 Footwear ../input/fashion-product-images-small/images/6...
44421 Apparel ../input/fashion-product-images-small/images/1...
44422 Personal Care ../input/fashion-product-images-small/images/4...
44423 Accessories ../input/fashion-product-images-small/images/5...
44424 rows × 2 columns
Then I create a datablock
def getImages(d): return d['imagePath']
def getLabel(d): return d['masterCategory']
from fastai.vision.all import *
dblock = DataBlock(
blocks=(ImageBlock, MultiCategoryBlock),
get_x=getImages,
splitter=RandomSplitter(valid_pct=0.2, seed=42),
get_y=getLabel,
item_tfms=[Resize(192, method='squish')]
)
Then I use the dataloader and when I show batch, but I get these weird labels instead of the the mater categories.
dsets = dblock.dataloaders(ds, bs=32)
dsets.show_batch(max_n=20)
thank you.
I found the issue, The block I needed is not MultiCategoryBlock, it is CategoryBlock. I thought since there where multiple categories ot pick from that is what was needed but no MulticategoryBlock is used to label one image with multiple categories. Not to pick from multiple categories.

Bart Large MNLI - Get predicted label in a single column

I'm trying to classify the sentences of a specific column into three labels with Bart Large MNLI. The problem is that the output of the model is "sentence + the three labels + the scores for each label. Output example:
{'sequence': 'Growing special event set production/fabrication company
is seeking a full-time accountant with experience in entertainment
accounting. This position is located in a fast-paced production office
located near downtown Los Angeles.Responsibilities:• Payroll
management for 12+ employees, including processing new employee
paperwork.', 'labels': ['senior', 'middle', 'junior'], 'scores':
[0.5461998581886292, 0.327671617269516, 0.12612852454185486]}
What I need is to get a single column with only the label with the highest score, in this case "senior".
Any feedback which can help me to do it? Right now my code looks like:
df_test = df.sample(frac = 0.0025)
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
sequence_to_classify = df_test["full_description"]
candidate_labels = ['senior', 'middle', 'junior']
df_test["seniority_label"] = df_test.apply(lambda x: classifier(x.full_description, candidate_labels, multi_label=True,), axis=1)
df_test.to_csv("Seniority_Classified_SampleTest.csv")
(Using a Sample of the df for testing code).
And the code I've followed comes from this web, where they do receive a column with labels as an output idk how: https://practicaldatascience.co.uk/machine-learning/how-to-classify-customer-service-emails-with-bart-mnli

Forecasting.ForecastBySsa with Multiple variables as input

I've got this code to predict a time series. I want to have a prediction based upon a time series of prices and a correlated indicator.
So together with the value to forecast, I want to pass a side value but I cannot understand if this is taken into account because prediction doesn't change with or without it. In which way do I need to tell to the algorithm how to consider these parameters?
public static TimeSeriesForecast PerformTimeSeriesProductForecasting(List<TimeSeriesData> listToForecast)
{
var mlContext = new MLContext(seed: 1); //Seed set to any number so you have a deterministic environment
var productModelPath = $"product_month_timeSeriesSSA.zip";
if (File.Exists(productModelPath))
{
File.Delete(productModelPath);
}
IDataView productDataView = mlContext.Data.LoadFromEnumerable<TimeSeriesData>(listToForecast);
var singleProductDataSeries = mlContext.Data.CreateEnumerable<TimeSeriesData>(productDataView, false).OrderBy(p => p.Date);
TimeSeriesData lastMonthProductData = singleProductDataSeries.Last();
const int numSeriesDataPoints = 2500; //The underlying data has a total of 34 months worth of data for each product
// Create and add the forecast estimator to the pipeline.
IEstimator<ITransformer> forecastEstimator = mlContext.Forecasting.ForecastBySsa(
outputColumnName: nameof(TimeSeriesForecast.NextClose),
inputColumnName: nameof(TimeSeriesData.Close), // This is the column being forecasted.
windowSize: 22, // Window size is set to the time period represented in the product data cycle; our product cycle is based on 12 months, so this is set to a factor of 12, e.g. 3.
seriesLength: numSeriesDataPoints, // This parameter specifies the number of data points that are used when performing a forecast.
trainSize: numSeriesDataPoints, // This parameter specifies the total number of data points in the input time series, starting from the beginning.
horizon: 5, // Indicates the number of values to forecast; 2 indicates that the next 2 months of product units will be forecasted.
confidenceLevel: 0.98f, // Indicates the likelihood the real observed value will fall within the specified interval bounds.
confidenceLowerBoundColumn: nameof(TimeSeriesForecast.ConfidenceLowerBound), //This is the name of the column that will be used to store the lower interval bound for each forecasted value.
confidenceUpperBoundColumn: nameof(TimeSeriesForecast.ConfidenceUpperBound)); //This is the name of the column that will be used to store the upper interval bound for each forecasted value.
// Fit the forecasting model to the specified product's data series.
ITransformer forecastTransformer = forecastEstimator.Fit(productDataView);
// Create the forecast engine used for creating predictions.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine = forecastTransformer.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Save the forecasting model so that it can be loaded within an end-user app.
forecastEngine.CheckPoint(mlContext, productModelPath);
ITransformer forecaster;
using (var file = File.OpenRead(productModelPath))
{
forecaster = mlContext.Model.Load(file, out DataViewSchema schema);
}
// We must create a new prediction engine from the persisted model.
TimeSeriesPredictionEngine<TimeSeriesData, TimeSeriesForecast> forecastEngine2 = forecaster.CreateTimeSeriesEngine<TimeSeriesData, TimeSeriesForecast>(mlContext);
// Get the prediction; this will include the forecasted product units sold for the next 2 months since this the time period specified in the `horizon` parameter when the forecast estimator was originally created.
prediction = forecastEngine.Predict();
return prediction;
}
TimeSeriesData has multiple attributes, not only the value of the series that I ant to forecast. Just wonder if they are taken into account when forecasting o not.
Is there a better method to forecast this type of series like LMST? Is this method available in ML.NET?
There is a new ticket for enhancement: Multivariate Time based series forecasting to ML.Net
See ticket: github.com/dotnet/machinelearning/issues/5638

ValueError: The name "Sequential" is used 4 times in the model. All the layer names should be unique?

Let's consider, I have four models following as M1 (client 1), M2 (client 2), M3 (client 3), and M4 (client 4). Each model has a similar structure.
Model Structure
After training for each client model. I have aggregated these models together and create a new model which is let's say "EnsModel". After that, I have used this ensemble model to retrain new data for each client again. However, when I tried to ensemble the updated models again, I faced this problem that says
"ValueError: The name "Sequential" is used 4 times in the model. All the layer names should be unique?"
Can anybody help me out? I also have one question. Is there any way that I can model modify the ensemble model structure for each client?
Thank you.
try to name each model and then merge them as right down below.
M1.name = 'Client1'
M2.name = 'Client2'
M3.name = 'Client3'
M4.name = 'Client4'
commonInput = Input((x, x, y))
outM1 = M1(commonInput)
outM2 = M2(commonInput)
// outM3
// outM4 also like the first two
mergedM1M2 = keras.layers.Add()([outM1,outM2])
//mergedM3M4
FinalMerged = keras.layers.Add()([mergedM1M2,mergedM3M4])
FinalModel = Model(commonInput, Finalmerged)

How to predict an item's category given its name?

Currently I have a database consisted of about 600,000 records represents merchandise with their category information look like below:
{'title': 'Canon camera', 'category': 'Camera'},
{'title': 'Panasonic regrigerator', 'category': 'Refrigerator'},
{'title': 'Logo', 'category': 'Toys'},
....
But there are merchandises without category information.
{'title': 'Iphone6', 'category': ''},
So I'm thinking whether it is possible to train a text classifier based on my items' name by using scikit-learn to help me predict which the category should the merchandise be. I'm forming this problem as a multi-class text classification but there are also one~many pictures for each item so maybe deep learning/Keras can also be used?
I don't know what is the best way to solve this problem so any suggestion or advice is welcome, thank you for reading this.
P.S. the actual text is in Japanese
You could build a 2-char / 3-char model and calculate values e.g. how often does the 3-gram "pho" appear in the category "Camera".
trigrams = {}
for record in records: # only the ones with categories
title = record['title']
cat = record['category']
for trigram in zip(title, title[1:], title[2:])
if trigram not in trigrams:
trigrams[trigram] = {}
for category in categories:
trigrams[trigram] = 0
trigrams[trigram][cat] += 1
Now you can use the titles trigrams to calculate a score:
scores = []
for trigram in zip(title, title[1:], title[2:]):
score = []
for cat in categories:
score.append(trigrams[trigram][cat])
# Normalize
sum_ = float(sum(score))
score = [s / sum_ for s in score]
scores.append(score)
Now score contains a probability distribution for every trigram: P(class | trigram). It does not take into account that some classes are just more common (prior, see Bayes theorem). I'm currently also not quite sure if you should do something against the problem that some titles might just be really long and thus have a lot of trigrams. I guess taking the prior does that already.
If it turns out that you have many trigrams missing, you could switch to bigrams. Or simply do Laplace smoothing.
edit: I've just seen that the text is in Japanese. I think the n-gram approach might be useless there. You could translate the name. However, it is probably easier to just take other sources for this information (e.g. wikipedia / amazon / ebay?)

Resources