How to train one model for several devices - machine-learning

I have some tabular device data comprising a
time column, some tabular features, target classes
There are around 500 rows (not same) in all devices data and target classes are same.
I have same data for around 1000 devices,
I want to train a general model for all the devices for detecting the class.
Can someone help me with the approach to train for the target variable. What kind of models work in this condition

If your device type is part of the data, you can train a decision tree. If the device type feature is important for classification sake, it will be added to the tree. First, create the device type features yourself - a binary column for each device type, like done in one-hot encoding. There will be a binary column per device type - is_device_samsung, is_device_lg, is_device_iphone and so forth. The number of columns created is equal to the number of device types. All but one of these columns will be 0, and the one indicating the current type will be 1. This will not guarantee the device type will be a part of the model - but let the AI decide this for you.
BTW - don't use get_dummies unless you know how to reuse it exactly as needed in the test data.
Another option is to use the python-weka wrapper, which accepts nominal attributes:
Example:
import weka.core.jvm as jvm
from weka.core.converters import Loader
from weka.classifiers import Classifier
def get_weka_prob(inst):
dist = c.distribution_for_instance(inst)
p = dist[next((i for i, x in enumerate(inst.class_attribute.values) if x == 'DONE'), -1)]
return p
jvm.start()
loader = Loader(classname="weka.core.converters.CSVLoader")
data = loader.load_file(r'.\recs_csv\df.csv')
data.class_is_last()
datatst = loader.load_file(r'.\recs_csv\dftst.csv')
datatst.class_is_last()
c = Classifier("weka.classifiers.trees.J48", options=["-C", "0.1"])
c.build_classifier(data)
print(c)
probstst = [get_weka_prob(inst) for inst in datatst]
jvm.stop()
Weka models are different models that use a java bridge to python - the methods are java methods that can be called using this bridge. To use the dataframe in sklearn - you would have to manipulate it with one-hot encoding. Note that the nominal attributes in weka cannot have any special character in them. so use
df = df.replace([',', '"', "'", "%", ";"], '', regex=True)
for any nominal attribute before saving it to csv.
If you want to ensure that the model_type feature will be included in your model, you can trick it and add a dummy model type - and ensure that the class column for this dummy model is always "1" or "True" - depending on your class variable. If you have enough rows with this dummy model - j48 will open it as the first branch. Once the attribute is selected by j48 - it will be branched for all of the model types, not just the dummy one.

Related

Categorical features encoding in H2O

I train GBM models with H2O and want to use them in my backend (not Java). To do so, I download the MOJOs, convert it to ONNX and run it in my apps.
In order to make inference, I need to know how categorical columns transformed to their one-hot encoded versions. I was able to find it in the POJO:
static final void fill(String[] sa) {
sa[0] = "Age";
sa[1] = "Fare";
sa[2] = "Pclass.1";
sa[3] = "Pclass.2";
sa[4] = "Pclass.3";
sa[5] = "Pclass.missing(NA)";
sa[6] = "Sex.female";
sa[7] = "Sex.male";
sa[8] = "Sex.missing(NA)";
}
So, here is the workflow for non-Java backend as I see it:
Encode categorical features with OneHotExplicit.
Train GBM model.
Download MOJO and convert to ONNX.
Download POJO and find feature alignment in the source code.
Implement the inference in your backend.
Is it the most straightforward and correct way?
Thank you for your question.
Can you access the stored categorical values here?
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoModel.java#L72
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/algos/tree/SharedTreeMojoReader.java#L34
https://github.com/h2oai/h2o-3/blob/master/h2o-algos/src/main/java/hex/tree/SharedTreeMojoWriter.java#L61
The index in the array means the translated categorical value.
The EasyPredictModelWrapper did it this way:
https://github.com/h2oai/h2o-3/blob/master/h2o-genmodel/src/main/java/hex/genmodel/easy/RowToRawDataConverter.java#L44
Can you access the model.ini inside of the zip? There is [domains] tag and under the tag is a list of files in domains/ directory which correspond the categorical encoding for each feature.
e.g:
[columns]
AGE
RACE
DPROS
DCAPS
PSA
VOL
GLEASON
CAPSULE
[domains]
7: 2 d000.txt
means 7th column (CAPSULE) has 2 categorical variables in d000.txt
or there is a experimental/modelDetails.json file that has categorical values under output.domains. The index in the list correspond to the feature in the output.names list.
e.g output.domains[7] are domains for output.names[7] feature.

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

Training model with multiple features who's values are conceptually the same

For example, say I am trying to train a binary classifier that takes sample inputs of the form
x = {d=(type of desk), p1=(type of pen on desk), p2=(type of *another* pen on desk)}
Say I then train a model on the samples:
x1 = {wood, ballpoint, gel}, y1 = {0}
x2 = {wood, ballpoint, ink-well}, y2 = {1}.
and try to predict on the new sample: x3 = {wood, gel, ballpoint}. The response that I am hoping for in this case is y3 = {0}, since conceptually it should not matter (ie. I don't want it to matter) which pen is designated as p1 or p2.
When trying to run this model (in my case, using an h2o.ai generated model), I get the error that the category enum for p2 is not valid (since the model has never seen 'ballpoint' in p2's category during training) (in h2o: hex.genmodel.easy.exception.PredictUnknownCategoricalLevelException)
My first idea was to generate permutations of the 'pens' features for each sample to train the model on. Is there a better way to handle this situation? Specifically, in h2o.ai Flow UI solution, since that is what I am using to build the model. Thanks.
H2O binary models (models running in the H2O cluster) will handle unseen categorical levels automatically, however, in when you are generating predictions using the pure Java POJO model method (like in your case), this is a configurable option. In the EasyPredictModelWrapper, the default behavior is that unknown categorical levels throw PredictUnknownCategoricalLevelException, which is why you are seeing that error.
There is more info about this in the EasyPredictModelWrapper Javadocs.
Here is an example:
The easy prediction API for generated POJO and MOJO models. Use as follows:
1. Instantiate an EasyPredictModelWrapper
2. Create a new row of data
3. Call one of the predict methods
Here is an example:
// Step 1.
modelClassName = "your_pojo_model_downloaded_from_h2o";
GenModel rawModel;
rawModel = (GenModel) Class.forName(modelClassName).newInstance();
EasyPredictModelWrapper model = new EasyPredictModelWrapper(
new EasyPredictModelWrapper.Config()
.setModel(rawModel)
.setConvertUnknownCategoricalLevelsToNa(true));
// Step 2.
RowData row = new RowData();
row.put(new String("CategoricalColumnName"), new String("LevelName"));
row.put(new String("NumericColumnName1"), new String("42.0"));
row.put(new String("NumericColumnName2"), new Double(42.0));
// Step 3.
BinomialModelPrediction p = model.predictBinomial(row);

Using test data set in RapidMiner

I'm trying to create a model with a training dataset and want to label the records in a test data set.
All tutorials or help I find online has information on only using cross validation with one data set, i.e., training dataset. I couldn't find how to use test data. I tried to apply the result model on to the test set. But the test set seems to give different no. of attributes than training set after pre-processing. This is a text classification problem.
At the end I get some output like this
18.03.2013 01:47:00 Results of ResultWriter 'Write as Text (2)' [1]:
18.03.2013 01:47:00 SimpleExampleSet:
5275 examples,
366 regular attributes,
special attributes = {
confidence_1 = #367: confidence(1) (real/single_value)
confidence_5 = #368: confidence(5) (real/single_value)
confidence_2 = #369: confidence(2) (real/single_value)
confidence_4 = #370: confidence(4) (real/single_value)
prediction = #366: prediction(label) (nominal/single_value)/values=[1, 5, 2, 4]
}
But what I wanted is all my examples to be labelled.
It seems that my test data and training data have different no. of attributes, I see many of following in the logs.
Mar 18, 2013 1:46:41 AM WARNING: Kernel Model: The given example set does not contain a regular attribute with name 'wireless'. This might cause problems for some models depending on this particular attribute.
But how do we solve such problem in text classification as we cannot know no. of and name of attributes before hand.
Can some one please throw some pointers.
You probably use a Process Documents operator to preprocess both training and test set. Here it is important that both these operators are setup identically. To "synchronize" the wordlist, i.e. consider the same set of words in both of them, you have to connect the wordlist (wor) output of the Process Documents operator used for training to the corresponding input port of the Process Documents operator used for preprocessing the test set.

Does test file in weka requires same or less number of features as train?

I have prepared two different .arff files from two different datasets one for testing and other for training. Each of them have equal instances but different features changing the dimensionality of feature vector for each file. When i did cross-validation on each of these files, they are working perfectly. This shows .arff files are properly prepared and don't have any error.
Now if i use the train file having less dimensionality compared to test file for evaluation. I get a following error.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 5986
at weka.classifiers.bayes.NaiveBayesMultinomial.probOfDocGivenClass(NaiveBayesMultinomial.java:295)
at weka.classifiers.bayes.NaiveBayesMultinomial.distributionForInstance(NaiveBayesMultinomial.java:254)
at weka.classifiers.Evaluation.evaluationForSingleInstance(Evaluation.java:1657)
at weka.classifiers.Evaluation.evaluateModelOnceAndRecordPrediction(Evaluation.java:1694)
at weka.classifiers.Evaluation.evaluateModel(Evaluation.java:1574)
at TrainCrossValidateARFF.main(TrainCrossValidateARFF.java:44)
Does test file in weka requires same or less number of features as train ?
Code for evaluation
public class TrainCrossValidateARFF{
private static DecimalFormat df = new DecimalFormat("#.##");
public static void main(String args[]) throws Exception
{
if (args.length != 1 && args.length != 2) {
System.out.println("USAGE: CrossValidateARFF <arff_file> [<stop_words_file>]");
System.exit(-1);
}
String TrainarffFilePath = args[0];
DataSource ds = new DataSource(TrainarffFilePath);
Instances Train = ds.getDataSet();
Train.setClassIndex(Train.numAttributes() - 1);
String TestarffFilePath = args[1];
DataSource ds1 = new DataSource(TestarffFilePath);
Instances Test = ds1.getDataSet();
// setting class attribute
Test.setClassIndex(Test.numAttributes() - 1);
System.out.println("-----------"+TrainarffFilePath+"--------------");
System.out.println("-----------"+TestarffFilePath+"--------------");
NaiveBayesMultinomial naiveBayes = new NaiveBayesMultinomial();
naiveBayes.buildClassifier(Train);
Evaluation eval = new Evaluation(Train);
eval.evaluateModel(naiveBayes,Test);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
}
}
Does test file in weka requires same or less number of features as train ? Code for evaluation
Same number of features are necessary. You may need to insert ? for class attribute too.
According to Weka Architect Mark Hall
To be compatible, the header information of the two sets of instances needs to be the same - same
number of attributes, with the same names in the same order. Furthermore, any nominal attributes must
have the same values declared in the same order in both sets of instances.
For unknown class values in your test set just set the value of each to missing - i.e "?".
According to Weka's wiki, the number of features needs to be same for both the training and test sets. Also the type of these features (e.g., nominal, numeric, etc) needs to be the same.
Also, I assume that you didn't apply any Weka filters to either of your datasets. The datasets often become incompatible if you apply filters separately on each dataset (even if it is the same filter).
How do I divide a dataset into training and test set?
You can use the RemovePercentage filter (package weka.filters.unsupervised.instance).
In the Explorer just do the following:
training set:
-Load the full dataset
-select the RemovePercentage filter in the preprocess panel
-set the correct percentage for the split
-apply the filter
-save the generated data as a new file
test set:
-Load the full dataset (or just use undo to revert the changes to the dataset)
-select the RemovePercentage filter if not yet selected
-set the invertSelection property to true
-apply the filter
-save the generated data as new file

Resources