Find the importance of each column to the model - machine-learning

I have a ML.net project and as of right now everything has gone great. I have a motor that collects a power reading 256 times around each rotation and I push that into a model. Right now it determines the state of the motor nearly perfectly. The motor itself only has room for 38 values on it at a time so I have been spending several rotations to collect the full 256 samples for my training data.
I would like to cut the sample size down to 38 so every rotation I can determine its state. If I just evenly space the samples down to 38 my model degrades by a lot. I know I am not feeding the model the features it thinks are most important but just making a guess and randomly selecting data for the model.
Is there a way I can see the importance of each value in the array during the training process? I was thinking I could use IDataView for this and I found the below statement about it (link).
Standard ML schema: The IDataView system does not define, nor prescribe, standard ML schema representation. For example, it does not dictate representation of nor distinction between different semantic interpretations of columns, such as label, feature, score, weight, etc. However, the column metadata support, together with conventions, may be used to represent such interpretations.
Does this mean I can print out such things as weight for each column and how would I do that?
I have actually only been working with ML.net for a couple weeks now so I apologize if the question is naive, I assure you I have googled this as many ways as I can think to. Any advice would be appreciated. Thanks in advance.
EDIT:
Thank you for the answer I was going down a completely useless path. I have been trying to get it to work following the example you linked to. I have 260 columns with numbers and one column with the conditions as one of five text strings. This is the condition I am trying to predict.
The first time I tried it threw an error "expecting single but got string". No problem I used .Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label")) to convert to key values and it threw the error expected Single, got Key UInt32. any ideas on how to push that into this function?
At any rate thank you for the reply but I guess my upvotes don't count yet sorry. hopefully I can upvote it later or someone else here can upvote it. Below is the code example.
//Create MLContext
MLContext mlContext = new MLContext();
//Load Data
IDataView data = mlContext.Data.LoadFromTextFile<ModelInput>(TRAIN_DATA_FILEPATH, separatorChar: ',', hasHeader: true);
// 1. Get the column name of input features.
string[] featureColumnNames =
data.Schema
.Select(column => column.Name)
.Where(columnName => columnName != "Label").ToArray();
// 2. Define estimator with data pre-processing steps
IEstimator<ITransformer> dataPrepEstimator =
mlContext.Transforms.Concatenate("Features", featureColumnNames)
.Append(mlContext.Transforms.NormalizeMinMax("Features"))
.Append(mlContext.Transforms.Conversion.MapValueToKey("Label", "Label"));
// 3. Create transformer using the data pre-processing estimator
ITransformer dataPrepTransformer = dataPrepEstimator.Fit(data);//error here
// 4. Pre-process the training data
IDataView preprocessedTrainData = dataPrepTransformer.Transform(data);
// 5. Define Stochastic Dual Coordinate Ascent machine learning estimator
var sdcaEstimator = mlContext.Regression.Trainers.Sdca();
// 6. Train machine learning model
var sdcaModel = sdcaEstimator.Fit(preprocessedTrainData);
ImmutableArray<RegressionMetricsStatistics> permutationFeatureImportance =
mlContext
.Regression
.PermutationFeatureImportance(sdcaModel, preprocessedTrainData, permutationCount: 3);
// Order features by importance
var featureImportanceMetrics =
permutationFeatureImportance
.Select((metric, index) => new { index, metric.RSquared })
.OrderByDescending(myFeatures => Math.Abs(myFeatures.RSquared.Mean));
Console.WriteLine("Feature\tPFI");
foreach (var feature in featureImportanceMetrics)
{
Console.WriteLine($"{featureColumnNames[feature.index],-20}|\t{feature.RSquared.Mean:F6}");
}

I believe what you are looking for is called Permutation Feature Importance. This will tell you which features are most important by changing each feature in isolation, and then measuring how much that change affected the model's performance metrics. You can use this to see which features are the most important to the model.
Interpret model predictions using Permutation Feature Importance is the doc that describes how to use this API in ML.NET.

You can also use an open-source set of packages, they are much more sophisticated than what is found in ML.NET. I have an example on my GitHub how-to use R with advanced explainer packages to explain ML.NET models. You can get local instance as well as global model breakdown/details/diagnostics/feature interactions etc.
https://github.com/bartczernicki/BaseballHOFPredictionWithMlrAndDALEX

Related

Can you search for related database tables/fields using text similarity?

I am doing a college project where I need to compare a string with list of other strings. I want to know if we have any kind of library which can do this or not.
Suppose I have a table called : DOCTORS_DETAILS
Other Table names are : HOSPITAL_DEPARTMENTS , DOCTOR_APPOINTMENTS, PATIENT_DETAILS,PAYMENTS etc.
Now I want to calculate which one among those are more relevant to DOCTOR_DETAILS ?
Expected output can be,
DOCTOR_APPOINTMENTS - More relevant because of the term doctor matches in both string
PATIENT_DETAILS - The term DETAILS present in both string
HOSPITAL_DEPARTMENTS - least relevant
PAYMENTS - least relevant
Therefore I want to find RELEVENCE based on number of similar terms present on both the strings in question.
Ex : DOCTOR_DETAILS -> DOCTOR_APPOITMENT(1/2) > DOCTOR_ADDRESS_INFORMATION(1/3) > DOCTOR_SPECILIZATION_DEGREE_INFORMATION (1/4) > PATIENT_INFO (0/2)
Semantic similarity is a common NLP problem. There are multiple approaches to look into, but at their core they all are going to boil down to:
Turn each piece of text into a vector
Measure distance between vectors, and call closer vectors more similar
Three possible ways to do step 1 are:
tf-idf
fasttext
bert-as-service
To do step 2, you almost certainly want to use cosine distance. It is pretty straightforward with Python, here is a implementation from a blog post:
import numpy as np
def cos_sim(a, b):
"""Takes 2 vectors a, b and returns the cosine similarity according
to the definition of the dot product
"""
dot_product = np.dot(a, b)
norm_a = np.linalg.norm(a)
norm_b = np.linalg.norm(b)
return dot_product / (norm_a * norm_b)
For your particular use case, my instincts say to use fasttext. So, the official site shows how to download some pretrained word vectors, but you will want to download a pretrained model (see this GH issue, use https://dl.fbaipublicfiles.com/fasttext/vectors-english/wiki-news-300d-1M-subword.bin.zip),
Then you'd then want to do something like:
import fasttext
model = fasttext.load_model("model_filename.bin")
def order_tables_by_name_similarity(main_table, candidate_tables):
'''Note: we use a fasttext model, not just pretrained vectors, so we get subword information
you can modify this to also output the distances if you need them
'''
main_v = model[main_table]
similarity_to_main = lambda w: cos_sim(main_v, model[w])
return sorted(candidate_tables, key=similarity_to_main, reverse=True)
order_tables_by_name_similarity("DOCTORS_DETAILS", ["HOSPITAL_DEPARTMENTS", "DOCTOR_APPOINTMENTS", "PATIENT_DETAILS", "PAYMENTS"])
# outputs: ['PATIENT_DETAILS', 'DOCTOR_APPOINTMENTS', 'HOSPITAL_DEPARTMENTS', 'PAYMENTS']
If you need to put this in production, the giant model size (6.7GB) might be an issue. At that point, you'd want to build your own model, and constrain the model size. You can probably get roughly the same accuracy out of a 6MB model!

How can I get predictions from these pretrained models?

I've been trying to generate human pose estimations, I came across many pretrained models (ex. Pose2Seg, deep-high-resolution-net ), however these models only include scripts for training and testing, this seems to be the norm in code written to implement models from research papers ,in deep-high-resolution-net I have tried to write a script to load the pretrained model and feed it my images, but the output I got was a bunch of tensors and I have no idea how to convert them to the .json annotations that I need.
total newbie here, sorry for my poor English in advance, ANY tips are appreciated.
I would include my script but its over 100 lines.
PS: is it polite to contact the authors and ask them if they can help?
because it seems a little distasteful.
Im not doing skeleton detection research, but your problem seems to be general.
(1) I dont think other people should teaching you from begining on how to load data and run their code from begining.
(2) For running other peoples code, just modify their test script which is provided e.g
https://github.com/leoxiaobin/deep-high-resolution-net.pytorch/blob/master/tools/test.py
They already helps you loaded the model
model = eval('models.'+cfg.MODEL.NAME+'.get_pose_net')(
cfg, is_train=False
)
if cfg.TEST.MODEL_FILE:
logger.info('=> loading model from {}'.format(cfg.TEST.MODEL_FILE))
model.load_state_dict(torch.load(cfg.TEST.MODEL_FILE), strict=False)
else:
model_state_file = os.path.join(
final_output_dir, 'final_state.pth'
)
logger.info('=> loading model from {}'.format(model_state_file))
model.load_state_dict(torch.load(model_state_file))
model = torch.nn.DataParallel(model, device_ids=cfg.GPUS).cuda()
Just call
# evaluate on Variable x with testing data
y = model(x)
# access Variable's tensor, copy back to CPU, convert to numpy
arr = y.data.cpu().numpy()
# write CSV
np.savetxt('output.csv', arr)
You should be able to open it in excel
(3) "convert them to the .json annotations that I need".
That's the problem nobody can help. We don't know what format you want. For their format, it can be obtained either by their paper. Or looking at their training data by
X, y = torch.load('some_training_set_with_labels.pt')
By correlating the x and y. Then you should have a pretty good idea.

How do I speedup adding two big vectors of tuples?

Recently, I am implementing an algorithm from a paper that I will be using in my master's work, but I've come across some problems regarding the time it is taking to perform some operations.
Before I get into details, I just want to add that my data set comprehends roughly 4kk entries of data points.
I have two lists of tuples that I've get from a framework (annoy) that calculates cosine similarity between a vector and every other vector in the dataset. The final format is like this:
[(name1, cosine), (name2, cosine), ...]
Because of the algorithm, I have two of that lists with the same names (first value of the tuple) in it, but two different cosine similarities. What I have to do is to sum the cosines from both lists, and then order the array and get the top-N highest cosine values.
My issue is: is taking too long. My actual code for this implementation is as following:
def topN(self, user, session):
upref = self.m2vTN.get_user_preference(user)
spref = self.sm2vTN.get_user_preference(session)
# list of tuples 1
most_su = self.indexer.most_similar(upref, len(self.m2v.wv.vocab))
# list of tuples 2
most_ss = self.indexer.most_similar(spref, len(self.m2v.wv.vocab))
# concat both lists and add into a dict
d = defaultdict(int)
for l, v in (most_ss + most_su):
d[l] += v
# convert the dict into a list, and then sort it
_list = list(d.items())
_list.sort(key=lambda x: x[1], reverse=True)
return [x[0] for x in _list[:self.N]]
How do I make this code faster? I've tried using threads but I'm not sure if it will make it faster. Getting the lists is not the problem here, but the concatenation and sorting is.
Thanks! English is not my native language, so sorry for any misspelling.
What do you mean by "too long"? How large are the two lists? Is there a chance your model, and interim results, are larger than RAM and thus forcing virtual-memory paging (which would create frustrating slowness)?
If you are in fact getting the cosine-similarity with all vectors in the model, the annoy-indexer isn't helping any. (Its purpose is to get a small subset of nearest-neighbors much faster, at the expense of perfect accuracy. But if you're calculating the similarity to every candidate, there's no speedup or advantage to using ANNOY.
Further, if you're going to combine all of the distances from two such calculation, there's no need for the sorting that most_similar() usually does - it just makes combining the values more complex later. For the gensim vector-models, you can supply a False-ish topn value to just get the unsorted distances to all model vectors, in order. Then you'd have two large arrays of the distances, in the model's same native order, which are easy to add together elementwise. For example:
udists = self.m2v.most_similar(positive=[upref], topn=False)
sdists = self.m2v.most_similar(positive=[spref], topn=False)
combined_dists = udists + sdists
The combined_dists aren't labeled, but will be in the same order as self.m2v.index2entity. You could then sort them, in a manner similar to what the most_similar() method itself does, to find the ranked closest. See for example the gensim source code for that part of most_similar():
https://github.com/RaRe-Technologies/gensim/blob/9819ce828b9ed7952f5d96cbb12fd06bbf5de3a3/gensim/models/keyedvectors.py#L557
Finally, you might not need to be doing this calculation yourself at all. You can provide more-than-one vector to most_similar() as the positive target, and then it will return the vectors closest to the average of both vectors. For example:
sims = self.m2v.most_similar(positive=[upref, spref], topn=len(self.m2v))
This won't be the same value/ranking as your other sum, but may behave very similarly. (If you wanted less-than-all of the similarities, then it might make sense to use the ANNOY indexer this way, as well.)

Obfuscation of sensitive data for machine learning

I am preparing a dataset for my academic interests. The original dataset contains sensitive information from transactions, like Credit card no, Customer email, client ip, origin country, etc. I have to obfuscate this sensitive information, before they leave my origin data-source and store them for my analysis algorithms. Some of the fields in data can be categorical and would not be difficult to obfuscate. Problem lies with the non-categorical data fields, how best should I obfuscate them to leave underlying statistical characteristics of my data intact but make it impossible (at least mathematically hard) to revert back to original data.
EDIT: I am using Java as front-end to prepare the data. The prepared data would then be handled by Python for machine learning.
EDIT 2: To explain my scenario, as a followup from the comments. I have data fields like:
'CustomerEmail', 'OriginCountry', 'PaymentCurrency', 'CustomerContactEmail',
'CustomerIp', 'AccountHolderName', 'PaymentAmount', 'Network',
'AccountHolderName', 'CustomerAccountNumber', 'AccountExpiryMonth',
'AccountExpiryYear'
I have to obfuscate the data present in each of these fields (data samples). I plan to treat these fields as features (with the obfuscated data) and train my models against a binary class label (which I have for my training and test samples).
There is no general way to obfuscate non categorical data as any processing leads to the loss of information. The only thing you can do is try to list what type of information is the most important one and design transformation which leaves it. For example if your data is Lat/Lng geo position tags you could perform any kind of distance-preserving transformations, such as translation, rotations etc. if it is not good enough you can embeed your data in lower dimensional space while preserving the pairwise distances (there are many such methods). In general - each type of non-categorical data requires different processing, and each destroys information - it is up to you to come up with the list of important properties and finding transformations preserving it.
I agree with #lejlot that there is no silver bullet method to solve your problem. However, I believe this answer can get you started thinking about to handle at least the numerical fields in your data set.
For the numerical fields, you can make use of the Java Random class and map a given number to another obfuscated value. The trick here is to make sure that you map the same numbers to the same new obfuscated value. As an example, consider your credit card data, and let's assume that each card number is 16 digits. You can load your credit card data into a Map and iterate over it, creating a new proxy for each number:
Map<Integer, Integer> ccData = new HashMap<Integer, Integer>();
// load your credit data into the Map
// iterate over Map and generate random numbers for each CC number
for (Map.Entry<Integer, Integer> entry : ccData.entrySet()) {
Integer key = entry.getKey();
Random rand = new Random();
rand.setSeed(key);
int newNumber = rand.nextInt(10000000000000000); // generate up to max 16 digit number
ccData.put(key, newNumber);
}
After this, any time you need to use a credit card num you would access it via ccData.get(num) to use the obfuscated value.
You can follow a similar plan for the IP addresses.

Using test data set in RapidMiner

I'm trying to create a model with a training dataset and want to label the records in a test data set.
All tutorials or help I find online has information on only using cross validation with one data set, i.e., training dataset. I couldn't find how to use test data. I tried to apply the result model on to the test set. But the test set seems to give different no. of attributes than training set after pre-processing. This is a text classification problem.
At the end I get some output like this
18.03.2013 01:47:00 Results of ResultWriter 'Write as Text (2)' [1]:
18.03.2013 01:47:00 SimpleExampleSet:
5275 examples,
366 regular attributes,
special attributes = {
confidence_1 = #367: confidence(1) (real/single_value)
confidence_5 = #368: confidence(5) (real/single_value)
confidence_2 = #369: confidence(2) (real/single_value)
confidence_4 = #370: confidence(4) (real/single_value)
prediction = #366: prediction(label) (nominal/single_value)/values=[1, 5, 2, 4]
}
But what I wanted is all my examples to be labelled.
It seems that my test data and training data have different no. of attributes, I see many of following in the logs.
Mar 18, 2013 1:46:41 AM WARNING: Kernel Model: The given example set does not contain a regular attribute with name 'wireless'. This might cause problems for some models depending on this particular attribute.
But how do we solve such problem in text classification as we cannot know no. of and name of attributes before hand.
Can some one please throw some pointers.
You probably use a Process Documents operator to preprocess both training and test set. Here it is important that both these operators are setup identically. To "synchronize" the wordlist, i.e. consider the same set of words in both of them, you have to connect the wordlist (wor) output of the Process Documents operator used for training to the corresponding input port of the Process Documents operator used for preprocessing the test set.

Resources