Is there a way to inference TF model with "filter"?
Imagine my training data points can be clustered and be labeled (just an example, it's not vision ML)
datapoint1-> cat (parent label animals)
datapoint2-> dog (parent label animals)
datapoint3-> mazda (parent label cars)
datapoint4-> toyota (parent label cars)
Let's say I train this in TensorFlow.
Assuming that when inferencing the model, the consumer "knows" the parent label - is it possible to narrow the "search" and inference time?
For example - assuming the client knows data point is a car, and only wished to identify which cat is it - can I "filter" the TF classification process?
Related
I want to give a custom metric that should be optimised to decide while split to use for a decision tree, to replace the standard 'gini index'. How can I do this in any Decision Tree package. Could be boosted decesion trees.
Edit:
I want to implement a criterion like:
crit = (c1 + c2 - c3)/(2* sqrt(c2 + c4))
where c1,c2,c3,c4 are different classes. The classes are imbalanced so I want that to be taken into account in the calculation (not use balanced class weights).
I am trying to solve a text classification problem. I have a limited number of labels that capture the category of my text data. If the incoming text data doesn't fit any label, it is tagged as 'Other'. In the below example, I built a text classifier to classify text data as 'breakfast' or 'italian'. In the test scenario, I included couple of text data that do not fit into the labels that I used for training. This is the challenge that I'm facing. Ideally, I want the model to say - 'Other' for 'i like hiking' and 'everyone should understand maths'. How can I do this?
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfTransformer
X_train = np.array(["coffee is my favorite drink",
"i like to have tea in the morning",
"i like to eat italian food for dinner",
"i had pasta at this restaurant and it was amazing",
"pizza at this restaurant is the best in nyc",
"people like italian food these days",
"i like to have bagels for breakfast",
"olive oil is commonly used in italian cooking",
"sometimes simple bread and butter works for breakfast",
"i liked spaghetti pasta at this italian restaurant"])
y_train_text = ["breakfast","breakfast","italian","italian","italian",
"italian","breakfast","italian","breakfast","italian"]
X_test = np.array(['this is an amazing italian place. i can go there every day',
'i like this place. i get great coffee and tea in the morning',
'bagels are great here',
'i like hiking',
'everyone should understand maths'])
classifier = Pipeline([
('vectorizer', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', MultinomialNB())])
classifier.fit(X_train, y_train_text)
predicted = classifier.predict(X_test)
proba = classifier.predict_proba(X_test)
print(predicted)
print(proba)
['italian' 'breakfast' 'breakfast' 'italian' 'italian']
[[0.25099411 0.74900589]
[0.52943091 0.47056909]
[0.52669142 0.47330858]
[0.42787443 0.57212557]
[0.4 0.6 ]]
I consider the 'Other' category as noise and I cannot model this category.
I think Kalsi might have suggested this but it was not clear to me. You could define a confidence threshold for your classes. If the predicted probability does not achieve the threshold for any of your classes ('italian' and 'breakfast' in your example), you were not able to classify the sample yielding the 'other' "class".
I say "class" because other is not exactly a class. You probably don't want your classifier to be good at predicting "other" so this confidence threshold might be a good approach.
You cannot do that.
You have trained the model to predict only two labels i.e., breakfast or italian. So the model doesn't have any idea about the third label or the fourth etc.
You and me know that "i like hiking" is neither breakfast nor italian. But how a model a would know that ? It only knows breakfast & italian. So there has to be a way to tell the model that: If you get confused between breakfast &italian, then predict the label as other
You can achieve this by training the model which is having other as label with some texts like "i like hiking" etc
But in your case, a little hack can be done as follows.
So what does it mean when a model predicts a label with 0.5 probability (or approximately 0.5)? It means that model is getting confused between the labels breakfast and italian. So here you can take advantage of this.
You can take all the predicted probability values & assign the label other if the probability value is between 0.45 & 0.55 . In this way you can predict the other label (obviously with some errors) without letting the model knowing there is a label called other
You can try setting class priors when creating the MultinomialNB. You could create a dummy "Other" training example, and then set the prior high enough for Other so that instances default to Other when there aren't enough evidence to select the other classes.
No, you cannot do that.
You have to define a third category "other" or whatever name that suits you and give your model some data related to that category. Make sure that number of training examples for all three categories are somewhat equal, otherwise "other" being a very broad category could skew your model towards "other" category.
Other way to approach this, is to get noun phrases from all your sentences for different categories including other and then feed into the model, consider this as a feature selection step for your machine learning model. In this way noise added by irrelevant words will be removed, better performance than tf-idf.
If you have huge data, go for deep learning models which does feature selection automatically.
Dont go with manipulating probabilities by yourself approach, 50-50% probability means that the model is confused between two classes which you have defined, it has no idea about the third "other class".
Lets say the sentence is "I want italian breakfast", the model will be confused whether this sentence belongs to "italian" or "breakfast" category but that doesnt mean it belongs to "other" category".
I made a program that trains a decision tree built on the ID3 algorithm using an information gain function (Shanon entropy) for feature selection (split).
Once I trained a decision tree I tested it to classify unseen data and I realized that some data instances cannot be classified: there is no path on the tree that classifies the instance.
An example (this is an illustration example but I encounter the same problem with a larger and more complex data set):
Being f1 and f2 the predictor variables (features) and y the categorical variable, the values ranges are:
f1: [a1; a2; a3]
f2: [b1; b2; b3]
y : [y1; y2; y3]
Training data:
("a1", "b1", "y1");
("a1", "b2", "y2");
("a2", "b3", "y3");
("a3", "b3", "y1");
Trained tree:
[f2]
/ | \
b1 b2 b3
/ | \
y1 y2 [f1]
/ \
a2 a3
/ \
y3 y1
The instance ("a1", "b3") cannot be classified with the given tree.
Several questions came up to me:
Does this situation have a name? tree incompleteness or something like that?
Is there a way to know if a decision tree will cover all combinations of unknown instances (all features values combinations)?
Does the reason of this "incompleteness" lie on the topology of the data set or on the algorithm used to train the decision tree (ID3 in this case) (or other)?
Is there a method to classify these unclassifiable instances with the given decision tree? or one must use another tool (random forest, neural networks...)?
This situation cannot occur with the ID3 decision-tree learner---regardless of whether it uses information gain or some other heuristic for split selection. (See, for example, ID3 algorithm on Wikipedia.)
The "trained tree" in your example above could not have been returned by the ID3 decision-tree learning algorithm.
This is because when the algorithm selects a d-valued attribute (i.e. an attribute with d possible values) on which to split the given leaf, it will create d new children (one per attribute value). In particular, in your example above, the node [f1] would have three children, corresponding to attribute values a1,a2, and a3.
It follows from the previous paragraph (and, in general, from the way the ID3 algorithm works) that any well-formed vector---of the form (v1, v2, ..., vn, y), where vi is a value of i-th attribute and y is the class value---should be classifiable by the decision tree that the algorithm learns on a given train set.
Would you mind providing a link to the software you used to learn the "incomplete" trees?
To answer your questions:
Not that I know of. It doesn't make sense to learn such "incomplete trees." If we knew that some attribute values will never occur then we would not include them in the specification (the file where you list attributes and their values) in the first place.
With the ID3 algorithm, you can prove---as I sketched in the answer---that every tree returned by the algorithm will cover all possible combinations.
You're using the wrong algorithm. Data has nothing to do with it.
There is no such thing as an unclassifiable instance in decision-tree learning. One usually defines a decision-tree learning problem as follows. Given a train set S of examples x1,x2,...,xn of the form xi=(v1i,v2i,...,vni,yi) where vji is the value of the j-th attribute and yi is the class value in example xi, learn a function (represented by a decision tree) f: X -> Y, where X is the space of all possible well-formed vectors (i.e. all possible combinations of attribute values) and Y is the space of all possible class values, which minimizes an error function (e.g. the number of misclassified examples). From this definition, you can see that one requires that the function f is able to map any combination to a class value; thus, by definition, each possible instance is classifiable.
I am working on text categorization in rapid miner and require to implement a problem transformation method to convert multi-label data set into single label i.e. Label Power set etc but couldn't find one in Rapid miner, i am sure i am missing something or may be Rapid miner has provided them with another name or something ?
1) I searched and found "Polynomial By Binomial" operator for Rapidminer which i think is using Binary Relevance internally for problem transformation but how can i apply others i.e. Label Power set or Classifier Chains ?
2) Secondly SVM (Learner) inside "Polynomial By Binomial" operator is applied K(Number of classes)times and combines 'K' Models into a single model but it would still classify a multi-label (multiple labels) example as a single label (one label) example, How can i get the multiple labels associate with an example ?
3) Do i have to store each model generated inside "Polynomial By Binomial" and then apply each on testing data to find out the multiple labels associate with an example ?
I am new to rapid miner so ignore my mistake
Thanks in Advance ...
Polynomial to Bionomial is not the way you want to go.
This operator performs something like XvsAll. This enables you to solve multiclass problems with a learner only capable doing binomial classification.
For your problem:
Would it to transform your table like this:
before:
ID Label
1 A|B|C
2 B|C
to
ID Label
1 A
2 B
3 C
4 B
5 C
The tricky thing for this is how to calculate the performance. But i think once this is clear a combination of recall/remember/remove duplicates and join will do it.
I am having a problem at hand where,
I need to classify the input data to one or more of the labels S1, S2, S3, S4
There is a relationship between the labels S1, S2, S3 and S4 which is,
If input is labelled Sn it must be labelled S1..Sn.
S1, S2, S3 and S4 are like different stages for an entity X to pass through. Based on input data X might get through one or many of the stages, X must pass through S1 to go to S2, S2 to go to S3 and so on
We want to ensure that only those X are allowed to pass which reach S3, so based on input data we decide whether to allow X to go through S1 or not
What machine learning models can we choose to predict if X reaches S3 if we have information like, input data and what stages X has passed for that input data
I am thinking in direction of a multi label classification There might be some relationship between input data stage S1 and S2
Update: I have to train with examples like
1. Input data is s1
2. Input data is s2
3. ..
4 ..
Some doubts
Your question is far from being clear, for example:
We want to optimize that most X reaches S3, so based on input data we decide whether to allow X to go through S1 or not
Actually suggest, that the best model would be "always answer yes" ,as it maximized number of objects reaching S3 (as it simply lets any object reach this point)
General ideas
I assume two possible interpretations:
You have a labels "pipeline", which simply means, that object cannot be labelled S_n if it has not been already labelled with all S_i for i < n
This does not seem to be the problem for one single model, you can pipeline models in a natural way, ie. train a model 1 which regognizes, if object x should have label S_1. Next, you train a model 2 on all data that has label S_1 in the training set and predict label S_2, and so on. During execution you simply ask each model i if it accepts (labels) the incoming object x, and stop when the first one says "no"
You have some more complex constraints on the labels, which may be strict or not.For such cases, you should try one of many methods of multi label classification with constraints, in particular there is a tech report regarding this aspect of ML.
Solution 1 - approximating test functions
If your problem can be described as:
You have data points X, such that for each of them you know the maximum number of some pipelineable tests T_i which x passes
You want to train a classifier able to predict, what is the maximum number of consequtive tests that your point x passes
You do not have access to actual tests T_i or they are very inefficient
Then the simplest way would be to apply the following training procedure instead of one classifier:
Take all your data points, label those with y=0 as 0 and those with y>=1 as 1 and train some binary classifier (for example SVM). So you simply temporarly relabel your data so it shows points that pass the first test and those who don't. Lets call this classifier cl_1
Now take your data points, label those with y=1 as 0 and those with y>=2 as 1 and again train binary classifier, and call it cl_2
Repest until all tests have their classifier, in general in we call the classifier cl_i when it can distinguish between points labeled with y=i-1 and those with y>=i.
Now, to classify your new point, you simply check iteratively all your cl_i for i=1,..,tests and answer with the largest such i that cl_i(x)=1. So you "simulate" your tests with classifiers, and simply say how many this tests' approximations it passed.
To sum up: each test can be approximated with one binary classifier, and then the question of "What is the biggest consequtive test number that our point passes" is approximated with "what is the biggest consequtive classifier number that out point is classified as true".
Solution 2 - simple regression
You can also simply apply regression from your input space into the number of tests it reaches. Regression actually has an imprinted assumption, that the output values are correlated. So if you train your data with pairs (x,y) where y is the number of last test passed by x, then you are actually using the fact, that the output y=3 is highly related to first getting y=2 in the computations. Such regression (non-linear!) could be simply done using neural networks (possibly regularized)