How to evaluate mean reciprocal rank(mrr) is a good model - machine-learning

such as AUC have a metrics
a good model will be over 0.7
great one will be over 0.85.
I want to know mean reciprocal rank(mrr) metrics evaluation.
how to define this is a good model.
very thanks!!

The metric MRR take values from 0 (worst) to 1 (best), as described here. However, the definition of a good (or acceptable) MRR depends on your use case. For example, if you build a model to be used in a recommender system, and from thousands of possible items, recommend a set of five items to users, then an MRR of 0.2 could be defined as acceptable. This means that on average, the correct item the user bought was part of the top 5 items, predicted by your model.
All in all, it mostly depends on how many possible classes are possible to predict, as well as your use case.

Related

Machine Learning model generalisation

I'm new to Machine Learning, and I'd like to make a question regarding the model generalization. In my case, I'm going to produce some mechanical parts, and I'm interested in the control of the input parameters to obtain certain properties on the final part.
More particularly, I'm interested in 8 parameters (say, P1, P2, ..., P8). In which to optimize the number of required pieces produced to maximize the combinations of parameters explored, I've divided the problem into 2 sets. For the first set of pieces, I'll vary the first 4 parameters (P1 ... P4), while the others will be held constant. In the second case, I'll do the opposite (variables P5 ... P8 and constants P1 ... P4).
So I'd like to know if it's possible to make a single model that has the eight parameters as inputs to predict the properties of the final part. I ask because as I'm not varying all the 8 variables at once, I thought that maybe I would have to do 1 model for each set of parameters, and the predictions of the 2 different models couldn't be related one to the other.
Thanks in advance.
In most cases having two different models will have a better accuracy then one big model. The reason is that in local models, the model will only look at 4 features and will be able to identify patterns among them to make prediction.
But this particular approach will most certainly fail to scale. Right now you only have two sets of data but what if it increases and you have 20 sets of data. It will not be possible for you to create and maintain 20 ML models in production.
What works best for your case will need some experimentation. Take a random sample from data and train ML models. Take one big model and two local models and evaluate their performance. Not just accuracy, but also their F1 score, AUC-PR and ROC curve too to find out what works best for you. If you do not see a major performance drop, then one big model for the entire dataset will be a better option. If you know that your data will always be divided into these two sets and you dont care about scalability, then go with two local models.

Machine Learning: How to detect the independent variables that are generating a dependent boolean value

I'm Trying to use machine learning in my job, but I can't find a way to adapt it to what I need. And I don't know if it is already a known problem or if I'm working with something that doesn't have a known solution yet.
Let's say that I have a lot of independent variables, encoded as onehot, and a dependent variable with only two status: True (The result had an error) and False (The result was successful)
My independent variables are the parameters I use for a query in an API, and the result is the one that returned the API.
My objective is to detect a pattern where I can see in a dataset in a certain timeframe of a few hours, the failing parameters, so I can avoid to query the API if I'm certain that it could fail.
(I'm working with millions of queries per day, and this mechanism is critical for a good user experience)
I'll try to make an example so you can understand what I need.
Suppose that I have a delivery company, I count with 3 trucks, and 3 different routes I could take.
So, my dummy variables would be T1,T2,T3,R1,R2 and R3 (I could delete T3 and R3 since there are considered by the omission of the other 2)
Then, I have a big dataset of the times that the delivery was delayed. So: Delayed=1 or Delayed=0
With this, I would have a set like this:
T1_|_T2_|_T3_|_R1_|_R2_|_R3||Delayed
------------------------------------
_1_|_0__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_0_|_1__|_0__|_1__|_0__|_0_||____0__
_1_|_0__|_0__|_0__|_1__|_0_||____1__
_1_|_0__|_0__|_1__|_0__|_0_||____0__
Not only I want to say "in most cases, truck 1 arrives late, it could have a problem, I shouldn't send it more", that is a valid result too, but I also want to detect things like: "in most cases, truck 1 arrives late when it goes in the route 1, probably this type of truck has a problem on this specific route"
This dataset is an example, the real one is huge, with thousand of dependent variables, so it could probably have more than one problem in the same dataset.
example: truck 1 has problems in route 1, and truck 3 has problems in route 1.
example2: truck 1 has problems in route 1, and truck 3 has problems in any route.
So, I would make a blacklist like:
example: Block if (truck=1 AND route=1) OR (truck=3 AND route=1)
example2: Block if (truck=1 AND route=1) OR truck=3
I'm actually doing this without machine learning, with an ugly code that makes a massive cartesian product of the independent columns, and counts the quantity of "delayed". Then I choose the worst delayed/total proportion, I blacklist it, and I iterate again with new values.
This errors are commonly temporary, so I would send a new dataset every few hours, I don't need a lifetime span analysis, except that the algorithm considers these temporary issues.
Anyone has a clue of what can I use, or where can I investigate about it?
Don't hesitate to ask for more info if you need it.
Thanks in advance!
Regards
You should check out the scikit-learn package for machine learning classifiers (Random Forest is an industry standard). For this problem, you could feed a portion of the data (training set, say 80% of the data) to the model and it would learn how to predict the outcome variable (delayed/not delayed).
You can then test the accuracy of your model by 'testing' on the remaining 20% of your data (the test set), to see if your model is any good at predicting the correct outcome. This will give you a % accuracy. Higher is better generally, unless you have severely imbalanced classes, in which case your classifier will just always predict the more common class for easy high accuracy.
Finally, if the accuracy is satisfactory, you can find out which predictor variables your model considered most important to achieve that level of prediction, i.e. Variable Importance. I think this is what you're after. So running this every few hours would tell you exactly which features (columns) in your set are best at predicting if a truck is late.
Obviously, this is all easier said than done and often you will have to perform significant cleaning of your data, sometimes normalisation (not in the case of random forests though), sometimes weighting your classifications, sometimes engineering new features... there is a reason this is a dedicated profession.
Essentially what you're asking is "how do I do Data Science?". Hopefully this will get you started, the rest (i.e. learning) is on you.

Why Information gain feature selection gives zero scores

I have a dataset in which I used the Information gain feature selection method in WEKA to get the important features. Below is the output I got.
Ranked attributes:
0.97095 1 Opponent
0.41997 11 Field_Goals_Made
0.38534 24 Opp_Free_Throws_Made
0.00485 4 Home
0 8 Field_Goals_Att
0 12 Opp_Total_Rebounds
0 10 Def_Rebounds
0 9 Total_Rebounds
0 6 Opp_Field_Goals_Made
0 7 Off_Rebounds
0 14 Opp_3Pt_Field_Goals_Made
0 2 Fouls
0 3 Opp_Blocks
0 5 Opp_Fouls
0 13 Opp_3Pt_Field_Goals_Att
0 29 3Pt_Field_Goal_Pct
0 28 3Pt_Field_Goals_Made
0 22 3Pt_Field_Goals_Att
0 25 Free_Throws_Made
Which tells me that all features with score 0 can be ignored, is it correct?
Now when I tried the Wrapper subset evaluation in WEKA, I got selected attribute which were ignored in info gain method (i.e whose score was 0). Below is the output
Selected attributes: 3,8,9,11,24,25 : 6
Opp_Blocks
Field_Goals_Att
Total_Rebounds
Field_Goals_Made
Opp_Free_Throws_Made
Free_Throws_Made
I want to understand, what is the reason that the attributes ignored by info gain are considered strongly by wrapper subset evaluation method?
To understand what's happening, it helps to understand first what the two feature selection methods are doing.
The information gain of an attribute tells you how much information with respect to the classification target the attribute gives you. That is, it measures the difference in information between the cases where you know the value of the attribute and where you don't know the value of the attribute. A common measure for the information is Shannon entropy, although any measure that allows to quantify the information content of a message will do.
So the information gain depends on two things: how much information was available before knowing the attribute value, and how much was available after. For example, if your data contains only one class, you already know what the class is without having seen any attribute values and the information gain will always be 0. If, on the other hand, you have no information to start with (because the classes you want to predict are represented in equal quantities in your data), and an attribute splits the data perfectly into the classes, its information gain will be 1.
The important thing to note in this context is that the information gain is a purely information-theoretic measure, it does not consider any actual classification algorithms.
This is what the wrapper method does differently. Instead of analyzing the attributes and targets from an information-theoretic point of view, it uses an actual classification algorithm to build a model with a subset of the attributes and then evaluates the performance of this model. It then tries a different subset of attributes and does the same thing again. The subset for which the trained model exhibits the best empirical performance wins.
There are a number of reasons why the two methods would give you different results (this list is not exhaustive):
A classification algorithm may not be able to leverage all the information that the attributes can provide.
A classification algorithm may implement its own attribute selection internally (for example decision tree/forest learners do this) that considers a smaller subset than attribute selection will yield.
Individual attributes may not be informative, but combinations of them may be (for example perhaps a and b has no information separately, but a*b on the other hand, might). Attribute selection will not discover this because it evaluates attributes in isolation, while a classification algorithm may be able to leverage this.
Attribute selection does not consider the attributes sequentially. Decision trees for example use a sequence of attributes and while b may provide information on its own, it may not provide any information in addition to a, which is used higher up in the tree. Therefore b would appear useful when evaluated according to information gain, but is not used by a tree that "knows" a first.
In practice it's usually a better idea to use a wrapper for attribute selection as it takes the performance of the actual classifier you want to use into account, and different classifier vary widely in usage of information. The advantage of classifier-agnostic measures like information gain is that they are much cheaper to compute.
In filter technique(Info gain here), Features are considered in isolation from one another hence when individually considered IG is 0
But in certain cases one feature needs another feature to
boost accuracy and hence when considered together with other feature it produces predictive value.
Hope this helps and on time :)

Interpreting the parameters of the evaluate() function of a item-based recommender in Mahout

I am working with boolean values, trying to evaluate a recommending engine in Mahout. My questions are about the selection of the "correct" parameters of the evaluate function. Apologize in advance for the lengthy post.
IRStatistics evaluate(RecommenderBuilder recommenderBuilder,
DataModelBuilder dataModelBuilder,
DataModel dataModel,
IDRescorer rescorer,
int at,
double relevanceThreshold,
double evaluationPercentage) throws TasteException;
1) Can you think of an example in which the following two parameters must be used:
- DataModelBuilder dataModelBuilder
- IDRescorer rescorer
2) For the double relevanceThreshold variable, I set the value GenericRecommenderIRStatsEvaluator.CHOOSE_THRESHOLD, however, I was wondering if a "better" model could be built by setting a different value.
3) In my project, I need to recommend at most 10 items per user. Does this mean that it shouldn't make sense to set a value bigger than 10 for variable int at?
4) Given that I don't bother if I have to wait a lot for building the model, is it a good practice to set variable double evaluationPercentage equal to 1? Can you think of any case where 1 will not give the optimum model?
5) Why precision / recall (note that I am working on boolean data) increases as long as the number of recommendations (i.e. variable int at) increases (I proved that experimentally)?
6) Where does the spiting of both testing and training tests is taking place within mahout, and how could I change that percentage (unless if this is not the case for item-based recommendations)?
Accurate recommendations alone do not guarantee users of recommender systems an effective and satisfying experience, so measurements should be taken only as a reference point. That said, ideally real users would use your system against a baseline you set (like random recommendations) and do A/B test and see which has better performance. But that can be troublesome and not quite practical.
Precision and recall at N recommendations, are not a great metrics for recommenders. You are better off using a metric like AUC (area under the curve)
Have a look a the Mahout in Action book example (link)
Letting Mahout choose a threshold is fine, but it will be more computationally expensive
Yes, if you are making 10 recommendations, evaluating at 10 makes a lot of sense
Depends on the size of your data really. If using 100% (that is 1.0) is fast enough, I would use that. But if you do use something different (less), I would strongly suggest you use RandomUtils.useTestSeed(); when testing so you know the sampling will be done in the same manner every time you evaluate. (don't use it in production though)
Not sure. Depends on how your data looks like. But normally if precision increases, recall decreases and vice versa. See F1 Score (also available from Mahout IRStatistics)
For IRStatistics I'm not entirely sure where it happens (or if it happens at all). Notice it doesn't even take a % for division into training and test. Although there might be a default somewhere. If I were you I would go through the Mahout code and find out.

Does prior distribution matter in classification?

Currently I get a classification problem with two classes. what I want to do is that given a bunch of candidates, find out who will more likely to be the class 1. The problem is that class 1 is very rare (around 1%), which I guess makes my prediction quite inaccurate.
For training the dataset, can I sample half class 1 and half class 0? This will change the prior distribution, but I don't know whether the prior distribution affects the classification results?
Indeed, a very imbalanced dataset can cause problems in classification. Because by defaulting to the majority class 0, you can get your error rate already very low.
There are some workarounds that may or may not work for your particular problem, such as giving equal weight to the two classes (thus weighting instances from the rare class stronger), oversampling the rare class (i.e. learning each instance multiple times), producing slight variations of the rare objects to restore balance etc. SMOTE and so on.
You really should to grab some classification or machine learning book, and check the index for "imbalanced classification" or "unbalanced classification". If the book is any good, it will discuss this problem. (I just assume you did not know the term that they use.)
If you're forced to pick exactly one from a group, then the prior distribution over classes won't matter because it will be constant for all members of that group. If you must look at each in turn and make an independent decision as to whether they're class one or class two, the prior will potentially change the decision, depending on which method you choose to do the classification. I would suggest you get hold of as many examples of the rare class as possible, but beware that feeding a 50-50 split to a classifier as training blindly may make it implicitly fit a model that assumes this is the distribution at test time.
Sampling your two classes evenly doesn't change assumed priors unless your classification algorithm computes (and uses) priors based on the training data. You stated that your problem is "given a bunch of candidates, find out who will more likely to be the class 1". I read this to mean that you want to determine which observation is most likely to belong to class 1. To do this, you want to pick the observation $x_i$ that maximizes $p(c_1|x_i)$. Using Bayes' theorem, this becomes:
$$
p(c_1|x_i)=\frac{p(x_i|c_1)p(c_1)}{p(x_i)}
$$
You can ignore $p(c_1)$ in the equation above since it is a constant. However, computing the denominator will still involve using prior probabilities. Since your problem is really more of a target detection problem than a classification problem, an alternate approach for detecting low probability targets is to take the likelihood ratio of the two classes:
$$
\Lambda=\frac{p(x_i|c_1)}{p(x_i|c_0)}
$$
To pick which of your candidates is most likely to belong to class 1, pick the one with the highest value of $\Lambda$. If your two classes are described by multivariate Gaussian distributions, you can replace $\Lambda$ with its natural logarithm, resulting in a simpler quadratic detector. If you further assume that the target and background have the same covariance matrices, this results in a linear discriminant (http://en.wikipedia.org/wiki/Linear_discriminant_analysis).
You may want to consider Bayesian utility theory to re-weight the costs of different kinds of error to get away from the problem of the priors dominating the decision.
Let A be the 99% prior probability class, B be the 1% class.
If we just say that all errors incur the same cost (negative utility), then
it's possible that the optimal decision approach is to always declare "A". Many
classification algorithms (implicitly) assume this.
If instead, we declare that the cost of declaring "B" when, in fact, the instance
was "A" is much bigger than the cost of the opposite error, then the decision logic
becomes, in a sense, more sensitive to slighter differences in the features.
This kind of situation frequently comes up in fault detection -- faults in the monitored
system will be rare, but you want to be sure that if we see any data that points to
an error condition, action needs to be taken (even if it is just reviewing the data).

Resources