Mahout : What is the value returned by AverageAbsoluteDifferenceEvaluator on TanimotoCoefficientSimilarity? - mahout

I'm trying to improve the mahout recommendation implementation in a project, and I found out that my predecessor used tanimotoCoefficientSimilarity for a dataset with preference value 1-5. I changed it to UncenteredCosineSimilarity, and now I'm trying to test its improvement in performance.
I tried using AverageAbsoluteDifferenceEvaluator on both, but realised that this should not be used for Tanimoto since it does not return the expected value of the preference.
However, the value seems odd and I don't quite understand what the value this implementation is returning represents. The average preference value of the dataset is 3.2, and if Tanimoto was to return a value in the range [0,1], then the output of AverageAbsoluteDifferenceEvaluator must be in the range [2.2, 3.2], but it consistently returns a value in the range [0.8, 1.1].
Does anyone have explanation for this?
Thank you.

TanimotoCoefficientSimilarity works without coefficients - so AverageAbsoluteDifferenceEvaluator not have any sense for TanimotoCoefficientSimilarity

Related

How to scale % change based features so that they are viewed "similarly" by the model

I have some features that are zero-centered values and supposed to represent change between a current value and previous value. Generally speaking i believe there should be some symmetry between these values. Ie. there should be roughly the same amount of positive values as negative values and roughly these values should operate on the same scale.
When i try to scale my samples using MaxAbsScaler, i notice that my negative values for this feature get almost completely drowned out by the positive values. And i don't really have any reason to believe my positive values should be that much larger than my negative values.
So what i've noticed is that fundamentally, the magnitude of percentage change values are not symmetrical in scale. For example if i have a value that goes from 50 to 200, that would result in a 300.0% change. If i have a value that goes from 200 to 50 that would result in a -75.0% change. I get there is a reason for this, but in terms of my feature, i don't see a reason why a change of 50 to 100 should be 3x+ more "important" than the same change in value but the opposite direction.
Given this information, i do not believe there would be any reason to want my model to treat a change of 200-50 as a "lesser" change than a change of 50-200. Since i am trying to represent the change of a value over time, i want to abstract this pattern so that my model can "visualize" the change of a value over time that same way a person would.
Right now i am solving this by using this formula
if curr > prev:
return curr / prev - 1
else:
return (prev / curr - 1) * -1
And this does seem to treat changes in value, similarly regardless of the direction. Ie from the example of above 50>200 = 300, 200>50 = -300. Is there a reason why i shouldn't be doing this? Does this accomplish my goal? Has anyone ran into similar dilemmas?
This is a discussion question and it's difficult to know the right answer to it without knowing the physical relevance of your feature. You are calculating a percentage change, and a percent change is dependent on the original value. I am not a big fan of a custom formula only to make percent change symmetric since it adds a layer of complexity when it is unnecessary in my opinion.
If you want change to be symmetric, you can try direct difference or factor change. There's nothing to suggest that difference or factor change are less correct than percent change. So, depending on the physical relevance of your feature, each of the following symmetric measures would be correct ways to measure change -
Difference change -> 50 to 200 yields 150, 200 to 50 yields -150
Factor change with logarithm -> 50 to 200 yields log(4), 200 to 50 yields log(1/4) = -log(4)
You're having trouble because you haven't brought the abstract questions into your paradigm.
"... my model can "visualize" ... same way a person would."
In this paradigm, you need a metric for "same way". There is no such empirical standard. You've dropped both of the simple standards -- relative error and absolute error -- and you posit some inherently "normal" standard that doesn't exist.
Yes, we run into these dilemmas: choosing a success metric. You've chosen a classic example from "How To Lie With Statistics"; depending on the choice of starting and finishing proportions and the error metric, you can "prove" all sorts of things.
This brings us to your central question:
Does this accomplish my goal?
We don't know. First of all, you haven't given us your actual goal. Rather, you've given us an indefinite description and a single example of two data points. Second, you're asking the wrong entity. Make your changes, run the model on your data set, and examine the properties of the resulting predictions. Do those properties satisfy your desired end result?
For instance, given your posted data points, (200, 50) and (50, 200), how would other examples fit in, such as (1, 4), (1000, 10), etc.? If you're simply training on the proportion of change over the full range of values involved in that transaction, your proposal is just what you need: use the higher value as the basis. Since you didn't post any representative data, we have no idea what sort of distribution you have.

K-Mode clustering

I have a dataset of 6 million rows with mixed datatype. k prototype is not scalable and hence I converted all columns to categorical and ran K-mode for 4 clusters on a random sample of 4 M rows. However, k-mode has an initialization problem that will give different clusters every time you run the model. Let's say, I run it once and take the output for my analysis. Is the approach completely wrong for one time analysis? If yes, is there a way to fix initialization problem? May be by setting parameter or something. Any suggestion is deeply appreciated.
I am sure you did this but definitely set the seed. Because once you set the mode variable it selects a random set of rows from your data and proceeds with the algorithm. So seeting the seed is important for reproducible results. I am assuming your code is something like this:
kmodes(data, modes=4, iter.max = 10, weighted = FALSE, fast = TRUE)
I hope by different cluster you don't imply the number of clusters is also changing.

XGBoost prediction always returning the same value - why?

I'm using SageMaker's built in XGBoost algorithm with the following training and validation sets:
https://files.fm/u/pm7n8zcm
When running the prediction model that comes out of the training with the above datasets always produces the exact same result.
Is there something obvious in the training or validation datasets that could explain this behavior?
Here is an example code snippet where I'm setting the Hyperparameters:
{
{"max_depth", "1000"},
{"eta", "0.001"},
{"min_child_weight", "10"},
{"subsample", "0.7"},
{"silent", "0"},
{"objective", "reg:linear"},
{"num_round", "50"}
}
And here is the source code: https://github.com/paulfryer/continuous-training/blob/master/ContinuousTraining/StateMachine/Retrain.cs#L326
It's not clear to me what hyper parameters might need to be adjusted.
This screenshot shows that I'm getting a result with 8 indexes:
But when I add the 11th one, it fails. This leads me to believe that I have to train the model with zero indexes instead of removing them. So I'll try that next.
Update: retraining with zero values included doesn't seem to help. I'm still getting the same value every time. I noticed i can't send more than 10 values to the prediction endpoint or it will return an error: "Unable to evaluate payload provided". So at this point using the libsvm format has only added more problems.
You've got a few things wrong there.
using {"num_round", "50"} with such a small ETA {"eta", "0.001"} will give you nothing.
{"max_depth", "1000"} 1000 is insane! (default value is 6)
Suggesting:
{"max_depth", "6"},
{"eta", "0.05"},
{"min_child_weight", "3"},
{"subsample", "0.8"},
{"silent", "0"},
{"objective", "reg:linear"},
{"num_round", "200"}
Try this and report your output
As I was grouping time series, certain frequencies created gaps in data.
I solved this issue by filling all NaN's.

Non-Evaluation of Numerical Expression in Maxima

I start with a simple Maxima question, the answer to which may provide the answer to the actual problem I'm grappling with.
Related Simple Question:
How can I get maxima to calculate:
bfloat((1+%i)^0.3);
Might there be an option variable that can be set so that this evaluates to a complex number?
Actual Question:
In evaluating approximations for numerical time integration for finite element methods, for this purpose I'm using spectral analysis, which requires the calculation of the eigenvalues of a 4 x 4 matrix. This matrix "cav" is also calculated within maxima, using some of the algebra capabilities of maxima, but sustituting numerical values, so that matrix is entirely numerical, i.e. containing no variables. I've calculated the eigenvalues with Mathematica and it returns 4 real eigenvalues. However Maxima calculates horrenduously complicated expressions for this case, which apparently it does not "know" how to simplify, even numerically as "bigfloat". Perhaps this problem arises because Maxima first approximates the matrix "cac" by rational numbers (i.e. fractions) and then tries to solve the problem fully exactly, instead of simply using numerical "bigfloat" computations throughout. Is there I way I can change this?
Note that if you only change the input value of gzv to say 0.5 it works fine, and returns numerical values of complex eigenvalues.
I include the code below. Note that all of the code up until "cav:subst(vs,ca)$" is just for the definition of the matrix cav and seems to work fine. It is in the few statements thereafter that it fails to calculate numerical values for the eigenvalues.
v1:v0+ (1-gg)*a0+gg*a1$
d1:d0+v0+(1/2-gb)*a0+gb*a1$
obf:a1+(1+ga)*(w^2*d1 + 2*gz*w*(d1-d0)) -
ga *(w^2*d0 + 2*gz*w*(d0-g0))$
obf:expand(obf)$
cd:subst([a1=1,d0=0,v0=0,a0=0,g0=0],obf)$
fd:subst([a1=0,d0=1,v0=0,a0=0,g0=0],obf)$
fv:subst([a1=0,d0=0,v0=1,a0=0,g0=0],obf)$
fa:subst([a1=0,d0=0,v0=0,a0=1,g0=0],obf)$
fg:subst([a1=0,d0=0,v0=0,a0=0,g0=1],obf)$
f:[fd,fv,fa,fg]$
cad1:expand(cd*[1,1,1/2-gb,0] - gb*f)$
cad2:expand(cd*[0,1,1-gg,0] - gg*f)$
cad3:expand(-f)$
cad4:[cd,0,0,0]$
cad:matrix(cad1,cad2,cad3,cad4)$
gav:-0.05$
ggv:1/2-gav$
gbv:(ggv+1/2)^2/4$
gzv:1.1$
dt:0.01$
wv:bfloat(dt*2*%pi)$
vs:[ga=gav,gg=ggv,gb=gbv,gz=gzv,w=wv]$
cav:subst(vs,ca)$
cav:bfloat(cav)$
evam:eigenvalues(cav)$
evam:bfloat(evam)$
eva:evam[1]$
The main problem here is that Maxima tries pretty hard to make computations exact, and it's hard to tell it to ease up and allow inexact results.
Is there a mistake in the code you posted above? You have cav:subst(vs,ca) but ca is not defined. Is that supposed to be cav:subst(vs,cad) ?
For the short problem, usually rectform can simplify complex expressions to something more usable:
(%i58) rectform (bfloat((1+%i)^0.3));
`rat' replaced 1.0B0 by 1/1 = 1.0B0
(%o58) 2.59023849130283b-1 %i + 1.078911979230303b0
About the long problem, if fixed-precision (i.e. ordinary floats, not bigfloats) is acceptable to you, then you can use the LAPACK function dgeev to compute eigenvalues and/or eigenvectors.
(%i51) load (lapack);
<bunch of messages here>
(%o51) /usr/share/maxima/5.39.0/share/lapack/lapack.mac
(%i52) dgeev (cav);
(%o52) [[- 0.02759949957202372, 0.06804641655485913, 0.997993508502892, 0.928429191717788], false, false]
If you really need variable precision, I don't know what to try. In principle it's possible to rework the LAPACK code to work with variable-precision floats, but that's a substantial task and I'm not sure about the details.

what does 0 as output means in AverageAbsoluteDifferenceRecommenderEvaluator in mahout?

I am currently playing with Apache mahout and reading book Mahout in Action and it confused me about the evaluator which we use in evaluation of recommender system and specifically i wanted to ask about AverageAbsoluteDifferenceRecommenderEvaluator i.e when it results in 0.
Does it means there is no error or does it means recommendation system is very bad?
Thanks.
0 means a perfect match.
RecommenderEvaluator returns MeanAverageError, representing how well the Recommender's estimated preferences match real values; lower scores mean a better match: 0 no error at all.
more

Resources