I'm newbie to machine learning and would like to understand what algorithm (Classification algorithm or co-relation algorithm?) to use in order to understand what is the relationship between one or more attributes.
for example consider I have following set of attributes,
Bill No, Bill Amount, Tip amount, Waiter Name
and would like to figure out which are the attribute(s) that are contributing to Tip amount.
Following is the sample set of data,
Bill No, Bill Amount, Tip amount, Waiter detail
1, 100, 10, Sathish
2, 200, 20, Sathish
3, 150, 10, Rahul
4, 200, 10, Simon
5, 100, 10, Sathish
In this case we know the Tip amount would be 99% influenced by the Bill Amount. But i want to know what is the Spark MLib algorithm that i should use to figure out the same? If so i could apply the similar techniques to long set of attributes.
One thing you can do is calculate correlation between rows. Take a look at the tutorial about summary statistics at mllib website.
More advanced approach would be use dimensionality reduction. This should discover more complex dependencies.
You can calculate the correlation between different rows. Please refer to Correlations(https://spark.apache.org/docs/latest/mllib-statistics.html#correlations). For example, if you calculate the correlation between Bill Amount and Tip amount, most probably you will get the correlation value near to 1.
Related
Running catboost on a large-ish dataset (~1M rows, 500 columns), I get:
Training has stopped (degenerate solution on iteration 0, probably too small l2-regularization, try to increase it).
How do I guess what the l2 regularization value should be? Is it related to the mean values of y, number of variables, tree depth?
Thanks!
I don't think you will find an exact answer to your question because each data-set is different one from another.
However, based on my experience values form a range between 2 and 30, is a good starting point.
I know that using neural networks for anything text-related is difficult as they have problems with non-numerical input data.
But I'm not sure about mathematical sets. And sets of sets.
Like [0, 1, 2] and [3, 4, 5] or [[0, 1], [2, 3]] and [[4, 5], [6, 7]]
It should be possible to compute distances between these by computing the distances between all corresponding elements, right? I can't really find any information on that and don't want to start using neural networks without being sure.
(Googling anything with 'set' just isn't promising because all you get as result is the term 'data set'..)
EDIT:
First: The assignment specifically asks for a neural network, so I can't use k-means or any other clustering methods.
So the original question wasn't really addressing the actual problem. I don't have to think of a distance metric but of a way to add the sets to the activation function and for that of how to map them to a single value. But, regarding the distance metric, I'm actually not really sure at what point of the neural network I need it.. I guess that's a basic comprehension problem.
I will just write down some thoughts now.
The thing that confuses me is standardization of categories. Having three categories 'red', 'green' and 'blue' you can map them to numbers 1 to 3, but that would mean that 'red' would have a larger distance to 'blue' than 'green' does and that's not the case. So the categories are encoded as (1, 0, 0) and (0, 1, 0) and (0, 0, 1) which gives them all the same distance.
So it must be possible to add these to the activation function somehow. I could imagine that they are interpreted as binary numbers, so that (1,0,0)=100=4, (0,1,0)=010=2 and (0,0,1)=001=1. That would be a distinct mapping. But numbers 1 to 3 are distinct to, so as mentioned above, the distance metric must be necessary at some point.
So the problem still is how to map a set to a single value. I can do that right before I add it to the function, so I don't have to choose a mapping that also maintains a logical distance between the sets because when getting to the point of applying the distance metric I can still apply it to the original sets and don't have to use the mapped value. Is that correct? Or am I still missing something?
Neural nets, in general, have no such problem. Image recognition and language translation are well within their domains. What you do need is the metrics and manipulations to relate your inputs to the ground truth in a well-ordered fashion -- which your distance metric will do quite nicely.
Go right ahead and build your neural network. Supply it with the appropriate distance function, and let it train away. Do make sure to put in some tracking instrumentation (e.g. print statements) to trace the operation for a few iterations before you turn it entirely loose.
I am going to find a appropriate function in order to obtain accurate similarity between two persons according to their favourites.
for instance persons are connected to tags and their desire to each tags will be kept on the edge of tag nodes as a numeric values. I want to recommend similar persons to each persons.
I have found two solutions:
Cosine Similarity
There is Cosine function in Neo4j that just accept one input while in above function I need to pass vectores to this formula. Such as:
for "a": a=[10, 20, 45] each number indicates person`s desire to each tag.
for "b": b=[20, 50, 70]
Pearson Correlation
When I was surfing on the net and your documentation I found:
http://neo4j.com/docs/stable/cypher-cookbook-similarity-calc.html#cookbook-calculate-similarities-by-complex-calculations
My question is what is your logic behind this formula?
What is difference between r and H?
Because at the first glance I think H1 or H2 are always equals one. Unless I should consider the rest of the graph.
Thank you in advanced for any helps.
I think the purpose of H1 and H2 are to normalize the results of the times property (the number of times the user ate the food) across food types. You can experiment with this example in this Neo4j console
Since you mention other similarity measures you might be interested in this GraphGist, Similarity Measures For Collaborative Filtering With Cypher. It has some simple examples of calculating Pearson correlation and Jaccard similarity using Cypher.
This example makes it a little bit hard to understand what is going on. In this example, H1 and H2 are both 1. a better example would show each person eating different types of food, so you'd be able to see the value of H changing. If "me" also ate "vegetables", "pizza", and "hotdogs", their H would be 4.
Can't help you with Neo4J, just want to point out that Cosine Similarity and Pearsons' correlation coefficient are essentially the same thing. If you decode the different notations, you'll find that the only difference is that Pearsons zero-centers the vectors first. So you can define Pearsons as follows:
Pearsons(a, b) = Cosine(a - mean(a), b - mean(b))
I am using Non-negative Matrix Factorization and Non-negative Least Squares for predictions, and I want to evaluate how good the predictions are depending on the amount of data given. For example the original Data was
original = [1, 1, 0, 1, 1, 0]
And now I want to see how good I can reconstruct the original data when the given data is incomplete:
incomplete1 = [1, 1, 0, 1, 0, 0],
incomplete2 = [1, 1, 0, 0, 0, 0],
incomplete3 = [1, 0, 0, 0, 0, 0]
And I want to do this for every example in a big dataset. Now the problem is, the original data varies in the amount of positive data, in the original above there are 4, but for other examples in the dataset it could be more or less. Let´s say I make an evaluation round with 4 positives given, but half of my dataset only has 4 positives, the other half has 5,6 or 7. Should I exclude the half with 4 positives, because they have no data missing which makes the "prediction" much better? On the other side I would change the trainingset if I excluded data. What can I do? Or shouldn´t I evaluate with 4 at all in this case?
EDIT:
Basically I want to see how good I can reconstruct the input matrix. For simplicity, say the "original" stands for a user who watched 4 movies. And then I want to know how good I can predict each user, based on just 1 movie that the user acually watched. I get a prediction for lots of movies. Then I plot a ROC and Precision-Recall curve (using top-k of the prediction). And I will repeat all of this with n movies that the users actually watched. I will get a ROC curve in my plot for every n. When I come to the point where I use e.g. 4 movies that the user actually watched, to predict all movies he watched, but he only watched those 4, the results get too good.
The reason why I am doing this is to see how many "watched movies" my system needs to make reasonable predictions. If it would return only good results when there are already 3 movies watched, It would not be so good in my application.
I think it's first important to be clear what you are trying to measure, and what your input is.
Are you really measuring ability to reconstruct the input matrix? In collaborative filtering, the input matrix itself is, by nature, very incomplete. The whole job of the recommender is to fill in some blanks. If it perfectly reconstructed the input, it would give no answers. Usually, your evaluation metric is something quite different from this when using NNMF for collaborative filtering.
FWIW I am commercializing exactly this -- CF based on matrix factorization -- as Myrrix. It is based on my work in Mahout. You can read the docs about some rudimentary support for tests like Area under curve (AUC) in the product already.
Is "original" here an example of one row, perhaps for one user, in your input matrix? When you talk about half, and excluding, what training/test split are you referring to? splitting each user, or taking a subset across users? Because you seem to be talking about measuring reconstruction error, but that doesn't require excluding anything. You just multiply your matrix factors back together and see how close they are to the input. "Close" means low L2 / Frobenius norm.
But for convention recommender tests (like AUC or precision recall), which are something else entirely, you would either split your data into test/training by time (recent data is the test data) or value (most-preferred or associated items are the test data). If I understand the 0s to be missing elements of the input matrix, then they are not really "data". You wouldn't ever have a situation where the test data were all the 0s, because they're not input to begin with. The question is, which 1s are for training and which 1s are for testing.
I used a Dimensionality Reduction method (discussion here: Random projection algorithm pseudo code) on a large dataset.
After reducing the dimension from 1000 to 50, I get my new dataset where each sample looks like:
[ 1751. -360. -2069. ..., 2694. -3295. -1764.]
Now I am a bit confused, because I don't know what negative feature values supposed to mean. Is it okay to have negative features like this? Because before the reduction, each sample was like this:
3, 18, 18, 18, 126 ...
Is it normal or am I doing something wrong?
I guess you implemented the algorithm from this paper.
As the projection matrix has some negative entries it is ok that the projection maps positve to negative values. So the change in the signs does not indicate an error.