How to predict users' preferences using item similarity? - machine-learning

I am thinking if I can predict if a user will like an item or not, given the similarities between items and the user's rating on items.
I know the equation in collaborative filtering item-based recommendation, the predicted rating is decided by the overall rating and similarities between items.
The equation is:
http://latex.codecogs.com/gif.latex?r_{u%2Ci}%20%3D%20\bar{r_{i}}%20+%20\frac{\sum%20S_{i%2Cj}%28r_{u%2Cj}-\bar{r_{j}}%29}{\sum%20S_{i%2Cj}}
My question is,
If I got the similarities using other approaches (e.g. content-based approach), can I still use this equation?
Besides, for each user, I only have a list of the user's favourite items, not the actual value of ratings.
In this case, the rating of user u to item j and average rating of item j is missing. Is there any better ways or equations to solve this problem?
Another problem is, I wrote a python code to test the above equation, the code is
mat = numpy.array([[0, 5, 5, 5, 0], [5, 0, 5, 0, 5], [5, 0, 5, 5, 0], [5, 5, 0, 5, 0]])
print mat
def prediction(u, i):
target = mat[u,i]
r = numpy.mean(mat[:,i])
a = 0.0
b = 0.0
for j in range(5):
if j != i:
simi = 1 - spatial.distance.cosine(mat[:,i], mat[:,j])
dert = mat[u,j] - numpy.mean(mat[:,j])
a += simi * dert
b += simi
return r + a / b
for u in range(4):
lst = []
for i in range(5):
lst.append(str(round(prediction(u, i), 2)))
print " ".join(lst)
The result is:
[[0 5 5 5 0]
[5 0 5 0 5]
[5 0 5 5 0]
[5 5 0 5 0]]
4.6 2.5 3.16 3.92 0.0
3.52 1.25 3.52 3.58 2.5
3.72 3.75 3.72 3.58 2.5
3.16 2.5 4.6 3.92 0.0
The first matrix is the input and the second one is the predicted values, they looks not close, anything wrong here?

Yes, you can use different similarity functions. For instance, cosine similarity over ratings is common but not the only option. In particular, similarity using content-based filtering can help with a sparse rating dataset (if you have relatively dense content metadata for items) because you're mapping users' preferences to the smaller content space rather than the larger individual item space.
If you only have a list of items that users have consumed (but not the magnitude of their preferences for each item), another algorithm is probably better. Try market basket analysis, such as association rule mining.

What you are referring to is a typical situation of implicit ratings (i.e. users do not give explicit ratings to items, let's say you just have likes and dislikes).
As for the approches you can use Neighbourhood models or latent factor models.
I will suggest you to read this paper that proposes a well known machine-learning based solution to the problem.

Related

How to deal with array of string features in traditional machine learning?

Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
...
If we are interested in training a classifier that predicts label using age, job, and friends as features, how would we go about transforming the features into a numerical array which can be fed into a model?
Age is pretty straightforward since it is already numerical.
Job can be hashed / indexed since it is a categorical variable.
Friends is a list of categorical variables. How would I go about representing this feature?
Approaches:
Hash each element of the list. Using the example dataframe, let's assume our hashing function has the following mapping:
NULL -> 0
engineer -> 42069
World of Warcraft -> 9001
Netflix -> 14
9gag -> 9
manager -> 250
Let's further assume that the maximum length of friends is 5. Anything shorter gets zero-padded on the right hand side. If friends size is larger than 5, then the first 5 elements are selected.
Approach 1: Hash and Stack
dataframe after feature transformation would look like this:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
Limitations
Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
...
Compare the features of the first and third record:
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 14, 9, 9001, 0] 1
Both records have the same set of friends, but are ordered differently resulting in a different feature hashing even though they should be the same.
Approach 2: Hash, Order, and Stack
To solve the limitation of Approach 1, simply order the hashes from the friends feature. This would result in the following feature transform (assuming descending order):
feature label
[23, 42069, 9001, 14, 9, 0, 0] 1
[35, 250, 0, 0, 0, 0, 0] 0
[26, 42069, 9001, 14, 9, 0, 0] 1
This approach has a limitation too. Consider the following:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL 0
26 'engineer' ['Netflix', '9gag', 'World of Warcraft'] 1
42 'manager' ['Netflix', '9gag'] 1
...
Applying feature transform with ordering we get:
row feature label
1 [23, 42069, 9001, 14, 9, 0, 0] 1
2 [35, 250, 0, 0, 0, 0, 0] 0
3 [26, 42069, 9001, 14, 9, 0, 0] 1
4 [44, 250, 14, 9, 0, 0, 0] 1
What is the problem with the above features? Well, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array but not in row 4. This would mess up with the training.
Approach 3: Convert Array to Columns
What if we convert friends into a set of 5 columns and deal with each of the resulting columns just like we deal with any categorical variable?
Well, let's assume the friends vocabulary size is large (>100k). It would then be madness to go and create >100k columns where each column is responsible for the hash of the respective vocab element.
Approach 4: One-Hot-Encoding and then Sum
How about this? Convert each hash to one-hot-vector, and add up all these vectors.
In this case, the feature in row one for example would look like this:
[23, 42069, 01x8, 1, 01x4, 1, 01x8986, 1, 01x(max_hash_size-8987)]
Where 01x8 denotes a row of 8 zeros.
The problem with this approach is that these vectors will be very huge and sparse.
Approach 5: Use Embedding Layer and 1D-Conv
With this approach, we feed each word in the friends array to the embedding layer, then convolve. Similar to the Keras IMDB example: https://keras.io/examples/imdb_cnn/
Limitation: requires using deep learning frameworks. I want something which works with traditional machine learning. I want to do logistic regression or decision tree.
What are your thoughts on this?
As another answer mentioned, you've already listed a number of alternatives that could work, depending on the dataset and the model and such.
For what it's worth, a typical logistic regression model that I've encountered would use Approach 3, and convert each of your friends strings into a binary feature. If you're opposed to having 100k features, you could treat these features like a bag-of-words model and discard the stopwords (very common features).
I'll also throw a hashing variant into the mix:
Bloom Filter
You could store the strings in question in a bloom filter for each training example, and use the bits of the bloom filter as a feature in your logistic regression model. This is basically a hashing solution like you've already mentioned, but it takes care of some of the indexing/sorting issues, and provides a more principled tradeoff between sparsity and feature uniqueness.
First, there is no definitive answer to this problem, you presented 5 alternatives, and the five are valid, it all depends on the dataset you are using.
Considering this, I will list the options that I find most advantageous. For me option 5 is the best, but as in your case you want to use traditional machine learning techniques, I will discard it. So I would go for option 4, but in this case I need to know if you have the hardware to deal with this problem, if the answer is yes, I would go with this option, considering the answer is no, i would try approach 2, as you pointed out, the hashes for Netflix and 9gag in rows 1 and 3 have the same index in the array, but not in row 4, but that won't be a problem if you have enough data for training ( again, it all depends on the data available ), even if I have some problems with this approach, I would apply a Data Augmentation technique before discarding it.
Option 1 seems to me the worst, in it you have a great chance of overfitting and certainly a use of a lot of computational resources.
Hope this helps!
Approach 1 (Hash and Stack) and 2 (Hash, Order, and Stack) resolve their limitations if the result of the hashing function is considered as the index of a sparse vector with values of 1 instead of the values of each position of the vector.
Then, whenever "World of Warcraft" is in friends array, the feature vector will have a value of 1 in position 9001, regardless of the position of "World of Warcraft" in friends array (limitation of approach 1) and regardless of the existence of other elements in friends array (limitation of approach 2). If "World of Warcraft" is not in friends array, then the value of features vector in position 9001 will most likely be 0 (look up hashing trick collisions to learn more).
Using word2vec representation (as a feature value), then do a supervised classification also can be a good idea.

How does correlation work for an even-sized filter in this example?

(I know a question like this exists, but I wanted help with a specific example)
If the linear filter has even dimensions, how is the "center" defined? i.e. in the following scenario:
filter = np.array([[a, b],
[c, d]])
and the image was:
image = np.array([[0, 1, 0],
[1, 1, 1],
[0, 1, 0]])
what would be the result of correlation of the image with the linear filter?
Which of the elements of the even-sized filter is considered the origin is an arbitrary choice. Each implementation will make a different choice. Though a and d are the two most likely choices for reasons of similarity of the two image dimensions.
For example, MATLAB's imfilter (which implements correlation, not convolution) does the following:
f = [1,2;4,8];
img = [0,1,0;1,1,1;0,1,0];
imfilter(img,f,'same')
ans =
14 13 4
11 7 1
2 1 0
meaning that a is the origin of the kernel in this case. Other implementations might make a different choice.

Converting Continuous Model Probability Score to a Categorical Rating

I have a standard xgboost classification model that has been trained and now predicts a probability score. However, for the purposes of making the user interface simpler, I would like to convert this score to a 5 star rating scheme. I.e. discretizing the score.
What are intelligent ways of deriving the thresholds for this quantization such that the high ratings represents a high probability score with high confidence?
For example, I was considering generating the confidence intervals along with the prediction and grouping high confidence high score as 5 stars. High confidence low score as 1 star. High confidence medium high score as 4 star and so on.
I investigated multiple solutions for this and prototyped a V0 solution. The main requirements for the solution are as follows:
As the rating level increases (5 star is better than 1 star) the # of false positives must decrease.
The user doesnt have to manually define thresholds on the score probabilities and the thresholds are derived automatically.
The thresholds are derived from some higher level business requirement.
The thresholds are derived from the labelled data and can be rederived as new information is found.
Other solutions considered:
Confidence interval based rating. For example, you could have a high predicted score of 0.9 but low confidence (i.e. large confidence interval) and a high predicted score of 0.9 but high confidence (i.e. small interval). I suspect we might want the latter to be a 5 star candidate while the former a 4* perhaps?
Identifying Convexity and concavity of ROC curve to identify points of max value
Use Youden index to identify optimal point
Final solution - Sample ROC curve with a given set of business requirements (set of FPR's associated to each star rating) and then translate to thresholds.
Note: This worked but assumes a somewhat monotonic precision curve which may not always be the case. I improved the solution by formulating the problem as an optimization problem where the rating thresholds were the degree of freedom and the objective function was the linearity of the conversion rates between each rating bucket. Im sure you could try out different objective functions but for my purpose that worked really well.
References:
Converting Continuous Model Probability Score to a Categorical Rating
http://www.medicalbiostatistics.com/roccurve.pdf
http://www.bigdatarepublic.nl/regression-prediction-intervals-with-xgboost/
Prototype Solution:
import numpy as np
import pandas as pd
# The probas and fpr/tpr/thresholds come from the roc curve.
probas_ = xgb_model_copy.fit(features.values[train], label.values[train]).predict_proba(features.values[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(label.values[test], probas_[:, 1])
fpr_req = [0.01, 0.3, 0.5,0.9]
def find_nearest(array,value):
idx = (np.abs(array-value)).argmin()
return idx
fpr_indexes = [find_nearest(fpr, fpr_req_val) for fpr_req_val in fpr_req]
star_rating_thresholds = thresholds[fpr_indexes]
star_rating_thresholds = np.append(np.append([1],star_rating_thresholds),[0])
candidate_ratings = pd.cut(probas_,
star_rating_thresholds[::-1], labels=[5,4,3,2,1],right=False,include_lowest=True)
star_rating_thresolds
array([1. , 0.5073538 , 0.50184137, 0.5011086 , 0.4984425 ,
0. ])
candidate_ratings
[5, 5, 5, 5, 5, ..., 2, 2, 2, 2, 1]
Length: 564
Categories (5, int64): [5 < 4 < 3 < 2 < 1]
youcan use Pandas.cut() method:
In [62]: np.random.seed(0)
In [63]: a = np.random.rand(10)
In [64]: a
Out[64]: array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 , 0.64589411, 0.43758721, 0.891773 , 0.96366276, 0.38344152])
In [65]: pd.cut(a, bins=np.linspace(0, 1, 6), labels=[1,2,3,4,5])
Out[65]:
[3, 4, 4, 3, 3, 4, 3, 5, 5, 2]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
UPDATE: #EranMoshe has added an important point - "you might want to normalize your output before cutting it into categorical values".
Demo:
In [117]: a
Out[117]: array([0.6 , 0.8 , 0.85, 0.9 , 0.95, 0.97])
In [118]: pd.cut(a, bins=np.linspace(a.min(), a.max(), 6),
labels=[1,2,3,4,5], include_lowest=True)
Out[118]:
[1, 3, 4, 5, 5, 5]
Categories (5, int64): [1 < 2 < 3 < 4 < 5]
Assuming a classification problem: either 1's or 0's
When calculating the AUC of the ROC curve, you sort the "events" by your model prediction. So at the top, you'll most likely have a lot of 1's, and the farther you go down on that sorted list, more 0's.
Now lets say you try to determine whats the threshold of score "5".
You can count the relative % of 0's in your data you are willing to suffer.
Given the following table:
item A score
1 1 0.99
2 1 0.92
3 0 0.89
4 1 0.88
5 1 0.74
6 0 0.66
7 0 0.64
8 0 0.59
9 1 0.55
If I want "user score" "5" to have 0% false positives I would determine the threshold for "5" to be above 0.89
If I can tolerate 10% false positives I would determine the threshold to be above 0.66.
You can do the same for each threshold.
In my opinion this is a 100% business decision and the smartest way to pick those thresholds is by your knowledge of the users.
If users expects "5" to be a perfect prediction of the class (life and death situation) go with the 0% false positives.

Compute similarity between n entities

I am trying to compute the similarity between n entities that are being described by entity_id, type_of_order, total_value.
An example of the data might look like:
NR entity_id type_of_order total_value
1 1 A 10
2 1 B 90
3 1 C 70
4 2 B 20
5 2 C 40
6 3 A 10
7 3 B 50
8 3 C 20
9 4 B 50
10 4 C 80
My question would be what is a god way of measuring the similarity between entity_id 1 and 2 for example with regards to the type_of_order and the total_value for that type of order.
Would a simple KNN give satisfactory results or should I consider other algorithms?
Any suggestion would be much appreciated.
The similarity metric is a heuristic to capture a relationship between two data rows, with respect to the data semantics and the purpose of the training. We don't know your data; we don't know your usage. It would be irresponsible to suggest metrics to solve a problem when we have no idea what problem we're solving.
You have to address this question to the person you find in the mirror. You've given us three features with no idea of what they mean or how they relate. You need to quantify ...
relative distances within features: under type_of_order, what is the relationship (distance) between any two measurements? If we arbitrarily assign d(A, B) = 1, then what is d(B, C)? We have no information to help you construct this. Further, if we give that some value c, then what is d(A, C)? In various popular metrics, it could be 1+c, |1-c|, all distances could be 1, or perhaps it's something else -- even more than 1+c in some applications.
Even in the last column, we cannot assume that d(10, 20) = d(40, 50); the actual difference could be a ratio, difference of squares, etc. Again, this depends on the semantics behind these labels.
relative weights between features: How do the differences in the various columns combine to provide a similarity? For instance, how does d([A, 10], [B, 20]) compare to d([A, 10], [C, 30])? That's two letters in the left column, two steps of 10 in the right column. How about d([A, 10], [A, 20]) vs d([A, 10], [B, 10])? Are the distances linear, or do the relationships change as we slide up the alphabet or to higher numbers?

Dividing data sets into testing and training data

I have a dataset with k examples and I want to partition into m sets.
How can I do it programmatically.
For example, if k = 5 and m = 2, therefore, 5 / 2 = 2.5
How do I partition it into the 2 and 3, and not 2, 2 and 1?
Similarly, if k = 10 and m = 3, I want it to be partitioned into 3, 3 and 4, but not 3, 3, 3 and 1.
Usually, this sort of functionality is built into tools. But, assuming that your observations are independent, just set up a random number generator and do something like:
for i = 1 to k do;
set r = rand();
if r < 0.5 then data[i].which = 'set1'
else data[i].which = 'set2'
You can extend this for any number of sets and probabilities.
For an example where k = 5, then you could actually get all the rows in a single set (I'm thinking about 3% of the time). However, the point of splitting data is for dealing with larger amounts of data. If you only have 5 or 10 rows, then splitting your observations into different partitions is probably not the way to go.

Resources