collaborative filtering item-based in mahout - without isolating users - mahout

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.
In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.
Example 1. items are 11-12
Similarity between items:
101 102 0.36602540378443865
Matrix with preferences:
11 12
1 1
2 1
3 1 1
4 1
It looks like mahout treats null as 0.
Example 2. items are 101-103.
Similarity between items:
101 102 0.2612038749637414
101 103 0.4340578302732228
102 103 0.2600070276638468
Matrix with preferences:
101 102 103
1 1 0.1
2 1 0.1
3 1 0.1
4 1 1 0.1
5 1 1 0.1
6 1 0.1
7 1 0.1
8 1 0.1
9 1 0.1
10 1 0.1
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Both examples were run without any additional parameters.
Is this problem solved somwhere, somehow? Any ideas?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf

Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.
The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.
It's doing the right thing.

Related

Which Feature Selection Techniques for NLP is this represent

I have a dataset that came from NLP for technical documents
my dataset has 60,000 records
There are 30,000 features in the dataset
and the value is the number of repetitions that word/feature appeared
here is a sample of the dataset
RowID Microsoft Internet PCI Laptop Google AWS iPhone Chrome
1 8 2 0 0 5 1 0 0
2 0 1 0 1 1 4 1 0
3 0 0 0 7 1 0 5 0
4 1 0 0 1 6 7 5 0
5 5 1 0 0 5 0 3 1
6 1 5 0 8 0 1 0 0
-------------------------------------------------------------------------
Total 9,470 821 5 107 4,605 719 25 8
Appearance
There are some words that only appeared less than 10 times in the whole dataset
The technique is to select only words/features that appeared in the dataset for more than a certain number (say 100)
what is this technique called? the one that only uses features that in total appeared more than a certain number.
This technique for feature selection is rather trivial so I don't believe it has a particular name beyond something intuitive like "low-frequency feature filtering", "k-occurrence feature filtering" "top k-occurrence feature selection" in the machine learning sense; and "term-frequency filtering" and "rare word removal" in the Natural Language Processing (NLP) sense.
If you'd like to use more sophisticated means of feature selection, I'd recommend looking into the various supervised and unsupervised methods available. Cai et al. [1] provide a comprehensive survey, if you can't access the article, then this page by JavaTPoint covers some of the supervised methods. A quick web search for supervised/unsupervised feature selection also yields many good blogs, most of which make use of the sciPy and sklean Python libraries.
References
[1] Cai, J., Luo, J., Wang, S. and Yang, S., 2018. Feature selection in machine learning: A new perspective. Neurocomputing, 300, pp.70-79.

How to work out difference between scores at date 1 and future dates for same ID SPSS

I have a dataset where each ID has visited a website and recorded their risk level which is coded 0-3. They have then returned to the website at a future date and recorded their risk level again. I want to calculate the difference between each ID's risk level from their first recorded risk level.
For example my dataset looks like this:
ID Timestamp RiskLevel
1 20-Jan-21 2
1 04-Apr-21 2
2 05-Feb-21 1
2 12-Mar-21 2
2 07-May-21 3
3 09-Feb-21 2
3 14-Mar-21 1
3 18-Jun-21 0
And I would like it to look like this:
ID Timestamp RiskLevel DifFromFirstRiskLevel
1 20-Jan-21 2 .
1 04-Apr-21 2 0
2 05-Feb-21 1 .
2 12-Mar-21 2 1
2 07-May-21 3 2
3 09-Feb-21 2 .
3 14-Mar-21 1 -1
3 18-Jun-21 0 -2
What should I do?
One way to approach this is with the strategy in my answer here, but I will use a different approach here:
sort cases by ID timestamp.
compute firstRisk=risklevel.
if $casenum>1 and ID=lag(ID) firstRisk=lag(firstRisk).
execute.
compute DifFromFirstRiskLevel=risklevel-firstRisk.

Multi-class classification in sparse dataset

I have a dataset of factory workstations.
There are two types of error in same particular time.
User selects error and time interval (dependent variable-y)
Machines produces errors during production (independent variables-x)
User selected error types are 8 unique in total so I tried to predict those errors using machine-produced errors(total 188 types) and some other numerical features such as avg. machine speed, machine volume, etc.
Each row represents user-selected error in particular time;
For example in first line user selects time interval as:
2018-01-03 12:02:00 - 2018-01-03 12:05:37
and m_er_1(machine error 1) also occured in same time interval 12 times.
m_er_1_dur(machine error 1 duration) is total duration of machine error in seconds
So I matched those two tables and looks like below;
user_error m_er_1 m_er_2 m_er_3 ... m_er_188 avg_m_speed .. m_er_1_dur
A 12 0 0 0 150 217
B 0 0 2 0 10 0
A 3 0 0 6 34 37
A 0 0 0 0 5 0
D 0 0 0 0 3 0
E 0 0 0 0 1000 0
In the end, I have 1900 rows 390 rows( 376( 188 machine error counts + 188 machine error duration) + 14 numerical features) and due to machine errors it is a sparse dataset, lots of 0.
There a none outliers, none nan values, I normalized and tried several classification algorithms( SVM, Logistic Regression, MLPC, XGBoost, etc.)
I also tried PCA but didn't work well, for 165 components explained_variance_ratio is around 0.95
But accuracy metrics are very low, for logistic regression accuracy score is 0.55 and MCC score around 0.1, recall, f1, precision also very low.
Are there some steps that I miss? What would you suggest for multiclass classification for sparse dataset?
Thanks in advance

Which clustering model can I use to predict the following outcome?

I have three columns in my dataset. This is the list of restaurants that come under the category 'pizza'.This data was derived from the yelp dataset.There are three columns for each restaurant present. Latitude,Longitude,Checkins. I am supposed to build a model where I should be able to predict the coordinates(latitude,longitude) where I should start a new restaurant so that the number of checkins can be high. There are totally 4951 rows
checkins latitude longitude
0 2 33.394877 -111.600194
1 2 43.841217 -79.303936
2 1 40.442828 -80.186293
3 1 41.141631 -81.356603
4 1 40.434399 -79.922983
5 1 33.552870 -112.133712
6 1 43.686836 -79.293838
7 2 41.131282 -81.490180
8 1 40.500796 -79.943429
9 12 36.010086 -115.118656
10 2 41.484475 -81.921150
11 1 43.842450 -79.027990
12 1 43.724840 -79.289919
13 2 45.448630 -73.608719
14 1 45.577027 -73.330855
15 1 36.238059 -115.210341
16 1 33.623055 -112.339758
17 1 43.762768 -79.491417
18 1 43.708415 -79.475884
19 1 45.588257 -73.428926
20 4 41.152875 -81.358754
21 1 41.608833 -81.525020
22 1 41.425152 -81.896178
23 1 43.694716 -79.304879
24 1 40.442147 -79.956513
25 1 41.336466 -81.784790
26 1 33.231942 -111.721218
27 2 36.291436 -115.287016
28 2 33.641847 -111.995571
29 1 43.570217 -79.566431
... ... ... ...
I tried to approach the problem with clustering using DBSCAN and ended with the following graph. But I am not able to make any sense of it. How do I Proceed further or how do I approach the problem in a different way to get my results?
import pandas as pd
from sklearn.cluster import DBSCAN
import numpy as np
import matplotlib.pyplot as plt
review=pd.read_csv('pizza_category.csv')
checkin=pd.read_csv('yelp_academic_dataset/yelp_checkin.csv')
final=pd.merge(review,checkin,on='business_id',how='inner')
final.dropna()
final=final.reset_index(drop=True)
X=final[['checkins']]
X['latitude']=final[['latitude']].astype(dtype=np.float64).values
X['longitude']=final[['longitude']].astype(dtype=np.float64).values
print(X)
arr=X.values
db = DBSCAN(eps=2,min_samples=5)
y_pred = db.fit_predict(arr)
plt.figure(figsize=(20,10))
plt.scatter(arr[:, 0], arr[:, 1], c=y_pred, cmap="plasma")
plt.xlabel("Feature 0")
plt.ylabel("Feature 1")
Here's the plot I got
This is not a clustering problem.
What you want to do is density estimation, where you estimate density based on previous check-in frequencies.

Clustering unique datasets based on similarities (equality)

I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.

Resources