how to find similar users based on items purchsed - machine-learning

I have users and products which the user has purchased, and there seems to be no ratings for the products given by users. Blow is the sample data
DATA :
user products
A 111, 333, 444
B 333, 444, 555
C 555, 111, 333
D 222,333, 333,333
E 111,333,444,555
F 222,555,111
Can we find similar customers based on above data. Am trying to use 1 for product purchase and 0 if not like below.
111 222 333 444 555
A 1 0 1 1 0
B 0 0 1 1 1
C 1 0 1 0 1
D 0 1 1 0 0
E 1 0 1 1 1
F 1 1 0 0 1
Using the above matrix, how do I find similar customers. expecting an output in the below format.
user Id similar customers
A E, B, C
B E, A, F
C A, E
E A, B, C
F B, D

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
df = df.set_index('user')
cvect = CountVectorizer()
cs = pd.DataFrame(cosine_similarity(cvect.fit_transform(df['products'])),
columns=df.index, index=df.index)
np.fill_diagonal(cs.values, 0)
threshold = 0.66
df['similar'] = cs[cs > threshold].apply(lambda row: row.dropna().index.tolist(), axis=1)
Result:
In [300]: df
Out[300]:
products similar
user
A 111, 333, 444 [B, C, E]
B 333, 444, 555 [A, C, E]
C 555, 111, 333 [A, B, E, F]
D 222,333, 333,333 []
E 111,333,444,555 [A, B, C]
F 222,555,111 [C]

Related

Find column number of last match in a row in sheets

In this table it's easy to find that column E is the first match for the value 3.
How do I find the column of the last match of 3 which will be column I
A B C D E F G H I J K L
6 6 9 9 3 3 2 2 3 1 1 1
Use this formula
=ArrayFormula(Substitute(Address(1,MAX(IF(REGEXMATCH(A1:L1,3&"")<>TRUE,,COLUMN(A1:L1))),4),"1",""))
try:
=SUBSTITUTE(ADDRESS(2, XMATCH(3, A2:P2,, -1), 4), 2, )
=ADDRESS(2, XMATCH(3, A2:P2,, -1), 4)
=XLOOKUP(3, A2:P2, A1:P1,,, -1)
XMATCH has a reverse search feature. Set search_mode top -1 to activate:
=INDEX(1:1,XMATCH(3,2:2,,-1))
(A1)A
B
C
D
E
F
G
H
I
J
K
L
6
6
9
9
3
3
2
2
3
1
1
1
Result:
I

How to use average function in neo4j with collection

I want to calculate covariance of two vectors as collection
A=[1, 2, 3, 4]
B=[5, 6, 7, 8]
Cov(A,B)= Sigma[(ai-AVGa)*(bi-AVGb)] / (n-1)
My problem for covariance computation is:
1) I can not have a nested aggregate function
when I write
SUM((ai-avg(a)) * (bi-avg(b)))
2) Or in another shape, how can I extract two collection with one reduce such as:
REDUCE(x= 0.0, ai IN COLLECT(a) | bi IN COLLECT(b) | x + (ai-avg(a))*(bi-avg(b)))
3) if it is not possible to extract two collection in oe reduce how it is possible to relate their value to calculate covariance when they are separated
REDUCE(x= 0.0, ai IN COLLECT(a) | x + (ai-avg(a)))
REDUCE(y= 0.0, bi IN COLLECT(b) | y + (bi-avg(b)))
I mean that can I write nested reduce?
4) Is there any ways with "unwind", "extract"
Thank you in advanced for any help.
cybersam's answer is totally fine but if you want to avoid the n^2 Cartesian product that results from the double UNWIND you can do this instead:
WITH [1,2,3,4] AS a, [5,6,7,8] AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i] - e_b))) / (n - 1) AS cov;
Edit:
Not calling anyone out, but let me elaborate more on why you would want to avoid the double UNWIND in https://stackoverflow.com/a/34423783/2848578. Like I said below, UNWINDing k length-n collections in Cypher results in n^k rows. So let's take two length-3 collections over which you want to calculate the covariance.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN aa, bb;
| aa | bb
---+----+----
1 | 1 | 4
2 | 1 | 5
3 | 1 | 6
4 | 2 | 4
5 | 2 | 5
6 | 2 | 6
7 | 3 | 4
8 | 3 | 5
9 | 3 | 6
Now we have n^k = 3^2 = 9 rows. At this point, taking the average of these identifiers means we're taking the average of 9 values.
> WITH [1,2,3] AS a, [4,5,6] AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 2.0 | 5.0
Also as I said below, this doesn't affect the answer because the average of a repeating vector of numbers will always be the same. For example, the average of {1,2,3} is equal to the average of {1,2,3,1,2,3}. It is likely inconsequential for small values of n, but when you start getting larger values of n you'll start seeing a performance decrease.
Let's say you have two length-1000 vectors. Calculating the average of each with a double UNWIND:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
UNWIND a AS aa
UNWIND b AS bb
RETURN AVG(aa), AVG(bb);
| AVG(aa) | AVG(bb)
---+---------+---------
1 | 500.0 | 1500.0
714 ms
Is significantly slower than using REDUCE:
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
RETURN REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b;
| e_a | e_b
---+-------+--------
1 | 500.0 | 1500.0
4 ms
To bring it all together, I'll compare the two queries in full on length-1000 vectors:
> WITH RANGE(0, 1000) AS aa, RANGE(1000, 2000) AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS
covariance;
| covariance
---+------------
1 | 83583.5
9105 ms
> WITH RANGE(0, 1000) AS a, RANGE(1000, 2000) AS b
WITH REDUCE(s = 0.0, x IN a | s + x) / SIZE(a) AS e_a,
REDUCE(s = 0.0, x IN b | s + x) / SIZE(b) AS e_b,
SIZE(a) AS n, a, b
RETURN REDUCE(s = 0.0, i IN RANGE(0, n - 1) | s + ((a[i] - e_a) * (b[i
] - e_b))) / (n - 1) AS cov;
| cov
---+---------
1 | 83583.5
33 ms
[EDITED]
This should calculate the covariance (according to your formula), given your sample inputs:
WITH [1,2,3,4] AS aa, [5,6,7,8] AS bb
UNWIND aa AS a
UNWIND bb AS b
WITH aa, bb, SIZE(aa) AS n, AVG(a) AS avgA, AVG(b) AS avgB
RETURN REDUCE(s = 0, i IN RANGE(0,n-1)| s +((aa[i]-avgA)*(bb[i]-avgB)))/(n-1) AS covariance;
This approach is OK when n is small, as is the case with the original sample data.
However, as #NicoleWhite and #jjaderberg point out, when n is not small, this approach will be inefficient. The answer by #NicoleWhite is an elegant general solution.
How do you arrive at collections A and B? The avg function is an aggregating function and cannot be used in the REDUCE context, nor can it be applied to collections. You should calculate your average before you get to that point, but exactly how to do that best depends on how you arrive at the two collections of values. If you are at a point where you have individual result items that you then collect to get A and B, that's the point when you could use avg. For example:
WITH [1, 2, 3, 4] AS aa UNWIND aa AS a
WITH collect(a) AS aa, avg(a) AS aAvg
RETURN aa, aAvg
and for both collections
WITH [1, 2, 3, 4] AS aColl UNWIND aColl AS a
WITH collect(a) AS aColl, avg(a) AS aAvg
WITH aColl, aAvg,[5, 6, 7, 8] AS bColl UNWIND bColl AS b
WITH aColl, aAvg, collect(b) AS bColl, avg(b) AS bAvg
RETURN aColl, aAvg, bColl, bAvg
Once you have the two averages, let's call them aAvg and bAvg, and the two collections, aColl and bColl, you can do
RETURN REDUCE(x = 0.0, i IN range(0, size(aColl) - 1) | x + ((aColl[i] - aAvg) * (bColl[i] - bAvg))) / (size(aColl) - 1) AS covariance
Thank you so much Dears, however I wonder which one is most efficient
1) Nested unwind and range inside reduce -> #cybersam
2) nested Reduce -> #Nicole White
3) Nested With (reset query by with) -> #jjaderberg
BUT Important Issue is :
Why there is an error and difference between your computations and real and actual computations.
I mean your covariance equals to = 1.6666666666666667
But in real world covariance equals to = 1.25
please check: https://www.easycalculation.com/statistics/covariance.php
Vector X: [1, 2, 3, 4]
Vector Y: [5, 6, 7, 8]
I think this differences is because that some computation do not consider (n-1) as divisor and instead of (n-1) , just they use n. Therefore when we grow divisor from n-1 to n the result will be diminished from 1.6 to 1.25.

How to get a single value from a cell-range by matching multiple columns and rows

I'm struggling with this one.
Here is data from 'sheet1':
|| A B C D E
=========================================
1 || C1 C2 X1 X2 X3
.........................................
2 || a b 1 2 3
3 || a d 10 11 12
4 || c d 4 5 6
5 || c f 13 14 15
6 || e f 7 8 9
7 || e b 16 17 18
Here's data in "sheet2":
|| A B C D
=================================
1 || C1 C2 C3 | val
.................................
2 || a d X2 | ?
3 || c f X1 | ?
4 || e b X3 | ?
Note that column C in sheet2 actually has values equal to user column names in sheet1.
I simply want to match A, B and C in sheet2 with A, B and 1 in sheet1 to find values in the last column:
|| A B C D
=================================
1 || C1 C2 C3 | val
.................................
2 || a d X2 | 11
3 || c f X1 | 13
4 || e b X3 | 18
I've been playing with OFFSET() and MATCH() but can't seem to lock down on one cell using multiple search criteria. Can someone help please?
I would use this function in sheet2 D2 field:
=index(filter(sheet1!C:E,sheet1!A:A=A2,sheet1!B:B=B2),1,match(C2,sheet1!$C$1:$E$1,0))
Explanation:
There is a FILTER function which will result the X1,X2,X3 values (C,D,E columns of sheet1) of the row which matches to the these two conditions:
C1 is "a"
C2 is "d"
So it will give back an array: [10,11,12] - which is the values of the X1, X2, X3 (C,D,E ) columns of sheet1 in the appropriate row.
Then, the INDEX function will grab this array. Now we only need to determine which value to pick. The MATCH function will do this computation as it tries to find the third condition C3 (which is in this case "X2) in the header row of sheet1. And in this example it will give back "2" as X2 is in the 2nd position of sheet1!c1:e1
So the INDEX function will give back the 2nd element of this array:[10,11,12], which is 11, the desired value.
Hope this helps.

how to append every row of pandas dataframe to every row of another dataframe

for example, df1 is a 3*2 dataframe, and df2 is a 10*3 dataframe, what I want is to generate a new dataframe of 30*5, where each row in df1 is appended with the 3 columns of df2 for all 10 rows in df2.
I know I can use iteration to append columns of df2 to each row of df1, but I am wondering whether there are some more efficient way to do this in pandas, like its concat functions.
could anyone help?
regards,
nan
If I understand you, you need cartesian product. You can emulate this with merge in pandas:
>>> df1 = pd.DataFrame({'A':list('abc'), 'B':range(3)})
>>> df2 = pd.DataFrame({'C':list('defg'), 'D':range(3,7)})
>>> df1['key'] = 1
>>> df2['key'] = 1
>>> df = pd.merge(df1, df2, on='key')
>>> del df['key']
>>> df
A B C D
0 a 0 d 3
1 a 0 e 4
2 a 0 f 5
3 a 0 g 6
4 b 1 d 3
5 b 1 e 4
6 b 1 f 5
7 b 1 g 6
8 c 2 d 3
9 c 2 e 4
10 c 2 f 5
11 c 2 g 6

Assigning a list of values to a list of variables

Is there a way in Maxima to assign values to a list of variables? Say I have two lists:
var : [a, b, c];
val : [1, 2, 3];
... and I want to assign 1 to a, 2 to b etc. Of course by iterating over the lists somehow, not "manually", i.e. a : 1; b : 2 ...
Thanks!
Use the :: operator.
(%i4) x : '[a, b, c];
(%o4) [a, b, c]
(%i5) x :: [11, 22, 33];
(%o5) [11, 22, 33]
(%i6) a;
(%o6) 11
(%i7) b;
(%o7) 22
(%i8) c;
(%o8) 33

Resources