My code takes an eternity to compute jaccard similarity. It is an .csv file with 100000 in it. I have already created indexes on 2 basic Nodes (id+ value)
I have already use the Jaccard algorithm in Playground but it also takes an eternity to run.
MATCH (i:Item)-[:HAS]->(p2:Properties)<-[:HAS]-(i1:Item)
WITH {item:id(i), categories: collect(id(i1))} as userData
WITH collect(userData) as data
CALL algo.similarity.jaccard.stream(data, {similarityCutoff: 0.5})
YIELD item1, item2, count1, count2, intersection, similarity
RETURN algo.asNode(item1).id AS from, algo.asNode(item2).id AS to, intersection, similarity
Can anyone help?
The first two lines syntax of your query is not correct. You should run it like this:
OLD:
MATCH (i:Item)-[:HAS]->(p2:Properties)<-[:HAS]-(i1:Item)
WITH {item:id(i), categories: collect(id(i1))} as userData
NEW:
MATCH (i:Item)-[:HAS]->(p2:Properties)
WITH {item:id(i), categories: collect(id(p2))} as userData
This is what the algorithm (jaccard ) is doing. An item (say Item1) is similar (number from 0 to 1 inclusive) to another item (like Item2) if both shares the same properties. For example; Item1 has 3 properties1,2,3 and Item2 has 3 properties2,3,4. So the jaccard similarity index is 2/4 or 0.5 because property2,3 are common and there are 4 unique properties in both items.
So in your query, you only need to specify that an item (like item1) has some properties and you don't need to specify another item (like item2) has some properties. The function will iterate all items and will give you the jaccard index, that is, item1 vs item2, item1 vs item3..., item2 vs item3, so on...This is the syntax for algo.similarity.jaccard.stream.
See reference here: https://neo4j.com/docs/graph-algorithms/current/labs-algorithms/jaccard/
Related
I have to make a query that returns me a club or clubs, where play the most amount of players that are not representing the country, from where the club is.
My query works fine, but I want to filter, so my result is ONLY clubs that size is the most.
As for now the biggest size is 4, and I have 4 clubs that have 4 players which were supposed to be there.
The only thing comes to my mind to filter it out was by using LIMIT 1 in the end, but then, I cut out three clubs, that also fill the predicate.
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size
RETURN c,list_players,country,size ORDER BY size DESC LIMIT 1
edit:
I managed to do something like this, don't know if it's optimal, but it is working:
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size
WITH c,list_players,country,size ORDER BY size DESC LIMIT 1
WITH size
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH size,c,collect(p.name) as list_players,n.country as country,size(collect(p.name)) as size2 WHERE size(collect(p.name)) = size
RETURN c,list_players,country,size
If you install APOC Procedures, there is an aggregation function you can use to get the items associated with a maximum value, and this works even when multiple items are tied for that value: apoc.agg.maxItems()
The trouble now is that all the club-specific data needs to be encapsulated into the item itself, so you'll need to add them to a map and use the map as the item, and the size of the person collection as the value.
Also your aggregation isn't quite correct. You're collecting player names, but you have the country of the player as a part of the grouping key (when you aggregate, all non-aggregation terms form the grouping key), and that isn't likely want you want. Maybe you wanted the country of the club instead?
Try working from this:
MATCH (c: Club)<-[r: PLAYS_FOR]-(p: Player)-[r2: REPRESENTS]->(n: NationalTeam)
WHERE c.country<>n.country
WITH c,collect(p) as list_players
WITH apoc.agg.maxItems({club:c, players:list_players}, size(list_players)) as maxResults
UNWIND maxResults.items as result
WITH result.club as c, [player IN result.players | player.name] as list_players, maxResults.value as size
RETURN c,list_players,size
I have a data in below format where 1st column represents the products node, all the following columns represent properties of the products. I want to apply content based filtering algo using cosine similarity in Neo4j. For that, I believe, I need to define the fx columns as the properties of each product node and then call these properties as a vector and then apply cosine similarity between the products. I am having trouble doing two things:
1. How to define these columns as properties in one go(as the columns could be more than 100).
2. How to call all the property values as a vector to be able to apply cosine similarity.
Product f1 f2 f3 f4 f5
P1 0 1 0 1 1
P2 1 0 1 1 0
P3 1 1 1 1 1
P4 0 0 0 1 0
You can use LOAD CSS to input your data.
For example, this query will read in your data file and output for each input line (ignoring the header line) a name string and a props collection:
LOAD CSV FROM 'file:///data.csv' AS line FIELDTERMINATOR ' '
WITH line SKIP 1
RETURN HEAD(line) AS name, [p IN TAIL(line) | TOFLOAT(p)] AS props
Even though your data has a header line, the above query skips over it, as it is not needed. In fact, we don't want to use the WITH HEADERS option of LOAD CSV, since that would convert each data line into a map, whereas it is more convenient for our current purposes to get each data line as a collection of values.
The above query assumes that all the columns are space-separated, that the first column will always contain a name string, and that all other columns contain the numeric values that should be put into the same collection (named props).
If you replace RETURN with WITH, you can append additional clauses to the query that make use of the name and props values.
Suppose that I have a table:
0.8
0.7
0.9
0.5
And I want to get the index of 2 largest values, so in this case, it should return:
3 1
I am quite newbie with Lua, so any help is more than welcome.
Thanks a lot,
You can loop over a table using for loop (for i = 1, #tbl or for i, val in ipairs(tbl)) and keep track of the largest and second to large elements (you'll need to store first index and first value and second index with second value to check the value and save the index). After the loop is done, you get the indexes of the first and second largest elements. Keep in mind that when the first value is updated its old value may need to be checked against the second value.
Another option is to build an array of indexes and sort it based on the values (as the sort can take an optional comparator function):
local function indexsort(tbl)
local idx = {}
for i = 1, #tbl do idx[i] = i end -- build a table of indexes
-- sort the indexes, but use the values as the sorting criteria
table.sort(idx, function(a, b) return tbl[a] > tbl[b] end)
-- return the sorted indexes
return (table.unpack or unpack)(idx)
end
local tbl = {0.8, 0.7, 0.9, 0.5}
print(indexsort(tbl))
This prints 3 1 2 4. If you only need the first two indexes, you can do local first, second = indexsort(tbl). Note that indexsort returns all indexes, so if you only need the first two (and your table is large), you may want to update the function to only return the first two items instead of the entire table.
No matter how I swing it, I need some kind of function to find the index of a item in an array supplied as a parameter.
I am trying to simply update items in a collection based on the index of one of their properties in an array, and have been poring through Cypher docs for nearly 2 hours...
It would also be acceptable to order the items by that array, and then run a foreach on the ordered list...
Following #stefan-armbruster answer and great blog post, a slow but simple index_of can be done with:
reduce(x=[-1,0], i IN [1,2,7,5,21,5,1,435] |
CASE WHEN i = 21 THEN [x[1], x[1]+1] ELSE [x[0], x[1]+1] END
)[0]
Here reduce function works with a two elements array: the position and the current index. If an element in your array matches the given condition, the first element of the reduced array will be replaced with the current index.
I put an example on neo4j console http://console.neo4j.org/?id=34byv
I've blogged about that recently. You can use the reduce function with an three element array as state variable containing
the index of the highest occupation so far
the current index (aka iteration number)
the value of highest occupation so far
As an example to find the index of max element in an array:
RETURN reduce(x=[0,0,0], i IN [1,2,2,5,2,1] |
CASE WHEN i>x[2] THEN [x[1],x[1]+1,o] ELSE [x[0], x[1]+1,x[2]] END
)[0]
So, let's say I have relationship r, with property r.myarray:
[1,2,3,4,5,6,7]
and I need to write a query which will replace the items in the array - up to an including an arbitrary member guaranteed to be in the array (let's say 3 in this case) - with another array - let's say:
[6,12,13]
to get result:
[6,12,13,4,5,6,7]
I got as far as seeing that you can use RANGE or subset notation for the array (e.g. r.myarray[0..x]) to specify part of the array, and could theoretically do SET to replace the array with the first array plus the second subset (r.myarray[x..r.myarray.length], or something like that). I am about half a mile from a complete answer here, though.
edit: Final, interpolat-able query:
START r=relationship(726)
SET r.myarray = [1,2,3,4] + filter(y in r.ancestors where NOT (y IN [718]));
Range probably isn't what you want. Range produces a collection of numbers. It's good for looping, like if you want to go through all the numbers from 1-10, but it's not that useful with other array indexes. You probably want a combination of the + operator on collections, index operations, with possibly a dash of extract and filter. Combining those will let you do basically whatever you like. Here are some examples of the things you can do. I'm using a WITH clause just to show a data sample, you could of course do this on any node property:
/* Return only the first three items */
with [1,2,3,4,5,6,7] as arr return arr[0..3];
/* Cut out the 4th item, otherwise return everything */
with [1,2,3,4,5,6,7] as arr return arr[0..3] + arr[4..];
/* Return only the even numbers */
with [1,2,3,4,5,6,7] as arr
return filter(y in
extract(x in arr | case when (x % 2 = 0) then x end) where y > 0);