Find most frequently used distinct set of terms

Find most frequently used distinct set of terms - neo4j

Imagine a graph database composed of URLs and tags used to describe them. From this we want to find which sets of tags are most frequently used together and determine which URLs belong in each identified set.
I've tried to create a dataset which simplifies this problem as such in cypher:
CREATE (tech:Tag { name: "tech" }), (comp:Tag { name: "computers" }), (programming:Tag { name: "programming" }), (cat:Tag { name: "cats" }), (mice:Tag { name: "mice" }), (u1:Url { name: "http://u1.com" })-[:IS_ABOUT]->(tech), (u1)-[:IS_ABOUT]->(comp), (u1)-[:IS_ABOUT]->(mice), (u2:Url { name: "http://u2.com" })-[:IS_ABOUT]->(mice), (u2)-[:IS_ABOUT]->(cat), (u3:Url { name: "http://u3.com" })-[:IS_ABOUT]->(tech), (u3)-[:IS_ABOUT]->(programming), (u4:Url { name: "http://u4.com" })-[:IS_ABOUT]->(tech), (u4)-[:IS_ABOUT]->(mice), (u4)-[:IS_ABOUT]->(acc:Tag { name: "accessories" })
Using this as a reference (neo4j console example here), we can look at it and visually identify that the most commonly used tags are tech and mice (the query for this is trivial) both referencing 3 URLs. The most commonly used tag pair is [tech, mice] as it (in this example) is the only pair shared by 2 urls (u4, and u1). It's important to note that this tag pair is a subset of the matched URLs, it's not the entire set for either. There is no combination of 3 tags shared by any urls.
How can I write a cypher query to identify which tag combinations are the most frequently used together (either in pairs, or in N sized groups)? Perhaps there's a better way to structure this data which would make analysis easier? Or is this problem is not well suited for a Graph DB? Been struggling a bit trying to figure this one out, any help or thoughts would be appreciated!

It looks like the problem on combinatorics.
// The tags for each URL, sorted by ID
MATCH (U:Url)-[:IS_ABOUT]->(T:Tag)
WITH U, T ORDER BY id(T)
WITH U,
collect(distinct T) as TAGS
// Calc the number of combinations of tags for a node,
// independent of the order of tags
// Since the construction of the power in the cyper is not available,
// use the logarithm and exponent
//
WITH U, TAGS,
toInt(floor(exp(log(2) * size(TAGS)))) as numberOfCombinations
// Iterate through all combinations
UNWIND RANGE(0, numberOfCombinations) as combinationIndex
WITH U, TAGS, combinationIndex
// And check for each tag its presence in combination
// Bitwise operations are missing in the cypher,
// therefore, we use APOC
// https://neo4j-contrib.github.io/neo4j-apoc-procedures/#_bitwise_operations
//
UNWIND RANGE(0, size(TAGS)-1) as tagIndex
WITH U, TAGS, combinationIndex, tagIndex,
toInt(ceil(exp(log(2) * tagIndex))) as pw2
call apoc.bitwise.op(combinationIndex, "&", pw2) YIELD value
WITH U, TAGS, combinationIndex, tagIndex,
value WHERE value > 0
// Get all combinations of tags for URL
WITH U, TAGS, combinationIndex,
collect(TAGS[tagIndex]) as combination
// Return all the possible combinations of tags, sorted by frequency of use
RETURN combination, count(combination) as freq, collect(U) as urls
ORDER BY freq DESC
I think that it is best to calculate and store the tag combination with the use of this algorithm at the time of tagging. And the query will be something like this:
MATCH (Comb:TagsCombination)<-[:IS_ABOUT]-(U:Url)
WITH Comb, collect(U) as urls, count(U) as freq
MATCH (Comb)-[:CONTAIN]->(T:Tag)
RETURN Comb, collect(T) as Tags, urls, freq ORDER BY freq DESC

Start at the URL nodes, build a tuple of tag.name objects (order it first so they all group up the same). That'll give you all of the possible combinations of tags that exist. Then, use filters to find out how many url's match for each possible set of tags.
MATCH (u:url)
WITH u
MATCH (u) - [:IS_ABOUT] -> (t:tag)
WITH u, t
ORDER BY t.name
WITH u, [x IN COLLECT(t)|x.name] AS tags
WITH DISTINCT tags
MATCH (u)
WHERE ALL(tag IN tags WHERE (u) - [:IS_ABOUT] -> (tag))
RETURN tags, count(u)

Related

match a branching path of variable length

I have a graph which looks like this:
Here is the link to the graph in the neo4j console:
http://console.neo4j.org/?id=av3001
Basically, you have two branching paths, of variable length. I want to match the two paths between orange node and yellow nodes. I want to return one row of data for each path, including all traversed nodes. I also want to be able to include different WHERE clauses on different intermediate nodes.
At the end, i need to have a table of data, like this:
a - b - c - d
neo - morpheus - null - leo
neo - morpheus - trinity - cypher
How could i do that?
I have tried using OPTIONAL MATCH, but i can't get the two rows separately.
I have tried using variable length path, which returns the two paths but doesn't allow me to access and filter intermediate nodes. Plus it returns a list, and not a table of data.
I've seen this question:
Cypher - matching two different possible paths and return both
It's on the same subject but the example is very complex, a more generic solution to this simpler problem is what i'm looking for.

You can define what your end node by using WHERE statement. So in your case end node has no outgoing relationship. Not sure why you expect a null on return as you said neo - morpheus - null - leo
MATCH p=(n:Person{name:"Neo"})-[*]->(end) where not (end)-->()
RETURN extract(x IN nodes(p) | x.name)
Edit:
may not the the best option as I am not sure how to do this programmatically. If I use UNWIND I get back only one row. So this is a dummy solution
MATCH p=(n{name:"Neo"})-[*]->(end) where not (end)-->()
with nodes(p) as list
return list[0].name,list[1].name,list[2].name,list[3].name

You can use Cypher to match a path like this MATCH p=(:a)-[*]->(:d) RETURN p, and p will be a list of nodes/relationships in the path in the order it was traversed. You can apply WHERE to filter the path just like with node matching, and apply any list functions you need to it.
I will add these examples too
// Where on path
MATCH p=(:a)-[*]-(:d) WHERE NONE(n in NODES(p) WHERE n.name="Trinity") WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Spit path into columns
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN p[0], p[1], p[2], p[3]
// Match path, filter on label
MATCH p=(:a)-[*]-(:d) WITH NODES(p) as p RETURN FILTER(n in p WHERE "a" in LABELS(n)) as a, FILTER(n in p WHERE "b" in LABELS(n)) as b, FILTER(n in p WHERE "c" in LABELS(n)) as c, FILTER(n in p WHERE "d" in LABELS(n)) as d
Unfortunately, you HAVE to explicitly set some logic for each column. You can't make dynamic columns (that I know of). In your table example, what is the rule for which column gets 'null'? In the last example, I set each column to be the set of nodes of a label.

I.m.o. you're asking for extensive post-processing of the results of a simply query (give me all the paths starting from Neo). I say this because :
You state you need to be able to specify specific WHERE clauses for each path (but you don't specify which clauses for which path ... indicating this might be a dynamic thing ?)
You don't know the size of the longest path beforehand ... but you still want the result to be a same-size-for-all-results table. And would any null columns then always be just before the end node ? Why (for that makes no real sense other then convenience) ?
...
Therefore (and again i.m.o.) you need to process the results in a (Java or whatever you prefer) program. There you'll have full control over the resultset and be able to slice and dice as you wish. Cypher (exactly like SQL in fact) can only do so much and it seems that you're going beyond that.
Hope this helps,
Regards,
Tom
P.S. This may seem like an easy opt-out, but look at how simple your query is as compared to the constructs that have to be wrought trying to answer your logic. So ... separate the concerns.

Neo4J node traversal cypher where clause for each node

I've been playing with neo4j for a geneology site and it's worked great!
I've run into a snag where finding the starting node isn't as easy. Looking through the docs and the posts online I haven't seen anything that hints at this so maybe it isn't possible.
What I would like to do is pass in a list of genders and from that list follow a specific path through the nodes to get a single node.
in context of the family:
I want to get my mother's father's mother's mother. so I have my id so I would start there and traverse four nodes from mine.
so pseudo query would be
select person (follow childof relationship)
where starting node is me
where firstNode.gender == female
AND secondNode.gender == male
AND thirdNode.gender == female
AND fourthNode.gender == female

Focusing on the general solution:
MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND extract(x in tail(nodes(p)) | x.gender) = {genders}
RETURN ancestor
here's how it works:
match the starting node by id
match all the variable-length paths going to any ancestor
constrain the length of the path (i.e. the number of relationships, which is the same as the number of ancestors), as you can't parameterize the length in the query
extract the genders in the path
nodes(p) returns all the nodes in the path, including the starting node
tail(nodes(p)) skips the first element of the list, i.e. the starting node, so now we only have the ancestors
extract() extracts the genders of all the ancestor nodes, i.e. it transforms the list of ancestor nodes into their genders
the extracted list of genders can be compared to the parameter
if the path matched, we can return the bound ancestor, which is the end of the path
However, I don't think it will be faster than the explicit solution, though the performance could remain comparable. On my small test data (just 5 nodes), the general solution does 26 DB accesses whereas the specific solution only does 22, as reported by PROFILE. Further profiling would be needed on a larger database to compare the performances:
PROFILE MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND extract(x in tail(nodes(p)) | x.gender) = {genders}
RETURN ancestor
The general solution has the advantage of being a single query which won't need to be parsed again by the Cypher engine, whereas each generated query will need to be parsed.

It was more simple than I thought. Maybe there is still a better way so I'll leave this open for a bit.
the query would be
MATCH (n1:Person { Id: 'f59c40de-506d-4829-a765-7a3ae94af8d1' })
<-[:CHILDOF]-(n2 { Gender:'0'})
<-[:CHILDOF]-(n3 { Gender:'1'})
<-[:CHILDOF]-(n4 { Gender:'1'})
RETURN n4
and for each generation back would add a new row.

The equivalent query would look something like this:
MATCH (me:Person)
WHERE me.ID = ?
WITH me
MATCH (me)-[r:childof*4]->(ancestor:Person)
WITH ancestor, EXTRACT(rel IN r | endNode(rel).gender) AS genders
WHERE genders = ?
RETURN ancestor
Disclaimer, I haven't double-checked the syntax.
In Neo4j you typically find your start node first, typically by an ID of some sort (modify as required to match on a unique property). We then traverse a number of relationships to an ancestor, extract the gender property of all end nodes in the traversed relationships, and compare the genders to the expected list of genders (you'll need to make sure the argument is a bracketed list in the desired order).
Note that this approach filters down all possible results with that degree of childof relationship as opposed to walking your graph, so higher degrees of relationship (the higher the degree of ancestry you're querying), the slower the call will get.
I'm also unsure if you can parameterize the degree of the variable relationship, so that might prevent this from being a generalized solution for any degree of ancestry.

I'm not sure if you want a generic query which can work whatever the collection of genders you pass, or a specific solution.
Here's the specific solution: you match the path with the wanted length, and match each gender, as you've already noted in your own answer.
MATCH (me:Person)-[:IS_CHILD_OF]->(p1:Person)
-[:IS_CHILD_OF]->(p2:Person)
-[:IS_CHILD_OF]->(p3:Person)
-[:IS_CHILD_OF]->(p4:Person)
WHERE me.uuid = {uuid}
AND p1.gender = {genders}[0]
AND p2.gender = {genders}[1]
AND p3.gender = {genders}[2]
AND p4.gender = {genders}[3]
RETURN p4
Now, if you want to pass in a list of genders of an arbitrary length, it's actually possible. You match a variable-length path, make sure it has the right length (matching the number of genders), then match each gender in sequence.
MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND all(i IN range(0, size({genders}) - 1)
WHERE {genders}[i] = extract(x in tail(nodes(p)) | x.gender)[i])
RETURN ancestor
Building on #InverseFalcon's answer, you can actually compare collections, which simplifies the query:
MATCH p = (me:Person)-[:IS_CHILD_OF*]->(ancestor:Person)
WHERE me.uuid = {uuid}
AND length(p) = size({genders})
AND extract(x in tail(nodes(p)) | x.gender) = {genders}
RETURN ancestor

Keep track of Changes in Neo4j - Achieve functionality like a "flag" variable in standard programming

Initial setup of the sample database is provided link to console
There are various cases and within each case, there are performers(with properties id and name). This is the continuation of problems defined problem statement and solution to unique node creation
The solution in the second link is (credits to Christophe Willemsen
)
MATCH (n:Performer)
WITH collect(DISTINCT (n.name)) AS names
UNWIND names as name
MERGE (nn:NewUniqueNode {name:name})
WITH names
MATCH (c:Case)
MATCH (p1)-[r:RELATES_TO]->(p2)<-[:RELATES]-(c)-[:RELATES]->(p1)
WITH r
ORDER BY r.length
MATCH (nn1:NewUniqueNode {name:startNode(r).name})
MATCH (nn2:NewUniqueNode {name:endNode(r).name})
MERGE (nn1)-[rf:FINAL_RESULT]->(nn2)
SET rf.strength = CASE WHEN rf.strength IS NULL THEN r.value ELSE rf.strength + r.value END
This solution achieved what was asked for.
But I need to achieve something like this.
foreach (Case.id in the database)
{
foreach(distinct value of r.Length)
{
//update value property of node normal
normal.value=normal.value+0.5^(r.Length-2)
//create new nodes and add the result as their relationship or merge it to existing one
MATCH (nn1:NewUniqueNode {name:startNode(r).name})
MATCH (nn2:NewUniqueNode {name:endNode(r).name})
MERGE (nn1)-[rf:FINAL_RESULT]->(nn2)
//
rf.strength=rf.strength + r.value*0.5^(r.Length-2);
}
}
The problem is to track the change in the case and then the r.Length property. How can it be achieved in Cypher?

I will not redo the last part, where setting strengths.
One thing though, in your console link, there is only one Normal node, so why do you need to iterate over each case, you can just match distinct relationships lengths.
By the way for the first part :
MATCH (n:Case)
MATCH (n)-[:RELATES]->()-[r:RELATES_TO]->()<-[:RELATES]-(n)
WITH collect(DISTINCT (r.length)) AS lengths
MATCH (normal:Normal)
UNWIND lengths AS l
SET normal.value = normal.value +(0.5^l)
RETURN normal.value
Explanations :
Match the cases
Foreach case, match the RELATES_TO relationships of Performers for that Case
collect distinct lengths
Match the Normal node, iterate the the distinct lengths collection and setting the proper value on the normal node

Neo4J Arrays in MATCH query

The intention of my Query is to mark similar words.
CREATE CONSTRAINT ON (n:Word) ASSERT n.title IS UNIQUE
MATCH (n) WHERE ID(n)={id}
MERGE (o:Word{title:{title}})
WITH n,o MERGE n-[r:SIMILAR{location:'{location}'}]->o
RETURN ID(o)
n is a existing Word. I want to create the relationsship & the other Word (o) if they don't exist yet.
The Problem with this query is, that it works for one title, but if I use a Array with titles the title of the Word o is the whole Array.
Can you suggest me another Query that does the same and/or a way to pass multiple values to title.
I'm using the Neography Gem on Rails e.g. the REST API

To use individual values in a parameter array you can use FOREACH, something like
MATCH (n)
WHERE ID (n) = {id}
FOREACH (t IN {title} |
MERGE (o:Word {title:t})
MERGE n-[:SIMILAR]->o
)
If you want to pass location also as a parameter (it is actually a string literal in your current query), such that merge operations for n should happen for each title, location pair in a parameter array, you can try
FOREACH (map IN {maps} |
MERGE (o:Word {title:map.title})
MERGE n-[:SIMILAR {location:map.location}]->o
)
with a parameter that looks something like
{
"maps": [
{
"title":"neography",
"location":"1.."
},{
"title":"coreography",
"location":"3.."
}
]
}
Other suggestions:
It's usually not great to look up nodes by internal id from parameter. In some cases when chaining queries it may be fine, but in most cases label index lookup would be better: MATCH (n:Word {title:"geography"})
If you are not using the transactional cypher endpoint, give it a shot. You can then make one or more calls with one or more queries in each call within one transaction. Performance improves and you may find you don't need to send the more complex parameter object, but can send many simple queries.

How to get nodes from relationships with any depth including relation filter

I'm using a query similar to this one:
(n)-[*]->(m)
Any depth.
But I cannot filter the relation name in such a query like this:
(n)-[*:DOES]->(m)
Any depth.
I need to filter the relation name since there are different relations on the related path. If it helps, here is my graph:
CREATE (Computer { name:'Computer' }),(Programming { name:'Programming' }),(Java { name:'Java' }),(GUI { name:'GUI' }),(Button { name:'Button' }), Computer<-[:IS]-Programming, Programming<-[:IS]-Java, Java<-[:IS]-GUI, GUI<-[:IS]-Button, (Ekin { name:'Ekin' }), (Gunes { name:'Gunes' }), (Ilker {name:'Ilker'}), Ekin-[:DOES]->Programming, Ilker-[:DOES]->Java, Ilker-[:DOES]->Button, Gunes-[:DOES]->Java
I'd like to get the names (Ekin, Ilker and Gunes) which have "DOES" relationship connected to "Programming" with any depth.
Edit:
I'm able to get the values I want by merging two different queries' results (think 13 is the top node that I want to reach):
START n=node(13)
MATCH p-[:DOES]->()-[*]->(n)
RETURN DISTINCT p
START n=node(13)
MATCH p-[:DOES]->(n)
RETURN DISTINCT p
I want to do it in a single query.

Change the matching pattern to "p-[:DOES]->()-[*0..]->n",
Match p-[:DOES]->()-[*0..]->n
Return distinct p.name
The variable length relationship "[*]" means 1..*. You need 0..* length relationships on the path.

Just to update the answer with Neo4j 3.0.
MATCH p-[:DOES*0..]->(n)
RETURN DISTINCT(p.name)
It returns the same result as the accepted answer.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Find most frequently used distinct set of terms - neo4j

Related

match a branching path of variable length

Neo4J node traversal cypher where clause for each node

Keep track of Changes in Neo4j - Achieve functionality like a "flag" variable in standard programming

Neo4J Arrays in MATCH query

How to get nodes from relationships with any depth including relation filter

Categories

Resources