I'm trying to model a large knowledge graph. (using v3.1.1).
My actual graph contains only two types of Nodes (Topic, Properties) and a single type of Relationships (HAS_PROPERTIES).
The count of nodes is about 85M (47M :Topic, the rest of nodes are :Properties).
I'm trying to get the most connected node:Topic for this. I'm using the following query:
MATCH (n:Topic)-[r]-()
RETURN n, count(DISTINCT r) AS num
ORDER BY num
This query or almost any query I try to perform (without filtering the results) using the count(relationships) and order by count(relationships) is always extremely slow: these queries take more than 10 minutes and still no response.
Am i missing indexes or is the a better syntax?
Is there any chance i can execute this query in a reasonable time?
Use this:
MATCH (n:Topic)
RETURN n, size( (n)--() ) AS num
ORDER BY num DESC
LIMIT 100
Which reads the degree from a node directly.
Related
I trying to create a simple cypher query that should find all instances in the graph matching roughly this structure (BlogPost A) -> (Term) <- (BlogPost B). This means, I am trying all pairs of blog posts that are flagged with the same term and moreover count the number of terms. A term is a mechanism of categorization in this context.
Here is my query proposal:
MATCH (blogA:content {entitySubType:'blog'})
WITH blogA MATCH (blogA) -[]-> (t:term) <-[]- (blogB:content)
WHERE blogB.entitySubType='blog' AND NOT (ID(blogA) = ID(blogB))
RETURN ID(blogA), ID(blogB), count(t) ;
This query ends with null after ~1 day.
Is the uasge of blogA in the subquery not possible in the way I am using it? When using the same query with limits I do get reuslts:
MATCH (blogA:content {entitySubType:'blog'})
WITH blogA
LIMIT 10
MATCH (blogA) -[]-> (t:term) <-[]- (blogB:content)
WHERE blogB.entitySubType='blog' AND NOT (ID(blogA) = ID(blogB))
RETURN ID(blogA), ID(blogB), count(t)
LIMIT 20;
My Neo4j Instance has ~500GB RAM and the whole graph inclduing all properties is ~30 GB with ~15 million vertices in total, whereas there are 101k blog vertices and 108k terms.
I would be grateful for every hint about possible problems or suggestions for improvements.
Also make sure to consume that query with a client driver (e.g. Java) that can stream the billions of results. Here is a query that would use the compiled runtime which should be fastest and most memory efficient.
MATCH (blogA:Blog)-[:TAGGED]->(t:Term)<-[:TAGGED]-(blogB:Blog)
WHERE blogA <> blogB
RETURN ID(blogA), ID(blogB), count(t);
I have only about 16000 nodes in my database and when I do match n return n I never get any graph back, any reason why or how to fix?
Either you have no relation between the nodes - when using the browser GUI check the result as table.
You can limit this for testing e.g. match (n) return n limit 25.
I just imported the English Wikipedia into Neo4j and am playing around. I started by looking up the pages that link into the Page "Berlin"
MATCH p=(p1:Page {title:"Berlin"})<-[*1..1]-(otherPage)
WITH nodes(p) as neighbors
LIMIT 500
RETURN DISTINCT neighbors
That works quite well. What I would like to achieve next is to show the 2nd degree of relationships. In order to be able to display them correctly, I would like to limit the number of first degree relationship nodes to 20 and then query the next level of relationship.
How does one achieve that?
I don't know the Wikipedia model, but I'm assuming that there are many different relationship types and that is why that -[*1..1]-, I think that is analogous to -[]- or even --. I doubt it has any serious impact though.
You can collect up the first level matches and limit them to 20 using a WITH with a LIMIT. You can then perform a second match using those (<20) other pages as the start point.
MATCH (p1:Page {title:"Berlin"})<-[*1..1]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[*1..1]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH otherPage, secondDegree
LIMIT 500
RETURN otherPage, COLLECT(secondDegree)
There are many ways to return the data, this just returns the first degree match with an array of the subsequent matches.
If the only type of relationship is :Link and you want to keep the start node then you can change the query to this:
MATCH (p1:Page {title:"Berlin"})<-[:Link]-(otherPage:Page)
WITH p1, otherPage
LIMIT 20
MATCH (otherPage)<-[:Link]-(secondDegree:Page)
WHERE secondDegree <> p1
WITH p1, otherPage, secondDegree
LIMIT 500
RETURN p1, otherPage, COLLECT(secondDegree)
I am trying to search for a key word on all the indexes. I have in my graph database.
Below is the query:
start n=node:Users(Name="Hello"),
m=node:Location(LocationName="Hello")
return n,m
I am getting the nodes and if keyword "Hello" is present in both the indexes (Users and Location), and I do not get any results if keyword Hello is not present in any one of index.
Could you please let me know how to modify this cypher query so that I get results if "Hello" is present in any of the index keys (Name or LocationName).
In 2.0 you can use UNION and have two separate queries like so:
start n=node:Users(Name="Hello")
return n
UNION
start n=node:Location(LocationName="Hello")
return n;
The problem with the way you have the query written is the way it calculates a cartesian product of pairs between n and m, so if n or m aren't found, no results are found. If one n is found, and two ms are found, then you get 2 results (with a repeating n). Similar to how the FROM clause works in SQL. If you have an empty table called empty, and you do select * from x, empty; then you'll get 0 results, unless you do an outer join of some sort.
Unfortunately, it's somewhat difficult to do this in 1.9. I've tried many iterations of things like WITH collect(n) as n, etc., but it boils down to the cartesian product thing at some point, no matter what.
is there a default way how to match only first n relationships except that filtering on LIMIT n later?
i have this query:
START n=node({id})
MATCH n--u--n2
RETURN u, count(*) as cnt order by cnt desc limit 10;
but assuming the number of n--u relationships is very high, i want to relax this query and took for example first 100 random relationships and than continue with u--n2...
this is for a collaborative filtering task, and assuming the users are more-less similar i dont want to match all users u but a random subset. this approach should be faster in performance - now i got ~500ms query time but would like to drop it under 50ms.
i know i could break the above query into 2 separate ones, but still in the first query it goes through all users and than later it limits the output. i want to limit the max rels during match phase.
You can pipe the current results of your query using WITH, then LIMIT those initial results, and then continue on in the same query:
START n=node({id})
MATCH n--u
WITH u
LIMIT 10
MATCH u--n2
RETURN u, count(*) as cnt
ORDER BY cnt desc
LIMIT 10;
The query above will give you the first 10 us found, and then continue to find the first ten matching n2s.
Optionally, you can leave off the second LIMIT and you will get all matching n2s for the first ten us (meaning you could have more than ten rows returned if they matched the first 10 us).
This is not a direct solution to your question, but since I was running into a similar problem, my work-around might be interesting for you.
What I need to do is: get relationships by index (might yield many thousands) and get the start node of these. Since the start node is always the same with that index-query, I only need the very first relationship's startnode.
Since I wasn't able to achieve that with cypher (the proposed query by ean5533 does not perform any better), I am using a simple unmanaged extension (nice template).
#GET
#Path("/address/{address}")
public Response getUniqueIDofSenderAddress(#PathParam("address") String addr, #Context GraphDatabaseService graphDB) throws IOException
{
try {
RelationshipIndex index = graphDB.index().forRelationships("transactions");
IndexHits<Relationship> rels = index.get("sender_address", addr);
int unique_id = -1;
for (Relationship rel : rels) {
Node sender = rel.getStartNode();
unique_id = (Integer) sender.getProperty("unique_id");
rels.close();
break;
}
return Response.ok().entity("Unique ID: " + unique_id).build();
} catch (Exception e) {
return Response.serverError().entity("Could not get unique ID.").build();
}
}
For this case here, the speed up is quite nice.
I don't know your exact use case, but since Neo4j even supports HTTP streaming afaik, you should be able to create to convert your query to an unmanaged extension and still get the full performance.
E.g., "java-querying" all your qualifying nodes and emit the partial result to the HTTP stream.