Depth wise retrieval of nodes from neo4j

I have a science graph in neo4j which has names of some scientists as nodes and connected to nodes holding laws by relation has_discovered. The laws are then related to their application by relation has_application. I am new to cypher. I want to know what cql query will give me level 1 and level 2 nodes of the scientists nodes. Here level 1 will be the nodes holding laws and level 2 will be nodes holding their applications.

This query should probably take care of it, assuming your labels are :Scientist, :Law, and :Application.
MATCH (sci:Scientist)-[:has_discovered]->(law:Law)-[:has_application]->(app:Application)
RETURN sci, law, app
As long as your :has_discovered and :has_application relationships only connect those types of nodes, you can leave off the :Law and :Application labels (but you'll want to keep the :Scientist label so you begin your pattern match only at :Scientist nodes).
You can use COLLECT() as necessary to group results if you want.


Retrieve All Nodes That Can Be Reached By A Specific Node In A Directed Graph

Given a graph in Neo4j that is directed (but possible to have cycles), how can I retrieve all nodes that are reachable from a specific node with Cypher?
(Also: how long can I expect a query like this to take if my graph has 2 million nodes, and by extension 48 million nodes? A rough gauge will do eg. less than a minute, few minutes, an hour)
Cypher's uniqueness behavior is that relationships must be unique per path (each relationship can only be traversed once per path), but this isn't efficient for these kinds of use cases, where the goal is instead to find distinct nodes, so a node should only be visited once total (across all paths, not per path).
There are some path expander procedures in the APOC Procedures library that are directed at these use cases.
If you're trying to find all reachable nodes from a starting node, traversing relationships in either direction, you can use apoc.path.subgraphNodes() like so, using the movies graph as an example:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {}) YIELD node
If you only wanted reachable nodes going a specific direction (let's say outgoing) then you can use a relationshipFilter to specify this. You can also add in the type too if that's important, but if we only wanted reachable via any outgoing relationship the query would look like:
MATCH (n:Movie {title:"The Matrix"})
CALL apoc.path.subgraphNodes(n, {relationshipFilter:'>'}) YIELD node
In either case these approaches should work better than with Cypher alone, especially in any moderately connected graph, as there will only ever be a single path considered for every reachable node (alternate paths to an already visited node will be pruned, cutting down on the possible paths to explore during traversal, which is efficient as we don't care about these alternate paths for this use case).
Have a look here, where an algorithm is used for community detection.
You can use something like
match (n:Movie {title:"The Matrix"})-[r*1..50]-(m) return distinct id(m)
but that is slow (tested on the Neo4j movie dataset with 60k nodes, above already runs more than 10 minutes. Probably memory usage will become an issue when you have a dataset consisting out of millions of nodes. Next to that, it also depends how your dataset is connected, e.g. nr of relationships.

What is indexing means neo4j and how it effects performance

I have a idea of indexing in rdbms but can't think how indexing works in neo4j and also what is schema indexing?
To quote from neo4j's free book, Graph Databases:
Indexes help optimize the process of finding specific nodes.
Most of
the time, when querying a graph, we’re happy to let the traversal
process discover the nodes and relationships that meet our
information goals. By following relationships that match a specific
graph pattern, we encounter elements that contribute to a query’s
result. However, there are certain situations that require us to pick
out specific nodes directly, rather than discover them over the course
of a traversal. Identifying the starting nodes for a traversal, for
example, requires us to find one or more specific nodes based on some
combination of labels and property values.
That same book does an extensive comparison between neo4j and relational databases as well.
As for what the above-mentioned indexes (also known as "schema indexes") index: they index the nodes that have a specific node label and node property combination.
There is also a different indexing mechanism called "manual" (or "legacy", or "explicit") indexing, which is now only recommended for special use cases.
As an example, suppose we have already created an index on :Person(firstname), like so:
CREATE INDEX ON :Person(firstname);
In that case, the following query can quickly start off by using the index to find the desired Person nodes. Once those nodes are found, neo4j can easily traverse their outgoing WORKS_AT relationships to find the related Company nodes:
MATCH (p:Person)-[:WORKS_AT]->(c:Company)
WHERE p.firstname = 'Karan'
RETURN p, c;
Without that index, the query would have to either:
Scan through all Person nodes to find the right ones, before traversing their outgoing WORKS_AT relationships, or
Find all Company nodes, traverse their incoming WORKS_AT relationships, and compare the firstname values of every Person at the other end of the relationship.

Find all nodes with two-way relationships starting from one specific node using cypher in neo4j

neo4j nodes and relationships
This is quite a tough job. I'm trying to find all nodes with two-way relationships starting from a specific node. Based on the image above, I would like to find all two-way relationships starting from node 1. Only nodes with two-way relationships match. For example, node 1,3,4 matches and node 1,2,3 matches as two separate groups. However, if node 2 and 4 has a two-way relationship, then node 1,2,3,4 matches as one group. The main idea is that all nodes are linked both ways in such a group. My idea is to find all nodes with two-way relationships starting from 1 and continue processing, but I'm not able to continue. Can anyone help me with this problem, thanks a lot. By the way, only the largest 'two-way-circle' is needed.
Your problem looks a lot like finding strongly connected components in the graph. As defined in the docs.
A directed graph is strongly connected if there is a path between all
pairs of vertices ( nodes ). This algorithms treats the graph as directed, so
the direction of the relationship is important and strongly connected
compoment exists only if there are relationships between nodes in both
Check out more in the documentation. You will need neo4j-graph-algorithms.
Example query with writing back the component of the graph to the node.
CALL algo.scc('Label','C', {write:true,partitionProperty:'partition'})
YIELD loadMillis, computeMillis, writeMillis, setCount, maxSetSize, minSetSize
And then you can find your biggest component with the following query.
MATCH (u:Label)
RETURN distinct(u.partition) as partition,count(*) as size_of_partition
ORDER by size_of_partition DESC LIMIT 1

Fast search for unconnected nodes in big neo4j graph

So, i've created a Neo4j graph database out of a relational database. The graph database has about 7 million nodes, and about 9 million relationships between the nodes.
I now want to find all nodes, that are not connected to nodes with a certain label (let's call them unconnected nodes). For example, i have nodes with the labels "Customer" and "Order" (let's call them top-level-nodes). I want to find all nodes that have no relationship from or to these top-level-nodes. The relationship doesn't have to be direct, the nodes can be connected via other nodes to the top-level-nodes.
I have a cypher query which would solve this problem:
MATCH (a) WHERE not ((a)-[*]-(:Customer)) AND not ((a)-[*]-(:Order)) RETURN a;
As you can imagine, the query will need a long time to execute, the performance is bad. Most likely because of the undirected relationship and because it doesn't matter via how many nodes the relationship can be made. However, the relationship directions don't matter, and i need to make sure that there is no path from any node to one of the top-level-nodes.
Is there any way to find the unconnected nodes faster ? Note that the database is really big, and there are more than 2 labels which mark top-level-nodes.
You could try this approach, which does involve more operations, but can be run in batches for better performance (see apoc.periodic.commit() in the APOC procedures library).
The idea is to first apply a label (say, :Unconnected) to all nodes in your graph (batch execute with apoc.periodic.commit), and then, taking batches of top level nodes with that label, matching to all nodes in the subgraphs extending from them and removing that label.
When you finally have run out of top level nodes with the :Unconnected label (meaning all top level nodes and their subgraphs no longer have this label) then the only nodes remaining in your graph with the :Unconnected label are not connected to your top level nodes.
Any approach to this kind of operation will likely be slow, but the advantage again is that you can process this in batches, and if you get interrupted, you can resume. Once your queries are done, all the relevant unconnected nodes are now labeled for further processing at your convenience.
Also, one last note, in Neo4j undirected relationships have no arrows in the syntax ()-[*]-().
not (a:Customer OR a:Order)
AND shortestPath((a)-[*]-(:Customer)) IS NULL
AND shortestPath((a)-[*]-(:Order)) IS NULL
If you could add rel-types it would be faster.
One further optimization could be to check the nodes of an :Customer path for an :Order node and vice versa. i.e.
NONE(n in nodes(path) WHERE n:Order)
In general, this might be rather a set operation, i.e.
expand around all order and customer nodes in parallel into two sets
and compute the overlap between the two sets.
Then remove the overlap from the total number of nodes.
I added an issue for apoc here to add such a function or procedure

Category design in Neo4j, root node relationships vs relationships to indexed nodes

I would like to represent millions of products that belong to one or more of dozens of categories.
I'm contemplating a few approaches:
Indexed Category Nodes - Create nodes for each category and create an auto_index on category_name. Then create "isCategoryOf" relationships between each of my product nodes and their respective category nodes.
Individual Category Relationship Types- Create respective "isCategoryGames", "isCategoryFood", "isCategoryLifestyle", etc... relationships between products and the root node.
Storing Categories as a Property of One Relationship Type - Create "isCategory" relationshps between prduct nodes and the root node and store their respective category types in a property of the relationship, e.g. relationship "isCategory" { categoryName:"food"}
Which of these approaches is most efficent and/or scalable. Is there a limit or performance implications of having almost every node in the database connect to the root node?
If you attach millions of nodes to the root node, you make the root node a supernode. This can be problematic.
The general concept of Option 1 shows promise. If you were modeling food, you might have nodes with a name property like "Nuts", "Dairy Products", "Desserts", "Produce" and a type property of "Category". You would then have other nodes with a name property like "Cherry Cheesecake" with outgoing "category" edges to the "Dairy Products", and "Desserts" nodes.
Whether this structure is going to be performant enough depends on your queries. If you have broad categories like 'food', you could end up with a supernode, and you'll take a linear scan through the connected nodes to find a node with a given property. A linear scan over thousands of things might be fast enough for your purposes, but a scan over 1M things might not.
To find out, I would recommend creating a quick prototype where you generate some random product and category nodes, then connect each product node to a random number of category nodes. Indexing the product and category nodes by name will help you find individual products or categories, but it's the traversals that will cause performance problems if you hit supernodes. Experiment with a few of the Gremlin traversals or Cypher queries that you think might be most problematic. Try scaling up the number of nodes from 1K, 10K, 100K, and 1M with a proportionate number of edges. How do your traversal / query times change?
