I understand that the Neo4j Community edition does not provide a dump. What is the best way to measure the size of the underlying Neo4j database? Will doing du provide a good estimate?
Thank you.
What do you mean:
the Neo4j Community edition does not provide a dump
The Enterprise edition doesn't provide anything like this either. Are you looking for statistics on the size of the DB in terms of raw disk or a count of Nodes/Relationships?
If Disk: Use du -sh on linux, or check the folder on Windows.
If Node/Relationship: You'll have to write Java code to actually evaluate the true size, as the count on the Web Console is not always true. You could also do a basic count by taking the pure size on disk and dividing by 9 for the node store, and 33 for the relationship store.
Java code would look like this:
long relationshipCounter = 0;
long nodeCounter = 0;
GlobalGraphOperations ggo = GlobalGraphOperations.at(db);
for (Node n : ggo.getAllNodes()) {
nodeCounter++;
try {
for (Relationship relationship : n.getRelationships()) {
relationshipCounter++;
}
} catch (Exception e) {
logger.error("Error with node: {}", n, e);
}
}
System.out.println("Number of Relationships: " + relationshipCounter);
System.out.println("Number of Nodes: " + nodeCounter);
The reason the Web Console isn't always true is it checks a file for the highest value, and Neo4j uses a delete marker for nodes, so there could be a range of "deleted" nodes that buff up the number of total nodes that are available. Eventually neo4j will compact and remove these nodes, but they don't do it in real time.
The reason why the file size may lie is the same as above. The only true way is to go through all nodes and relationships to check for the "isActive" marker.
You can use Cypher to get the number of nodes:
MATCH ()
RETURN COUNT(*) AS node_count;
and the number of relationships:
MATCH ()-->()
RETURN COUNT(*) AS rel_count;
Some ways to view the database size in terms of bytes:
du -hc $NEO4J_HOME/data/databases/graph.db/*store.db*
From the dashboard:
http://localhost:7474/webadmin/
And view the 'database disk usage' indicator.
'Server Info' from the dashboard:
http://localhost:7474/webadmin/#/info/org.neo4j/Store%20file%20sizes/
And view the 'TotalStoreSize' row.
Finally, the database drawer; scroll all the way down and view the 'Database' section.
Related
I am currently trying to build a search application on top of SNOMED CT international FULL RF 2 Release. The database is quite huge and I had decided to move on with a full text search index for optimal results. So there are primarily 3 types of nodes in a SNOMED CT database :
ObjectConcept
Descriptions
Role Group
There are multiple relationships between the nodes but I'm focussing on that later.
Currently I'm focussing on searching for ObjectConcept Nodes by a String property value called FSN which stands for fully specified name. For this I tried two things :
Create text indexes on FSN : After using this with MATCH queries the results were rather slow even when I was using the CONTAINS predicate and limiting return value to 15.
Create Full Text Indexes: According to the docs, the FT indexes are powered by Apache Lucene. I created this for FSN. After using the FT Indexes and using AND clauses in the search term, for example :
Search Term : head pain
Query Term: head AND pain
I observe quite impressive benefits in query time using the profiler in neo4j browser(around from 43ms to 10ms for some queries) however once I start querying the db using the apollo server, query times go as high as 2s - 3s. Sometimes upto 8-10s.
The query is as follows, implemented by a custom resolver in neo4j/graphql and apollo-server:
` const session = context.driver.session()
let words = args.name.split(" ");
let compoundQuery = "";
if (words.length === 1) compoundQuery = words[0];
else compoundQuery = words.join(" AND ");
console.log(compoundQuery)
compoundQuery+= AND (${args.type})
return session
.run(
`CALL db.index.fulltext.queryNodes('searchIndex',$name) YIELD node, score
RETURN node
LIMIT 10
`,
{ name: compoundQuery }
)
.then((res) => {
session.close()
return res.records.map((record) => {
return record.get('node').properties
})
})
} `
I have the following questions:
Am I utilising FT indexes as much as I can or am I missing important optimisations out?
I was trying to implement elasticsearch with neo4j but I read elasticsearch and the FT indexes are both powered by lucene. So am I likely to gain improvements from using elasticsearch? If so? how should I go about it considering that I am using neo4j aura db and my graphql server is on ec2. I am confused as to how to use elasticsearch overall with the GRANDStack. Any help would be appreciated.
Any other Suggestions for optimising search would be greatly appreciated.
Thanks in advance!
I am trying to use apoc.periodic.iterate to reduce heap usage when doing very large transactions in a Neo4j database.
I've been following the advice given in this presentation.
BUT, my results are differing from those observed in those slides.
First, some notes on my setup:
Using Neo4j Desktop, graph version 4.0.3 Enterprise, with APOC 4.0.0.10
I'm calling queries using the .NET Neo4j Driver, version 4.0.1.
neo4j.conf values:
dbms.memory.heap.initial_size=2g
dbms.memory.heap.max_size=4g
dbms.memory.pagecache.size=2g
Here is the cypher query I'm running:
CALL apoc.periodic.iterate(
"UNWIND $nodes AS newNodeObj RETURN newNodeObj",
"CREATE(n:MyNode)
SET n = newNodeObj",
{batchSize:2000, iterateList:true, parallel:false, params: { nodes: $nodes_in } }
)
And the line of C#:
var createNodesResCursor = await session.RunAsync(createNodesQueryString, new { nodes_in = nodeData });
where createNodesQueryString is the query above, and nodeData is a List<Dictionary<string, object>> where each Dictionary has just three entries: 2 strings, 1 long.
When attempting to run this to create 1.3Million nodes I observe the heap usage (via JConsole) going all the way up to the 4GB available, and bouncing back and forth between ~2.5g - 4g. Reducing the batch size makes no discernible difference, and upping the heap.max_size causes the heap usage to shoot up to almost as much as that value. It's also really slow, taking 30+ mins to create those 1.3 million nodes.
Does anyone have any idea what I may be doing wrong/differently to the linked presentation? I understand my query is doing a CREATE whereas in the presentation they are only updating an already loaded dataset, but I can't imagine that's the reason my heap usage is so high.
Thanks
My issue was that although using apoc.periodic.iterate, I was still uploading that large 1.3million node data set to the database as a parameter for the query!
Modifying my code to do the batching myself as follows fixed my heap usage problem, and the slowness problem:
const int batchSize = 2000;
for (int count = 0; count < nodeData.Count; count += batchSize)
{
string createNodesQueryString = $#"
UNWIND $nodes_in AS newNodeObj
CREATE(n:MyNode)
SET n = newNodeObj";
int length = Math.Min(batchSize, nodeData.Count - count);
var createNodesResCursor = await session.RunAsync(createNodesQueryString,
new { nodes_in = nodeData.ToList().GetRange(count, length) });
var createNodesResSummary = await createNodesResCursor.ConsumeAsync();
}
Everyone. I am a new one to use Neo4j DataBase.
Now, I have a graph which contains nodes and relationships, I want to get all the path that from A to other nodes which the total cost is less than a maximum.
The maximum should be changed.
I use Java to query Neo4j. I know Evaluator cl ones can depend when we stop traversal the path. But i can give my Maximum to The interface evaluate()
My code is here:
public class MyEvaluators implements Evaluator {
#Override
public Evaluation evaluate(Path path) {
// TODO Auto-generated method stub
Iterable<Relationship> rels = path.relationships();
double totalCost = 0.0;
for(Relationship rel: rels){
totalCost += (double) rel.getProperty("cost");
}
return totalCost > MAXIMUM ? Evaluation.EXCLUDE_AND_PRUNE:Evaluation.INCLUDE_AND_CONTINUE;
}}
And I don't want to limit the path depth.
So how can I do this query quickly?
Which version are you looking at?
https://neo4j.com/docs/java-reference/current/tutorial-traversal/
In the current API you can pass a context object (branch-state) to the traversal, that keeps your current state per branch. So you can accumulate the total cost in the PathEvaluator:
https://neo4j.com/docs/java-reference/3.4/javadocs/org/neo4j/graphdb/traversal/PathEvaluator.html
Also perhaps you want to derive from the Dijkstra Evaluator.
https://github.com/neo4j/neo4j/blob/3.5/community/graph-algo/src/main/java/org/neo4j/graphalgo/impl/path/Dijkstra.java
I have a big graph model and I need to write the result of following query into a csv.
Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId
When I "Explain" query, this is the result I get :
It shows that the estimated result is something around 9M. My problems are :
1) It takes alot of time to get a response. From neo4j-shell it takes 38 minutes! Is this normal? BTW I have all schema indexes there and they all are ONLINE.
2) When I use SpringDataNeo4j to fetch the result , it throws an "java.lang.OutOfMemoryError: GC overhead limit exceeded" error , and that happens when SDN tries to convert the loaded data to our #QueryResult object.
I tried to optimize the query in all different ways but nothing was changed ! My impression is that I am doing something wrong. Does anyone have any idea how I can solve this problem? Should I go for Batch read/write ?
P.S I am using Neo4j comunity edition Version:3.0.1 and these are my sysinfos:
and these are my server configs.
dbms.jvm.additional=-Dunsupported.dbms.udc.source=tarball
use_memory_mapped_buffers=true
neostore.nodestore.db.mapped_memory=3G
neostore.relationshipstore.db.mapped_memory=4G
neostore.propertystore.db.mapped_memory=3G
neostore.propertystore.db.strings.mapped_memory=1000M
neostore.propertystore.db.index.keys.mapped_memory=500M
neostore.propertystore.db.index.mapped_memory=500M
Although Neo4j will stream results to you as it matches them, when you use SDN it has to collect the output into a single #QueryResult object. To avoid OOM problems you'll need to either ensure your application has sufficient heap memory available to load all 9m responses, or use the neo4j-shell, or use a purpose-built streaming interface, such as https://www.npmjs.com/package/cypher-stream. (caveat emptor: I haven't tried this, but it looks like it should do the trick)
Your config settings are not correct for Neo4j 3.0.1
you have to set the heap in conf/neo4j-wrapper.conf, e.g. 8G
and page-cache in conf/neo4j.conf (looking at your store you only need 2G for page-cache).
Also as you can see it will create 8+ million rows.
You might have more luck with this query:
Match (u:USER)-[:PURCHASED]->(:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
return u.id as userId,i.product_number as itemId
It also doesn't make sense to return 8M rows to neoj-shell to be honest.
If you want to measure it, replace the RETURN with WITH and add a RETURN count(*)
Match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM)
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)
Another optimization could be to go via item and user and do a hash-join in the middle for a global query like this:
Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
USING JOIN ON o
with distinct u,i
WITH u.id as userId,i.product_number as itemId
RETURN count(*)
The other thing that I'd probably do to reduce the number of returned results is to try aggregation.
Match (u:USER)-[:PURCHASED]->(o:ORDER)-[:HAS]->(i:ITEM)
with distinct u,i
WITH u, collect(distinct i) as products
WITH u.id as userId,[i in products | i.product_number] as items
RETURN count(*)
Thanks to Vince's and Michael comments I found a solution !
After doing some experiments it got clear that the server response time is actually good ! 1.5 minute for 9 million data ! The problem is with SDN as Vince mentioned ! The OOM happens when SDN tries to convert the data to #QueryResult Object. Increasing heap memory for our application is not a permanent solution as we will have more rows in future ! So we decide to use neo4j-jdbc-driver for big data queries... & it works like a jet ! Here is the code example we used :
Class.forName("org.neo4j.jdbc.Driver");
try (Connection con = DriverManager.getConnection("jdbc:neo4j:bolt://HOST:PORT", "USER", "PASSWORD")) {
// Querying
String query = "match (u:USER)-[r:PURCHASED]->(o:ORDER)-[h:HAS]->(i:ITEM) return u.id as userId,i.product_number as itemId";
con.setAutoCommit(false); // important for large dataset
Statement st = con.createStatement();
st.setFetchSize(50);// important for large dataset
try (ResultSet rs = st.executeQuery(query)) {
while (rs.next()) {
writer.write(rs.getInt("userId") + ","+rs.getInt("itemId"));
writer.newLine();
}
}
st.setFetchSize(0);
writer.close();
st.close();
}
Just make sure you use " con.setAutoCommit(false); " and "st.setFetchSize(50)" if you know that you are going to load a large dataset. Thanks Everyone !
I'm wanting to run some tests on neo4j, and compare its performance with other databases, in this case postgresql.
This postgres database have about 2000000 'content's distributed around 3000 'categories'. ( this means that there is a table 'content', one 'category' and a relation table 'content-to-category' since one content can be in more than 1 category).
So, mapping this to a neo4j db, i'm creating nodes 'content', 'category' and their relations ( content to category, and content to content, cause contents can have related contents).
category -> category ( categories can have sub-categories )
content -> category
content -> content (related)
Do you think this 'schema' is ok for this type of domain ?
migrating all data from postgresql do neo4j: it is taking forever ( about 4, 5 days ). This is just some search for nodes and creating/updating accordingly. (search is using indexes and the insert/update if taking 500ms for each node)
Am i doing something wrong ?
Migration is done, so i went to try some querying ...
i ended up with about 2000000 content nodes, 3000 category nodes, and more than 4000000 relationships.
(please note that i'm new to all this neo4j world, so i have no idea how to optimize cypher queries...)
One of the queries i wanted to test is: get the 10 latest published contents of a given 'definition' in a given category (this includes contents that are in sub categories of the given category)
experimenting a little, i ended up with something like this :
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
MATCH (c) <- [ r:BELONGS_TO * ] - (n)
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
this takes around 3 seconds, excluding the first run, that takes a lot more ... is this normal ?
What am i doing wrong ?
please note that i'm using neo4j 1.9.2, and auto indexing some node properties ( type, code, state, definitionCode and published_stamp included - title is not auto indexed )
also, returning 'c' on the previous query ( start c = node : node_auto_index( 'type: category AND code : category-code' ) return c; ) is fast (again, excluding the first run, that takes aroung 20-30ms)
also, i'm not sure if this is the right way to use indexes ...
Thank you in advance (sorry if something is not making sense - ask me and i'll try to explain better).
Have you looked at the batch import facilities: http://www.neo4j.org/develop/import? You really should look at that for the initial import - it will take minutes instead of days.
I will ask some of our technical folks to get back to you on some of the other stuff. You really should not be seeing this.
Rik
How many nodes are returned by this?
START
n = node : node_auto_index( 'type: content AND state: published AND definitionCode: definition_name' )
RETURN count(*)
I would try to let the graph do the work.
How deep are your hierarchies usually?
Usually you limit arbitrary length relationships to not have the combinatorial explosion:
I would also have a different relationship-type between content and category than the category tree.
Can you point out your current relationship-types?
START
c = node : node_auto_index( 'type: category AND code: category_code' ),
MATCH (c) <- [:BELONGS_TO*5] - (n)
WHERE n.type = 'content' AND n.state='published' and n.definitionCode = 'definition_name'
RETURN n.published_stamp, n.title
ORDER BY n.published_stamp DESC
LIMIT 6
Can you try that?
For import it is easiest to generate CSV from your SQL and import that using http://github.com/jexp/batch-import
Are you running Linux, maybe on an ext4 filesystem?
You might want to set the barrier=0 mount option, as described here: http://structr.org/blog/neo4j-performance-on-ext4
Further discussion of this topic: https://groups.google.com/forum/#!topic/neo4j/nflUyBsRKyY