Create graph from large, separate list of nodes and edges - neo4j

new to neo4j, I want to load a JSON with the following structure into my neo4j DB:
{
"nodes": [
{
"last_update": 1629022369,
"pub_key": "pub1",
"alias": "alias1"
},
{
"last_update": 1618162974,
"pub_key": "pub2",
"alias": "alias2"
},
{
"last_update": 1634745976,
"pub_key": "pub3",
"alias": "alias3"
}
],
"edges": [
{
"node1_pub": "pub1",
"node2_pub": "pub2",
"capacity": "37200"
},
{
"node1_pub": "pub2",
"node2_pub": "pub3",
"capacity": "37200"
},
{
"node1_pub": "pub3",
"node2_pub": "pub1",
"capacity": "37200"
}
]
}
I load nodes and edges in separate queries:
WITH "file:///graph.json" AS graph
CALL apoc.load.json(graph) YIELD value
FOREACH (nodeObject in value.nodes | CREATE (node:Node {pubKey:nodeObject.pub_key}))
WITH "file:///graph.json" AS graph
CALL apoc.load.json(graph) YIELD value
UNWIND value.edges as edgeObject
MATCH (node1:Node {pubKey: edgeObject.node1_pub})
MATCH (node2:Node {pubKey: edgeObject.node2_pub})
CREATE (node1)-[:IS_CONNECTED {capacity: edgeObject.capacity}]->(node2)
This works fine with a small number of edges, but I have a ~100mb file with plenty of edges in there. In the latter case, the query does not return. I'm running it from the neo4j web interface. neo4j is running in docker and the max heap size is set to 3g, which should be more than enough.
I have not grasped all of the concepts of Cypher, so probably there is some better way to do it anyways. Also maybe in one query, so that the file does not need to be loaded twice.
Thanks a lot!

You can load the json file by batch using txBatchSize parameter. See the documentation below:
https://neo4j.com/labs/apoc/4.1/import/load-json/#load-json-available-procedures-apoc.import.json
WITH "file:///graph.json" as graph
CALL apoc.load.json(graph, '[0:10000]') YIELD value
RETURN value
Where it will return 10000 rows.

okay, after trying out the batching suggested by #jose_bacoy, I saw that even 1000 rows took around 20s.
Obviously, the MATCH operation is quite CPU intensive. After I created an index the import of 80k edges worked like a charm.
CREATE INDEX FOR (n:Node) ON (n.pubKey)

Related

Is there a way to dynamically generate nodes from JSON with apoc.load.json procedure?

I would like to create a set of nodes and relationships from a JSON document. Here is sample JSON:
{"records": [{
"type": "bundle",
"id": "bundle--1",
"objects": [
{
"type": "evaluation",
"id": "evaluation--12345",
"name": "Eval 1",
"goals": [
"test"
]
},
{
"type": "subject",
"id": "subject--67890",
"name": "Eval 2",
"goals": [
"execute"
]
},
{
"type": "relationship",
"id": "relationship--45678",
"relationship_type": "participated-in",
"source_ref": "subject--67890",
"target_ref": "evaluation--12345"
}
}]
}
And I would like that JSON to be represented in Neo similar to the following:
(:evaluation {properties})<-[:RELATIONSHIP]-(:subject {properties})
Ultimately I would like to have a model that represents the evaluation, subject, and relationship generated via a few cypher calls with as little outside manipulation as possible. Is it possible to use the apoc.create.* set of calls to create the necessary nodes and relationships from this JSON? I have tried something similar to the following to get this JSON to load and I can get it to create nodes of an arbitrary, in this case "object", type.
WITH "file:///C:/path/to/my/sample.json" AS json
CALL apoc.load.json(json, "$.records") YIELD value
UNWIND value.objects as object
MERGE (o:object {id: object.id, type: object.type, name: object.name})
RETURN count(*)
I have tried changing the JSONPath expression to filter different record types but it is difficult to run a Goessner path like $.records..objects[?(#.type = 'subject')] thanks to the embedded quotes. This would also lead to multiple runs (I have 15 or so different types) against the real JSON, which could be very time consuming. The LoadJSON docs have a simple filter expression and there is a blog post that shows how to parse stackoverflow but the JSON objects are keyed in a way that is easy to map in cypher. Is there a cypher trick or APOC I should be aware of that can help me solve this problem?
I would approach this as a two-pass method:
First pass: create the nodes for evaluation and subject. You could use apoc.do.case/when if helpful
Second pass: only scan for relationship and then do a MATCH to find the evaluation and subject nodes based on the source_ref and target_ref, and then MERGE or CREATE the relationship to connect them.
Like this you're not impacted by situations such as the relationship coming before the nodes it connects etc. or how many items you've got within objects
As Lju pointed out, the apoc.do.case function can be used to create a set of conditions to check, followed by a cypher statement. Combining that with another apoc call requires the returns from each apoc call to be handled properly. My answer ended up looking like the following:
WITH "file:///C:/path/to/my/sample.json" AS json
CALL apoc.load.json(json, "$.records") YIELD value as records
UNWIND records.objects as object
CALL apoc.do.case(
[object.type="evaluation", "MERGE (:Evaluation {id: object.id}) ON CREATE SET Evaluation.id = object.id, Evaluation.prop1 = object.prop1",
object.type="subject", "MERGE (:Subject {id: object.id}) ON CREATE SET Subject.id = object.id, Subject.prop1 = object.prop1",
....]
"<default case cypher query goes here>", {object:object}
)
YIELD value RETURN count(*)
Notice there are two apoc calls that YIELD. Use aliases to help the parser differentiate between objects. The documentation for the apoc.do.case is a little sparse but describes the syntax for the statement. It looks like there are other ways to accomplish this task but with smaller JSON files, and a handful of cases, this works well enough.

Is it advisable to use MapReduce to 'flatten' irregular entities in CouchDB?

In a question on CouchDB I asked previously (Can you implement document joins using CouchDB 2.0 'Mango'?), the answer mentioned creating domain objects instead of storing relational data in Couch.
My use case, however, is not necessarily to store relational data in Couch but to flatten relational data. For example, I have the entity of Invoice that I collect from several suppliers. So I have two different schemas for that entity.
So I might end up with 2 docs in Couch that look like this:
{
"type": "Invoice",
"subType": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
{
"type": "Invoice",
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
I also have a doc like this:
{
"type": "Customer",
"name": "me",
"details": "etc..."
}
My intention then is to 'flatten' the Invoice entities, and then join on the reduce function. So, the map function looks like this:
function(doc) {
switch(doc.type) {
case 'Customer':
emit(doc.customerName, { doc information ..., type: "Customer" });
break;
case 'Invoice':
switch (doc.subType) {
case 'supplier B':
emit (doc.customerName, { total: doc.total, date: doc.date, type: "Invoice"});
break;
case 'supplier A':
emit (doc.customerName, { total: doc.InvoiceTotal, date: doc.OrderDate, type: "Invoice"});
break;
}
break;
}
}
Then I would use the reduce function to compare docs with the same customerName (i.e. a join).
Is this advisable using CouchDB? If not, why?
First of all apologizes for getting back to you late, I thought I'd look at it directly but I haven't been on SO since we exchanged the other day.
Reduce functions should only be used to reduce scalar values, not to aggregate data. So you wouldn't use them to achieve things such as doing joins, or removing duplicates, but you would for example use them to compute the number of invoices per customer - you see the idea. The reason is you can only make weak assumptions with regards to the calls made to your reduce functions (order in which records are passed, rereduce parameter, etc...) so you can easily end up with serious performance problems.
But this is by design since the intended usage of reduce functions is to reduce scalar values. An easy way to think about it is to say that no filtering should ever happen in a reduce function, filtering and things such as checking keys should be done in map.
If you just want to compare docs with the same customer name you do not need a reduce function at all, you can query your view the following parameters:
startkey=["customerName"]
endkey=["customerName", {}]
Otherwise you may want to create a separate view to filter on customers first, and return their names and then use these names to query your view in a bulk manner using the keys view parameter. Startkey/endkey is good if you only want to filter one customer at a time, and/or need to match complex keys in a partial way.
If what you are after are the numbers, you may want to do :
if(doc.type == "Invoice") {
emit([doc.customerName, doc.supplierName, doc.date], doc.amount)
}
And then use the _stats built-in reduce function to get statistics on the amount (sum, min, max,)
So that to get the amount spent with a supplier, you'd just need to make a reduce query to your view, and use the parameter group_level=2 to aggregate by the first 2 elements of the key. You can combine this with startkey and endkey to filter specific values of this key :
startkey=["name1", "supplierA"]
endkey=["name1", "supplierA", {}]
You can then build from this example to do things such as :
if(doc.type == "Invoice") {
emit(["BY_DATE", doc.customerName, doc.date], doc.amount);
emit(["BY_SUPPLIER", doc.customerName, doc.supplierName], doc.amount);
emit(["BY_SUPPLIER_AND_DATE", doc.customerName, doc.supplierName, doc.date], doc.amount);
}
Hope this helps
It is totally ok to "normalize" your different schemas (or subTypes) via a view. You cannot create views based on those normalized schemas, though, and on the long run, it might be hard to manage different schemas.
The better solution might be to normalize the documents before writing them to CouchDB. If you still need the documents in their original schema, you can add a sub-property original where you store your documents in their original form. This would make working on data much easier:
{
"type": "Invoice",
"total": 22.5,
"date": "2017-01-10T00:00:00.000Z",
"customerName": "me",
"original": {
"supplier": "supplier B",
"total": 22.5,
"date": "10 Jan 2017",
"customerName": "me"
}
},
{
"type": "Invoice",
"total": 10.2,
"date": "2017-01-12T00:00:00:00.000Z,
"customerName": "me",
"original": {
"subType": "supplier A",
"InvoiceTotal": 10.2,
"OrderDate": <some other date format>,
"customerName": "me"
}
}
I d' also convert the date to ISO format because it parses well with new Date(), sorts correctly and is human-readable. You can easily emit invoices grouped by year, month, day and whatever with that.
Use reduce preferably only with built-in functions, because reduces have to be re-executed on queries, and executing JavaScript on many documents is a complex and time-intensive operation, even if the database has not changed at all. You find more information about the reduce process in the CouchDB process. It makes more sense to preprocess the documents as much as you can before storing them in CouchDB.

Report Filtering With Adobe Analytics API

I am queuing and getting a report through the API and javascript, but now I want to start filtering the report. I want the results that come back to apply only to the user (other filters are needed too) who is requesting the report. What is the best way to put a filter on the initial report queue?
The way I am doing it now is adding a selected element to the report description:
...
"elements": [
{ "id": "page" },{ "id": "evar23" , "selected": ["295424","306313"]}
...
But this only seems to apply to the breakdown section of the results, not the top level count that is returned. I would expect the top level count in the below example be 66, not 68:
...
"counts":[
"68"
],
"breakdown":[
{
"name":"306313",
"url":"",
"counts":[
"43"
]
},
{
"name":"295424",
"url":"",
"counts":[
"23"
]
}
]
}
,...
I know I can just crawl through the breakdown array and total up what I need, but the more filters I apply the messier it becomes. All of a sudden I am three levels deep in a nested array, making sure that all 3 breakdown names match my conditions. There must be a better way to do this, any ideas? Many thanks.
Although there are some possible limitations to them that I am still working through, it seems that segments is what I need, not elements.
"segments": [
{
"element": "evar23","selected": ["295424","306313"]
}]
https://marketing.adobe.com/developer/forum/reporting/report-filtering-with-api

The graph section of Cypher response, remains blank

I noticed for some queries the response populates the "graph" section as follows
}
],
"graph": {
"nodes": [
{
"id": "68",
"labels": [
"ROOM"
],
"properties": {
"id": 15,
"name": "Sun and Snow",
but for other queries, this "graph" section is not returning with nodes/relationships and associated labels/properties even though the "data" section returns valid output
Does it convey anything about the quality of the cypher query ?
It depends on what you return from your query. If you return nodes and relationships, you'll get a graph. If you return scalars such as n.name or r.weight, you don't get a graph.
Are you talking about the HTTP requests from the web UI or requests that you are making yourself?
The graph key is controlled via the resultDataContents option when making a request. You can see the documentation for that here:
http://neo4j.com/docs/stable/rest-api-transactional.html#rest-api-return-results-in-graph-format
You can request multiple formats for the result ("row" and "REST" are other examples)

Using Cypher to return nested, hierarchical JSON from a tree

I'm currently using the example data on console.neo4j.org to write a query that outputs hierarchical JSON.
The example data is created with
create (Neo:Crew {name:'Neo'}), (Morpheus:Crew {name: 'Morpheus'}), (Trinity:Crew {name: 'Trinity'}), (Cypher:Crew:Matrix {name: 'Cypher'}), (Smith:Matrix {name: 'Agent Smith'}), (Architect:Matrix {name:'The Architect'}),
(Neo)-[:KNOWS]->(Morpheus), (Neo)-[:LOVES]->(Trinity), (Morpheus)-[:KNOWS]->(Trinity),
(Morpheus)-[:KNOWS]->(Cypher), (Cypher)-[:KNOWS]->(Smith), (Smith)-[:CODED_BY]->(Architect)
The ideal output is as follows
name:"Neo"
children: [
{
name: "Morpheus",
children: [
{name: "Trinity", children: []}
{name: "Cypher", children: [
{name: "Agent Smith", children: []}
]}
]
}
]
}
Right now, I'm using the following query
MATCH p =(:Crew { name: "Neo" })-[q:KNOWS*0..]-m
RETURN extract(n IN nodes(p)| n)
and getting this
[(0:Crew {name:"Neo"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"}), (2:Crew {name:"Trinity"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"}), (3:Crew:Matrix {name:"Cypher"})]
[(0:Crew {name:"Neo"}), (1:Crew {name:"Morpheus"}), (3:Crew:Matrix {name:"Cypher"}), (4:Matrix {name:"Agent Smith"})]
Any tips to figure this out? Thanks
In neo4j 3.x, after you install the APOC plugin on the neo4j server, you can call the apoc.convert.toTree procedure to generate similar results.
For example:
MATCH p=(n:Crew {name:'Neo'})-[:KNOWS*]->(m)
WITH COLLECT(p) AS ps
CALL apoc.convert.toTree(ps) yield value
RETURN value;
... would return a result row that looks like this:
{
"_id": 127,
"_type": "Crew",
"name": "Neo",
"knows": [
{
"_id": 128,
"_type": "Crew",
"name": "Morpheus",
"knows": [
{
"_id": 129,
"_type": "Crew",
"name": "Trinity"
},
{
"_id": 130,
"_type": "Crew:Matrix",
"name": "Cypher",
"knows": [
{
"_id": 131,
"_type": "Matrix",
"name": "Agent Smith"
}
]
}
]
}
]
}
This was such a useful thread on this important topic, I thought I'd add a few thoughts after digging into this a bit further.
First off, using the APOC "toTree" proc has some limits, or better said, dependencies. It really matters how "tree-like" your architecture is. E.g., the LOVES relation is missing in the APOC call above and I understand why – that relationship is hard to include when using "toTree" – that simple addition is a bit like adding an attribute in a hierarchy, but as a relationship. Not bad to do but confounds the simple KNOWS tree. Point being, a good question to ask is “how do I handle such challenges”. This reply is about that.
I do recommend upping ones JSON skills as this will give you much more granular control. Personally, I found my initial exploration somewhat painful. Might be because I'm an XML person :) but once you figure out all the [, {, and ('s, it is really a powerful way to efficiently pull what's best described as a report on your data. And given the JSON is something that can easily become a class, it allows for a nice way to push that back to your app.
I have found perf to also be a challenge with "toTree" vs. just asking for the JSON. I've added below a very simplistic look into what your RETURN could look like. It follows the following BN format. I'd love to see this more maturely created as the possibilities are quite varied, but this was something I'd have found useful thus I’ll post this immature version for now. As they say; “a deeper dive is left up to the readers” 😊
I've obfuscated the values, but this is an actual query on what I’ll term a very poor example of a graph architecture, whose many design “mistakes” cause some significant performance headaches when trying to access a holistic report on the graph. As in this example, the initial report query I inherited took many minutes on a server, and could not run on my laptop - using this strategy, the updated query now runs in about 5 seconds on my rather wimpy laptop on a db of about 200K nodes and .5M relationships. I added the “persons” grouping alias as a reminder that "persons" will be different in each array element, but the parent construct will be repeated over and over again. Where you put that in your hand-grown tree, will matter, but having the ability to do that is powerful.
Bottom line, a mature use of JSON in the RETURN statement, gives you a powerful control over the results in a Cypher query.
RETURN STATEMENT CONTENT:
<cypher_alias>
{.<cypher_alias_attribute>,
...,
<grouping_alias>:
(<cypher_alias>
{.<cypher_alias_attribute,
...
}
)
...
}
MATCH (j:J{uuid:'abcdef'})-[:J_S]->(s:S)<-[:N_S]-(n:N)-[:N_I]->(i:I), (j)-[:J_A]->(a:P)
WHERE i.title IN ['title1', 'title2']
WITH a,j, s, i, collect(n.description) as desc
RETURN j{.title,persons:(a{.email,.name}), s_i_note:
(s{.title, i_notes:(i{.title,desc})})}
if you know how deep your tree is, you can write something like this
MATCH p =(:Crew { name: "Neo" })-[q:KNOWS*0..]-(m)
WITH nodes(p)[0] AS a, nodes(p)[1] AS b, nodes(p)[2] AS c, nodes(p)[3] AS d, nodes(p)[4] AS e
WITH (a{.name}) AS ab, (b{.name}) AS bb, (c{.name}) AS cb, (d{.name}) AS db, (e{.name}) AS eb
WITH ab, bb, cb, db{.*,children:COLLECT(eb)} AS ra
WITH ab, bb, cb{.*,children:COLLECT(ra)} AS rb
WITH ab, bb{.*,children:COLLECT(rb)} AS rc
WITH ab{.*,children:COLLECT(rc)} AS rd
RETURN rd
Line 1 is your query. You save all paths from Neo to m in p.
In line 2 p is split into a, b, c, d and e.
Line 3 takes just the namens of the nodes. If you want all properties you can write (a{.*}) AS ab. This step is optional you can also work with nodes if you want to.
In line 4 you replace db and eb with a map containing all properties of db and the new property children containing all entries of eb for the same db.
Lines 5, 6 and 7 are basically the same. You reduce the result list by grouping.
Finally you return the tree. It looks like this:
{
"name": "Neo",
"children": [
{
"name": "Morpheus",
"children": [
{"name": "Trinity", "children": []},
{"name": "Cypher","children": [
{"name": "Agent Smith","children": []}
]
}
]
}
]
}
Unfortunately this solution only works when you know how deep your tree is and you have to add a row if your tree is one step deeper.
If someone has an idea how to solve this with dynamic tree depth, please comment.

Resources