Neo4j variable-length pattern matching tunning - neo4j

Query:
PROFILE
MATCH(node:Symptom) WHERE node.symptom =~ '.*adult male.*|.*151.*'
WITH node
MATCH (node)-[*1..2]-(result:Disease)
RETURN result
Profile:
enter image description here
Problems:
There are over 40 thousand "Symptom" nodes in the database, and the query is very slow because of the part - "[*1..2]".
It only took 4 seconds when length is 1, i.e "[*1]", but it will take about 30 seconds when length is 2, i.e "[*1..2]".
Is there any way to tune this query???

Firstly your query is using the regex operator, and it can't use indexes. You should use the CONTAINS operator instead :
MATCH (node:Symptom)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN node
And you can create an index :CREATE INDEX ON :Symptom(symptom)
For the second part of your query, as it, there is nothing to do ... it's due to the complexity you are asking to do.
So to have better performances, you should think to :
put the relationship type on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]-(result:Disease)
put the direction on the relationship on the pattern to reduce the number returned path : (node)-[*1..2:MY_REL_TYPE]->(result:Disease)
find an other way to reduce this complexity (filter on a property of the relationship , review your model, etc)
For your information, you can directly write your query in one step (ie. without the WITH, but in your case performances should be the same) :
MATCH (node:Symptom)-[*1..2]-(result:Disease)
WHERE node.symptom CONTAINS 'adult male' OR node.symptom CONTAINS '151'
RETURN result

Related

(Neo4j 3.5) Executing query with two parts and the second one consists on a group of matches where at least one needs to be fulfilled

we are trying to run a cypher query but we are not able to get the results we want.
It is important to note that we cannot make this work with subqueries because we are using Neo4j 3.5 and in this version, they are still not available.
The problem is that we have two parts for a query, the first one will fix the variables for the second part, and the second part consists of multiple matches, and it has to get the results of the previous query and if at least one of the matches return a result for a row, this row is not filtered out, else if none return result it is discarded.
More specifically, the query we are trying to run is similar to the following one:
//First part of the query where we want to fix variables with the match and where
MATCH (u:User)-[:ASSIGNED_TO]->(t:Task)-[:PENDING]->(ob:Object)<-[:HAS_OPEN_OBJECT]-(do:DataObject)<-[:ASOCIATED]-(:Module)-[:CAN_LIST]->(view:WidgetObject)
WHERE u.uid = 'user_uid'
AND view.uid = 'view_uid'
AND view.object_name = do.object_type
with do, t, ob
// In this second part of the query we want to maintain the variables of the previous part and if at least one matches the value should be returned
// we have tried with UNION but we will need pagination, but even with union it's not working
MATCH (ac:Action)<-[:ASOCIATED]-(t)-[rel:COMPLETED|PENDING]->(ob)<-[:HAS_OPEN_OBJECT|HAS_CLOSED_OBJECT]-(do)
WHERE ac.name CONTAINS 'body'
WITH COLLECT({data_object_uid: do.uid}) as act_filter
MATCH (c:Comment)<-[:COMMENTED]-(t)-[rel:COMPLETED|PENDING]->(ob)<-[:HAS_OPEN_OBJECT|HAS_CLOSED_OBJECT]-(do)
WHERE c.body CONTAINS 'body'
WITH act_filter + COLLECT({data_object_uid: do.uid}) as comment_filter
MATCH (at:Attachment)<-[:HAS_ATTACHMENT]-(t)-[rel:COMPLETED|PENDING]->(ob)<-[:HAS_OPEN_OBJECT|HAS_CLOSED_OBJECT]-(do)
WHERE at.name CONTAINS 'body'
WITH comment_filter + COLLECT({data_object_uid: do.uid}) as attachment_filter
UNWIND attachment_filter as row
return row.data_object_uid
We are not sure if in the second part, the second and third matches are maintaining the same subset of results coming from the first part of the query.
A funny behavior we have found is that if we remove the last match we are getting results but if we add it, we are not getting any results. We do not understand this behavior because if the second match is returning results and they are stored in a variable after a collect, appending this to the next collected results should return something.
For example, if the second match returns as comment_filter [{data_object_uid: "343dienmd3-dasd"}] and the third match is not returning anything, after the concatenation in the WITH clause it should return the same thing, but the result is empty.
We need some light here, we don't know if we are close and we are making a stupid mistake or we are getting all wrong and we need to change the approach completely.
Since you do not know which of the three matches in the second part will yield results, I would try something along the lines below:
NOTE I used ASSOCIATED instead of ASOCIATED
MATCH (n)<-[:ASSOCIATED|COMMENTED|HAS_ATTACHMENT]-(t)-[rel:COMPLETED|PENDING]->(ob)<-[:HAS_OPEN_OBJECT|HAS_CLOSED_OBJECT]-(do)
WHERE
(n:Action AND n.name CONTAINS 'body')
OR
(n:Comment AND n.body CONTAINS 'body')
OR
(n:Attachment AND n.name CONTAINS 'body')
RETURN COLLECT(DISTINCT {data_object_uid: do.uid})

Neo4j removing direction to relationship significantly takes more time

I have 2 queries which take significantly different amount of time. What dont I understand in relationships in Cypher to have this issue
less than 1 second with 5 results:
allshortestpaths((et)-[*]->(st))
less than 1 second with 3 results:
allshortestpaths((et)<-[*]-(st))
Takes forever (timeout):
allshortestpaths((et)-[*]-(st))
Why does this take forever. I would assume that this just has to return 8 results!!
Complete sample query:
profile match (s:Stop)--(st:Stoptime),
(e:Stop)--(et:Stoptime)
where s.name IN [ 'Schlump', 'U Schlump']
and e.name IN [ 'Hamburg Hbf', 'Hauptbahnhof Nord']
match p = allshortestpaths((et)-[*]->(st))
return p
With allshortestpaths((et)-[*]->(st)) you will have a paths that look like that : (a)-->(b)-->(c)-->(d)
With allshortestpaths((et)<-[*]-(st)) you will have a path that look like that : (a)<--(b)<--(c)<--(d)
With allshortestpaths((et)<-[*]-(st)) you will have a paths that look like that :
(a)-->(b)-->(c)-->(d)
(a)<--(b)<--(c)<--(d)
(a)-->(b)<--(c)-->(d)
(a)<--(b)<--(c)-->(d)
(a)-->(b)-->(c)<--(d)
(a)-->(b)<--(c)--<(d)
Do you the differences ? The last one is a more complexe query than the previous ones, that's why it can take a long time, specially when you not specify the relationship type and the max depth of the path.
So here it'spossible that you are asking to Neo4j to traverse all your database ...

How to concatenate three columns into one and obtain count of unique entries among them using Cypher neo4j?

I can query using Cypher in Neo4j from the Panama database the countries of three types of identity holders (I define that term) namely Entities (companies), officers (shareholders) and Intermediaries (middle companies) as three attributes/columns. Each column has single or double entries separated by colon (eg: British Virgin Islands;Russia). We want to concatenate the countries in these columns into a unique set of countries and hence obtain the count of the number of countries as new attribute.
For this, I tried the following code from my understanding of Cypher:
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)-[:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND
NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND BEZ3.countries="Belize") OR
(BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved", "Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
SET BEZ4.countries= (BEZ1.countries+","+BEZ2.countries+","+BEZ3.countries)
RETURN BEZ3.countries AS IntermediaryCountries, BEZ3.name AS
Intermediaryname, BEZ2.countries AS OfficerCountries , BEZ2.name AS
Officername, BEZ1.countries as EntityCountries, BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress,DISTINCT count(BEZ4.countries) AS NoofConnections
The relevant part is the SET statement in the 7th line and the DISTINCT count in the last line. The code shows error which makes no sense to me: Invalid input 'u': expected 'n/N'. I guess it means to use COLLECT probably but we tried that as well and it shows the error vice-versa'd between 'u' and 'n'. Please help us obtain the output that we want, it makes our job hell lot easy. Thanks in advance!
EDIT: Considering I didn't define variable as suggested by #Cybersam, I tried the command CREATE as following but it shows the error "Invalid input 'R':" for the command RETURN. This is unfathomable for me. Help really needed, thank you.
CODE 2:
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)-
[:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND
NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND
BEZ3.countries="Belize") OR
(BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved",
"Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
CREATE (p:Connections{countries:
split((BEZ1.countries+";"+BEZ2.countries+";"+BEZ3.countries),";")
RETURN BEZ3.countries AS IntermediaryCountries, BEZ3.name AS
Intermediaryname, BEZ2.countries AS OfficerCountries , BEZ2.name AS
Officername, BEZ1.countries as EntityCountries, BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress, AS TOTAL, collect (DISTINCT
COUNT(p.countries)) AS NumberofConnections
Lines 8 and 9 are the ones new and to be in examination.
First Query
You never defined the identifier BEZ4, so you cannot set a property on it.
Second Query (which should have been posted in a separate question):
You have several typos and a syntax error.
This query should not get an error (but you will have to determine if it does what you want):
MATCH (BEZ2:Officer)-[:SHAREHOLDER_OF]->(BEZ1:Entity),(BEZ3:Intermediary)- [:INTERMEDIARY_OF]->(BEZ1:Entity)
WHERE BEZ1.address CONTAINS "Belize" AND NOT ((BEZ1.countries="Belize" AND BEZ2.countries="Belize" AND BEZ3.countries="Belize") OR (BEZ1.status IN ["Inactivated", "Dissolved shelf company", "Dissolved", "Discontinued", "Struck / Defunct / Deregistered", "Dead"]))
CREATE (p:Connections {countries: split((BEZ1.countries+";"+BEZ2.countries+";"+BEZ3.countries), ";")})
RETURN BEZ3.countries AS IntermediaryCountries,
BEZ3.name AS Intermediaryname,
BEZ2.countries AS OfficerCountries ,
BEZ2.name AS Officername,
BEZ1.countries as EntityCountries,
BEZ1.name AS Companyname,
BEZ1.address AS CompanyAddress,
SIZE(p.countries) AS NumberofConnections;
Problems with the original:
The CREATE clause was missing a closing } and also a closing ).
The RETURN clause had a dangling AS TOTAL term.
collect (DISTINCT COUNT(p.countries)) was attempting to perform nested aggregation, which is not supported. In any case, even if it had worked, it probably would not have returned what you wanted. I suspect that you actually wanted the size of the p.countries collection, so that is what I used in my query.

Solr and Rails: [* TO *] value instead of nil (asterisk TO asterisk)

Inside my model at searchable block I have index time added_at.
At search block for searching I added with(:added_at, nil), made reindex and now inside search object I have:
<Sunspot::Search:{:fq=>["-added_at_d:[* TO *]"]...}>
What is the meaning of this [* TO *] ? Something went wrong?
By adding with(:added_at, nil) you narrow down the search results to documents having no values in the field added_at, so we can expect the corresponding query filter to be defined as :
fq=>["added_at_d:null"] # not valid
The problem is that Solr Standard Query Parser does not support searching a field for empty/null value. In this situation the filter needs to be negated (exluding documents having any value in the field) so that the query remains valid.
The operator - can be used to exclude the field, and the wildcard character * can be used to match any value, now we can expect the query filter to look like :
fq=>["-added_at_d:*"]
However, although the above is valid for the query parser, using a range query should be preferred to prevent inconsitent behaviors when using wildcard within negative subqueries.
Range Queries allow one to match documents whose field(s) values are
between the lower and upper bound specified by the Range Query. Range
Queries can be inclusive or exclusive of the upper and lower bounds.
A * may be used for either or both endpoints to specify an open-ended range query.
Eventually there is nothing wrong with this filter that ends up looking like :
fq=>["-added_at_d:[* TO *]"]
cf. Lucene Range Queries, Solr Standard Query Parser

Neo4j Performance for large dataset

I am trying to load large dataset into neo4j-3 and looking for the options. I found one neo4j-import but the problem with that is it is for initial load only. I have to load 2M records around every week.
I tried loading through shell but having some performance issue, I tried following.
1) Creating constraint upfront.
2) Creating Node and relationships in separate query.
3) Heap space 8G
4) dbms.memory.pagecache 4G
Many times the import just hangs and does nothing for hours.
Edit - CSV load being executed:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS
FROM "file:///my_sds_39_joe.csv"
AS row
OPTIONAL MATCH (per:Person {UID : "Person."+row.player_cardnum})
WHERE per IS NULL
MERGE (p:Person {CardNumber : row.player_cardnum})
ON CREATE SET p.Creation Date = timestamp(), p.Modification Date = timestamp() ;
EDIT
On a second look, seems like you're trying to implement some kind of conditional logic to your insert.
It looks like what you're trying to do is figure out if a :Person exists with a UID (derived from some concatenation with row.player_cardnum), and in the case where that :Person doesn't exist and the match fails, MERGE a :Person with the CardNumber given by row.player_cardnum.
If this is your goal, you're ALMOST there with your query. The problem is with your WHERE clause.
Understand that WHERE clauses are linked with a preceding MATCH, OPTIONAL MATCH, or WITH, and only affects the linked clause.
With that WHERE on that OPTIONAL MATCH, per will always be null, but more importantly, your row will still exist, and the following MERGE will ALWAYS take place for all rows in the CSV. This is probably the source of your slowdown, as it's creating new :Person nodes for all rows.
If you're trying to null out the row completely when the OPTIONAL MATCH hits on an existing :Person (so the MERGE won't happen in that case), you'll need to add a WITH clause, and make sure your WHERE clause is applied to it instead of the OPTIONAL MATCH.
Additionally, make sure that you have either unique constraints or indexes on Person.UID and Person.CardNumber. As for the UID match, I've heard that indexes are not used when there's some kind of string concatenation of the thing you're matching upon, so you may need to assemble it first and pass it in with a WITH.
Your final query would look like this:
USING PERIODIC COMMIT 5000
LOAD CSV WITH HEADERS
FROM "file:///my_sds_39_joe.csv"
AS row
// first build the UID so we can take advantage of the index
WITH row, "Person." + row.player_cardnum AS UID
OPTIONAL MATCH (per:Person {UID : UID})
// the WHERE now applies to the WITH, which will filter out and null out the row when an OPTIONAL MATCH is found
WITH row, per
WHERE per IS NULL
MERGE (p:Person {CardNumber : row.player_cardnum})
ON CREATE SET p.Creation Date = timestamp(), p.Modification Date = timestamp() ;

Resources