PROFILE and EXPLAIN not showing anything on cypher-shell - neo4j

After seeing this question, I've been reading this blog post about the need to avoid the need to avoid Eager when loading a very large CSV into Neo4J.
In my case, I have a ~27 million line CSV, totaling ~8.5 GB in size. It seems pretty important that I break up my query into several queries to avoid Eager transactions.
EXPLAIN and PROFILE both offer ways to "test" a query. In Mark Needham's blog post linked above, he mentions:
You'll notice that when we profile each query we're stripping off the
periodic commit section and adding a 'WITH row LIMIT 0'. This allows
us to generate enough of the query plan to identify the 'Eager'
operator without actually importing any data.
However, when I try to test my query on the cypher shell with PROFILE prepended... nothing happens. I don't get any output or report back.
$ ./bin/cypher-shell
Connected to Neo4j 3.3.5 at bolt://localhost:7687 as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j> :begin
neo4j# PROFILE LOAD CSV WITH HEADERS FROM "file:///myfile.tsv" AS line FIELDTERMINATOR '\t'
WITH line LIMIT 0
MERGE ...
I also EXPLAIN and saw the same behavior -- no report or output.
If I paste the same PROFILE ... command into the Neo4J web interface, I do see the graphical plan show up, and even a warning tab telling me about EAGER. That is better than nothing, I suppose, but it's hard to read through this graphical display. I'd really like to use the cypher-shell for this, but it bizarrely is not showing anything.
I have also tried piping the EXPLAIN or PROFILE query to cypher-shell, but that just gives me some meta-data, not the actual plan.
$ cat query.cypher | ./bin/cypher-shell --format plain
Plan: "EXPLAIN"
Statement: "READ_WRITE"
Version: "CYPHER 3.3"
Planner: "COST"
Runtime: "INTERPRETED"
Time: 155
PROFILE:
$ cat query.cypher | ./bin/cypher-shell --format plain
Plan: "PROFILE"
Statement: "READ_WRITE"
Version: "CYPHER 3.3"
Planner: "COST"
Runtime: "INTERPRETED"
Time: 285
DbHits: 0
Rows: 1
count(*)
0
Any ideas what is going on?

That :begin opens a transaction, the query itself won't execute until you end with :commit.
In this case, you can leave off :begin completely, just end the query with a semicolon. Also, since you're only after the query plan here, use EXPLAIN so it doesn't actually execute the query.

Related

Blazer - Escape Characters

I feel like I am missing something, but I could not find it in the documentation on GH.
What are the escape characters for Blazer when searching in a string that contains a ' or ".
Example:
SELECT * FROM "search_filters"
where "params" like '%with_vehicles_id"=>[%'
LIMIT 100
Update:
The underlying database is Postgres 11. This is a blazer tool question, as the query above works just fine in a tool like dBeaver, or console. For some reason, I believe this is related to how Blazer is parsing the query before it is sent.
I'm not very familiar with Blazer but it looks like it's a BI tool that lets your run SQL queries against your database and there's a playground here.
For PostgreSQL you don't need to do anything special for a double-quote inside of single quotes. The query as you wrote it would execute in a postgres terminal and the same approach works in the blazer playground.
SELECT * FROM "search_filters"
where "params" like '%text"text%'
LIMIT 100
To query on a string that includes a single quote, PosgreSQL has you use two sequential single quotes, like this:
SELECT * FROM "search_filters"
where "params" like '%text''text%'
LIMIT 100
Here's a link with more information:
https://www.prisma.io/dataguide/postgresql/short-guides/quoting-rules
-- UPDATE --
Based on your error message ("syntax error at or near "LIMIT" LINE 3: LIMIT 100 LIMIT 1000") it looks like there are two "LIMIT" clauses being added to the SQL query. Do you have gems/plugins that are modifying the query and is there a way to disable them to see if that's causing the problem?

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Out of memory when creating large number of relationships

I'm new to Neo4J, and I want to try it on some data I've exported from MySQL. I've got the community edition running with neo4j console, and I'm entering commands using the neo4j-shell command line client.
I have 2 CSV files, that I use to create 2 types of node, as follows:
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/updates.csv" AS row
CREATE (:Update {update_id: row.id, update_type: row.update_type, customer_name: row.customer_name, .... });
CREATE INDEX ON :Update(update_id);
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM "file:/tmp/facts.csv" AS row
CREATE (:Fact {update_id: row.update_id, status: row.status, ..... });
CREATE INDEX ON :Fact(update_id);
This gives me approx 650,000 Update nodes, and 21,000,000 Fact nodes.
Once the indexes are online, I try to create relationships between the nodes, as follows:
MATCH (a:Update)
WITH a
MATCH (b:Fact{update_id:a.update_id})
CREATE (b)-[:FROM]->(a)
This fails with an OutOfMemoryError. I believe this is because Neo4J does not commit the transaction until it completes, keeping it in memory.
What can I do to prevent this? I have read about USING PERIODIC COMMIT but it appears this is only useful when reading the CSV, as it doesn't work in my case:
neo4j-sh (?)$ USING PERIODIC COMMIT
> MATCH (a:Update)
> WITH a
> MATCH (b:Fact{update_id:a.update_id})
> CREATE (b)-[:FROM]->(a);
QueryExecutionKernelException: Invalid input 'M': expected whitespace, comment, an integer or LoadCSVQuery (line 2, column 1 (offset: 22))
"MATCH (a:Update)"
^
Is it possible to create relationships in this way, between large numbers of existing nodes, or do I need to take a different approach?
The Out of Memory Exception is normal as it will try to commit it all at once and as you didn't provide it, I assume java heap settings are set as default (512m).
You can however, batch the process with kind of pagination, only I would prefer to use MERGE rather than CREATE in this case :
MATCH (a:Update)
WITH a
SKIP 0
LIMIT 50000
MATCH (b:Fact{update_id:a.update_id})
MERGE (b)-[:FROM]->(a)
Modify SKIP and LIMIT after each batch until your reach 650k update nodes.

store temp variables in neo4j

I have some cypher queries that I execute against my neo4j database. The query is in this form
MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL)
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ 'VERY_LONG_LIST')
RETURN count(r1) AS number_iframes;
If you can't understand what I am doing. This is a much simpler query
MATCH (s:WORD)
WHERE NOT (s.text=~"badword1|badword2|badword3")
RETURN s
I am basically trying to match some words against specific list
The problem is that this list is very large as you can see my job_id=5000 and I have more than 20000 jobs, so if my whitelist length is 1MB then I will end up with very large queries. I tried 500 jobs and end up with 200 MB queries file.
I was trying to execute these queries using transactions from py2neo but this is wont be feasible because my post request length will be very large and it will timeout. As a result, I though of using
neo4j-shell -file <queries_file>
However as you can see the file size is very large because of the large whitelist. So my question is there anyway that I can store this "whitelist" in a variable in neo4j using cypher??
I wish if there is something similar to this
SAVE $whitelist="word1,word2,word3,word4,word5...."
MATCH p=(j:JOB)-[r:HAS|STARTS]->(s:URL)-[r1:VISITED]->(t:URL)
WHERE j.job_id =5000 and r1.origin='iframe' and r1.job_id=5000 AND NOT (t.netloc =~ $whitelist)
RETURN count(r1) AS number_iframes;
What datatype is your netloc?
If you have an index on netloc you can also use t.netloc IN {list} where {list} is a parameter provided from the outside.
Such large regular expressions will not be fast
What exactly is your regexp and netloc format like? Perhaps you can change that into a split + index-list lookup?
In general also for regexps you can provide an outside parameter.
You can also use "IN" + index for job_ids.
You can also run a separate job that tags the jobs within your whitelist with a label and use that label for additional filtering e.g. in the match already.
Why do you have to check this twice ? Isn't it enough that the job has id=5000?
j.job_id =5000 and r1.job_id=5000

Trying to execute a list of Cypher statements in Neo4j via the admin interface

I have a file that contains a long list of Cypher statements, something like:
create (n:oeuvre {ide12:"41",numpers:[87603],titre:"JE PARS"});
create (n:oeuvre {ide12:"151",numpers:[395225,364617,396308,306762],titre:"I DID IT FOR LOVE"});
create (n:oeuvre {ide12:"67",numpers:[54001],titre:"GRAND PERE N AIME PAS LE"});
create (n:oeuvre {ide12:"80",numpers:[58356],titre:"MON HEURE DE SWING"});
create (n:oeuvre {ide12:"91",numpers:[58356],titre:"AU QUATRIEME TOP"});
When I drag my file on the Cypher admin console area "Drop a file to import Cypher or Grass" and then click on the little play icon, I get the message "Expected exactly one statement per query but got: 1405".
Is there a way to batch execute Cypher requests via the admin console? The wording "Drop a file to import Cypher" seems to suggest so.
Thanks
Yann
Yeah, the console just let's you run one statement at a time. Fortunately a statement can have multiple CREATE clauses, so if you just remove the semi-colon characters it should work.
Alternatively you can use the neo4j-shell command with the -file argument to run a cypher script file. This method allows for scripts with multiple commands separated by semi-colons.

Resources