I understand that Neo4j supports different options to run the Cypher queries. The web browser, neo4j shell and the REST API.
Is there a difference in performance when using the shell and the API?
I'm working on a dataset that has around 10 million objects(nodes+edges).
Thanks!
The web browser use in the backend the ReST API. The shell is connected directly into neo4j.
So yes you will see performance differences, the shell will generally be more faster. Now using the shell will perform slower that connecting to ReST API in your application because in the shell you can't pass parameters.
In your application, passing parameters will permit that the execution can be cached (after the warmup).
Also, if you have bad indexes and bad queries, running it on a 10 million objects dataset will just result in being not performant in the shell, in the browser and in your application.
Related
I have some relly long running queries. Just as abckground information: I am crawling my graph for all instances of a specific meta path. for example, count all instances of a specific metha path found in the graph.
MATCH (a:Content) - [:isTaggedWith]-> (t:Term) <-[:isTaggedWith]-(b:Content) return (*)
In the first place, I want to measure the runtimes. is there any possibility to do so? especially in the community edition?
Furthermore, I have the problem that I do not know, whether a query is still running in neo4j or if it was already terminated. I issue the query from a rest client but I am open to other options if necessary. For example, I queried neo4j with a rest client and set the read timeout (client side) on 2 days. The problem is, that I can't verify whether the query is still running or if the client is simply waiting for the neo4j answer, which will never appear because the query might already be killed in the backend. is there really no possibility to check from the browser or another client which queries are currently running? maybe with an option to terminate them as well.
Thanks in advance!
Measuring Query Performance
To answer your first question, there are two main options for measuring the performance of a query. The first is to use PROFILE; put it in front of a query (like PROFILE MATCH (a:Content)-[:IsTaggedWith]->(t:Term)...), and it will execute the query and display the execution plan used, including the native API calls, number of results from each operation, number of total database hits, and total time of execution.
The downside is that PROFILE will execute the query, so if it is an operation that writes to the database, the changes are persisted. To profile a query without actually executing it, EXPLAIN can be used instead of PROFILE. This will show the query plan and native operations that will be used to execute the query, as well as the estimated total database hits, but it will not actually run the query, so it is only an estimate.
Checking Long Running Queries (Enterprise only)
Checking for running queries can be accomplished using Cypher in Enterprise Edition: CALL dbms.listQueries;. You must be logged in as an admin user to perform the query. If you want to stop a long-running query, use CALL dbms.killQuery() and pass in the ID of the query you wish to terminate.
Note that besides manual killing of a query and timeout of it based on the configured query timeout, unless you have something else set up to kill long-runners, the queries should, in general, not be getting killed on the backend; however, with the above method, you can double-check your assumptions that the queries are indeed executing after sending.
These are available only in Enterprise Edition; there is no way that I am aware of to use these functions or replicate their behavior in Community.
For measuring long running queries I figured out the following approach:
Use a tmux (tmux crash course) terminal session, which is really very easy. Hereby, you can execute your query and close the terminal. Later on you can get back the session.
New session: tmux new -s *sessionName*
Detach from current session (within session): tmux detach
List sessions: tmux ls
Re-attach to session: tmux a -t *sessionName*
Within the tmux session, execute the query via the cypher shell. Either directly in the shell or pipe the command into the shell. The ladder approach is preferable because you can use the unix command time to actually measure the runtime as follows:
time cat query.cypher | cypher-shell -u neo4j -p n > result.txt
The file query.cypher simply conatins the regular query including terminating semicolon at the end. The result of the query will be piped into the result.txt and the runtime of the execution will be displayed in the terminal.
Moreover, it is possible to list the running queries only in the enterprise edition as correctly stated by #rebecca.
So I am working on a little project that sets up a streaming pipeline using Google Dataflow and apache beam. I went through some tutorials and was able to get a pipeline up and running streaming into BigQuery, but I am going to want to Stream it into a full relational DB(ie: Cloud SQL). I have searched through this site and throughout google and it seems that the best route to achieve that would be to use the JdbcIO. I am a bit confused here because when I am looking up info on how to do this it all refers to writing to cloud SQL in batches and not full out streaming.
My simple question is can I stream data directly into Cloud SQL or would I have to send it via batch instead.
Cheers!
You should use JdbcIO - it does what you want, and it makes no assumption about whether its input PCollection is bounded or unbounded, so you can use it in any pipeline and with any Beam runner; the Dataflow Streaming Runner is no exception to that.
In case your question is prompted by reading its source code and seeing the word "batching": it simply means that for efficiency, it writes multiple records per database call - the overloaded use of the word "batch" can be confusing, but here it simply means that it tries to avoid the overhead of doing an expensive database call for every single record.
In practice, the number of records written per call is at most 1000 by default, but in general depends on how the particular runner chooses to execute this particular pipeline on this particular data at this particular moment, and can be less than that.
I am using Neo4JClient to connect to my Neo4J database and execute CYPHER queries. My goal is to check performance of queries I send to database. Problem is that I have to check it on the db side so I can't use Stopwatch in .NET. Queries have to be executed using Neo4JClient. I don't need to know execution times for specific queries. I.e. average for last 1000 queries will be enough.
I can use only Neo4J Community Edition.
Thanks in advance!
Neo4j Enterprise Edition has the capability to log slow queries taking longer than a given threshold, see the config settings containing querylog on http://neo4j.com/docs/stable/configuration-settings.html.
I am using Neo4j 2.0.0M4 community edition with Node.js with https://github.com/thingdom/node-neo4j to access the Neo4j DB server over REST API by passing Cypher queries.
I have observed that the data returned by Neo4j from the webadmin of neo4j and even from the REST APi is pretty slow. for e.g.
a query returning 900 records takes 1.2s and then subsequent runs take around 200ms.
and similarly if the number of records go upto 27000 the query in the webadmin browser takes 21 sec.
I am wondering whats causing the REST API to be so slow and also how to go about improving the performance?
a) It's using the CYPHER? the jSON parsing or
b) the HTTP Overhead itself as similar query with 27000 records returned in mysql takes 11 ms
Any help is highly appreciated
Neo4j 2.0 is currently a milestone build that is not yet performance optimized.
Consider enabling streaming and make sure you use parameterized Cypher.
For large result sets the browser consumes a lot of time for rendering. You might try the same query using cURL to see a difference.
I am using Torquebox to build a Rails application with an embedded Neo4j instance as the datastore. I've read multiple blogs that have said that Torquebox is a great for this because the Backgroundable method calls run in the same process (replacing delayed_job which doesn't work under jRuby anyway).
Unfortunately after playing around with it, this clearly isn't the case since the new thread keeps trying to start Neo4j and it fails.
After looking at the documentation, I did find this which confirms it:
The message processors run in a separate ruby runtime from the application, which may be on a different machine if you have a cluster.
I'm new to Torquebox, so I'm not sure if people are just incorrect on this, or is there another way with Torquebox to do an asynchronous call that runs in the same process so it can interact with an embedded Neo4j data store?
I'm unfamiliar with Rails/Torquebox, but are you creating a new Neo4j graph in each thread? If so, in Neo4j, only one connection can be made to the graph database in an embedded environment. If you host a Neo4j and use a RESTful client to call the DB you can have multiple clients.