Uniqueness in BatchInserter of Neo4J - neo4j

I am using a "BatchInserter" to build a graph (in a single thread). I want to make sure nodes (and possibly relationships) are unique. My current solution is to check whether the node exists in the following manner:
String name = (String) nodeProperties.get(IndexKeys.CATEGORY_KEY);
if(index.get(IndexKeys.CATEGORY_KEY, name).size() > 0)
return index.get(IndexKeys.CATEGORY_KEY, name).getSingle();
Long nodeID = inserter.createNode( nodeProperties,categoryLabel );
index.add(nodeID, nodeProperties);
index.flush();
It seems to be working fine but as you can see it is IO expensive (flushing on every new addition - which i believe is a lucene "commit" command). This is slowing down my code considerably.
I am aware of put if absent and uniqueFactory. As documented:
By using put-if-absent functionality, entity uniqueness can be guaranteed using an index.
Here the index acts as the lock and will only lock the smallest part
needed to guaranteed uniqueness across threads and transactions. To
get the more high-level get-or-create functionality make use of
UniqueFactory
However, these are for transaction based interactions with the graph. What I would like to do is to ensure uniqueness of nodes and possibly relationships in a batch insertion semantics, that is faster than my current setup.
Any pointers would be much appreciated.
Thank you

You should investigate the MERGE keyword in cypher. I believe this will permit you to exploit your autoindexes without requiring you to use them yourself. More broadly, you might want to see if you can formulate your bulk load in a way that is conducive to piping large volumes of cypher queries through the neo4j-shell.
Finally, as general pointers and background, you should check out this information on bulk loading

When I encountered this problem, I just decided to go tyrant and force index values in my own. Can't you do the same? I mean, ensure uniqueness before you do the insertions?

Related

How to run complex queries in Tarantool

I've always worked with relational DBs and recently decided to migrate a performance-critial service from SQL Server to Tarantool with a hope to take advantage of the fast in-memory search and processing. I've got a couple of questions while planning for the migration.
I've got a table with about one million records containing pricing information which means I'm dealing mostly with numbers and uuids. First, I need to run a select containing multiple conditions to get a subset of the data, like
SELECT * FROM rates WHERE SupplierId = #SupplierId AND ProductId = #ProductId AND (LocalDistributionZoneId = #LocalDistributionZoneId OR LocalDistributionZoneId IS NULL)
Q1: What is the strategy of running such a query in Lua? Do I create an index for each field in the predicate or I can go along with one secondary composite index?
Q2: Will it be more covenient to run such a query in SQL (box.sql.execute) rather than in pure Lua? Will it be considerably slower than running the same query in pure Lua?
Q3: If I use SQL, is it possible to review the execusion plan to make sure that the query I run really uses the indexes I've defined in the space?
Ok, after I've get the results from the first query I need to analyse the data and then based on the results of analysis, run one more query on the dataset returned by the first query.
Q4: Can Tarantool help me in dealing with the intermediate dataset? More specifically, may I somehow run more queries against the intermediate subset of tuples leveraging the indexes created in the space? Or, I would need to implement alternative strategies like re-add the intrim results to a temporary space with pre-defined indexes and then do another select, or implement further search myself?
Thank you!
Don't. Use SQL, it's faster: it doesn't create garbage collected objects for intermediate execution results.
Yes, please use our SQL features for that.
Use EXPLAIN statement.
I don't know what you exactly mean by "help". You could try to whatever strategy works best: create a more complex query, save the original query in a view to use in the resulting query, create a temporary table and work with it. To give more details let's look if the execution plan Tarantool chooses is good enough or you have to manually optimize it.

Is it possible to execute read only cypher queries from java?

I'd like to know just what the title says.
The reason I'd want this is to permit constrained read-only cypher queries to be executed; the data results would later be interpreted and serialized by a separate API layer.
I've seen code that makes basic assumptions in an attempt to mimic this behavior, e.g. the code might filter out any Cypher query that contains certain special words associated with write query structures (merge, create, delete, set, and so on).
This approach tends to be limited and naive though; if it very simply looks for those tokens, it would prevent a query like MATCH n WHERE n.label =~ '.*create.*' RETURN n even though it's a read-only query.
I'd really prefer not to do a full parse on a candidate query and then descend through the AST trying to figure out whether something is read-only or not (although I would gladly accept an answer that shows how to do this easily in java)
EDIT - I'm aware it's possible to start the entire database in read-only mode via the configuration property read_only=true, but this would be undesirable; no other aspect of the java API would be able to change the database.
EDIT 2 - I found another possible strategy, but I'm not sure of its advisability. Comments welcome on this, and potential downsides:
try (Transaction ignore = graphDb.beginTx()) {
ExecutionResult result = executionEngine.execute(query);
// Do nifty stuff with result, then...
// Force transaction to fail.
ignore.failure();
}
The idea here is that if queries happen within transactions and the transaction is always force-failed, then nothing can ever be written to the DB no matter what the result.
Read-only Cypher is (not yet) directly supported. However I can think of two workarounds for that:
1) assuming you're running a Neo4j enterprise cluster: you can set read_only=true on one instance. That instance is then used for the read only queries where the other cluster instances are used for r/w. A load balancer in front of the cluster can be set up to send the requests to the right instance.
2) Use a TransactionEventHandler that vetos a transaction if its TransactionData contains write operations. Just for fun I've invested some minutes to implement that, see https://github.com/sarmbruster/read-only-cypher - feedback is appreciated.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

Representing (and incrementing) relationship strength in Neo4j

I would like to represent the changing strength of relationships between nodes in a Neo4j graph.
For a static graph, this is easily done by setting a "strength" property on the relationship:
A --knows--> B
|
strength
|
3
However, for a graph that needs updating over time, there is a problem, since incrementing the value of the property can't be done atomically (via the REST interface) since a read-before-write is required. Incrementing (rather than merely updating) is necessary if the graph is being updated in response to incoming streamed data.
I would need to either ensure that only one REST client reads and writes at once (external synchronization), or stick to only the embedded API so I can use the built-in transactions. This may be workable but seems awkward.
One other solution might be to record multiple relationships, without any properties, so that the "strength" is actually the count of relationships, i.e.
A knows B
A knows B
A knows B
means a relationship of strength 3.
Disadvantage: only integer strengths can be recorded
Advantage: no read-before-write is required
Disadvantage: (probably) more storage required
Disadvantage: (probably) much slower to extract the value since multiple relationships must be extracted and counted
Has anyone tried this approach, and is it likely to run into performance issues, particularly when reading?
Is there a better way to model this?
Nice idea.
To reduce storage and multi-reads those relationships could be aggregated to one in a batch job which runs transactionally.
Each rel could also carry an individual weight value, whose aggregated value is used as weight. It doesn't have to be integer based and could also be negative to represent decrements.
You could also write a small server-extension for updating a weight value on a single relationship transactionally. Would probably even make sense for the REST API (as addition to the "set single value" operation have a modify single value operation.
PUT http://localhost:7474/db/data/node/15/properties/mod/foo
The body contains the delta value (1.5, -10). Another idea would be to replace the mode keyword by the actual operation.
PUT http://localhost:7474/db/data/node/15/properties/add/foo
PUT http://localhost:7474/db/data/node/15/properties/or/foo
PUT http://localhost:7474/db/data/node/15/properties/concat/foo
What would "increment" mean in a non integer case?
Hmm a bit of a different approach, but you could consider using a queuing system. I'm using the Neo4j REST interface as well and am looking into storing a constantly changing relationship strength. The project is in Rails and using Resque. Whenever an update to the Neo4j database is required it's thrown in a Resque queue to be completed by a worker. I only have one worker working on the Neo4j Resque queue so it never tries to perform more than one Neo4j update at once.
This has the added benefit of not making the user wait for the neo4j updates when they perform an action that triggers an update. However, it is only a viable solution if you don't need to use/display the Neo4j updates instantly (though depending on the speed of your worker and the size of your queue, it should only take a few seconds).
Depends a bit on what read and write load you are targeting. How big is the total graph going to be?

Db4o select performance

I have 7000 objects in my Db4o database.
When i retrieve all of the objects it's almost instant..
When i add a where constrain ie Name = "Chris" it takes 6-8 seconds.
What's going on?
Also i've seen a couple of comments about using Lucene for search type of queries does anyone have any good links for this?
There are two things to check.
Have you added the 'Db4objects.Db4o.NativeQueries'-assembly? Without this assembly, a native query cannot be optimized.
Have set an index on the field which represents the Name? A index should make query a lot faster
Index:
cfg.Common.ObjectClass(typeof(YourObject)).ObjectField("fieldName").Indexed(true);
This question is kinda old, but perhaps this is of any use:
When using native queries, try to set a breakpoint on the lambda expression. If the breakpoint is actually invoked, you're in trouble because the optimization failed. To invoke the lambda, each of the objects will have to be instantiated which is very costly.
If optimization worked, the lambda expression tree will be analyzed and the actual code won't be needed, thus breakpoints won't be triggered.
Also note that settings indexes on fields must be performed before opening the connection.
Last, I have a test case of simple objects. When I started without query optimization and indexing (and worse, using a server that was forced to use the GenericReflector because I failed to provide the model .dlls), it too 600s for a three-criteria query on about 100,000 objects. Now it takes 6s for the same query on 2.5M objects so there is really a HUGE gain.

Resources