DSE Graph - Do vertices with meta-properties require more than a single query to read? - datastax-enterprise

The 5.1.3 doc says
Vertices without multi-properties fetch all properties in a single
query, rather than requesting properties one at a time. Using
multi-properties as vertices is not recommended.
Is that true for meta-properties as well? In other words, are multiple queries needed to fetch vertices that have meta-properties or can a single query suffice?

Meta-properties are fine.
The issue with multi-properties is that we can't have a good idea if an element is small or large. So fetching everything could be expensive.

Related

Neo4J Batch Inserter is slow with big ids

I'm working on an RDF files importer but I have a problem, my data files have duplicate nodes. For this reason, I use a big ids to insert the nodes using batch inserter but the proccess is slow. I have seen this post when Michael recommends to use a index but the process remains slow.
Another option would be to merge duplicate nodes but I think there is no automatic option to do so in Neo4J. Am I wrong?
Could anyone help me? :)
Thanks!
There is no duplicate handling in the CSV batch importer yet (it's planned for the next version), as it is non-trivial and memory expensive.
Best to de-duplicate on your side.
Don't use externally supplied id's as node-id's that can get large from the beginning that just doesn't work. Use an efficient map (like trove) to keep the mapping between your key and the node-id.
I usually use a two-pass and an array for it then sort the array, array index becomes node-id and after sorting you can do another pass that nulls-out duplicate entries
Perfect :) The data would have the following structure:
chembl_activity:CHEMBL_ACT_102540 bao:BAO_0000208 bao:BAO_0002146 .
chembl_document:CHEMBL1129248 cco:hasActivity
chembl_activity:CHEMBL_ACT_102551 .
chembl_activity:CHEMBL_ACT_102540 cco:hasDocument
chembl_document:CHEMBL1129248 .
Each line corresponds with a relationship between two nodes and we could see that the node chembl_activity:CHEMBL_ACT_102540 is duplicated.
I wanted to save as id the hashcode of the node name but that hashcode is a very large number that slows the process. So I could check for ids to only create the relationship and not the nodes.
Thanks for all! :)

What's the longest request in Neo4J Cypher?

I have a very long Cypher request in my app (running on Node.Js and Neo4j 2.0.1), which creates at once about 16 nodes and 307 relationships between them. It is about 50K long.
The high number of relationships is determined by the data model, which I probably want to change later, but nevertheless, if I decide to keep everything as it is, two questions:
1) What would be the maximum size of each single Cypher request I send to Neo4J?
2) What would be the best strategy to deal with a request that is too long? Split it into the smaller ones and then batch them in a transaction? I wouldn't like to do that because in this case I lose the consistency that I had resulting from a combination of MERGE and CREATE commands (the request automatically recognized some nodes that did not exist yet, create them, and then I could make relations between them using their indices that I already got through the MERGE).
Thank you!
I usually recommend to
Use smaller statements, so that the query plan cache can kick in and execute your query immediately without compiling, for this you also need
parameters, e.g. {context} or {user}
I think a statement size of up to 10-15 elements is easy to handle.
You can still execute all of them in a single tx with the transactional cypher endpoint, which allows batching of statements and their parameters.

Fastest way to load neo4j DB w/Cypher - how to integrate new subgraphs?

I'm loading a Neo4j database using Cypher commands piped directly into the neo4j-shell. Some experiments suggest that subgraph batches of about 1000 lines give the optimal throughput (about 3.2ms/line, 300 lines/sec (slow!), Neo4j 2.0.1). I use MATCH statements to bind existing nodes to the loading subgraph. Here's a chopped example:
begin
...
MATCH (domain75ea8a4da9d65189999d895f536acfa5:SubDomain { shorturl: "threeboysandanoldlady.blogspot.com" })
MATCH (domainf47c8afacb0346a5d7c4b8b0e968bb74:SubDomain { shorturl: "myweeview.com" })
MATCH (domainf431704fab917205a54b2477d00a3511:SubDomain { shorturl: "www.computershopper.com" })
CREATE
(article1641203:Article { id: "1641203", url: "http://www.coolsocial.net/sites/www/blackhawknetwork.com.html", type: 4, timestamp: 1342549270, datetime: "2012-07-17 18:21:10"}),
(article1641203)-[:PUBLISHED_IN]->(domaina9b3ed6f4bc801731351b913dfc3f35a),(author104675)-[:WROTE]->(article1641203),
....
commit
Using this (ridiculously slow) method, it takes several hours to load 200K nodes (~370K relationships) and, at that point, the loading slows down even more. I presume the asymptotic slowdown is due to the overhead of the MATCH statements. They make up 1/2 of the subgraph load statements by the time the graph hits 200K nodes. There's got to be a better way of doing this, it just doesn't scale.
I'm going to try rewriting the statements with parameters (refs: What is the most efficient way to insert nodes into a neo4j database using cypher AND http://jexp.de/blog/2013/05/on-importing-data-in-neo4j-blog-series/). I expect that to help, but it seems that I will still have problems making the subgraph connections. Would using MERGE or CREATE UNIQUE instead of the MATCH statements be the way to go? There must be best practice ways to do this that I'm missing. Any other speed-up ideas?
many thanks
Use MERGE, and do smaller transactions--I've found best results with batches of 50-100 (while doing index lookups). Bigger batches are better when doing CREATE only without MATCH. Also, I recommend using a driver to send your commands over the transactional API (with parameters) instead of via neo4j-shell--it tends to be a fair bit faster.
Alternatively (might not be applicable to all use cases), keep a local "index" of the node ids you've created. For only 200k items, this should be easy to fit in a normal map/dict of string->long. This will prevent you needing to tax the index on the db, and you can do only node-ID-based lookups and CREATE statements, and create the indexes later.
The load2neo plugin worked well for me. Installation was fast+painless and it has a very cypher-like command structure that easily supports uniqueness requirements. Works with neo4j 2.0 labels.
load2neo install + curl usage example:
http://nigelsmall.com/load2neo
load2neo Geoff syntax:
http://nigelsmall.com/geoff
It is much faster (>>10x) than using Cypher via neo4j-shell.
I wasn't able to get the parameters in Cypher through neo4j-shell working despite trying everything I could find via internet search.

Uniqueness in BatchInserter of Neo4J

I am using a "BatchInserter" to build a graph (in a single thread). I want to make sure nodes (and possibly relationships) are unique. My current solution is to check whether the node exists in the following manner:
String name = (String) nodeProperties.get(IndexKeys.CATEGORY_KEY);
if(index.get(IndexKeys.CATEGORY_KEY, name).size() > 0)
return index.get(IndexKeys.CATEGORY_KEY, name).getSingle();
Long nodeID = inserter.createNode( nodeProperties,categoryLabel );
index.add(nodeID, nodeProperties);
index.flush();
It seems to be working fine but as you can see it is IO expensive (flushing on every new addition - which i believe is a lucene "commit" command). This is slowing down my code considerably.
I am aware of put if absent and uniqueFactory. As documented:
By using put-if-absent functionality, entity uniqueness can be guaranteed using an index.
Here the index acts as the lock and will only lock the smallest part
needed to guaranteed uniqueness across threads and transactions. To
get the more high-level get-or-create functionality make use of
UniqueFactory
However, these are for transaction based interactions with the graph. What I would like to do is to ensure uniqueness of nodes and possibly relationships in a batch insertion semantics, that is faster than my current setup.
Any pointers would be much appreciated.
Thank you
You should investigate the MERGE keyword in cypher. I believe this will permit you to exploit your autoindexes without requiring you to use them yourself. More broadly, you might want to see if you can formulate your bulk load in a way that is conducive to piping large volumes of cypher queries through the neo4j-shell.
Finally, as general pointers and background, you should check out this information on bulk loading
When I encountered this problem, I just decided to go tyrant and force index values in my own. Can't you do the same? I mean, ensure uniqueness before you do the insertions?

Querying in Gremlin using multiple indices

I am trying to optimize requests in Gremlin on a Neo4J graph.
Here is the short version of the basic request I am using:
g.idx("myIndex")[[myId:5]].outE("HAS_PRODUCT").filter{it.shop_id:5}.inV
So I looked into indexation and got to create an index on "HAS_PRODUCT"-typed edges with key 'shop_id'.
Using the same request, I don't see a big difference.
My questions are:
Is my new index used when I query with: filter{it.shop_id:5}
If not, how can I use this new index in my request?
More generally, if idx( is the graph method to use an index, is there a pipe method for that?
Thanks!
The short answer is that Gremlin won't make use of the secondary index when using Neo4j, but please consider reading the longer answer below in relation to TinkerPop, Gremlin and its philosophy.
The longer answer is....Indices are not being used for your shop_id. When you call outE you are effectively iterating all the edges to find those with shop_id == 5. To make use of indices in Gremlin you should use a vertex query. So, rewriting your code a bit (to also use key indices) would be like:
g.V('myIndex',5).outE('HAS_PRODUCT').has('shop_id',5).inV
With Blueprints implementations that support vertex-centric indices, the use of has will utilize that index automatically. Unfortunately, Neo4j is not one of those databases yet. Blueprints implementations that do implement it, include Titan (see Vertex-centric indices) and OrientDB (as part of the yet unreleased Blueprints 2.4.0...I believe they will have a partial implementation in that release) in that case.

Resources