Neo4j BatchInserter - create relationship using Node properties - neo4j

I'm using BatchInserter to initialise my Neo4j database - the data is coming from XML files on my local filesystem.
Suppose one set of files contains node information / properties, and another set has relationship information. I wanted to do two passes: create all the nodes, then set about creating the relationships.
However, the createRelationship method accepts a long id for the nodes, which I don't have in my relationship XML - all of my nodes have a GUID as a property called ID which I use to reference them.
Does BatchInsert mean it hasn't been indexed yet, so I won't be able to create relationships on nodes based on some other property?

I usually just keep the node-attribute to id mapping in a cache in memory in an efficient collection implementation like Trove or so.
Then for the relationships you can look up the node-id by attribute.

I found I was able to add nodes to the index as I go.
Creating index:
BatchInserter inserter = BatchInserters.inserter( "data/folder" );
BatchInserterIndexProvider indexProvider = new LuceneBatchInserterIndexProvider( inserter );
BatchInserterIndex index = indexProvider.nodeIndex("myindex", MapUtil.stringMap( "type", "exact" ) );
Then each time I insert a node, add it to the index as well:
Label label = DynamicLabel.label("person");
Map<String, Object> properties = new HashMap<>();
properties.put("ID", <some-value-here>);
long newNode = inserter.createNode(properties, labek);
index.add(newNode, properties);
index.flush();
Which I can query as I like:
IndexHits<Long> hits = index.get("ID", <some-value-here>);
if(hits.size() > 0) {
long existing = hits.getSingle();
}
I have no idea whether this is any good. I guess calling flush on the index often is a bad idea, but it seems to work for me.

Related

Cypher 1.9.9, START by both relationship and node index

My Neo4j 1.9.9 entities are stored using Spring Data Neo4j. However, because many derived queries from repository methods are wrong, I've been forced to use directly Cypher
Basically, I have two classes:
#NodeEntity
public class RecommenderMashup {
#Indexed(indexType = IndexType.SIMPLE, indexName = "recommenderMashupIds")
private String mashupId;
}
#RelationshipEntity(type = "MASHUP_TO_MASHUP_SIMILARITY")
public class MashupToMashupSimilarity {
#StartNode
private RecommenderMashup mashupFrom;
#EndNode
private RecommenderMashup mashupTo;
}
In addition to the indexes directly provided, as you know, Spring Data Neo4j adds two other indexes: __types__ for nodes and __rel_types__ for relationship; both of them have className as their key.
So, I've tried the query below to get all the MashupToMashupSimilarity objects related to a specific node
START `mashupFrom`=node:`recommenderMashupIds`(`mashupId`='5367575248633856'),
`mashupTo`=node:__types__(className="package.RecommenderMashup"),
`mashupToMashupSimilarity`=rel:__rel_types__(className="package.MashupToMashupSimilarity")
MATCH `mashupFrom`-[:`mashupToMashupSimilarity`]->`mashupTo`
RETURN `mashupToMashupSimilarity`;
However, I always got empty results. I suspect that this is due to the fact that the START clause contains both nodes and relationships. Is this possible? Otherwise, what could be the problem here?
Additional infos
The suspect came from the fact that
START `mashupToMashupSimilarity`=rel:__rel_types__(className='package.MashupToMashupSimilarity')
RETURN `mashupToMashupSimilarity`;
and
START `mashup`=node:__types__(className="package.RecommenderMashup")
RETURN `mashup`;
and other similar queries always return the right results.
The only working alternative at this point is
START `mashupFrom`=node:`recommenderMashupIds`(`mashupId`='6006582764634112'),
`mashupTo`=node:__types__(className="package.RecommenderMashup")
MATCH `mashupFrom`-[`similarity`:MASHUP_TO_MASHUP_SIMILARITY]->`mashupTo`
RETURN `similarity`;
both I don't know how it works in terms of performance (the indexes should be faster). Also, I'm curious what I've been doing wrong.
Did you try to run your queries in the neo4j-browser or shell? did they work there?
This query is also wrong,
START `mashupFrom`=node:`recommenderMashupIds`(`mashupId`='5367575248633856'),
`mashupTo`=node:__types__(className="package.RecommenderMashup"),
`mashupToMashupSimilarity`=rel:__rel_types__(className="package.MashupToMashupSimilarity")
MATCH `mashupFrom`-[:`mashupToMashupSimilarity`]->`mashupTo`
RETURN `mashupToMashupSimilarity`;
you use mashupToMashupSimilarity as identifier for the relationship,
but then you use it wrongly as relationship-type:
-[:mashupToMashupSimilarity]->
it should be: -[mashupToMashupSimilarity]->
but of course better, skip the rel-index check and use -[similarity:MASHUP_TO_MASHUP_SIMILARITY]->
And you can just leave of the relationship-index lookup which doesn't make sense at all, as you should already filter with the relationship-type.
Update: Don't use index lookups for type check
START mashupFrom=node:recommenderMashupIds(mashupId='5367575248633856')
MATCH (mashupFrom)-[mashupToMashupSimilarity:MASHUP_TO_MASHUP_SIMILARITY]->(mashupTo)
WHERE mashupTo.__type__ = 'package.RecommenderMashup'
RETURN mashupToMashupSimilarity;
As the relationship-type is already restricting, I think you don't even need the type-check on the target node.

Using Merge in BatchInserter?

I am using the BatchInserter in order to create some nodes and relationships, however I have unique nodes, and I wanted to make multiple relationships between them.
I can easily do that using the Cypher and in the very same time by using the Java Core API by:
ResourceIterator<Node> existedNodes = graphDBService.findNodesByLabelAndProperty( DynamicLabel.label( "BaseProduct" ), "code", source.getBaseProduct().getCode() ).iterator();
if ( !existedNodes.hasNext() )
{
//TO DO
}
else {
// create relationship with the retrieved node
}
and in Cypher I can easily use the merge.
is there any possible way to do the same with the BatchInserter ?
No it is not possible in the batch-inserter, as those APIs are not available there.
That's why I usually keep in-memory maps with the information I need to look up.
See this blog post for a groovy script:
http://jexp.de/blog/2014/10/flexible-neo4j-batch-import-with-groovy/

Update relationship / payload with the Neo4jClient

I am new to Neo4j and Neo4jClient. I am trying to update an existing relationship. Here is how I created the relationship.
var item2RefAddedBefore = _graphClient.CreateRelationship((NodeReference<Item>)item2Ref,
new AddedBefore(item1Ref, new Payload() { Frequency = 1 }));
For this particular use case, I would like to update the Payload whenever the Nodes and relationship already exist. I am using Cypher mostly with the Neo4jClient.
Appreciate any help!
Use this IGraphClient signature:
void Update<TRelationshipData>(RelationshipReference<TRelationshipData> relationshipReference, Action<TRelationshipData> updateCallback)
where TRelationshipData : class, new();
Like this:
graphClient.Update(
(RelationshipReference<Payload>)item2RefAddedBefore,
p => { p.Foo = "Bar"; });
Update: The syntax is a little awkward right now, where CreateRelationship only returns a RelationshipReference instead of a RelationshipReference<TData> but Update requires the latter, so you need to explicitly cast it. To be honest, we probably won't fix this any time soon as all of the investment for both Neo4j and Neo4jClient is going towards doing mutations via Cypher instead.

Neo4j indexes and legacy data

I have a legacy dataset (ENRON data represented as GraphML) that I would like to query. In an comment in a related question, #StefanArmbruster suggests that I use Cypher to query the database. My query use case is simple: given a message id (a property of the Message node), retrieve the node that has that id, and also retrieve the sender and recipient nodes of that message.
It seems that to do this in Cypher, I first have to create an index of the nodes. Is there a way to do this automatically when the data is loaded from the graphML file? (I had used Gremlin to load the data and create the database.)
I also have an external Lucene index of the data (I need it for other purposes). Does it make sense to have two indexes? I could, for example, index the Neo4J node ids into my external index, and then query the graph based on those ids. My concern is about the persistence of these ids. (By analogy, Lucene document ids should not be treated as persistent.)
So, should I:
Index the Neo4j graph internally to query on message ids using Cypher? (If so, what is the best way to do that: regenerate the database with some suitable incantation to get the index built? Build the index on the already-existing db?)
Store Neo4j node ids in my external Lucene index and retrieve nodes via these stored ids?
UPDATE
I have been trying to get auto-indexing to work with Gremlin and an embedded server, but with no luck. In the documentation it says
The underlying database is auto-indexed, see Section 14.12, “Automatic Indexing” so the script can return the imported node by index lookup.
But when I examine the graph after loading a new database, no indexes seem to exist.
The Neo4j documentation on auto indexing says that a bunch of configuration is required. In addition to setting node_auto_indexing = true, you have to configure it
To actually auto index something, you have to set which properties
should get indexed. You do this by listing the property keys to index
on. In the configuration file, use the node_keys_indexable and
relationship_keys_indexable configuration keys. When using embedded
mode, use the GraphDatabaseSettings.node_keys_indexable and
GraphDatabaseSettings.relationship_keys_indexable configuration keys.
In all cases, the value should be a comma separated list of property
keys to index on.
So is Gremlin supposed to set the GraphDatabaseSettings parameters? I tried passing in a map into the Neo4jGraph constructor like this:
Map<String,String> config = [
'node_auto_indexing':'true',
'node_keys_indexable': 'emailID'
]
Neo4jGraph g = new Neo4jGraph(graphDB, config);
g.loadGraphML("../databases/data.graphml");
but that had no apparent effect on index creation.
UPDATE 2
Rather than configuring the database through Gremlin, I used the examples given in the Neo4j documentation so that my database creation was like this (in Groovy):
protected Neo4jGraph getGraph(String graphDBname, String databaseName) {
boolean populateDB = !new File(graphDBName).exists();
if(populateDB)
println "creating database";
else
println "opening database";
GraphDatabaseService graphDB = new GraphDatabaseFactory().
newEmbeddedDatabaseBuilder( graphDBName ).
setConfig( GraphDatabaseSettings.node_keys_indexable, "emailID" ).
setConfig( GraphDatabaseSettings.node_auto_indexing, "true" ).
setConfig( GraphDatabaseSettings.dump_configuration, "true").
newGraphDatabase();
Neo4jGraph g = new Neo4jGraph(graphDB);
if (populateDB) {
println "Populating graph"
g.loadGraphML(databaseName);
}
return g;
}
and my retrieval was done like this:
ReadableIndex<Node> autoNodeIndex = graph.rawGraph.index()
.getNodeAutoIndexer()
.getAutoIndex();
def node = autoNodeIndex.get( "emailID", "<2614099.1075839927264.JavaMail.evans#thyme>" ).getSingle();
And this seemed to work. Note, however, that the getIndices() call on the Neo4jGraph object still returned an empty list. So the upshot is that I can exercise the Neo4j API correctly, but the Gremlin wrapper seems to be unable to reflect the indexing state. The expression g.idx('node_auto_index') (documented in Gremlin Methods) returns null.
the auto indexes are created lazily. That is - when you have enabled the auto-indexing, the actual index is first created when you index your first property. Make sure you are inserting data before checking the existence of the index, otherwise it might not show up.
For some auto-indexing code (using programmatic configuration), see e.g. https://github.com/neo4j-contrib/rabbithole/blob/master/src/test/java/org/neo4j/community/console/IndexTest.java (this is working with Neo4j 1.8
/peter
Have you tried the automatic index feature? It's basically the use case you're looking for--unfortunately it needs to be enabled before you import the data. (Otherwise you have to remove/add the properties to reindex them.)
http://docs.neo4j.org/chunked/milestone/auto-indexing.html

How to build a custom Lucene index for Neo4j graph?

I am using Gremlin and Neo4j to manipulate the ENRON dataset from infochimps. This dataset has two types of vertexes, Message and Email Addresss and two types of edges, SENT and RECEVIED_BY. I would like to create a custom index on this dataset that creates a Lucene document for each vertex of type: 'Message' and incorporates information from associated vertexes (e.g., v.in(), v.out()) as additional fields in the Lucene document.
I am thinking of code along the lines of
g = new Neo4jGraph('enron');
PerFieldAnalyzerWrapper analyzer =
new PerFieldAnalyzerWrapper(new StandardAnalyzer());
analyzer.addAnalyzer("sender", new KeywordAnalyzer());
analyzer.addAnalyzer("recipient", new KeywordAnalyzer());
IndexWriter idx = new IndexWriter (dir,analyzer,IndexWriter.MaxFieldLength.UNLIMITED);
g.V.filter{it.type == 'Message'}.each { v ->
Document doc = new Document();
doc.add(new Field("subject", v.subject));
doc.add(new Field("body", v.body));
doc.add(new Field("sender", v.in().address);
v.out().each { recipient ->
doc.add(new Field("recipient", recipient.address));
}
idx.addDocument(doc);
}
idx.close();
My questions are:
Is there a better way to enumerate vertexes for indexing?
Can I use auto-indexing for this, and if so, how to I specify what should be indexed?
Can I specify my own Analyzer, or am I stuck with the default? What is the default?
If I must create my own index, should I be using gremlin for this, or am I better off with a Java program?
I will be talking about direct Neo4j access here since I'm not well travelled in Gremlin.
So you'd like to build a Lucene index "outside of" the graph itself? Otherwise you can use the built in graphDb.index().forNodes( "myIndex", configForMyIndex ) to get (created on demand) a Lucene index associated with neo4j. You can then add multiple fields to each document by calling index.add( node, key, value ), where each node will be represented by one document in that Lucene index.
1) In Gremiln... I don't know
2) See http://docs.neo4j.org/chunked/milestone/auto-indexing.html
3) See http://docs.neo4j.org/chunked/milestone/indexing-create-advanced.html
4) Do you need to create it outside of the db entirely? If so, why?
I just finished an import with a Java process and it's really easy, in my opinion better inclusive through Gremlin.
Anyway, if the process is failing is because of you CAN'T create a new object of StandardAnalyzer. All the constructors of that class require parameters, so you should create a wrapper class or create it with the right version of Lucene like paramater in the constructor.
Neo4J, until today, accepts only until the lucene version 36.

Resources