Indices in Neo4j - questions and doubts - neo4j

The only indices that I know about them are indices on properties (these indices are created on particular labels (node types)). I have some doubts, however.
Are there exists indices on edges/relationships?
I often read that Neo4j leveraged Lucene Index. Is it still used? What is aim?
Are there exists any other indicses than indices on properties?
Thanks in advance,

Neo4j has two indexing systems.
The more modern one is referred to as "schema indexes", and these are the ones that are automatic and apply to properties of a given label for quick lookup by those properties when the given properties and label are provided within a query. This does not currently support indexing of relationship properties. These started out based on lucene, but we've gradually replaced the implementation with our own native indexing solution. Discussion of these, as well as any noteworthy information and limitations, can be found in our index configuration documentation.
The other indexing system is an older manual system that is called "explicit indexes", though this has previously been called "manual indexes". This is also based on lucene, but these are not automatic -- it is up to the user to manually add or remove entries to the index and keep them up to date when data in the database changes. This makes usage and maintenance cumbersome, and we recommend avoid using this system if possible.
Built-in procedures are the means to create and lookup using explicit indexes, as these are never used automatically under the hood (as opposed to schema indexes). APOC Procedures also offers various means of interfacing with explicit indexes.
The main reason one would use explicit indexes is because you are able to create an index on relationships for properties and get fast lookup when querying the index. This also allows for a full text lookup across multiple labels and properties, provided the index has been configured in such a way.
Separate from all of these, it should be noted that usage of labels is itself a kind of index, as it provides quick access to all nodes with the given label.

Related

Is this syntax not right for executing an APOC query?

call apoc.index.nodes('Product', 'name:iPhone*') yield node return node
In my graph I have 'iPhone X' and 'iPhone Plus', but this query doesn't return anything. I also have an index on 'name' property of Product.
Indexes
ON :Product(name) ONLINE
apoc.index.nodes is one of the APOC procedures for "manual indexes", which are also confusingly referred to in various docs as "legacy indexes" and "explicit indexes". Such indexes use the Apache Lucene library and are NOT the same as the standard neo4j indexes that most people use, and the way you create/update/use such indexes is also not standard.
For example, you cannot create a "manual index" via a Cypher CREATE INDEX clause. And neo4j Browser's :schema command will not show any manual indexes.
If you will only be searching :Product(name) via manual indexes, then you should drop your standard index for :Product(name), since it will not be needed but will add overhead (time and space) to your DB.
One way to create/update/use manual indexes is through the special APOC procedures. The APOC documentation for manual indexes (linked above) provides a good amount of information about how to add nodes and relationships to such indexes, and how to search using them.
As an example, before you can use the query in your question, you first have to add all the :Product(name) values to the Product manual index. If you want to add them all at once, you can use the following query (and since it has to return something, it just returns a count of the number of Products):
MATCH (p:Product)
CALL apoc.index.addNode(p, ['name'])
RETURN count(*)
[UPDATED]
Manual indexing is typically only used for partial and fuzzy text search use cases. When you just need exact value matching, standard indexes are recommended, especially since they require much less effort on your part. The reason manual indexes are called "manual" is because the responsibility for maintaining them falls entirely on your shoulders. That is, your node/relationship/property addition/removal/update queries would normally have to add/remove/update any relevant manual index entries as well. Note that when you update a property that is manually indexed, you have to remove the old index entry and then add the new entry.

Embedded automatic full text indexing completely removed from Neo4j as of 3.0.0?

I'm moving from Neo4j 2.2.* to (still prerelease) 3.0.0 and all of a sudden it seems that configuration parameters
node_auto_indexing=true
relationship_auto_indexing=true
node_keys_indexable=some_node_property
relationship_keys_indexable=some_rel_property
had gone and are not available any more. This is sad because I need full-text indexing (namely, fuzzy search queries and range searches), I was happily using it since 2.0.0 and had a naive hope that new Lucene 5.5 will make my life better with 3.0.0.
Is this functionality completely removed? START clause is still here in Cypher, neo4j-shell still has command which allows manipulating "legacy" FT indices so my question is:
how do I populate my FT index without using Java or another external programming language?
case 1: I import some bunch of "static" data into the graph which
will rarely be updated (consider dictionary) and need to arrange FTS
on those once, and manually perform complete reindex on occasional updates of the dataset;
case 2: nodes and relationships with specific properties
automagically get indexed upon creation or upon assignment of a new value to the property with specific name, near-realtime, as it used to be before.
New schema indexes are cool in 3.0.0 and range searches are implemented, but a) they work only on properties of nodes, no relationships, b) they don't allow full-text, fuzzy queries, and AFAIK regular expression matching does not use index.
Thanks for your suggestions!
WBR, Andrii
Andrii,
only the default config parameters have been removed not the functionality.
What is the actual use-case you are using the FTS indexes (on rels) for?
In 3.0 you can still use the start-clause but using stored procedures you can add nodes and relationship explicitly to indexes. And you can use similar procedures to query your indexes even more efficiently, e.g. by passing in start and end-nodes.
See (WIP): https://github.com/jexp/neo4j-apoc-procedures#manual-indexes

Neo4j auto increase schema index

It is recommended not to use Neo4j's id property because it may change, but rather create our own identifier. Then to identify my users, I plan to create a user_id property on the nodes labelled User and put an index on it. However, I cannot figure out a way to make it auto increase.
After some searching, I noticed there are two kinds of indexes in Neo4j, the schema index and the legacy index. Could anyone explain to me the difference between them? And is there a way to make my user_id index auto increase?
Schema indices are effectively labels, e.g. :User. You can also create indices on the properties of those labels if you wish. There's also no need to specify which index you're using as this is done automatically, in this case.
Legacy indices are the node indices that were around prior to Neo4j 2.0. They're a traditional index where you can specify what you're indexing and which properties they apply to, but, they're only used in START statements, which are optional (and on their way to deprecation).
For more detail, have a look here (http://docs.neo4j.org/chunked/stable/graphdb-neo4j-schema.html) and here (http://docs.neo4j.org/chunked/stable/indexing.html).
As for auto-incrementing, I'm unaware of any such functionality for user-defined index keys.
HTH

How to programmatically add constraints to Neo4J Cypher queries

I am writing a sever plugin for Neo4J. The plugin receives a cypher query, and executes it. Currently, my implementation uses a CypherExecutor.
I now need to further constrain the results. (For example, imagine that the results need to be filtered by ACLs.)
One approach is to filter the results after executing the query. I'd rather not do this, for performance reasons as well as other limitations (for example, any aggregate results would be wrong.)
I considered adding the constraints to the query itself. I've looked at the command.AbstractQuery subclasses produced using the CypherParser. That object model is immutable.
I am wondering whether I will need to resort to cloning Neo4J's ExecutionEngine and CypherCompiler, just to extend the ExecutionPlanBuilder... I would like to avoid this option if at all possible.
Any recommendations about how this can be done?
In my case, I am simply trying to simulate multiple isolated graphs. I am OK with how this might be modeled -- whether I add a 'tenantId' to each node, or maintain a tenant node and add (:Tenant)<-[:scopedTo]-(n) relationships to every node.

How to apportion between BatchInserterIndex cache and MMIO?

In a batch insertion using lucene indexes, given a large set of nodes and relations such that the node and relationship store cannot fit completely in mapped memory (hence the need for lucene index caching), how should one divide memory between MMIO and lucene index caches to achieve optimal performance? Having read the documentation, I am already somewhat familiar with how to divide memory within the mapped-memory schema. I am interested in the overall allotment of memory between MMIO and the lucene caches. Since I am working on a prototype with what hardware happens to be available, and the future resources and data volume are undetermined, I would prefer the answer to be in general terms (I think this would also make the answer more useful to the rest of Neo4j community too.) So it would be good if I could pose the question like this:
Given
rwN nodes and rwR relationships that are written and must be read later in the batch insertion,
woN nodes and woR relationships that are only written,
G gigabytes of RAM (not including what is required for the operating system)
What is the optimal division of G between lucene index caches and MMIO?
If more details are needed I can supply them for my particular case.
All these considerations are only relevant for importing (multiple) billions of nodes and relationships
Usually when you do lookups it depends on the "hot dataset size" of your index lookups.
By default that's all nodes but if you know your domain better, you can probably devise some paging that results in smaller needed caches (e.g. by pre-sorting your input data for relationship creation by start and end-node lookup-property) then you have kind of a moving window over your node data during which each node is accessed frequently.
I usually even sort by min(start,end).
Usually you try to use most of the RAM for mmio mapping of the rel-store and node store. The property stores are only written to but the others have to be updated as well.
The index cache lookup is only a HashMap behind the scenes, so quite wasteful. What I found to work better is to use a different approach, e.g. a multi-pass one.
use an string-array put all your lookup properties in there, sort it and use the array index (Arrays.binarySearch) as node-id then the lookup only in that array is quite efficient
another way is using a multi-pass on the source data so you already create the node-ids that are needed for the rels as part of the source, Friso and Kris from Xebia did something like that in their hadoop based solution esp. the monotonically increasing parallel id's

Resources