Does label mechanism provide auto-indexing features when using neo4j-java API? - neo4j

First of all, I bet that there is an answer on this question somewhere in docs, but since 'Manual: Labels and Indexes' link here gives me 404 error, I'm going to ask you anyway.
Is it possible to create an index on some label and specify it as an automatic one (just like legacy indexes I'm currently using, but for labels)?
If someone from neo4j team is reading this post, please let me know if I'm looking for the documentation in the right place, 'cause I can't find anything more or less informative on labels and indexes (except a couple of posts in Michael Hunger's blog and, maybe, some presentations, what is obviously not enough).
This is a more technical one: is it possible to find an item in the index by the regex? Suppose I have node with property 'n' -> '/a/b/c', and another node 'n' -> '/a/*/c. Can I somehow match them?

I don't work for Neo4j but I'll answer anyway.
All label indexing is automatic. Once you've created the index it maintains itself, possibly with minimal delay.
The manual for the last stable release can always be found here. The chapter on indexing for the embedded Java API is here.
You cannot use regexp with label indices yet. It's said to be on the agenda, along with index support for array lookups, i.e. what in Cypher would be
MATCH (a:MyLabel) WHERE a.value IN ['val1', 'val2']

Related

How to implement fuzzy search

I'm using Neo4j 3 REST API and i have node named customer it has properties like name etc i need to get search results of name of customer eg i should get results for name "john" for my input "joan".how to implement fuzzy search to get my desired results.
Thanks in advance
First off, I want to make that you know that if you're using Neo4j 3.x that 3.x is currently in beta and isn't considered stable yet.
You have two options to implement a fuzzy search in Neo4j. You can use the legacy indexes to implement Lecene-based indexing. That should provide anything that Lucene can do, though you'd probably need to do a bit more work. You can also implement your own unmanaged extension which will allow you to use Lucene a bit more directly.
Perhaps the easier alternative is to use elasticsearch with Neo4j and have elasticsearch do your full-text indexing. You might take a look at the Neo4j and ElasticSearch page on neo4j.com. There they provide a link to a GitHub repository which is a plugin for Neo4j which automagically updates ElasticSearch with data from Neo4j and which provides and endpoint for querying your graph fuzzily. There is also a video tutorial on how to do this.
You will have to try using https://neo4j.com/developer/kb/how-to-perform-a-soundex-search/ which in this case will work. If your input is Joan you will not get John as the response, unless you just give jo as input in which you will get both. To get what you are expecting you will have to use the soundex search.
Stepping back a little, what is the problem you are trying to solve with fuzzy matching?
My experience has been that misspellings and typos are far less common than you might think, and humans prefer exact matches whenever possible. If there is no exact match (often just missing a space between words), that's a good time to use a spellchecker, and that's where the fuzzy matching should kick in.
In addition, your example would match "joan" to "john", but some synonyms like "joanie" would be more useful. If you have a big corpus of content to work with, you may be able to extract some relationships, using fuzzy & machine learning to identify "joanne" and "joni" as possible synonyms and then submit that to a human curator. "Jon" looks like a related name but it's not, while "jo" and even "nonie" may or may not be nicknames in these groupings.

What is a path in neo4j cypher v2.0 and higher?

I read in the neo4j 2.0 cypher-refcard
that
Paths are no longer collections, use nodes(path) or rels(path).
What is a path now? Why the change? What consequence for path MATCHing does the change have, for example?
A path is a path. #DaveBennett answers what they are from the JSON perspective. Inside of cypher they're a special kind of object, which you can access in various ways (e.g. through nodes and rels). This I find more clear and intuitive; if it was to be a collection, what would it be a collection of? Inevitably mixed types (e.g. node rel node rel). Better that it should be its own object type to discourage people from doing things like indexing into even numbered items making certain assumptions.
Expanding on the previous answer, this (I think) further makes sense because of the syntax cypher uses for path binding, i.e.
MATCH p=(a)-[r]-(b) RETURN p.
Clearly in this example p is something special. The syntax pretty clearly indicates that a has to be a node, and r is definitely a relationship. Paths just aren't either of those things.
From a programming language perspective, it's good for "collections" to be uniformly typed. E.g. a programmer can know how to deal with a Collection<String>, this means each item in the collection plays by the semantic rules of a String. Making a path a collection would then be problematic, because it can't be a collection of any one type. When iterating through a path/collection, what would you do with each item? The answer is it would depend on what the item is, which tends to make for messy code.
Again, better to have paths be their own thing. Want to iterate over all of the nodes in the path? That's what nodes(p) is for, which will give you a uniformly typed collection. Extra bonus that it makes your cypher code more readable.
In some ways I'm "back-explaining" what the neo4j devs did. I didn't make this design decision, and I wasn't involved in it, so I'm not giving you the neo4j official answer why. This is just my explanation for why the design decision was (IMHO) a very good idea. It follows design patterns you see everywhere else, with certain advantages.

Neo4j performance - counting nodes - linked list performance - alternatives?

UPDATED: Wes hit a home run here! Thanks.. I've added a Rails version I was developing using the neography Gem.. Accomplishes the same thing but his version is much faster.. see comparison below
I am using a linked list in Neo4j (1.9, REST, Cypher) to help keep the comments in proper order (Yes I know I can sort on the time etc).
(object node)---[:comment]--->(comment)--->(comment)--->(comment).... etc
Currently I have 900 comments and it's taking 7 seconds to get through the whole list - completely unacceptable.. I'm just returning the ID of the node (I know, don't do this, but it's not he point of my post).
What I'm trying to do is find the ID's of users who commented so I can return a count.. (like "Joe and 405 others commented on your post").. Now, I'm not even counting the unique nodes at this point - I'm just returning the author_id for each record.. (I'll worry about counting later - first take care of the basic performance issue).
start object=node(15837) match object-[:COMMENTS*]->comments return comments.author_id
7 seconds is waaaay too long..
Instead of using a linked list, I could just simply have an object and link all the comments directly to the node - but that could lead to a supernode that is just bogged down, and then finding the most recent comments, even with skip and limit, will be dog slow..
Will relationship indexes help here? I've never used them other than to ensure a unique relationship, or to see if a relationship exists, but can I use them in a cypher query to help speed things up?
If not, what else can I do to decrease the time it takes to return the IDs?
COMPARISON: Here is the Rails version using "phase II" methods of the Neography gem:
next_node_id=18233
#neo=Neography::Rest.new
start_node = Neography::Node.load(next_node_id, #neo)
all_nodes=start_node.outgoing(:COMMENTS).depth(10000)
raise all_nodes.size.to_i
Result: 526 nodes found in 290ms..
Wes' solution took 5 ms.. :-)
Relationship indexes will not help. I'd suggest using an unmanaged extension and the traversal API--it will be a lot faster than Cypher for this particular query on long lists. This example should get you close:
https://github.com/wfreeman/linkedlistlength
I based it on Mark Needham's example here:
http://www.markhneedham.com/blog/2014/07/20/neo4j-2-1-2-finding-where-i-am-in-a-linked-list/
If you're only doing this to return a count, the best solution here is to not figure it out on every query since it isn't changing that often. Cache the results on the node in a total_comments property to your node. Every time a relationship is added or removed, update that count. If you want to know whether any of the current user's friends commented on it so you can say, "Joe and 700 others commented on this," you could do a second query:
start joe=node(15830) object=node(15838) match joe-[:FRIENDS]->friend-[:POSTED_COMMENT]->comment<-[:COMMENTS]-object RETURN friend LIMIT 1
You limit it to 1 since you only need the name of one friend who commented. If it returns someone, adjust the number of comments displayed by 1, include the user's name. You could do that with JS so it doesn't delay your page load. Sorry if my Cypher is a little off, not used to <2.0 syntax.

Create Unique Relationship is taking much amount of time

START names = node(*),
target=node:node_auto_index(target_name="TARGET_1")
MATCH names
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
Iam consisting of nearly 1,80,000 names nodes, i had iterated the above process to create unique relationships above 100 times by changing the target. its taking too much amount of time.How can i resolve it..
i build the query with java and iterated.iam using neo4j 2.0.0.5 and java 1.7 .
I edited your cypher query because I think I understand it, but I can barely read the rest of your question. If you edit it with white spaces and punctuation it might be easier to understand what you are trying to do. Until then, here are some thoughts about your query being slow.
You bind all the nodes in the graph, that's typically pretty slow.
You bind all the nodes in the graph twice. First you bind universally in your start clause: names=node(*), and then you bind universally in your match clause: MATCH names, and only then you limit your pattern. I don't quite know what the Cypher engine makes of this (possibly it gets a migraine and goes off to make a pot of coffee). It's unnecessary, you can at least drop the names=node(*) from your start clause. Or drop the match clause, I suppose that could work too, since you don't really do anything there, and you will still need a start clause for as long as you use legacy indexing.
You are using Neo4j 2.x, but you use legacy indexing instead of labels, at least in this query. Without knowing your data and model it's hard to know what the difference would be for performance, but it would certainly make it much easier to write (and read) your queries. So, that's a different kind of slow. It's likely that if you had labels and label indices, the query performance would improve.
So, first try removing one of the universal bindings of nodes, then use the 2.x schema tools to structure your data. You should be able to write queries like
MATCH target:Target
WHERE target.target_name="TARGET_1"
WITH target
MATCH names:Name
WHERE NOT names-[:contains]->()
AND HAS (names.age)
AND (names.qualification =~ ".*(?i)B.TECH.*$"
OR names.qualification =~ ".*(?i)B.E.*$")
CREATE UNIQUE (names)-[r:contains{type:"declared"}]->(target)
RETURN names.name,names,names.qualification
I have no idea if such a query would be fast on your data, however. If you put the "Name" label on all your nodes, then MATCH names:Name will still bind all nodes in the database, so it'll probably still be slow.
P.S. The relationships you create have a TYPE called contains, and you give them a property called type with value declared. Maybe you have a good reason, but that's potentially very confusing.
Edit:
Reading through your question and my answer again I no longer think that I understand even your cypher query. (Why are you returning both the bound nodes and properties of those nodes?) Please consider posting sample data on console.neo4j.org and explain in more detail what your model looks like and what you are trying to do. Let me know if my answer meets your question at all or I'll consider removing it.

Lucene partial word matching

Lucene does not support it out of the box, so I need some help building my query.
Lets say I have the document with a field value "Develop"
I would like this document to be returned for the searches "Dev" and "lop".
Maybe creating two queries?
"*keyword"
and
"keyword*"
and
"keyword"
?
How would you go about doing this with multiple words? Would you split the sentence/search into a words list and do the previous example for each word?
What you're asking is if I understand you correctly not feasible on any large scale search engine.
Lucene creates an index over keywords using term-document matrix and inverted-file techniques (see links at the bottom). A fully fledged string matching might be very nice to have, but it does not scale: you will never be able to query a decently sized index (say more than a couple of dozen/hundreds of documents) in an acceptable time.
Still, here are two ideas that might help...
Syllable tokenization
To come back to your example with 'Develop'. As long as you are happy with letting users search for syllables I guess you can do something.
You would have to create use tokenizer that splits up words in your indexed according to their syllables and create a database index over the syllables. (I am not sure there are built in tokenizers for the English language that can do that and writing one on your own might be tricky...)
An important thing to note:
If you would index the full words AND the seperate syllables the size of your index will be much larger than if you only index one of the two.
However I would not suggest to index only syllables. If you want to also allow your users to search for the full word 'Develop' (which I guess you want) this would result in two queries with a logical and between them, namely <'dev' AND 'lop'>. Although Lucene supports such logical constructs in queries they are very expensive. I have personally had some trouble in the past using logical queries in Lucene.
Stemming
Another way to somehow arrive at what you're trying could be to use a brutal form of word stemming (http://en.wikipedia.org/wiki/Stemming) that stems words to their first syllable. (This would allow to search for 'dev' but not for 'lop'...)
Again, I don't think such a word stem feature is already in Lucene. Writing one for yourself will be a pain and involve working with/importing huge dictionaries.
Links
These might be looking into if you don't know about search engine internals:
http://en.wikipedia.org/wiki/Index_%28search_engine%29
http://en.wikipedia.org/wiki/Vector_space_model
http://en.wikipedia.org/wiki/Inverted_file
http://en.wikipedia.org/wiki/Term-document_matrix
http://en.wikipedia.org/wiki/Tf-idf

Resources