Using Neo4j fulltext indexing for Contains queries - neo4j

I'm using neo4j 3.5, and have about 9 million user nodes. I was trying to implement the following query, but it was taking way too long:
MATCH (users:User) WHERE (users.username CONTAINS "joe" OR users.first_name CONTAINS "joe" OR users.last_name CONTAINS "joe")
RETURN users
LIMIT 30
I was hoping to take advantage of neo4j 3.5's newe fulltext indexing feature by creating the following index:
CALL db.index.fulltext.createNodeIndex('users', ['User'], ['username', 'first_name', 'last_name'])
and then querying the db like so
CALL db.index.fulltext.queryNodes('users', joe)
YIELD node
RETURN node.user_id
I thought this would work the same as contains and return users whose username, first_name or last_name contains joe (eg: myjoe12, joe12, 12joe, 44joeseph, etc.) but it seems to be returning users whose fields are joe exactly or contain joe separated by a whitespace (eg: Joe B, Joe y1), I tried using joe* in the query but that only returns everything starting with joe, I want to return everything containing joe or whatever search term. What would be the best way to go about this?

Speed issue / Index:
So far I know, Neo4j has a optimised index for STARTS WITH & ENDS WITH only for NOT composite indexes.
If I read this docs paragraph, my conclusion will be this: Your 9 million users will be searched one by one, neo4j doesn't use any index for your query. What makes this query really slow.
A answer to your question:
I want to return everything containing Joe or whatever search term.
You probably looking for a regex search (this is also slow and not a index search and not recommended):
Example query based on your query:
MATCH (users:User)
WHERE (users.username =~ "(?i).*joe.*" OR users.first_name =~ "(?i).*joe.*" OR users.last_name =~ "(?i).*joe.*")
RETURN users
LIMIT 30
Explanation for (?i) this means case-insensitive so Joe or joe will be matched. See regex operator docs and regex where docs

For the fulltext schema index, it looks like you'll need to use the fuzzy search operator ~ in your query, though you may need to do some filtering on the score to make sure you're looking at relevant results:
CALL db.index.fulltext.queryNodes('users', 'joe~')
YIELD node, score
WHERE score > .8
RETURN node.user_id

Related

How to store regex or search terms in Postgres database and evaluate in Rails Query?

I am having trouble with a DB query in a Rails app. I want to store various search terms (say 100 of them) and then evaluate against a value dynamically. All the examples of SIMILAR TO or ~ (regex) in Postgres I can find use a fixed string within the query, while I want to look the query up from a row.
Example:
Table: Post
column term varchar(256)
(plus regular id, Rails stuff etc)
input = "Foo bar"
Post.where("term ~* ?", input)
So term is VARCHAR column name containing the data of at least one row with the value:
^foo*$
Unless I put an exact match (e.g. "Foo bar" in term) this never returns a result.
I would also like to ideally use expressions like
(^foo.*$|^second.*$)
i.e. multiple search terms as well, so it would match with 'Foo Bar' or 'Search Example'.
I think this is to do with Ruby or ActiveRecord stripping down something? Or I'm on the wrong track and can't use regex or SIMILAR TO with row data values like this?
Alternative suggestions on how to do this also appreciated.
The Postgres regular expression match operators have the regex on the right and the string on the left. See the examples: https://www.postgresql.org/docs/9.3/static/functions-matching.html#FUNCTIONS-POSIX-TABLE
But in your query you're treating term as the string and the 'Foo bar' as the regex (you've swapped them). That's why the only term that matches is the exact match. Try:
Post.where("? ~* term", input)

Postgresql text searching, matching multiple words

I don't know the name for this kind of search, but I see that it's getting pretty common.
Let's say I have records with the following file names:
'order_spec.rb', 'order.sass', 'orders_controller_spec.rb'
If I search with the following string 'oc' I would like the result to return 'orders_controller_spec.rb' due to match the o in orders and the c in controller.
If the string is 'os' then I'd like all 3 to match, 'order_spec.rb', 'order.sass', 'orders_controller_spec.rb'.
If the string is 'oco' then I'd like 'orders_controller_spec.rb'
What is the name for this kind of search and how would I go about getting this done in Postgresql?
This is a called a subsequence search. One simple way to do it in Postgres is to use the LIKE operator (or several of the other options in those docs) and fill the spaces between your letters with a wildcard, which for LIKE is %. To match anything with an o followed by an s in the words column, that would look like this:
SELECT * FROM table WHERE words LIKE '%o%s%';
This is a relatively expensive search, but you can improve performance with a varchar_pattern_ops or text_pattern_ops index to support faster pattern matching.
CREATE INDEX pattern_index ON table (words varchar_pattern_ops);

Configure Sphinx to index dash and search it with and without it

I have a record
Item id: 1, name: "wd-40"
How do I configure Sphinx to match this record on the following queries:
Item.search("wd40")
Item.search("wd-40")
To answer your title question, charset_table is what you want.
http://sphinxsearch.com/docs/current.html#charsets
But that doesnt actully solve the query of matching those two queries, indexing - wouldn't work, just be the inverse of indexing it.
Instead, you probably want ignore_chars
http://sphinxsearch.com/docs/current.html#conf-ignore-chars
First indexing:
By default, only ascii characters are indexed by Sphinx; the others are considered word separators. To fix that, you need to use the charset_table parameter to map the dash to the dash character.
Second searching:
AFAIK, it is not possible to make Sphinx to consider both searches like you are asking for. However, you can just use something like:
# in Python, but I believe is understandable
query = word
if '-' in word:
query += " | " + word.replace('-','')
Item.search(query) # if word = 'wd-40', query = 'wd-40 | wd40'

Index on multiple properties in Neo4j / Cypher

Can I create an index with multiple properties in cypher?
I mean something like
CREATE INDEX ON :Person(first_name, last_name)
If I understand correctly this is not possible, but if I want to write queries like:
MATCH (n:Person)
WHERE n.first_name = 'Andres' AND n.last_name = 'Doe'
RETURN n
Does these indexes make sense?
CREATE INDEX ON :Person(first_name)
CREATE INDEX ON :Person(last_name)
Or should I try to merge "first_name" and "last_name" in one property?
Thanks!
Indexes are good for defining some key that maps to some value or set of values. The key is always a single dimension.
Consider your example:
CREATE INDEX ON :Person(first_name)
CREATE INDEX ON :Person(last_name)
These two indexes now map to those people with the same first name, and separately it maps those people with the same last name. So for each person in your database, two indexes are created, one on the first name and one on the last name.
Statistically, this example stinks. Why? Because the distribution is stochastic. You'll be creating a lot of indexes that map to small clusters/groups of people in your database. You'll have a lot of nodes indexed on JOHN for the first name. Likewise you'll have a lot of nodes indexed on SMITH for the last name.
Now if you want to index the user's full name, then concatenate, forming JOHN SMITH. You can then set a property of person as person.full_name. While it is redundant, it allows you to do the following:
Create
CREATE INDEX ON :Person(full_name)
Match
MATCH (n:Person)
USING INDEX n:Person(full_name)
WHERE n.full_name = 'JOHN SMITH'
You can always refer to http://docs.neo4j.org/refcard/2.0/ for more tips and guidelines.
Cheers,
Kenny
As of 3.2, Neo4j supports composite indexes. For your example:
CREATE INDEX ON :Person(first_name, last_name)
You can read more on composite indexes here.

Can Neo4j be effectively used to show a collection of nodes in a sortable and filterable table?

I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval
if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.
Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.

Resources