Can I create an index with multiple properties in cypher?
I mean something like
CREATE INDEX ON :Person(first_name, last_name)
If I understand correctly this is not possible, but if I want to write queries like:
MATCH (n:Person)
WHERE n.first_name = 'Andres' AND n.last_name = 'Doe'
RETURN n
Does these indexes make sense?
CREATE INDEX ON :Person(first_name)
CREATE INDEX ON :Person(last_name)
Or should I try to merge "first_name" and "last_name" in one property?
Thanks!
Indexes are good for defining some key that maps to some value or set of values. The key is always a single dimension.
Consider your example:
CREATE INDEX ON :Person(first_name)
CREATE INDEX ON :Person(last_name)
These two indexes now map to those people with the same first name, and separately it maps those people with the same last name. So for each person in your database, two indexes are created, one on the first name and one on the last name.
Statistically, this example stinks. Why? Because the distribution is stochastic. You'll be creating a lot of indexes that map to small clusters/groups of people in your database. You'll have a lot of nodes indexed on JOHN for the first name. Likewise you'll have a lot of nodes indexed on SMITH for the last name.
Now if you want to index the user's full name, then concatenate, forming JOHN SMITH. You can then set a property of person as person.full_name. While it is redundant, it allows you to do the following:
Create
CREATE INDEX ON :Person(full_name)
Match
MATCH (n:Person)
USING INDEX n:Person(full_name)
WHERE n.full_name = 'JOHN SMITH'
You can always refer to http://docs.neo4j.org/refcard/2.0/ for more tips and guidelines.
Cheers,
Kenny
As of 3.2, Neo4j supports composite indexes. For your example:
CREATE INDEX ON :Person(first_name, last_name)
You can read more on composite indexes here.
Related
I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.
I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.
I'm using neo4j 3.5, and have about 9 million user nodes. I was trying to implement the following query, but it was taking way too long:
MATCH (users:User) WHERE (users.username CONTAINS "joe" OR users.first_name CONTAINS "joe" OR users.last_name CONTAINS "joe")
RETURN users
LIMIT 30
I was hoping to take advantage of neo4j 3.5's newe fulltext indexing feature by creating the following index:
CALL db.index.fulltext.createNodeIndex('users', ['User'], ['username', 'first_name', 'last_name'])
and then querying the db like so
CALL db.index.fulltext.queryNodes('users', joe)
YIELD node
RETURN node.user_id
I thought this would work the same as contains and return users whose username, first_name or last_name contains joe (eg: myjoe12, joe12, 12joe, 44joeseph, etc.) but it seems to be returning users whose fields are joe exactly or contain joe separated by a whitespace (eg: Joe B, Joe y1), I tried using joe* in the query but that only returns everything starting with joe, I want to return everything containing joe or whatever search term. What would be the best way to go about this?
Speed issue / Index:
So far I know, Neo4j has a optimised index for STARTS WITH & ENDS WITH only for NOT composite indexes.
If I read this docs paragraph, my conclusion will be this: Your 9 million users will be searched one by one, neo4j doesn't use any index for your query. What makes this query really slow.
A answer to your question:
I want to return everything containing Joe or whatever search term.
You probably looking for a regex search (this is also slow and not a index search and not recommended):
Example query based on your query:
MATCH (users:User)
WHERE (users.username =~ "(?i).*joe.*" OR users.first_name =~ "(?i).*joe.*" OR users.last_name =~ "(?i).*joe.*")
RETURN users
LIMIT 30
Explanation for (?i) this means case-insensitive so Joe or joe will be matched. See regex operator docs and regex where docs
For the fulltext schema index, it looks like you'll need to use the fuzzy search operator ~ in your query, though you may need to do some filtering on the score to make sure you're looking at relevant results:
CALL db.index.fulltext.queryNodes('users', 'joe~')
YIELD node, score
WHERE score > .8
RETURN node.user_id
Basically my question is: how do I sum relationship properties where there is a related nodes that have properties equal to Value A and Value B?
For example:
I have a simple DB has the following relationship:
(site)-[:HAS_MEMBER]->(user)-[:POSTED]->(status)-[:TAGGED_WITH]->(tag)
On [:TAGGED_WITH] I have a property called "TimeSpent". I can easily SUM up all the time spent for a particular day and user by using the following query:
MATCH (user)-[:POSTED]->(updates)-[r:TAGGED_WITH]->(tags)
WHERE user.name = "Josh Barker" AND updates.date = 20141120
RETURN tags.name, SUM(r.TimeSpent) as totalTimeSpent;
This returns to me a nice table with tags and associated time spent on each. (i.e. #Meeting 4.5). However, the question arises if I want to do some advanced searches and say "Show me all the meetings for ProjectA" (i.e. #Meeting #ProjectA). Basically, I am looking for a query that I can get all of the relationships where a single status has BOTH tags (and only if it has both). Then I can SUM that number up to get a count for how many meetings I spent in #ProjectA.
How do I do this?
MATCH (updates)-[r:TAGGED_WITH]->(tag1 {name: 'Meeting'}),
(updates)-[r:TAGGED_WITH]->(tag2 {name: 'ProjectA'})
RETURN SUM(r.TimeSpent) as totalTimeSpent, count(updates);
This should find all updates tagged with both of those things, and sum all time spent across all of those updates.
To create a generic solution where you may want one or more tags you could use something like this, passing in the array of tags as a parameter (and using the length of the array instead of the hard coded 2.
MATCH (user)-[:POSTED]->(update)-[r:TAGGED_WITH]->(tag)
WHERE user.name = "Josh Barker" AND updates.date = 20141120 AND tag.name IN ['Meeting', 'ProjectA']
WITH update, SUM(r.TimeSpent) AS totalTimeSpent, COLLECT(tag) AS tags
WHERE LENGTH(tags) = 2
RETURN update, totalTtimeSpent
As long as tag.name is indexed, this should be fast.
Edit - Remove User constraint
MATCH (update)-[r:TAGGED_WITH]->(tag)
WHERE tag.name IN ['Meeting', 'ProjectA']
WITH update, SUM(r.TimeSpent) AS totalTimeSpent, COLLECT(tag) AS tags
WHERE LENGTH(tags) = 2
RETURN update, totalTtimeSpent
I have just started working with py2neo and neo4j.
I am confused about how to go about using indices in my database.
I have created a create_user function:
g = neo4j.GraphDatabaseService()
users_index = g.get_or_create_index(neo4j.Node, "Users")
def create_user(name, username, **kwargs):
batch = neo4j.WriteBatch(g)
user = batch.create(node({"name" : name, "username" : username}))
for key, value in kwargs.iteritems():
batch.set_property(user, key, value)
batch.add_labels(user, "User")
batch.get_or_add_to_index(neo4j.Node, users_index, "username", username, user)
results = batch.submit()
print "Created: " + username
Now to obtain users by their username:
def lookup_user(username):
print node(users_index.get("username", username)[0])
I saw the Schema class and noticed that I can create an index on the "User" label, but I couldn't figure out how to obtain the index and add entities to it.
I want it to be as efficient as possible, so would adding the index on the "User" label add to performance, in case I were to add more nodes with different labels later on? Is it already the most efficient it can be?
Also, if I would want my username system to be unique per user, how would I be able to do that? How do I know whether the batch.get_or_add_to_index is getting or adding the entity?
Your confusion is understandable. There are actually two types of indexes in Neo4j - the Legacy Indexes (which you access with the get_or_create_index method) and the new Indexes (which deal with indexing based on labels).
The new Indexes do not need to be manually kept up to date, they keep themselves in sync as you make changes to the graph, and are automatically used when you issue cypher queries against that label/property pair.
The reason the legacy indexes are kept around is that they support some complex functionality that is not yet available for the new indexes - such as geospatial indexing, full text indexing and composite indexing.
I realise this may not be ideal usage, but apart from all the graphy goodness of Neo4j, I'd like to show a collection of nodes, say, People, in a tabular format that has indexed properties for sorting and filtering
I'm guessing the Type of a node can be stored as a Link, say Bob -> type -> Person, which would allow us to retrieve all People
Are the following possible to do efficiently (indexed?) and in a scalable manner?
Retrieve all People nodes and display all of their names, ages, cities of birth, etc (NOTE: some of this data will be properties, some Links to other nodes (which could be denormalised as properties for table display's and simplicity's sake)
Show me all People sorted by Age
Show me all People with Age < 30
Also a quick how to do the above (or a link to some place in the docs describing how) would be lovely
Thanks very much!
Oh and if the above isn't a good idea, please suggest a storage solution which allows both graph-like retrieval and relational-like retrieval
if you want to operate on these person nodes, you can put them into an index (default is Lucene) and then retrieve and sort the nodes using Lucene (see for instance How do I sort Lucene results by field value using a HitCollector? on how to do a custom sort in java). This will get you for instance People sorted by Age etc. The code in Neo4j could look like
Transaction tx = neo4j.beginTx();
idxManager = neo4j.index()
personIndex = idxManager.forNodes('persons')
personIndex.add(meNode,'name',meNode.getProperty('name'))
personIndex.add(youNode,'name',youNode.getProperty('name'))
tx.success()
tx.finish()
'*** Prepare a custom Lucene query context with Neo4j API ***'
query = new QueryContext( 'name:*' ).sort( new Sort(new SortField( 'name',SortField.STRING, true ) ) )
results = personIndex.query( query )
For combining index lookups and graph traversals, Cypher is a good choice, e.g.
START people = node:people_index(name="E*") MATCH people-[r]->() return people.name, r.age order by r.age asc
in order to return data on both the node and the relationships.
Sure, that's easily possible with the Neo4j query language Cypher.
For example:
start cat=node:Types(name='Person')
match cat<-[:IS_A]-person-[born:BORN]->city
where person.age > 30
return person.name, person.age, born.date, city.name
order by person.age asc
limit 10
You can experiment with it in our cypher console.