I am used to have tables for ongoing activities from my former life as relational DB guy. I am wondering how I would store ongoing information like transactions, logs or whatever in a neo4j DB. Let#s assume I have an account, which is been assigned to a user A:
(u:User {name:"A"})
I want to keep track on transactions he does, e.g. deducting or adding a value:
(t:Transaction {value:"-20", date:timestamp()})
Would I do for every transaction a new node and assign it to the user:
(u) -[r:changeBalance]-> (t)
In the end I might have lots of nodes which are assigned to the user and keep one transaction each resulting in lots of nodes with only one information. I was pondering if a query that has a limit on the last 50 transactions (limit 50, sort by t.date) might still have to read all available transaction nodes to get the total sorting queue before the limit applies - this seems a bit unperformant.
How would you model a list of actions in a neo4j DB? Any hint is very appreciated.
If you used a simple query like the following, you would NOT be reading all Transaction nodes per User.
MATCH (u:User)-[r:ChangeBalance]->(t:Transaction)
RETURN u, t
ORDER BY t.date;
You'd only be reading the Transaction nodes that are directly related to each User (via a ChangeBalance relationship). So, the performance would not be as bad as you are afraid it might be.
Although everything is fine with your query - you are reading only transactions, that are related to this specific user - this approach can be improved.
Let's imagine that, for some reason, you application will work 5 years and you have user that have 10 transactions per day. It will result in ~18250 transaction connected to single node.
This is not great idea, from data model perspective. In this case if you want to filter result (get 50 latest transaction) on some non-indexed field, then this will result in full 18250 node traverse.
This can be solved by adding additional relations to database.
Currently you have such graph: (user)-[:HAS]->(transasction)
( user )
/ | \
(transasction1) (transaction2) (transaction3)
You can add additional relation between transactions, to specify sequence of events.
Like that: (transaction)-[:NEXT]->(transasction)
( user )
/ | \
(transasction1)-(transaction2)-(transaction3)
Note: there is no need to have additional PREVIOUS relation, because Neo4j store relationship pointers in both directions, so traversing backwards can be done at same speed as forward.
And maintain relations to first and last user transasctions:
(user)-[:LAST_TRANSACTION]->(transaction)
(user)-[:FIRST_TRANSACTION]->(transaction)
This allows you to get last transaction in 1 hop. And then latest 50 with additional 50 hops.
So, adding additional complexity, you can traverse/manipulate with your data in more efficient ways.
This idea come from EventStore database (and similar to them).
Moreover, with such data model User balance can be aggerated by wrapping up sequence of transaction. This can give you a nice and fast way to get user balance at any point.
Getting latest 50 transaction in this model can look like this:
MATCH (user:User {id: 1} WITH user
MATCH (user)-[:LAST_TRANSACTION]->(last_transaction:Transaction) WITH last_transaction
MATCH (last_transasction)<-[:NEXT*0..50]-(transasctions:Transaction)
RETURN transactions
Getting total user balance can be:
MATCH (user:User {id: 1}) WITH user
MATCH (user)-[:FIRST_TRANSACTION]->(first_transaction:Transaction) WITH first_transaction
MATCH (first_transaction)-[:NEXT*]->(transactions:Transaction)
RETURN first_transaction.value + sum(transasctions.value)
Related
I want to create a relationship between 2 nodes where there are only a few unique pairs and all the others may get repeated. Probably, because of only a few unique nodes, the import-tool is not able to create a relationship, however, when I run a query for the relationship creation in a shell it takes too long. How can I optimize this query by some kind of filtering for uniqueness?
MATCH (a:Applications), (sms:Sms {id: a.application_id})
MERGE (a)-[r:APP_SMS]->(sms)
RETURN distinct a.application_id, sms.id
I have only found a possibility to use distinct in a return part of the query.
I have executed the same query with profile and limit 25 to see the query plan and results:
10382692 total db hits in 3219 ms
As stdob-- suggested, you need to create an index on :Sms(id) so this becomes a cheap NodeByIndexSeek instead of NodeByLabelScan for the sms node.
That you only have 25 rows after the DISTINCT operation is a little concerning, as id fields tend to suggest uniqueness for nodes in a label, but that doesn't seem to be the case here. For the nodes which have the same id property, are these duplicate nodes, with the same properties, or are the other properties aside from id different? If you have duplicate nodes in the db that that suggests a modeling problem.
EDIT
Per the comments, you added a LIMIT 25 to the query so the DISTINCT 25 results makes sense.
There shouldn't be a duplication issue here, unless id isn't unique across :Sms nodes.
Not sure if there's much that can be optimized. You could try batching the MERGE of the relationship using apoc.periodic.iterate(), but you should do that without parallelization to avoid locking issues.
Let's start from simple query that finds all coworkers recursively.
match (user:User {username: 'John'})
match (user)-[:WORKED_WITH *1..]-(coworkers:User)
return user, coworkers
Now, I have to modify it in order to recieve only those users, that are connected with first N relationships.
Every User label have value of N in the properties, and every relationship have date of creation in its properties.
I suppose, that it can be reasonable to create and maintain separate set of relationships that will satisfy this condition.
UPD: Limitations have to be applied only for those, who know each other directly.
Limitation have to be applied to each node in the path, e.g. first user have 3 relationships :WORKED_WITH (on the first level) and limitation 5, than everything OK we can continue to check connected users, if user have 6 relationships and limitation 5, only 5 of relationships have to be used to move on.
I understand that it can be slow query, but how to do that without hand written tools? One of improvements is to move all that limitations out of query execution into some preprocessing step and create additional type of relationships that will hold all of those limitations, it will require validations because they are not part of the state but projection of the state.
The following query should work (as long as you do not have a lot of data). It uses DISTINCT to remove duplicates.
MATCH (user:User {username: 'John'})-[:WORKED_WITH*]-(coworker:User)
WITH DISTINCT user, coworker
ORDER BY coworker.createDate
RETURN COLLECT(coworker)[0..user.N] AS coworkers;
Note: since variable-length paths have exponential complexity, you would usually want to specify a reasonable upper bound (e.g., [:WORKED_WITH*..5]) to avoid the query running too long or causing an out-of-memory error. Also, since the LIMIT operator does not accept a variable as its argument, this query uses COLLECT(coworker)[0..user.N] to get the N coworkers with the earliest createDate -- which is also a bit expensive.
Now, if (as you suggested) you had created a specific type of relationship (e.g., FIRST_WORKED_WITH) between each User and its N earliest "coworkers", that would allow you to use the following trivial and fast query:
MATCH (user:User {username: 'John'})-[:FIRST_WORKED_WITH]->(coworker:User)
RETURN coworker;
I am trying to write a query which looks for potential friends in a Neo4j db based on common friends and interests.
I don't want to post the whole query (part of school assignment), but this is the important part
MATCH (me:User {firstname: "Name"}), (me)-[:FRIEND]->(friend:User)<-[:FRIEND]-(potential:User), (me)-[:MEMBER]->(i:Interest)
WHERE NOT (potential)-[:FRIEND]->(me)
WITH COLLECT(DISTINCT potential) AS potentialFriends,
COLLECT(DISTINCT friend) AS friends,
COLLECT(i) as interests
UNWIND potentialFriends AS potential
/*
#HANDLING_FINDINGS
Here I count common friends, interests and try to find relationships between
potential friends too -- hence the collect/unwind
*/
RETURN potential,
commonFriends,
commonInterests,
(commonFriends+commonInterests) as totalPotential
ORDER BY totalPotential DESC
LIMIT 10
In the section #HANDLING_FINDINGS I use the found potential friends to find relationships between each other and calculate their potential (i.e. sum of shared friends and common interests) and then order them by potential.
The problem is that there might be users with no friends whom I would also like to recommend someone friends.
My question - can I somehow insert a few random users into the "potential" findings if their count is below 10 so that everyone gets a recommendation?
I have tried something like this
...
UNWIND potentialFriends AS potential
CASE
WHEN (count(potential) < 10 )
...
But that produced an error as soon as it hit start of the CASE. I think that case can be used only as part of a command like return? (maybe just return)
Edit with 2nd related question:
I was already thinking of matching all users and then ranking them based on common friends/interestes, but wouldn't searching through the whole DB be intensive?
A CASE expression can be used wherever a value is needed, but it cannot be used as a complete clause.
With respect to your main question, you can put a WITH clause like the following between your existing WITH and UNWIND clauses:
WITH friends, interests,
CASE WHEN SIZE(potentialFriends) < 10 THEN {randomFriends} ELSE potentialFriends END AS potentialFriends
If the size of the potentialFriends collection is less than 10, the CASE expression assigns the value of the {randomFriends} parameter to potentialFriends.
As for your second question, yes it would be expensive.
I have been reading about Graph databases and want to know if this type of structure is applicable to it:
Company > Has user Accounts > Accounts send out facebook posts (which are available to all users)
Up to here - I think this makes sense - yes it would be a good use of Graph. A post has a relationship to any accounts and you can find out the direction both ways - posts for a company and which posts were sent by which users or companies.
However
Users get added and deleted on a daily basis and I need a record store of how many there were at a given time
Accounts are getting results for each post (likes/friends) which I need to store on a daily basis
I need to find out how many likes a company received (on any given day)
I also need to find out how many likes a user received
I need to find out how many likes a user received per post
You would need to store Likes as a group and then date-value - can you even have "sub" properties?
I struggle at this point unless you are storing lots of date-value property lists per node. Is that the way you would do it? If I wanted to find out the later 2 points for example would it be as efficient as a RDBMS?
Here is a very simple example of a Graph data model that seems to cover your stated use cases. (Since nodes can have multiple labels, all Company and User nodes are also Entity nodes -- to simplify the model.)
(:Company:Entity {id:100})-[:HAS_USER]->(:User:Entity {id: 200})
(:Entity)-[:SENT]->(:Post {date: 123, msg: "I like cats!"})
(:Entity)-[:LIKES {date: 234}]->(:Post)
Your use cases:
Users get added and deleted on a daily basis and I need a record store of how many there were at a given time.
How to count all users:
MATCH (u:User)
RETURN COUNT(*);
How to count a company's users:
MATCH (c:Company {id:100})-[:HAS_USER]->(u:User)
RETURN COUNT(*);
I need to find out how many likes a company received (on any given day)
MATCH (c:Company {id: 100})-[:SENT]->(p:Post)<-[:LIKES {date:234}]-()
RETURN COUNT(*)
I also need to find out how many likes a user received
MATCH (u:User {id:200})-[:SENT]->(p:Post)<-[:LIKES]-()
RETURN COUNT(*);
I need to find out how many likes a user received per post
MATCH (u:User {id:200})-[:SENT]->(p:Post)<-[:LIKES]-()
RETURN p, COUNT(*)
You would need to store Likes as a group and then date-value - can you even have "sub" properties?
You do not need to explicitly group likes by date (if that is what you mean). Such "groupings" can be easily obtained by the appropriate query (e.g., in #2 above).
neo4j noob here, on Neo4j 2.0.0 Community
I've got a graph database of 24,000 movies and 2700 users, and somewhere around 60,000 LIKE relationships between a user and a movie.
Let's say that I've got a specific movie (movie1) in mind.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)
RETURN usersLikingMovie1;
I can quickly and easily find the users who liked the movie with the above query. I can follow this path further to get the users who liked the same movies that as the people who liked movie1. I call these generation 2 users
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
RETURN usersGen2;
This query takes about 3 seconds and returns 1896 users.
Now I take this query one step further to get the movies liked by the users above (generation 2 movies)
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
This query causes neo4j to spin for several minutes at 100% cpu utilization and using 4GB of RAM. Then it sends back an exception "OutOfMemoryError: GC overhead limit exceeded".
I was hoping someone could help me out and explain to me the issue.
Is Neo4j not meant to handle a query of this depth in a performant manner?
Is there something wrong with my Cypher query?
Thanks for taking the time to read.
That's a pretty intense query, and the deeper you go the closer you're probably getting to a set of all users that ever rated any movie, since you're essentially just expanding out through the graph in tree form starting with your given movie. #Huston's WHERE and DISTINCT clauses will help to prune branches you've already seen, but you're still just expanding out through the tree.
The branching factor of your tree can be estimated with two values:
u, the average number of users that liked a movie (incoming to :Movie)
m, the average number of movies that each user liked (outgoing from :User)
For an estimate, your first step will return m users. On the next step, for each user you get all the movies each of them liked followed by all the users that liked all of those movies:
gen(1) => u
gen(2) => u * (m * u)
For each generation you'll tack on another m*u, so your third generation is:
gen(3) => u * (m * u) * (m * u)
Or more generically:
gen(n) => u^n * m^(n-1)
You could estimate your branching factors by computing the average of your likes/users and likes/movie, but that's probably very inaccurate since it gives you 22.2 likes/user and 2.5 likes/movie. Those numbers aren't reasonable for any movie that's worthy of rating. A better approach would be to take the median number of ratings or look at a histogram and use the peaks as your branching factors.
To put this in perspective, the average Netflix user rated 200 movies. The Netflix Prize training set had 17,770 movies, 480,189 users, and 100,480,507 ratings. That's 209 ratings/user and 5654 ratings/movie.
To keep things simple (and assuming your data set is much smaller), let's use:
m = 20 movie ratings/user
u = 100 users have rated/movie
Your query in gen-3 (without distincts) will return:
gen(3) = 100^3 * 20^2
= 400,000,000
400 million nodes (users)
Since you only have 2700 users, I think it's safe to say your query probably returns every user in your data set (rather, 148 thousand-ish copies of each user).
Your movie nodes in ASCII -- (n:Movie {movieid:"88cacfca-3def-4b2c-acb2-8e7f4f28be04"}) are 58 bytes minimum. If your users are about the same, let's say each node is 60 bytes, your storage requirement for this resultant set is:
400,000,000 nodes * 60 bytes
= 24,000,000,000 bytes
= 23,437,500 kb
= 22,888 Mb
= 22.35 Gb
So by my conservative estimates, your query requires 22 Gigabytes of storage. This seems quite reasonable that Neo4j would run out of memory.
My guess is that you're trying to find similarities in the patterns of users, but the query you're using is returning all the users in your dataset duplicated a bunch of times. Maybe you want to be asking questions of your data more like:
what users rate movies most like me?
what users rated most of the same movies as I rated
what movies have users that have rated similar movies to me watched that I haven't watched yet?
Cheers,
cm
To minimize the explosion that #cod3monk3y talks about, I'd limit the number of intermediate results.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
RETURN moviesGen2;
or even like this
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)
WITH distinct moviesGen1
MATCH (moviesGen1)<-[:LIKES]-(usersGen2)
WITH distinct usersGen2
MATCH (usersGen2)-[:LIKES]->(moviesGen2)
RETURN distinct moviesGen2;
if you want to, you can use "profile start ..." in the neo4j shell to see how many hits / db-rows you create in between, starting with your query and then these two.
Cypher is a pattern matching language, and it is important to remember that the MATCH clause will always find a pattern everywhere it exists in the Graph.
The problem with the MATCH clause you are using is that sometimes Cypher will find different patterns where 'usersGen2' is the same as 'usersLikingMovie1' and where 'movie1' is the same as 'movieGen1' across different patterns. So, in essence, Cypher finds the pattern every single time it exists in the Graph, is holding it in memory for the duration of the query, and then returning all the moviesGen2 nodes, which could actually be the same node n number of times.
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)
If you explicitly tell Cypher that the movies and users should be different for each match pattern it should solve the issue. Try this? Additionally, The DISTINCT parameter will make sure you only grab each 'moviesGen2' node once.
START movie1=node:Movie("MovieId:88cacfca-3def-4b2c-acb2-8e7f4f28be04")
MATCH (movie1)<-[:LIKES]-(usersLikingMovie1)-[:LIKES]->(moviesGen1)<-[:LIKES]-(usersGen2)-[:LIKES]->(moviesGen2)
WHERE movie1 <> moviesGen2 AND usersLikingMovie1 <> usersGen2
RETURN DISTINCT moviesGen2;
Additionally, in 2.0, the start clause is not required. So you can actually leave out the START clause all together (However - only if you are NOT using a legacy index and use labels)...
Hope this works... Please correct my answer if there are syntax errors...