How to store quantitative and time series data in Neo4j - neo4j

Referring to this use case, I have a set of similar data that contain people names and their ownership of companies. However, the ownership is identified by percentages and the numbers change frequently. So I cannot simply record "Tom own X company" but I need "Tom own 50% of X company from 1 Jan 2015 to 6 Jun 2015, 55% from 7 Jun 2015 to 10 Oct 2016"
What is the best way to model the data? Ultimately, is Neo4J a good tool for this type of data?

I think you might want to consider :Ownership nodes, which represent an owner of a percentage of a company valid during a range in time. Your timestamps will likely need to be indexed.
This can let you perform queries such as:
// find the percentage of ownership in a company at a point of time
WITH {params.instant} as instant
MATCH (p:Person{name:'Bob Barker'})-[:Owns]->(o:Ownership)-[:Of]->(:Company{name:'KrispyKreme'})
WHERE o.start <= instant <= o.end
RETURN o.percentage
// find the max percentage a user has ever owned of any company
MATCH (p:Person{name:'Bob Barker'})-[:Owns]->(o:Ownership)
ORDER BY o.percentage DESC
LIMIT 1
WITH o
MATCH (o)-[:OF]->(c:Company)
RETURN c, o.percentage
// find who owned what percentages of a company at a point in time
WITH {params.instant} as instant
MATCH (o:Ownership)-[:Of]->(c:Company{name:'KripsyKreme'})
WHERE o.start <= instant <= o.end
WITH o
MATCH (p:Person)-[:Owns]->(o)
RETURN p, o.percentage
The weakness of this model is that there are bound to be a great many ownership nodes, so queries on ownership from a company or companies may slow depending on how much data you have, but queries of individuals and their ownerships should be comparatively quick.

Related

NEO4j 3.0 retrieve data between certain period

I'm using NEO4J 3.0 and it seems that HAS function was removed.
Type of myrelationship is a date and I'm looking to retrieve all relation between two dates such as my property "a" is greater than certain value.
How can I test this using NEO4j
Thank you
[EDITED to add info from comments]
I have tried this:
MATCH p=(n:origin)-[r]->()
WHERE r>'2015-01'
RETURN AVG(r.amount) as totalamout;
I created relationship per date and each one has a property, amount, and I am looking to compute the average amount for certain period. As example, average amount since 2015-04.
To answer the issue raised by your first sentence: in neo4j 3.x, the HAS() function was replaced by EXISTS().
[UPDATE 1]
This version of your query should work:
MATCH p=(n:origin)-[r]->()
WHERE TYPE(r) > '2015-01'
RETURN AVG(r.amount) as totalamout;
However, it is a bad idea to give your relationships different types based on a date. It is better to just use a date property.
[UPDATE 2]
If you changed your data model to add a date property to your relationships (to which I will give the type FOO), then the following query will find the average amount, per p, of all the relationships whose date is after 2015-01 (assuming that all your dates follow the same strict YYYY-MM pattern):
MATCH p=(n:origin)-[r:FOO]->()
WHERE r.date > '2015-01'
RETURN p, AVG(r.amount) as avg_amout;

Write Cypher query to display temperature values till it reaches set temperature

I have about 200,000 rows of 24 hour data as follows:
I can use the query to create a room node with time, roomtemp, and set temp as properties. Moreover, I can also, define the relationship of each room with its corresponding temperatures.
Now, I need to find:
all rows that show an update/increase/decrease from initial temperature till set temperature for all rooms. e.g. based on above data, I need:
Here I have discarded 5th row data as 16 was repetitive and showed no update(increase or decrease) in temp value. The temperature values continued till it reached set temperature '18'.
I can manually create the temperature states by giving its values one by one, but I am unsure how to MERGE the above requirement into the graph using Cypher.
Can I utilize any other programming language to obtain same results using Neo4j in conjunction?
Do I have to utilize in-graph time-tree for this scenario? Can I retrieve my results without creating a time tree?
Filter temparature by room and date (which can also be a date-node)
Sort by time
Collect into a list
Filter by differences in two subsequent temperatues
Turn list into rows
Here is a query that does this:
MATCH (r:Room)<-[:TEMP]-(t:Temparature)
WHERE t.time STARTS WITH "2016-01-01"
AND t.temp < room.temp ADN t.temp > {initial}
WITH t ORDER by t.time ASC
WITH collect(t) temps
WITH [idx in range(0,size(temps)-2) WHERE temps[idx].temp <> temps[idx+1].temp | temps[idx] ] as filtered
UNWIND filtered as t
RETURN t;

Low performance of neo4j

I am server engineer in company that provide dating service.
Currently I am building a PoC for our new recommendation engine.
I try to use neo4j. But performance of this database does not meet our needs.
I have strong feeling that I am doing something wrong and neo4j can do much better.
So can someone give me an advice how to improve performance of my Cypher’s query or how to tune neo4j in right way?
I am using neo4j-enterprise-2.3.1 which is running on c4.4xlarge instance with Amazon Linux.
In our dataset each user can have 4 types of relationships with others users - LIKE, DISLIKE, BLOCK and MATCH.
Also he has a properties like countryCode, birthday and gender.
I made import of all our users and relationships from RDBMS to neo4j using neo4j-import tool.
So each user is a node with properties and each reference is a relationship.
The report from neo4j-import tool said that :
2 558 667 nodes,
1 674 714 539 properties and
1 664 532 288 relationships
were imported.
So it’s huge DB :-) In our case some nodes can have up to 30 000 outgoing relationships..
I made 3 indexes in neo4j :
Indexes
ON :User(userId) ONLINE
ON :User(countryCode) ONLINE
ON :User(birthday) ONLINE
Then I try to build online recommendation engine using this query :
MATCH (me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()<-[:LIKE | :MATCH]-(similar:User)
USING INDEX me:User(userId)
USING INDEX similar:User(birthday)
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
here is the execution plan for one of the user :
plan
When I executed this query for list of users I had the result :
count=2391, min=4565.128849, max=36257.170065, mean=13556.750555555178, stddev=2250.149335254768, median=13405.409811, p75=15361.353029999998, p95=17385.136478, p98=18040.900481, p99=18426.811424, p999=19506.149138, mean_rate=0.9957385490980866, m1=1.2148195797996817, m5=1.1418078036067119, m15=0.9928564378521962, rate_unit=events/second, duration_unit=milliseconds
So even the fastest is too slow for Real-time recommendations..
Can you tell me what I am doing wrong?
Thanks.
EDIT 1 : plan with the expanded boxes :
I built an unmanaged extension to see if I could do better than Cypher. You can grab it here => https://github.com/maxdemarzi/social_dna
This is a first shot, there are a couple of things we can do to speed things up. We can pre-calculate/save similar users, cache things here and there, and random other tricks. Give it a shot, let us know how it goes.
Regards,
Max
If I'm reading this right, it's finding all matches for users by userId and separately finding all matches for users by your various criteria. It's then finding all of the places that they come together.
Since you have a case where you're starting on the left with a single node, my guess is that we'd be better served by following the paths and then filtering what it gotten via relationship traversal.
Let's see how starting like this works for you:
MATCH
(me:User {userId: {source_user_id} })-[:LIKE | :MATCH]->()
<-[:LIKE | :MATCH]-(similar:User)
WITH similar
WHERE similar.birthday >= {target_age_gte} AND
similar.birthday <= {target_age_lte} AND
similar.countryCode = {target_country_code} AND
similar.gender = {source_gender}
WITH similar, count(*) as weight ORDER BY weight DESC
SKIP {skip_similar_person} LIMIT {limit_similar_person}
MATCH (similar)-[:LIKE | :MATCH]-(recommendation:User)
WITH recommendation, count(*) as sheWeight
WHERE recommendation.birthday >= {recommendation_age_gte} AND
recommendation.birthday <= {recommendation_age_lte} AND
recommendation.gender= {target_gender}
WITH recommendation, sheWeight ORDER BY sheWeight DESC
SKIP {skip_person} LIMIT {limit_person}
MATCH (me:User {userId: {source_user_id} })
WHERE NOT ((me)--(recommendation))
RETURN recommendation
[UPDATED]
One possible (and nonintuitive) cause of inefficiency in your query is that when you specify the similar:User(birthday) filter, Cypher uses an index seek with the :User(birthday) index (and additional tests for countryCode and gender) to find all possible DB matches for similar. Let's call that large set of similar nodes A.
Only after finding A does the query filter to see which of those nodes are actually connected to me, as specified by your MATCH pattern.
Now, if there are relatively few me to similar paths (as specified by the MATCH pattern, but without considering its WHERE clause) as compared to the size of A -- say, 2 or more orders of magnitude smaller -- then it might be faster to remove the :User label from similar (since I presume they are probably all going to be users anyway, in your data model), and also remove the USING INDEX similar:User(birthday) clause. In this case, not using the index for similar may actually be faster for you, since you will only be using the WHERE clause on a relatively small set of nodes.
The same considerations also apply to the recommendation node.
Of course, this all has to be verified by testing on your actual data.

Neo4j cyper performance on simple match

I have a very simple cypher which give me a poor performance.
I have approx. 2 million user and 60 book category with relation from user to category around 28 million.
When I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN distinct(bc.id);
It returns me 8.5k rows within 2 - 2.5 (First time) minutes
And when I do this cypher:
MATCH (u:User)-[read:READ]->(bc:BookCategory)
WHERE read.timestamp >= timestamp() - (1000*60*60*24*30)
RETURN u.id, u.email, read.timestamp;
It return 55k rows within 3 to 6 (First time) minutes.
I already have index on User id and email, but still I don't think this performance is acceptable. Any idea how can I improve this?
First of all, you can profile your query, to find what happens under the hood.
Currently looks like that query scans all nodes in database to complete query.
Reasons:
Neo4j support indexes only for '=' operation (or 'IN')
To complete query, it traverses all nodes, one by one, checking each node if it has valid timestamp
There is no straightforward way to deal with this problem.
You should look into creating proper graph structure, to deal with Time-specific queries more efficiently. There are several ways how to represent time in graph databases.
You can take look on graphaware/neo4j-timetree library.
Can you explain your model a bit?
Where are the books and the "reading"-Event in it?
Afaik all you want to know, which book categories have been recently read (in the last month)?
You could create a second type of relationship thats RECENTLY_READ which expires (is deleted) by a batch job it is older than 30 days. (That can be two simple cypher statements which create and delete those relationships).
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[read:READ]->(b:BookCategory)
WHERE read.timestamp >= timestamp() - month
MERGE (a)-[rr:RECENTLY_READ]->(b)
WHERE coalesce(rr.timestamp,0) < read.timestamp
SET rr.timestamp = read.timestamp;
WITH (1000*60*60*24*30) as month
MATCH (a:User)-[rr:RECENTLY_READ]->(b:BookCategory)
WHERE rr.timestamp < timestamp() - month
DELETE rr;
There is another way to achieve what you exactly want to do here, but it's unfortunately not possible in Cypher.
With a relationship-index on timestamp on your read relationship you can run a Lucene-NumericRangeQuery in Neo4j's Java API.
But I wouldn't really recommend to go down this route.

specific query with cypher

I need help with specific query. I am using neo4j. My database consists of companies (nodes) and transactions between them(relationship). Each relationship(PAID) has properties:
amount- for amount of transaction
year - year of transaction
month - month of transaction
What I need, is to find all cycles in a graph, starting at node A. It must also be true that transaction occurred one after another.
So valid example would be A PIAD B in march, B PAID C in april, C PAID A in june.
So is there any way to get all cycles from node A, so that transactions occur in continuous order?
You may want to set up a sample graph at Neo4j console to share or at least tell more about what version of Neo4j you are using, but if you're using 2.0 and if you store year and month as long or integer, then maybe you could try something like
MATCH a-[ab:PAID]->b-[bc:PAID]->c-[ca:PAID]->a
WHERE (ab.year + ab.month) > (bc.year + bc.month) > (ca.year + ca.month)
RETURN a,b,c
EDIT:
Actually that was hasty, the additions won't work that way of course, but the structure should be ok. Maybe
WHERE ((ab.year > bc.year) or (ab.year = bc.year AND ab.month > bc.month))
AND ((bc.year > ca.year) OR (bc.year = ca.year AND bc.month > ca.month))
or
WHERE (ab.year * 12 + ab.month) > (bc.year * 12 + bc.month) > (ca.year * 12 + ca.month)
If you only use dates for this type of comparison, consider storing them as one property, perhaps as milliseconds since 'epoch' 1/1 -70 GMT. That makes comparisons very easy. But if you need to return and display dates frequently, then keeping them separate might make sense.
EDIT2:
I can't think of a way to build your condition of "r1.date < r2.date" into the pattern, which means matching all variable depth cycles and then discarding some (most) of them. That's wont to become expensive in a large graph, and you may be better off building a traversal or server plugin, which can make complex iterative decisions during the traversal. In 2.0, thanks to Wes' elegant collection slicing, you could try something like this
MATCH path=a-[ab:PAID*..10]->a
WHERE ALL (ix IN range(0,length(ab)-2)
WHERE ((ab[ix]).year * 12 +(ab[ix]).month)<((ab[ix+1]).year * 12 +(ab[ix+1]).month))
RETURN path
The same could probably be achieved in 1.9 with HEAD() and TAIL(). Again, share sample data in a console and maybe someone else can pitch in.

Resources