Is there a way on Neo4j to fetch a list of all the new nodes created after a certain time? like a built in change-feed?
I know this could be done by traversing the entire graph and comparing if a node's date is > than the treshold set before.
However, this is not optimal at the very least and would not perform well on a 10 million node graph.
Is there a way to know if new nodes were added? (or relationships) some sort of change feed like a built in bloom filter?
If not, any ideas on getting a change feed every x minutes?
Have you tried an INDEX? With an index your query performance will be improved. Try creating an index in the property related to the creation time of the nodes.
CREATE INDEX ON :Person(created_at)
After, when creating a node, you can use the timestamp() function and save the current timestamp in the property created_at of :Person nodes.
CREATE (:Person {name:'Jon', created_at: timestamp()})
CREATE (:Person {name:'Doe', created_at: timestamp()})
Then you can query normally by the created_at property of :Person nodes and the index will be used.
MATCH (p:Person)
WHERE p.created_at > 1502882338889 // given a timestamp...
RETURN p
Also, if you don't need all the nodes modified after a given timestamp at same time, you can make a pagination in the query and work with pieces of the entire data using SKIP and LIMIT.
MATCH (p:Person)
WHERE p.created_at > 1502882338889 // given a timestamp...
RETURN p
ORDER BY p.created_at
SKIP 1
LIMIT 2
Related
So let's say we have User nodes, Company nodes, Project nodes, School nodes and Event nodes. And there are the following relationships between these nodes
(User)-[:WORKED_AT {start: timestamp, end:timestamp}]->(Company)
(User)-[:COLLABORATED_ON]->(Project)
(Company)-[:COLLABORATED_ON]->(Project)
(User)-[:IS_ATTENDING]->(Event)
(User)-[:STUDIED_AT]->(School)
I am trying to recommend users to any given user. My starting query looks like this
MATCH p=(u:User {id: {leftId}})-[r:COLLABORATED_ON|:AUTHORED|:WORKED_AT|:IS_ATTENDING|:STUDIED_AT*1..3]-(pymk:User)
RETURN p
LIMIT 24
Now this returns me all the pymk users within 1 to 3 relationships away, which is fine. But I want to filter the path according to the relationship attributes. Like remove the following path if the user and pymk work start date and end date is not overlapping.
(User)-[:WORKED_AT]->(Company)<-[:WORKED_AT]-(User)
I can do this with single query
MATCH (u:User)-[r1:WORKED_AT]->(Company)<-[r2:WORKED_AT]-(pymk:User)
WHERE
(r1.startedAt < r2.endedAt) AND (r2.startedAt < r1.endedAt)
RETURN pymk
But couldn't get my head around doing it within a collection of paths. I don't even know if this is possible.
Any help is appreciated.
This should do the trick:
MATCH p=(:User {id: {leftId}})-[:COLLABORATED_ON|:AUTHORED|:WORKED_AT|:IS_ATTENDING|:STUDIED_AT*1..3]-(:User)
WITH p, [rel in relationships(p) WHERE type(rel) = 'WORKED_AT'] as worked
WHERE size(worked) <> 2 OR
apoc.coll.max([work in worked | work.startedAt]) < apoc.coll.min([work in worked | work.endedAt])
RETURN p
LIMIT 24
We're using APOC here to get the max and min of a collection (the max() and min() aggregation functions in just Cypher are aggregation functions across rows, and can't be used on lists).
This relies on distilling the logic of overlapping down to max([start times]) < min([end times]), which you can check out in this highly popular answer here
I have a collection of :Product nodes and I want to return latest 100. Consider global query like:
MATCH (p:Product) RETURN p LIMIT 100
From what I can see it returns oldest nodes first. Is there a way of getting newest on top?
Order by won't be an option as number of products can be millions.
UPDATE
I ended up creating a dense node (:ProductIndex). Each time I create product I add it to the index (:Product)-[:INDEXED]->(:ProductIndex). With dense nodes rel chain will be ordered by latest first so that query below will return newest records on top
MATCH (p:Product)-[:INDEXED]->(:ProductIndex)
RETURN p
LIMIT 1000
I can always keep index fixed size as I don't need to preserve full history.
Is the data model such that the products are connected in a list (e.g. (:Product)-[:PREVIOUS]->(:Product)?
Can you keep track of the most recent node? Either with a time stamp that you can easily locate or another node connected to your most recent product node.
If so, you could always query out the most recent ones with a query similar to the following.
match (max:Date {name: 'Last Product Date'})-->(latest:Product)
with latest
match p=(latest)-[:PREVIOUS*..100]->(:Product)
return nodes(p)
order by length(p) desc
limit 1
OR something like this where you select
match (max:Date {name: 'Product Date'})
with max
match p=(latest:Product)-[:PREVIOUS*..100]->(:Product)
where latest.date = max.date
return nodes(p)
order by length(p) desc
limit 1
Another approach, still using a list could be to keep an indexed create date property on each product. But when looking for the most recent pick a control date that doesn't go back to the beginning of time so you have a smaller pool of nodes (i.e. not millions). Then use an max function on that smaller pool to find the most recent node and follow it back by however many you want.
match (latest:Product)
where latest.date > {control_date}
with max(latest.date) as latest_date
match p=(product:Product)-[:PREVIOUS*..100]->(:Product)
where product.date = latest_date
return nodes(p)
order by length(p) desc
limit 1
Deleting a node in a linked list is pretty simple. If you need to perform this search a lot and you don't want to order the products, I think keeping the products in a list is a pretty good graph application. Here is an example of a delete that maintains the list.
match (previous:Product)<-[:PREVIOUS]-(product_to_delete:Product)<-[:PREVIOUS]-(next:Product)
where product_to_delete.name = 'name of node to delete'
create (previous)<-[:PREVIOUS]-(next)
detach delete product_to_delete
return previous, next
I have a Neo4J DB up and running with currently 2 Labels: Company and Person.
Each Company Node has a Property called old_id.
Each Person Node has a Property called company.
Now I want to establish a relation between each Company and each Person where old_id and company share the same value.
Already tried suggestions from: Find Nodes with the same properties in Neo4J and
Find Nodes with the same properties in Neo4J
following the first link I tried:
MATCH (p:Person)
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
resulting in no change at all and as suggested by the second link I tried:
START
p=node(*), c=node(*)
WHERE
HAS(p.company) AND HAS(c.old_id) AND p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN p, c;
resulting in a runtime >36 hours. Now I had to abort the command without knowing if it would eventually have worked. Therefor I'd like to ask if its theoretically correct and I'm just impatient (the dataset is quite big tbh). Or if theres a more efficient way in doing it.
This simple console shows that your original query works as expected, assuming:
Your stated data model is correct
Your data actually has Person and Company nodes with matching company and old_id values, respectively.
Note that, in order to match, the values must be of the same type (e.g., both are strings, or both are integers, etc.).
So, check that #1 and #2 are true.
Depending on the size of your dataset you want to page it
create constraint on (c:Company) assert c.old_id is unique;
MATCH (p:Person)
WITH p SKIP 100000 LIMIT 100000
MATCH (c:Company) WHERE p.company = c.old_id
CREATE (p)-[:BELONGS_TO]->(c)
RETURN count(*);
Just increase the skip value from zero to your total number of people in 100k steps.
I am learning the basics of neo4j and I am looking at the following example with credit card fraud https://linkurio.us/stolen-credit-cards-and-fraud-detection-with-neo4j. Cypher query that finds stores where all compromised user shopped is
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE r.status = “Disputed”
MATCH victim-[t:HAS_BOUGHT_AT]->(othermerchants)
WHERE t.status = “Undisputed” AND t.time < r.time
WITH victim, othermerchants, t ORDER BY t.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT t) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DESC
However, when the number of users increase (let's say to millions of users), this query may become slow since the initial query will have to traverse through all nodes labeled person. Is it possible to speed up the query by asigning properties to nodes instead of transactions? I tried to remove "status" property from relationships and add it to nodes (users, not merchants). However, when I run query with constraint WHERE victim.status="Disputed" query doesn't return anything. So, in my case person has one additional property 'status'. I assume I did a lot of things wrong, but would appreciate comments. For example
MATCH (victim:person)-[r:HAS_BOUGHT_AT]->(merchant)
WHERE victim.status = “Disputed”
returns the correct number of disputed transactions. The same holds for separately quering number of undisputed transactions. However, when merged, they yield an empty set.
If I made a mistake in my approach, how can I speed up queries for large number of nodes (avoid traversing all nodes in the first step). I will be working with a data set with similar properties, but will have around 100 million users, so I would like to index users on additional properties.
[Edited]
Moving the status property from the relationship to the person node does not seem to be the right approach, since I presume the same person can be a customer of multiple merchants.
Instead, you can reify the relationship as a node (let's label it purchase), as in:
(:person)-[:HAS_PURCHASE]->(:purchase)-[:BOUGHT_AT]->(merchant)
The purchase nodes can have the status property. You just have to create the index:
CREATE INDEX ON :purchase(status)
Also, you can put the time property in the new purchase nodes.
With the above, your query would become:
MATCH (victim:person)-[:HAS_PURCHASE]->(pd:purchase)-[:BOUGHT_AT]->(merchant)
WHERE pd.status = “Disputed”
MATCH victim-[:HAS_PURCHASE]->(pu:purchase)-[:BOUGHT_AT]->(othermerchants)
WHERE pu.status = “Undisputed” AND pu.time < pd.time
WITH victim, othermerchants, pu ORDER BY pu.time DESC
RETURN DISTINCT othermerchants.name as suspicious_store, count(DISTINCT pu) as count, collect(DISTINCT victim.name) as victims
ORDER BY count DES
I am attempting to query the db for the 10 most recently created nodes. I have attempt
MATCH (a:Post) RETURN a ORDER BY TIMESTAMP() LIMIT 10
I have also tried this
MATCH (a:Post) RETURN a ORDER BY TIMESTAMP() DESC LIMIT 10
If I create nodes with contents {one, two, three} in that order, both queries produce the nodes in the order one, two, three. Any thoughts or ideas as to why this happens??
TIMESTAMP() is a scalar function to mean the exact time at query execution. It does not have anything to do with the time the nodes or relationships were created.
That is why you get the exact same results for both queries. You're simply ordering by the current time, which doesn't make a lot of sense because the current time is exactly the same for all records.
Neo4j does not store any creation timestamps by default. You need to store them as an additional property, if it's important to you. This is where you should use the scalar function.
CREATE (:Post {created_at: TIMESTAMP()})
Once that is done, match and order like this.
MATCH (a:Post) RETURN a ORDER BY a.created_at LIMIT 10
Note that you're ordering by the created_at property, and not the TIMESTAMP() scalar function.