Querying temporal data in Neo4j - neo4j

There are several possible ways I can think of to store and then query temporal data in Neo4j. Looking at an example of being able to search for recurring events and any exceptions, I can see two possibilities:
One easy option would be to create a node for each occurrence of the event. Whilst easy to construct a cypher query to find all events on a day, in a range, etc, this could create a lot of unnecessary nodes. It would also make it very easy to change individual events times, locations etc, because there is already a node with the basic information.
The second option is to store the recurrence temporal pattern as a property of the event node. This would greatly reduce the number of nodes within the graph. When searching for events on a specific date or within a range, all nodes that meet the start/end date (plus any other) criteria could be returned to the client. It then boils down to iterating through the results to pluck out the subset who's temporal pattern gives a date within the search range, then comparing that to any exceptions and merging (or ignoring) the results as necessary (this could probably be partially achieved when pulling the initial result set as part of the query).
Whilst the second option is the one I would choose currently, it seems quite inefficient in that it processes the data twice, albeit a smaller subset the second time. Even a plugin to Neo4j would probably result in two passes through the data, but the processing would be done on the database server rather than the requesting client.
What I would like to know is whether it is possible to use Cypher or Neo4j to do this processing as part of the initial query?

Whilst I'm not 100% sure I understand you requirement, I'd have a look at this blog post, perhaps you'll find a bit of inspiration there: http://graphaware.com/neo4j/2014/08/20/graphaware-neo4j-timetree.html

Related

Neo4j Cypher optimization of complex paginated query

I have a rather long and complex paginated query. I'm trying to optimize it. In the worst case - first, I have to execute the data query in a one call to Neo4j, and then I have to execute pretty much the same query for the count. Of course, I do everything in one transaction. Anyway, I don't like the overall execution time, so I extracted the most common part for both - data and count queries and execute it on the first call. This common query returns the IDs of nodes, which I then pass as parameters to the rest of data and count queries. Now, everything works much faster. One thing I don't like is that a common query can sometimes return quite a large set of IDs.. it can be 20k..50k Long IDs.
So my question is - because I'm doing this in a one transaction - is there a way to preserve such Set of IDs somewhere in Neo4j between common query and data/count query calls and just refer them somehow in the subsequent data/count queries without moving between app JVM and Neo4j?
Also, am I crazy for doing this, or is this a good approach to optimize a complex paginated query?
Only with a custom procedure.
Otherwise you'd need to return them.
But usually it's uncommon to both provide counts (even google doesn't provide "real" counts) and data.
One way is to just stream the results with the reactive driver as long as the user scrolls.
Otherwise I would just query for pageSize+1 and return "more than pageSize results".
If you just stream the id's back (and don't collect them as an aggregation) you can start using the id's received already to issue your new queries (even in parallel).

Neo4J using properties on relationships for quicker lookup?

I am yet trying to make use of neo4j to perform a complex query (similar to shortest path search except I have very strange conditions applied to this search like minimum path length in terms of nodes traversed count).
My dataset contains around 2.5M nodes of one single type and around 1.5 billion edges (One single type as well). Each given node has on average 1000 directional relation to a "next" node.
Yet, I have a query that allows me to retrieve this shortest path given all of my conditions but the only way I found to have decent response time (under one second) is to actually limit the number of results after each new node added to the path, filter it, order it and then pursue to the next node (This is kind of a greedy algorithm I suppose).
I'd like to limit them a lot less than I do in order to yield more path as a result, but the problem is the exponential complexity of this search that makes going from LIMIT 40 to LIMIT 60 usually a matter of x10 ~ x100 processing time.
This being said, I am yet evaluating several solutions to increase the speed of the request but I'm quite unsure of the result they will yield as I'm not sure about how neo4j really stores my data internally.
The solution I think about yet is to actually add a property to my relationships which would be an integer in between 1 and 15 because I usually will only query the relationships that have one or two max different values for this property. (like only relationships that have this property to 8 or 9 for example).
As I can guess yet, for each relationship, neo4j then have to gather the original node properties and use it to apply my further filters which takes a very long time when crossing 4 nodes long path with 1000 relationships each (I guess O(1000^4)). Am I right ?
With relationship properties, will it have direct access to it without further data fetching ? Is there any chance it will make my queries faster? How are neo4j edges properties stored ?
UPDATE
Following #logisima 's advice I've written a procedure directly with the Java traversal API of neo4j. I then switched to the raw Java procedure API of Neo4J to leverage even more power and flexibility as my use case required it.
The results are really good : the lower bound complexity is overall a little less thant it was before but the higher bound is like ten time faster and when at least some of the nodes that will be used for the traversal are in the cache of Neo4j, the performances just becomes astonishing (depth 20 in less than a second for one of my tests when I only need depth 4 usually).
But that's not all. The procedures makes it very very easily customisable while keeping the performances at their best and optimizing every single operation at its best. The results is that I can use far more powerful filters in far less computing time and can easily update my procedure to add new features. Last but not least Procedures are very easily pluggable with spring-data for neo4j (which I use to connect neo4j to my HTTP API). Where as with cypher, I would have to auto generate the queries (as being very complex, there was like 30 java classes to do the trick properly) and I should have used jdbc for neo4j while handling a separate connection pool only for this request. Cannot recommend more to use the awesome neo4j java API.
Thanks again #logisima
If you're trying to do a custom shortespath algo, then you should write a cypher procedure with the traversal API.
The principe of Cypher is to make pattern matching, and you want to traverse the graph in a specific way to find your good solution.
The response time should be really faster for your use-case !

Activity feeds with rollups

We have items in our app that form a tree-like structure. You might have a pattern like the following:
(c:card)-[:child]->(subcard:card)-[:child]->(subsubcard:card) ... etc
Every time an operation is performed on a card (at any level), we'd like to record it. Here are some possible events:
The title of a card was updated by Bob
A comment was added by Kate mentioning Joe
The status of a card changed from pending to approved
The linked list approach seems popular but given the sorts of queries we'd like to perform, I'm not sure if it works the best for us.
Here are the main queries we will be running:
All of the activity associated with a particular card AND child cards, sorted by time of the event (basically we'd like to merge all of these activity feeds together)
All of the activity associated with a particular person sorted by time
On top of that we'd like to add filters like the following:
Filter by person involved
Filter by time period
It is also important to note that cards may be re-arranged very frequently. In other words, the parents may change.
Any ideas on how to best model something like this? Thanks!
I have a couple of suggestions, but I would suggest benchmarking them.
The linked list approach might be good if you could use the Java APIs (perhaps via an unmanaged extension for Neo4j). If the newest event in the list were the one attached to the card (and essentially the list was ordered by the date the events happened down the line), then if you're filtering by time you could terminate early when you've found an event which is earlier than the specified time.
Attaching the events directly to the card has the potential to lead you down into problems with supernodes/dense nodes. It would be the simplest to query for in Cypher, though. The problem is that Cypher will look at all of them before filtering. You could perhaps improve the performance of queries by, in addition to placing the date/time of the event on the event node, placing it on the relationships to the node ((:Card)-[:HAS_EVENT]->(:Event) or (:Event)-[:PERFORMED_BY]->(:Person)). Then when you query you can filter by the relationships so that it doesn't need to traverse to the nodes.
Regardless, it would probably be helpful to break up the query like so:
MATCH (c:Card {uuid: 'id_here')-[:child*0..]->(child:Card)
WITH child
MATCH (child)-[:HAS_EVENT]->(event:Event)
I think that would mean that the MATCH is going to have fewer permutations of paths that it will need to evaluate.
Others are welcome to supplement my dubious advice as I've never really dealt with supernodes personally, just read about them ;)

Data model and traversal approach for bus routing in neo4j

I'm trying to create an application to find the best paths to use when traveling by bus within my local city. I have found some useful answers on here so far, but I'm currently struggeling with my approach in general and I'd like to get some feedback.
Current data model
There are stations and stages modeled as nodes and two relationships between stations and stops. Each stage node has a start and an end time as a string in "HH:mm" format and belongs to some higher-level structure which I call routes, that are connected to these stage nodes to describe a trajectory along stations with time details. Each :FROM relationship has a property duration to model the travel time for reduce statements.
So the following query would return something like this. The stage nodes show the start property in the picture.
match (from:Station {name: "Glosberg"})
match (to:Station {name: "Knellendorf"})
match paths=((from)-[:FROM|:TO*..10]->(to))
return paths;
Problems so far:
ShortestPath/AllShortestPaths is not a valid option as smallest number of hops does not mean best path. What I want is a reduction of travel duration, which I can achive with a Reduce statement, which I have already done. Since I have to check out all paths I'm using the general pattern matcher with a limit (as seen above). The limit I use in my queries is actually the length of the shortest paths between from and to plus 10% or so to also include paths that might consist of more hops but take less time. This is not necessarily accurate but seems like a fair trade-off.
Using dijkstra gives me all paths from A to B. Since stage nodes have a form of time data on them, most of the paths do not make sense, because they are either combined in reversed order (2pm -> 1pm) or produce long waiting times (2pm -> 4pm), which are not necessary. Therefore I have to filter out bad paths, either in cypher or at some api level. However, with the current data model there simply are too many paths to check for validity. With some sample data, which would also run in production, I have a route that visits 24 stations 2 times a day, resulting in 2^23 paths to take. I'm pretty sure that my data model is the problem, but I can't see any ways to solve this; any ideas?
Questions:
More of a side problem: How would you solve ordering paths with stages that go past 0am? As "23:59" is bigger than "00:01" but not chronologically.
What would you change about the data model?
Would you suggest any trade-offs in how the path finding works to reduce the complexity (eg. simply using shortestpath)?
Would you suggest seperating the actual route data (timetable, who stops where and when) from the infrastructure data (stations and which stations are close to which)? That way I'd have to and use neo4j to find a path/set of stations to travel along and then try to find a suiting set of elements from a timetable, similiar to wanderu's approach.
I've read that the traversal api is a better way to describe how the graph should be accessed instead of using cypher, which only describes what to look for, but I'd like to receive feedback on my thoughs until now before I dive into that.

Neo4j 2.0: Indexing array-valued properties with schema indexing

I have nodes with multiple "sourceIds" in one array-valued property called "sourceIds", just because there could be multiple resources a node could be derived from (I'm assembling multiple databases into one Neo4j model).
I want to be able to look up nodes by any of their source IDs. With legacy indexing this was no problem, I would just add a node to the index associated with each element of the sourceIds property array.
Now I wanted to switch to indexing with labels and I'm wondering how that kind of index works here. I can do
CREATE INDEX ON :<label>(sourceIds)
but what does that actually do? I hoped it would just create index entries for each array element, but that doesn't seem to be the case. With
MATCH n:<label> WHERE "testid" in n.sourceIds RETURN n
the query takes between 300ms and 500ms which is too long for an index lookup (other schema indexes work three to five times faster). With
MATCH n:<label> WHERE n.sourceIds="testid" RETURN n
I don't get a result. That's clear because it's an array property but I just gave it a try since it would make sense if array properties would be broken down to their elements for indexing purposes.
So, is there a way to handle array properties with schema indexing or are there plans or will I just have to stick to legacy indexing here? My problem with the legacy Lucene index was that I hit the max number of boolean clauses (1024). Another question thus would be: Can I raise this number? Lucene allows that, but can I do this with the Lucene index used by Neo4j?
Thanks and best regards!
Edit: A bit more elaboration on why I hit the boolean clauses max limit: I need to export specific parts of the database into custom file formats for text processing pipelines. These pipelines use components I cannot (be it for the sake of accessibility or time) change to query Neo4j directly, so I'd rather stay with the defined required file format(s). I do the export via the pattern "give me all IDs in the DB; now, for batches of IDs, query the desired information (e.g. specific paths) from Neo4j and store the results to file". Why I use batches at all? Well, if I don't, things are slowed down significantly via the connection overhead. Thus, large batches are a kind of optimization here.
Schema indexes can only do exact matches right now. Your "testid" in n.sourceIds does not use the index (as shown by your query times). I think there are plans to make this behave better, but I'm waiting for them just as eagerly as you are.
I've actually hit a lower max in the lucene query: 512. If there is a way to increase it I'd love to hear of it. The way I got around it is just doing more than one query if I have one of the rare cases that actually goes over 512 ids. What query are you doing where you need more?

Resources