Structuring time-series data in ArangoDB

Structuring time-series data in ArangoDB - time-series

I have some time-series data (roughly on the order of 1-5 points per day) I need to be able to quickly access in a webapp using ArangoDB. The data is associated with a particular profile, but one collection is used for all the data for all profiles. Between the profile node and the data node, there is a report node and an event node. The report is simply a group of data points from a given event. The existing graph structure looks like this:
profile =====> event1 ========> reportA =======> data1
\ \ \=======> data2
\ \
\ \========> reportB =======> data3
\ \=======> data4
\
\==> event2 ========> reportA =======> data1
\ \=======> data2
\
\========> reportB =======> data3
\=======> data4
The chart I would like would effectively present data1 sequentially, by associated event, sorted by an attribute of the event. An analogous tabular structure of the result set I would like looks like this:
event dataAttr value
-------------------------------
event1 data1 42
event2 data1 6
event3 data1 7
event4 data1 343
I am likely to run this query for every dataAttr in a given report, to effectively create a time-series result set for each dataAttr on a particular profile for the last 10-20 events.
When investigating this problem in Neo4J, they recommended directly connecting sequential events to each other. I'm wondering if this is also a better approach in ArangoDB.
This would mean creating an additional graph that looks something like this:
data1 (of event1) => data1 (of event2) => data1 (of event3) => data1 (of event4)
data2 (of event1) => data2 (of event2) => data2 (of event3) => data2 (of event4)
Etc.
Each dataAttr is connected to its cousin in the previous event, thus after traversing to the most recent event in the first graph, the second graph would be used to traverse n-layers to past events (practically 10-20).
Is this probably the best way to structure the data for a query like this? Performance will be critical as I potentially will be loading 20 charts on a page that each are fed by this query.
Would this query be faster simply querying on a document collection with indices rather than via graph traversal? The document collection structure could put a hash index on the dataAttr and skiplist on the event (they will be sequentially ordered with string sorting).
I'm assuming that traversing down to data1 of event1, back up to profile, and back down event2 data1 and so on would be very inefficient.

If performance is critical, then trying to handle as much as possible using indexes is paramount. Traversals are superior if you have an unknown path length, which is not your use-case.
I would recommend denormalizing the data stored in the data node. You want to return all data nodes belonging to a profile and a given dataAttr sorted by a time-stamp timeStamp, right? In this case I would at least add the profile identifier to the data node and use a skip-listed index on profileId, dataAttr and timeStamp.

Related

Correct order of operations in neo4j - LOAD, MERGE, MATCH, WITH, SET

I am loading simple csv data into neo4j. The data is simple as follows :-
uniqueId compound value category
ACT12_M_609 mesulfen 21 carbon
ACT12_M_609 MNAF 23 carbon
ACT12_M_609 nifluridide 20 suphate
ACT12_M_609 sulfur 23 carbon
I am loading the data from the URL using the following query -
LOAD CSV WITH HEADERS
FROM "url"
AS row
MERGE( t: Transaction { transactionId: row.uniqueId })
MERGE(c:Compound {name: row.compound})
MERGE (t)-[r:CONTAINS]->(c)
ON CREATE SET c.category= row.category
ON CREATE SET r.price =row.value
Next I do the aggregation to count total orders for a compound and create property for a node in the following way -
MATCH (c:Compound) <-[:CONTAINS]- (t:Transaction)
with c.name as name, count( distinct t.transactionId) as ord
set c.orders = ord
So far so good. I can accomplish what I want but I have the following 2 questions -
How can I create the orders property for compound node in the first step itself? .i.e. when I am loading the data I would like to perform the aggregation straight away.
For a compound node I am also setting the property for category. Theoretically, it can also be modelled as category -contains-> compound by creating Categorynode. But what advantage will I have if I do it? Because I can execute the queries and get the expected output without creating this additional node.
Thank you for your answer.

I don't think that's possible, LOAD CSV goes over one row at a time, so at row 1, it doesn't know how many more rows will follow.
I guess you could create virtual nodes and relationships, aggregate those and then use those to create the real nodes, but that would be way more complicated. Virtual Nodes/Rels
That depends on the questions/queries you want to ask.
A graph database is optimised for following relationships, so if you often do a query where the category is a criteria (e.g. MATCH (c: Category {category_id: 12})-[r]-(:Compound) ), it might be more performant to create a label for it.
If you just want to get the category in the results (e.g. RETURN compound.category), then it's fine as a property.

Merging a set of nodes onto one (only in query)

I am currently investigating how to model a bitemporal graph in neo4j. Unfortunately noone seems to have publicly undertaken this before.
One particular thing I am looking at is whether I can store in a new node only those values that have changed and then express a query that would merge all those values ordered by a given timestamp:
This creates the data I am playing with:
CREATE (:P1 {id: '1'})<-[:EXPANDS {date:5200, recorded:5100}]-(:P1Data {name:'Joe', wage: 3000})
// New data, recorded 2014-10-1 for 2015-1-1
MATCH (p:P1 {id: '1'}) CREATE (:P1Data { wage:3100 })-[:EXPANDS { date:5479, recorded: 5387}]->(p)
Now, I can get a history for a given point in time so far, e.g. like
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WHERE x.recorded < 6000
WITH {date: x.date, data:d} as data
RETURN data
ORDER BY data.date DESC
What I would like to achieve is to merge the name and wage values such that I get a whole view of the data at a given point in time. The answer may also be that this is not really possible.
(PS: I say only in query, because I found a refactor function in apoc which does merge nodes, but that procedure actually merges and persists the node, while I would just want to query it).

As with most things, you can do it using REDUCE like so:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
WITH REDUCE(s = {}, y IN datas|
{name: COALESCE(y.name, s.name),
wage: COALESCE(y.wage, s.wage)})
AS most_recent_fields
RETURN most_recent_fields.name AS name, most_recent_fields.wage AS wage
You can do it in descending order instead (swap s and y inside the COALESCE statements if so), but there isn't really a way to shortcut processing the entire set of results from your queried time back to the start.
UPDATE: This will, of course, generate a Map and not a Node, but if you only want the properties and don't want to create a permanent record, a Map is actually better suited to your needs.
EXTENDED: If you don't want to specify which keys to use, you can do it without REDUCE like this instead:
MATCH (:P1 { id: '1' })<-[x:EXPANDS]-(d:P1Data)
WITH x.date AS date, d AS data
ORDER BY date
WITH COLLECT(data) AS datas
CREATE (t:Temp)
FOREACH(data IN datas|
SET t += data)
DELETE t
RETURN t
This approach does create a node, but if you DELETE it right before you RETURN it, it won't persist at all. += ensures that pre-existing properties aren't removed, only overwritten if the data node has existing values.

Write Cypher query to display temperature values till it reaches set temperature

I have about 200,000 rows of 24 hour data as follows:
I can use the query to create a room node with time, roomtemp, and set temp as properties. Moreover, I can also, define the relationship of each room with its corresponding temperatures.
Now, I need to find:
all rows that show an update/increase/decrease from initial temperature till set temperature for all rooms. e.g. based on above data, I need:
Here I have discarded 5th row data as 16 was repetitive and showed no update(increase or decrease) in temp value. The temperature values continued till it reached set temperature '18'.
I can manually create the temperature states by giving its values one by one, but I am unsure how to MERGE the above requirement into the graph using Cypher.
Can I utilize any other programming language to obtain same results using Neo4j in conjunction?
Do I have to utilize in-graph time-tree for this scenario? Can I retrieve my results without creating a time tree?

Filter temparature by room and date (which can also be a date-node)
Sort by time
Collect into a list
Filter by differences in two subsequent temperatues
Turn list into rows
Here is a query that does this:
MATCH (r:Room)<-[:TEMP]-(t:Temparature)
WHERE t.time STARTS WITH "2016-01-01"
AND t.temp < room.temp ADN t.temp > {initial}
WITH t ORDER by t.time ASC
WITH collect(t) temps
WITH [idx in range(0,size(temps)-2) WHERE temps[idx].temp <> temps[idx+1].temp | temps[idx] ] as filtered
UNWIND filtered as t
RETURN t;

iOS Core Data fetch newly inserted, not yet saved objects

I have to create some data in CoreData entities in batch (import process) and I would like to "commit" at the end or "rollback" on an error (so saving inbetween won't work).
The problem is that I for example need to create an entity "Person" and later in that progress I need to re-use that entity. BUT it can already exists BEFORE this process or may be created WHILE this import process.
So I'm trying to fetch it with a predicate "(personId == 4711)". But although I have set
[fetchRequest setIncludesPendingChanges:YES]; it doesn't find the newly created Person object.
I read this this question and this answer which state, that it is not possible? Am I right?
If so, how can I workaround / handle this?

From what I know of CoreData, this is indeen impossible (please correct me if I'm wrong).
However, even if this was possible, you never want to query your store on a "per object" basis.The performance impact on large (more then a few dozens) imports is huge.
My suggestion for you is to create a dictionary keyed by your unique identifier during the import stage (prefetch from the store existing entities and create new ones for ones that aren't).
Note: You should be careful not to perform multiple inserts from different contexts in a multithreaded environment. in such cases you will need a coordinator to prevent duplications.
example:
store content: 1 --> P1, 3 --> P3
service response: 1 --> Data1, 2 --> Data2
Algorithm:
on response completion, get all unique ids from response --> recievedIds = #[1,2]
during the creation of recievedIds set create a mapping/dictionary of personId --> data:
#{1 : Data1, 2 : Data2}
fetch from the store by predicate:
[NSPredicate predicateWithFormat:#"personId IN %#",recievedIds]
Create a dictionary from the resulting array.
in this case: existingItems = #{1 : P1}
Pass over all ids in recievedIds:
1) if the id exist in existingItems update the existing object
2) else create a new person and insert the data to the new record.
This will only fetch from the store once.
And you only save once.
==> only 2 trips to the store instead of a trip per object

Combining table, web service data in Grails

I'm trying to figure out the best approach to display combined tables based on matching logic and input search criteria.
Here is the situation:
We have a table of customers stored locally. The fields of interest are ssn, first name, last name and date of birth.
We also have a web service which provides the same information. Some of the customers from the web service are the same as the local file, some different.
SSN is not required in either.
I need to combine this data to be viewed on a Grails display.
The criteria for combination are 1) match on SSN. 2) For any remaining records, exact match on first name, last name and date of birth.
There's no need at this point for soundex or approximate logic.
It looks like what I should do is extract all the records from both inputs into a single collection, somehow making it a set on SSN. Then remove the blank ssn.
This will handle the SSN matching (once I figure out how to make that a set).
Then, I need to go back to the original two input sources (cached in a collection to prevent a re-read) and remove any records that exist in the SSN set derived previously.
Then, create another set based on first name, last name and date of birth - again if I can figure out how to make a set.
Then combine the two derived collections into a single collection. The collection should be sorted for display purposes.
Does this make sense? I think the search criteria will limit the number of record pulled in so I can do this in memory.
Essentially, I'm looking for some ideas on how the Grails code would look for achieving the above logic (assuming this is a good approach). The local customer table is a domain object, while what I'm getting from the WS is an array list of objects.
Also, I'm not entirely clear on how the maxresults, firstResult, and order used for the display would be affected. I think I need to read in all the records which match the search criteria first, do the combining, and display from the derived collection.

The traditional Java way of doing this would be to copy both the local and remote objects into TreeSet containers with a custom comparator, first for SSN, second for name/birthdate.
This might look something like:
def localCustomers = Customer.list()
def remoteCustomers = RemoteService.get()
TreeSet ssnFilter = new TreeSet(new ClosureComparator({c1, c2 -> c1.ssn <=> c2.ssn}))
ssnFilter.addAll(localCustomers)
ssnFilter.addAll(remoteCustomers)
TreeSet nameDobFilter = new TreeSet(new ClosureComparator({c1, c2 -> c1.firstName + c1.lastName + c1.dob <=> c2.firstName + c2.lastName + c2.dob}))
nameDobFilter.addAll(ssnFilter)
def filteredCustomers = nameDobFilter as List
At this point, filteredCustomers has all the records, except those that are duplicates by your two criteria.
Another approach is to filter the lists by sorting and doing a foldr operation, combining adjacent elements if they match. This way, you have an opportunity to combine the data from both sources.
For example:
def combineByNameAndDob(customers) {
customers.sort() {
c1, c2 -> (c1.firstName + c1.lastName + c1.dob) <=>
(c2.firstName + c2.lastName + c2.dob)
}.inject([]) { cs, c ->
if (cs && c.equalsByNameAndDob(cs[-1])) {
cs[-1].combine(c) //combine the attributes of both records
cs
} else {
cs << c
}
}
}

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart