I have a neo4j database where two users attempt to trade a number of cards.
Each trade node has two outgoing relationships towards the two users the trade involves,
and the cards that are being traded.
If no agreement is made, a subsequent trade is created pointing to the previous one with a PREVIOUS relationship.
If an agreement is made the last node of the trade chain is marked with a success:true property.
The image below represents an example trade between two users.
I am trying to get all last trade nodes between two users with ids 10 and 20.
The last trade node is the one that has no incoming relationship.
My attempt is this:
MATCH (u:User)<--(t:Trade)-->(n:User)
WHERE (ID(u)=10 AND ID(n)=20) OR (ID(u)=20 AND ID(n)=10)
AND NOT (t)<-[:PREVIOUS]-()
RETURN t
The above however returns all 3 trade nodes. In fact the third line seems to make no
difference in the result of the query.
Why is that? How else can I achieve my objective?
I think the problem is with the order of boolean evaluations.
That is, AND is evaluated before OR (but after parenthesis), so what you have (simplified down) is:
WHERE (<id check 1>) OR (<id check 2>) AND <not pattern>
The AND grouping gets evaluated first, so it behaves like:
WHERE (<id check 1>) OR ((<id check 2>) AND <not pattern>)
so as long as the first id check evaluates to true, then the entire WHERE clause comes out as true.
To fix, add parenthesis to surround the ID predicates like so:
WHERE ((ID(u)=10 AND ID(n)=20) OR (ID(u)=20 AND ID(n)=10))
AND NOT (t)<-[:PREVIOUS]-()
Related
So I'm working with your typical "products with tags" data model, where Product nodes have an id property and Tag nodes have name property. To my surprise, my query has far more db hits when I include a Product label.
I was profiling my queries on matches between products and lists of tags of various lengths.
According to Mark Needham and Petra Selmer (great talk in Graph Connect Europe 2016), adding the label Product in the query would drastically improve performance since we restrict the search space of the query. Makes total sense. Curiously enough, at the beginning I had ~~accidentally~~ omitted the Product label. When I added into the query, the db hit count almost doubled, going from 5803 to 10316!
Here is the query I was using:
PROFILE MATCH (product:Product)-[:TAGGED]->(tag:Tag)
WHERE tag.name IN ["tag_1","tag_2",..."tag_N"]
WITH product, COLLECT (tag.name) AS tags_list
RETURN product.id, tags_list;
Since I can't believe my eyes right now, here I share the plans that come out of the PROFILE statement:
With node label
https://drive.google.com/file/d/1dGmF_2zfKdGBtThm45MUUOkLSLCEHTYU/view?usp=sharing
Without node label
https://drive.google.com/file/d/1efZWK6gXzNB0tjcKyhGIRFo22bDV8WjP/view?usp=sharing
I tried removing the COLLECT operation at the end but the query without Product label has still less db hits, 9325 against 13837. I'm afraid I am really new to Neo4j and I might be missing something super obvious here. What could possible cause db hit count to raise when a node label is added?
The short answer is yes, when you have a label in your query, there will be db hits to filter on the label, so your eyes aren't deceiving you.
That said, not all db hits are equal. They are abstract units of db work, and label filtering is fairly lightweight.
There are times when you can leave a label off, and there are times where you really need to leave the label on.
If your model is such that only :Product nodes can be tagged like this, then you can leave out the label, as that's redundant. However, if there are other types of nodes that can be tagged, with :Product being only one of those, then you definitely need the :Product label in there for correctness.
The same applies for queries with longer paths, you may need to filter on node labels along the way, which should ensure you're minimizing the work needed for the rest of the expansions in the query, since you're only considering relevant paths with the correct labels in place.
Also, for some nodes in a pattern you may have properties on them, and having the label present allows the planner to consider using an index lookup, if an index is present on the label/property combination.
I am trying to perform the following query in one pass but I conclude that it is impossible and would furthermore lead to some form of "nested" structure which is never good news in terms of performance.
I may however be missing something here, so I thought I might ask.
The underlying data structure is a many-to-many relationship between two entities A<---0:*--->B
The end goal is to obtain how many times are objects of entity B assigned to objects of entity A within a specific time interval as a percentage of total assignments.
It is exactly this latter part of the question that causes the headache.
Entity A contains an item_date field
Entity B contains an item_category field.
The presentation of the results can be expanded to a table whose columns are the distinct item_date and rows are the different item_category normalised counts. I am just mentioning this for clarity, the query does not have to return the results in that exact form.
My Attempt:
with 12*30*24*3600 as window_length, "1980-1-1" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_date,"s","yyyy-MM-dd")<(date_step+window_length)
with window_length, date_step, count(r) as total_count unwind ["code_A", "code_B", "code_C"] as the_code [MATCH THE PATTERN AGAIN TO COUNT SPECIFIC `item_code` this time.
I am finding it difficult to express this in one pass because it requires the equivalent of two independent GROUP BY-like clauses right after the definition of the graph pattern. You can't express these two in parallel, so you have to unwind them. My worry is that this leads to two evaluations: One for the total count and one for the partial count. The bit I am trying to optimise is some way of re-writing the query so that it does not have to count nodes it has "captured" before but this is very difficult with the implied way the aggregate functions are being applied to a set.
Basically, any attribute that is not an aggregate function becomes the stratification variable. I have to say here that a plain simple double stratification ("Grab everything, produce one level of count by item_date produce another level of count by item_code) does not work for me because there is NO WAY to control the width of the window_length. This means that I cannot compare between two time periods with different rates of assignments of item_codes because the time periods are not equal :(
Please note that retrieving the counts of item_code and then normalising for the sum of those particular codes within a period of time (externally to cypher) would not lead to accurate percentages because the normalisation there would be with respect to that particular subset of item_code rather than the total.
Is there a way to perform a simultaneous count of r within a time period but then (somehow) re-use the already matched a,b subsets of nodes to now evaluate a partial count of those specific b's that (b:{item_code:the_code})-[r2:RELATOB]-(a) where a.item_date...?
If not, then I am going to move to the next fastest thing which is to perform two independent queries (one for the total count, one for the partials) and then do the division externally :/ .
The solution proposed by Tomaz Bratanic in the comment is (I think) along these lines:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
unwind ["code_A","code_B","code_c"] as the_code
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
This:
Is working
It does exactly what I was after (after some minor modifications)
It even adds filling in the missing values with zero because of that ELSE 0 which is effectively forcing a zero even when no count data exists.
But in realistic conditions it is at least 30 seconds slower (no it is not, please see edit) than what I am currently using which re-matches. (And no, it is not because of the extra data that is now returned as the missing data are filled in, this is raw query time).
I thought that it might be worth attaching the query plans here:
This is the plan of the applying the same pattern twice but fast way of doing it:
This is the plan of the performing the count in one pass but slow way of doing it:
I might see how does time scales with data in the input later on, maybe the two are scaling at different rates but at this point, the "one-pass" seems to be already slower than the "two-pass" and frankly, I cannot see how it could get any faster with more data. This is already a simple count of 12 months over 3 categories distributed amongst 18k items (approximately).
Hope this might help others too.
EDIT:
While I had done this originally, there was another modification that I did not include where the second unwind goes AFTER the match. This slashes the time by 20 seconds below the "double match" as the unwind affects the return rather than multiple executions of the same query which now becomes:
with 1*30*24*3600 as window_length,
"1980-01-01" as start_date,
"1985-12-31" as end_date
unwind range(apoc.date.parse(start_date,"s","yyyy-MM-dd"),apoc.date.parse(end_date,"s","yyyy-MM-dd"),window_length) as date_step
match (a:A)<-[r:RELATOB]-(b:B)
where apoc.date.parse(a.item_date,"s","yyyy-MM-dd")>=date_step and apoc.date.parse(a.item_category,"s","yyyy-MM-dd")<(date_step+window_length)
unwind ["code_A","code_B","code_c"] as the_code
return the_code, date_step, tofloat(sum(case when b.item_category=code then 1 else 0 end)/count(r)) as perc_count order by date_step asc
And here is the execution plan for it too:
Original double match approximately 55790ms, Doing it in one pass (both unwinds BEFORE the match) 82306ms, Doing it in one pass (second unwind after the match) 23461ms.
I would like to know how I can prevent a duplicate entry (based on my own client/project definition of what that means-below), in an AppSheet mobile app connected to Google Sheets.
AppSheet talks alot about UNIQUEID() which they also encourage using and designating as the KEY field. row_number is another possibility.
This is fine for the KEY in the sense of its purpose is to be unique, meaningless, and uniquely identify a record, and relate to other tables.
However, it doesn't prevent a duplicate ("duplicate" again, as defined by my own client's business rules&process) from occurring. I mean, I assume the UniqueId() theoretically would, but that's abstract theory, because it would only produce unique ones anyway.
MY TABLE HAS THESE COLUMN: [FACILITY NUMBER] and [TIMESTAMP] (date and time of event). We consider it a duplicate event, and want to DISALLOW the adding of such a record to this table, if the 2nd record has the same DATE (time irrelevant), with the same FACILITY. (we just do one facility per day, ever).
In AppSheet how can I create some logic that disallows the add based on that criteria? I even basically know some ways I would do it. it just seems like I can't find a place to "put" it. I created an expression that perfectly evalutes to TRUE or FALSE and nothing else, (by referencing whether or not the FACILIY NUMBER on the new record being added is in a SLICE which I've defined as today's entries). I wanted to place this expression in another (random) field's VALIDIF. To me it seemed like that would meet the platform documentation. the other random field would be considered valid, only if the expression evaluated to true. but instead appsheet thought i wanted to conver the entire [other random column] to a dependent dropdown.
Please help! I will cry tears of joy when appsheet introduces FORM events and RECORD events that can be hooked into at the time of keying, saving, etc.
surprised to see this question here in stackoverflow --- most AppSheet questions are at http://community.appsheet.com.
The brief answer is that you are doing the right thing providing a Valid_If constraint. Your constraint is of the form IN([_THIS], ) so AppSheet is doing the "smart" thing by automatically converting that list into a dropdown of allowed values. From your post, it appears that you may instead want to say NOT(IN([_THIS], )) -- thereby saying that the value [_THIS] is valid as long as it is not in the list specified (making sure it is not a duplicate).
Old question, but in case someone stumbles upon the same:
The (not so simple) answer is given in https://help.appsheet.com/en/articles/961274-list-expressions-and-aggregates.
From the reference:
NOT(IN([_THIS], SELECT(Customers[State], NOT(IN([CustomerId],
LIST([_THISROW].[CustomerId])))))): when used as the Valid_If
condition for the State column, it ensures that every customer has a
unique value for State. In this example, we assume that CustomerId is
the key for the Customers table.
This could be written more schematic like this:
NOT(IN([_THIS], SELECT(<TableName>[<UnqiueColumnName>], NOT(IN([<KeyColumnName>], LIST([_THISROW].[<KeyColumnName>]))))))
Technically it says:
Get me a list of the current values of the column of the table
Ignore the value of the current row (identified by [_THISROW] and looking into the column)
Check, if the given value exists in the resulting list
This statement has to be defined - with the correct values for , & - as Valid_If statement.
We have items in our app that form a tree-like structure. You might have a pattern like the following:
(c:card)-[:child]->(subcard:card)-[:child]->(subsubcard:card) ... etc
Every time an operation is performed on a card (at any level), we'd like to record it. Here are some possible events:
The title of a card was updated by Bob
A comment was added by Kate mentioning Joe
The status of a card changed from pending to approved
The linked list approach seems popular but given the sorts of queries we'd like to perform, I'm not sure if it works the best for us.
Here are the main queries we will be running:
All of the activity associated with a particular card AND child cards, sorted by time of the event (basically we'd like to merge all of these activity feeds together)
All of the activity associated with a particular person sorted by time
On top of that we'd like to add filters like the following:
Filter by person involved
Filter by time period
It is also important to note that cards may be re-arranged very frequently. In other words, the parents may change.
Any ideas on how to best model something like this? Thanks!
I have a couple of suggestions, but I would suggest benchmarking them.
The linked list approach might be good if you could use the Java APIs (perhaps via an unmanaged extension for Neo4j). If the newest event in the list were the one attached to the card (and essentially the list was ordered by the date the events happened down the line), then if you're filtering by time you could terminate early when you've found an event which is earlier than the specified time.
Attaching the events directly to the card has the potential to lead you down into problems with supernodes/dense nodes. It would be the simplest to query for in Cypher, though. The problem is that Cypher will look at all of them before filtering. You could perhaps improve the performance of queries by, in addition to placing the date/time of the event on the event node, placing it on the relationships to the node ((:Card)-[:HAS_EVENT]->(:Event) or (:Event)-[:PERFORMED_BY]->(:Person)). Then when you query you can filter by the relationships so that it doesn't need to traverse to the nodes.
Regardless, it would probably be helpful to break up the query like so:
MATCH (c:Card {uuid: 'id_here')-[:child*0..]->(child:Card)
WITH child
MATCH (child)-[:HAS_EVENT]->(event:Event)
I think that would mean that the MATCH is going to have fewer permutations of paths that it will need to evaluate.
Others are welcome to supplement my dubious advice as I've never really dealt with supernodes personally, just read about them ;)
I am used to have tables for ongoing activities from my former life as relational DB guy. I am wondering how I would store ongoing information like transactions, logs or whatever in a neo4j DB. Let#s assume I have an account, which is been assigned to a user A:
(u:User {name:"A"})
I want to keep track on transactions he does, e.g. deducting or adding a value:
(t:Transaction {value:"-20", date:timestamp()})
Would I do for every transaction a new node and assign it to the user:
(u) -[r:changeBalance]-> (t)
In the end I might have lots of nodes which are assigned to the user and keep one transaction each resulting in lots of nodes with only one information. I was pondering if a query that has a limit on the last 50 transactions (limit 50, sort by t.date) might still have to read all available transaction nodes to get the total sorting queue before the limit applies - this seems a bit unperformant.
How would you model a list of actions in a neo4j DB? Any hint is very appreciated.
If you used a simple query like the following, you would NOT be reading all Transaction nodes per User.
MATCH (u:User)-[r:ChangeBalance]->(t:Transaction)
RETURN u, t
ORDER BY t.date;
You'd only be reading the Transaction nodes that are directly related to each User (via a ChangeBalance relationship). So, the performance would not be as bad as you are afraid it might be.
Although everything is fine with your query - you are reading only transactions, that are related to this specific user - this approach can be improved.
Let's imagine that, for some reason, you application will work 5 years and you have user that have 10 transactions per day. It will result in ~18250 transaction connected to single node.
This is not great idea, from data model perspective. In this case if you want to filter result (get 50 latest transaction) on some non-indexed field, then this will result in full 18250 node traverse.
This can be solved by adding additional relations to database.
Currently you have such graph: (user)-[:HAS]->(transasction)
( user )
/ | \
(transasction1) (transaction2) (transaction3)
You can add additional relation between transactions, to specify sequence of events.
Like that: (transaction)-[:NEXT]->(transasction)
( user )
/ | \
(transasction1)-(transaction2)-(transaction3)
Note: there is no need to have additional PREVIOUS relation, because Neo4j store relationship pointers in both directions, so traversing backwards can be done at same speed as forward.
And maintain relations to first and last user transasctions:
(user)-[:LAST_TRANSACTION]->(transaction)
(user)-[:FIRST_TRANSACTION]->(transaction)
This allows you to get last transaction in 1 hop. And then latest 50 with additional 50 hops.
So, adding additional complexity, you can traverse/manipulate with your data in more efficient ways.
This idea come from EventStore database (and similar to them).
Moreover, with such data model User balance can be aggerated by wrapping up sequence of transaction. This can give you a nice and fast way to get user balance at any point.
Getting latest 50 transaction in this model can look like this:
MATCH (user:User {id: 1} WITH user
MATCH (user)-[:LAST_TRANSACTION]->(last_transaction:Transaction) WITH last_transaction
MATCH (last_transasction)<-[:NEXT*0..50]-(transasctions:Transaction)
RETURN transactions
Getting total user balance can be:
MATCH (user:User {id: 1}) WITH user
MATCH (user)-[:FIRST_TRANSACTION]->(first_transaction:Transaction) WITH first_transaction
MATCH (first_transaction)-[:NEXT*]->(transactions:Transaction)
RETURN first_transaction.value + sum(transasctions.value)