what is the meaning of `explicitlyDeleted=false` in LDBC? - neo4j

I am looking at the LDBC benchmark which has contributions from Neo4j and TigerGraph. I want to understand how entries are ingested to measure performance.
Here are two example entries from "Person_likes_Post".
{"creationDate":1296583977045,"deletionDate":1577664000000,"explicitlyDeleted":false,"PersonId":13194139533355,"PostId":412316861128}
{"creationDate":1296750065049,"deletionDate":1296750075058,"explicitlyDeleted":true,"PersonId":13194139533355,"PostId":412316861129}
Does it mean only the edge is deleted when "explicitlyDeleted":true ?
When "explicitlyDeleted":false, does it mean the src node is deleted, dst node is deleted or both?
Link to the benchmark doc:
https://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf
Download link to the example LDBC dataset containing these entries:
https://ldbcouncil.org/ldbc_snb_datagen_spark/social-network-sf0.003-bi-composite-merged-fk.zip
(I wanted to tag LDBC but there is no such an option.)

The explicitlyDeleted attribute indicates whether there is a delete operation that targets specifically the given entity (i.e. a node or edge in the graph). This distinction is needed because the LDBC SNB workloads have cascading deletes where the deletion of an entity may trigger the deletion of other entities.
For example, a Person_likes_Post edge can be deleted due to various explicit delete operations:
an explicit delete operation targeting a single Person_likes_Post edge
an explicit delete operation targeting its source Person
an explicit delete operation targeting its target Post
an explicit delete operation targeting a Forum that contains its target Post
an explicit delete operation targeting a Person whose Album/Wall (which are Forum subtypes) contains its target Post
For the Person_likes_Post edge, the explicitlyDeleted attribute is true in case 1, and false for the other cases.
Note that this attribute is only part of the raw data set. The data sets used for the actual workload executions (Interactive, BI) only contain explicit delete operations, hence they omit this attribute.

Related

Neo4J - Which is better to store element as a property of user or as a node & relationship?

I got a problem when designing a graph model with million users. I need to store information that user is registered or non-register.
As I see we have 2 options:
Store a property "register = true/false" in each user node. So with 1 million user, we have 1 million properties "register".
Store a Registered node then make relationship just for registered user to this node. So we have number of relationship equal exactly with the registered user.
Which option is better in performance searching also about minimum storage?
Thanks in advance,
Modeling your data as a graph is a difficult thing to pin down exactly. Typically, when it comes to NoSQL databases, the most important thing to consider is how you will be using your data, and to model it based on that.
Using the external node might run into performance problems, as Neo4J typically starts to run into issues during traversing as it approaches around 10,000 relationships in a single node. You will be well above that limit with an external "Registered" node; on the other hand as long as you are not anchoring your search to that node, it should be okay.
No matter which route you go, the query you described in the comments will likely anchor on (start with) the user, then traverse to who their friends are, and then for each friend, it will check whether it
A. has the "registered" property set to 'true'
B. has a relationship to the "Registered" node.
Each of these methods appears to have a similar execution time, and indexing on the "registered" property will have negligible impact because it is not being used as an anchor (presumably; you would have to PROFILE your query with both methods to find out for sure). So, like you mentioned, one might consider the space restraints.
Besides that, there is not much difference from a performance analysis perspective between the two methods that I can see.
A third option, mentioned by #InverseFalcon, is to use an additional label, ':Registered' on those nodes that are registered. This might well result in a faster comparison time than keeping it in a property, as labels will be inlined in the node store and can be checked there, whereas properties might have an additional level of indirection to the property store.

LRU cache with a singly linked list

Most LRU cache tutorials emphasize using both a doubly linked list and a dictionary in combination. The dictionary holds both the value and a reference to the corresponding node on the linked list.
When we perform a remove operation, we look up the node from the linked list in the dictionary and we'll have to remove it.
Now here's where it gets weird. Most tutorials argue that we need the preceding node in order to remove the current node from the linked list. This is done in order to get O(1) time.
However, there is a way to remove a node from a singly linked list in O(1) time here. We set the current node's value to the next node and then kill the next node.
My question is: why are all these tutorials that show how to implement an LRU cache with a doubly linked list when we could save constant space by using a singly linked list?
You are correct, the single linked list can be used instead of the double linked list, as can be seen here:
The standard way is a hashmap pointing into a doubly linked list to make delete easy. To do it with a singly linked list without using an O(n) search, have the hashmap point to the preceding node in the linked list (the predecessor of the one you care about, or null if the element is at the front).
Retrieve list node:
hashmap(key) ? hashmap(key)->next : list.head
Delete:
successornode = hashmap(key)->next->next
hashmap( successornode ) = hashmap(key)
hashmap(key)->next = successornode
hashmap.delete(key)
Why is the double linked list so common with LRU solutions then? It is easier to understand and use.
If optimization is an issue, then the trade off of a slightly less simple solution of a single linked list is definitely worth it.
There are a few complications for swapping the payload
The payload could be large (such as buffers)
part of the application code may still refer to the payload (have it pinned)
there may be locks or mutexes involved (which can be owned by both the DLL/hash nodes and/or the payload)
In any case, modifying the DLL affects at most 2*2 pointers, swapping the payload will need (memcpy for the swap +) walking the hash-chain (twice), which could need access to any node in the structure.

Is there a race condition when creating unique paths?

I recently discovered that a race condition exists when executing concurrent MERGE statements. Specifically, duplicate nodes can be created in the scenario where a node is created after the MATCH step but before the CREATE step of a given MERGE.
This can be worked around in some instances using unique constraints on the merged nodes; however, this falls short in scenarios where:
There is no single unique property to enforce (e.g. pairs of properties need to be unique but individual ones don't).
Trying to merge relationships and paths.
Does using CREATE UNIQUE solve this problem (or do the same pitfalls exist)? If so, is it the only option? It feels like the usefulness of MERGE is fairly heavily diminished when it effectively can't guarantee the uniqueness of the path or node being merged...
When MERGE statements are executed concurrently, these situations may occur. Basically, each transaction gets a view of the graph at the first point of reading, and won't see updates made after that point (with some variations). The main exception to this are uniquely constrained nodes, where Neo4j will initialise a fresh reader from the index when reading, regardless of what was previously read in the transaction.
A workaround could be to create a 'dummy' property and a unique constraint on it and one of the node labels. In Neo4j 2.2.5, this should work to get around your problem.

Visual Studio REST API Iteration and Area ID's

I am working with the VSO REST API and have a question on how Iteration and Area ID's are assigned. Specifically, why is it when I assign a work item to the root Iteration or Area the ID that is returned for the WIT is not returned when I query the classification nodes?
For example, imagine I have this hierarchy when I query /DefaultCollection/my project/_apis/wit/classificationnodes?$depth=2
My Project: id=1234
Area 1: id=5678
Area 2: id= 9012
And I then query for a work item using /DefaultCollection/_apis/wit/workItems/1?$expand=all
If the work item is in Area 1 or Area 2, the System.AreaId field is as expected (5678 and 9012, respectively). However, if I assign the work item to My Project, the System.AreaID is some value that is not included when I query for all classification nodes. There appears to be some kind of relationship between the ID's as they are serial (e.g. the ID returned by the classification node query is 1232 for the area and 1233 for the iteration), but I can't seem to find a way to query to get the actual ID returned by the work item query.
In fact, not only is the ID returned for a work item not present when I query for all classification nodes, if I assign the work item to both the root iteration and area, the ID returned for both fields is the same value that is not included in the classification node query.
What I need is a way to look at a work item and figure out the area and iteration it belongs to. I could probably do something with the path field strings that are returned, but that seems error prone since users can change them.
****Edit****
The above appears to be a bug in the REST API, but for anyone who comes across this post there is a way to get a usable iteration ID by the path string. Structure your REST call like so:
/DefaultCollection/[Project Name]/_apis/wit/classificationnodes/iterations/Release 1/Sprint 1 (etc.)
I have never done this with ID. I only use the path. In the classification service you can get the node by path easily enough.
For example, using the REST API - you can access this url to get the data about a specific iteration:
/DefaultCollection/[Project Name]/_apis/wit/classificationnodes/iterations/[Release X]/[Sprint Y]
Note that trying to access the default iteration path (the project name instead of a specific iteration) will return an error:
/DefaultCollection/[Project Name]/_apis/wit/classificationnodes/iterations/[Project Name]
will give :
{"$id":"1","innerException":null,"message":"VS402485: The node name is not recognized: [Project Name]","typeName":"Microsoft.TeamFoundation.WorkItemTracking.Server.Metadata.WorkItemTrackingTreeNodeNotFoundException, Microsoft.TeamFoundation.WorkItemTracking.Server","typeKey":"WorkItemTrackingTreeNodeNotFoundException","errorCode":0,"eventId":3200}
So if you do batch work, you have to filter those before querying the api.
There are three ways of identifying an Area (everything I post equally applies to Iterations also). The Path (string), the ID (int) and a Guid. each of them are used in different ways, and have different ramifications.
For example renaming an Area, does NOT change it's identity, therefore does not update a work item (the path returned for in a workitem is dynamic).
It is also possible to delete and recreate and identical path, but it will have a different ID.
The GUID is used primarily for the Excel Reports (such as are parent of the SharePoint portal)
Depending on how you want things to react determines the appropriate element to use.
I have not seen any issues with the ID that you mention, and if you could create a simple repro, I would be glad to look at it.
david(dot)corbin(at)dynconcepts(dot)com

Representing (and incrementing) relationship strength in Neo4j

I would like to represent the changing strength of relationships between nodes in a Neo4j graph.
For a static graph, this is easily done by setting a "strength" property on the relationship:
A --knows--> B
|
strength
|
3
However, for a graph that needs updating over time, there is a problem, since incrementing the value of the property can't be done atomically (via the REST interface) since a read-before-write is required. Incrementing (rather than merely updating) is necessary if the graph is being updated in response to incoming streamed data.
I would need to either ensure that only one REST client reads and writes at once (external synchronization), or stick to only the embedded API so I can use the built-in transactions. This may be workable but seems awkward.
One other solution might be to record multiple relationships, without any properties, so that the "strength" is actually the count of relationships, i.e.
A knows B
A knows B
A knows B
means a relationship of strength 3.
Disadvantage: only integer strengths can be recorded
Advantage: no read-before-write is required
Disadvantage: (probably) more storage required
Disadvantage: (probably) much slower to extract the value since multiple relationships must be extracted and counted
Has anyone tried this approach, and is it likely to run into performance issues, particularly when reading?
Is there a better way to model this?
Nice idea.
To reduce storage and multi-reads those relationships could be aggregated to one in a batch job which runs transactionally.
Each rel could also carry an individual weight value, whose aggregated value is used as weight. It doesn't have to be integer based and could also be negative to represent decrements.
You could also write a small server-extension for updating a weight value on a single relationship transactionally. Would probably even make sense for the REST API (as addition to the "set single value" operation have a modify single value operation.
PUT http://localhost:7474/db/data/node/15/properties/mod/foo
The body contains the delta value (1.5, -10). Another idea would be to replace the mode keyword by the actual operation.
PUT http://localhost:7474/db/data/node/15/properties/add/foo
PUT http://localhost:7474/db/data/node/15/properties/or/foo
PUT http://localhost:7474/db/data/node/15/properties/concat/foo
What would "increment" mean in a non integer case?
Hmm a bit of a different approach, but you could consider using a queuing system. I'm using the Neo4j REST interface as well and am looking into storing a constantly changing relationship strength. The project is in Rails and using Resque. Whenever an update to the Neo4j database is required it's thrown in a Resque queue to be completed by a worker. I only have one worker working on the Neo4j Resque queue so it never tries to perform more than one Neo4j update at once.
This has the added benefit of not making the user wait for the neo4j updates when they perform an action that triggers an update. However, it is only a viable solution if you don't need to use/display the Neo4j updates instantly (though depending on the speed of your worker and the size of your queue, it should only take a few seconds).
Depends a bit on what read and write load you are targeting. How big is the total graph going to be?

Resources