Let's say you have a mnesia table replicated on nodes A and B. If on node C, which does not contain a copy of the table, I do mnesia:change_config(extra_db_nodes, [NodeA, NodeB]), and then on node C I do mnesia:dirty_read(user, bob) how does node C choose which node's copy of the table to execute a query on?
According to my own research answer for the question is - it will choose the most recently connected node. I will be grateful for pointing out errors if found - mnesia is a really complex system!
As Dan Gudmundsson pointed out on the mailing list algorithm of selection of the remote node to query is defined in mnesia_lib:set_remote_where_to_read/2. It is the following
set_remote_where_to_read(Tab, Ignore) ->
Active = val({Tab, active_replicas}),
Valid =
case mnesia_recover:get_master_nodes(Tab) of
[] -> Active;
Masters -> mnesia_lib:intersect(Masters, Active)
end,
Available = mnesia_lib:intersect(val({current, db_nodes}), Valid -- Ignore),
DiscOnlyC = val({Tab, disc_only_copies}),
Prefered = Available -- DiscOnlyC,
if
Prefered /= [] ->
set({Tab, where_to_read}, hd(Prefered));
Available /= [] ->
set({Tab, where_to_read}, hd(Available));
true ->
set({Tab, where_to_read}, nowhere)
end.
So it gets the list of active_replicas (i.e. list of candidates), optionally shrinks the list to master nodes for the table, remove tables to be ignored (for any reason), shrinks the list to currently connected nodes and then selects in the following order:
First non-disc_only_copies
Any available node
The most important part is in fact the list of active_replicas, since it determines the order of nodes in the list of candidates.
List of active_replicas is formed by remote calls of mnesia_controller:add_active_replica/* from newly connected nodes to old nodes (i.e. one which were in the cluster before), which boils down to the function add/1 which adds the item as the head of the list.
Hence answer for the question is - it will choose the most recently connected node.
Notes:
To check out the list of active replicas on the given node you can use this (dirty hack) code:
[ {T,X} || {{T,active_replicas}, X} <- ets:tab2list(mnesia_gvar) ].
Well, node C would need to contact either node A or node B in order to do a query. Thus node C will have to decide itself which table copy to execute the query on.
If you need something more than this you would either need to have some algorithm which will decide which node to query on, or even replicate the table on node C (this would typically depend on what kind of characteristics you want / need).
If node A and node B form or are part of a database cluster, a good start is probably the round robin algorithm (or random, as you suggest).
Related
I'm lost and tried everything I can think of. Maybe you can help me.
I'm trying to find all dependencies for a given software package. In this special case I'm working with the Node.js / JavaScript ecosystem and scraped the whole npm registry. My data model is simple, I've got packages and a package can have multiple versions.
In my database I have 113.339.030 dependency relationships and 19.753.269 versions.
My whole code works fine until I found a package that has so many dependencies (direct and transitive) that all my queries break down. It's called react-scripts. Here you can see the package information.
https://registry.npmjs.org/react-scripts
One visualizer never finishes
https://npm.anvaka.com/#/view/2d/react-scripts
and another one creates a dependency graph so big it's hard to analyze.
https://npmgraph.js.org/?q=react-scripts
At first I tried PostgreSQL with recursive common table expression.
with recursive cte as (
select
child_id
from
dependencies
where
dependencies.parent_id = 16674850
union
select
dependencies.child_id
from
cte
left join dependencies on
cte.child_id = dependencies.parent_id
where
cte.child_id is not null
)
select * from cte;
That returns 1.726 elements which seems to be OK. https://deps.dev/npm/react-scripts/4.0.3/dependencies returns 1.445 dependencies.
However I'd like to get the path to the nodes and that doesn't work well with PostgreSQL and UNION. You'd have to use UNION ALL but the query will be much more complicated and slower. That's why I thought I'd give Neo4j a chance.
My nodes have the properties
version_id: integer
name: string
version: string
I'm starting with what I thought would be a simple query but it's already failing.
Start with version that has version_id 16674850 and give me all its dependencies.
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
return DISTINCT b;
I have an index on version_id.
CREATE INDEX FOR (version:Version) ON (version.version_id)
That works until I set the depth to variable length to or greater 12.
Then the query runs forever. Here is the query plan.
Neo4j runs inside Docker. I've increased some memory settings.
- NEO4J_dbms_memory_heap_initial__size=2G
- NEO4J_dbms_memory_heap_max__size=2G
- NEO4J_dbms_memory_pagecache_size=1G
Any ideas? I'm really lost right now and don't want to give up on my "software dependency analysis graph".
I spent the last 6 weeks on this problem.
Thank you very much!
Edit 28/09/2021
I uploaded a sample data set. Here are the links
https://s3.amazonaws.com/blog.spolytics.com/versions.csv (737.1 MB)
https://s3.amazonaws.com/blog.spolytics.com/dependencies.csv (1.7 GB)
Here is the script to import the data.
neo4j-admin import \
--database=deps \
--skip-bad-relationships \
--id-type=INTEGER \
--nodes=Version=import/versions.csv \
--relationships=DEPENDS_ON=import/dependencies.csv
That might help to do some experiments on your side and to reproduce my problem.
The trouble here is that Cypher is interested in finding all possible path that match a pattern. That can make it problematic for cases when you just want distinct reachable nodes, where you really don't care about expanding to every distinct path, but just finding nodes and ignoring any alternate paths leading to nodes already visited.
Also, the planner is making a bad choice with that cartesian product plan, that can make the problem worse.
I'd recommend using APOC Procedures for this, as there are procs that are optimized to expanding to distinct nodes and ignoring paths to those already visited. apoc.path.subgraphNodes() is the procedure.
Here's an example of use:
MATCH (a:Version {version_id: 16674850})
CALL apoc.path.subgraphNodes(a, {relationshipFilter:'DEPENDS_ON>', labelFilter:'>Version', maxLevel:11}) YIELD node as b
RETURN b
The arrow in the relationship filter indicates direction, and since it's pointing right it refers to traversing outgoing relationships. If we were interested in traversing incoming relationships instead, we would have the arrow at the start of the relationship name, pointing to the left.
For the label filter, the prefixed > means the label is an end label, meaning that we are only interested in returning the nodes of that given label.
You can remove the maxLevel config property if you want it to be an unbounded expansion.
More options and details here:
https://neo4j.com/labs/apoc/4.1/graph-querying/expand-subgraph-nodes/
I don’t have a large dataset like yours, but I think you could bring the number of paths down by filtering the b nodes. Does this work , as a start?
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..11]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
UNWIND nodes(p) AS node
return COUNT(DISTINCT node)
To check if you can return longer paths, you could do
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN count(p)
Now if that works, I would do :
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN p LIMIT 10
to see whether the paths are correct.
Sometimes UNWIND is causing n issue. To get the set of unique nodes, you could also try APOC
MATCH p = (a:Version {version_id: 16674850})-[:DEPENDS_ON*..12]->(b:Version)
WHERE NOT EXISTS ((b)-[:DEPENDS_ON]->())
RETURN apoc.coll.toSet(
apoc.coll.flatten(
COLLECT(nodes(p))
)
) AS unique nodes
We create multiple graphs based on versions of a program (A,B in my example)
(:ProgNode {compileUnit:RL105A, nodeKey:100, captureDate:1/1/1} )
(:ProgNode {compileUnit:RL105B}, nodeKey:200,captureDate:2/2/2} )
these fan out into full blown graphs with thousands of nodes. We also have a single node (:ProgUnit{compileUnit:RL105})
that is a "master" node for that program. We want to link the first node of each individual subgraph (the lowest nodeKey ) to the master. My current query looks like this
MATCH (p:ProgNode) where p.compileUnit = 'RL105A' WITH min(p.nodeKey) as low_node
Match (j:ProgUnit) where j.compileUnit = 'RL105'
Create (j)-[r:RELEASE]->(p)
A and B will eventually be dates but for now, letters
This works (sort of), but instead of linking the master to the subgraph, it seems to create a new node which isn't anything.
I know I will have to run this 2 times to build both links (A,B) and thats not an issue.
Thoughts ? What am I doing wrong here ?
Your WITH clause did not include p as a term, so p became an unbound variable again.
The following query should create a RELEASE relationship to the ProgNode whose compileUnit starts with "RL105" and whose nodeKey has the lowest value:
MATCH (p:ProgNode) WHERE p.compileUnit STARTS WITH 'RL105'
WITH p ORDER BY p.nodeKey LIMIT 1
MATCH (j:ProgUnit) WHERE j.compileUnit = 'RL105'
CREATE (j)-[:RELEASE]->(p)
Use MERGE instead of CREATE if you need to avoid creating duplicate relationships.
We have a set of nodes that are connected. Each node has a link to the next node in the chain. When the chain runs out, that end node just hangs out there. See the graphic below.
Node path
Each of these nodes has the same level, so as long as they are in the chain, they have the same number. So what I am hoping to do is come up with a cypher query that builds a link between the max ID and the MIN ID that share the same line number. So basically connecting the end, with the beginning. Is there a clever way to do this ?
Your question lacks some clarity, but what about thinking along the lines below ?
// find all levels in your dataset of nodes in the chains
MATCH (n)
WHERE (n)-[:NEXT]-()
WITH COLLECT(DISTINCT n.level) AS levels
UNWIND levels AS level
// for each level, find the chain
MATCH (start {level:level})-[:NEXT*]->(end {level:level})
WHERE NOT (
({level:level})-[:NEXT]->(start)
OR
(end)-[:NEXT]->({level:level})
)
// connect end to start
MERGE (end)-[:MYRELTYPE]->(start)
I'm new to neo4j, i've read a couple of tutorials but i am stuck with finding all paths from a node till it reaches another when the status changes and different path each time.
I've made a picture:
Starting from the node at the top, I would like to find all nodes T that have status=1 and we move from node of type O to T with a 'o' relationship and from T to O with 'i' relationships. If we reach a node T with status = 0 then we go the 'i' relationship and check if T status = 1 etc
I don't know the depth of the graph. I've found on the manual that we can use [r*1..] but i am not sure how to use here.
I have tried
match (o1:O)-[:o]-(t:T), (t)-[:i]-(o2:O)-[:o]-(t2:T)
return o1, t, o2, t2
for the first depth but i don't know how to do it with unknown depth and make go deeper as long as status is not 1
Your schema looks like so (the question mark means I'm not sure what relationship you wanted there).
(:O)<-[:o]-(:T)<-[:i]-(:O)<-[:o]-(:T)<-[:?]-(:T)
You need to somehow identify the first node from which you start, and I'm not sure exactly what nodes you are trying to get from the schema, but something like this would return all nodes with status 1 that are somehow connected to first node, which here is just identified by having status 0 (so might actually be more than one node).
MATCH (firstnode:O {Status: 0})<-[:o|:i*..]-(othernodes) WHERE othernodes.Status=1 RETURN othernodes
But be warned - any *.. command will take forever to run.
Why is the time complexity of node deletion in doubly linked lists (O(1)) faster than node deletion in singly linked lists (O(n))?
The problem assumes that the node to be deleted is known and a pointer to that node is available.
In order to delete a node and connect the previous and the next node together, you need to know their pointers. In a doubly-linked list, both pointers are available in the node that is to be deleted. The time complexity is constant in this case, i.e., O(1).
Whereas in a singly-linked list, the pointer to the previous node is unknown and can be found only by traversing the list from head until it reaches the node that has a next node pointer to the node that is to be deleted. The time complexity in this case is O(n).
In cases where the node to be deleted is known only by value, the list has to be searched and the time complexity becomes O(n) in both singly- and doubly-linked lists.
Actually deletion in singly linked lists can also be implemented in O(1).
Given a singly linked list with the following state:
SinglyLinkedList:
Node 1 -> Node 2
Node 2 -> Node 3
Node 3 -> Node 4
Node 4 -> None
Head = Node 1
We can implement delete Node 2 in such a way:
Node 2 Value <- Node 3 Value
Node 2 -> Node 4
Here we replace the value of Node 2 with the value of its next node (Node 3) and set its next value pointer to the next value pointer of Node 3 (Node 4), skipping over the now effectively "duplicate" Node 3. Thus no traversal needed.
Because you can't look backwards...
Insertion and deletion at a known position is O(1). However, finding that position is O(n), unless it is the head or tail of the list.
When we talk about insertion and deletion complexity, we generally assume we already know where that's going to occur.
It has to do with the complexity of fixing up the next pointer in the node previous to the one you're deleting.
Unless the element to be deleted is the head(or first) node, we need to traverse to the node before the one to be deleted. Hence, in worst case, i.e., when we need to delete the last node, the pointer has to go all the way to the second last node thereby traversing (n-1) positions, which gives us a time complexity of O(n).
I don't think Its O(1) unless you know the address of the
node whichh has to be deleted ..... Don't you loop to reach the node which has to be deleted from head ????
It is O(1) provided you have the address of the node which has to be deleted because you have it's prev node link and next node link .
As you have all the necessary links available just make the "node of interest " out of the list by re arranging the links and then free() it .
But in a single linked list you have to traverse from head to get it's previous and next address doesn't matter whether you have the address to f the node to be deleted or the node position ( as in 1st ,2nd ,10th etc.,.) To be deleted .
Suppose there is a linked list from 1 to 10 and you have to delete node 5 whose location is given to you.
1 -> 2 -> 3 -> 4 -> 5-> 6-> 7-> 8 -> 9 -> 10
You will have to connect the next pointer of 4 to 6 in order to delete 5.
Doubly Linked list
You can use the previous pointer on 5 to go to 4. Then you can do
4->next = 5->next;
or
Node* temp = givenNode->prev;
temp->next = givenNode->next;
Time Complexity = O(1)
singly Linked List
Since you don't have a previous pointer in Singly linked list you cant go backwards so you will have to traverse the list from head
Node* temp = head;
while(temp->next != givenNode)
{
temp = temp->next;
}
temp->next = givenNode->next;
Time Complexity = O(N)
In LRU cache design, deletion in doubly linked list takes O(1) time. LRU cache is implemented with hash map and doubly linked list. In the doubly linked list, we store the values and it hash maps we store the pointers of linked list nodes.
In case of a cache hit, we have to move the element to the front of the list. If the node is somewhere in the middle of doubly linked list, since we keep the pointers in the hash map and we retrieved in O(1) time, we can delete it by
next_temp=retrieved_node.next
prev_temp=retrieved_node.prev
then set the pointers to None
retrieved_node.next=None
retrieved_node.prev=None
and then you can reconnect the missing parts of the linked list
prev_temp.next=next_temp