Distributing in-memory linked list

Distributing in-memory linked list - linked-list

I have a program which is build based on a singly linked list. There are different programs which creates some form of data and this data sent to this linked list module to be added. As long as I've RAM available, program working as intended. Periodically -about every year-, I archive the entire linked list to the disk -due to requirement, I'm archiving all-. So far so good.
What happens if I wanted to add new node to the list whilst RAM is full and I haven't archived and freed the memory on RAM? This might occur when producer count goes up or regardless of producer count, there may be more data created depending or where it's used etc. I couldn't find a clear solution the scale the on-memory linked list. There is a workaround in my head but don't know even if it works so I thought better to ask here.
When the RAM start to get almost full, I would create a new instance
of the linked list program -just another machine on the cloud or new
physical computer on premise, whatever -.
I do have an service discovery module -something like ZooKeeper-, this discovery module will detect the newly created machine and adds to the list.
When first instance is almost in it's limits, it will check if there is an available instance, if there is; it will relay the node to the next instance and it will update its last node's next pointer to something special. If you wanted to traverse the list from start to finish across all the machines every time you come to this special node, it will have the information of the which machine has the next node. Traversal will continue from the next machine that the last node points to.
Since this this not a hash map or something in that nature, I can't just add replicate the service and for example relay the incoming request based on a given key to a particular machine.
Rather than archiving part of the old data and loading that to the RAM and continuing on like that, I thought it would be better to have a last pointer to point to a different machine and continue reading from that machine. My choice for a network call seemed better because this program will be used in a intranet, but still I couldn't find a solid solution on paper.
Is there a such example that I can study on and try to find a better solution? Is this solution feasible?
An example:
Machine 1:
1st node : [data:x, *next: 2nd Node address],
2nd node : [data:123, *next: 3rd Node address],
...
// at this point RAM is almost full
// receive next instance's ip
(n-1)th node : [data:987, *next: nth Node address],
nth node : [data:x2t, type: LastNodeInMachine, *next: nullptr]
Machine 2:
1st node == (n+1) node : [data:x, *next: 2nd Node address],
... and so on

Related

Stream de-duplication on Dataflow | Running services on Dataflow services

I want to de-dupe a stream of data based on an ID in a windowed fashion. The stream we receive has and we want to remove data with matching within N-hour time windows. A straight-forward approach is to use an external key-store (BigTable or something similar) where we look-up for keys and write if required but our qps is extremely large making maintaining such a service pretty hard. The alternative approach I came up with was to groupBy within a timewindow so that all data for a user within a time-window falls within the same group and then, in each group, we use a separate key-store service where we look up for duplicates by the key. So, I have a few questions about this approach
[1] If I run a groupBy transform, is there any guarantee that each group will be processed in the same slave? If guaranteed, we can group by the userid and then within each group compare the sessionid for each user
[2] If it is feasible, my next question is to whether we can run such other services in each of the slave machines that run the job - in the example above, I would like to have a local Redis running which can then be used by each group to look up or write an ID too.
The idea seems off what Dataflow is supposed to do but I believe such use cases should be common - so if there is a better model to approach this problem, I am looking forward to that too. We essentially want to avoid external lookups as much as possible given the amount of data we have.

1) In the Dataflow model, there is no guarantee that the same machine will see all the groups across windows for the key. Imagine that a VM dies or new VMs are added and work is split across them for scaling.
2) Your welcome to run other services on the Dataflow VMs since they are general purpose but note that you will have to contend with resource requirements of the other applications on the host potentially causing out of memory issues.
Note that you may want to take a look at RemoveDuplicates and use that if it fits your usecase.
It also seems like you might want to be using session windows to dedupe elements. You would call:
PCollection<T> pc = ...;
PCollection<T> windowed_pc = pc.apply(
Window<T>into(Sessions.withGapDuration(Duration.standardMinutes(N hours))));
Each new element will keep extending the length of the window so it won't close until the gap closes. If you also apply an AfterCount speculative trigger of 1 with an AfterWatermark trigger on a downstream GroupByKey. The trigger would fire as soon as it could which would be once it has seen at least one element and then once more when the session closes. After the GroupByKey you would have a DoFn that filters out an element which isn't an early firing based upon the pane information ([3], [4]).
DoFn(T -> KV<session key, T>)
|
\|/
Window.into(Session window)
|
\|/
Group by key
|
\|/
DoFn(Filter based upon pane information)
It is sort of unclear from your description, can you provide more details?

Sorry for not being clear. I gave the setup you mentioned a try, except for the early and late firings part, and it is working on smaller samples. I have a couple of follow up questions, related to scaling this up. Also, I was hoping I could give you more information on what the exact scenario is.
So, we have incoming data stream, each item of which can be uniquely identified by their fields. We also know that duplicates occur pretty far apart and for now, we care about those within a 6 hour window. And regarding the volume of data, we have atleast 100K events every second, which span across a million different users - so within this 6 hour window, we could get a few billion events into the pipeline.
Given this background, my questions are
[1] For the sessioning to happen by key, I should run it on something like
PCollection<KV<key, T>> windowed_pc = pc.apply(
Window<KV<key,T>>into(Sessions.withGapDuration(Duration.standardMinutes(6 hours))));
where key is a combination of the 3 ids I had mentioned earlier. Based on the definition of Sessions, only if I run it on this KV would I be able to manage sessions per-key. This would mean that Dataflow would have too many open sessions at any given time waiting for them to close and I was worried if it would scale or I would run into any bottle-necks.
[2] Once I perform Sessioning as above, I have already removed the duplicates based on the firings since I will only care about the first firing in each session which already destroys duplicates. I no longer need the RemoveDuplicates transform which I found was a combination of (WithKeys, Combine.PerKey, Values) transforms in order, essentially performing the same operation. Is this the right assumption to make?
[3] If the solution in [1] going to be a problem, the alternative is to reduce the key for sessioning to be just user-id, session-id ignoring the sequence-id and then, running a RemoveDuplicates on top of each resulting window by sequence-id. This might reduce the number of open sessions but still would leave a lot of open sessions (#users * #sessions per user) which can easily run into millions. FWIW, I dont think we can session only by user-id since then the session might never close as different sessions for same user could keep coming in and also determining the session gap in this scenario becomes infeasible.
Hope my problem is a little more clear this time. Please let me know any of my approaches make the best use of Dataflow or if I am missing something.
Thanks

I tried out this solution at a larger scale and as long as I provide sufficient workers and disks, the pipeline scales well although I am seeing a different problem now.
After this sessionization, I run a Combine.perKey on the key and then perform a ParDo which looks into c.pane().getTiming() and only rejects anything other than an EARLY firing. I tried counting both EARLY and ONTIME firings in this ParDo and it looks like the ontime-panes are actually deduped more precisely than the early ones. I mean, the #early-firings still has some duplicates whereas the #ontime-firings is less than that and has more duplicates removed. Is there any reason this could happen? Also, is my approach towards deduping using a Combine+ParDo the right one or could I do something better?
events.apply(
WithKeys.<String, EventInfo>of(new SerializableFunction<EventInfo, String>() {
#Override
public java.lang.String apply(EventInfo input) {
return input.getUniqueKey();
}
})
)
.apply(
Window.named("sessioner").<KV<String, EventInfo>>into(
Sessions.withGapDuration(mSessionGap)
)
.triggering(
AfterWatermark.pastEndOfWindow()
.withEarlyFirings(AfterPane.elementCountAtLeast(1))
)
.withAllowedLateness(Duration.ZERO)
.accumulatingFiredPanes()
);

How can I know when a memory request is forwarded to another numa node, which node is it?

When a memory access happens in node A, but it is a remote access, which is forwarded to node B, through QuickPath Interconnect controller.
Different node has different range of memory address, so of course I can use the memory address to identify this.
If I don't know the memory address, can I use some hardware register or performance counter to do this?

If you don't have address, use perf framework to collect overall statistics (events you seek to are called node-*). Tool will show number of misses/overall events for loads, stores and prefetches.
perf may be called from userspace, in per-thread collection mode, thus you may use rdpmc assembly instruction and read individual performance counter. I.e. read counter before and after memory access and calculate difference.
I created a small sample using my old code, but I can't test it right now :(
Here it is: https://gist.github.com/myaut/cd67ea5143615264b2e6
If you have address, you may use page_zone() and virt_to_page() kernel functions to get nodeid for address (where ptr is virtual address):
struct zone* z = page_zone(virt_to_page((void*) ptr));
return z->node;
I used this to track memory accesses in kernel using SystemTap

Missing master heartbeat does not cause node to react in a CANopen system

I have a strange finding about the heartbeat-protocol in CANopen. Maybe somebody else has seen something like this and maybe it is supposed to work like this... Anyway, here's what it's about:
In CANopen there are two timeout-based life-guarding mechanisms: the first is node guarding, which I will not mention further, since it's considered old news.
The other one is called heartbeat. It is pretty simple: Any participant on the network sends a regular message stating its node ID and its state. The frequency is defined by object 0x1017sub0 and is called heartbeat-producer-time. If it is set to zero, no heartbeat is being sent.
Any other participant can then define a number of nodes it wants to find on the network plus the maximum time there may be between two consecutive heartbeat-messages. This information is stored in object 0x1016sub1..n as 32-bit entries for as many nodes as this particular node wants to listen to.
The entries consist of the node ID (bits 22 to 16) and the mentioned maximum time that may elaps between heartbeats, called the heartbeat-consumer-time (in bits 15..0). Again if the entry is zero, it is being ignored.
As you may have gathered, there is no distinction between network-master (node ID 1) and slaves (node IDs 2 to 127).
So far the theory, now for my problem:
I configure one of the slave-nodes in my network as a heartbeat-consumer for the master, so there's an entry in object 0x1016sub1 that looks like this: 0x000107D0. Meaning that a heartbeat-message from the master is expected after at least two seconds.
I have observed that this works in two examples. If I send a master-heartbeat for a time and then stop, the node either returns to pre-operational mode or sends an appropriate emergency-message.
If I don't send any master-heartbeat-messages, I would expect that after I start the node (send it into operational mode) it takes at most two seconds for the node to either return to pre-operational mode or send an appropriate emergency-message or perhaps even both. But in the two examples I tried, nothing happened. If I never send any heartbeat, the node never expects one and just keeps on running.
The two examples are very different from each other. I am not sure whether they use the same CANopen-stack library perhaps.
Is there an explanation?

If you read CANopen User Manual, section 1.3.1.6, page 39, you will notice that the heartbeat consumer is first activated upon receiving a heartbeat from the producer. I would assume then that, since in your example the first heartbeat is never sent, the consumer is not activated.

The impact of a distributed application configuration on node discovery via net_adm:ping/0

I am experiencing different behavior with respect to net_adm:ping/1 when being done in the context of a Distributed Application.
I have an application that pings a well-known node on start-up and in that way discovers all nodes in a mesh of connected nodes.
When I start this application on a single node (non-distributed configuration), the net_adm:ping/1 followed by a nodes/0 reports 4 other nodes (this is correct). The 4 nodes are on 2 different physical machines, so what is returned is the following n1#machine_1, n2#machine_2, n3#machine_2, n4#machine_1 (ip addresses are actually returned, not machine_x).
When part of a two-node distributed application, on the node where the application starts, the net_adm:ping/1 followed by a nodes/0 reports 2 nodes, one from each machine(n1#machine1, n2#machine2). A second call to nodes/0 after about a 750 ms delay results in the correct 5 nodes being found. Two of the three missing nodes are required for my application to work and so, not finding them, the application dies.
I am using R15B02
Is latency regarding the transitive node-discovery process known to be different when some of the nodes in the mesh are participating in distributed application configuration?

The kernel application documentation mentions the way to synchronize nodes in order to stop the boot phase until ready to move forward and everything is in place. Here are the options:
sync_nodes_mandatory = [NodeName]
Specifies which other nodes must be alive in order for this node to start properly. If some node in the list does not start within the specified time, this node will not start either. If this parameter is undefined, it defaults to [].
sync_nodes_optional = [NodeName]
Specifies which other nodes can be alive in order for this node to start properly. If some node in this list does not start within the specified time, this node starts anyway. If this parameter is undefined, it defaults to the empty list.
A file using them could look as follows:
[{kernel,
[{sync_nodes_mandatory, [b#ferdmbp, c#ferdmbp]},
{sync_nodes_timeout, 30000}]
}].
Starting the node a#ferdmbp by calling erl -sname a -config config-file-above. The downside of this approach is that each node needs its own config file.

how do I remove an extra node

I have a group of erlang nodes that are replicating their data through Mnesia's "extra_db_nodes"... I need to upgrade hardware and software so I have to detach some nodes as I make my way from node to node.
How does one remove a node and still preserve the data that was inserted?
[update] removing nodes is as important as adding them. Over time as your cluster grows it must also contract. If not then Mnesia is going to be busy trying to send data to nonexistent nodes filling up queues and keeping the network busy.
[final update] after pouring through the erlang/mnesia source code I was able to determine that it is not possible to completely disassociate nodes. While del_table_copy removes the linkage between tables it is incomplete. I would close this question but none of the close descriptions are adequate.

I wish I had found this a long time ago: http://weblambdazero.blogspot.com/2008/08/erlang-tips-and-tricks-mnesia.html
basically, with a properly functioning cluster....
login to the cluster to be removed
stop mnesia
mnesia:stop().
login to a different node on the cluster
delete the schema
mnesia:del_table_copy(schema, node#host.domain).

I'm extremely late to the party, but came across this info in the doc when looking for a solution to the same problem:
"The function call
mnesia:del_table_copy(schema,
mynode#host) deletes the node
'mynode#host' from the Mnesia system.
The call fails if mnesia is running on
'mynode#host'. The other mnesia nodes
will never try to connect to that node
again. Note, if there is a disc
resident schema on the node
'mynode#host', the entire mnesia
directory should be deleted. This can
be done with mnesia:delete_schema/1.
If mnesia is started again on the the
node 'mynode#host' and the directory
has not been cleared, mnesia's
behaviour is undefined."
(http://www.erlang.org/doc/apps/mnesia/Mnesia_chap5.html#id74278)
I think the following might do what you desire:
AllTables = mnesia:system_info(tables),
DataTables = lists:filter(fun(Table) -> Table =/= schema end,
AllTables),
RemoveTableCopy = fun(Table,Node) ->
Nodes = mnesia:table_info(Table,ram_copies) ++
mnesia:table_info(Table,disc_copies) ++
mnesia:table_info(Table,disc_only_copies),
case lists:member(Node,Nodes) of
true -> mnesia:del_table_copy(Table,Node);
false -> ok
end
end,
[RemoveTableCopy(Tbl,'gone#gone_host') || Tbl <- DataTables].
rpc:call('gone#gone_host',mnesia,stop,[]),
rpc:call('gone#gone_host',mnesia,delete_schema,[SchemaDir]),
RemoveTablecopy(schema,'gone#gone_host').
Though, I haven't tested it since my scenario is slightly different.

I've certainly used this method to perform this (supporting the mnesia:del_table_copy/2 use). See removeNode/1 below:
-module(tool_bootstrap).
-export([bootstrapNewNode/1, closedownNode/0,
finalBootstrap/0, removeNode/1]).
-include_lib("records.hrl").
-include_lib("stdlib/include/qlc.hrl").
bootstrapNewNode(Node) ->
%% Make the given node part of the family and start the cloud on it
mnesia:change_config(extra_db_nodes, [Node]),
%% Now make the other node set things up
rpc:call(Node, tool_bootstrap, finalBootstrap, []).
removeNode(Node) ->
rpc:call(Node, tool_bootstrap, closedownNode, []),
mnesia:del_table_copy(schema, Node).
finalBootstrap() ->
%% Code removed to actually copy over my tables etc...
application:start(cloud).
closedownNode() ->
application:stop(cloud), mnesia:stop().

If you have replicated the table (added table copies) on nodes other than the one you're removing, then you're already fine - just remove the node.
If you wanted to be slightly tidier you'd delete the table copies from the node you're about to remove first via mnesia:del_table_copy/2.
Generally, mnesia gracefully handles node loss and detects node rejoin (rebooted nodes obtain new table copies from nodes that kept running, nodes that didn't reboot are detected as a network partition event). Mnesia does not consume CPU or network traffic for nodes that have gone down. I think, though I haven't confirmed it in the source, mnesia won't reconnect to nodes that have gone down automatically - the node that goes down is expected to reboot (mnesia) and reconnect.
mnesia:add_table_copy/3, mnesia:move_table_copy/3 and mnesia:del_table_copy/2 are the functions you should look at for live schema management.
The extra_db_nodes parameter should only be used when initialising a new DB node - once a new node has a copy of the schema it doesn't need the extra_db_nodes parameter.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart