Configure Same offset for Kafka consumers from different groups

Configure Same offset for Kafka consumers from different groups - docker

I have ServiceA which produces DomainChangeEvents and commits them into topic in kafka, then ServiceB consumes this events from kafka topic and applies changes to a read model stored in memory. Some of DomainChangeEvent's are reset events and those reset domain to starting point. On restart of ServiceB i want to read ChangeEvents from last reset and re-build domain afterwards.
ServiceB is lunched in docker as replicated service.
As i want all ChangeEvents in each replica of ServiceB i cannot give them same group.id or messages will be loadbalanced and i won't get all events in all replicas. How can i configure ServiceB to continue from latest reset event after restart?
I tried setting random group.id on ServiceB and committing reset message after i consume it but after restart i have different group.id so all messages are consumed from the start again.
Thought about giving different configuration to docker replicas but as i read docker service is configured to be identical in all replicas and thats not an option.

A possible solution would be storing those points you want your different consumers to start from, by manually committing the offset to, for example, a database.
A table that would look like:
Topic Partition Offset
topicA 0 112
topicA 1 125
topicB 0 2313
topicB 1 2984
topicB 2 2554
Those would be your "latest reset" points, or positions your consumers want to start from. The problem with the subscribe() method , as you correctly said, is that it depends on the group.id parameter, and plays the consumer rebalancing and coordination game.
In order to consume from a fixed point (or set of points in different partitions), you should make a call to assign() instead. With this method, you'll be able to manually specify a list of partitions to your consumers. No group.id, no dynamic partition assignment nor offset loading, which is what you seem to need.
After assigning the partitions, you should make a call to seek(). With seek, you are telling the consumer from which offset you want to start reading from the partition that was specified on the assign() method.
For example, to start reading from the "latest resets" from any topic, you should do something like:
//seeking the last offset of topicA's partition0
public void setStartPosition(TopicPartition partition, long offset)
{
consumer.assign(Collections.singletonList(partition)); //f.e-> partition0
consumer.seek(partition, offset); //f.e -> 112
}
Calling this method will position your consumer exactly in the desired position in each partition. I'm not really sure if I'm answering your issue, but hope it helps!

Related

Distributing in-memory linked list

I have a program which is build based on a singly linked list. There are different programs which creates some form of data and this data sent to this linked list module to be added. As long as I've RAM available, program working as intended. Periodically -about every year-, I archive the entire linked list to the disk -due to requirement, I'm archiving all-. So far so good.
What happens if I wanted to add new node to the list whilst RAM is full and I haven't archived and freed the memory on RAM? This might occur when producer count goes up or regardless of producer count, there may be more data created depending or where it's used etc. I couldn't find a clear solution the scale the on-memory linked list. There is a workaround in my head but don't know even if it works so I thought better to ask here.
When the RAM start to get almost full, I would create a new instance
of the linked list program -just another machine on the cloud or new
physical computer on premise, whatever -.
I do have an service discovery module -something like ZooKeeper-, this discovery module will detect the newly created machine and adds to the list.
When first instance is almost in it's limits, it will check if there is an available instance, if there is; it will relay the node to the next instance and it will update its last node's next pointer to something special. If you wanted to traverse the list from start to finish across all the machines every time you come to this special node, it will have the information of the which machine has the next node. Traversal will continue from the next machine that the last node points to.
Since this this not a hash map or something in that nature, I can't just add replicate the service and for example relay the incoming request based on a given key to a particular machine.
Rather than archiving part of the old data and loading that to the RAM and continuing on like that, I thought it would be better to have a last pointer to point to a different machine and continue reading from that machine. My choice for a network call seemed better because this program will be used in a intranet, but still I couldn't find a solid solution on paper.
Is there a such example that I can study on and try to find a better solution? Is this solution feasible?
An example:
Machine 1:
1st node : [data:x, *next: 2nd Node address],
2nd node : [data:123, *next: 3rd Node address],
...
// at this point RAM is almost full
// receive next instance's ip
(n-1)th node : [data:987, *next: nth Node address],
nth node : [data:x2t, type: LastNodeInMachine, *next: nullptr]
Machine 2:
1st node == (n+1) node : [data:x, *next: 2nd Node address],
... and so on

CAN bus sending data from two masters with equal balance

I have two master nodes connected to the same CAN bus, both send data to my PC.
first master ID = 0xFFA1
second master ID = 0xFFA2
Since the first master ID is lower than the second it takes control of the bus more than the second master. And this causes some delay in the data.
Is there a way to make load balancing between two nodes so that each node send an almost equal amount of messages.
I tried making the first node send data while switching between two IDs 0xFFA1 and 0xFFB2,
and the second node sends data with ID 0xFFB1. And it didn't help.

There is no such thing as "masters" in CAN, nor in higher layer protocols like CANopen for that matter (a "master" in CANopen is just a supervisor node). Who gets to send what is defined by the CAN identifiers - CAN is primarily focusing on data, not nodes. What matters is what is sent, rather than who is sending/receiving, since every message is broadcasted.
It sounds as if you have 2 nodes that wildly spam the bus with identifier 0xFFA1 and 0xFFA2 messages, as fast as they are able, leading to 100% bus load. Then the node sending 0xFFA2 will "starve". Sending data "as fast as you are able" is never the correct way to use CAN.
Instead you need to define a higher layer protocol that dictates real-time characteristics. In control systems, this is most commonly done by having nodes send data at fixed intervals, such as once per 10ms or 100ms. This alone should fix your starvation problem.
If you want to prevent nodes from sending at the same time, then you could provide a means for them to synchronize. A trick used in CANopen and other protocols, is to have one node send out a "sync" message at given fixed time intervals.
After reading this sync message, all nodes should act within x ms from receiving it.

Where are NVMe commands located inside the PCIe BAR?

According to the NVMe specification, the BAR has tail and head fields for each queue. For example:
Submission Queue y Tail Doorbell (SQyTDBL):
Start: 1000h + (2y * (4 << CAP.DSTRD))
End: 1003h + (2y * (4 << CAP.DSTRD))
Submission Queue y Head Doorbell (SQyHDBL):
Start: 1000h + ((2y + 1) * (4 << CAP.DSTRD))
End: 1003h + ((2y + 1) * (4 << CAP.DSTRD))
Are there the queue itself or just mere pointers? Is this correct? If it is the queue, I would assume the DSTRD indicates the maximum length of all queues.
Moreover, the specification talks about two optional regions: Host Memory Buffer (HMB) and Controller Memory Buffer (CMB).
HMB: a region within the host's DRAM (PCIe root)
CMB: a region within the NVMe controller's DRAM (inside the SSD)
If both are optional, where is it located then? Since endpoint PCIe only works with BARs and PCI Headers, I don't see any other place they might be located, other than a BAR.

Sorry but I am doing this from memory but I have implemented an FPGA NVMe host so hopefully my memory will be enough to answer your questions and more, if I get something wrong though at least you know why. I'll be providing reference sections from the specification which you can find here. https://nvmexpress.org/wp-content/uploads/NVM-Express-1_4-2019.06.10-Ratified.pdf Also as a note before I really answer your question I want to clarify some confusion, understanding the spec takes some time I honestly recommend reading it bottom to top the last few sections help give context for the first few as strange as that sounds.
These are the submission and completion queues, specifically the subqueue tail and completion queue head respectively (SECTION 3.1). More on this later I just wanted to correct the missconception that you access the submission queue head as the host, you do not only the controller (traditionally the drive) does. A simple reminder submission is you asking the drive to do something, completion is the drive telling you how it went. Read SECTION 7.2 for more info.
Before you can send anything to these queues you must first setup said queues. Baseline in the system these queues do not exist, you must use the admin queue to set them up.
28h 2Fh ASQ Admin Submission Queue Base Address
30h 37h ACQ Admin Completion Queue Base Address
Your statement about DSTRD is a huge miss understanding. This field is from the capabilities register (0x0) Figure 3.1.1. This field is the controller (drive) telling you the "doorbell stride" which says how many bytes are between each doorbell, I've never seen a drive report anything but 0 for this value since well, why would you want to leave dead space between doorbell registers.
Please be careful with the size of your writes, in my experience most NVMe drives require you to send writes of at least 2dwords (8 bytes) even if you only intend to send 1dword of data, just a note.
Onto actually helping you use this thing as a host, please reference SECTION 7.6.1 to find the initialization sequence. Notice how you must setup multiple registers, read certain parameters and other such things.
Assuming you or someone else has done initalization let me now answer the core of your question, how to use these queues. The thing is, this answer spans MANY sections of the spec and is the core of it. So with that I am going to break it down as best I can for a simple write command. Please note you CANNOT write, until you have first created the queues using the admin queues which leverage different opcodes from a different section of the spec, sorry I cannot write all of this out.
STEPS TO WRITING DATA TO AN NVMe DRIVE.
In the creation of the submission queue you will specify the size of this specific queue. This is the number of commands that can be placed in the queue at one time for processing. Along with this you will specify the queue base address. So for this example let's assume you set the base address to 0x1000_0000 and size 16 (0x10). Figure 105 let's us know that every submission queue entry has a size of 64bytes (0x40) so queue entry 0 is at 0x1000_0000 entry 1 is at 0x1000_0040 2 0x1000_0080 and so on for our 16 entries then it loops back.
You will first store data for writing, let's say you were given 512bytes (0x200) of data to write. So for simplicity you place that data at 0x2000_0000 - 0x2000_0200.
You create the submission queue command. This is not a simple process. I'm not going to document all of this for you but understand you should be referencing Figure 104, Figure 346, and Section 6.15. This is not enough however. You will also need to understand PRP vs SGL and which you are using (PRP is easier to start with). NLB (Number of logical blocks) which determine your write size, with NVMe you do not specify writes in bytes but in terms of NLBs which the size is specified by the controller (drive), it may implement multiple NLB sizes but this is up to the drive not you as the host, you just get to pick from what it supports Section 5.15.2.1, Figure 245 You want to look at identify namespace to tell you the LBA (logical block address) size, this will lead you down a rabbit hole to determine the actual size but that's ok the info is there.
Ok so you finished this mess and have created the submission command. Let's assume the host has already completed 2 commands on this queue (at start this will be 0 I'm picking 2 just to be clearer in my example). What you now need to do is place this command at 0x1000_0080.
Now let's assume this is queue 1 (from the equation you posted the queue number is the y value. Note that queue 0 is the admin queue). What you need to do is poke the controllers submission queue tail doorbell to say how many commands are now loaded (thus you can queue multiple up at once and only tell the drive when you are ready to). In this case the number is 2. So you need to write the value 2 to register 0x1008.
At this point the drive will go. aha, the host has told me there are new commands to fetch. So the controller will go to queue base address + commandsize*2 and fetch 64bytes of data aka 1 command (address 0x1000_0080). The controller will decode this command as a write which means the controller (drive) must read data from some address and put it in memory where it was told to. This means your write command should tell the drive to go to address 0x2000_0000 and read 512 bytes of data, and it will if you scope the PCIe bus. At this point the drive will fill out a completion queue entry (16 bytes specified at Section 4.6) and place it in the completion queue address you specified at queue creation (plus 0x20 since this is the 2nd completion). Then the controller will generate and MSI-X interrupt.
At this point you must go to wherever the completion queue was placed and read the response to check status, and also if you queued multiple submissions check the SQID to see what finished since jobs can finish out of order. You then must write to the completion queue head (0x100C) to indicate that you have retrieved the completion queue (success or failure). Notice here you never interact with the submission queue head (that's up to the controller since only he knows when the submission queue entry was processed) and only the controller places things in the completion queue tail since only he can create new entries.
I'm sorry this is so long and not well formatted but hopefully you now have a slightly better understanding of NVMe, it's a bit of a mess at first but once you get it it all makes sense. Just remember my example assumed you had created a queue which baseline doesn't exist. First you need to setup the admin submission and completion queues (0x28 and 0x30) which has queue ID 0 thus it's tail/head doorbell is address 0x1000,0x1004 respectively. You then must reference Section 5 to find the opcodes to make stuff happen but I have faith you can figure it out from what I've given you. If you have any more questions put a comment down and I'll see what I can do.

How to define Alerts with exception in InfluxDB/Kapacitor

I'm trying to figure out the best or a reasonable approach to defining alerts in InfluxDB. For example, I might use the CPU batch tickscript that comes with telegraf. This could be setup as a global monitor/alert for all hosts being monitored by telegraf.
What is the approach when you want to deviate from the above setup for a host, ie instead of X% for a specific server we want to alert on Y%?
I'm happy that a distinct tickscript could be created for the custom values but how do I go about excluding the host from the original 'global' one?
This is a simple scenario but this needs to meet the needs of 10,000 hosts of which there will be 100s of exceptions and this will also encompass 10s/100s of global alert definitions.
I'm struggling to see how you could use the platform as the primary source of monitoring/alerting.

As said in the comments, you can use the sideload node to achieve that.
Say you want to ensure that your InfluxDB servers are not overloaded. You may want to allow 100 measurements by default. Only on one server, which happens to get a massive number of datapoints, you want to limit it to 10 (a value which is exceeded by the _internal database easily, but good for our example).
Given the following excerpt from a tick script
var data = stream
|from()
.database(db)
.retentionPolicy(rp)
.measurement(measurement)
.groupBy(groupBy)
.where(whereFilter)
|eval(lambda: "numMeasurements")
.as('value')
var customized = data
|sideload()
.source('file:///etc/kapacitor/customizations/demo/')
.order('hosts/host-{{.hostname}}.yaml')
.field('maxNumMeasurements',100)
|log()
var trigger = customized
|alert()
.crit(lambda: "value" > "maxNumMeasurements")
and the name of the server with the exception being influxdb and the file /etc/kapacitor/customizations/demo/hosts/host-influxdb.yaml looking as follows
maxNumMeasurements: 10
A critical alert will be triggered if value and hence numMeasurements will exceed 10 AND the hostname tag equals influxdb OR if value exceeds 100.
There is an example in the documentation handling scheduled downtimes using sideload
Furthermore, I have created an example available on github using docker-compose
Note that there is a caveat with the example: The alert flaps because of a second database dynamically generated. But it should be sufficient to show how to approach the problem.

What is the cost of using sideload nodes in terms of performance and computation if you have over 10 thousand servers?

Managing alerts manually directly in Chronograph/Kapacitor is not feasible for big number of custom alerts.
At AMMP Technologies we need to manage alerts per database, customer, customer_objects. The number can go into the 1000s. We've opted for a custom solution where keep a standard set of template tickscripts (not to be confused with Kapacitor templates), and we provide an interface to the user where only expose relevant variables. After that a service (written in python) combines the values for those variables with a tickscript and using the Kapacitor API deploys (updates, or deletes) the task on the Kapacitor server. This is then automated so that data for new customers/objects is combined with the templates and automatically deployed to Kapacitor.
You obviously need to design your tasks to be specific enough so that they don't overlap and generic enough so that it's not too much work to create tasks for every little thing.

Missing master heartbeat does not cause node to react in a CANopen system

I have a strange finding about the heartbeat-protocol in CANopen. Maybe somebody else has seen something like this and maybe it is supposed to work like this... Anyway, here's what it's about:
In CANopen there are two timeout-based life-guarding mechanisms: the first is node guarding, which I will not mention further, since it's considered old news.
The other one is called heartbeat. It is pretty simple: Any participant on the network sends a regular message stating its node ID and its state. The frequency is defined by object 0x1017sub0 and is called heartbeat-producer-time. If it is set to zero, no heartbeat is being sent.
Any other participant can then define a number of nodes it wants to find on the network plus the maximum time there may be between two consecutive heartbeat-messages. This information is stored in object 0x1016sub1..n as 32-bit entries for as many nodes as this particular node wants to listen to.
The entries consist of the node ID (bits 22 to 16) and the mentioned maximum time that may elaps between heartbeats, called the heartbeat-consumer-time (in bits 15..0). Again if the entry is zero, it is being ignored.
As you may have gathered, there is no distinction between network-master (node ID 1) and slaves (node IDs 2 to 127).
So far the theory, now for my problem:
I configure one of the slave-nodes in my network as a heartbeat-consumer for the master, so there's an entry in object 0x1016sub1 that looks like this: 0x000107D0. Meaning that a heartbeat-message from the master is expected after at least two seconds.
I have observed that this works in two examples. If I send a master-heartbeat for a time and then stop, the node either returns to pre-operational mode or sends an appropriate emergency-message.
If I don't send any master-heartbeat-messages, I would expect that after I start the node (send it into operational mode) it takes at most two seconds for the node to either return to pre-operational mode or send an appropriate emergency-message or perhaps even both. But in the two examples I tried, nothing happened. If I never send any heartbeat, the node never expects one and just keeps on running.
The two examples are very different from each other. I am not sure whether they use the same CANopen-stack library perhaps.
Is there an explanation?

If you read CANopen User Manual, section 1.3.1.6, page 39, you will notice that the heartbeat consumer is first activated upon receiving a heartbeat from the producer. I would assume then that, since in your example the first heartbeat is never sent, the consumer is not activated.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart