How the CAN message is re-transmitted after arbitration - can-bus

I understand the CAN arbitration process. But I am very curious about how the Node which loses the arbitration re-transmit its message until success.
As I know many CAN messages are repeatably send on the CAN bus. For example, Node A and Node B simultaneously send messages every 100ms.
Assuming Node A has low identifier value and Node B has high identifier value, then Node A will always win the arbitration and repeatably send message on CAN bus. As Node A and Node B send the message always at the same time, seems Node B will always lose the arbitration and the message can not reach to other Nodes forever...
What CAN mechanism is used for this situation?

Node B will try again when Node A ends the tx, which happens much earlier than 100ms.

Related

Node with higher ID always losing arbitration in CAN Bus

I am a beginner in CAN Protocol, referring Texas Instruments Application Report SLOA101B - Introduction to the Controller Area Network (CAN).
What happens when 2 nodes are continuously sending CAN frames, will the node sending the frame higher CAN ID always lose arbitration?
In my understanding, in the initial arbitration, the node with lower ID wins, then sends the data frame, after which the bus goes to 3 recessive IFS, then again both nodes finds the bus as idle and starts arbitration, here also the node with lower ID wins the arbitration and so on. This means the node sending the frame higher CAN ID always loses arbitration.
Yes that is correct. But apart from identifier, arbitration also involves the RTR/SRR bit as well as the IDE (extended identifier) bits. So an 11 bit frame with identifier 0x123 always wins over an 29 bit frame with identifier 0x123.
Though note that continuous, 100% bus load often means there is something seriously wrong with the bus.

CAN bus sending data from two masters with equal balance

I have two master nodes connected to the same CAN bus, both send data to my PC.
first master ID = 0xFFA1
second master ID = 0xFFA2
Since the first master ID is lower than the second it takes control of the bus more than the second master. And this causes some delay in the data.
Is there a way to make load balancing between two nodes so that each node send an almost equal amount of messages.
I tried making the first node send data while switching between two IDs 0xFFA1 and 0xFFB2,
and the second node sends data with ID 0xFFB1. And it didn't help.
There is no such thing as "masters" in CAN, nor in higher layer protocols like CANopen for that matter (a "master" in CANopen is just a supervisor node). Who gets to send what is defined by the CAN identifiers - CAN is primarily focusing on data, not nodes. What matters is what is sent, rather than who is sending/receiving, since every message is broadcasted.
It sounds as if you have 2 nodes that wildly spam the bus with identifier 0xFFA1 and 0xFFA2 messages, as fast as they are able, leading to 100% bus load. Then the node sending 0xFFA2 will "starve". Sending data "as fast as you are able" is never the correct way to use CAN.
Instead you need to define a higher layer protocol that dictates real-time characteristics. In control systems, this is most commonly done by having nodes send data at fixed intervals, such as once per 10ms or 100ms. This alone should fix your starvation problem.
If you want to prevent nodes from sending at the same time, then you could provide a means for them to synchronize. A trick used in CANopen and other protocols, is to have one node send out a "sync" message at given fixed time intervals.
After reading this sync message, all nodes should act within x ms from receiving it.

Missing master heartbeat does not cause node to react in a CANopen system

I have a strange finding about the heartbeat-protocol in CANopen. Maybe somebody else has seen something like this and maybe it is supposed to work like this... Anyway, here's what it's about:
In CANopen there are two timeout-based life-guarding mechanisms: the first is node guarding, which I will not mention further, since it's considered old news.
The other one is called heartbeat. It is pretty simple: Any participant on the network sends a regular message stating its node ID and its state. The frequency is defined by object 0x1017sub0 and is called heartbeat-producer-time. If it is set to zero, no heartbeat is being sent.
Any other participant can then define a number of nodes it wants to find on the network plus the maximum time there may be between two consecutive heartbeat-messages. This information is stored in object 0x1016sub1..n as 32-bit entries for as many nodes as this particular node wants to listen to.
The entries consist of the node ID (bits 22 to 16) and the mentioned maximum time that may elaps between heartbeats, called the heartbeat-consumer-time (in bits 15..0). Again if the entry is zero, it is being ignored.
As you may have gathered, there is no distinction between network-master (node ID 1) and slaves (node IDs 2 to 127).
So far the theory, now for my problem:
I configure one of the slave-nodes in my network as a heartbeat-consumer for the master, so there's an entry in object 0x1016sub1 that looks like this: 0x000107D0. Meaning that a heartbeat-message from the master is expected after at least two seconds.
I have observed that this works in two examples. If I send a master-heartbeat for a time and then stop, the node either returns to pre-operational mode or sends an appropriate emergency-message.
If I don't send any master-heartbeat-messages, I would expect that after I start the node (send it into operational mode) it takes at most two seconds for the node to either return to pre-operational mode or send an appropriate emergency-message or perhaps even both. But in the two examples I tried, nothing happened. If I never send any heartbeat, the node never expects one and just keeps on running.
The two examples are very different from each other. I am not sure whether they use the same CANopen-stack library perhaps.
Is there an explanation?
If you read CANopen User Manual, section 1.3.1.6, page 39, you will notice that the heartbeat consumer is first activated upon receiving a heartbeat from the producer. I would assume then that, since in your example the first heartbeat is never sent, the consumer is not activated.

What guarantees does erlang's "monitor" give?

While reading the ERTS user's guide, I found this section:
The only signal ordering guarantee given is the following. If an entity sends multiple signals to the same destination
entity, the order will be preserved. That is, if A sends a signal S1 to B, and later sends the signal S2 to B, S1 is
guaranteed not to arrive after S2.
I've also happened across this while doing further research googling:
Erlang Reference Manual, 13.5:
Message sending is asynchronous and safe, the message is guaranteed to eventually reach the recipient, provided that the recipient exists.
That seems very vague and I'd like to know what guarantees I can rely on in the following scenario:
A,B are processes on two different nodes.
Assume A does not crash and B was a valid node at some point.
A and B monitor each other.
A sends messages M1,M2,M3 to B
In the above scenario, is it possible that B receives M1,M3 (M2 is dropped),
without any sort of 'DOWN'/'EXIT'/heartbeat timeout being received at A?
There are no other guarantees other than the ordering guarantee. Note that by default you don't even know who the sender is, unless the sender encodes this in the message.
Your example could happen:
A sends M1 and M2
B receives M1
The node on which B resides gets disconnected
The node on which B resides comes up again
A sends M3 to B
B receives M3
M2 can be lost on the network link in this scenario. It is highly unlikely this happens, but it can happen. The usual trick is to have some kind of notion of such errors. Either by having a timeout trigger, or by monitoring the node or Pid which is the recipient of the message.
Updated scenario:
In the updated scenario, provided I read it correctly, then A would get a 'DOWN' style message at some point, and likewise, it would get a message telling you that the node is up again, if you monitor the node.
Though often, such things are better modeled using an idempotent protocol if at all possible.
Reading through the erlang mailing-list and the academic faq, it seems like there are a few guarantees provided by the ERTS implementation, however I was not able to determine whether or not they are guaranteed at a language/specification level, too.
If you assume TCP is "reliable", then the current implementation guarantees that
given A,B are processes on different nodes (&hosts) and A monitors B, A sends to B, assuming A doesn't crash, any message delivery failures* between the two nodes or host/node/process failures on B will lead to A getting a 'DOWN' message (or 'EXIT' in the case of links). [ See 1 and 2 ]
*From what I have read on the mailing-list thread , this property is almost entirely based on the fact that TCP is used, so "message delivery failure" means any situation where TCP decides that a failure has occurred/the connection needs to be closed.
The academic faq talks about this like it's also a language/specification level guarantee, however I couldn't yet find anything to back that up.

How is the detection of terminated nodes in Erlang working? How is net_ticktime influencing the control of node liveness in Erlang?

I set net_ticktime value to 600 seconds.
net_kernel:set_net_ticktime(600)
In Erlang documentation for net_ticktime = TickTime:
Specifies the net_kernel tick time. TickTime is given in seconds. Once every TickTime/4 second, all connected nodes are ticked (if anything else has been written to a node) and if nothing has been received from another node within the last four (4) tick times that node is considered to be down. This ensures that nodes which are not responding, for reasons such as hardware errors, are considered to be down.
The time T, in which a node that is not responding is detected:
MinT < T < MaxT where:
MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4
TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds.
Note: Normally, a terminating node is detected immediately.
My Problem:
My TickTime is 600 (seconds). Thus, 450 (7.5 minutes)< T < 750 seconds (12.5 minutes). Although, when I set net_ticktime to all distributed nodes in Erlang to value 600 when some node fails (eg. when I close Erlang shell) then the other nodes get message immediately and not according to definition of ticktime.
However it is noted that normally a terminating node is detected immediately but I could not find explanation (neither in Erlang documentation, or Erlang ebook or other Erlang based sources) of this immediate response principle for node termination in distributed Erlang. Are nodes in distributed environment pinged periodically with smaller intervals than net_ticktime or does the terminating node send some kind of message to other nodes before it terminates? If it does send a message are there any scenarios when upon termination node cannot send this message and must be pinged to investigate its liveliness?
Also it is noted in Erlang documentation that Distributed Erlang is not very scalable for clusters larger than 100 nodes as every node keeps links to all nodes in the cluster. Is the algorithm for investigating liveliness of nodes (pinging, announcing termination) modified with increasing size of the cluster?
When two Erlang nodes connect, a TCP connection is made between them. The failure you are inducing would cause the underlying OS to close the connection, effectively notifying the other node very quickly.
The network tick is used to detect a connection to a distant node that appears to be up but is not actually passing traffic, such as may occur when a network event isolates a node.
If you want to simulate a failure that would require a tick to detect, use a firewall to block the traffic on the connection created when the nodes first ping.

Resources