CAN J1939 device stops responding after communication timeout - can-bus

I'm a higher layer guy, I don't and don't want to know much about can-bus, j1939 or even particular ECUs. I just don't like the software solution, so I'd like to ask, if customer's requirements are legitimate.
If particular ECU doesn't receive CAN frame within 300 ms timeout after powerup, it stops responding to any further frames and must be power cycled. This is a information from customer's technicians, I have to just believe it.
It is possible to powerup ECU after CAN driver thread is ready, but it would require some extra wiring by end customers.
Software solutions are all bad or worse, like running FreeRTOS before important checks, put CAN driver code to code common with other products, or start CAN periphery in the bootloader and left running without software control until driver starts.
The sensitive part is, that we have no explicit demand to start CAN driver within such a short time in specification. Customer says, that it's part of J1939 specification.
Can someone confirm or disprove, that J1939 allows devices to unrecoverably stop receiving after 300 ms of silence or requires devices to start transmitting within 300 ms after powerup? Or at least guide me to parts of J1939 standard, which could possibly regard this?
Thank you

If particular ECU doesn't receive CAN frame within 300 ms timeout after powerup, it stops responding to any further frames and must be power cycled. This is a information from customer's technicians, I have to just believe it.
This does of course entirely depend on what task it is performing.
Generally, an ECU, as in an automotive computer in a car/truck etc is never allowed to hang up/latch up. The normal course of action would be for the ECU to either reboot/reset itself or revert to a fail-safe mode.
But in case of tractors and heavy machinery the normal safe mode is "stop everything".
It is possible to powerup ECU after CAN driver thread is ready, but it would require some extra wiring by end customers.
I don't know what this is supposed to mean. What is "extra wiring"? Something to keep other nodes in common mode while one is rebooting? Terminating resistors? Some dirty power-up delay circuit?
Software solutions are all bad or worse, like running FreeRTOS before important checks, put CAN driver code to code common with other products, or start CAN periphery in the bootloader and left running without software control until driver starts.
Generally speaking, it's custom to initialize critical hardware like clocks, watchdogs, prescalers, pull resistors etc very early on. Initializing hardware peripherals may or may not be critical. It's custom to do this after the CRT has been executed, at the beginning of main() and the order of initialization usually matters a lot.
If you have a delay longer than 300ms from power-on reset to the start of main(), something is terribly wrong with the program.
The sensitive part is, that we have no explicit demand to start CAN driver within such a short time in specification. Customer says, that it's part of J1939 specification.
I haven't worked much with J1939 and I don't remember what it says specifically, but 300ms is an eternity in a real-time system! It's not a "short time".
In general, correctly designed mission-/safety-critical CAN control systems in automotive/industrial settings work like this:
All data is sent repeatedly in fixed intervals, regardless of if it has changed or not. Commonly once per 10ms or once per 100ms.
A node which has not received new data will use the previously received data for now.
There is a timeout from the point of when last valid data was received, when the receiving node must stop using old data and revert to a fail-safe mode. This time is often relative to how fast the controlled object can move. It's common to have timeouts after some multiple of 100ms.
I would say that your customer's requirements are very reasonable, it's nothing out of the ordinary.

My colleague answered, that there's no such demand, only vague 300 ms timeout.

Related

Inhibit Time in Tx-PDO

Objects 180Nh have the following subindices:
0x00:----
0x01:----
0x02:----
0x03 (inhibit time): This subindex contains a time lock in 100 µs steps (see following figure). This can be used to set a time that must elapse after the sending of a PDO before the PDO is sent another time. This time only applies for asynchronous PDOs. This is intended to prevent PDOs from being sent continuously if the mapped object constantly changes.
0x04 (compatibility entry): This subindex has no function and exists only for compatibility reasons.
0x05 (event timer): This time (in ms) can be used to trigger an Event which handles the copying of the data and the sending of the PDO.
According to the above point, we realize that when the event occurs, a certain time is determined, which is blocked, and it is for Tx-PDO; now, if the event occurs in this interval, it will be executed in the next section.
Why should the whole section be implemented? Why is the second, third, and fourth event executed in the last part?
Shouldn't the third and fourth events be executed separately?
By default, common CANopen device profiles like for example CiA 401 "generic I/O module" are configured to suit large automation networks. That is: a large network with lots of nodes where it is important to keep bus traffic low. On such networks nodes only transmit PDOs when there has been a data update (an internal event has occurred).
However, such a setup is very much unsuitable when CANopen is used for real-time control systems, like for example having a PLC controlling a bunch of actuator I/O modules that control motions of a machine. Which could also be a safety-related application. In such systems, it is custom to always send data repeatedly at even intervals, even if it has not changed. For example send all data once every 10ms/100ms.
Only the last data sent is used by the receiving node(s), so in case data goes missing/corrupt, new reliable data will arrive soon again. And if no data arrives at all, that's an indication that something is broken and the system ought to revert to a safe state, after receiving no new data in a certain time period. This is how mobile/automotive control systems are most commonly designed, since it is safe, deterministic and proven in use. Custom, non-standard CAN bus protocols by OEM are often implemented exactly like this.
Now, to achieve this with CANopen, we have to configure the TPDO communication parameters. Event timer to set the interval and inhibit time to prevent the node spamming extra data as soon as something has changed. If I remember correctly we also need to set 180N:2 transmission type to asynchronous (which sounds counter-intuitive).
With a setup like this, only the most recent event matters. The most up to date data will always get sent, at fixed intervals.

Time to send SDO

I am working on CANopen architecture and had three questions:
1- When the 'synchronous window' is closed until the next SYNC message, should we send the SDO message? Can we not send a message during this period?
2- Is it possible not to send the PDO message during the simultaneous window?
3- What is the answer that the slaves give in the SYNC message?
Disclaimer: I don't have exact answers but I just wanted to share my assumptions & thoughts.
CiA 301 doesn't mention the relation between synchronous window and SDOs. In normal operation after the initial configuration, one may assume that SDOs aren't present on the system, or at least they are rare compared to PDOs. Although not strictly necessary, SDOs are generally initiated by a device which has the master role, and that device also produces the SYNC messages (again, it's not strictly necessary but it's the usual/common implementation). So, the master device may adjust the timing of SDO requests according to the synchronous window.
Here is a quote from CiA 301:
If the synchronous window length expires all synchronous TPDOs may be
discarded and an EMCY message may be transmitted; all synchronous
RPDOs may be discarded until the next SYNC message is received.
Synchronous RPDO processing is resumed with the next SYNC message.
CiA 301 uses the word "may" (see the quote above). So I'm not sure if it's mandatory or not. In my opinion, it makes sense to follow the advice and abort synchronous TPDO transmissions after the synchronous window and send an EMCY message. Event-driven (non-synchronous) TPDOs can be sent within or after the synchronous window.
There is no direct response to SYNC messages. On SYNC reception, SYNC consumers (slaves) sample their inputs, drive their outputs according to the previous RPDOs, and start transmitting their TPDOs containing the previous samples (or the current ones? I'm not sure about this).
Synchronous windows are for specific PDO synchronization only. For hard real-time systems, data might be required to arrive within certain fixed time intervals - not too early, not too late. That is, it acts as a real-time deadline. If such features are enabled, you need to take that in consideration when doing the specific CANopen bus implementation.
For example if some SDO communication would occupy the bus so that the PDO can't meet its time window, that would be a problem. But this is easily solved by giving the PDO a lower COBID than the SDO, which should already be the case in most default device profile setups like "DS401 GPIO module". Other than that, you would have to make sure there is no ridiculous bus loads or that nodes hang up or get busy doing other things.
In systems with hard real-time requirements you probably don't want to allow any SDO communication during operational mode to begin with.
What is the answer that the slaves give in the SYNC message?
That question doesn't make any sense. You need to study what the SYNC message does and what it is for.

Is it possible to setup timeout for receiving data over USB in STM32 MCUs?

I'm wondering if this is possible to setup a timeout for receiving data over USB interface in STM32 microcontrollers. Such approach is possible for example in UART connection (please refer to AN3109, section 2. Receive DMA timeout).
I can't find anything similar related to USB interface. What's more, it is said that DMA for USB should be enabled only if really necessary because data transfer shall be aligned to 32-bit word.
You have a receive call back function (if you use the HAL) in your ...._if.c file. Copy reived chars to the buffer. Implement timeout there.
What you refer to in case of UART is either DMA receive timeout as you've said or (when not using DMA) an IDLE interrupt. I'm not aware of such thing coming "out of the box" for USB CDC - you'd have to implement this timeout yourself, which shouldn't be too hard. Have a timer (hardware of software) that you re-trigger every time you receive data. Set its period to the timeout value of your choice and do protocol parsing after timeout elapses.
If I had to add anything - these kind of problems (not knowing how many bytes to receive) are typically solved at the protocol level. Assuming binary protocol, one way of achieving this is having frame start and end bytes which never occur in data (and if they do - you escape them) in which case you receive everything starting after "start byte" until you reveive "end byte". Yet another way is having a "start byte" and a field indicating how many bytes there are to receive. All of it should of course be checksumed in some way.
Having said that, if you have an option to change the protocol, you really should do so. Relying on timings in your communication, especially on such low level only invites problems and headaches in the long run. You introduce tight coupling between your protocol layer and interface layer. This is going to backfire on you every time you decide to use a different interface, as you'll have to re-invent the same thing again. Not to mention how painful it's going to be when you decide to move to TCP/IP with all its greatness - network jitter, dropped packets etc.

What happens if a bus-off error occurs in a CAN controller while a car is in motion?

I know that in a CAN controller if the error count reaches some threshold (say 255), bus off will occur which means that a particular CAN node will get switched off from the CAN network. So there won't be any communication at all. But what if the above said scenario happens while the car is moving which contains the ECU (includes the CAN controller)?
Is there any auto-recovery mechanism in a CAN controller to avoid any of the above situations?
During bus off, the node will be isolated.
CAN waits for the mandatory time period, 128 x 11 bits (1408 bits - 5.6 ms for a 250 kbit/s system) of time, and then tries to re-initialize the node.
Yes, if a CAN Tx error count reaches 255, a node will turn off and potentially reset itself. A good implementation will not continue resetting a node if the problem persists.
In addition to this safety mechanism, ECU's (electric control units) also time the duration between valid transmissions of the messages they expect to receive. Therefore, if the engine controller goes offline, nearly every ECU in the vehicle will report "Lost Communication with the Engine Controller."
Typically, these type of CAN problems are identified by DTC's (diagnostic trouble codes) beginning with U, like this one: http://www.obd-codes.com/u0115
Depending on the severity of the issue, the vehicle might enter a "limp home" mode, or might be totally disabled. Problems with the CAN bus on a vehicle are extremely rare, unless there has been some tampering.
The recovery mechanism depends on the software stack that's being used. Most new vehicles have AUTOSAR compliant software implementations. In the AUTOSAR communication stack, the CanSM (state manager) module has configurable BusOff Monitoring and Recovery. You can read more at http://autosar.org .
A BusOff however, is a serious situation in a running vehicle. How this is handled at the vehicle level is very specific to the system design. But, in most cases the system would go into a safe mode of operation and all parameters would take pre-set fault values to let the vehicle run with a reduced functionality. You would see the warning lamps on the dash go off to alert the driver. ECUs typically comply with some level of ASIL (https://en.wikipedia.org/wiki/Automotive_Safety_Integrity_Level) standard. This makes sure that such situations are thought of and taken care of during design and development.
Nothing spectacular will happen, even if the Engine Control Unit looses CAN communication. The car will continue running.
When bus-off occurs, the CAN network isolates that node and then resets that node which can able to start communication.
As you mentioned, after reaching a specific error count, that node gets disconnected/prohibited from transmitting anything on the bus. This is a description for the bus side.
On the controller side, every CAN controller generates an interrupt on BUS_OFF. It is the controller's responsibility that it should reset the CAN controller and bring it back to the normal state.
This is strictly followed for every CAN controller in any car. And this all happens in a few milliseconds... So for the driver, nothing happens!
When the ECU detects a BUS_OFF fault, the ECU should stop its emissions so this is a good question to ask.
There is an auto-recovery mechanism:
For the first three detections, the CAN controller resets its registers without a delay
For the next detections, the ECU waits 1 second before the reset
There is something called limp-home mode for the cars. That is the condition when all the ECUs fail in the car network. Then a set of default parameters for the ECUs are initialized and then the system, i.e., your car can continue running only for some time before it is properly serviced by the OEM.
I know this is an old thread, but the answers are a bit different from the situation I have observed, in relation to the OP question.
From experience, I'm have an issue where my ECU stops communicating with the diagnostic tools while the engine is running, apparantly it has entered the CAN off state. The only reason I know is I have a OBD 2 plug in monitor for engine parameters. I don't get ANY DTC, well most of the time anyways.. sometimes I get DTCs that are not applicable to my vehcile, and some U codes.
That said, the vehicle continues to run just fine, and if I didn't have the plug-in monitor, I would have no idea there was a problem! I'm now pretty sure the ECU for the Engine is having communication problems, and hitting the error counter and shutting off, it's the only thing that makes sense. I checked the CAN signals with a 2 channel O-scope, and they are a bit noisy compared to one of my other cars, so my next step is to swap the ECU and see if that fixes it. I already swapped out the TIPM (Total Integrated Power Module), it serves as a router of sorts between the 2 CAN networks, to the OBD2 port. That apparantly wasn't it.
if a CAN Tx or RX error counter reaches 255 , the node will turn off and be isolated
What happens if a bus-off error occurs in a CAN controller while a car is in motion?
1)HARD SWAPPING can be done in can network.
eg: Assume four(4) nodes(ECUS) are connected in can bus network.if we disconnected one
ecus then also can bus works properly.
2)In BUSS OFF condition it can hear every signal on the bus network but it cant transmit
mssgs(signal). If the car in motion or in rest position.
eg: Ecus(ABS) are using for better performance but actual work is done by actuator(DISK BRAKE).

How can I determine the quality of a connection in iOS?

I'm familiar with using Reachability to determine the type of internet connection (if any) being used on an iOS device. Unfortunately that's not a decent indicator of connection quality. Wifi with low signal strength is pretty sketchy and 3G with anything less than 3 bars is a disaster (not to mention networks that only allow EDGE connections).
How can I determine the quality of my connection so I can help my users decide if they should be downloading larger files on their current connection?
A pragmatic approach would be to download one moderately large-sized file hosted on a reliable, worldwide CDN, at the start of your application. You know the filesize beforehand, you just have to measure the time it takes, make a simple computation and then you've got your estimate of the quality of the connection.
For example, jQuery UI source code, unminified, gzipped weighs roughly 90kB. Downloading it from http://ajax.googleapis.com/ajax/libs/jqueryui/1.8.14/jquery-ui.js takes 327ms here on my Mac. So one can assume I have at least a decent connection that can handle approximately 300kB/s (and in fact, it can handle much more).
The trick is to find the good balance between the original file size and the latency of the network, as the full download speed is never reached on a small file like this. On the other hand, downloading 1MB right after launching your application will surely penalize most of your users, even if it will allow you to measure more precisely the speed of the connection.
Cyrille's answer is a good pragmatic answer, but is not really in the end a great solution in the mobile context for these reasons:
It involves doing a test "at the start of your application" by which I assume he means when your app launches. But your app may execute for a long while, may go background and then back into the foreground, and all the while the user is changing network contexts with changes in underlying network performance - so that initial test result may bear no relationship to the "current" performance of the network connection.
For the reason he rightly points out, that it is "penalizing" your user by making them download a test file over what may already be constrained network conditions.
You also suggest in your original post that you want your user to decide if they should download based on information you present to them. But I would suggest that this is not a good way to approach interacting with mobile users - that you should not be asking them to make complicated decisions. If absolutely necessary, only ask if they want to download the file if you think it may present a problem, but keep it that simple - "Do you want to download XYZ file (100 MB)?" I personally would even avoid even that.
Instead of downloading a test file, the better solution is to monitor and adapt. Measure the performance of the connection as you go along, keep track of the "freshness" of that information you have about how well the connection is performing, and only present your user with a decision to make if based on the on-going performance of the connection it seems necessary.
EDIT: For example, if you determine a patience threshold that in your opinion represents tolerable download performance, keep track of each download that the user does in order to determine if that threshold is being reached. That way, instead of clogging up the users connection with test downloads, you're using the real world activity as the determining factor for "quality of the connection", which is ultimately about the end-user experience of the quality of the connection. If you decide to provide the user with the ability to cancel downloads, then you have an excellent "input" about the user's actual patience threshold, and can adapt your functionality to that situation, by subsequently giving them the choice before they start the download. If you've flipped into this type of "confirmation" mode, but then find that files are starting to download faster, you could dynamically exit the confirmation mode.
Rob's answer is very good, but for a more specific implementation start with (https://developer.apple.com/library/archive/samplecode/SimplePing/Introduction/Intro.html#//apple_ref/doc/uid/DTS10000716)Apple's Simple Ping example source code
Target the domain for the server that you want to monitor connection quality to. Use the ping library to "ping" it on a regular basis (say 1 or 10 seconds depending upon your UI needs). Measure how long it takes to get a response to your ping (or if it never returns) to develop an estimate of the connection quality to communicate to your user.

Resources