SocketCAN error-passive with RevPi CAN Connect. Hardware or Software issue? - can-bus

We have bought a CAN Connect module for our Revpi Connect, and set it up as can been seen in this thread on the Kunbus forum. As far as we understand this will provide the proper termination.
We have also followed this guide provided by Revolution PI.
When checking that the drivers for the CAN module is properly loaded, everything looks fine:
$ dmesg | grep can
[ 4.616900] hi3110 spi0.0 can0: 3110 successfully initialized.
[ 107.049422] IPv6: ADDRCONF(NETDEV_UP): can0: link is not ready
[ 107.049463] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
Checking ip statistics gives this result:
$ ip -det -statistics link show can0
5: can0: mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
link/can promiscuity 0
can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 0
bitrate 50000 sample-point 0.875
tq 1250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
clock 16000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
0 0 0 0 0 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
However, when we turn on candump and send a message using cansend, candump does not show any traffic.
If we check dmesg for CAN again, we get this result:
$ dmesg | grep can
[ 4.616900] hi3110 spi0.0 can0: 3110 successfully initialized.
[ 107.049422] IPv6: ADDRCONF(NETDEV_UP): can0: link is not ready
[ 107.049463] IPv6: ADDRCONF(NETDEV_CHANGE): can0: link becomes ready
[ 292.285353] can: controller area network core (rev 20170425 abi 9)
[ 292.297811] can: raw protocol (rev 20170425)
And if we check ip -statistics again, we can see that the state of the connection has gone into ERROR-PASSIVE:
$ ip -det -statistics link show can0
5: can0: mtu 16 qdisc pfifo_fast state UP mode DEFAULT group default qlen 10
link/can promiscuity 0
can state ERROR-PASSIVE (berr-counter tx 128 rx 0) restart-ms 0
bitrate 50000 sample-point 0.875
tq 1250 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
hi3110: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
clock 16000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
0 0 0 1 1 0 numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
RX: bytes packets errors dropped overrun mcast
0 0 0 0 0 0
TX: bytes packets errors dropped carrier collsns
0 0 0 0 0 0
We are trying to communicate with a computer with and Ixxat USB2CAN v2, at 50kbps, with both highspeed and FT settings, but nothing seems to work. Resetting the connection, seems to put it back into ERROR-ACTIVE state.
When trying to communicate with mentioned computer, using another computer with Socketcan and another Ixxat USB2CAN v2 device, everything works fine.
I should perhaps also mention that if we turn loopback on, we can see the messages we are sending.
We are struggling to understand whether this a hardware or software error. Is our termination in order? Are there any magical settings in SocketCAN we have overlooked? Where should we start debugging this issue?
Any help would be greatly appreciated.

As it turns out, the schematic printed directly on the device is wrong.The schematic on their website is correct.
Hopefully this can help someone else avoid pulling their hair out in the future.

Related

How Standard CAN win the bus access in Arbitration with EXT CAN?

I read some documents and all of them say that the Std CAN have higher priority than the Ext CAN because the SRR bit is always Recessive in EXT CAN when they have the same ID, but from my understanding it depends.
https://copperhilltech.com/blog/controller-area-network-can-bus-tutorial-extended-can-protocol/
To simplify, let's say we have message ID 0x1(Std CAN) and 0x1(Ext CAN) sending simultaneously on the same bus.
The arbitration field of the Std CAN be compared to Ext CAN should be like this:
Std CAN: 0 0 0 0 0 0 0 0 0 0 1 0 (The bold bit is RTR)
Ext CAN: 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 (The bold bits are SRR, IDE and RTR)
At the 11th bit, The node that sends Std CAN is sending 1 (Recessive bit), and the node that sends Ext CAN is sending 0 (Dominant bit), so the Ext CAN wins the bus access and the node that sends Std CAN switch to listen mode and not sending anything after that, so the SRR and IDE bits never be reached to decide the message is Ext CAN or Std CAN.
Is my above understanding correct?
Thank you in advance,
Yes, an 29 bit frame with RTR set has higher priority than an 11 bit frame without RTR, given the first 11 bits of the identifiers are identical. So saying that standard frames have higher priority than extended is a simplification.
RTR frames is a bit of an oddball case overall, as they may also have varied length in the DLC area even though there's no data at all in the frame.

False high values in Grafana causing false alerts

I configured alerting in Grafana yesterday and get from two servers alerts. It's always the same two servers which got high IO, high CPU or anything else.
The thing is, they do not have such high data. In fact they're almost on idle. All servers are configured exactly the same via Ansible. So the Telegraf config is the same on all servers.
Also if I filter the stats in Grafana to the corresponding server the data displayed in the graph is correct as you can see in the screenshot below. Still the Rule-Test results in a false positive.
I checked vmstat which also displays correct informations:
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
1 0 47100 151152 20948 454556 2 2 16 38 2 1 2 1 96 0 1
0 0 47100 151136 20948 454592 0 0 0 0 125 135 0 1 96 0 2
0 0 47100 150408 20956 454584 0 0 0 84 222 282 1 3 93 0 4
0 0 47100 150424 20956 454592 0 0 0 0 151 225 0 0 97 0 2
0 0 47100 150424 20956 454592 0 0 0 0 115 140 0 0 96 0 4
0 0 47100 150424 20956 454592 0 0 0 0 109 125 0 0 97 0 2
0 0 47100 150424 20956 454592 0 0 0 0 121 131 0 0 98 0 2
0 0 47100 150412 20972 454576 0 0 0 92 139 208 0 1 96 0 3
0 0 47100 150456 20972 454592 0 0 0 0 65 117 0 0 99 0 1
0 0 47100 150876 20972 454592 0 0 0 16 692 705 2 4 88 0 5
And the telegraf.log if something's wrong.
2017-07-07T09:22:04Z I! Starting Telegraf (version 1.3.3)
2017-07-07T09:22:04Z I! Loaded outputs: influxdb
2017-07-07T09:22:04Z I! Loaded inputs: inputs.diskio inputs.processes inputs.swap inputs.system inputs.redis inputs.disk inputs.kernel inputs.mem inputs.net inputs.nginx inputs.postgresql inputs.cpu
2017-07-07T09:22:04Z I! Tags enabled: environment=production host=om-1-prod rails_env=production role=telegraf
2017-07-07T09:22:04Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"om-1-prod", Flush Interval:10s
Any ideas whats wrong here?
I kept monitoring the servers manually and found these high peaks for a short period of time.
So the issue here is that these peaks aren't visible in the selected range of time within Grafana. It gets aggregated to a smaler average which then looks like there only have been 40 ips. If I zoom into the corresponding time range I see these peaks.
Long story short: There's no issue witch Grafana, Telegraf of InfluxDB. The problem existed between keyboard and chair.

write: no buffer space available socket-can/linux-can

I'm running a program with two CAN channels (using TowerTech CAN Cape TT3201).
The two channels are can0 (500k) and can1 (125k). The can0 channels works perfectly but can1 runs a write:No buffer space available error.
I'm using ValueCAN3/VehicleSpy to check the messages.
This is before I run the program. can0 and can1 both seem to send, but only can0 shows up in VehicleSpy.
root#cantool:~# cansend can0 100#00
root#cantool:~# cansend can1 100#20
This is after I try running the program
root#cantool:~# cansend can1 100#20
write: No buffer space available
root#cantool:~# cansend can0 111#10
While my program is running : I get this error for all messages to be sent on can1
2016-11-02 15:36:03,052 - can.socketcan.native.tx - WARNING - Failed to send: 0.000000 12f83018 010 1 00
2016-11-02 15:36:03,131 - can.socketcan.native.tx - WARNING - Failed to send: 0.000000 0af81118 010 6 00 00 00 00 00 00
2016-11-02 15:36:03,148 - can.socketcan.native.tx - WARNING - Failed to send: 0.000000 12f81018 010 6 00 00 00 00 00 00
2016-11-02 15:36:03,174 - can.socketcan.native.tx - WARNING - Failed to send: 0.000000 0af87018 010 3 00 00 00
2016-11-02 15:36:03,220 - can.socketcan.native.tx - WARNING - Failed to send: 0.000000 12f89018 010 4 00 00 00 00
2016-11-02 15:36:03,352 - can.socketcan.native.tx - WARNING - Failed to send: 0.000000 12f83018 010 1 00
However sometimes the whole program works perfectly (if the module is rebooted or some random instances).
How do I fix this?
root#cantool:~# uname -r
4.1.15-ti-rt-r43
after doing some digging, I found this
root#cantool:-#ip -details link show can0
4:can0: <NOARP,UP,LOWER_UP, ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
link/can promiscuity 0
can state ERROR-ACTIVE (berr-counter tx 0 rx 0) restart-ms 100
bitrate 500000 sample-point 0.875
tq 125 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
c_can: tseg1 2..16 tseg2 1..8 sjw 1..4 brp 1..1024 brp-inc 1
clock 24000000
root#cantool:-#ip -details link show can1
5: can1: <NOARP,UP,LOWER_UP, ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
link/can promiscuity 0
can state STOPPED restart-ms 100
bitrate 125000 sample-point 0.875
tq 500 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
c_can: tseg1 2..16 tseg2 2..8 sjw 1..4 brp 1..64 brp-inc 1
clock 8000000
Turns out that can1 is STOPPED for some reason
however when I try:
ip link set can1 type can restart
RNETLINK answers: Invalid argument
After you enable the can0 interface with sudo ifconfig can0 up, run:
sudo ifconfig can0 txqueuelen 1000
This will increase the number of frames allowed per kernel transmission queue for the queuing discipline. More info here
... sometimes the whole program works perfectly (if the module is rebooted or some random instances).
The reason why it works when you restart the SocketCAN interface, is that you might clear up just enough buffer space to make it work.

Which of the IDS 11.70 onconfig parameters can be changed to maximize performance for a DSS app?

Informix 11.70.TC5DE,
Windows Vista with Dual Core Processor, 8GB RAM, 1TB HDD:
During the installation of this server, I specified it was going to be used for a data warehousing application. These are the onconfig parameters the install script generated.
Can any of these parameters be changed to maximize the performance of the server?
#(onconfig.ol_informix1170) - for data warehousing app.
ROOTNAME rootdbs
ROOTPATH C:\PROGRA~1\IBM\Informix\11.70\OL_INF~2\dbspaces\rootdbs.000
ROOTOFFSET 0
ROOTSIZE 312992
MIRROR 0
MIRRORPATH
MIRROROFFSET 0
PHYSFILE 49152
PLOG_OVERFLOW_PATH
PHYSBUFF 512
LOGFILES 6
LOGSIZE 10000
DYNAMIC_LOGS 2
LOGBUFF 256
LTXHWM 70
LTXEHWM 80
MSGPATH C:\PROGRA~1\IBM\Informix\11.70\ol_informix1170_1.log
CONSOLE C:\PROGRA~1\IBM\Informix\11.70\ol_informix1170_1.con
TBLTBLFIRST 0
TBLTBLNEXT 0
TBLSPACE_STATS 1
DBSPACETEMP tempdbs
SBSPACETEMP
SBSPACENAME sbspace
SYSSBSPACENAME
ONDBSPACEDOWN 2
SERVERNUM 6
DBSERVERNAME ol_informix1170_1
DBSERVERALIASES dr_informix1170_1
NETTYPE olsoctcp,1,150,NET
LISTEN_TIMEOUT 60
MAX_INCOMPLETE_CONNECTIONS 1024
FASTPOLL 1
NS_CACHE host=900,service=900,user=900,group=900
MULTIPROCESSOR 0
VPCLASS cpu,num=1,noage
VP_MEMORY_CACHE_KB 0
SINGLE_CPU_VP 1
#VPCLASS aio,num=1
CLEANERS 2
AUTO_AIOVPS 1
DIRECT_IO 0
LOCKS 2000
DEF_TABLE_LOCKMODE page
RESIDENT 0
SHMBASE 0xc000000L
SHMVIRTSIZE 209920
SHMADD 6560
EXTSHMADD 8192
SHMTOTAL 0
SHMVIRT_ALLOCSEG 0,3
#SHMNOACCESS 0x70000000-0x7FFFFFFF
CKPTINTVL 300
AUTO_CKPTS 1
RTO_SERVER_RESTART 60
BLOCKTIMEOUT 3600
CONVERSION_GUARD 2
RESTORE_POINT_DIR $INFORMIXDIR\tmp
TXTIMEOUT 300
DEADLOCK_TIMEOUT 60
HETERO_COMMIT 0
TAPEDEV \\.\TAPE0
TAPEBLK 16
TAPESIZE 0
LTAPEDEV
LTAPEBLK 16
LTAPESIZE 0
BAR_ACT_LOG $INFORMIXDIR\tmp\bar_act.log
BAR_DEBUG_LOG $INFORMIXDIR\tmp\bar_dbug.log
BAR_DEBUG 0
BAR_MAX_BACKUP 0
BAR_RETRY 1
BAR_NB_XPORT_COUNT 20
BAR_XFER_BUF_SIZE 15
RESTARTABLE_RESTORE ON
BAR_PROGRESS_FREQ 0
BAR_BSALIB_PATH
BACKUP_FILTER
RESTORE_FILTER
BAR_PERFORMANCE 0
BAR_CKPTSEC_TIMEOUT 15
ISM_DATA_POOL ISMData
ISM_LOG_POOL ISMLogs
DD_HASHSIZE 31
DD_HASHMAX 10
DS_HASHSIZE 31
DS_POOLSIZE 127
PC_HASHSIZE 31
PC_POOLSIZE 127
PRELOAD_DLL_FILE
STMT_CACHE 0
STMT_CACHE_HITS 0
STMT_CACHE_SIZE 512
STMT_CACHE_NOLIMIT 0
STMT_CACHE_NUMPOOL 1
USEOSTIME 0
STACKSIZE 64
ALLOW_NEWLINE 0
USELASTCOMMITTED NONE
FILLFACTOR 90
MAX_FILL_DATA_PAGES 0
BTSCANNER num=1,threshold=5000,rangesize=-1,alice=6,compression=default
ONLIDX_MAXMEM 188928
MAX_PDQPRIORITY 100
DS_MAX_QUERIES 1
DS_TOTAL_MEMORY 188928
DS_MAX_SCANS 1
DS_NONPDQ_QUERY_MEM 188928
DATASKIP
OPTCOMPIND 2
DIRECTIVES 1
EXT_DIRECTIVES 0
OPT_GOAL -1
IFX_FOLDVIEW 0
AUTO_REPREPARE 1
USTLOW_SAMPLE 0
RA_PAGES 64
RA_THRESHOLD 16
BATCHEDREAD_TABLE 1
BATCHEDREAD_INDEX 1
BATCHEDREAD_KEYONLY 0
EXPLAIN_STAT 1
#SQLTRACE level=low,ntraces=1000,size=2,mode=global
#DBCREATE_PERMISSION informix
#DB_LIBRARY_PATH
IFX_EXTEND_ROLE 1
SECURITY_LOCALCONNECTION
UNSECURE_ONSTAT
ADMIN_USER_MODE_WITH_DBSA
ADMIN_MODE_USERS
PLCY_POOLSIZE 127
PLCY_HASHSIZE 31
USRC_POOLSIZE 127
USRC_HASHSIZE 31
STAGEBLOB
OPCACHEMAX 0
SQL_LOGICAL_CHAR OFF
SEQ_CACHE_SIZE 10
ENCRYPT_HDR
ENCRYPT_SMX
ENCRYPT_CDR 0
ENCRYPT_CIPHERS
ENCRYPT_MAC
ENCRYPT_MACFILE
ENCRYPT_SWITCH
CDR_EVALTHREADS 1,2
CDR_DSLOCKWAIT 5
CDR_QUEUEMEM 4096
CDR_NIFCOMPRESS 0
CDR_SERIAL 0
CDR_DBSPACE
CDR_QHDR_DBSPACE
CDR_QDATA_SBSPACE
CDR_SUPPRESS_ATSRISWARN
CDR_DELAY_PURGE_DTC 0
CDR_LOG_LAG_ACTION ddrblock
CDR_LOG_STAGING_MAXSIZE 0
CDR_MAX_DYNAMIC_LOGS 0
DRAUTO 0
DRINTERVAL 30
DRTIMEOUT 30
HA_ALIAS
DRLOSTFOUND $INFORMIXDIR\etc\dr.lostfound
DRIDXAUTO 0
LOG_INDEX_BUILDS
SDS_ENABLE
SDS_TIMEOUT 20
SDS_TEMPDBS
SDS_PAGING
SDS_LOGCHECK 0
UPDATABLE_SECONDARY 0
FAILOVER_CALLBACK
FAILOVER_TX_TIMEOUT 0
TEMPTAB_NOLOG 0
DELAY_APPLY 0
STOP_APPLY 0
LOG_STAGING_DIR
RSS_FLOW_CONTROL 0
ENABLE_SNAPSHOT_COPY 0
SMX_COMPRESS 0
ON_RECVRY_THREADS 2
OFF_RECVRY_THREADS 5
DUMPDIR $INFORMIXDIR\tmp
DUMPSHMEM 1
DUMPGCORE 0
DUMPCORE 0
DUMPCNT 1
ALARMPROGRAM $INFORMIXDIR\etc\alarmprogram.bat
ALRM_ALL_EVENTS 0
#SYSALARMPROGRAM $INFORMIXDIR\etc\evidence.bat
STORAGE_FULL_ALARM 600,3
RAS_PLOG_SPEED 10982
RAS_LLOG_SPEED 0
EILSEQ_COMPAT_MODE 0
QSTATS 0
WSTATS 0
#VPCLASS MQ,noyield
MQSERVER
MQCHLLIB
MQCHLTAB
#VPCLASS jvp,num=1
#JVPJAVAHOME $INFORMIXDIR\extend\krakatoa\jre
#JVPHOME $INFORMIXDIR\extend\krakatoa
JVPPROPFILE $INFORMIXDIR\extend\krakatoa\.jvpprops
JVPLOGFILE $INFORMIXDIR\jvp.log
#JDKVERSION 1.5
#JVPJAVALIB \bin
#JVPJAVAVM jvm
#JVPARGS -verbose:jni
#JVPCLASSPATH $INFORMIXDIR\extend\krakatoa\krakatoa_g.jar;$INFORMIXDIR\extend\krakatoa\jdbc_g.jar
JVPARGS -Dcom.ibm.tools.attach.enable=no
JVPCLASSPATH $INFORMIXDIR\extend\krakatoa\krakatoa.jar;$INFORMIXDIR\extend\krakatoa\jdbc.jar
BUFFERPOOL default,buffers=10000,lrus=8,lru_min_dirty=50.00,lru_max_dirty=60.50
BUFFERPOOL size=4K,buffers=13108,lrus=16,lru_min_dirty=70.00,lru_max_dirty=80.00
AUTO_LRU_TUNING 1
USERMAPPING OFF
SP_AUTOEXPAND 1
SP_THRESHOLD 0
SP_WAITTIME 30
DEFAULTESCCHAR \
LOW_MEMORY_RESERVE 0
LOW_MEMORY_MGR 0
REMOTE_SERVER_CFG
REMOTE_USERS_CFG
S6_USE_REMOTE_SERVER_CFG 0
GSKIT_VERSION
NETTYPE drsoctcp,1,150,NET
If it is a multiprocessor machine, definitely consider turning on MULTIPROCESSOR by setting it to a non-zero value.
The ONCONFIG parameters of greatest interest to you for DSS are those related to Parallel Data Query, or PDQ. The block that commences with MAX_PDQPRIORITY. It is worth perusing the fine manual on these specifically, because the inter-relationship between them and some other parameters is too complex to go into here.
But in essence, DS_MAX_QUERIES is the maxumum number of parallel queries permitted at any time, and DS_MAX_SCANS determines the number of IO threads for scanning your tables. DS_TOTAL_MEMORY determines the amount of memory allocated for PDQ processing, and there is an algorithm in the manual that shows how these variables and the user's PDQPRIORITY setting combine.
You might also want to consider lifting the RA_PAGES and RA_THRESHOLD values - these determine how many pages are read into memory as 'blocks' before grabbing the next batch. If you're wanting to favour table-scans (which generally you do in DSS) then increasing these to something like 256 and 128 might improve performance.
My experience is with SMP and MPP unix boxes, rather than Windows, so I'm not sure how much you can wring out of your architecture, but this is where you want to start.
I would recommend identifying a good DSS query that runs for a decent length of time, and changing one parameter at a time to see the effect. SET EXPLAIN ON is your friend here, too.
One last thing - 11.7 supports table compression, and the tests I've seen show dramatic improvements in a DSS environment with large reads and irregular writes.

How to interpret the memory usage figures?

Can someone explain this in a practical way? Sample represents usage for one, low-traffic Rails site using Nginx and 3 Mongrel clusters. I ask because I am aiming to learn about page caching, wondering if these figures have significant meaning to that process. Thank you. Great site!
me#vps:~$ free -m
total used free shared buffers cached
Mem: 512 506 6 0 15 103
-/+ buffers/cache: 387 124
Swap: 1023 113 910
Physical memory is all used up. Why? Because it's there, the system should be using it.
You'll note also that the system is using 113M of swap space. Bad? Good? It depends.
See also that there's 103M of cached disk; this means that the system has decided that it's better to cache 103M of disk and swap out these 113M; maybe you have some processes using memory that are not being used and thus are paged out to disk.
As the other poster said, you should be using other tools to see what's happening:
Your perception: is the site running appropiately when you use it?
Benchmarking: what response times are your clients seeing?
More fine-grained diagnostics:
top: you can see live which processes are using memory and CPU
vmstat: it produces this kind of output:
alex#armitage:~$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 1 71184 156520 92524 316488 1 5 12 23 362 250 13 6 80 1
0 0 71184 156340 92528 316508 0 0 0 1 291 608 10 1 89 0
0 0 71184 156364 92528 316508 0 0 0 0 308 674 9 2 89 0
0 0 71184 156364 92532 316504 0 0 0 72 295 723 9 0 91 0
1 0 71184 150892 92532 316508 0 0 0 0 370 722 38 0 62 0
0 0 71184 163060 92532 316508 0 0 0 0 303 611 17 2 81 0
which will show you whether swap is hurting you (high numbers on si, so) and a more easier to see performance-over-time statistic.
by my reading of this, you have used almost all your memory, have 6 M free, and are going into about 10% of your swap. A more useful tools is to use top or perhaps ps to see how much each of your individual mongrels are using in RAM. Because you're going into swap, you're probably getting more slowdowns. you might find having only 2 mongrels rather than 3 might actually respond faster because it likely wouldn't go into swap memory.
Page caching will for sure help a tonne on response time, so if your pages are cachable (eg, they don't have content that is unique to the individual user) I would say for sure check it out

Resources