Is it possible to imitate arbitration on CAN bus? - can-bus

Setup
I have two nodes connected to one CAN bus. The first node is a black-box, controlled by some real-time hardware. The second node is a Linux machine with attached PEAK-USB CAN controller:
+--------+ +----------+
| HW CAN |--- CAN BUS ---| Linux PC |
+--------+ +----------+
In order to investigate some problem related to occasional frame loss I want to mimic the CAN arbitration process. To do that I am setting the CAN bit-rate to 125Kb/s and flooding it with random CAN frames with 1ms delay, controlling the bus load with canbusload from can-utils. I also monitor CAN error frames running candump can0,0~0,#ffffffff and the overall can statistics with ip -s -d link show can:
26: can0: <NOARP,UP,LOWER_UP,ECHO> mtu 16 qdisc pfifo_fast state UNKNOWN mode DEFAULT group default qlen 10
link/can promiscuity 0
can state ERROR-ACTIVE restart-ms 0
bitrate 125000 sample-point 0.875
tq 500 prop-seg 6 phase-seg1 7 phase-seg2 2 sjw 1
pcan_usb: tseg1 1..16 tseg2 1..8 sjw 1..4 brp 1..64 brp-inc 1
clock 8000000
re-started bus-errors arbit-lost error-warn error-pass bus-off
0 0 0 0 0 0
RX: bytes packets errors dropped overrun mcast
120880 15110 0 0 0 0
TX: bytes packets errors dropped carrier collsns
234123 123412 0 0 0 0
Problem
Now the problem is that the given setup works for hours with zero collisions (arbitration) or any other kind of error frames when the load is at 99%. When I reduce the delay to increase the bus load write(2) fails with either "ENOBUFS 105 No buffer space available" or "EAGAIN 11 Resource temporarily unavailable" - the actual error depends on whether I modify the qlen parameter or set to to defaults.
As I understand it, the load I put is either not enough or too much. What would be the right way to make two nodes enter the arbitration? A successful result would be a received CAN error frame corresponding to the CAN_ERR_LOSTARB constant from can/error.h and a value of collsns other than 0.
Source code
HW Node (Arduino Due with CAN board)
#include <due_can.h>
CAN_FRAME input, output;
// the setup function runs once when you press reset or power the board
void setup() {
Serial.begin(9600);
Serial.println("start");
// Can0.begin(CAN_BPS_10K);
Can0.begin(CAN_BPS_125K);
// Can0.begin(CAN_BPS_250K);
output.id = 0x303;
output.length = 8;
output.data.low = 0x12abcdef;
output.data.high = 0x24abcdef;
}
// the loop function runs over and over again forever
void loop() {
Can0.sendFrame(output);
Can0.read(input);
delay(1);
}
Linux node
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <net/if.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/can.h>
#include <linux/can/raw.h>
int main(int argc, char *argv[])
{
int s;
int nbytes;
struct sockaddr_can addr;
struct can_frame frame;
struct ifreq ifr;
const char *ifname = "can0";
if((s = socket(PF_CAN, SOCK_RAW, CAN_RAW)) < 0) {
perror("Error while opening socket");
return -1;
}
strcpy(ifr.ifr_name, ifname);
ioctl(s, SIOCGIFINDEX, &ifr);
addr.can_family = AF_CAN;
addr.can_ifindex = ifr.ifr_ifindex;
printf("%s at index %d\n", ifname, ifr.ifr_ifindex);
if(bind(s, (struct sockaddr *)&addr, sizeof(addr)) < 0) {
perror("Error in socket bind");
return -2;
}
frame.can_id = 0x304;
frame.can_dlc = 2;
frame.data[0] = 0x11;
frame.data[1] = 0x22;
int sleep_ms = atoi(argv[1]) * 1000;
for (;;) {
nbytes = write(s, &frame, sizeof(struct can_frame));
if (nbytes == -1) {
perror("write");
return 1;
}
usleep(sleep_ms);
}
return 0;
}

From the documentation subsection 4.1.2 RAW socket option CAN_RAW_ERR_FILTER, it says that the errors are by default not activated which is why the lost arbitration field in ip was not increasing.
In order to toggle on all the errors, you need to add those two lines :
can_err_mask_t err_mask = CAN_ERR_MASK;
setsockopt(socket_can, SOL_CAN_RAW, CAN_RAW_ERR_FILTER, &err_mask, sizeof(err_mask));
But this feature is not available for all drivers and devices because it requires from the hardware to have a loopback mode. In the case of the PEAK-USB, it seems that if the version of the firmware from the device is less than 4.x, there is no loopback [source]. Thus SocketCAN won't be able to detect lost arbitration.

Related

Packet Lost using UART Driver of Telit's LE910Cx MCU

It takes up to 40 min before a packet is lost, (at rate of 1 packet every few minutes),
The MCU use Linux kernel 3.18.48,
Using Scope, (on UART's Rx Pin), I can see the packets, (about 15 bytes long), are sent well.
But the read() doesn't return, with any of the packet's bytes,
(VMIN = 1, VTIME = 0, configured to return if at least 1 byte is in the Rx buffer),
This code is used in 4 other projects, with different HW Board, and we never saw this issue before.
Can you share ideas of how to tackle such issue?
How can I debug the UART driver?
To better understand where the packet got lost,
Thanks,
Logic Analyzer of the Lost Packet
E_UARTDRV_STATUS UartDrv_Open(void *pUart, S_UartDrv_InitData *init_data)
{
struct termios tty;
struct serial_struct serial;
/*
* O_RDWR - Opens the port for reading and writing
* O_NOCTTY - The port never becomes the controlling terminal of the process.
* O_NDELAY - Use non-blocking I/O.
* On some systems this also means the RS232 DCD signal line is ignored.
* Note well: if present, the O_EXCL flag is silently ignored by the kernel when opening a serial device like a modem.
* On modern Linux systems programs like ModemManager will sometimes read and write to your device and possibly corrupt your program state.
* To avoid problems with programs like ModemManager you should set TIOCEXCL on the terminal after associating a terminal with the device.
* You cannot open with O_EXCL because it is silently ignored.
*/
fd = open(init_data->PortName, O_RDWR | O_NOCTTY | O_NDELAY);
if (fd == -1) // if there is an invalid descriptor, print the reason {
SYS_LOG_ERR_V("fd invalid whilst trying to open com port %s: %s\n", init_data->PortName, strerror(errno));
return UARTDRV_STATUS_ERROR;
}
if (tcflush(fd, TCIOFLUSH) < 0) {
SYS_LOG_ERR_V("Error failed to flush input output buffers %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
// Enable low latency...this should affect the file /sys/bus/usb-serial/devices/ttyUSB0/latency_timer
if (ioctl(fd, TIOCGSERIAL, &serial) < 0) {
SYS_LOG_ERR_V("Error failed to get latency current value: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
serial.flags |= ASYNC_LOW_LATENCY;
if (ioctl(fd, TIOCSSERIAL, &serial) < 0) {
SYS_LOG_ERR_V("Error failed to set Low latency: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
if (fcntl(fd, F_SETFL, 0) < 0) {
SYS_LOG_ERR_V("Error failed to set file flags: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
/* Get current configuration */
if (tcgetattr(fd, &tty) < 0) {
SYS_LOG_ERR_V("Error failed to get current configuration: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
if (cfsetospeed(&tty, init_data->baud) < 0) {
SYS_LOG_ERR_V("Error failed to set output baud rate: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
if (cfsetispeed(&tty, init_data->baud) < 0) {
SYS_LOG_ERR_V("Error failed to set input baud rate: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
tty.c_cflag |= (CLOCAL | CREAD); /* Enable the receiver and set local mode */
tty.c_cflag &= ~CSIZE;
tty.c_cflag |= CS8; /* 8-bit characters */
tty.c_cflag &= ~PARENB; /* no parity bit */
tty.c_cflag &= ~CSTOPB; /* only need 1 stop bit */
/*
* Input flags - Turn off input processing
* convert break to null byte, no CR to NL translation,
* no NL to CR translation, don't mark parity errors or breaks
* no input parity check, don't strip high bit off,
* no XON/XOFF software flow control
* BRKINT - If this bit is set and IGNBRK is not set, a break condition clears the terminal input and output queues and raises a SIGINT signal for the foreground process group associated with the terminal.
* If neither BRKINT nor IGNBRK are set, a break condition is passed to the application as a single '\0' character if PARMRK is not set, or otherwise as a three-character sequence '\377', '\0', '\0'.
* INPCK - If this bit is set, input parity checking is enabled. If it is not set, no checking at all is done for parity errors on input; the characters are simply passed through to the application.
* Parity checking on input processing is independent of whether parity detection and generation on the underlying terminal hardware is enabled; see Control Modes.
* For example, you could clear the INPCK input mode flag and set the PARENB control mode flag to ignore parity errors on input, but still generate parity on output.
* If this bit is set, what happens when a parity error is detected depends on whether the IGNPAR or PARMRK bits are set. If neither of these bits are set, a byte with a parity error is passed to the application as a '\0' character.
*/
tty.c_iflag &= ~(BRKINT | PARMRK | ISTRIP | INLCR | IGNCR | ICRNL | IXON);
/*
* IGNBRK - If this bit is set, break conditions are ignored.
* A break condition is defined in the context of asynchronous serial data transmission as a series of zero-value bits longer than a single byte.
*/
tty.c_iflag |= IGNBRK;
/*
* No line processing
* echo off, echo newline off, canonical mode off,
* extended input processing off, signal chars off
*/
tty.c_lflag &= ~(ECHO | ECHONL | ICANON | ISIG | IEXTEN);
/*
* Output flags - Turn off output processing
* no CR to NL translation, no NL to CR-NL translation,
* no NL to CR translation, no column 0 CR suppression,
* no Ctrl-D suppression, no fill characters, no case mapping,
* no local output processing
*
* c_oflag &= ~(OCRNL | ONLCR | ONLRET | ONOCR | ONOEOT| OFILL | OLCUC | OPOST);
*/
tty.c_oflag = 0;
/* fetch bytes as they become available */
tty.c_cc[VMIN] = 0;
tty.c_cc[VTIME] = 1; // timeout in 10th of second
if (tcsetattr(fd, TCSANOW, &tty) != 0) {
SYS_LOG_ERR_V("Error failed to set new configuration: %s\n", strerror(errno));
return UARTDRV_STATUS_ERROR;
}
uartPeripheral.fd = fd;
return UARTDRV_STATUS_SUCCESS;
}
uint8_t *UartDrv_Rx(S_UartDrv_Handle *handle, uint16_t bytesToRead, uint16_t *numBytesRead)
{
ssize_t n_read;
uint16_t n_TotalReadBytes = 0;
struct timespec timestamp;
struct timespec now;
long diff_ms;
bool Timeout = false, isPartialRead = false;
if (handle == NULL) {
SYS_LOG_ERR("UartDrv Error: async rx error - peripheral error");
exit(EXIT_FAILURE);
}
if (bytesToRead > sizeof(uartPeripheral.buffer_rx)) {
ESILOG_ERR_V("Param Error: Invalid length %u, max length %zu", bytesToRead, sizeof(uartPeripheral.buffer_rx));
*numBytesRead = 0;
return NULL;
}
while(n_TotalReadBytes < bytesToRead && !Timeout) {
do {
n_read = read(uartPeripheral.fd, &uartPeripheral.buffer_rx[n_TotalReadBytes], bytesToRead - n_TotalReadBytes);
if (isPartialRead) {
clock_gettime(CLOCK_REALTIME, &now);
diff_ms = (now.tv_sec - timestamp.tv_sec)*1000;
diff_ms += (now.tv_nsec - timestamp.tv_nsec)/1000000;
if (diff_ms > UART_READ_TIMEOUT_MS) {
SYS_LOG_ERR("UartDrv_Rx: Error, timeout while reading\r\n");
Timeout = true;
}
}
} while ((n_read != -1) && (n_read == 0) && !Timeout);
if (n_read == -1) {
if (errno == EINTR) {
ESILOG_WARN("Uart Interrupted");
continue;
}
ESILOG_ERR_V("Uart Error: [%d, %s]", errno, strerror(errno));
exit(EXIT_FAILURE);
}
n_TotalReadBytes += (uint16_t)n_read;
if (n_TotalReadBytes < bytesToRead) {
//SYS_LOG_DBG_V("UartDrv_Rx: couldn't fetch all bytes, read %hu, expected %hu, continue reading %s\r\n", n_TotalReadBytes, bytesToRead, isPartialRead? "During Partial read": "");
if (!isPartialRead) {
isPartialRead = true;
clock_gettime(CLOCK_REALTIME, &timestamp);
}
}
}
*numBytesRead = n_TotalReadBytes;
return uartPeripheral.buffer_rx;
}

bpftrace doesn’t recognise a syscall argument as negative

Here is a simple bpftrace script:
#!/usr/bin/env bpftrace
tracepoint:syscalls:sys_enter_kill
{
$tpid = args->pid;
printf("%d %d %d\n", $tpid, $tpid < 0, $tpid >= 0);
}
It traces kill syscalls, prints the target PID and two additional values: whether it is negative, and whether it is non-negative.
Here is the output that I get:
# ./test.bt
Attaching 1 probe...
-1746 0 1
-2202 0 1
4160 0 1
4197 0 1
4197 0 1
-2202 0 1
-1746 0 1
Weirdly, both positive and negative pids appear to be positive for the comparison operator.
Just as a sanity, check, if I replace the assignment line with:
$tpid = -10;
what I get is exactly what I expect:
# ./test.bt
Attaching 1 probe...
-10 1 0
-10 1 0
-10 1 0
What am I doing wrong?
As you've discovered, bpftrace assigns a u64 type to your $tpid variable. Yet, according to the tracepoint format doc., args->pid should be of type pid_t, or int.
# cat /sys/kernel/debug/tracing/events/syscalls/sys_enter_kill/format
name: sys_enter_kill
ID: 185
format:
field:unsigned short common_type; offset:0; size:2; signed:0;
field:unsigned char common_flags; offset:2; size:1; signed:0;
field:unsigned char common_preempt_count; offset:3; size:1; signed:0;
field:int common_pid; offset:4; size:4; signed:1;
field:int __syscall_nr; offset:8; size:4; signed:1;
field:pid_t pid; offset:16; size:8; signed:0;
field:int sig; offset:24; size:8; signed:0;
print fmt: "pid: 0x%08lx, sig: 0x%08lx", ((unsigned long)(REC->pid)), ((unsigned long)(REC->sig))
The bpftrace function that assigns this type is TracepointFormatParser::adjust_integer_types(). This change was introduced by commit 42ce08f to address issue #124.
For the above tracepoint description, bpftrace generates the following structure:
struct _tracepoint_syscalls_sys_enter_kill
{
unsigned short common_type;
unsigned char common_flags;
unsigned char common_preempt_count;
int common_pid;
int __syscall_nr;
u64 pid;
s64 sig;
};
When it should likely generate:
struct _tracepoint_syscalls_sys_enter_kill
{
unsigned short common_type;
unsigned char common_flags;
unsigned char common_preempt_count;
int common_pid;
int __syscall_nr;
u32 pad1;
pid_t pid;
u32 pad2;
int sig;
};
bpftrace seems to be confused by the size parameter that doesn't match the type in the above description. All syscall arguments get size 8 (on 64-bit at least), but that doesn't mean all 8 bytes are used. I think it would be worth opening an issue on bpftrace.
There is something strange going on with integer types in bpftrace (see #554, #772, #834 for details).
It seems that in my case arg->pids gets treated as a 64-bit value by default, while it is actually not. So the solution is to explicitly cast it:
$tpid = (int32)args->pid;
And now it works as expected:
# bpftrace test.bt
Attaching 1 probe...
-2202 1 0
-1746 1 0
-2202 1 0
4160 0 1
4197 0 1

What exactly are the transaction metrics reported by NVPROF?

I'm trying to figure out what exactly each of the metrics reported by "nvprof" are. More specifically I can't figure out which transactions are System Memory and Device Memory read and writes. I wrote a very basic code just to help figure this out.
#define TYPE float
#define BDIMX 16
#define BDIMY 16
#include <cuda.h>
#include <cstdio>
#include <iostream>
__global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy)
{
__shared__ float s_data[BDIMY][BDIMX];
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
int in_idx = iy * dimx + ix; // index for reading input
int tx = threadIdx.x; // thread’s x-index into corresponding shared memory tile
int ty = threadIdx.y; // thread’s y-index into corresponding shared memory tile
s_data[ty][tx] = g_input[in_idx];
__syncthreads();
g_output[in_idx] = s_data[ty][tx] * 1.3;
}
int main(){
int size_x = 16, size_y = 16;
dim3 numTB;
numTB.x = (int)ceil((double)(size_x)/(double)BDIMX) ;
numTB.y = (int)ceil((double)(size_y)/(double)BDIMY) ;
dim3 tbSize;
tbSize.x = BDIMX;
tbSize.y = BDIMY;
float* a,* a_out;
float *a_d = (float *) malloc(size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a, size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a_out, size_x * size_y * sizeof(TYPE));
for(int index = 0; index < size_x * size_y; index++){
a_d[index] = index;
}
cudaMemcpy(a, a_d, size_x * size_y * sizeof(TYPE), cudaMemcpyHostToDevice);
kernel <<<numTB, tbSize>>>(a_out, a, size_x, size_y);
cudaDeviceSynchronize();
return 0;
}
Then I run nvprof --metrics all for the output to see all the metrics. This is the part I'm interested in:
Metric Name Metric Description Min Max Avg
Device "Tesla K40c (0)"
Kernel: kernel(float*, float*, int, int)
local_load_transactions Local Load Transactions 0 0 0
local_store_transactions Local Store Transactions 0 0 0
shared_load_transactions Shared Load Transactions 8 8 8
shared_store_transactions Shared Store Transactions 8 8 8
gld_transactions Global Load Transactions 8 8 8
gst_transactions Global Store Transactions 8 8 8
sysmem_read_transactions System Memory Read Transactions 0 0 0
sysmem_write_transactions System Memory Write Transactions 4 4 4
tex_cache_transactions Texture Cache Transactions 0 0 0
dram_read_transactions Device Memory Read Transactions 0 0 0
dram_write_transactions Device Memory Write Transactions 40 40 40
l2_read_transactions L2 Read Transactions 70 70 70
l2_write_transactions L2 Write Transactions 46 46 46
I understand the shared and global accesses. The global accesses are coalesced and since there are 8 warps, there are 8 transactions.
But I can't figure out the system memory and device memory write transaction numbers.
It helps if you have a model of the GPU memory hierarchy with both logical and physical spaces, such as the one here.
Referring to the "overview tab" diagram:
gld_transactions refer to transactions issued from the warp targetting the global logical space. On the diagram, this would be the line from the "Kernel" box on the left to the "global" box to the right of it, and the logical data movement direction would be from right to left.
gst_transactions refer to the same line as above, but logically from left to right. Note that these logical global transaction could hit in a cache and not go anywhere after that. From the metrics standpoint, those transaction types only refer to the indicated line on the diagram.
dram_write_transactions refer to the line on the diagram which connects device memory on the right with L2 cache, and the logical data flow is from left to right on this line. Since the L2 cacheline is 32 bytes (whereas the L1 cacheline and size of a global transaction is 128 bytes), the device memory transactions are also 32 bytes, not 128 bytes. So a global write transaction that passes through L1 (it is a write-through cache if enabled) and L2 will generate 4 dram_write transactions. This should explain 32 out of the 40 transactions.
system memory transactions target zero-copy host memory. You don't seem to have that so I can't explain those.
Note that in some cases, for some metrics, on some GPUs, the profiler may have some "inaccuracy" when launching very small numbers of threadblocks. For example, some metrics are sampled on a per-SM basis and scaled. (device memory transactions are not in this category, however). If you have disparate work being done on each SM (perhaps due to a very small number of threadblocks launched) then the scaling can be misleading/less accurate. Generally if you launch a larger number of threadblocks, these usually become insignificant.
This answer may also be of interest.

Amount of local memory per CUDA thread

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0).
I tried unsuccessfully to check this limit on Linux with CUDA 6.5.
Here is the code I used (its only purpose is to test local memory limit, it doesn't make any usefull computation):
#include <iostream>
#include <stdio.h>
#define MEMSIZE 65000 // 65000 -> out of memory, 60000 -> ok
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if( abort )
exit(code);
}
}
inline void gpuCheckKernelExecutionError( const char *file, int line)
{
gpuAssert( cudaPeekAtLastError(), file, line);
gpuAssert( cudaDeviceSynchronize(), file, line);
}
__global__ void kernel_test_private(char *output)
{
int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row
char tmp[MEMSIZE];
for( int i = 0; i < MEMSIZE; i++)
tmp[i] = 4*r + c; // dummy computation in local mem
for( int i = 0; i < MEMSIZE; i++)
output[i] = tmp[i];
}
int main( void)
{
printf( "MEMSIZE=%d bytes.\n", MEMSIZE);
// allocate memory
char output[MEMSIZE];
char *gpuOutput;
cudaMalloc( (void**) &gpuOutput, MEMSIZE);
// run kernel
dim3 dimBlock( 1, 1);
dim3 dimGrid( 1, 1);
kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
gpuCheckKernelExecutionError( __FILE__, __LINE__);
// transfer data from GPU memory to CPU memory
cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);
// release resources
cudaFree(gpuOutput);
cudaDeviceReset();
return 0;
}
And the compilation command line:
nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu
The compilation is ok, and reports:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info : Function properties for _Z19kernel_test_privatePc
65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 40 bytes cmem[0]
I got an "out of memory" error at runtime on the GTX 580 when I reached 65000 bytes per thread. Here is the exact output of the program in the console:
MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48
I also did a test with a GTX 770 GPU (on Linux with CUDA 6.5). It ran without error for MEMSIZE=200000, but the "out of memory error" occurred at runtime for MEMSIZE=250000.
How to explain this behavior ? Am I doing something wrong ?
It seems you are running into not a local memory limitation but a stack size limitation:
ptxas info : Function properties for _Z19kernel_test_privatePc
65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
The variable that you had intended to be local is on the (GPU thread) stack, in this case.
Based on the information provided by #njuffa here, the available stack size limit is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
GPU memory/(#of SMs)/(max threads per SM)
Clearly, the first limit is not the issue. I assume you have a "standard" GTX580, which has 1.5GB memory and 16 SMs. A cc2.x device has a maximum of 1536 resident threads per multiprocessor. This means we have 1536MB/16/1536 = 1MB/16 = 65536 bytes stack. There is some overhead and other memory usage that subtracts from the total available memory, so the stack size limit is some amount below 65536, somewhere between 60000 and 65000 in your case, apparently.
I suspect a similar calculation on your GTX770 would yield a similar result, i.e. a maximum stack size between 200000 and 250000.

Memory bandwidth measurement with memset,memcpy

I am trying to understand the performance of memory operations with memcpy/memset. I measure the time needed for a loop containing memset,memcpy. See the attached code (it is in C++11, but in plain C the picture is the same). It is understandable that memset is faster than memcpy. But this is more-or-less the only thing which I understand... The biggest question is:
Why there is a such a strong dependence on the number of loop iterations?
The application is single threaded! And the CPU is: AMD FX(tm)-4100 Quad-Core Processor.
And here are some numbers:
memset: iters=1 0.0625 GB in 0.1269 s : 0.4927 GB per second
memcpy: iters=1 0.0625 GB in 0.1287 s : 0.4857 GB per second
memset: iters=4 0.25 GB in 0.151 s : 1.656 GB per second
memcpy: iters=4 0.25 GB in 0.1678 s : 1.49 GB per second
memset: iters=16 1 GB in 0.2406 s : 4.156 GB per second
memcpy: iters=16 1 GB in 0.3184 s : 3.14 GB per second
memset: iters=128 8 GB in 1.074 s : 7.447 GB per second
memcpy: iters=128 8 GB in 1.737 s : 4.606 GB per second
The code:
/*
-- Compilation and run:
g++ -O3 -std=c++11 -o mem-speed mem-speed.cc && ./mem-speed
-- Output example:
*/
#include <cstdio>
#include <chrono>
#include <memory>
#include <string.h>
using namespace std;
const uint64_t _KB=1024, _MB=_KB*_KB, _GB=_KB*_KB*_KB;
std::pair<double,char> measure_memory_speed(uint64_t buf_size,int n_iters)
{
// without returning something from the buffers, the compiler will optimize memset() and memcpy() calls
char retval=0;
unique_ptr<char[]> buf1(new char[buf_size]), buf2(new char[buf_size]);
auto time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memset(buf1.get(),123,buf_size);
retval += buf1[0];
}
auto t1 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memcpy(buf2.get(),buf1.get(),buf_size);
retval += buf2[0];
}
auto t2 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
printf("memset: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t1.count()/1e9, n_iters*buf_size/double(_GB) / (t1.count()/1e9) );
printf("memcpy: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t2.count()/1e9, n_iters*buf_size/double(_GB) / (t2.count()/1e9) );
printf("\n");
double avr = n_iters*buf_size/_GB * (1e9/t1.count()+1e9/t2.count()) / 2;
retval += buf1[0]+buf2[0];
return std::pair<double,char>(avr,retval);
}
int main(int argc,const char **argv)
{
uint64_t n=64;
if( argc==2 )
n = atoi(argv[1]);
for( int i=0; i<=10; i++ )
measure_memory_speed(n*_MB,1<<i);
return 0;
}
Surely this is just down to the instruction caches loading - so the code runs faster after the 1st iteration, and the data cache speeding access to the memcpy/memcmp for further iterations. The cache memory is inside the processor so it doesn't have to fetch or put the data to the slower external memory so often - so runs faster.

Resources