LLVM is optimizing the intrinsic code as well - clang

We have some code which is manually written as intrinsics. But LLVM is trying to optimize further because of --fast-math flag. Manual intrinsic is better compared to LLVM optimized one.
Example source code:
inline __m256 simd_evaluate_polynomial<__m256, APPROX_DEFAULT>(__m256 x, const std::array<__m256, APPROX_DEFAULT + 1>& coeff)
{
__m256 power = _mm256_set1_ps(1.0f);
__m256 res = _mm256_set1_ps(0.0f);
for (unsigned int i = 0; i <= APPROX_DEFAULT; i++) {
__m256 term = _mm256_mul_ps(coeff[i], power);
power = _mm256_mul_ps(power, x);
res = _mm256_add_ps(res, term);
}
return res;
}
For above function LLVM ASSEMBLY
Address Source Line Assembly CPU Time: Total CPU Time: Self
0x1402bbf7d 0 Block 1:
0x1402bbf7d 19 vmovaps ymm5, ymmword ptr [rip+0x50e4b5b] 0.1% 15.584ms
0x1402bbf85 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b32] 0.1% 15.595ms
0x1402bbf8e 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4b09] 0.6% 93.654ms
0x1402bbf97 19 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ae0] 0.2% 31.178ms
0x1402bbfa0 21 vfmadd213ps ymm5, ymm3, ymmword ptr [rip+0x50e4ab7] 0.3% 46.992ms
Can anyone please explain this why this is happening?

Related

cJSON Memory leak when freeing cJSON object

I am facing an issue while using the cJSON Library. I am assuming that there is a memory leak that is breaking the code after a certain time (40 mins to 1 hr).
I have copied my code below :
void my_work_handler_5(struct k_work *work)
{
char *ptr1[6];
int y=0;
static int counterdo = 0;
char *desc6 = "RSRP";
char *id6 = "dBm";
char *type6 = "RSRP";
char rsrp_str[100];
snprintf(rsrp_str, sizeof(rsrp_str), "%d", rsrp_current);
sensor5 = cJSON_CreateObject();
cJSON_AddItemToObject(sensor5, "description", cJSON_CreateString(desc6));
cJSON_AddItemToObject(sensor5, "Time", cJSON_CreateString(time_string));
cJSON_AddItemToObject(sensor5, "value", cJSON_CreateNumber(rsrp_current));
cJSON_AddItemToObject(sensor5, "unit", cJSON_CreateString(id6));
cJSON_AddItemToObject(sensor5, "type", cJSON_CreateString(type6));
/* print everything */
ptr1[counterdo] = cJSON_Print(sensor5);
printk("Counterdo value is : %d\n", counterdo);
cJSON_Delete(sensor5);
counterdo = counterdo + 1;
if (counterdo==6){
for(y=0;y<=counterdo;y++){
free(ptr1[y]);
}
counterdo = 0;
}
return;
}
I read some other threads regarding freeing up the memory and tried to do the same. Can anyone let me know if this is the right approach to free up the space allocated to the cJSON Object.
Regards,
Adeel.
Since cJSON is a portable library with no dependencies, this is better to look for a potential issue in your code on a PC: they are specialized tools available in this environment for facilitating the investigation. I am assuming here you have a Linux system, a Windows system with WSL or WSL2 installed, or a Linux virtual machine, available, and gcc, valgrind installed.
A minimal, self-contained, portable version of your code could be:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cJSON.h>
static int rsrp_current = 1;
static char *time_string = NULL;
void
my_work_handler_5 ()
{
char *ptr1[6];
int y = 0;
static int counterdo = 0;
char *desc6 = "RSRP";
char *id6 = "dBm";
char *type6 = "RSRP";
char rsrp_str[100];
snprintf (rsrp_str, sizeof (rsrp_str), "%d", rsrp_current);
cJSON *sensor5 = cJSON_CreateObject ();
cJSON_AddItemToObject (sensor5, "description", cJSON_CreateString (desc6));
cJSON_AddItemToObject (sensor5, "Time", cJSON_CreateString (time_string));
cJSON_AddItemToObject (sensor5, "value", cJSON_CreateNumber (rsrp_current));
cJSON_AddItemToObject (sensor5, "unit", cJSON_CreateString (id6));
cJSON_AddItemToObject (sensor5, "type", cJSON_CreateString (type6));
/* print everything */
ptr1[counterdo] = cJSON_Print (sensor5);
printf ("Counterdo value is : %d\n", counterdo);
cJSON_Delete (sensor5);
counterdo = counterdo + 1;
if (counterdo == 6)
{
for (y = 0; y <= counterdo; y++)
{
free (ptr1[y]);
}
counterdo = 0;
}
return;
}
int
main (int argc, char **argv)
{
time_t curtime;
time (&curtime);
for (int n = 0; n < 3 * 6; n++)
{
my_work_handler_5 ();
}
}
Build procedure:
wget https://github.com/DaveGamble/cJSON/archive/v1.7.14.tar.gz
tar zxf v1.7.14.tar.gz
gcc -g -O0 -IcJSON-1.7.14 -o cjson cjson.c cJSON-1.7.14/cJSON.c
Running valgrind on the program:
valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --verbose ./cjson
..indicates some memory is being freed that was not previously allocated: Invalid free() / delete / delete[] / realloc():
==6747==
==6747== HEAP SUMMARY:
==6747== in use at exit: 0 bytes in 0 blocks
==6747== total heap usage: 271 allocs, 274 frees, 14,614 bytes allocated
==6747==
==6747== All heap blocks were freed -- no leaks are possible
==6747==
==6747== ERROR SUMMARY: 21 errors from 2 contexts (suppressed: 0 from 0)
==6747==
==6747== 3 errors in context 1 of 2:
==6747== Invalid free() / delete / delete[] / realloc()
==6747== at 0x483CA3F: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6747== by 0x1094DA: my_work_handler_5 (cjson.c:42)
==6747== by 0x10955A: main (cjson.c:59)
==6747== Address 0x31 is not stack'd, malloc'd or (recently) free'd
==6747==
==6747==
==6747== 18 errors in context 2 of 2:
==6747== Conditional jump or move depends on uninitialised value(s)
==6747== at 0x483C9F5: free (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6747== by 0x1094DA: my_work_handler_5 (cjson.c:42)
==6747== by 0x10955A: main (cjson.c:59)
==6747== Uninitialised value was created by a stack allocation
==6747== at 0x109312: my_work_handler_5 (cjson.c:11)
==6747==
==6747== ERROR SUMMARY: 21 errors from 2 contexts (suppressed: 0 from 0)
Replacing:
for (y = 0; y <= counterdo; y++)
{
free (ptr1[y]);
}
by:
for (y = 0; y < counterdo; y++)
{
free (ptr1[y]);
}
and executing valgrind again:
==6834==
==6834== HEAP SUMMARY:
==6834== in use at exit: 1,095 bytes in 15 blocks
==6834== total heap usage: 271 allocs, 256 frees, 14,614 bytes allocated
==6834==
==6834== Searching for pointers to 15 not-freed blocks
==6834== Checked 75,000 bytes
==6834==
==6834== 1,095 bytes in 15 blocks are definitely lost in loss record 1 of 1
==6834== at 0x483DFAF: realloc (in /usr/lib/x86_64-linux-gnu/valgrind/vgpreload_memcheck-amd64-linux.so)
==6834== by 0x10B161: print (cJSON.c:1209)
==6834== by 0x10B25F: cJSON_Print (cJSON.c:1248)
==6834== by 0x1094AB: my_work_handler_5 (cjson.c:30)
==6834== by 0x10959C: main (cjson.c:59)
==6834==
==6834== LEAK SUMMARY:
==6834== definitely lost: 1,095 bytes in 15 blocks
==6834== indirectly lost: 0 bytes in 0 blocks
==6834== possibly lost: 0 bytes in 0 blocks
==6834== still reachable: 0 bytes in 0 blocks
==6834== suppressed: 0 bytes in 0 blocks
==6834==
==6834== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
Some memory is definitively being leaked.
The reason is that char *ptr1[6] is not static, and is therefore created on the stack every time my_work_handler_5() is being called. The pointers that were returned are by cJSON_Print() are therefore lost between two calls, and free() is being called on arbitrary pointer values, since ptr1[] is not initialized as it could be:
char *ptr1[6] = { NULL, NULL, NULL, NULL, NULL, NULL };
Since you are freeing memory every 6 calls, this is causing the memory leak you were suspecting.
Replacing:
char *ptr1[6];
by:
static char *ptr1[6];
compiling, running valgrind again:
==6927==
==6927== HEAP SUMMARY:
==6927== in use at exit: 0 bytes in 0 blocks
==6927== total heap usage: 271 allocs, 271 frees, 14,614 bytes allocated
==6927==
==6927== All heap blocks were freed -- no leaks are possible
==6927==
==6927== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
The modified version of the program should now work on your bare-metal system.

Decode UDP message with LUA

I'm relatively new to lua and programming in general (self taught), so please be gentle!
Anyway, I wrote a lua script to read a UDP message from a game. The structure of the message is:
DATAxXXXXaaaaBBBBccccDDDDeeeeFFFFggggHHHH
DATAx = 4 letter ID and x = control character
XXXX = integer shows the group of the data (groups are known)
aaaa...HHHHH = 8 single-precision floating point numbers
The last ones is those numbers I need to decode.
If I print the message as received, it's something like:
DATA*{V???A?A?...etc.
Using string.byte(), I'm getting a stream of bytes like this (I have "formatted" the bytes to reflect the structure above.
68 65 84 65/42/20 0 0 0/237 222 28 66/189 59 182 65/107 42 41 65/33 173 79 63/0 0 128 63/146 41 41 65/0 0 30 66/0 0 184 65
The first 5 bytes are of course the DATA*. The next 4 are the 20th group of data. The next bytes, the ones I need to decode, and are equal to those values:
237 222 28 66 = 39.218
189 59 182 65 = 22.779
107 42 41 65 = 10.573
33 173 79 63 = 0.8114
0 0 128 63 = 1.0000
146 41 41 65 = 10.573
0 0 30 66 = 39.500
0 0 184 65 = 23.000
I've found C# code that does the decode with BitConverter.ToSingle(), but I haven't found any like this for Lua.
Any idea?
What Lua version do you have?
This code works in Lua 5.3
local str = "DATA*\20\0\0\0\237\222\28\66\189\59\182\65..."
-- Read two float values starting from position 10 in the string
print(string.unpack("<ff", str, 10)) --> 39.217700958252 22.779169082642 18
-- 18 (third returned value) is the next position in the string
For Lua 5.1 you have to write special function (or steal it from François Perrad's git repo )
local function binary_to_float(str, pos)
local b1, b2, b3, b4 = str:byte(pos, pos+3)
local sign = b4 > 0x7F and -1 or 1
local expo = (b4 % 0x80) * 2 + math.floor(b3 / 0x80)
local mant = ((b3 % 0x80) * 0x100 + b2) * 0x100 + b1
local n
if mant + expo == 0 then
n = sign * 0.0
elseif expo == 0xFF then
n = (mant == 0 and sign or 0) / 0
else
n = sign * (1 + mant / 0x800000) * 2.0^(expo - 0x7F)
end
return n
end
local str = "DATA*\20\0\0\0\237\222\28\66\189\59\182\65..."
print(binary_to_float(str, 10)) --> 39.217700958252
print(binary_to_float(str, 14)) --> 22.779169082642
It’s little-endian byte-order of IEEE-754 single-precision binary:
E.g., 0 0 128 63 is:
00111111 10000000 00000000 00000000
(63) (128) (0) (0)
Why that equals 1 requires that you understand the very basics of IEEE-754 representation, namely its use of an exponent and mantissa. See here to start.
See #Egor‘s answer above for how to use string.unpack() in Lua 5.3 and one possible implementation you could use in earlier versions.

iOS Neon assembler sample questions

Just trying http://api.madewithmarmalade.com/ExampleArmASM.html and using iOS; the program run if I comment out the loop and the res is printed as 28. But if not comment it out, it will abend without printing the res.
Any hint why and how to fix it.
Thanks in advance.
My code is as follows:
#include <stdio.h>
#include <stdlib.h>
#define ARRAY_SIZE 512
#if defined __arm__ && defined __ARM_NEON__
static int computeSumNeon(const int a[])
{
// Computes the sum of all elements in the input array
int res = 0;
asm(".align 4 \n\t" //dennis warning avoiding
"vmov.i32 q8, #0 \n\t" //clear our accumulator register
"mov r3, #512 \n\t" //Loop condition n = ARRAY_SIZE
// ".loop1: \n\t" // No loop add 0-7 works as 28
"vld1.32 {d0, d1, d2, d3}, [%[input]]! \n\t" //load 8 elements into d0, d1, d2, d3 = q0, q1
"pld [%[input]] \n\t" // preload next set of elements
"vadd.i32 q8, q0, q8 \n\t" // q8 += q0
"vadd.i32 q8, q1, q8 \n\t" // q8 += q1
"subs r3, r3, #8 \n\t" // n -= 8
// "bne .loop1 \n\t" // n == 0?
"vpadd.i32 d0, d16, d17 \n\t" // d0[0] = d16[0] + d16[1], d0[1] = d17[0] + d17[1]
"vpaddl.u32 d0, d0 \n\t" // d0[0] = d0[0] + d0[1]
"vmov.32 %[result], d0[0] \n\t"
: [result] "=r" (res) , [input] "+r" (a)
:
: "q0", "q1", "q8", "r3");
return res;
}
#else
static int computeSumNeon(const int a[])
{
int i, res = 0;
for (i = 0; i < ARRAY_SIZE; i++)
res += a[i];
}
#endif
...
#implementation AppDelegate
- (BOOL)application:(UIApplication *)application didFinishLaunchingWithOptions:(NSDictionary *)launchOptions {
// Override point for customization after application launch.
//int* inp;
int inp[ARRAY_SIZE];
//posix_memalign((void**)&inp, 64, ARRAY_SIZE*sizeof(int)); // Align to cache line size (64bytes on a cortex A8)
// Initialise the array with consecutive integers.
int i;
for (i = 0; i < ARRAY_SIZE; i++)
{
inp[i] = i;
}
for (i = 0; i < ARRAY_SIZE; i++)
{
printf("%i,", inp[i]);
}
printf("\n\n sum 0-7:%i\n", 0+1+2+3+4+5+6+7);
int res = 0;
res = computeSumNeon(inp);
printf("res NEO :%i\n", res);
// free(inp); // error pointer being free was not allocated !!!
UISplitViewController *splitViewController = (UISplitViewController *)self.window.rootViewController;
UINavigationController *navigationController = [splitViewController.viewControllers lastObject];
navigationController.topViewController.navigationItem.leftBarButtonItem = splitViewController.displayModeButtonItem;
splitViewController.delegate = self;
return YES;
}
- (void)applicationWillResignActive:(UIApplication *)application {
...
==== assembly code generated
.align 1
.code 16 # #computeSumNeon
.thumb_func _computeSumNeon
_computeSumNeon:
Lfunc_begin3:
.loc 18 133 0 is_stmt 1 # ...
.cfi_startproc
# BB#0:
sub sp, #8
movs r1, #0
str r0, [sp, #4]
.loc 18 135 9 prologue_end # ...
Ltmp18:
str r1, [sp]
.loc 18 136 5 # ...
ldr r0, [sp, #4]
# InlineAsm Start
.align 4
vmov.i32 q8, #0x0
movw r3, #504
.loop1:
vld1.32 {d0, d1, d2, d3}, [r0]!
vadd.i32 q8, q0, q8
vadd.i32 q8, q1, q8
subs r3, #8
bne .loop1
vpadd.i32 d0, d16, d17
vpaddl.u32 d0, d0
vmov.32 r1, d0[0]
# InlineAsm End
str r1, [sp]
str r0, [sp, #4]
.loc 18 155 12 # ...
ldr r0, [sp]
.loc 18 155 5 is_stmt 0 # ...
add sp, #8
bx lr
Ltmp19:
Lfunc_end3:
.cfi_endproc

Golang append memory allocation VS. STL push_back memory allocation

I compared the Go append function and the STL vector.push_back and found that different memory allocation strategy which confused me. The code is as follow:
// CPP STL code
void getAlloc() {
vector<double> arr;
int s = 9999999;
int precap = arr.capacity();
for (int i=0; i<s; i++) {
if (precap < i) {
arr.push_back(rand() % 12580 * 1.0);
precap = arr.capacity();
printf("%d %p\n", precap, &arr[0]);
} else {
arr.push_back(rand() % 12580 * 1.0);
}
}
printf("\n");
return;
}
// Golang code
func getAlloc() {
arr := []float64{}
size := 9999999
pre := cap(arr)
for i:=0; i<size; i++ {
if pre < i {
arr = append(arr, rand.NormFloat64())
pre = cap(arr)
log.Printf("%d %p\n", pre, &arr)
} else {
arr = append(arr, rand.NormFloat64())
}
}
return;
}
But the memory address is invarient to the increment of size expanding, this really confused me.
By the way, the memory allocation strategy is different in this two implemetation (STL VS. Go), I mean the expanding size. Is there any advantage or disadvantage? Here is the simplified output of code above[size and first element address]:
Golang CPP STL
2 0xc0800386c0 2 004B19C0
4 0xc0800386c0 4 004AE9B8
8 0xc0800386c0 6 004B29E0
16 0xc0800386c0 9 004B2A18
32 0xc0800386c0 13 004B2A68
64 0xc0800386c0 19 004B2AD8
128 0xc0800386c0 28 004B29E0
256 0xc0800386c0 42 004B2AC8
512 0xc0800386c0 63 004B2C20
1024 0xc0800386c0 94 004B2E20
1280 0xc0800386c0 141 004B3118
1600 0xc0800386c0 211 004B29E0
2000 0xc0800386c0 316 004B3080
2500 0xc0800386c0 474 004B3A68
3125 0xc0800386c0 711 004B5FD0
3906 0xc0800386c0 1066 004B7610
4882 0xc0800386c0 1599 004B9768
6102 0xc0800386c0 2398 004BC968
7627 0xc0800386c0 3597 004C1460
9533 0xc0800386c0 5395 004B5FD0
11916 0xc0800386c0 8092 004C0870
14895 0xc0800386c0 12138 004D0558
18618 0xc0800386c0 18207 004E80B0
23272 0xc0800386c0 27310 0050B9B0
29090 0xc0800386c0 40965 004B5FD0
36362 0xc0800386c0 61447 00590048
45452 0xc0800386c0 92170 003B0020
56815 0xc0800386c0 138255 00690020
71018 0xc0800386c0 207382 007A0020
....
UPDATE:
See comments for Golang memory allocation strategy.
For STL, the strategy depends on the implementation. See this post for further information.
Your Go and C++ code fragments are not equivalent. In the C++ function, you are printing the address of the first element in the vector, while in the Go example you are printing the address of the slice itself.
Like a C++ std::vector, a Go slice is a small data type that holds a pointer to an underlying array that holds the data. That data structure has the same address throughout the function. If you want the address of the first element in the slice, you can use the same syntax as in C++: &arr[0].
You're getting the pointer to the slice header, not the actual backing array. You can think of the slice header as a struct like
type SliceHeader struct {
len,cap int
backingArray unsafe.Pointer
}
When you append and the backing array is reallocated, the pointer backingArray will likely be changed (not necessarily, but probably). However, the location of the struct holding the length, cap, and pointer to the backing array doesn't change -- it's still on the stack right where you declared it. Try printing &arr[0] instead of &arr and you should see behavior closer to what you expect.
This is pretty much the same behavior as std::vector, incidentally. Think of a slice as closer to a vector than a magic dynamic array.

How CUDA constant memory allocation works?

I'd like to get some insight about how constant memory is allocated (using CUDA 4.2). I know that the total available constant memory is 64KB. But when is this memory actually allocated on the device? Is this limit apply to each kernel, cuda context or for the whole application?
Let's say there are several kernels in a .cu file, each using less than 64K constant memory. But the total constant memory usage is more than 64K. Is it possible to call these kernels sequentially? What happens if they are called concurrently using different streams?
What happens if there is a large CUDA dynamic library with lots of kernels each using different amounts of constant memory?
What happens if there are two applications each requiring more than half of the available constant memory? The first application runs fine, but when will the second app fail? At app start, at cudaMemcpyToSymbol() calls or at kernel execution?
Parallel Thread Execution ISA Version 3.1 section 5.1.3 discusses constant banks.
Constant memory is restricted in size, currently limited to 64KB which
can be used to hold statically-sized constant variables. There is an
additional 640KB of constant memory, organized as ten independent 64KB
regions. The driver may allocate and initialize constant buffers in
these regions and pass pointers to the buffers as kernel function
parameters. Since the ten regions are not contiguous, the driver
must ensure that constant buffers are allocated so that each buffer
fits entirely within a 64KB region and does not span a region
boundary.
A simple program can be used to illustrate the use of constant memory.
__constant__ int kd_p1;
__constant__ short kd_p2;
__constant__ char kd_p3;
__constant__ double kd_p4;
__constant__ float kd_floats[8];
__global__ void parameters(int p1, short p2, char p3, double p4, int* pp1, short* pp2, char* pp3, double* pp4)
{
*pp1 = p1;
*pp2 = p2;
*pp3 = p3;
*pp4 = p4;
return;
}
__global__ void constants(int* pp1, short* pp2, char* pp3, double* pp4)
{
*pp1 = kd_p1;
*pp2 = kd_p2;
*pp3 = kd_p3;
*pp4 = kd_p4;
return;
}
Compile this for compute_30, sm_30 and execute cuobjdump -sass <executable or obj> to disassemble you should see
Fatbin elf code:
================
arch = sm_30
code version = [1,6]
producer = cuda
host = windows
compile_size = 32bit
identifier = c:/dev/constant_banks/kernel.cu
code for sm_30
Function : _Z10parametersiscdPiPsPcPd
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; // stack pointer
/*0010*/ /*0x40001de428004005*/ MOV R0, c [0x0] [0x150]; // pp1
/*0018*/ /*0x50009de428004005*/ MOV R2, c [0x0] [0x154]; // pp2
/*0020*/ /*0x0001dde428004005*/ MOV R7, c [0x0] [0x140]; // p1
/*0028*/ /*0x13f0dc4614000005*/ LDC.U16 R3, c [0x0] [0x144]; // p2
/*0030*/ /*0x60011de428004005*/ MOV R4, c [0x0] [0x158]; // pp3
/*0038*/ /*0x70019de428004005*/ MOV R6, c [0x0] [0x15c]; // pp4
/*0048*/ /*0x20021de428004005*/ MOV R8, c [0x0] [0x148]; // p4
/*0050*/ /*0x30025de428004005*/ MOV R9, c [0x0] [0x14c]; // p4
/*0058*/ /*0x1bf15c0614000005*/ LDC.U8 R5, c [0x0] [0x146]; // p3
/*0060*/ /*0x0001dc8590000000*/ ST [R0], R7; // *pp1 = p1
/*0068*/ /*0x0020dc4590000000*/ ST.U16 [R2], R3; // *pp2 = p2
/*0070*/ /*0x00415c0590000000*/ ST.U8 [R4], R5; // *pp3 = p3
/*0078*/ /*0x00621ca590000000*/ ST.64 [R6], R8; // *pp4 = p4
/*0088*/ /*0x00001de780000000*/ EXIT;
/*0090*/ /*0xe0001de74003ffff*/ BRA 0x90;
/*0098*/ /*0x00001de440000000*/ NOP CC.T;
/*00a0*/ /*0x00001de440000000*/ NOP CC.T;
/*00a8*/ /*0x00001de440000000*/ NOP CC.T;
/*00b0*/ /*0x00001de440000000*/ NOP CC.T;
/*00b8*/ /*0x00001de440000000*/ NOP CC.T;
...........................................
Function : _Z9constantsPiPsPcPd
/*0008*/ /*0x10005de428004001*/ MOV R1, c [0x0] [0x44]; // stack pointer
/*0010*/ /*0x00001de428004005*/ MOV R0, c [0x0] [0x140]; // p1
/*0018*/ /*0x10009de428004005*/ MOV R2, c [0x0] [0x144]; // p2
/*0020*/ /*0x0001dde428004c00*/ MOV R7, c [0x3] [0x0]; // kd_p1
/*0028*/ /*0x13f0dc4614000c00*/ LDC.U16 R3, c [0x3] [0x4]; // kd_p2
/*0030*/ /*0x20011de428004005*/ MOV R4, c [0x0] [0x148]; // p3
/*0038*/ /*0x30019de428004005*/ MOV R6, c [0x0] [0x14c]; // p4
/*0048*/ /*0x20021de428004c00*/ MOV R8, c [0x3] [0x8]; // kd_p4
/*0050*/ /*0x30025de428004c00*/ MOV R9, c [0x3] [0xc]; // kd_p4
/*0058*/ /*0x1bf15c0614000c00*/ LDC.U8 R5, c [0x3] [0x6]; // kd_p3
/*0060*/ /*0x0001dc8590000000*/ ST [R0], R7;
/*0068*/ /*0x0020dc4590000000*/ ST.U16 [R2], R3;
/*0070*/ /*0x00415c0590000000*/ ST.U8 [R4], R5;
/*0078*/ /*0x00621ca590000000*/ ST.64 [R6], R8;
/*0088*/ /*0x00001de780000000*/ EXIT;
/*0090*/ /*0xe0001de74003ffff*/ BRA 0x90;
/*0098*/ /*0x00001de440000000*/ NOP CC.T;
/*00a0*/ /*0x00001de440000000*/ NOP CC.T;
/*00a8*/ /*0x00001de440000000*/ NOP CC.T;
/*00b0*/ /*0x00001de440000000*/ NOP CC.T;
/*00b8*/ /*0x00001de440000000*/ NOP CC.T;
.....................................
I annotated to the right of the SASS.
On sm30 you can see that parameters are passed in constant bank 0 starting at offset 0x140.
User defined __constant__ variables are defined in constant bank 3.
If you execute cuobjdump --dump-elf <executable or obj> you can find other interesting constant information.
32bit elf: abi=6, sm=30, flags = 0x1e011e
Sections:
Index Offset Size ES Align Type Flags Link Info Name
1 34 142 0 1 STRTAB 0 0 0 .shstrtab
2 176 19b 0 1 STRTAB 0 0 0 .strtab
3 314 d0 10 4 SYMTAB 0 2 a .symtab
4 3e4 50 0 4 CUDA_INFO 0 3 b .nv.info._Z9constantsPiPsPcPd
5 434 30 0 4 CUDA_INFO 0 3 0 .nv.info
6 464 90 0 4 CUDA_INFO 0 3 a .nv.info._Z10parametersiscdPiPsPcPd
7 4f4 160 0 4 PROGBITS 2 0 a .nv.constant0._Z10parametersiscdPiPsPcPd
8 654 150 0 4 PROGBITS 2 0 b .nv.constant0._Z9constantsPiPsPcPd
9 7a8 30 0 8 PROGBITS 2 0 0 .nv.constant3
a 7d8 c0 0 4 PROGBITS 6 3 a00000b .text._Z10parametersiscdPiPsPcPd
b 898 c0 0 4 PROGBITS 6 3 a00000c .text._Z9constantsPiPsPcPd
.section .strtab
.section .shstrtab
.section .symtab
index value size info other shndx name
0 0 0 0 0 0 (null)
1 0 0 3 0 a .text._Z10parametersiscdPiPsPcPd
2 0 0 3 0 7 .nv.constant0._Z10parametersiscdPiPsPcPd
3 0 0 3 0 b .text._Z9constantsPiPsPcPd
4 0 0 3 0 8 .nv.constant0._Z9constantsPiPsPcPd
5 0 0 3 0 9 .nv.constant3
6 0 4 1 0 9 kd_p1
7 4 2 1 0 9 kd_p2
8 6 1 1 0 9 kd_p3
9 8 8 1 0 9 kd_p4
10 16 32 1 0 9 kd_floats
11 0 192 12 10 a _Z10parametersiscdPiPsPcPd
12 0 192 12 10 b _Z9constantsPiPsPcPd
The kernel parameter constant bank is versioned per launch so that concurrent kernels can be executed. The compiler and user constants are per CUmodule. It is the responsibility of the developer to manage coherency of this data. For example, the developer has to ensure that a cudaMemcpyToSymbol is update in a safe manner.

Resources