Calling InterlockedAdd on RWByteAddressBuffer multiple times gives unexpected results (on NVidia) - directx

I was looking to move back from using counter buffer for some compute shader routines, and had some unexpected behaviour on Nvidia cards
I made a really simplified example (so it does not make sense to do that, but that's the smallest that can reproduce the issue I encounter).
So I want to perform conditional writes in several locations on a buffer (also for simplification, I only run a single thread, since the behaviour can also be reproduced that way).
I will write 4 uints, then 2 uint3 (using InterlockedAdd to "simulate conditional writes")
So I use a single buffer (with raw access on uav), with the following simple layout :
0 -> First counter
4 -> Second counter
8 till 24 -> First 4 ints to write
24 till 48 -> Pair of uint3 to write
I also clear the buffer every frame (0 for each counter, and arbitrary value for the rest, 12345 in this case).
I copy the buffer staging resource in order to check the values, so yes my pipeline binding is correct, but I can post the code if asked for.
Now when I call the compute shader, only performing 4 increments as here :
RWByteAddressBuffer RWByteBuffer : BACKBUFFER;
#define COUNTER0_LOCATION 0
#define COUNTER1_LOCATION 4
#define PASS1_LOCATION 8
#define PASS2_LOCATION 24
[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
uint i0,i1,i2,i3;
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i0);
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i1);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i2);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i3);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
}
I then obtain the following results (formatted a little):
4,0,
10,20,30,40,
12345,12345,12345,12345,12345,12345,12345,12345,12345
Which is correct (counter is 4 as I called 4 times, second one was not called), I get 10 till 40 in the right locations, and rest has default values
Now if I want to reuse those indices in order to write them to another location:
[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
uint i0,i1,i2,i3;
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i0);
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i1);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i2);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i3);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
uint3 inds = uint3(i0, i1, i2);
uint3 inds2 = uint3(i1,i2,i3);
uint writeIndex;
RWByteBuffer.InterlockedAdd(COUNTER1_LOCATION, 1, writeIndex);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex * 12, inds);
RWByteBuffer.InterlockedAdd(COUNTER1_LOCATION, 1, writeIndex);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex * 12, inds2);
}
Now If I run that code on Intel card (tried HD4000 and HD4600), or ATI card 290, I get expected results eg :
4,2,
10,20,30,40,
0,1,2,1,2,3
But running that on NVidia (used 970m, gtx1080, gtx570) , I get the following :
4,2,
40,12345,12345,12345,
0,0,0,0,0,0
So it seems it suddenly returns 0 in the return value of interlocked add (it still increments properly as counter is 4, but we end up with 40 in last value.
Also we can see that only 0 got written in i1,i2,i3
In case I "reserve memory", eg, call Interlocked only once per location (incrementing by 4 and 2 , respectively):
[numthreads(1,1,1)]
void CSB(uint3 tid : SV_DispatchThreadID)
{
uint i0;
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 4, i0);
uint i1 = i0 + 1;
uint i2 = i0 + 2;
uint i3 = i0 + 3;
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
uint3 inds = uint3(i0, i1, i2);
uint3 inds2 = uint3(i1,i2,i3);
uint writeIndex;
RWByteBuffer.InterlockedAdd(COUNTER1_LOCATION, 2, writeIndex);
uint writeIndex2 = writeIndex + 1;
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex * 12, inds);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex2 * 12, inds2);
}
Then this works on all cards, but I have some cases when I have to rely on the earlier behaviour.
As a side note, if I use structured buffers with a counter flag on the uav instead of a location in a byte address and do :
RWStructuredBuffer<uint> rwCounterBuffer1;
RWStructuredBuffer<uint> rwCounterBuffer2;
RWByteAddressBuffer RWByteBuffer : BACKBUFFER;
#define PASS1_LOCATION 8
#define PASS2_LOCATION 24
[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
uint i0 = rwCounterBuffer1.IncrementCounter();
uint i1 = rwCounterBuffer1.IncrementCounter();
uint i2 = rwCounterBuffer1.IncrementCounter();
uint i3 = rwCounterBuffer1.IncrementCounter();
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
uint3 inds = uint3(i0, i1, i2);
uint3 inds2 = uint3(i1,i2,i3);
uint writeIndex1= rwCounterBuffer2.IncrementCounter();
uint writeIndex2= rwCounterBuffer2.IncrementCounter();
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex1* 12, inds);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex2* 12, inds2);
}
This works correctly across all cards, but has all sorts of issues (that are out of topic for this question).
This is running on DirectX11 (I did not try it on DX12, and that's not relevant to my use case, except plain curiosity)
So is it a bug on NVidia?
Or is there something wrong with the first approach?

Related

How would I be able to make a register-based virtual machine code off of a Binary Tree for math interpretation

My code is represented in Dart, but this is more general to the Binary Tree data structure and Register-based VM implementation. I have commented the code for you to understand if you do not know Dart as well.
So, here are my nodes:
enum NodeType {
numberNode,
addNode,
subtractNode,
multiplyNode,
divideNode,
plusNode,
minusNode,
}
NumberNode has a number value in it.
AddNode, SubtractNode, MultiplyNode, DivideNode, they are really just Binary Op Nodes .
PlusNode, MinusNode, are Unary Operator nodes.
The tree is generated based off Order of Operations. Unary Operation first, then multiplication and division, and then addition and subtraction. E.g. "1 + 2 * -3" becomes "(1 + (2 * (-3)))"
Here is my code to trying to walk over the AST:
/// Converts tree to Register-based VM code
List<Opcode> convertNodeToCode(Node node) {
List<Opcode> result = [const Opcode(OpcodeKind.loadn, 2, -1)];
bool counterHasBeenZero = false;
bool binOpDebounce = false;
int counter = 0;
List<Opcode> convert(Node node) {
switch (node.nodeType) {
case NodeType.numberNode:
counter = counter == 0 ? 1 : 0;
if (counter == 0 && !counterHasBeenZero) {
counterHasBeenZero = true;
} else {
counter = 1;
}
return [Opcode(OpcodeKind.loadn, counter, (node as NumberNode).value)];
case NodeType.addNode:
var aNode = node as AddNode;
return convert(aNode.nodeA) +
convert(aNode.nodeB) +
[
const Opcode(
OpcodeKind.addn,
0,
1,
)
];
case NodeType.subtractNode:
var sNode = node as SubtractNode;
var result = convert(sNode.nodeA) +
convert(sNode.nodeB) +
(binOpDebounce
? [
const Opcode(
OpcodeKind.subn,
0,
0,
1,
)
]
: [
const Opcode(
OpcodeKind.subn,
0,
1,
)
]);
if (!binOpDebounce) binOpDebounce = true;
return result;
case NodeType.multiplyNode:
var mNode = node as MultiplyNode;
var result = convert(mNode.nodeA) +
convert(mNode.nodeB) +
(binOpDebounce
? [
const Opcode(
OpcodeKind.muln,
0,
0,
1,
)
]
: [
const Opcode(
OpcodeKind.muln,
0,
1,
)
]);
if (!binOpDebounce) binOpDebounce = true;
return result;
case NodeType.divideNode:
var dNode = node as DivideNode;
var result = convert(dNode.nodeA) +
convert(dNode.nodeB) +
(binOpDebounce
? [
const Opcode(
OpcodeKind.divn,
0,
0,
1,
)
]
: [
const Opcode(
OpcodeKind.divn,
0,
1,
)
]);
if (!binOpDebounce) binOpDebounce = true;
return result;
case NodeType.plusNode:
return convert((node as PlusNode).node);
case NodeType.minusNode:
return convert((node as MinusNode).node) +
[Opcode(OpcodeKind.muln, 1, 2)];
default:
throw Exception('Non-existent node type');
}
}
return result + convert(node) + [const Opcode(OpcodeKind.exit)];
}
I tried a method to just use 2-3 registers and using a counter to track where I loaded the number in the register, but the code gets ugly real quick and when I'm trying to do Order of Operations, it gets really hard to track where the numbers are with the counter. Basically, how I tried to make this code work is just store the number in register 1 or 0 and load the number if needed to and add the registers together to equal to register 0. Example, 1 + 2 + 3 + 4 becomes [r2 = -1.0, r1 = 1.0, r0 = 2.0, r0 = r1 + r0, r1 = 3.0, r0 = r1 + r0, r1 = 4.0, r0 = r1 + r0, exit]. When I tried this with multiplication though, this became very hard as it incorrectly multiplied the wrong number which is possibly because of the order of operations.
I tried to see if this way could be done as well:
// (1 + (2 * ((-2) + 3) * 5))
const code = [
// (-2)
Opcode(OpcodeKind.loadn, 1, -2), // r1 = -2;
// (2 + 3)
Opcode(OpcodeKind.loadn, 1, 2), // r1 = 2;
Opcode(OpcodeKind.loadn, 2, 3), // r2 = 3;
Opcode(OpcodeKind.addn, 2, 1, 2), // r2 = r1 + r2;
// (2 * (result) * 5)
Opcode(OpcodeKind.loadn, 1, 2), // r1 = 2;
Opcode(OpcodeKind.loadn, 3, 5), // r3 = 5;
Opcode(OpcodeKind.muln, 2, 1, 2), // r2 = r1 * r2;
Opcode(OpcodeKind.muln, 2, 2, 3), // r2 = r2 * r3;
// (1 + (result))
Opcode(OpcodeKind.loadn, 1, 1), // r1 = 1;
Opcode(OpcodeKind.addn, 1, 1, 2), // r1 = r1 + r2;
Opcode(OpcodeKind.exit), // exit halt
];
I knew this method would not work because if I'm going to iterate through the nodes I need to know the position of the numbers and registers beforehand, so I'd have to use another method or way to find the number/register.
You don't need to read all of above; those were just my attempts to try to produce register-based virtual machine code.
I want to see how you guys would do it or how you would make it.

Parse int and float values from Uint8List Dart

I'm trying to parse int and double values which I receive from a bluetooth device using this lib: https://github.com/Polidea/FlutterBleLib
I receive the following Uint8List data: 31,212,243,57,0,224,7,1,6,5,9,21,0,1,0,0,0,91,228
I found some help here: How do I read a 16-bit int from a Uint8List in Dart?
On Android I have done some similar work, but the library there had so called Value Interpreter which I only passed the data and received back float/int.
Example code from Android:
int offset = 0;
final double spOPercentage = ValueInterpreter.getFloatValue(value, FORMAT_SFLOAT, offset);
Where value is a byte array
Another example from android code, this code if from the library:
public static Float getFloatValue(#NonNull byte[] value, int formatType, #IntRange(from = 0L) int offset) {
if (offset + getTypeLen(formatType) > value.length) {
return null;
} else {
switch(formatType) {
case 50:
return bytesToFloat(value[offset], value[offset + 1]);
case 52:
return bytesToFloat(value[offset], value[offset + 1], value[offset + 2], value[offset + 3]);
default:
return null;
}
}
}
private static float bytesToFloat(byte b0, byte b1) {
int mantissa = unsignedToSigned(unsignedByteToInt(b0) + ((unsignedByteToInt(b1) & 15) << 8), 12);
int exponent = unsignedToSigned(unsignedByteToInt(b1) >> 4, 4);
return (float)((double)mantissa * Math.pow(10.0D, (double)exponent));
}
private static float bytesToFloat(byte b0, byte b1, byte b2, byte b3) {
int mantissa = unsignedToSigned(unsignedByteToInt(b0) + (unsignedByteToInt(b1) << 8) +
(unsignedByteToInt(b2) << 16), 24);
return (float)((double)mantissa * Math.pow(10.0D, (double)b3));
}
private static int unsignedByteToInt(byte b) {
return b & 255;
}
In flutter/dart I want to write my own value interpreter.
The starting example code is:
int offset = 1;
ByteData bytes = list.buffer.asByteData();
bytes.getUint16(offset);
I don't understand how data is manipulated here in dart to get a int value from different position from data list. I need some explanation how to do this, would be great if anyone can give some teaching about this.
Having the following:
values [31, 212, 243, 57, 0, 224, 7, 1, 6, 5, 9, 21, 0, 1, 0, 0, 0, 91, 228];
index 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
When you make:
values.list.buffer.asByteData().getUint16(0);
you interpret [31, 212] as a single unsigned int of two bytes length.
If you want to get a Uint16 from bytes 9 and 10 [5, 9], you'd call:
values.list.buffer.asByteData().getUint16(9);
Regarding your comment (Parse int and float values from Uint8List Dart):
I have this Uint8List and the values are: 31, 212, 243, 57, 0, 224, 7, 1, 6, 5, 9, 21, 0, 1, 0, 0, 0, 91, 228 I use the code below ByteData bytes = list.buffer.asByteData(); int offset = 1; double value = bytes.getFloat32(offset); and value that I expected should be something between 50 and 150 More info on what I am doing can be found here: bluetooth.com/wp-content/uploads/Sitecore-Media-Library/Gatt/… name="SpO2PR-Spot-Check - SpO2"
This property is of type SFLOAT, which according to https://www.bluetooth.com/specifications/assigned-numbers/format-types/ looks like this:
0x16 SFLOAT IEEE-11073 16-bit SFLOAT
As Dart does not seem to have an easy way to get that format, you might have to create a parser yourself using raw bytes.
These might be helpful:
https://stackoverflow.com/a/51391743/6413439
https://stackoverflow.com/a/16474957/6413439
Here is something that I used to convert sfloat to double in dart for our flutter app.
double sfloat2double(ieee11073) {
var reservedValues = {
0x07FE: 'PositiveInfinity',
0x07FF: 'NaN',
0x0800: 'NaN',
0x0801: 'NaN',
0x0802: 'NegativeInfinity'
};
var mantissa = ieee11073 & 0x0FFF;
if (reservedValues.containsKey(mantissa)){
return 0.0; // basically error
}
if ((ieee11073 & 0x0800) != 0){
mantissa = -((ieee11073 & 0x0FFF) + 1 );
}else{
mantissa = (ieee11073 & 0x0FFF);
}
var exponent = ieee11073 >> 12;
if (((ieee11073 >> 12) & 0x8) != 0){
exponent = -((~(ieee11073 >> 12) & 0x0F) + 1 );
}else{
exponent = ((ieee11073 >> 12) & 0x0F);
}
var magnitude = pow(10, exponent);
return (mantissa * magnitude);
}

Sage: Polynomial ring over finite field - inverting a polynomial non-prime

I'm trying to recreate the wiki's example procedure, available here:
https://en.wikipedia.org/wiki/NTRUEncrypt
I've run into an issue while attempting to invert the polynomials.
The SAGE code below seems to be working fine for the given p=3, which is a prime number.
However, the representation of the polynomial in the field generated by q=32 ends up wrong, because it behaves as if the modulus was 2.
Here's the code in play:
F = PolynomialRing(GF(32),'a')
a = F.gen()
Ring = F.quotient(a^11 - 1, 'x')
x = Ring.gen()
pollist = [-1, 1, 1, 0, -1, 0, 1, 0, 0, 1, -1]
fq = Ring(pollist)
print(fq)
print(fq^(-1))
The Ring is described as follows:
Univariate Quotient Polynomial Ring in x over Finite Field in z5 of size 2^5 with modulus a^11 + 1
And the result:
x^10 + x^9 + x^6 + x^4 + x^2 + x + 1
x^5 + x + 1
I've tried to replace the Finite Field with IntegerModRing(32), but the inversion ends up demanding a field, as implied by the message:
NotImplementedError: The base ring (=Ring of integers modulo 32) is not a field
Any suggestions as to how I could obtain the correct inverse of f (mod q) would be greatly appreciated.
GF(32) is the finite field with 32 elements, not the integers modulo 32. You must use Zmod(32) (or IntegerModRing(32), as you suggested) instead.
As you point out, Sage psychotically bans you from computing inverses in ℤ/32ℤ[a]/(a¹¹-1) because that is not a field, and not even a factorial ring. It can, however, compute those inverses when they exist, only you must ask more kindly:
sage: F.<a> = Zmod(32)[]
sage: fq = F([-1, 1, 1, 0, -1, 0, 1, 0, 0, 1, -1])
sage: print(fq)
31*a^10 + a^9 + a^6 + 31*a^4 + a^2 + a + 31
sage: print(fq.inverse_mod(a^11 - 1))
16*a^8 + 4*a^7 + 10*a^5 + 28*a^4 + 9*a^3 + 13*a^2 + 21*a + 1
Not ideal, admittedly.

How to get the bytes that an unsigned integer is composed of?

Supposing I've this number:
local uint = 2000;
how can I get the bytes that it's composed of? For instance:
print(sepbytes(uint));
-- 7, 208
My try:
local function sepbytes(cur)
local t = {};
repeat
cur = cur / 16;
table.insert(t, 1, cur);
until cur <= 0
return table.unpack(t);
end
print(sepbytes(2000));
This results in:
0 9.8813129168249e-324, +(lots of numbers)...
Expected result:
7, 208
Basing in the comments, if I want 2 fixed bytes (that's the current case), I may use #ajcr solution:
local function InParts(num)
return ((num & (0xff << 8)) >> 8), (num & 0xff);
end
#EgorSkriptunoff (Lua 5.3) solution works for any amount of bytes.
local function InParts(num)
return string.pack(">I2", uint):byte(1, -1);
end

How can I access my constant memory in my kernel?

I can't manage to access the data in my constant memory and I don't know why. Here is a snippet of my code:
#define N 10
__constant__ int constBuf_d[N];
__global__ void foo( int *results, int *constBuf )
{
int tdx = threadIdx.x;
int idx = blockIdx.x * blockDim.x + tdx;
if( idx < N )
{
results[idx] = constBuf[idx];
}
}
// main routine that executes on the host
int main(int argc, char* argv[])
{
int *results_h = new int[N];
int *results_d = NULL;
cudaMalloc((void **)&results_d, N*sizeof(int));
int arr[10] = { 16, 2, 77, 40, 12, 3, 5, 3, 6, 6 };
int *cpnt;
cudaError_t err = cudaGetSymbolAddress((void **)&cpnt, "constBuf_d");
if( err )
cout << "error!";
cudaMemcpyToSymbol((void**)&cpnt, arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);
foo <<< 1, 256 >>> ( results_d, cpnt );
cudaMemcpy(results_h, results_d, N*sizeof(int), cudaMemcpyDeviceToHost);
for( int i=0; i < N; ++i )
printf("%i ", results_h[i] );
}
For some reason, I only get "0" in results_h. I'm running CUDA 4.0 with a card with capability 1.1.
Any ideas? Thanks!
If you add proper error checking to your code, you will find that the cudaMemcpyToSymbol is failing with a invalid device symbol error. You either need to pass the symbol by name, or use cudaMemcpy instead. So this:
cudaGetSymbolAddress((void **)&cpnt, "constBuf_d");
cudaMemcpy(cpnt, arr, N*sizeof(int), cudaMemcpyHostToDevice);
or
cudaMemcpyToSymbol("constBuf_d", arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);
or
cudaMemcpyToSymbol(constBuf_d, arr, N*sizeof(int), 0, cudaMemcpyHostToDevice);
will work. Having said that, passing a constant memory address as an argument to a kernel is the wrong way to use constant memory - it defeats the compiler from generating instructions to access memory via the constant memory cache. Compare the compute capability 1.2 PTX generated for your kernel:
.entry _Z3fooPiS_ (
.param .u32 __cudaparm__Z3fooPiS__results,
.param .u32 __cudaparm__Z3fooPiS__constBuf)
{
.reg .u16 %rh<4>;
.reg .u32 %r<12>;
.reg .pred %p<3>;
.loc 16 7 0
$LDWbegin__Z3fooPiS_:
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.s32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
mov.u32 %r4, 9;
setp.gt.s32 %p1, %r3, %r4;
#%p1 bra $Lt_0_1026;
.loc 16 14 0
mul.lo.u32 %r5, %r3, 4;
ld.param.u32 %r6, [__cudaparm__Z3fooPiS__constBuf];
add.u32 %r7, %r6, %r5;
ld.global.s32 %r8, [%r7+0];
ld.param.u32 %r9, [__cudaparm__Z3fooPiS__results];
add.u32 %r10, %r9, %r5;
st.global.s32 [%r10+0], %r8;
$Lt_0_1026:
.loc 16 16 0
exit;
$LDWend__Z3fooPiS_:
} // _Z3fooPiS_
with this kernel:
__global__ void foo2( int *results )
{
int tdx = threadIdx.x;
int idx = blockIdx.x * blockDim.x + tdx;
if( idx < N )
{
results[idx] = constBuf_d[idx];
}
}
which produces
.entry _Z4foo2Pi (
.param .u32 __cudaparm__Z4foo2Pi_results)
{
.reg .u16 %rh<4>;
.reg .u32 %r<12>;
.reg .pred %p<3>;
.loc 16 18 0
$LDWbegin__Z4foo2Pi:
mov.u16 %rh1, %ctaid.x;
mov.u16 %rh2, %ntid.x;
mul.wide.u16 %r1, %rh1, %rh2;
cvt.s32.u16 %r2, %tid.x;
add.u32 %r3, %r2, %r1;
mov.u32 %r4, 9;
setp.gt.s32 %p1, %r3, %r4;
#%p1 bra $Lt_1_1026;
.loc 16 25 0
mul.lo.u32 %r5, %r3, 4;
mov.u32 %r6, constBuf_d;
add.u32 %r7, %r5, %r6;
ld.const.s32 %r8, [%r7+0];
ld.param.u32 %r9, [__cudaparm__Z4foo2Pi_results];
add.u32 %r10, %r9, %r5;
st.global.s32 [%r10+0], %r8;
$Lt_1_1026:
.loc 16 27 0
exit;
$LDWend__Z4foo2Pi:
} // _Z4foo2Pi
Note that in the second case, constBuf_d is accessed via ld.const.s32, rather than ld.global.s32, so that constant memory cache is used.
Excellent answer #talonmies. But I would like to mention that there have been changes in cuda 5. In the function MemcpyToSymbol(), char * argument is no longer supported.
The CUDA 5 release notes read:
** The use of a character string to indicate a device symbol, which was possible with certain API functions, is no longer supported. Instead, the symbol should be used directly.
Instead the copy have to be made to the constant memory as follows :
cudaMemcpyToSymbol( dev_x, x, N * sizeof(float) );
In this case "dev_x" is pointer to constant memory and "x" is pointer to host memory which needs to be copied into dev_x.

Resources