Related
I am validating DPDK receive functionality & for this I'm shooting a pcap externally &
added code in l2fwd to dump received packets to pcap, the l2fwd dumped pcap have all the packets from shooter but some of them are not in sequence.
Shooter is already validated.
DPDK version in use-21.11
link of the pcap used : https://wiki.wireshark.org/uploads/__moin_import__/attachments/SampleCaptures/tcp-ecn-sample.pcap
Out of order packets are random. For the first run I saw no jumbled packets but was able to replicate the issue on second run with the 2nd,3rd,4th packets jumbled having order 3,4,2.
Below is snipped from l2fwd example & our modifications as //TESTCODE..
/* Read packet from RX queues. 8< */
for (i = 0; i < qconf->n_rx_port; i++) {
portid = qconf->rx_port_list[i];
nb_rx = rte_eth_rx_burst(portid, 0,
pkts_burst, MAX_PKT_BURST);
port_statistics[portid].rx += nb_rx;
for (j = 0; j < nb_rx; j++) {
m = pkts_burst[j];
// TESTCODE_STARTS
uint8_t* pkt = rte_pktmbuf_mtod(m, uint8_t*);
dump_to_pcap(pkt, rte_pktmbuf_pkt_len(m));
// TESTCODE_ENDS
rte_prefetch0(rte_pktmbuf_mtod(m, void *));
l2fwd_simple_forward(m, portid);
}
}
/* >8 End of read packet from RX queues. */
Below is code for dump_to_pcap
static int
dump_to_pcap(uint8_t* pkt, int pkt_len)
{
static FILE* fp = NULL;
static int init_file = 0;
if (0 == init_file) {
printf("Creating pcap\n");
char pcap_filename[256] = { 0 };
char Two_pcap_filename[256] = { 0 };
currentDateTime(pcap_filename);
sprintf(Two_pcap_filename,".\\Rx_%d_%s.pcap", 0, pcap_filename);
printf("FileSName to Create: %s\n", Two_pcap_filename);
fp = fopen(Two_pcap_filename, "wb");
if (NULL == fp) {
printf("Unable to open file\n");
fp = NULL;
}
else {
printf("File create success..\n");
init_file = 1;
typedef struct pcap_file_header1 {
unsigned int magic; // a 32-bit "magic number"
unsigned short version_major; //a 16-bit major version number
unsigned short version_minor; //a 16-bit minor version number
unsigned int thiszone; //a 32-bit "time zone offset" field that's actually not used, so ou can (and probably should) just make it 0
unsigned int sigfigs; //a 32-bit "time stamp accuracy" field that's not actually used,so you can (and probably should) just make it 0;
unsigned int snaplen; //a 32-bit "snapshot length" field
unsigned int linktype; //a 32-bit "link layer type" field
}dumpFileHdr;
dumpFileHdr file_hdr;
file_hdr.magic = 2712847316; //0xa1b2c3d4;
file_hdr.version_major = 2;
file_hdr.version_minor = 4;
file_hdr.thiszone = 0;
file_hdr.sigfigs = 0;
file_hdr.snaplen = 65535;
file_hdr.linktype = 1;
fwrite((void*)(&file_hdr), sizeof(dumpFileHdr), 1, fp);
//printf("Pcap Header written\n");
}
}
typedef struct pcap_pkthdr1 {
unsigned int ts_sec; /* time stamp */
unsigned int ts_usec;
unsigned int caplen; /* length of portion present */
unsigned int len; /* length this packet (off wire) */
}dumpPktHdr;
dumpPktHdr pkt_hdr;
static int ts_sec = 1;
pkt_hdr.ts_sec = ts_sec++;
pkt_hdr.ts_usec = 0;
pkt_hdr.caplen = pkt_hdr.len = pkt_len;
if (NULL != fp) {
fwrite((void*)(&pkt_hdr), sizeof(dumpPktHdr), 1, fp);
fwrite((void*)(pkt), pkt_len, 1, fp);
fflush(fp);
}
return 0;
}
I have an array already initialized that I am trying to use in each thread of the kernel call (each thread uses a different part of the array so there are no dependencies). I create the array and save memory on the device using cudaMalloc and the array is copied from host to device using cudaMemcpy.
I pass the pointer returned by cudaMalloc to the kernel call to be used by each thread.
int SIZE = 100;
int* data = new int[SIZE];
int* d_data = 0;
cutilSafeCall( cudaMalloc(&d_data, SIZE * sizeof(int)) );
for (int i = 0; i < SIZE; i++)
data[i] = i;
cutilSafeCall( cudaMemcpy(d_data, data, SIZE * sizeof(int), cudaMemcpyHostToDevice) );
This code was taken from here.
For the kernel call.
kernel<<<blocks, threads>>> (results, d_data);
I keep track of the results from each thread by using the struct Result. The next code works without errors.
__global__ void mainKernel(Result res[], int* data){
int x = data[0];
}
But when I assign that value to res:
__global__ void mainKernel(Result res[], int* data){
int threadId = (blockIdx.x * blockDim.x) + threadIdx.x;
int x = data[0];
res[threadId].x = x;
}
An error is raised:
cudaSafeCall() Runtime API error in file , line 355 : an illegal memory access was encountered.
The same error appears with any operation involving the use of that pointer
__global__ void mainKernel(Result res[], int* data){
int threadId = (blockIdx.x * blockDim.x) + threadIdx.x;
int x = data[0];
if (x > 10)
res[threadId].x = 5;
}
There is no problem with the definition of res. Assigning any other value to res[threadId].x does not give me any error.
This is the output of running cuda-memcheck:
========= Invalid __global__ read of size 4
========= at 0x00000150 in mainKernel(Result*, int*)
========= by thread (86,0,0) in block (49,0,0)
========= Address 0x13024c0000 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x150d6d]
========= Host Frame:./out [0x2cc4b]
========= Host Frame:./out [0x46c23]
========= Host Frame:./out [0x3e37]
========= Host Frame:./out [0x3ca1]
========= Host Frame:./out [0x3cd6]
========= Host Frame:./out [0x39e9]
========= Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf5) [0x21ec5]
========= Host Frame:./out [0x31b9]
EDIT:
This is an example of the full code:
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#include <string.h>
#include <iostream>
#include <assert.h>
typedef struct
{
int x,y,z;
} Result;
__global__ void mainKernel(Result pResults[], int* dataimage)
{
int threadId = (blockIdx.x * blockDim.x) + threadIdx.x;
int xVal = dataimage[0];
if (xVal > 10)
pResults[threadId].x = 5;
}
int main (int argc, char** argv)
{
int NUM_THREADS = 5*5;
int SIZE = 100;
int* data = new int[SIZE];
int* d_data = 0;
cutilSafeCall( cudaMalloc(&d_data, SIZE * sizeof(int)) );
for (int i = 0; i < SIZE; i++)
data[i] = i;
cutilSafeCall( cudaMemcpy(d_data, data, SIZE * sizeof(int), cudaMemcpyHostToDevice) );
unsigned int GPU_ID = 1; // not actually :-)
// unsigned int GPU_ID = cutGetMaxGflopsDeviceId() ;
cudaSetDevice(GPU_ID);
Result * results_GPU = 0;
cutilSafeCall( cudaMalloc( &results_GPU, NUM_THREADS * sizeof(Result)) );
Result * results_CPU = 0;
cutilSafeCall( cudaMallocHost( &results_CPU, NUM_THREADS * sizeof(Result)) );
mainKernel<<<5,5>>> ( results_GPU, d_data );
cudaThreadSynchronize();
cutilSafeCall( cudaMemcpy(results_CPU, results_GPU, NUM_THREADS * sizeof(Result),cudaMemcpyDeviceToHost) );
cutilSafeCall(cudaFree(results_GPU));
cutilSafeCall(cudaFreeHost(results_CPU));
cudaThreadExit();
} // ()
Your problem lies in this sequence of calls:
cutilSafeCall( cudaMalloc(&d_data, SIZE * sizeof(int)) );
for (int i = 0; i < SIZE; i++)
data[i] = i;
cutilSafeCall( cudaMemcpy(d_data, data, SIZE * sizeof(int), cudaMemcpyHostToDevice) );
unsigned int GPU_ID = 1;
cudaSetDevice(GPU_ID);
Result * results_GPU = 0;
cutilSafeCall( cudaMalloc( &results_GPU, NUM_THREADS * sizeof(Result)) );
Result * results_CPU = 0;
cutilSafeCall( cudaMallocHost( &results_CPU, NUM_THREADS * sizeof(Result)) );
mainKernel<<<5,5>>> ( results_GPU, d_data );
What is effectively happening is that you are allocating d_data and running your kernel on different GPUs, and d_data is not valid on the GPU you are launching the kernel on.
In detail, because you call cudaMalloc for d_data before cudaSetDevice, you are allocating d_data on the default device, and then explicitly allocating results_GPU and running the kernel on device 1. Clearly device 1 and the default device are not the same GPU (enumeration of devices usually starts at 0 in the runtime API).
If you change the code like this:
unsigned int GPU_ID = 1;
cutilSafeCall(cudaSetDevice(GPU_ID));
cutilSafeCall( cudaMalloc(&d_data, SIZE * sizeof(int)) );
for (int i = 0; i < SIZE; i++)
data[i] = i;
cutilSafeCall( cudaMemcpy(d_data, data, SIZE * sizeof(int), cudaMemcpyHostToDevice) );
Result * results_GPU = 0;
cutilSafeCall( cudaMalloc( &results_GPU, NUM_THREADS * sizeof(Result)) );
Result * results_CPU = 0;
cutilSafeCall( cudaMallocHost( &results_CPU, NUM_THREADS * sizeof(Result)) );
mainKernel<<<5,5>>> ( results_GPU, d_data );
i.e. select the non-default device before any allocations are made, the problem should disappear. The reason this doesn't happen with your very simple kernel:
__global__ void mainKernel(Result res[], int* data){
int x = data[0];
}
is simply that the CUDA compiler performs very aggressive optimisations by default, and because the result of the read of data[0] isn't actually used, the entire read can be optimised away and you are left with an empty stub kernel which doesn't do anything. Only when the result of the load from memory is used in a memory write will the code not be optimised away during compilation. You can confirm this yourself by dissassembling the code emitted by the compiler, if you are curious.
Note that there are ways to make this work on multi-GPU systems which supported it, via peer-to-peer access, but that must be explicitly configured in your code for that facility to be used.
I have a CUDA kernel which takes an edge image and processes it to create a smaller, 1D array of the edge pixels. Now here is the strange behaviour. Every time I run the kernel and calculate the number of edge pixels in "d_nlist" (see the code near the printf), I get a greater pixel count each time, even when I use the same image and stop the program completely and re-run. Therefore, each time I run it, it takes longer to run, until eventually, it throws an un-caught exception.
My question is, how can I stop this from happening so that I can get consistent results each time I run the kernel?
My device is a Geforce 620.
Constants:
THREADS_X = 32
THREADS_Y = 4
PIXELS_PER_THREAD = 4
MAX_QUEUE_LENGTH = THREADS_X * THREADS_Y * PIXELS_PER_THREAD
IMG_WIDTH = 256
IMG_HEIGHT = 256
IMG_SIZE = IMG_WIDTH * IMG_HEIGHT
BLOCKS_X = IMG_WIDTH / (THREADS_X * PIXELS_PER_THREAD)
BLOCKS_Y = IMG_HEIGHT / THREADS_Y
The kernel is as follows:
__global__ void convert2DEdgeImageTo1DArray( unsigned char const * const image,
unsigned int* const list, int* const glob_index ) {
unsigned int const x = blockIdx.x * THREADS_X*PIXELS_PER_THREAD + threadIdx.x;
unsigned int const y = blockIdx.y * THREADS_Y + threadIdx.y;
volatile int qindex = -1;
volatile __shared__ int sh_qindex[THREADS_Y];
volatile __shared__ int sh_qstart[THREADS_Y];
sh_qindex[threadIdx.y] = -1;
// Start by making an array
volatile __shared__ unsigned int sh_queue[MAX_QUEUE_LENGTH];
// Fill the queue
for(int i=0; i<PIXELS_PER_THREAD; i++)
{
int const xx = i*THREADS_X + x;
// Read one image pixel from global memory
unsigned char const pixel = image[y*IMG_WIDTH + xx];
unsigned int const queue_val = (y << 16) + xx;
if(pixel)
{
do {
qindex++;
sh_qindex[threadIdx.y] = qindex;
sh_queue[threadIdx.y*THREADS_X*PIXELS_PER_THREAD + qindex] = queue_val;
} while (sh_queue[threadIdx.y*THREADS_X*PIXELS_PER_THREAD + qindex] != queue_val);
}
// Reload index from smem (last thread to write to smem will have updated it)
qindex = sh_qindex[threadIdx.y];
}
// Let thread 0 reserve the space required in the global list
__syncthreads();
if(threadIdx.x == 0 && threadIdx.y == 0)
{
// Find how many items are stored in each list
int total_index = 0;
#pragma unroll
for(int i=0; i<THREADS_Y; i++)
{
sh_qstart[i] = total_index;
total_index += (sh_qindex[i] + 1u);
}
// Calculate the offset in the global list
unsigned int global_offset = atomicAdd(glob_index, total_index);
#pragma unroll
for(int i=0; i<THREADS_Y; i++)
{
sh_qstart[i] += global_offset;
}
}
__syncthreads();
// Copy local queues to global queue
for(int i=0; i<=qindex; i+=THREADS_X)
{
if(i + threadIdx.x > qindex)
break;
unsigned int qvalue = sh_queue[threadIdx.y*THREADS_X*PIXELS_PER_THREAD + i + threadIdx.x];
list[sh_qstart[threadIdx.y] + i + threadIdx.x] = qvalue;
}
}
The following is the method which calls the kernel:
void call2DTo1DKernel(unsigned char const * const h_image)
{
// Device side allocation
unsigned char *d_image = NULL;
unsigned int *d_list = NULL;
int h_nlist, *d_nlist = NULL;
cudaMalloc((void**)&d_image, sizeof(unsigned char)*IMG_SIZE);
cudaMalloc((void**)&d_list, sizeof(unsigned int)*IMG_SIZE);
cudaMalloc((void**)&d_nlist, sizeof(int));
// Time measurement initialization
cudaEvent_t start, stop, startio, stopio;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventCreate(&startio);
cudaEventCreate(&stopio);
// Start timer w/ io
cudaEventRecord(startio,0);
// Copy image data to device
cudaMemcpy((void*)d_image, (void*)h_image, sizeof(unsigned char)*IMG_SIZE, cudaMemcpyHostToDevice);
// Start timer
cudaEventRecord(start,0);
// Kernel call
// Phase 1 : Convert 2D binary image to 1D pixel array
dim3 dimBlock1(THREADS_X, THREADS_Y);
dim3 dimGrid1(BLOCKS_X, BLOCKS_Y);
convert2DEdgeImageTo1DArray<<<dimGrid1, dimBlock1>>>(d_image, d_list, d_nlist);
// Stop timer
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
// Stop timer w/ io
cudaEventRecord(stopio,0);
cudaEventSynchronize(stopio);
// Time measurement
cudaEventElapsedTime(&et,start,stop);
cudaEventElapsedTime(&etio,startio,stopio);
// Time measurement deinitialization
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaEventDestroy(startio);
cudaEventDestroy(stopio);
// Get list size
cudaMemcpy((void*)&h_nlist, (void*)d_nlist, sizeof(int), cudaMemcpyDeviceToHost);
// Report on console
printf("%d pixels processed...\n", h_nlist);
// Device side dealloc
cudaFree(d_image);
cudaFree(d_space);
cudaFree(d_list);
cudaFree(d_nlist);
}
Thank you very much in advance for your help everyone.
As a preamble, let me suggest some troubleshooting steps that are useful:
instrument your code with proper cuda error checking
run your code with cuda-memcheck e.g. cuda-memcheck ./myapp
If you do the above steps, you'll find that your kernel is failing, and the failures have to do with global writes of size 4. So that will focus your attention on the last segment of your kernel, beginning with the comment // Copy local queues to global queue
Regarding your code, then, you have at least 2 problems:
The addressing/indexing in your final segment of your kernel, where you are writing the individual queues out to global memory, is messed up. I'm not going to try and debug this for you.
You are not initializing your d_nlist variable to zero. Therefore when you do an atomic add to it, you are adding your values to a junk value, which will tend to increase as you repeat the process.
Here's some code which has the problems removed, (I did not try to sort out your queue copy code) and error checking added. It produces repeatable results for me:
$ cat t216.cu
#include <stdio.h>
#include <stdlib.h>
#define THREADS_X 32
#define THREADS_Y 4
#define PIXELS_PER_THREAD 4
#define MAX_QUEUE_LENGTH (THREADS_X*THREADS_Y*PIXELS_PER_THREAD)
#define IMG_WIDTH 256
#define IMG_HEIGHT 256
#define IMG_SIZE (IMG_WIDTH*IMG_HEIGHT)
#define BLOCKS_X (IMG_WIDTH/(THREADS_X*PIXELS_PER_THREAD))
#define BLOCKS_Y (IMG_HEIGHT/THREADS_Y)
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void convert2DEdgeImageTo1DArray( unsigned char const * const image,
unsigned int* const list, int* const glob_index ) {
unsigned int const x = blockIdx.x * THREADS_X*PIXELS_PER_THREAD + threadIdx.x;
unsigned int const y = blockIdx.y * THREADS_Y + threadIdx.y;
volatile int qindex = -1;
volatile __shared__ int sh_qindex[THREADS_Y];
volatile __shared__ int sh_qstart[THREADS_Y];
sh_qindex[threadIdx.y] = -1;
// Start by making an array
volatile __shared__ unsigned int sh_queue[MAX_QUEUE_LENGTH];
// Fill the queue
for(int i=0; i<PIXELS_PER_THREAD; i++)
{
int const xx = i*THREADS_X + x;
// Read one image pixel from global memory
unsigned char const pixel = image[y*IMG_WIDTH + xx];
unsigned int const queue_val = (y << 16) + xx;
if(pixel)
{
do {
qindex++;
sh_qindex[threadIdx.y] = qindex;
sh_queue[threadIdx.y*THREADS_X*PIXELS_PER_THREAD + qindex] = queue_val;
} while (sh_queue[threadIdx.y*THREADS_X*PIXELS_PER_THREAD + qindex] != queue_val);
}
// Reload index from smem (last thread to write to smem will have updated it)
qindex = sh_qindex[threadIdx.y];
}
// Let thread 0 reserve the space required in the global list
__syncthreads();
if(threadIdx.x == 0 && threadIdx.y == 0)
{
// Find how many items are stored in each list
int total_index = 0;
#pragma unroll
for(int i=0; i<THREADS_Y; i++)
{
sh_qstart[i] = total_index;
total_index += (sh_qindex[i] + 1u);
}
// Calculate the offset in the global list
unsigned int global_offset = atomicAdd(glob_index, total_index);
#pragma unroll
for(int i=0; i<THREADS_Y; i++)
{
sh_qstart[i] += global_offset;
}
}
__syncthreads();
// Copy local queues to global queue
/*
for(int i=0; i<=qindex; i+=THREADS_X)
{
if(i + threadIdx.x > qindex)
break;
unsigned int qvalue = sh_queue[threadIdx.y*THREADS_X*PIXELS_PER_THREAD + i + threadIdx.x];
list[sh_qstart[threadIdx.y] + i + threadIdx.x] = qvalue;
}
*/
}
void call2DTo1DKernel(unsigned char const * const h_image)
{
// Device side allocation
unsigned char *d_image = NULL;
unsigned int *d_list = NULL;
int h_nlist=0, *d_nlist = NULL;
cudaMalloc((void**)&d_image, sizeof(unsigned char)*IMG_SIZE);
cudaMalloc((void**)&d_list, sizeof(unsigned int)*IMG_SIZE);
cudaMalloc((void**)&d_nlist, sizeof(int));
cudaCheckErrors("cudamalloc fail");
// Time measurement initialization
cudaEvent_t start, stop, startio, stopio;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventCreate(&startio);
cudaEventCreate(&stopio);
float et, etio;
// Start timer w/ io
cudaEventRecord(startio,0);
cudaMemcpy(d_nlist, &h_nlist, sizeof(int), cudaMemcpyHostToDevice);
// Copy image data to device
cudaMemcpy((void*)d_image, (void*)h_image, sizeof(unsigned char)*IMG_SIZE, cudaMemcpyHostToDevice);
cudaCheckErrors("cudamemcpy 1");
// Start timer
cudaEventRecord(start,0);
// Kernel call
// Phase 1 : Convert 2D binary image to 1D pixel array
dim3 dimBlock1(THREADS_X, THREADS_Y);
dim3 dimGrid1(BLOCKS_X, BLOCKS_Y);
convert2DEdgeImageTo1DArray<<<dimGrid1, dimBlock1>>>(d_image, d_list, d_nlist);
cudaDeviceSynchronize();
cudaCheckErrors("kernel fail");
// Stop timer
cudaEventRecord(stop,0);
cudaEventSynchronize(stop);
// Stop timer w/ io
cudaEventRecord(stopio,0);
cudaEventSynchronize(stopio);
// Time measurement
cudaEventElapsedTime(&et,start,stop);
cudaEventElapsedTime(&etio,startio,stopio);
// Time measurement deinitialization
cudaEventDestroy(start);
cudaEventDestroy(stop);
cudaEventDestroy(startio);
cudaEventDestroy(stopio);
// Get list size
cudaMemcpy((void*)&h_nlist, (void*)d_nlist, sizeof(int), cudaMemcpyDeviceToHost);
cudaCheckErrors("cudaMemcpy 2");
// Report on console
printf("%d pixels processed...\n", h_nlist);
// Device side dealloc
cudaFree(d_image);
// cudaFree(d_space);
cudaFree(d_list);
cudaFree(d_nlist);
}
int main(){
unsigned char *image;
image = (unsigned char *)malloc(IMG_SIZE * sizeof(unsigned char));
if (image == 0) {printf("malloc fail\n"); return 0;}
for (int i =0 ; i<IMG_SIZE; i++)
image[i] = rand()%2;
call2DTo1DKernel(image);
call2DTo1DKernel(image);
call2DTo1DKernel(image);
call2DTo1DKernel(image);
call2DTo1DKernel(image);
cudaCheckErrors("some error");
return 0;
}
$ nvcc -arch=sm_20 -O3 -o t216 t216.cu
$ ./t216
32617 pixels processed...
32617 pixels processed...
32617 pixels processed...
32617 pixels processed...
32617 pixels processed...
$ ./t216
32617 pixels processed...
32617 pixels processed...
32617 pixels processed...
32617 pixels processed...
32617 pixels processed...
$
Consider the code below:
#include <stdio.h>
#include <stdlib.h>
#define FORCE_CAST(var, type) *(type*)&var
struct processor_status_register
{
unsigned int cwp:5;
unsigned int et:1;
unsigned int ps:1;
unsigned int s:1;
unsigned int pil:4;
unsigned int ef:1;
unsigned int ec:1;
unsigned int reserved:6;
unsigned int c:1;
unsigned int v:1;
unsigned int z:1;
unsigned int n:1;
unsigned int ver:4;
unsigned int impl:4;
}__attribute__ ((__packed__));
struct registers
{
unsigned long* registerSet;
unsigned long* globalRegisters;
unsigned long* cwptr;
unsigned long wim, tbr, y, pc, npc;
unsigned short registerWindows;
/* Though Intel x86 architecture allows un-aligned memory access, SPARC mandates memory accesses to be 8 byte aligned. Without __attribute__ ((aligned (8))) or a preceding dummy byte e.g. unsigned short dummyByte, the code below crashes with a dreaded Bus error and Core dump. For more details, follow the links below:
http://blog.jgc.org/2007/04/debugging-solaris-bus-error-caused-by.html
https://groups.google.com/forum/?fromgroups=#!topic/comp.unix.solaris/8SgFiMudGL4
*/
struct processor_status_register __attribute__ ((aligned (8))) psr;
}__attribute__ ((__packed__));
int getBit(unsigned long bitStream, int position)
{
int bit;
bit = (bitStream & (1 << position)) >> position;
return bit;
}
char* showBits(unsigned long bitStream, int startPosition, int endPosition)
{
// Allocate one extra byte for NULL character
char* bits = (char*)malloc(endPosition - startPosition + 2);
int bitIndex;
for(bitIndex = 0; bitIndex <= endPosition; bitIndex++)
bits[bitIndex] = (getBit(bitStream, endPosition - bitIndex)) ? '1' : '0';
bits[bitIndex] = '\0';
return bits;
}
int main()
{
struct registers sparcRegisters; short isLittleEndian;
// Check for Endianness
unsigned long checkEndian = 0x00000001;
if(*((char*)(&checkEndian)))
{printf("Little Endian\n"); isLittleEndian = 1;} // Little
Endian architecture detected
else
{printf("Big Endian\n"); isLittleEndian = 0;} // Big
Endian architecture detected
unsigned long registerValue = 0xF30010A7;
unsigned long swappedRegisterValue = isLittleEndian ? registerValue :
__builtin_bswap32(registerValue);
sparcRegisters.psr = FORCE_CAST(swappedRegisterValue, struct
processor_status_register);
registerValue = isLittleEndian ? FORCE_CAST (sparcRegisters.psr,
unsigned long) : __builtin_bswap32(FORCE_CAST (sparcRegisters.psr,
unsigned long));
printf("\nPSR=0x%0X, IMPL=%u, VER=%u, CWP=%u\n", registerValue,
sparcRegisters.psr.impl, sparcRegisters.psr.ver,
sparcRegisters.psr.cwp);
printf("PSR=%s\n",showBits(registerValue, 0, 31));
sparcRegisters.psr.cwp = 7;
sparcRegisters.psr.et = 1;
sparcRegisters.psr.ps = 0;
sparcRegisters.psr.s = 1;
sparcRegisters.psr.pil = 0;
sparcRegisters.psr.ef = 0;
sparcRegisters.psr.ec = 0;
sparcRegisters.psr.reserved = 0;
sparcRegisters.psr.c = 0;
sparcRegisters.psr.v = 0;
sparcRegisters.psr.z = 0;
sparcRegisters.psr.n = 0;
sparcRegisters.psr.ver = 3;
sparcRegisters.psr.impl = 0xF;
registerValue = isLittleEndian ? FORCE_CAST (sparcRegisters.psr,
unsigned long) : __builtin_bswap32(FORCE_CAST (sparcRegisters.psr,
unsigned long));
printf("\nPSR=0x%0X, IMPL=%u, VER=%u, CWP=%u\n", registerValue,
sparcRegisters.psr.impl, sparcRegisters.psr.ver,
sparcRegisters.psr.cwp);
printf("PSR=%s\n\n",showBits(registerValue, 0, 31));
return 0;
}
I have used gcc-4.7.2 on Solaris 10 on SPARC to compile the following
code to produce the Big-Endian output:
Big Endian
PSR=0xF30010A7, IMPL=3, VER=15, CWP=20
PSR=11110011000000000001000010100111
PSR=0x3F00003D, IMPL=15, VER=3, CWP=7
PSR=00111111000000000000000000111101
I have used gcc-4.4 on Ubuntu-10.04 on Intel-x86 to compile the
following code to produce the Little-Endian output:
Little Endian
PSR=0xF30010A7, IMPL=15, VER=3, CWP=7
PSR=11110011000000000001000010100111
PSR=0xF30000A7, IMPL=15, VER=3, CWP=7
PSR=11110011000000000000000010100111
While the later one is as expected, can anyone please explain the
Big-Endian counterpart? Considering the showBits() method to be
correct, how can PSR=0x3F00003D give rise to IMPL=15, VER=3, CWP=7
values? How is the bit-field is being arranged and interpreted in
memory on a Big-Endian system?
... PSR=0x3F00003D give rise to IMPL=15, VER=3, CWP=7 values?
It cant. I don't know why you're calling __builtin_bswap32 but 0x3F00003D does not represent the memory of the sparcRegisters struct as you initialized it.
Lets check this code:
sparcRegisters.psr.cwp = 7;
sparcRegisters.psr.et = 1;
sparcRegisters.psr.ps = 0;
sparcRegisters.psr.s = 1;
sparcRegisters.psr.pil = 0;
sparcRegisters.psr.ef = 0;
sparcRegisters.psr.ec = 0;
sparcRegisters.psr.reserved = 0;
sparcRegisters.psr.c = 0;
sparcRegisters.psr.v = 0;
sparcRegisters.psr.z = 0;
sparcRegisters.psr.n = 0;
sparcRegisters.psr.ver = 3;
sparcRegisters.psr.impl = 0xF;
The individual translations are as follows:
7 => 00111
1 => 1
0 => 0
1 => 1
0 => 0000
0 => 0
0 => 0
0 => 000000
0 => 0
0 => 0
0 => 0
0 => 0
3 => 0011
F => 1111
The structure therefore in memory becomes 00111101000000000000000000111111 which is 0x3D00003F in big-endian.
You can confirm with this code (tested using CC in solaris):
#include <stdio.h>
#include <string.h>
struct processor_status_register
{
unsigned int cwp:5;
unsigned int et:1;
unsigned int ps:1;
unsigned int s:1;
unsigned int pil:4;
unsigned int ef:1;
unsigned int ec:1;
unsigned int reserved:6;
unsigned int c:1;
unsigned int v:1;
unsigned int z:1;
unsigned int n:1;
unsigned int ver:4;
unsigned int impl:4;
}__attribute__ ((__packed__));
int getBit(unsigned long bitStream, int position)
{
int bit;
bit = (bitStream & (1 << position)) >> position;
return bit;
}
char* showBits(unsigned long bitStream, int startPosition, int endPosition)
{
// Allocate one extra byte for NULL character
static char bits[33];
memset(bits, 0, 33);
int bitIndex;
for(bitIndex = 0; bitIndex <= endPosition; bitIndex++)
{
bits[bitIndex] = (getBit(bitStream, endPosition - bitIndex)) ? '1' : '0';
}
return bits;
}
int main()
{
processor_status_register psr;
psr.cwp = 7;
psr.et = 1;
psr.ps = 0;
psr.s = 1;
psr.pil = 0;
psr.ef = 0;
psr.ec = 0;
psr.reserved = 0;
psr.c = 0;
psr.v = 0;
psr.z = 0;
psr.n = 0;
psr.ver = 3;
psr.impl = 0xF;
unsigned long registerValue = 0;
memcpy(®isterValue, &psr, sizeof(registerValue));
printf("\nPSR=0x%0X, IMPL=%u, VER=%u, CWP=%u\n", registerValue,
psr.impl, psr.ver,
psr.cwp);
printf("PSR=%s\n\n",showBits(registerValue, 0, 31));
return 0;
}
The output of this is:
PSR=0x3D00003F, IMPL=15, VER=3, CWP=7
PSR=00111101000000000000000000111111
I'm trying to implement the Particle Swarm Optimization on CUDA. I'm partially initializing data arrays on host, then I allocate memory on CUDA and copy it there, and then try to proceed with the initialization.
The problem is, when I'm trying to modify array element like so
__global__ void kernelInit(
float* X,
size_t pitch,
int width,
float X_high,
float X_low
) {
// Silly, but pretty reliable way to address array elements
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
int r = tid / width;
int c = tid % width;
float* pElement = (float*)((char*)X + r * pitch) + c;
*pElement = *pElement * (X_high - X_low) - X_low;
//*pElement = (X_high - X_low) - X_low;
}
It corrupts the values and gives me 1.#INF00 as array element. When I uncomment the last line *pElement = (X_high - X_low) - X_low; and comment the previous, it works as expected: I get values like 15.36 and so on.
I believe the problem is either with my memory allocation and copying, and/or with adressing the specific array element. I read the CUDA manual about these both topics, but I can't spot the error: I still get corrupt array if I do anything with the element of the array. For example, *pElement = *pElement * 2 gives unreasonable big results like 779616...00000000.00000 when the initial pElement is expected to be just a float in [0;1].
Here is the full source. Initialization of arrays begins in main (bottom of the source), then f1 function does the work for CUDA and launches the initialization kernel kernelInit:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
const unsigned f_n = 3;
const unsigned n = 2;
const unsigned p = 64;
typedef struct {
unsigned k_max;
float c1;
float c2;
unsigned p;
float inertia_factor;
float Ef;
float X_low[f_n];
float X_high[f_n];
float X_min[n][f_n];
} params_t;
typedef void (*kernelWrapperType) (
float *X,
float *X_highVec,
float *V,
float *X_best,
float *Y,
float *Y_best,
float *X_swarmBest,
bool &termination,
const float &inertia,
const params_t *params,
const unsigned &f
);
typedef float (*twoArgsFuncType) (
float x1,
float x2
);
__global__ void kernelInit(
float* X,
size_t pitch,
int width,
float X_high,
float X_low
) {
// Silly, but pretty reliable way to address array elements
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
int r = tid / width;
int c = tid % width;
float* pElement = (float*)((char*)X + r * pitch) + c;
*pElement = *pElement * (X_high - X_low) - X_low;
//*pElement = (X_high - X_low) - X_low;
}
__device__ float kernelF1(
float x1,
float x2
) {
float y = pow(x1, 2.f) + pow(x2, 2.f);
return y;
}
void f1(
float *X,
float *X_highVec,
float *V,
float *X_best,
float *Y,
float *Y_best,
float *X_swarmBest,
bool &termination,
const float &inertia,
const params_t *params,
const unsigned &f
) {
float *X_d = NULL;
float *Y_d = NULL;
unsigned length = n * p;
const cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();
size_t pitch;
size_t dpitch;
cudaError_t err;
unsigned width = n;
unsigned height = p;
err = cudaMallocPitch (&X_d, &dpitch, width * sizeof(float), height);
pitch = n * sizeof(float);
err = cudaMemcpy2D(X_d, dpitch, X, pitch, width * sizeof(float), height, cudaMemcpyHostToDevice);
err = cudaMalloc (&Y_d, sizeof(float) * p);
err = cudaMemcpy (Y_d, Y, sizeof(float) * p, cudaMemcpyHostToDevice);
dim3 threads; threads.x = 32;
dim3 blocks; blocks.x = (length/threads.x) + 1;
kernelInit<<<threads,blocks>>>(X_d, dpitch, width, params->X_high[f], params->X_low[f]);
err = cudaMemcpy2D(X, pitch, X_d, dpitch, n*sizeof(float), p, cudaMemcpyDeviceToHost);
err = cudaFree(X_d);
err = cudaMemcpy(Y, Y_d, sizeof(float) * p, cudaMemcpyDeviceToHost);
err = cudaFree(Y_d);
}
float F1(
float x1,
float x2
) {
float y = pow(x1, 2.f) + pow(x2, 2.f);
return y;
}
/*
* Generates random float in [0.0; 1.0]
*/
float frand(){
return (float)rand()/(float)RAND_MAX;
}
/*
* This is the main routine which declares and initializes the integer vector, moves it to the device, launches kernel
* brings the result vector back to host and dumps it on the console.
*/
int main() {
const params_t params = {
100,
0.5,
0.5,
p,
0.98,
0.01,
{-5.12, -2.048, -5.12},
{5.12, 2.048, 5.12},
{{0, 1, 0}, {0, 1, 0}}
};
float X[p][n];
float X_highVec[n];
float V[p][n];
float X_best[p][n];
float Y[p] = {0};
float Y_best[p] = {0};
float X_swarmBest[n];
kernelWrapperType F_wrapper[f_n] = {&f1, &f1, &f1};
twoArgsFuncType F[f_n] = {&F1, &F1, &F1};
for (unsigned f = 0; f < f_n; f++) {
printf("Optimizing function #%u\n", f);
srand ( time(NULL) );
for (unsigned i = 0; i < p; i++)
for (unsigned j = 0; j < n; j++)
X[i][j] = X_best[i][j] = frand();
for (int i = 0; i < n; i++)
X_highVec[i] = params.X_high[f];
for (unsigned i = 0; i < p; i++)
for (unsigned j = 0; j < n; j++)
V[i][j] = frand();
for (unsigned i = 0; i < p; i++)
Y_best[i] = F[f](X[i][0], X[i][1]);
for (unsigned i = 0; i < n; i++)
X_swarmBest[i] = params.X_high[f];
float y_swarmBest = F[f](X_highVec[0], X_highVec[1]);
bool termination = false;
float inertia = 1.;
for (unsigned k = 0; k < params.k_max; k++) {
F_wrapper[f]((float *)X, X_highVec, (float *)V, (float *)X_best, Y, Y_best, X_swarmBest, termination, inertia, ¶ms, f);
}
for (unsigned i = 0; i < p; i++)
{
for (unsigned j = 0; j < n; j++)
{
printf("%f\t", X[i][j]);
}
printf("F = %f\n", Y[i]);
}
getchar();
}
}
Update: I tried adding error handling like so
err = cudaMallocPitch (&X_d, &dpitch, width * sizeof(float), height);
if (err != cudaSuccess) {
fprintf(stderr, cudaGetErrorString(err));
exit(1);
}
after each API call, but it gave me nothing and didn't return (I still get all the results and program works to the end).
This is an unnecessarily complex piece of code for what should be a simple repro case, but this immediately jumps out:
const unsigned n = 2;
const unsigned p = 64;
unsigned length = n * p
dim3 threads; threads.x = 32;
dim3 blocks; blocks.x = (length/threads.x) + 1;
kernelInit<<<threads,blocks>>>(X_d, dpitch, width, params->X_high[f], params->X_low[f]);
So you are firstly computing the incorrect number of blocks, and then reversing the order of the blocks per grid and threads per block arguments in the kernel launch. That may well lead to out of bounds memory access, either hosing something in GPU memory or causing an unspecified launch failure, which your lack of error handling might not be catching. There is a tool called cuda-memcheck which has been shipped with the toolkit since about CUDA 3.0. If you run it, it will give you valgrind style memory access violation reports. You should get into the habit of using it, if you are not already doing so.
As for infinite values, that is to be expected isn't it? Your code starts with values in (0,1), and then does
X[i] = X[i] * (5.12--5.12) - -5.12
100 times, which is the rough equivalent of multiplying by 10^100, which is then followed by
X[i] = X[i] * (2.048--2.048) - -2.048
100 times, which is the rough equivalent of multiplying by 4^100, finally followed by
X[i] = X[i] * (5.12--5.12) - -5.12
again. So your results should be of the order of 1E250, which is much larger than the maximum 3.4E38 which is the rough upper limit of representable numbers in IEEE 754 single precision.