C bit field and the memory layout - memory

I am writing a program that parses the IP header, and the structure of the IP header is defined as follows.
struct ip_hdr {
uint8_t ihl : 4;
uint8_t version : 4;
uint8_t tos;
uint16_t len;
uint16_t id;
uint16_t flags : 3;
uint16_t frag_offset : 13;
uint8_t ttl;
uint8_t proto;
uint16_t csum;
uint32_t saddr;
uint32_t daddr;
uint8_t data[];
} __attribute__((packed));
After receiving the data, I found some problems with reading the flags field and the frag_offset field. These two fields occupy two bytes of space. Through debugging, it is found that 0x4000 is stored in that position. According to the definition of the structure, I think it should be like this:
byte0 byte1
001 00000 00000000
___ ______________
flags frag_offset
so flags should be 4, frag_offset should be 0. However, the read value is that flags is 0 and frag_offset is 8. Why is this happening?
P.S. I am working on an Intel-based machine.

Related

Number of thread increase but no effect on runtime

I have tried to implement alpha image blending algorithm in CUDA C. There is no error in my code. It compiled fine. As per the thread logic, If I run the code with the increased number of threads the runtime should be decreased. In my code, I got a weird pattern of run time. When I run the code with 1 thread the runtime was 8.060539 e-01 sec, when I run the code with 4 thread I got the runtime 7.579031 e-01 sec, When It ran for 8 threads the runtime was 7.810102e-01, and for 256 thread the runtime is 7.875319e-01.
Here is my code:
#include <stdio.h>
#include <stdlib.h>
#include "timer.h"
#define STB_IMAGE_IMPLEMENTATION
#include "stb_image.h"
#define STB_IMAGE_WRITE_IMPLEMENTATION
#include "stb_image_write.h"
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
int col = threadIdx.x + blockIdx.x*blockDim.x;
int row = threadIdx.y + blockIdx.y*blockDim.y;
if(col<width && row<height){
size_t img_size = width * height * channels;
if (Pout != NULL)
{
for (size_t i = 0; i < img_size; i++)
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
}
}
}
int main(int argc, char* argv[]){
int thread_count;
double start, finish;
float alpha;
int width, height, channels;
unsigned char *new_img;
thread_count = strtol(argv[1], NULL, 10);
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
unsigned char *orange = stbi_load("orange.jpg", &width, &height, &channels, 0);
size_t img_size = width * height * channels;
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&orange, img_size*sizeof(unsigned char));
GET_TIME(start);
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
cudaDeviceSynchronize();
GET_TIME(finish);
stbi_write_jpg("new_image.jpg", width, height, channels, new_img, 100);
cudaFree(new_img);
cudaFree(apple);
cudaFree(orange);
printf("\n Elapsed time for cuda = %e seconds\n", finish-start);
}
After getting a weird pattern in the runtime I am bit skeptical about the implementation of the code. Can anyone let me know why I get those runtime even if my code has no bug.
Let's start here:
image_blend<<<1,16,thread_count>>>(new_img,apple, orange, width, height, channels,alpha);
It seems evident you don't understand the kernel launch syntax:
<<<1,16,thread_count>>>
The first number (1) is the number of blocks to launch.
The second number (16) is the number of threads per block.
The third number (thread_count) is the size of the dynamically allocated shared memory in bytes.
So our first observation will be that although you claimed to have changed the thread count, you didn't. You were changing the number of bytes of dynamically allocated shared memory. Since your kernel code doesn't use shared memory, this is a completely meaningless variable.
Let's also observe your kernel code:
for (size_t i = 0; i < img_size; i++)
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
For every thread that passes your if test, each one of those threads will execute the entire for-loop and will process the entire image. That is not the general idea with writing CUDA kernels. The general idea is to break up the work so that each thread does a portion of the work, not the whole activity.
These are very basic observations. If you take advantage of an orderly introduction to CUDA, such as here, you can get beyond some of these basic concepts.
We could also point out that your kernel nominally expects a 2D launch, and you are not providing one, and perhaps many other observations. Another important concept that you are missing is that you cannot do this:
unsigned char *apple = stbi_load("apple.jpg", &width, &height, &channels, 0);
...
cudaMallocManaged(&apple,img_size* sizeof(unsigned char));
and expect anything sensible to come from that. If you want to see how data is moved from a host allocation to the device, study nearly any CUDA sample code, such as vectorAdd. Using a managed allocation doesn't allow you to overwrite the pointer like you are doing and get anything useful from that.
I'll provide an example of how one might go about doing what I think you are suggesting, without providing a complete tutorial on CUDA. To provide an example, I'm going to skip the STB image loading routines. To understand the work you are trying to do here, the actual image content does not matter.
Here's an example of an image processing kernel (1D) that will:
Process the entire image, only once
Use less time, roughly speaking, as you increase the thread count.
You haven't provided your timer routine/code, so I'll provide my own:
$ cat t2130.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start=0){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
unsigned char *i_load(int w, int h, int c, int init){
unsigned char *res = new unsigned char[w*h*c];
for (int i = 0; i < w*h*c; i++) res[i] = init;
return res;
}
__global__ void image_blend(unsigned char *Pout, unsigned char *pin1, unsigned char *pin2, int width, int height, int channels, float alpha){
if (Pout != NULL)
{
size_t img_size = width * height * channels;
for (size_t i = blockIdx.x*blockDim.x+threadIdx.x; i < img_size; i+=gridDim.x*blockDim.x) // grid-stride loop
{
Pout[i] = ((1.0 - alpha) * pin1[i] + alpha * pin2[i]);
}
}
}
int main(int argc, char* argv[]){
int threads_per_block = 64;
unsigned long long dt;
float alpha;
int width = 1920;
int height = 1080;
int channels = 3;
size_t img_size = width * height * channels;
int thread_count = img_size;
if (argc > 1) thread_count = atoi(argv[1]);
unsigned char *new_img, *m_apple, *m_orange;
printf("Enter the value for alpha:");
scanf("%f", &alpha);
unsigned char *apple = i_load(width, height, channels, 10);
unsigned char *orange = i_load(width, height, channels, 70);
//unsigned char *new_img = malloc(img_size);
cudaMallocManaged(&new_img,img_size*sizeof(unsigned char));
cudaMallocManaged(&m_apple,img_size* sizeof(unsigned char));
cudaMallocManaged(&m_orange, img_size*sizeof(unsigned char));
memcpy(m_apple, apple, img_size);
memcpy(m_orange, orange, img_size);
int blocks;
if (thread_count < threads_per_block) {threads_per_block = thread_count; blocks = 1;}
else {blocks = thread_count/threads_per_block;}
printf("running with %d blocks of %d threads\n", blocks, threads_per_block);
dt = dtime_usec(0);
image_blend<<<blocks, threads_per_block>>>(new_img,m_apple, m_orange, width, height, channels,alpha);
cudaDeviceSynchronize();
dt = dtime_usec(dt);
cudaError_t err = cudaGetLastError();
if (err != cudaSuccess) printf("CUDA Error: %s\n", cudaGetErrorString(err));
else printf("\n Elapsed time for cuda = %e seconds\n", dt/(float)USECPSEC);
cudaFree(new_img);
cudaFree(m_apple);
cudaFree(m_orange);
}
$ nvcc -o t2130 t2130.cu
$ ./t2130 1
Enter the value for alpha:0.2
running with 1 blocks of 1 threads
Elapsed time for cuda = 5.737880e-01 seconds
$ ./t2130 2
Enter the value for alpha:0.2
running with 1 blocks of 2 threads
Elapsed time for cuda = 3.230150e-01 seconds
$ ./t2130 32
Enter the value for alpha:0.2
running with 1 blocks of 32 threads
Elapsed time for cuda = 4.865200e-02 seconds
$ ./t2130 64
Enter the value for alpha:0.2
running with 1 blocks of 64 threads
Elapsed time for cuda = 2.623300e-02 seconds
$ ./t2130 128
Enter the value for alpha:0.2
running with 2 blocks of 64 threads
Elapsed time for cuda = 1.546000e-02 seconds
$ ./t2130
Enter the value for alpha:0.2
running with 97200 blocks of 64 threads
Elapsed time for cuda = 5.809000e-03 seconds
$
(CentOS 7, CUDA 11.4, V100)
The key methodology that allows the kernel to do all the work (only once) while making use of an "arbitrary" number of threads efficiently is the grid-stride loop.

Global device memory size limit when using statically alocated memory in cuda

I thought the maximal size of global memory should be only limited by the GPU device no matter it is allocated statically using __device__ __manged__ or dynamically using cudaMalloc.
But I found that if using the __device__ manged__ way, the maximum array size I can declare is much smaller than the GPU device limit.
The minimal working example is as follows:
#include <stdio.h>
#include <cuda_runtime.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
#define MX 64
#define MY 64
#define MZ 64
#define NX 64
#define NY 64
#define M (MX * MY * MZ)
__device__ __managed__ float A[NY][NX][M];
__device__ __managed__ float B[NY][NX][M];
__global__ void swapAB()
{
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int j = 0; j < NY; j++)
for(int i = 0; i < NX; i++)
A[j][i][tid] = B[j][i][tid];
}
int main()
{
swapAB<<<M/256,256>>>();
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
}
It uses 64 ^5 * 2 * 4 / 2^30 GB = 8 GB global memory, and I'll run compile and run it on a Nvidia Telsa K40c GPU which has a 12GB global memory.
Compiler cmd:
nvcc test.cu -gencode arch=compute_30,code=sm_30
Output warning:
warning: overflow in implicit constant conversion.
When I ran the generated executable, an error says:
GPUassert: an illegal memory access was encountered test.cu
Surprisingly, if I use the dynamically allocated global memory of the same size (8GB) via the cudaMalloc API instead, there is no compiling warning and runtime error.
I'm wondering if there are any special limitation about the allocatable size of static global device memory in CUDA.
Thanks!
PS: OS and CUDA: CentOS 6.5 x64, CUDA-7.5.
This would appear to be a limitation of the CUDA runtime API. The root cause is this function (in CUDA 7.5):
__cudaRegisterVar(
void **fatCubinHandle,
char *hostVar,
char *deviceAddress,
const char *deviceName,
int ext,
int size,
int constant,
int global
);
which only accepts a signed int for the size of any statically declared device variable. This would limit the maximum size to 2^31 (2147483648) bytes. The warning you see is because the CUDA front end is emitting boilerplate code containing calls to __cudaResgisterVar like this:
__cudaRegisterManagedVariable(__T26, __shadow_var(A,::A), 0, 4294967296, 0, 0);
__cudaRegisterManagedVariable(__T26, __shadow_var(B,::B), 0, 4294967296, 0, 0);
It is the 4294967296 which is the source of the problem. The size will overflow the signed integer and cause the API call to blow up. So it seems you are limited to 2Gb per static variable for the moment. I would recommend raising this as a bug with NVIDIA if it is a serious problem for your application.

Why 64-bits aligned XOR do not run faster than 32-bits aligned XOR?

I want to test the speed of two block of memmory, and I did a experiment in a 64 bits machine(4M cache), and XOR two region of memory with 32-bits aligned and 64-bits aligned respectively.I thought the 64-bits aligned region XOR counld much faster than 32-bits aligned region XOR, but the speed of two types of XOR are quiet the same.
code:
void region_xor_w32( unsigned char *r1, /* Region 1 */
unsigned char *r2, /* Region 2 */
unsigned char *r3, /* Sum region */
int nbytes) /* Number of bytes in region */
{
uint32_t *l1;
uint32_t *l2;
uint32_t *l3;
uint32_t *ltop;
unsigned char *ctop;
ctop = r1 + nbytes;
ltop = (uint32_t *) ctop;
l1 = (uint32_t *) r1;
l2 = (uint32_t *) r2;
l3 = (uint32_t *) r3;
while (l1 < ltop) {
*l3 = ((*l1) ^ (*l2));
l1++;
l2++;
l3++;
}
}
void region_xor_w64( unsigned char *r1, /* Region 1 */
unsigned char *r2, /* Region 2 */
unsigned char *r3, /* Sum region */
int nbytes) /* Number of bytes in region */
{
uint64_t *l1;
uint64_t *l2;
uint64_t *l3;
uint64_t *ltop;
unsigned char *ctop;
ctop = r1 + nbytes;
ltop = (uint64_t *) ctop;
l1 = (uint64_t *) r1;
l2 = (uint64_t *) r2;
l3 = (uint64_t *) r3;
while (l1 < ltop) {
*l3 = ((*l1) ^ (*l2));
l1++;
l2++;
l3++;
}
}
Result:
I believe this is due to data starvation. That is, your CPU is so fast and your code is so efficient that your memory subsystem simply can't keep up. Even XORing in a 32-bit aligned way takes less time than fetching data from memory. That's why both 32-bit and 64-bit aligned approaches have the same speed — that of your memory subsystem.
To demonstrate, I've reproduces your experiment, but this time with four different ways of XORing:
non-aligned (i.e. byte-aligned) XORing;
32-bit aligned XORing;
64-bit aligned XORing;
128-bit aligned XORing.
The last one was implemented via _mm_xor_si128(), which is a part of the SSE2 instruction set.
As you can see, switching to 128-bit processing gave no performance boost. Switching to per-byte processing, on the other hand, slowed everything down — that's because in this case memory subsystem still beats CPU.

get 32 bit number in ios

How to get a 32 bit number in objective c when an byte array is passed to it, similarly as in java where,
ByteBuffer bb = ByteBuffer.wrap(truncation);
return bb.getInt();
Where truncation is the byte array.
It returns 32 bit number.. Is this possible in objective c?
If the number is encoded in little-endian within the buffer, then use:
int32_t getInt32LE(const uint8_t *buffer)
{
int32_t value = 0;
unsigned length = 4;
while (length > 0)
{
value <<= 8;
value |= buffer[--length];
}
return value;
}
If the number is encoded in big-endian within the buffer, then use:
int32_t getInt32BE(const uint8_t *buffer)
{
int32_t value = 0;
for (unsigned i = 0; i < 4; i++)
{
value <<= 8;
value |= *buffer++;
}
return value;
}
UPDATE If you are using data created on the same host then endianness is not an issue, in which case you can use a union as a bridge between the buffer and integers, which avoids some unpleasant casting:
union
{
uint8_t b[sizeof(int32_t)];
int32_t i;
} u;
memcpy(u.b, buffer, sizeof(u.b));
// value is u.i
Depending on the endianness:
uint32_t n = b0 << 24 | b1 << 16 | b2 << 8 | b3;
or
uint32_t n = b3 << 24 | b2 << 16 | b1 << 8 | b0
Not sure if you just want to read 4 bytes and assign that value to an integer. This case:
int32_t number;
memcpy(&number, truncation, sizeof(uint32_t));
About endianess
From your question (for me) was clear that the bytes were already ordered correctly. However if you have to re-order these bytes, use ntohl() after memcpy() :
number=ntohl(number);

How is a pointer stored in memory?

I am working on an OS project and I am just wondering how a pointer is stored in memory? I understand that a pointer is 4 bytes, so how is the pointer spread amongst the 4 bytes?
My issue is, I am trying to store a pointer to a 4 byte slot of memory. Lets say the pointer is 0x7FFFFFFF. What is stored at each of the 4 bytes?
The way that pointer is stored is same as any other multi-byte values. The 4 bytes are stored according to the endianness of the system. Let's say the address of the 4 bytes is below:
Big endian (most significant byte first):
Address Byte
0x1000 0x7F
0x1001 0xFF
0x1002 0xFF
0x1003 0xFF
Small endian (least significant byte first):
Address Byte
0x1000 0xFF
0x1001 0xFF
0x1002 0xFF
0x1003 0x7F
Btw, 4 byte address is 32-bit system. 64-bit system has 8 bytes addresses.
EDIT:
To reference each individual part of the pointer, you need to use pointer. :)
Say you have:
int i = 0;
int *pi = &i; // say pi == 0x7fffffff
int **ppi = π // from the above example, int ppi == 0x1000
Simple pointer arithmetic would get you the pointer to each byte.
You should read up on Endianness. Normally you wouldn't work with just one byte of a pointer at a time, though, so the order of the bytes isn't relevant.
Update: Here's an example of making a fake pointer with a known value and then printing out each of its bytes:
#include <stdio.h>
int main(int arc, char* argv[]) {
int *p = (int *) 0x12345678;
unsigned char *cp = (unsigned char *) &p;
int i;
for (i = 0; i < sizeof(p); i++)
printf("%d: %.2x\n", i, cp[i]);
return 0;
}

Resources