Having brushed up on memory alignment I found this example
#include <stdio.h>
#include <stdlib.h>
typedef struct {
int x;
char y;
long long z;
} Example;
typedef struct{
int a; // 8 bytes
Example e; // 16 bytes
} Example2;
int main(){
printf("%ld\n", sizeof(Example));
printf("%ld\n", sizeof(Example2));
which prints 16, 24 to the screen although I would have expected 16, 32 in this case. Example2's second member Example has a sizeof 16 bytes and is thus the largest member of Example2 which defines the alignment of this struct, does it not? So with 16 bytes alignment I exptected to be a padding of 8 bytes inserted after the a member of Example2 but there seems to be none inserted. Why is that please?


Getting wrong output while executing sizeof(char)

My question is sizeof(char) is 1 byte but while executing below code why I am getting wrong output. Kindly help me. Thank you
typedef struct {
int x;
int y;
char a;
main() {
Point2D *myPoint=malloc(sizeof(Point2D));
NSLog(#"sizeof(Point2D): %zu", sizeof(Point2D));
Output: sizeof(Point2D) : 12 //But it should return 9 [int + int + char / 4 + 4 + 1]
Note: While running char individually , I am getting correct output
typedef struct {
char a;
char b;
main() {
Point2D *myPoint=malloc(sizeof(Point2D));
NSLog(#"sizeof(Point2D): %zu", sizeof(char));
output: sizeof(char) : 2
You are not getting "wrong" output, when an (Objective-)C compiler lays out a struct it is allowed to use internal padding so that the fields start at the best memory alignment for their type.
If you need the size of a struct to be exactly the sum of its field sizes you can use __attribute__((__packed__)). E.g:
typedef struct
int x;
int y;
char a;
} __attribute__((__packed__)) Point2D;
has a size of 9. However access to the fields may be slower due to the CPU having to deal with values not having optimal storage alignment.

cudaMemcpy invalid argument: in simple vector example

The following example:
#include <stdio.h>
#include <stdlib.h>
#include <cuda_runtime.h>
#include <cuda.h>
#include <math.h>
#define N 100
#define t_num 256
int main(){
int vector_one_h[t_num], vector_one_g[t_num];
cudaError_t err = cudaMalloc((void**)&vector_one_g, t_num * sizeof(int));
printf("Cuda malloc vector swap one: %s \n", cudaGetErrorString(err));
printf("Device Vector: %p \n:" , vector_one_g);
for(int m = 0; m < t_num; m++){
vector_one_h[m] = rand() % N;
err = cudaMemcpy(vector_one_g, vector_one_h, t_num * sizeof(int), cudaMemcpyHostToDevice);
printf("Cuda mem copy vector swap one: %s \n", cudaGetErrorString(err));
Will return:
Cuda malloc vector swap one: no error
Device Vector: 0x7ffcf028eea0
:Cuda mem copy vector swap one: invalid argument
So why is cudaMemcpy receiving an invalid argument?
From the documentation for cudaMemcpy() here I thought the problem may be that I need to give the second argument as the address, &vector_one_h, but placing that in the code returns the exact same error.
And also, while there are many posts about cudaMemcpy invalid arguments, I believe this is not a duplicate as most of the other questions have very complicated examples while this is a very simple and minimal example.
Try changing the first line to:
int vector_one_h[t_num], *vector_one_g;
BTW, prefixing an array name with an & has no effect. Array names are constant pointers by themselves, by the definition of C syntax.

Global device memory size limit when using statically alocated memory in cuda

I thought the maximal size of global memory should be only limited by the GPU device no matter it is allocated statically using __device__ __manged__ or dynamically using cudaMalloc.
But I found that if using the __device__ manged__ way, the maximum array size I can declare is much smaller than the GPU device limit.
The minimal working example is as follows:
#include <stdio.h>
#include <cuda_runtime.h>
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
if (code != cudaSuccess)
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
#define MX 64
#define MY 64
#define MZ 64
#define NX 64
#define NY 64
#define M (MX * MY * MZ)
__device__ __managed__ float A[NY][NX][M];
__device__ __managed__ float B[NY][NX][M];
__global__ void swapAB()
int tid = blockIdx.x * blockDim.x + threadIdx.x;
for(int j = 0; j < NY; j++)
for(int i = 0; i < NX; i++)
A[j][i][tid] = B[j][i][tid];
int main()
gpuErrchk( cudaPeekAtLastError() );
gpuErrchk( cudaDeviceSynchronize() );
return 0;
It uses 64 ^5 * 2 * 4 / 2^30 GB = 8 GB global memory, and I'll run compile and run it on a Nvidia Telsa K40c GPU which has a 12GB global memory.
Compiler cmd:
nvcc test.cu -gencode arch=compute_30,code=sm_30
Output warning:
warning: overflow in implicit constant conversion.
When I ran the generated executable, an error says:
GPUassert: an illegal memory access was encountered test.cu
Surprisingly, if I use the dynamically allocated global memory of the same size (8GB) via the cudaMalloc API instead, there is no compiling warning and runtime error.
I'm wondering if there are any special limitation about the allocatable size of static global device memory in CUDA.
PS: OS and CUDA: CentOS 6.5 x64, CUDA-7.5.
This would appear to be a limitation of the CUDA runtime API. The root cause is this function (in CUDA 7.5):
void **fatCubinHandle,
char *hostVar,
char *deviceAddress,
const char *deviceName,
int ext,
int size,
int constant,
int global
which only accepts a signed int for the size of any statically declared device variable. This would limit the maximum size to 2^31 (2147483648) bytes. The warning you see is because the CUDA front end is emitting boilerplate code containing calls to __cudaResgisterVar like this:
__cudaRegisterManagedVariable(__T26, __shadow_var(A,::A), 0, 4294967296, 0, 0);
__cudaRegisterManagedVariable(__T26, __shadow_var(B,::B), 0, 4294967296, 0, 0);
It is the 4294967296 which is the source of the problem. The size will overflow the signed integer and cause the API call to blow up. So it seems you are limited to 2Gb per static variable for the moment. I would recommend raising this as a bug with NVIDIA if it is a serious problem for your application.

How do I return calculated Histogram results from OpenCV back to LabVIEW?

I have no problems creating my DLL and sending images back & forth between LabVIEW and OpenCV DLL's.
Here is the code I am working with.
#include <opencv2/imgproc/imgproc.hpp>
#include <opencv2\highgui\highgui.hpp>
using namespace std;
using namespace cv;
// extern C
extern "C" {
_declspec (dllexport) int MyHistCalc(unsigned char *imageIN, int rows, int cols, double threshold1, double threshold2, int kernel_size, unsigned char *imageOUT, unsigned char *HistCalcImageOUT);
_declspec (dllexport) int MyHistCalc(unsigned char *imageIN,
int rows,
int cols,
double threshold1,
double threshold2,
int kernel_size,
unsigned char *imageOUT,
unsigned char *HistCalcImageOUT) ...
I am unsure if my problem is catching image_histcalcoutput incorrectly in LabVIEW or returning result correctly.
Please find screenshot of my VI attached.

malloc using 4 bytes for char

I am writing a code to examine how memory is managed between stack and heap. for a course work.
#define NUM_OF_CHARS 100
// function prototype
void f(void);
int main()
return 0;
void f(void)
char *ptr1;
ptr1 = (char *) malloc(NUM_OF_CHARS * sizeof(int));
printf("Address array 1: %016lx\n", (long)ptr1);
char *ptr2;
ptr2 = (char *) malloc(NUM_OF_CHARS * sizeof(int));
printf("Address array 2: %016lx\n", (long)ptr2);
when I run this code I get the following:
Address array 1: 000000000209e010
Address array 2: 000000000209e1b0
my expectation was to see a difference in the address of 100 bytes, but the difference is 416 bytes, when I changed the NUM_OF_CHARS to any other value (200,300,...) the result was always (NUM_OF_CHARS*4 + 16), so it seams like malloc is allocating 4 bytes for each char rather one byte plus 16 bytes of some overhead.
can anyone explain what is happening here?
Memory allocation is platform/compiler dependent. The only thing malloc ensures is that it allocates enough memory for what you are asking and nothing more.
There is no guarantee that your addresses will be contiguous due to memory alignment
Also, you are allocating by size of ints and not char in your code. This is most likely the reason why you see a NUM_OF_CHARS*4 difference, while the remaining difference can be attributed to padding.
