Resizing the mmap shared memory with mremap fails

Resizing the mmap shared memory with mremap fails - memory

I am working on a project in which I need to share a large array of std::pairs between processes created with fork() API. The size of the array is not known at the start of the program. Later on, the parent process will communicate with child processes to determine the size of the array. Then it will resize the shared memory space by calling mremap() API. I have had no luck so far in using mmap and mremap.
It seems to me that mremap() fails when the new size becomes larger than 4096 bytes ( system's page size?).
I have created a small example to help me understand the problem better. This code, create a shared memory using mmap(), then increase its size using mremap at multiple steps. There is only one process in this example.
I am also rounding up the size of the requested shared memory to 4096 bytes.
#include <iostream>
#include <sys/types.h> /* pid_t */
#include <sys/wait.h>
#include <unistd.h> /* _exit, fork */
#include <stdlib.h> /* exit */
#include <errno.h> /* errno */
#include <time.h> /* time */
#include <sys/mman.h>
#include <vector>
#include <map>
#include <string.h>
#include <sys/ipc.h>
#include <sys/shm.h>
#define page_size 4096
#define def_size 32
void write_shmem (long *shm_ptr, long npt)
{
for (long pt = 0; pt < npt ; ++pt)
{
shm_ptr[pt] = 77;
//std::cout << "shm_ptr[" << pt << "] = " << shm_ptr[pt] << "\n";
}
}
long read_shmem (long *shm_ptr, long npt)
{
long bad_indx = -1;
for (long pt = 0; pt < npt ; ++pt)
{
if (shm_ptr[pt] != 77)
{
bad_indx = pt;
break;
}
}
return bad_indx;
}
int main ()
{
void * add;
long *shm_arcs;
long padding = 0;
long init_shmem_size = sizeof (long) *def_size;
long resid = init_shmem_size % page_size;
if (resid!=0) padding = page_size - (init_shmem_size % page_size);
//std::cout << " requested " << init_shmem_size << " bytes, translated to " << padding+init_shmem_size << "\n";
init_shmem_size +=padding;
add = mmap(NULL, init_shmem_size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, -1, 0); // initial size is set to npt.
if (add == MAP_FAILED)
{
std::cout << "Mapped failed at parent \n";
exit(-1);
}
shm_arcs = (long *) add;
write_shmem (shm_arcs, def_size);
long bd = read_shmem (shm_arcs, def_size);
if (bd!=-1)
{
std::cout << "Shmem test failed at " << bd << "\n";
exit(-1);
}
std::cout<<"Parent: shared memory is created and verified at address " << shm_arcs <<" data size: " << init_shmem_size << ".\n ";
long *shm_new = shm_arcs;
long nsize = def_size;
for (int i = 0; i< 1000; i ++)
{
nsize += def_size;
long bsize = nsize*sizeof(long);
if ((bsize % page_size)!=0) padding = page_size - (bsize % page_size);
else padding = 0;
//std::cout << nsize << " elements requested " << bsize << " bytes, translated to " << padding+bsize << "\n";
bsize+=padding;
add = mremap(shm_new, init_shmem_size, bsize , MREMAP_MAYMOVE);
if (add != shm_new)
{
std::cout << "Old mem unmapped ---------------";
munmap(shm_new, init_shmem_size);
}
if (add == MAP_FAILED)
{
std::cout << "Mapped failed at parent at " << i << "\n";
exit(-1);
}
shm_new = (long *) add;
write_shmem (shm_new, nsize);
long bd = read_shmem (shm_arcs, nsize);
if (bd!=-1)
{
std::cout << "Shmem test failed at " << bd << "\n";
exit(-1);
}
std::cout<< i << " - Parent: shared memory is created and verified at address " << shm_new <<" with " << nsize << " elements, data size : " << bsize << ".\n ";
init_shmem_size = bsize;
}
return 0;
}
This is the output of the code:
Parent: shared memory is created and verified at address 0x7f00ae91b000 data size: 4096.
0 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 64 elements, data size : 4096.
1 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 96 elements, data size : 4096.
2 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 128 elements, data size : 4096.
3 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 160 elements, data size : 4096.
4 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 192 elements, data size : 4096.
5 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 224 elements, data size : 4096.
6 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 256 elements, data size : 4096.
7 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 288 elements, data size : 4096.
8 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 320 elements, data size : 4096.
9 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 352 elements, data size : 4096.
10 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 384 elements, data size : 4096.
11 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 416 elements, data size : 4096.
12 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 448 elements, data size : 4096.
13 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 480 elements, data size : 4096.
14 - Parent: shared memory is created and verified at address 0x7f00ae91b000 with 512 elements, data size : 4096.
Bus error
I am running it on Linux.
What am I doing wrong here? If this is not the best method to do this, I appreciate if you could point a better way?
Thanks

You should not attempt unmap the old addres after mremap as it is already unmapped and may overlap with the new address range. Your "Old mem unmapped" message will not be printed as cout is buffered and the process dies before the buffer is flushed.

Related

472 bytes in 1 blocks are still reachable in loss record 1 of 1 error while opening a file in C

When opening a file called output, for some reason tests get an error 472 bytes in 1 blocks are still reachable in loss record 1 of 1. The program works okay, it is only a memory error that needs to be fixed. The goal of the program is to take a .raw file, scan it 512 bytes at a time for a .jpg file header at the start of the block and recover 50 .jpg pictures (they go one after the other). Error happens on the line where the comment is the error. What do I do to make this mistake go away? I've tried anything I can, nothing seems to help
int x = 0;
char buffer[8];
uint8_t bytes[512];
FILE *output;
//Going through the input file 512 bytes at a time
while (fread(bytes, 1, 512, input))
{
//Checking for .jpg header
if (bytes[0] == 0xff && bytes[1] == 0xd8 && bytes[2] == 0xff && ((bytes[3] & 0xf0) == 0xe0))
{
if (x > 0)
{
//Closing the output file if I used it before
fclose(output);
}
//Creating file name string
sprintf(buffer, "%03i.jpg", x);
x++;
//Creating and opening a file to append to
output = fopen(buffer, "a"); //472 bytes in 1 blocks are still reachable in loss record 1 of 1
//Writing the header to the output file
fwrite(bytes, 1, 512, output);
}
else if (x > 0)
{
//Appending bytes until a new header is found
fwrite(bytes, 1, 512, output);
}
}

What exactly are the transaction metrics reported by NVPROF?

I'm trying to figure out what exactly each of the metrics reported by "nvprof" are. More specifically I can't figure out which transactions are System Memory and Device Memory read and writes. I wrote a very basic code just to help figure this out.
#define TYPE float
#define BDIMX 16
#define BDIMY 16
#include <cuda.h>
#include <cstdio>
#include <iostream>
__global__ void kernel(TYPE *g_output, TYPE *g_input, const int dimx, const int dimy)
{
__shared__ float s_data[BDIMY][BDIMX];
int ix = blockIdx.x * blockDim.x + threadIdx.x;
int iy = blockIdx.y * blockDim.y + threadIdx.y;
int in_idx = iy * dimx + ix; // index for reading input
int tx = threadIdx.x; // thread’s x-index into corresponding shared memory tile
int ty = threadIdx.y; // thread’s y-index into corresponding shared memory tile
s_data[ty][tx] = g_input[in_idx];
__syncthreads();
g_output[in_idx] = s_data[ty][tx] * 1.3;
}
int main(){
int size_x = 16, size_y = 16;
dim3 numTB;
numTB.x = (int)ceil((double)(size_x)/(double)BDIMX) ;
numTB.y = (int)ceil((double)(size_y)/(double)BDIMY) ;
dim3 tbSize;
tbSize.x = BDIMX;
tbSize.y = BDIMY;
float* a,* a_out;
float *a_d = (float *) malloc(size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a, size_x * size_y * sizeof(TYPE));
cudaMalloc((void**)&a_out, size_x * size_y * sizeof(TYPE));
for(int index = 0; index < size_x * size_y; index++){
a_d[index] = index;
}
cudaMemcpy(a, a_d, size_x * size_y * sizeof(TYPE), cudaMemcpyHostToDevice);
kernel <<<numTB, tbSize>>>(a_out, a, size_x, size_y);
cudaDeviceSynchronize();
return 0;
}
Then I run nvprof --metrics all for the output to see all the metrics. This is the part I'm interested in:
Metric Name Metric Description Min Max Avg
Device "Tesla K40c (0)"
Kernel: kernel(float*, float*, int, int)
local_load_transactions Local Load Transactions 0 0 0
local_store_transactions Local Store Transactions 0 0 0
shared_load_transactions Shared Load Transactions 8 8 8
shared_store_transactions Shared Store Transactions 8 8 8
gld_transactions Global Load Transactions 8 8 8
gst_transactions Global Store Transactions 8 8 8
sysmem_read_transactions System Memory Read Transactions 0 0 0
sysmem_write_transactions System Memory Write Transactions 4 4 4
tex_cache_transactions Texture Cache Transactions 0 0 0
dram_read_transactions Device Memory Read Transactions 0 0 0
dram_write_transactions Device Memory Write Transactions 40 40 40
l2_read_transactions L2 Read Transactions 70 70 70
l2_write_transactions L2 Write Transactions 46 46 46
I understand the shared and global accesses. The global accesses are coalesced and since there are 8 warps, there are 8 transactions.
But I can't figure out the system memory and device memory write transaction numbers.

It helps if you have a model of the GPU memory hierarchy with both logical and physical spaces, such as the one here.
Referring to the "overview tab" diagram:
gld_transactions refer to transactions issued from the warp targetting the global logical space. On the diagram, this would be the line from the "Kernel" box on the left to the "global" box to the right of it, and the logical data movement direction would be from right to left.
gst_transactions refer to the same line as above, but logically from left to right. Note that these logical global transaction could hit in a cache and not go anywhere after that. From the metrics standpoint, those transaction types only refer to the indicated line on the diagram.
dram_write_transactions refer to the line on the diagram which connects device memory on the right with L2 cache, and the logical data flow is from left to right on this line. Since the L2 cacheline is 32 bytes (whereas the L1 cacheline and size of a global transaction is 128 bytes), the device memory transactions are also 32 bytes, not 128 bytes. So a global write transaction that passes through L1 (it is a write-through cache if enabled) and L2 will generate 4 dram_write transactions. This should explain 32 out of the 40 transactions.
system memory transactions target zero-copy host memory. You don't seem to have that so I can't explain those.
Note that in some cases, for some metrics, on some GPUs, the profiler may have some "inaccuracy" when launching very small numbers of threadblocks. For example, some metrics are sampled on a per-SM basis and scaled. (device memory transactions are not in this category, however). If you have disparate work being done on each SM (perhaps due to a very small number of threadblocks launched) then the scaling can be misleading/less accurate. Generally if you launch a larger number of threadblocks, these usually become insignificant.
This answer may also be of interest.

Amount of local memory per CUDA thread

I read in NVIDIA documentation (http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#features-and-technical-specifications, table #12) that the amount of local memory per thread is 512 Ko for my GPU (GTX 580, compute capability 2.0).
I tried unsuccessfully to check this limit on Linux with CUDA 6.5.
Here is the code I used (its only purpose is to test local memory limit, it doesn't make any usefull computation):
#include <iostream>
#include <stdio.h>
#define MEMSIZE 65000 // 65000 -> out of memory, 60000 -> ok
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=false)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if( abort )
exit(code);
}
}
inline void gpuCheckKernelExecutionError( const char *file, int line)
{
gpuAssert( cudaPeekAtLastError(), file, line);
gpuAssert( cudaDeviceSynchronize(), file, line);
}
__global__ void kernel_test_private(char *output)
{
int c = blockIdx.x*blockDim.x + threadIdx.x; // absolute col
int r = blockIdx.y*blockDim.y + threadIdx.y; // absolute row
char tmp[MEMSIZE];
for( int i = 0; i < MEMSIZE; i++)
tmp[i] = 4*r + c; // dummy computation in local mem
for( int i = 0; i < MEMSIZE; i++)
output[i] = tmp[i];
}
int main( void)
{
printf( "MEMSIZE=%d bytes.\n", MEMSIZE);
// allocate memory
char output[MEMSIZE];
char *gpuOutput;
cudaMalloc( (void**) &gpuOutput, MEMSIZE);
// run kernel
dim3 dimBlock( 1, 1);
dim3 dimGrid( 1, 1);
kernel_test_private<<<dimGrid, dimBlock>>>(gpuOutput);
gpuCheckKernelExecutionError( __FILE__, __LINE__);
// transfer data from GPU memory to CPU memory
cudaMemcpy( output, gpuOutput, MEMSIZE, cudaMemcpyDeviceToHost);
// release resources
cudaFree(gpuOutput);
cudaDeviceReset();
return 0;
}
And the compilation command line:
nvcc -o cuda_test_private_memory -Xptxas -v -O2 --compiler-options -Wall cuda_test_private_memory.cu
The compilation is ok, and reports:
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z19kernel_test_privatePc' for 'sm_20'
ptxas info : Function properties for _Z19kernel_test_privatePc
65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 21 registers, 40 bytes cmem[0]
I got an "out of memory" error at runtime on the GTX 580 when I reached 65000 bytes per thread. Here is the exact output of the program in the console:
MEMSIZE=65000 bytes.
GPUassert: out of memory cuda_test_private_memory.cu 48
I also did a test with a GTX 770 GPU (on Linux with CUDA 6.5). It ran without error for MEMSIZE=200000, but the "out of memory error" occurred at runtime for MEMSIZE=250000.
How to explain this behavior ? Am I doing something wrong ?

It seems you are running into not a local memory limitation but a stack size limitation:
ptxas info : Function properties for _Z19kernel_test_privatePc
65000 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
The variable that you had intended to be local is on the (GPU thread) stack, in this case.
Based on the information provided by #njuffa here, the available stack size limit is the lesser of:
The maximum local memory size (512KB for cc2.x and higher)
GPU memory/(#of SMs)/(max threads per SM)
Clearly, the first limit is not the issue. I assume you have a "standard" GTX580, which has 1.5GB memory and 16 SMs. A cc2.x device has a maximum of 1536 resident threads per multiprocessor. This means we have 1536MB/16/1536 = 1MB/16 = 65536 bytes stack. There is some overhead and other memory usage that subtracts from the total available memory, so the stack size limit is some amount below 65536, somewhere between 60000 and 65000 in your case, apparently.
I suspect a similar calculation on your GTX770 would yield a similar result, i.e. a maximum stack size between 200000 and 250000.

implications of using _mm_shuffle_ps on integer vector

SSE intrinsics includes _mm_shuffle_ps xmm1 xmm2 immx which allows one to pick 2 elements from xmm1 concatenated with 2 elements from xmm2. However this is for floats, (implied by the _ps , packed single). However if you cast your packed integers __m128i, then you can use _mm_shuffle_ps as well:
#include <iostream>
#include <immintrin.h>
#include <sstream>
using namespace std;
template <typename T>
std::string __m128i_toString(const __m128i var) {
std::stringstream sstr;
const T* values = (const T*) &var;
if (sizeof(T) == 1) {
for (unsigned int i = 0; i < sizeof(__m128i); i++) {
sstr << (int) values[i] << " ";
}
} else {
for (unsigned int i = 0; i < sizeof(__m128i) / sizeof(T); i++) {
sstr << values[i] << " ";
}
}
return sstr.str();
}
int main(){
cout << "Starting SSE test" << endl;
cout << "integer shuffle" << endl;
int A[] = {1, -2147483648, 3, 5};
int B[] = {4, 6, 7, 8};
__m128i pC;
__m128i* pA = (__m128i*) A;
__m128i* pB = (__m128i*) B;
*pA = (__m128i)_mm_shuffle_ps((__m128)*pA, (__m128)*pB, _MM_SHUFFLE(3, 2, 1 ,0));
pC = _mm_add_epi32(*pA,*pB);
cout << "A[0] = " << A[0] << endl;
cout << "A[1] = " << A[1] << endl;
cout << "A[2] = " << A[2] << endl;
cout << "A[3] = " << A[3] << endl;
cout << "B[0] = " << B[0] << endl;
cout << "B[1] = " << B[1] << endl;
cout << "B[2] = " << B[2] << endl;
cout << "B[3] = " << B[3] << endl;
cout << "pA = " << __m128i_toString<int>(*pA) << endl;
cout << "pC = " << __m128i_toString<int>(pC) << endl;
}
Snippet of relevant corresponding assembly (mac osx, macports gcc 4.8, -march=native on an ivybridge CPU):
vshufps $228, 16(%rsp), %xmm1, %xmm0
vpaddd 16(%rsp), %xmm0, %xmm2
vmovdqa %xmm0, 32(%rsp)
vmovaps %xmm0, (%rsp)
vmovdqa %xmm2, 16(%rsp)
call __ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
....
Thus it seemingly works fine on integers, which I expected as the registers are agnostic to types, however there must be a reason why the docs say that this instruction is only for floats. Does someone know any downsides, or implications I have missed?

There is no equivalent to _mm_shuffle_ps for integers. To achieve the same effect in this case you can do
SSE2
*pA = _mm_shuffle_epi32(_mm_unpacklo_epi32(*pA, _mm_shuffle_epi32(*pB, 0xe)),0xd8);
SSE4.1
*pA = _mm_blend_epi16(*pA, *pB, 0xf0);
or change to the floating point domain like this
*pA = _mm_castps_si128(
_mm_shuffle_ps(_mm_castsi128_ps(*pA),
_mm_castsi128_ps(*pB), _MM_SHUFFLE(3, 2, 1 ,0)));
But changing domains may incur bypass latency delays on some CPUs. Keep in mind that according to Agner
The bypass delay is important in long dependency chains where latency is a bottleneck, but
not where it is throughput rather than latency that matters.
You have to test your code and see which method above is more efficient.
Fortunately, on most Intel/AMD CPUs, there is usually no penalty for using shufps between most integer-vector instructions. Agner says:
For example, I found no delay when mixing PADDD and SHUFPS [on Sandybridge].
Nehalem does have 2 bypass-delay latency to/from SHUFPS, but even then a single SHUFPS is often still faster than multiple other instructions. Extra instructions have latency, too, as well as costing throughput.
The reverse (integer shuffles between FP math instructions) is not as safe:
In Agner Fog's microarchitecture on page 112 in Example 8.3a, he shows that using PSHUFD (_mm_shuffle_epi32) instead of SHUFPS (_mm_shuffle_ps) when in the floating point domain causes a bypass delay of four clock cycles. In Example 8.3b he uses SHUFPS to remove the delay (which works in his example).
On Nehalem there are actually five domains. Nahalem seems to be the most effected (the bypass delays did not exist before Nahalem). On Sandy Bridge the delays are less significant. This is even more true on Haswell. In fact on Haswell Agner said he found no delays between SHUFPS or PSHUFD (see page 140).

Memory bandwidth measurement with memset,memcpy

I am trying to understand the performance of memory operations with memcpy/memset. I measure the time needed for a loop containing memset,memcpy. See the attached code (it is in C++11, but in plain C the picture is the same). It is understandable that memset is faster than memcpy. But this is more-or-less the only thing which I understand... The biggest question is:
Why there is a such a strong dependence on the number of loop iterations?
The application is single threaded! And the CPU is: AMD FX(tm)-4100 Quad-Core Processor.
And here are some numbers:
memset: iters=1 0.0625 GB in 0.1269 s : 0.4927 GB per second
memcpy: iters=1 0.0625 GB in 0.1287 s : 0.4857 GB per second
memset: iters=4 0.25 GB in 0.151 s : 1.656 GB per second
memcpy: iters=4 0.25 GB in 0.1678 s : 1.49 GB per second
memset: iters=16 1 GB in 0.2406 s : 4.156 GB per second
memcpy: iters=16 1 GB in 0.3184 s : 3.14 GB per second
memset: iters=128 8 GB in 1.074 s : 7.447 GB per second
memcpy: iters=128 8 GB in 1.737 s : 4.606 GB per second
The code:
/*
-- Compilation and run:
g++ -O3 -std=c++11 -o mem-speed mem-speed.cc && ./mem-speed
-- Output example:
*/
#include <cstdio>
#include <chrono>
#include <memory>
#include <string.h>
using namespace std;
const uint64_t _KB=1024, _MB=_KB*_KB, _GB=_KB*_KB*_KB;
std::pair<double,char> measure_memory_speed(uint64_t buf_size,int n_iters)
{
// without returning something from the buffers, the compiler will optimize memset() and memcpy() calls
char retval=0;
unique_ptr<char[]> buf1(new char[buf_size]), buf2(new char[buf_size]);
auto time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memset(buf1.get(),123,buf_size);
retval += buf1[0];
}
auto t1 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
time_start = chrono::high_resolution_clock::now();
for( int i=0; i<n_iters; i++ )
{
memcpy(buf2.get(),buf1.get(),buf_size);
retval += buf2[0];
}
auto t2 = chrono::duration_cast<std::chrono::nanoseconds>(chrono::high_resolution_clock::now() - time_start);
printf("memset: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t1.count()/1e9, n_iters*buf_size/double(_GB) / (t1.count()/1e9) );
printf("memcpy: iters=%d %g GB in %8.4g s : %8.4g GB per second\n",
n_iters,n_iters*buf_size/double(_GB),(double)t2.count()/1e9, n_iters*buf_size/double(_GB) / (t2.count()/1e9) );
printf("\n");
double avr = n_iters*buf_size/_GB * (1e9/t1.count()+1e9/t2.count()) / 2;
retval += buf1[0]+buf2[0];
return std::pair<double,char>(avr,retval);
}
int main(int argc,const char **argv)
{
uint64_t n=64;
if( argc==2 )
n = atoi(argv[1]);
for( int i=0; i<=10; i++ )
measure_memory_speed(n*_MB,1<<i);
return 0;
}

Surely this is just down to the instruction caches loading - so the code runs faster after the 1st iteration, and the data cache speeding access to the memcpy/memcmp for further iterations. The cache memory is inside the processor so it doesn't have to fetch or put the data to the slower external memory so often - so runs faster.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart