Low copy performance in OpenCL (enqueueWriteBuffer & enqueueReadBuffer)

Low copy performance in OpenCL (enqueueWriteBuffer & enqueueReadBuffer) - memory

I was getting very low performance while copying memory between the GPU and the CPU (both ways) with enqueueWriteBuffer and enqueueReadBuffer. So I have written a test to be sure the problem was in those two functions and I still get a very low performance.
My tests are performing several copies, including 1GB copy and still the best result is around 3GB/s. In contrast, the CUDA test "bandwidthTest.exe" achieves around 12GB/s with a copy size of 30MB. I am running all the test in a laptop with a NVIDIA 1050 GTX and CUDA 10.0.
Any ideas why the performance might be so low?
This is the code I am using for testing, I am building it using Qt. So, there are some dependencies (QTime, QDebug):
#include <QCoreApplication>
#include "CL/cl.hpp"
#include <vector>
#include <QDebug>
#include <QTime>
int main(int argc, char *argv[])
{
QCoreApplication a(argc, argv);
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
std::vector<std::string> platformsNames;
for (auto & platform : platforms) {
platformsNames.push_back(platform.getInfo<CL_PLATFORM_NAME>());
}
std::vector<cl::Device> all_devices;
platforms[0].getDevices(CL_DEVICE_TYPE_ALL, &all_devices);
auto device = all_devices[0];
qDebug() << "Platform used: " << platformsNames[0].c_str() << ". Device used: " << device.getInfo<CL_DEVICE_NAME>().c_str();
qDebug() << "Max work group size: " << device.getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>();
qDebug() << "Max items size: " << device.getInfo<CL_DEVICE_MAX_WORK_ITEM_SIZES>();
cl::Context context(device);
cl::CommandQueue queue(context);
// 1KB, 1MB, 10MB, 100MB, 1GB
size_t sizes[] = {1024, 1024 * 1024, 10 * 1024 * 1024, 100 * 1024 * 1024, 1024 * 1024 * 1024};
qDebug() << "Write Buffers";
for (auto size : sizes) {
cl::Buffer buffer(context, CL_MEM_READ_WRITE, size);
std::vector<unsigned char> t(size);
QTime timerFFT;
timerFFT.start();
auto iterations = 100.0f;
for (auto i = 0; i < iterations; i++) {
int err = queue.enqueueWriteBuffer(buffer, true, 0, size, t.data());
}
auto elapsed = timerFFT.elapsed() / 1000.0f;
qDebug() << "GB/s: " << size / (1024.0f * 1024.0f * 1024.0f) / elapsed * iterations;
}
qDebug() << "Read Buffers";
for (auto size : sizes) {
cl::Buffer buffer(context, CL_MEM_READ_WRITE, size);
std::vector<unsigned char> t(size);
QTime timerFFT;
timerFFT.start();
auto iterations = 100.0f;
for (auto i = 0; i < iterations; i++) {
int err = queue.enqueueReadBuffer(buffer, true, 0, size, t.data());
}
auto elapsed = timerFFT.elapsed() / 1000.0f;
qDebug() << "GB/s: " << size / (1024.0f * 1024.0f * 1024.0f) / elapsed * iterations;
}
return a.exec();
}

Related

How to use clEnqueueWriteBufferRect in OpenCL

I want to use clEnqueueReadBufferRect in OpenCL. To do it, I need to define the region as one of its passing arguement. But there is a inconsistency between references of OpenCL
In online reference, it is mention that
The (width, height, depth) in bytes of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region [2] should be 1.
but in the reference book, page 77, it is mentioned that
region defines the (width in bytes, height in rows, depth in slices) of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region [2] should be 1. The values in region cannot be 0
but unfortunately, none of those guides worked for me and I should provide region in (width in columns, height in rows, depth in slices), otherwise, when I defined them as byte not rows/columns, I got the error CL_INVALID_VALUE. Now which one is correct?
#define WGX 16
#define WGY 16
#include "misc.hpp"
int main(int argc, char** argv)
{
int i;
int n = 1000;
int filterWidth = 3;
int filterRadius = (int) filterWidth/2;
int padding = filterRadius * 2;
double h = 1.0 / n;
int width_x[2];
int height_x[2];
int deviceWidth[2];
int deviceHeight[2];
int deviceDataSize[2];
for (i = 0; i < 2; ++i)
{
set_domain_length(n, n, height_x[i], width_x[i], i);
}
float* x = new float [height_x[0] * width_x[0]];
init_unknown(x, height_x[0], width_x[0], 0);
set_bndryCond(x, width_x[0], h);
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
assert(platforms.size() > 0);
cl::Platform myPlatform = platforms[0];
std::vector<cl::Device> devices;
myPlatform.getDevices(CL_DEVICE_TYPE_GPU, &devices);
assert(devices.size() > 0);
cl::Device myDevice = devices[0];
cl_display_info(myPlatform, myDevice);
cl::Context context(myDevice);
std::ifstream kernelFile("iterative_scheme.cl");
std::string src(std::istreambuf_iterator<char>(kernelFile), (std::istreambuf_iterator<char>()));
cl::Program::Sources sources(1,std::make_pair(src.c_str(),src.length() + 1));
cl::Program program(context, sources);
cl::CommandQueue queue(context, myDevice);
deviceWidth[0] = roundUp(width_x[0], WGX);
deviceHeight[0] = height_x[0];
deviceDataSize[0] = deviceWidth[0] * deviceHeight[0] * sizeof(float);
cl::Buffer buffer_x;
try
{
buffer_x = cl::Buffer(context, CL_MEM_READ_WRITE, deviceDataSize[0]);
} catch (cl::Error& error)
{
std::cout << " ---> Problem in creating buffer(s) " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
exit(0);
}
cl::size_t<3> buffer_origin;
buffer_origin[0] = 0;
buffer_origin[1] = 0;
buffer_origin[2] = 0;
cl::size_t<3> host_origin;
host_origin[0] = 0;
host_origin[1] = 0;
host_origin[2] = 0;
cl::size_t<3> region;
region[0] = (size_t)(deviceWidth[0] * sizeof(float));
region[1] = (size_t)(height_x[0]);
region[2] = 1;
std::cout << "===> Start writing data to device" << std::endl;
try
{
queue.enqueueWriteBufferRect(buffer_x, CL_TRUE, buffer_origin, host_origin, region,
deviceWidth[0] * sizeof(float), 0, width_x[0] * sizeof(float), 0, x);
} catch (cl::Error& error)
{
std::cout << " ---> Problem in writing data from Host to Device: " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
exit(0);
}
// Build the program
std::cout << "===> Start building program" << std::endl;
try
{
program.build("-cl-std=CL2.0");
std::cout << " ---> Build Successfully " << std::endl;
} catch(cl::Error& error)
{
std::cout << " ---> Problem in building program " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
std::cout << " ---> " << program.getBuildInfo<CL_PROGRAM_BUILD_LOG>(myDevice) << std::endl;
exit(0);
}
std::cout << "===> Start reading data from device" << std::endl;
// read result y and residual from the device
buffer_origin[0] = (size_t)(filterRadius * sizeof(float));
buffer_origin[1] = (size_t)filterRadius;
buffer_origin[2] = 0;
host_origin[0] = (size_t)(filterRadius * sizeof(float));
host_origin[1] = (size_t)filterRadius;
host_origin[2] = 0;
// region of x
region[0] = (size_t)((width_x[0] - padding) * sizeof(float));
region[1] = (size_t)(height_x[0] - padding);
region[2] = 1;
try
{
queue.enqueueReadBufferRect(buffer_x, CL_TRUE, buffer_origin, host_origin,
region, deviceWidth[0] * sizeof(float), 0, deviceWidth[0] * sizeof(float), 0, x);
} catch (cl::Error& error)
{
std::cout << " ---> Problem reading buffer in device: " << std::endl;
std::cout << " ---> " << getErrorString(error) << std::endl;
exit(0);
}
delete[] (x);
return 0;
}

The online reference link you provided says:
region
The (width in bytes, height in rows, depth in slices) of the 2D or 3D rectangle being read or written. For a 2D rectangle copy, the depth value given by region[2] should be 1. The values in region cannot be 0.
This is consistent with what you quoted later as "reference book". That's because your first link points to OpenCL 2.0 while the second link to 1.2.
The inconsistency you mention exist between online manual of 1.2 and the PDF of 1.2, but the online manual of 2.0 is consistent with the PDF. So i assume it was a bug in 1.2 online manual which was fixed in 2.0
otherwise, when I defined them as byte not rows/columns
What's a "column", and how is it different from bytes ?
The "elements" of buffer rect copy are always bytes. If you're reading/writing a 1D rect from a buffer, it simply transfers region[0] bytes. The reason why the API has "rows" and "slices" is because if using 2D/3D regions, you can have padding between data; but you can't have padding between elements in a 1D region.

I found out what is the reason of the problem, that's according to the online reference
CL_INVALID_VALUE if host_row_pitch is not 0 and is less than region[0].
so enqueueWriteBufferRect should change as follow:
queue.enqueueWriteBufferRect(buffer_x, CL_TRUE, buffer_origin, host_origin, region,
deviceWidth[0] * sizeof(float), 0, deviceWidth[0] * sizeof(float), 0, x);
which means host_row_pitch = deviceWidth[0] * sizeof(float) instead of host_row_pitch = width_x[0] * sizeof(float).

How to increase BatchSize with Tensorflow's C++ API?

I took the code in https://gist.github.com/kyrs/9adf86366e9e4f04addb (which takes an opencv cv::Mat image as input and converts it to tensor) and I use it to label images with the model inception_v3_2016_08_28_frozen.pb stated in the Tensorflow tutorial (https://www.tensorflow.org/tutorials/image_recognition#usage_with_the_c_api). Everything worked fine when using a batchsize of 1. However, when I increase the batchsize to 2 (or greater), the size of
finalOutput (which is of type std::vector) is zero.
Here's the code to reproduce the error:
// Only for VisualStudio
#define COMPILER_MSVC
#define NOMINMAX
#include <string>
#include <iostream>
#include <fstream>
#include <opencv2/opencv.hpp>
#include <opencv2/imgproc/imgproc.hpp>
#include "tensorflow/core/public/session.h"
#include "tensorflow/core/platform/env.h"
#include "tensorflow/core/framework/tensor.h"
int batchSize = 2;
int height = 299;
int width = 299;
int depth = 3;
int mean = 0;
int stdev = 255;
// Set image paths
cv::String pathFilenameImg1 = "D:/IMGS/grace_hopper.jpg";
cv::String pathFilenameImg2 = "D:/IMGS/lenna.jpg";
// Set model paths
std::string graphFile = "D:/Tensorflow/models/inception_v3_2016_08_28_frozen.pb";
std::string labelfile = "D:/Tensorflow/models/imagenet_slim_labels.txt";
std::string InputName = "input";
std::string OutputName = "InceptionV3/Predictions/Reshape_1";
void read_prepare_image(cv::String pathImg, cv::Mat &imgPrepared) {
// Read Color image:
cv::Mat imgBGR = cv::imread(pathImg);
// Now we resize the image to fit Model's expected sizes:
cv::Size s(height, width);
cv::Mat imgResized;
cv::resize(imgBGR, imgResized, s, 0, 0, cv::INTER_CUBIC);
// Convert the image to float and normalize data:
imgResized.convertTo(imgPrepared, CV_32FC1);
imgPrepared = imgPrepared - mean;
imgPrepared = imgPrepared / stdev;
}
int main()
{
// Read and prepare images using OpenCV:
cv::Mat img1, img2;
read_prepare_image(pathFilenameImg1, img1);
read_prepare_image(pathFilenameImg2, img2);
// creating a Tensor for storing the data
tensorflow::Tensor input_tensor(tensorflow::DT_FLOAT, tensorflow::TensorShape({ batchSize, height, width, depth }));
auto input_tensor_mapped = input_tensor.tensor<float, 4>();
// Copy images data into the tensor:
for (int b = 0; b < batchSize; ++b) {
const float * source_data;
if (b == 0)
source_data = (float*)img1.data;
else
source_data = (float*)img2.data;
for (int y = 0; y < height; ++y) {
const float* source_row = source_data + (y * width * depth);
for (int x = 0; x < width; ++x) {
const float* source_pixel = source_row + (x * depth);
const float* source_B = source_pixel + 0;
const float* source_G = source_pixel + 1;
const float* source_R = source_pixel + 2;
input_tensor_mapped(b, y, x, 0) = *source_R;
input_tensor_mapped(b, y, x, 1) = *source_G;
input_tensor_mapped(b, y, x, 2) = *source_B;
}
}
}
// Load the graph:
tensorflow::GraphDef graph_def;
ReadBinaryProto(tensorflow::Env::Default(), graphFile, &graph_def);
// create a session with the graph
std::unique_ptr<tensorflow::Session> session_inception(tensorflow::NewSession(tensorflow::SessionOptions()));
session_inception->Create(graph_def);
// run the loaded graph
std::vector<tensorflow::Tensor> finalOutput;
session_inception->Run({ { InputName,input_tensor } }, { OutputName }, {}, &finalOutput);
// Get Top 5 classes:
std::cerr << "final output size = " << finalOutput.size() << std::endl;
tensorflow::Tensor output = std::move(finalOutput.at(0));
auto scores = output.flat<float>();
std::cerr << "scores size=" << scores.size() << std::endl;
std::ifstream label(labelfile);
std::string line;
std::vector<std::pair<float, std::string>> sorted;
for (unsigned int i = 0; i <= 1000; ++i) {
std::getline(label, line);
sorted.emplace_back(scores(i), line);
}
std::sort(sorted.begin(), sorted.end());
std::reverse(sorted.begin(), sorted.end());
std::cout << "size of the sorted file is " << sorted.size() << std::endl;
for (unsigned int i = 0; i< 5; ++i)
std::cout << "The output of the current graph has category " << sorted[i].second << " with probability " << sorted[i].first << std::endl;
}
Do I miss anything? Any ideas?
Thanks in advance!

I had the same problem. When I changed to the model used in https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/benchmark (differente version of inception) bigger batch sizes work correctly.
Notice you need to change the input size from 299,299,3 to 224,224,3 and the input and output layer names to: input:0 and output:0

Probably the graph in the protobuf file had a fixed batch size of 1 and I was only changing the shape of the input, not the graph. The graph has to accept a variable batch size by setting the shape to (None, width, heihgt, channels). This is done when you freeze the graph. Since the graph we have is already frozen, there is no way to change the batch size at this point.

Why do my in-kernel dynamic memory allocations fail for larger grid sizes?

I need to dynamically allocate some arrays inside the kernel function. Here is the code:
__global__
void kernel3(int N)
{
int index = blockIdx.x*blockDim.x + threadIdx.x;
if (index < N)
{
float * cost = new float[100];
for (int i = 0; i < 100; i++)
cost[i] = 1;
}
}
and
int main()
{
cudaDeviceSynchronize();
cudaThreadSynchronize();
size_t mem_tot_01 = 0;
size_t mem_free_01 = 0;
cudaMemGetInfo(&mem_free_01, &mem_tot_01);
cout << "Free memory " << mem_free_01 << endl;
cout << "Total memory " << mem_tot_01 << endl;
system("pause");
int blocksize = 256;
int aaa = 16000;
int numBlocks = (aaa + blocksize - 1) / blocksize;
kernel3 << <numBlocks, blocksize >> >(aaa);
cudaDeviceSynchronize();
cudaError_t err1 = cudaGetLastError();
if (err1 != cudaSuccess)
{
printf("Error: %s\n", cudaGetErrorString(err1));
system("pause");
}
cudaMemGetInfo(&mem_free_01, &mem_tot_01);
cout << "Free memory " << mem_free_01 << endl;
cout << "Total memory " << mem_tot_01 << endl;
system("pause");
}
In the first round of cudaMemGetInfo:
Free memory:3600826368
Total memory:4294967297
and i got an error:
Error:unspecified launch failure
and i tried to change the "int aaa" to some smaller value, it won't get error but the memory info is not equal to what i assigned.
What's wrong with it? The memory should be enough, 16000x100x32=512x10e5<3600826368

The device memory allocated by the new operator comes from a runtime heap of fixed size.
If your code requires a large amount of runtime heap memory, you will probably need to increase the size of the heap before running any kernels. NVIDIA provide the cudaDeviceSetLimit API for this purpose, which is used with the cudaLimitMallocHeapSize flag to set the size of the heap.

Cuda Memory access error : CudaIllegalAddress , Image Processing(Stereo vision)

I'm using cuda to deal with image proccessing. but my result is always get 'cudaErrorIllegalAddress : an illegal memory access was encountered'
What i did is below.
First, Load converted image(rgb to gray) to device, i use 'cudaMallocPitch' and 'cudaMemcpy2D'
unsigned char *dev_srcleft;
size_t dev_srcleftPitch
cudaMallocPitch((void**)&dev_srcleft, &dev_srcleftPitch, COLS * sizeof(int), ROWS));
cudaMemcpy2D(dev_srcleft, dev_srcleftPitch, host_srcConvertL.data, host_srcConvertL.step,
COLS, ROWS, cudaMemcpyHostToDevice);
And, Allocating 2D array for store result. the result value is describe as 27bit, so what i'm trying is using 'int' which is 4bytes=32bits, not only for ample size , atomic operation(atomicOr, atomicXor) is needed for performance.
and my device is not supports 64bit atomic operation.
int *dev_leftTrans;
cudaMallocPitch((void**)&dev_leftTrans, &dev_leftTransPitch, COLS * sizeof(int), ROWS);
cudaMemset2D(dev_leftTrans, dev_leftTransPitch, 0, COLS, ROWS);
Memory allocation and memcpy2D works great, and i check by
Mat temp_output(ROWS, COLS, 0);
cudaMemcpy2D(temp_output.data, temp_output.step, dev_srcleft, dev_srcleftPitch, COLS, ROWS, cudaMemcpyDeviceToHost);
imshow("temp", temp_output);
Then, Do kernel code.
__global__ void TestKernel(unsigned char *src, size_t src_pitch,
int *dst, size_t dst_pitch,
unsigned int COLS, unsigned int ROWS)
{
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
unsigned char src_val = src[x + y * src_pitch];
dst[x + y * dst_pitch] = src_val;
}
dim3 dimblock(3, 3);
dim3 dimGrid(ceil((float)COLS / dimblock.x), ceil((float)ROWS / dimblock.y));
TestKernel << <dimGrid, dimblock, dimblock.x * dimblock.y * sizeof(char) >> >
(dev_srcleft, dev_srcleftPitch, dev_leftTrans, dev_leftTransPitch, COLS, ROWS);
Parameter COLS and ROWS is size of image.
I think the error occurs here : TestKerenl.
src_val, reading from global memory works good but when i'm trying to access dst, it blows up with cudaErrorIllegalAddress
I don't know what is wrong, and i sufferd for 4 days. please help me
below is my full code
#include <cuda.h>
#include <cuda_runtime.h>
#include <cuda_runtime_api.h>
#include <device_functions.h>
#include <cuda_device_runtime_api.h>
#include <device_launch_parameters.h>
#include <math.h>
#include <iostream>
#include <opencv2\opencv.hpp>
#include<string>
#define HANDLE_ERROR(err)(HandleError(err, __FILE__, __LINE__))
static void HandleError(cudaError_t err, const char*file, int line)
{
if (err != cudaSuccess)
{
printf("%s in %s at line %d\n", cudaGetErrorString(err), file, line);
exit(EXIT_FAILURE);
}
}
using namespace std;
using namespace cv;
string imagePath = "Ted";
string imagePathL = imagePath + "imL.png";
string imagePathR = imagePath + "imR.png";
__global__ void TestKernel(unsigned char*src, size_t src_pitch,
int *dst, size_t dst_pitch,
unsigned int COLS, unsigned int ROWS)
{
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
if ((COLS< x) && (ROWS < y)) return;
unsigned char src_val = src[x + y * src_pitch];
dst[x + y * dst_pitch] = src_val;
}
int main(void)
{
//Print_DeviceProperty();
//Left Image Load
Mat host_srcImgL = imread(imagePathL, CV_LOAD_IMAGE_UNCHANGED);
if (host_srcImgL.empty()){ cout << "Left Image Load Fail!" << endl; return; }
Mat host_srcConvertL;
cvtColor(host_srcImgL, host_srcConvertL, CV_BGR2GRAY);
//Right Image Load
Mat host_srcImgR = imread(imagePathR, CV_LOAD_IMAGE_UNCHANGED);
if (host_srcImgL.empty()){ cout << "Right Image Load Fail!" << endl; return; }
Mat host_srcConvertR;
cvtColor(host_srcImgR, host_srcConvertR, CV_BGR2GRAY);
//Create parameters
unsigned int COLS = host_srcConvertL.cols;
unsigned int ROWS = host_srcConvertR.rows;
unsigned int SIZE = COLS * ROWS;
imshow("Left source image", host_srcConvertL);
imshow("Right source image", host_srcConvertR);
unsigned char *dev_srcleft, *dev_srcright, *dev_disp;
int *dev_leftTrans, *dev_rightTrans;
size_t dev_srcleftPitch, dev_srcrightPitch, dev_dispPitch, dev_leftTransPitch, dev_rightTransPitch;
cudaMallocPitch((void**)&dev_srcleft, &dev_srcleftPitch, COLS, ROWS);
cudaMallocPitch((void**)&dev_srcright, &dev_srcrightPitch, COLS, ROWS);
cudaMallocPitch((void**)&dev_disp, &dev_dispPitch, COLS, ROWS);
cudaMallocPitch((void**)&dev_leftTrans, &dev_leftTransPitch, COLS * sizeof(int), ROWS);
cudaMallocPitch((void**)&dev_rightTrans, &dev_rightTransPitch, COLS * sizeof(int), ROWS);
cudaMemcpy2D(dev_srcleft, dev_srcleftPitch, host_srcConvertL.data, host_srcConvertL.step,
COLS, ROWS, cudaMemcpyHostToDevice);
cudaMemcpy2D(dev_srcright, dev_srcrightPitch, host_srcConvertR.data, host_srcConvertR.step,
COLS, ROWS, cudaMemcpyHostToDevice);
cudaMemset(dev_disp, 255, dev_dispPitch * ROWS);
dim3 dimblock(3, 3);
dim3 dimGrid(ceil((float)COLS / dimblock.x), ceil((float)ROWS / dimblock.y));
cudaEvent_t start, stop;
float elapsedtime;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);
TestKernel << <dimGrid, dimblock, dimblock.x * dimblock.y * sizeof(char) >> >
(dev_srcleft, dev_srcleftPitch, dev_leftTrans, dev_leftTransPitch, COLS, ROWS);
/*TestKernel << <dimGrid, dimblock, dimblock.x * dimblock.y * sizeof(char) >> >
(dev_srcright, dev_srcrightPitch, dev_rightTrans, dev_rightTransPitch, COLS, ROWS);*/
cudaThreadSynchronize();
cudaError_t res = cudaGetLastError();
if (res != cudaSuccess)
printf("%s : %s\n", cudaGetErrorName(res), cudaGetErrorString(res));
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&elapsedtime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
cout << elapsedtime << "msec" << endl;
Mat temp_output(ROWS, COLS, 0);
cudaMemcpy2D((int*)temp_output.data, temp_output.step, dev_leftTrans, dev_leftTransPitch, COLS, ROWS, cudaMemcpyDeviceToHost);
imshow("temp", temp_output);
waitKey(0);
return 0;
}
And this is my environment vs2013, cuda v6.5
Device' property's below
Major revision number: 3
Minor revision number: 0
Name: GeForce GTX 760 (192-bit)
Total global memory: 1610612736
Total shared memory per block: 49152
Total registers per block: 65536
Warp size: 32
Maximum memory pitch: 2147483647
Maximum threads per block: 1024
Maximum dimension 0 of block: 1024
Maximum dimension 1 of block: 1024
Maximum dimension 2 of block: 64
Maximum dimension 0 of grid: 2147483647
Maximum dimension 1 of grid: 65535
Maximum dimension 2 of grid: 65535
Clock rate: 888500
Total constant memory: 65536
Texture alignment: 512
Concurrent copy and execution: Yes
Number of multiprocessors: 6
Kernel execution timeout: Yes

One problem is that your kernel doesn't do any thread-checking.
When you define a grid of blocks like this:
dim3 dimGrid(ceil((float)COLS / dimblock.x), ceil((float)ROWS / dimblock.y));
you will often be launching extra blocks. The reason is that if COLS or ROW is not evenly divisible by the block dimensions (3 in this case) then you will get extra blocks to cover the remainder in each case.
These extra blocks will have some threads that are doing useful work, and some that will access out-of-bounds. To protect against this, it's customary to put a thread-check in your kernel to prevent out-of-bounds accesses:
const unsigned int x = blockIdx.x * blockDim.x + threadIdx.x;
const unsigned int y = blockIdx.y * blockDim.y + threadIdx.y;
if ((x < COLS) && (y < ROWS)) { // add this
unsigned char src_val = src[x + y * src_pitch];
dst[x + y * dst_pitch] = src_val;
} // add this
This means that only the threads that have a valid (in-bounds) x and y will actually do any accesses.
As an aside, (3,3) may not be a particularly good choice of block dimensions for performance reasons. It's usually a good idea to create block dimensions whose product is a multiple of 32, so (32,4) or (16,16) might be examples of better choices.

Another problem in your code is pitch usage for dst array.
Pitch is always in bytes, so first you need to cast dst pointer to char*, calculate row offset and then cast it back to int*:
int* dst_row = (int*)(((char*)dst) + y * dst_pitch);
dst_row[x] = src_val;

Received throughput issue with saturated traffic

I am using NS3 (v3.13) Wi-Fi model in infrastructure topology configured as follows (simulation file attached):
Single AP (BSS)
Multiple STAs (stations)
Application duration = 10s
Saturated downlink traffic (OnOffApplication with OnTime=2s and OffTime=0) from AP to all STAs
Phy: 802.11a
Default YansWifiChannelHelper and YansWifiPhyHelper
Rate control: ConstantRateWifiManager
Mobility: ConstantPositionMobilityModel (STAs are positioned on a circle of 2 meters radius around the AP)
Although all is going well, for a high bitrate (saturated traffic), when the number of STAs per BSS increases a lot, some STAs don't receive any BYTE !!
Experiments:
OnOffApplication DataRate = 60Mb/s, Phy DataMode=OfdmRate54Mbps and 30 STAs, one STA receives packets with a bitrate of 7.2Mb/s and another with 15.3Mb/s (all other 28 STAs don't receive any BYTE)
OnOffApplication DataRate = 60Mb/s, DataMode=OfdmRate6Mbps and 30 STAs, one STA receives packets with a bitrate of 1.95Mb/s and another with 4.3Mb/s (all other 28 STAs don't receive any BYTE)
I think that the problem comes from the OnOff Application configurations; how should I configure it to simulate a full buffer downlink traffic?
Thanks in advance for any suggestion.
#include "ns3/core-module.h"
#include "ns3/point-to-point-module.h"
#include "ns3/network-module.h"
#include "ns3/applications-module.h"
#include "ns3/wifi-module.h"
#include "ns3/mobility-module.h"
#include "ns3/csma-module.h"
#include "ns3/internet-module.h"
#include "ns3/flow-monitor-helper.h"
#include "ns3/flow-monitor-module.h"
#include "ns3/applications-module.h"
#include "ns3/internet-module.h"
#include "ns3/gnuplot.h"
#include "ns3/constant-velocity-helper.h"
#include "ns3/integer.h"
#include "ns3/mpi-interface.h"
#include "math.h"
#include <iostream>
/**
* PARAMETERS
*/
#define StaNb 30
#define Distance 2
#define Duration 10
#define DataRate 90000000
#define PacketSize 1500
#define couleur(param) printf("\033[%sm",param)
using namespace ns3;
class Experiment {
public:
Experiment();
void CreateArchi(void);
void CreateApplis();
private:
Ptr<ListPositionAllocator> positionAllocAp;
Ptr<ListPositionAllocator> positionAllocSta;
Ptr<GridPositionAllocator> positionAllocStaCouloir;
Ptr<RandomDiscPositionAllocator> positionAllocStaAmphi;
std::vector<Ptr<ConstantPositionMobilityModel> > constant;
NodeContainer m_wifiAP, m_wifiQSta;
NetDeviceContainer m_APDevice;
NetDeviceContainer m_QStaDevice;
YansWifiChannelHelper m_channel;
Ptr<YansWifiChannel> channel;
YansWifiPhyHelper m_phyLayer_Sta, m_phyLayer_AP;
WifiHelper m_wifi;
QosWifiMacHelper m_macSta, m_macAP;
InternetStackHelper m_stack;
Ipv4InterfaceContainer m_StaInterface;
Ipv4InterfaceContainer m_ApInterface;
Ssid m_ssid;
};
Experiment::Experiment() {
positionAllocStaCouloir = CreateObject<GridPositionAllocator>();
positionAllocAp = CreateObject<ListPositionAllocator>();
positionAllocSta = CreateObject<ListPositionAllocator>();
positionAllocStaAmphi = CreateObject<RandomDiscPositionAllocator>();
m_wifi = WifiHelper::Default();
constant.resize(StaNb + 1);
for (int i = 0; i < StaNb + 1; i++) {
constant[i] = CreateObject<ConstantPositionMobilityModel>();
}
}
void Experiment::CreateArchi(void) {
m_wifiQSta.Create(StaNb);
m_wifiAP.Create(1);
m_ssid = Ssid("BSS_circle");
m_channel = YansWifiChannelHelper::Default();
channel = m_channel.Create();
m_wifi.SetStandard(WIFI_PHY_STANDARD_80211a);
m_wifi.SetRemoteStationManager("ns3::ConstantRateWifiManager", "DataMode",
StringValue("OfdmRate6Mbps"));
m_phyLayer_Sta = YansWifiPhyHelper::Default();
m_phyLayer_AP = YansWifiPhyHelper::Default();
m_phyLayer_Sta.SetChannel(channel);
m_phyLayer_AP.SetChannel(channel);
positionAllocAp->Add(Vector3D(0.0, 0.0, 0.0));
MobilityHelper mobilityAp;
mobilityAp.SetPositionAllocator(positionAllocAp);
mobilityAp.SetMobilityModel("ns3::ConstantPositionMobilityModel");
mobilityAp.Install(m_wifiAP.Get(0));
constant[0]->SetPosition(Vector3D(0.0, 0.0, 0.0));
float deltaAngle = 2 * M_PI / StaNb;
float angle = 0.0;
double x = 0.0;
double y = 0.0;
for (int i = 0; i < StaNb; i++) {
x = cos(angle) * Distance;
y = sin(angle) * Distance;
positionAllocSta->Add(Vector3D(x, y, 0.0));
MobilityHelper mobilitySta;
mobilitySta.SetPositionAllocator(positionAllocSta);
mobilitySta.SetMobilityModel("ns3::ConstantPositionMobilityModel");
mobilitySta.Install(m_wifiQSta.Get(i));
constant[i]->SetPosition(Vector3D(x, y, 0.0));
angle += deltaAngle;
}
m_macSta = QosWifiMacHelper::Default();
m_macSta.SetType("ns3::StaWifiMac", "ActiveProbing", BooleanValue(true),
"Ssid", SsidValue(m_ssid));
m_macAP = QosWifiMacHelper::Default();
m_macAP.SetType("ns3::ApWifiMac", "Ssid", SsidValue(m_ssid),
"BeaconInterval", TimeValue(Time(std::string("100ms"))));
m_APDevice.Add(m_wifi.Install(m_phyLayer_AP, m_macAP, m_wifiAP));
for (int i = 0; i < StaNb; i++) {
m_QStaDevice.Add(
m_wifi.Install(m_phyLayer_Sta, m_macSta, m_wifiQSta.Get(i)));
}
m_stack.Install(m_wifiAP);
m_stack.Install(m_wifiQSta);
Ipv4AddressHelper address;
address.SetBase("192.168.1.0", "255.255.255.0");
m_ApInterface.Add(address.Assign(m_APDevice.Get(0)));
for (int i = 0; i < StaNb; i++) {
m_StaInterface.Add(address.Assign(m_QStaDevice.Get(i)));
}
Ipv4GlobalRoutingHelper::PopulateRoutingTables();
}
void Experiment::CreateApplis() {
ApplicationContainer source;
OnOffHelper onoff("ns3::UdpSocketFactory", Address());
onoff.SetAttribute("OnTime", RandomVariableValue(ConstantVariable(2)));
onoff.SetAttribute("OffTime", RandomVariableValue(ConstantVariable(0)));
onoff.SetAttribute("DataRate", StringValue("500kb/s"));
for (int i = 0; i < StaNb; i++) {
AddressValue remoteAddress(
InetSocketAddress(m_StaInterface.GetAddress(i), 5010));
onoff.SetAttribute("Remote", remoteAddress);
source.Add(onoff.Install(m_wifiAP.Get(0)));
source.Start(Seconds(3.0));
source.Stop(Seconds(Duration));
}
ApplicationContainer sinks;
PacketSinkHelper packetSinkHelper("ns3::UdpSocketFactory",
Address(InetSocketAddress(Ipv4Address::GetAny(), 5010)));
for (int i = 0; i < StaNb; i++) {
sinks.Add(packetSinkHelper.Install(m_wifiQSta.Get(i)));
sinks.Start(Seconds(3.0));
sinks.Stop(Seconds(Duration));
}
}
int main(int argc, char *argv[]) {
Experiment exp = Experiment();
Config::SetDefault("ns3::WifiRemoteStationManager::RtsCtsThreshold",
StringValue("2346"));
exp.CreateArchi();
exp.CreateApplis();
FlowMonitorHelper flowmon;
Ptr<FlowMonitor> monitor = flowmon.InstallAll();
Simulator::Stop(Seconds(Duration));
Simulator::Run();
monitor->CheckForLostPackets();
Ptr<Ipv4FlowClassifier> classifier = DynamicCast<Ipv4FlowClassifier>(
flowmon.GetClassifier());
std::map<FlowId, FlowMonitor::FlowStats> stats = monitor->GetFlowStats();
int c = 0;
for (std::map<FlowId, FlowMonitor::FlowStats>::const_iterator i =
stats.begin(); i != stats.end(); ++i) {
Ipv4FlowClassifier::FiveTuple t = classifier->FindFlow(i->first);
std::cout << "Flux " << i->first << " (" << t.sourceAddress << " -> "
<< t.destinationAddress << ")\n";
std::cout << " Tx Bytes : " << i->second.txBytes << "\n";
std::cout << " Rx Bytes : " << i->second.rxBytes << "\n";
couleur("33");
std::cout << " Bitrate : "
<< i->second.rxBytes * 8.0
/ (i->second.timeLastRxPacket.GetSeconds()
- i->second.timeFirstRxPacket.GetSeconds())
/ 1000000 << " Mbps\n\n";
couleur("0");
if (i->second.rxBytes > 0)
c++;
}
std::cout << " Number of receiving nodes : " << c << "\n";
Simulator::Destroy();
}

I think the medium is too busy.
You need to tuning down onoff datarate e.g 1 mbps. Practically, full buffer 720p video only need no more than 1mbps
You may also check tracing using pcap, ascii or netanim too see either the packet dropping, packet never being send or bug in your code

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Low copy performance in OpenCL (enqueueWriteBuffer & enqueueReadBuffer) - memory

Related

How to use clEnqueueWriteBufferRect in OpenCL

How to increase BatchSize with Tensorflow's C++ API?

Why do my in-kernel dynamic memory allocations fail for larger grid sizes?

Cuda Memory access error : CudaIllegalAddress , Image Processing(Stereo vision)

Received throughput issue with saturated traffic

Categories

Resources