How to normalise a vector with thrust? - normalization

I am learning the thrust for the moment. I have a question: how to normalise with thrust?
I have a code that works, but I want to know if this is the optimum method.
struct square
{
__host__ __device__
float operator() (float x)
{
return x * x;
}
};
thrust::device_vector<float> d_x(2);
thrust::device_vector<float> d_y(2);
thrust::device_vector<float> d_z(2);
d_x[0] = 3;
d_x[1] = 4;
square<float> unary_op;
thrust::plus<float> binary_op;
float init = 0;
// compute norm
float norm = std::sqrt( thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op) );
thrust::fill(d_y.begin(), d_y.end(), 1/norm);
thrust::transform(d_x.begin(), d_x.end(), d_y.begin(), d_z.begin(), thrust::multiplies<float>());

This should be more efficient because it does not need to use for storage or bandwidth for d_y or d_z:
#include <thrust/device_vector.h>
#include <thrust/inner_product.h>
#include <thrust/transform.h>
#include <thrust/functional.h>
#include <cmath>
int main()
{
thrust::device_vector<float> d_x(2);
d_x[0] = 3;
d_x[1] = 4;
float norm = std::sqrt(thrust::inner_product(d_x.begin(), d_x.end()));
using namespace thrust::placeholders;
thrust::transform(d_x.begin(), d_x.end(), d_x.begin(), _1 /= norm);
return 0;
}
You'll want to make your problem size a few orders of magnitude larger, of course.

Related

How to use RealSense's spatial_filter on an OpenCV Mat?

I want to apply RealSense library's depth filtering (rs2::spatial_filter) on an OpenCV Mat, but it seems like the filter is not being applied. The original depth image and the supposedly filtered depth image look exactly the same.
To load raw depth data into rs2::frame, I used modified #Groch88's answer. One of the changes I made was changing the depth format from RS2_FORMAT_Z16 to RS2_FORMAT_DISTANCE (to be able to load a float depth map) and not loading the RGB part of the frame. The whole source is below.
Why do the original and the filtered images look exactly the same? Am I missing something obvious?
main.cpp:
#include <iostream>
#include <opencv2/opencv.hpp>
#include <librealsense2/rs.hpp>
#include "rsImageConverter.h"
int main()
{
cv::Mat rawDepthImg = load_raw_depth("/path/to/depth/image"); // loads float depth image
rsImageConverter ic{rawDepthImg.cols, rawDepthImg.rows, sizeof(float)};
if ( !ic.convertFrame(rawDepthImg.data, rawDepthImg.data) )
{
fprintf(stderr, "Could not load depth.\n");
exit(1);
}
rs2::frame rsDepthFrame = ic.getDepth();
// Filter
// https://dev.intelrealsense.com/docs/post-processing-filters
rs2::spatial_filter spat_filter;
spat_filter.set_option(RS2_OPTION_FILTER_MAGNITUDE, 2.0f);
spat_filter.set_option(RS2_OPTION_FILTER_SMOOTH_ALPHA, 0.5f);
spat_filter.set_option(RS2_OPTION_FILTER_SMOOTH_DELTA, 20.0f);
// Apply filtering
rs2::frame rsFilteredDepthFrame;
rsFilteredDepthFrame = spat_filter.process(rsDepthFrame);
// Copy filtered depth to OpenCV Mat
cv::Mat filteredDepth = cv::Mat::zeros(rawDepthImg.size(), CV_32F);
memcpy(filteredDepth.data, rsFilteredDepthFrame.get_data(), rawDepthImg.cols * rawDepthImg.rows * sizeof(float));
// Display (un)filtered images
cv::imshow("Original depth", rawDepthImg); // Original image is being shown
cv::imshow("Filtered depth", filteredDepth); // A depth image that looks exactly like the original unfiltered depth map is shown
cv::imshow("Diff", filteredDepth - rawDepthImg); // A black image is being shown
cv::waitKey(0);
return 0;
}
rsImageConverter.h (edited version of #Doch88's code):
#include <librealsense2/rs.hpp>
#include <librealsense2/hpp/rs_internal.hpp>
class rsImageConverter
{
public:
rsImageConverter(int w, int h, int bppDepth);
bool convertFrame(uint8_t* depth_data, uint8_t* color_data);
rs2::frame getDepth() const;
private:
int w = 640;
int h = 480;
int bppDepth = sizeof(float);
rs2::software_device dev;
rs2::software_sensor depth_sensor;
rs2::stream_profile depth_stream;
rs2::syncer syncer;
rs2::frame depth;
int ind = 0;
};
rsImageConverter.cpp (edited version of #Doch88's code):
#include "rsImageConverter.h"
rsImageConverter::rsImageConverter(int w, int h, int bppDepth) :
w(w),
h(h),
bppDepth(bppDepth),
depth_sensor(dev.add_sensor("Depth")) // initializing depth sensor
{
rs2_intrinsics depth_intrinsics{ w, h, (float)(w / 2), (float)(h / 2), (float) w , (float) h , RS2_DISTORTION_BROWN_CONRADY ,{ 0,0,0,0,0 } };
depth_stream = depth_sensor.add_video_stream({ RS2_STREAM_DEPTH, 0, 0,
w, h, 60, bppDepth,
RS2_FORMAT_DISTANCE, depth_intrinsics });
depth_sensor.add_read_only_option(RS2_OPTION_DEPTH_UNITS, 1.f); // setting depth units option to the virtual sensor
depth_sensor.open(depth_stream);
depth_sensor.start(syncer);
}
bool rsImageConverter::convertFrame(uint8_t* depth_data, uint8_t* color_data)
{
depth_sensor.on_video_frame({ depth_data, // Frame pixels
[](void*) {}, // Custom deleter (if required)
w*bppDepth, bppDepth, // Stride and Bytes-per-pixel
(rs2_time_t)ind * 16, RS2_TIMESTAMP_DOMAIN_HARDWARE_CLOCK, ind, // Timestamp, Frame# for potential sync services
depth_stream });
ind++;
rs2::frameset fset = syncer.wait_for_frames();
depth = fset.first_or_default(RS2_STREAM_DEPTH);
return depth;
}
rs2::frame rsImageConverter::getDepth() const
{
return depth;
}

Differentiation with Fourier Sine Transform (FFTW)

How can we calculate the first derivative of a function with FFTW_RODFT00 (sine transform)?
I have some luck calculating the 2nd derivative because sqrt(-1)^2 = -1 and whilst working with imaginary values, this returns an imaginary value that we can hand down the Inverse Sine Transform to get back df(x). With the first derivative, on the other hand, I fear we might have a real value (multiplication with i=sqrt(-1) ) to inverse transform with FFTW_REDFT00?
Here's my code:
#include "math.h"
#include "fftw3w.h"
#include "common.h"
//#include <complex>
#define M_PI 3.14159265358979323846
#define FFN_R2R 3
float b[FFN_R2R]={1,2,3};
static void normalize_r2r(void){
const float const_k=1.f/(2*(FFN_R2R+1));
for(unsigned int i=0; i<FFN_R2R; ++i) b[i]*=const_k;
}
static void delta_r2r(void) {
const float fM_PI= (float) M_PI;
fftwf_plan gplan[2];
gplan[0]= fftwf_plan_r2r_1d(FFN_R2R, b, b, FFTW_RODFT00, FFTW_ESTIMATE);
gplan[1]= fftwf_plan_r2r_1d(FFN_R2R, b, b, FFTW_RODFT00, FFTW_ESTIMATE);
fftwf_execute(gplan[0]);
/**/
const float rcpL=1.f/(1*(FFN_R2R+1));
unsigned int i;
float k;
for(i=0, k=0.f; i<FFN_R2R; ++i, ++k){
b[i]*= -( fM_PI*rcpL*(k+1.f) * fM_PI*rcpL*(k+1.f) );
}
fftwf_execute(gplan[1]);
normalize_r2r();
for(uint i=0; i<FFN_R2R; ++i) printf("%f \n", b[i]);
}

histogram kernel memory issue

I am trying to implement an algorithm to process images with more than 256 bins.
The main issue to process histogram in such case comes from the impossibility to allocate more than 32 Kb as local tab in the GPU.
All the algorithms I found for 8 bits per pixel images use a fixed size tab locally.
The histogram is the first process in that tab then a barrier is up and at last an addition is made with the output vector.
I am working with IR image which has more than 32K bins of dynamic.
So I cannot allocate a fixed size tab inside the GPU.
My algorithm use an atomic_add in order to create directly the output histogram.
I am interfacing with OpenCV so, in order to manage the possible case of saturation my bins use floating points. Depending on the ability of the GPU to manage single or double precision.
OpenCV doesn't manage unsigned int, long, and unsigned long data type as matrix type.
I have an error... I do think this error is a kind of segmentation fault.
After several days I still have no idea what can be wrong.
Here is my code :
histogram.cl :
#pragma OPENCL EXTENSION cl_khr_fp64: enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics: enable
static void Atomic_Add_f64(__global double *val, double delta)
{
union {
double f;
ulong i;
} old;
union {
double f;
ulong i;
} new;
do {
old.f = *val;
new.f = old.f + delta;
}
while (atom_cmpxchg ( (volatile __global ulong *)val, old.i, new.i) != old.i);
}
static void Atomic_Add_f32(__global float *val, double delta)
{
union
{
float f;
uint i;
} old;
union
{
float f;
uint i;
} new;
do
{
old.f = *val;
new.f = old.f + delta;
}
while (atom_cmpxchg ( (volatile __global ulong *)val, old.i, new.i) != old.i);
}
__kernel void khist(
__global const uchar* _src,
const int src_steps,
const int src_offset,
const int rows,
const int cols,
__global uchar* _dst,
const int dst_steps,
const int dst_offset)
{
const int gid = get_global_id(0);
// printf("This message has been printed from the OpenCL kernel %d \n",gid);
if(gid < rows)
{
__global const _Sty* src = (__global const _Sty*)_src;
__global _Dty* dst = (__global _Dty*) _dst;
const int src_step1 = src_steps/sizeof(_Sty);
const int dst_step1 = dst_steps/sizeof(_Dty);
src += mad24(gid,src_step1,src_offset);
dst += mad24(gid,dst_step1,dst_offset);
_Dty one = (_Dty)1;
for(int c=0;c<cols;c++)
{
const _Rty idx = (_Rty)(*(src+c+src_offset));
ATOMIC_FUN(dst+idx+dst_offset,one);
}
}
}
The function Atomic_Add_f64 directly come from here and there
main.cpp
#include <opencv2/core.hpp>
#include <opencv2/core/ocl.hpp>
#include <fstream>
#include <sstream>
#include <chrono>
int main()
{
cv::Mat_<unsigned short> a(480,640);
cv::RNG rng(std::time(nullptr));
std::for_each(a.begin(),a.end(),[&](unsigned short& v){ v = rng.uniform(0,100);});
bool ret = false;
cv::String file_content;
{
std::ifstream file_stream("../test/histogram.cl");
std::ostringstream file_buf;
file_buf<<file_stream.rdbuf();
file_content = file_buf.str();
}
int output_flag = cv::ocl::Device::getDefault().doubleFPConfig() == 0 ? CV_32F : CV_64F;
cv::String atomic_fun = output_flag == CV_32F ? "Atomic_Add_f32" : "Atomic_Add_f64";
cv::ocl::ProgramSource source(file_content);
// std::cout<<source.source()<<std::endl;
cv::ocl::Kernel k;
cv::UMat src;
cv::UMat dst = cv::UMat::zeros(1,65536,output_flag);
a.copyTo(src);
atomic_fun = cv::format("-D _Sty=%s -D _Rty=%s -D _Dty=%s -D ATOMIC_FUN=%s",
cv::ocl::typeToStr(src.depth()),
cv::ocl::typeToStr(src.depth()), // this to manage case like a matrix of usigned short stored as a matrix of float.
cv::ocl::typeToStr(output_flag),
atomic_fun.c_str());
ret = k.create("khist",source,atomic_fun);
std::cout<<"check create : "<<ret<<std::endl;
k.args(cv::ocl::KernelArg::ReadOnly(src),cv::ocl::KernelArg::WriteOnlyNoSize(dst));
std::size_t sz = a.rows;
ret = k.run(1,&sz,nullptr,false);
std::cout<<"check "<<ret<<std::endl;
cv::Mat b;
dst.copyTo(b);
std::copy_n(b.ptr<double>(0),101,std::ostream_iterator<double>(std::cout," "));
std::cout<<std::endl;
return EXIT_SUCCESS;
}
Hello I arrived to fix it.
I don't really know where the issue come from.
But if I suppose the output as a pointer rather than a matrix it work.
The changes I made are these :
histogram.cl :
__kernel void khist(
__global const uchar* _src,
const int src_steps,
const int src_offset,
const int rows,
const int cols,
__global _Dty* _dst)
{
const int gid = get_global_id(0);
if(gid < rows)
{
__global const _Sty* src = (__global const _Sty*)_src;
__global _Dty* dst = _dst;
const int src_step1 = src_steps/sizeof(_Sty);
src += mad24(gid,src_step1,src_offset);
ulong one = 1;
for(int c=0;c<cols;c++)
{
const _Rty idx = (_Rty)(*(src+c+src_offset));
ATOMIC_FUN(dst+idx,one);
}
}
}
main.cpp
k.args(cv::ocl::KernelArg::ReadOnly(src),cv::ocl::KernelArg::PtrWriteOnly(dst));
The rest of the code is the same in the two files.
For me it work fine.
If someone know why it work when the ouput is declared as a pointer rather than a vector (matrix of one row) I am interested.
Nevertheless my issue is fix :).

Is there any way to convert an Eigen::Matrix back to itk::image?

I used Eigen library to convert several itk::image images into matrices, and do some dense linear algebra computations on them. Finally, I have the output as a matrix, but I need it in itk::image form. Is there any way to do this?
const unsigned int numberOfPixels = importSize[0] * importSize[1];
float* array1 = inverseU.data();
float* localBuffer = new float[numberOfPixels];
std::memcpy(localBuffer, array1, numberOfPixels);
const bool importImageFilterWillOwnTheBuffer = true;
importFilter->SetImportPointer(localBuffer,numberOfPixels,importImageFilterWillOwnTheBuffer);
importFilter->Update();
inverseU is the Eigen library matrix (float), importSize is the size of this matrix. When I give importFilter->GetOutput(), and write the result to file, the image I get is like this, which is not correct.
This is the matrix inverseU.
https://drive.google.com/file/d/0B3L9EtRhN11QME16SGtfSDJzSWs/view?usp=sharing . It is supposed to give a retinal fundus image in image form, I got the matrix after doing deblurring.
Take a look at the ImportImageFilter of itk. In particular, it may be used to build an itk::Image starting from a C-style array (example).
Someone recently asked how to convert a CImg image to ITK image. My answer might be a starting point...
A way to get the array out of a matrix A from Eigen may be found here :
double* array=A.data();
EDIT : here is a piece of code to turn a matrix of float into a png image saved with ITK. First, the matrix is converted to an itk Image of float. Then, this image is rescaled an cast to a image of unsigned char, using the RescaleIntensityImageFilter as explained here. Finally, the image is saved in png format.
#include <iostream>
#include <itkImage.h>
using namespace itk;
using namespace std;
#include <Eigen/Dense>
using Eigen::MatrixXf;
#include <itkImportImageFilter.h>
#include <itkImageFileWriter.h>
#include "itkRescaleIntensityImageFilter.h"
void eigen_To_ITK (MatrixXf mat)
{
const unsigned int Dimension = 2;
typedef itk::Image<unsigned char, Dimension> UCharImageType;
typedef itk::Image< float, Dimension > FloatImageType;
typedef itk::ImportImageFilter< float, Dimension > ImportFilterType;
ImportFilterType::Pointer importFilter = ImportFilterType::New();
typedef itk::RescaleIntensityImageFilter< FloatImageType, UCharImageType > RescaleFilterType;
RescaleFilterType::Pointer rescaleFilter = RescaleFilterType::New();
typedef itk::ImageFileWriter< UCharImageType > WriterType;
WriterType::Pointer writer = WriterType::New();
FloatImageType::SizeType imsize;
imsize[0] = mat.rows();
imsize[1] = mat.cols();
ImportFilterType::IndexType start;
start.Fill( 0 );
ImportFilterType::RegionType region;
region.SetIndex( start );
region.SetSize( imsize );
importFilter->SetRegion( region );
const itk::SpacePrecisionType origin[ Dimension ] = { 0.0, 0.0 };
importFilter->SetOrigin( origin );
const itk::SpacePrecisionType spacing[ Dimension ] = { 1.0, 1.0 };
importFilter->SetSpacing( spacing );
const unsigned int numberOfPixels = imsize[0] * imsize[1];
const bool importImageFilterWillOwnTheBuffer = true;
float * localBuffer = new float[ numberOfPixels ];
float * it = localBuffer;
memcpy(it, mat.data(), numberOfPixels*sizeof(float));
importFilter->SetImportPointer( localBuffer, numberOfPixels,importImageFilterWillOwnTheBuffer );
rescaleFilter ->SetInput(importFilter->GetOutput());
rescaleFilter->SetOutputMinimum(0);
rescaleFilter->SetOutputMaximum(255);
writer->SetFileName( "output.png" );
writer->SetInput(rescaleFilter->GetOutput() );
writer->Update();
}
int main()
{
const int rows = 42;
const int cols = 90;
MatrixXf mat1(rows, cols);
mat1.topLeftCorner(rows/2, cols/2) = MatrixXf::Zero(rows/2, cols/2);
mat1.topRightCorner(rows/2, cols/2) = MatrixXf::Identity(rows/2, cols/2);
mat1.bottomLeftCorner(rows/2, cols/2) = -MatrixXf::Identity(rows/2, cols/2);
mat1.bottomRightCorner(rows/2, cols/2) = MatrixXf::Zero(rows/2, cols/2);
mat1+=0.1*MatrixXf::Random(rows,cols);
eigen_To_ITK (mat1);
cout<<"running fine"<<endl;
return 0;
}
The program is build using CMake. Here is the CMakeLists.txt :
cmake_minimum_required(VERSION 2.8 FATAL_ERROR)
project(ItkTest)
find_package(ITK REQUIRED)
include(${ITK_USE_FILE})
# to include eigen. This path may need to be changed
include_directories(/usr/local/include/eigen3)
add_executable(MyTest main.cpp)
target_link_libraries(MyTest ${ITK_LIBRARIES})

Why do operations with an array corrupt the values?

I'm trying to implement the Particle Swarm Optimization on CUDA. I'm partially initializing data arrays on host, then I allocate memory on CUDA and copy it there, and then try to proceed with the initialization.
The problem is, when I'm trying to modify array element like so
__global__ void kernelInit(
float* X,
size_t pitch,
int width,
float X_high,
float X_low
) {
// Silly, but pretty reliable way to address array elements
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
int r = tid / width;
int c = tid % width;
float* pElement = (float*)((char*)X + r * pitch) + c;
*pElement = *pElement * (X_high - X_low) - X_low;
//*pElement = (X_high - X_low) - X_low;
}
It corrupts the values and gives me 1.#INF00 as array element. When I uncomment the last line *pElement = (X_high - X_low) - X_low; and comment the previous, it works as expected: I get values like 15.36 and so on.
I believe the problem is either with my memory allocation and copying, and/or with adressing the specific array element. I read the CUDA manual about these both topics, but I can't spot the error: I still get corrupt array if I do anything with the element of the array. For example, *pElement = *pElement * 2 gives unreasonable big results like 779616...00000000.00000 when the initial pElement is expected to be just a float in [0;1].
Here is the full source. Initialization of arrays begins in main (bottom of the source), then f1 function does the work for CUDA and launches the initialization kernel kernelInit:
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <cuda.h>
#include <cuda_runtime.h>
const unsigned f_n = 3;
const unsigned n = 2;
const unsigned p = 64;
typedef struct {
unsigned k_max;
float c1;
float c2;
unsigned p;
float inertia_factor;
float Ef;
float X_low[f_n];
float X_high[f_n];
float X_min[n][f_n];
} params_t;
typedef void (*kernelWrapperType) (
float *X,
float *X_highVec,
float *V,
float *X_best,
float *Y,
float *Y_best,
float *X_swarmBest,
bool &termination,
const float &inertia,
const params_t *params,
const unsigned &f
);
typedef float (*twoArgsFuncType) (
float x1,
float x2
);
__global__ void kernelInit(
float* X,
size_t pitch,
int width,
float X_high,
float X_low
) {
// Silly, but pretty reliable way to address array elements
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
int r = tid / width;
int c = tid % width;
float* pElement = (float*)((char*)X + r * pitch) + c;
*pElement = *pElement * (X_high - X_low) - X_low;
//*pElement = (X_high - X_low) - X_low;
}
__device__ float kernelF1(
float x1,
float x2
) {
float y = pow(x1, 2.f) + pow(x2, 2.f);
return y;
}
void f1(
float *X,
float *X_highVec,
float *V,
float *X_best,
float *Y,
float *Y_best,
float *X_swarmBest,
bool &termination,
const float &inertia,
const params_t *params,
const unsigned &f
) {
float *X_d = NULL;
float *Y_d = NULL;
unsigned length = n * p;
const cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();
size_t pitch;
size_t dpitch;
cudaError_t err;
unsigned width = n;
unsigned height = p;
err = cudaMallocPitch (&X_d, &dpitch, width * sizeof(float), height);
pitch = n * sizeof(float);
err = cudaMemcpy2D(X_d, dpitch, X, pitch, width * sizeof(float), height, cudaMemcpyHostToDevice);
err = cudaMalloc (&Y_d, sizeof(float) * p);
err = cudaMemcpy (Y_d, Y, sizeof(float) * p, cudaMemcpyHostToDevice);
dim3 threads; threads.x = 32;
dim3 blocks; blocks.x = (length/threads.x) + 1;
kernelInit<<<threads,blocks>>>(X_d, dpitch, width, params->X_high[f], params->X_low[f]);
err = cudaMemcpy2D(X, pitch, X_d, dpitch, n*sizeof(float), p, cudaMemcpyDeviceToHost);
err = cudaFree(X_d);
err = cudaMemcpy(Y, Y_d, sizeof(float) * p, cudaMemcpyDeviceToHost);
err = cudaFree(Y_d);
}
float F1(
float x1,
float x2
) {
float y = pow(x1, 2.f) + pow(x2, 2.f);
return y;
}
/*
* Generates random float in [0.0; 1.0]
*/
float frand(){
return (float)rand()/(float)RAND_MAX;
}
/*
* This is the main routine which declares and initializes the integer vector, moves it to the device, launches kernel
* brings the result vector back to host and dumps it on the console.
*/
int main() {
const params_t params = {
100,
0.5,
0.5,
p,
0.98,
0.01,
{-5.12, -2.048, -5.12},
{5.12, 2.048, 5.12},
{{0, 1, 0}, {0, 1, 0}}
};
float X[p][n];
float X_highVec[n];
float V[p][n];
float X_best[p][n];
float Y[p] = {0};
float Y_best[p] = {0};
float X_swarmBest[n];
kernelWrapperType F_wrapper[f_n] = {&f1, &f1, &f1};
twoArgsFuncType F[f_n] = {&F1, &F1, &F1};
for (unsigned f = 0; f < f_n; f++) {
printf("Optimizing function #%u\n", f);
srand ( time(NULL) );
for (unsigned i = 0; i < p; i++)
for (unsigned j = 0; j < n; j++)
X[i][j] = X_best[i][j] = frand();
for (int i = 0; i < n; i++)
X_highVec[i] = params.X_high[f];
for (unsigned i = 0; i < p; i++)
for (unsigned j = 0; j < n; j++)
V[i][j] = frand();
for (unsigned i = 0; i < p; i++)
Y_best[i] = F[f](X[i][0], X[i][1]);
for (unsigned i = 0; i < n; i++)
X_swarmBest[i] = params.X_high[f];
float y_swarmBest = F[f](X_highVec[0], X_highVec[1]);
bool termination = false;
float inertia = 1.;
for (unsigned k = 0; k < params.k_max; k++) {
F_wrapper[f]((float *)X, X_highVec, (float *)V, (float *)X_best, Y, Y_best, X_swarmBest, termination, inertia, &params, f);
}
for (unsigned i = 0; i < p; i++)
{
for (unsigned j = 0; j < n; j++)
{
printf("%f\t", X[i][j]);
}
printf("F = %f\n", Y[i]);
}
getchar();
}
}
Update: I tried adding error handling like so
err = cudaMallocPitch (&X_d, &dpitch, width * sizeof(float), height);
if (err != cudaSuccess) {
fprintf(stderr, cudaGetErrorString(err));
exit(1);
}
after each API call, but it gave me nothing and didn't return (I still get all the results and program works to the end).
This is an unnecessarily complex piece of code for what should be a simple repro case, but this immediately jumps out:
const unsigned n = 2;
const unsigned p = 64;
unsigned length = n * p
dim3 threads; threads.x = 32;
dim3 blocks; blocks.x = (length/threads.x) + 1;
kernelInit<<<threads,blocks>>>(X_d, dpitch, width, params->X_high[f], params->X_low[f]);
So you are firstly computing the incorrect number of blocks, and then reversing the order of the blocks per grid and threads per block arguments in the kernel launch. That may well lead to out of bounds memory access, either hosing something in GPU memory or causing an unspecified launch failure, which your lack of error handling might not be catching. There is a tool called cuda-memcheck which has been shipped with the toolkit since about CUDA 3.0. If you run it, it will give you valgrind style memory access violation reports. You should get into the habit of using it, if you are not already doing so.
As for infinite values, that is to be expected isn't it? Your code starts with values in (0,1), and then does
X[i] = X[i] * (5.12--5.12) - -5.12
100 times, which is the rough equivalent of multiplying by 10^100, which is then followed by
X[i] = X[i] * (2.048--2.048) - -2.048
100 times, which is the rough equivalent of multiplying by 4^100, finally followed by
X[i] = X[i] * (5.12--5.12) - -5.12
again. So your results should be of the order of 1E250, which is much larger than the maximum 3.4E38 which is the rough upper limit of representable numbers in IEEE 754 single precision.

Resources