Data Clauses (output is zero when i use OpenACC)

Data Clauses (output is zero when i use OpenACC) - nvidia

I want to reduce runtime of my code by use the OpenACC but unfortunately when i use OpenACC the output becomes zero.
sajad.**
#include <stdio.h>
#include <math.h>
#include <stdlib.h>
#include <assert.h>
#include <openacc.h>
#include<time.h>
#include <string.h>
#include <malloc.h>
#define NX 201
#define NY 101
#define NZ 201
int main(void)
{
int i, j, k, l, m;
static double tr, w;
static double dt = 9.5e-9, t;
static double cu[NZ];
static double AA[NX][NY][NZ] , CC[NX][NY][NZ] , BB[NX][NY][NZ] ;
static double A[NX][NY][NZ] , B[NX][NY][NZ] , C[NX][NY][NZ] ;
FILE *file;
file = fopen("BB-and-A.csv", "w");
t = 0.;
#pragma acc data copyin( tr, w,dt, t),copy(B ,A , C,AA , CC,BB,cu )
{
for (l = 1; l < 65; l++) {
#pragma acc kernels loop private(i, j,k)
for (i = 1; i < NX - 1; i++) {
for (j = 0; j < NY - 1; j++) {
for (k = 1; k < NZ - 1; k++) {
A[i][j][k] = A[i][j][k]
+ 1. * (B[i][j][k] - AA[i][j][k - 1]);
}
}
}
#pragma acc kernels loop private(i, j,k)
for (i = 1; i < NX - 1; i++) { /* BB */
for (j = 1; j < NY - 1; j++) {
for (k = 0; k < NZ - 1; k++) {
B[i][j][k] = B[i][j][k]
+ 1.* (BB[i][j][k] - A[i - 1][j][k]);
}
}
}
#pragma acc kernels
for (m = 1; m < NZ - 1; m++) {
tr = t - (double)(m)*5 / 1.5e8;
if (tr <= 0.)
cu[m] = 0.;
else {
w = (tr / 0.25e-6)*(tr / 0.25e-6);
cu[m] =1666*w / (w + 1.)*exp(-tr / 2.5e-6) ;
cu[m] = 2*cu[m];
}
A[10][60][m] = -cu[m];
}
#pragma acc update self(B)
fprintf(file, "%e, %e \n", t*1e6, -B[22][60][10] );
t = t + dt;
}
}
fclose(file);
}

The problem here is the "copyin( tr, w,dt, t)", and in particular the "t" variable. By putting these scalars in a data clause, you'll need to managed the synchronization between the host as device copies. Hence, when you update the variable on the host (i.e. "t = t + dt;"), you then need to update the device copy with the new value.
Also, there's a potential race condition on "tr" since the device code will now the shared device variable instead of a private copy.
Though, the easiest thing to do is to simply not put these scalars in a data clause. By default, OpenACC privatizes scalars so there's no need manage them yourself. In t's case, it's value will be passed as an argument to the CUDA kernel.
To fix your code change:
#pragma acc data copyin( tr, w,dt, t),copy(B ,A , C,AA , CC,BB,cu )
to:
#pragma acc data copy(B ,A , C,AA , CC,BB,cu )
Note that there's no need to put the loop indices in a private clause since they are implicitly private.

Related

Exception thrown at 0x00007FFD9ABF024E (ucrtbased.dll) in myapp.exe: 0xC0000005: Access violation reading location

I'm trying to create a char matrix using dynamic allocation (char**). It represents a board where the margins are '#' character and in the middle is the ASCII 32 (blank space). When I run the code this massage appear: "Exception thrown at 0x00007FFD9ABF024E (ucrtbased.dll) in myapp.exe: 0xC0000005: Access violation reading location " in some cpp file.
Here's my code:
#include <iostream>
using namespace std;
char** allocateBoard(int n)
{
char** Board = 0;
Board = new char* [n+2];
int i;
for (i = 0; i < n + 2; i++)
{
Board[i] = new char[n * 2 + 2];
}
return Board;
}
void initBoard(char**& Board, int n)
{
int i, j;
for (i = 0; i < n; i++)
{
for (j = 0; j < n * 2; j++)
{
if (i == 0 || i == n - 1) Board[i][j] = '#';
else if (j == 0 || j == n * 2 - 1) Board[i][j] = '#';
else Board[i][j] = 32;
}
}
}
void showBoard(char** Board, int n)
{
int i, j;
for (i = 0; i < n; i++)
{
for (j = 0; j < n * 2; j++)
{
cout << Board[i][j];
}
cout << endl;
}
}
int main()
{
int n = 4;
char** Board = 0;
Board = allocateBoard(n);
initBoard(Board, n);
showBoard(Board, n);
cout << endl;
showBoard(Board, n);
for (int i = 0; i < n * 2 + 4; i++)
{
delete[] Board[i];
}
delete[] Board;
return 0;
}
Does anyone know where is the problem? As a very beginner I can't see where is the mistake. I've allocated more space in the matrix than I'm actually using so I can't figure why this message appears. Is the deallocation the problem?
Thanks!

stack smashing in C code about making a histogram

I need to make a c program that will make a histogram of all the letters present in a phrase the user gives. When I run it, I does it but gives a "* stack smashing detected *: terminated". Where would this error be coming from? (for ease right now I set max to 3). In the future i'll have it find the max
Thank you
Andrew
#include <stdio.h>
#include <ctype.h>
#include <string.h>
static void ReadText(int histo[26],int max) {
char phrase[100];
int i;
char Letter;
char toArray;
// read in phrase
printf("Enter Phrase: "); // reads in phrase with spaces between words
scanf("%[^\n]",phrase);
// count the number of certain letters that occur
for(i = 0; i <= strlen(phrase);++i) {
Letter = phrase[i];
if(isalpha(Letter) != 0){
Letter = tolower(Letter);
toArray = Letter - 97;
histo[(int)toArray] = histo[(int)toArray] + 1;
}
}
}
static void DrawHist(int histo[26], int max){
int i;
int j;
int histo2[50];
for(i = 0; i <= 26; i++) {
histo2[i+i] = histo[i];
if(i < 25) {
histo2[i+i+1] = 0;
}
}
// (i = 1; i <= 50; i++) {
// printf("%d",histo2[i]);
//}
//printf("\n");
for(i=max;i>0;--i) {
for(j=0;j<=51;++j) {
if((j < 51) && (histo2[j] >= i)) {
printf("|");
}
else if((j < 51) && (histo2[j] < i)){
printf(" ");
}
else if(j == 51){
printf("\n");
}
}
}
printf("+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-\n");
printf("A B C D E F G H I J K L M N O P Q R S T U V W X Y Z\n");
}
int main() {
int histo[26] = {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0};
int max = 3;
//int i;
ReadText(histo,max);
//for(i = 0; i<26;++i) {
// printf("%d",histo[i]);
//}
DrawHist(histo,max);
return 0;
}

Clang memory allocation

Could anyone please help me understand why Clang reallocates the same memory address for different variables while their lifetimes intersect?
I am using a sample program (below) to show the problem.
When I compile the program with clang -O0, variable j in function ok has the same memory address as variable solutions in function nqueens.
Function ok is called inside function nqueens, which means that the lifetime of the variables intersect; the same stack space cannot be used/reused for both functions.
Compiling the program with gcc or clang at -O1, however, they are assigned different memory addresses.
Any help is appreciated!
#include <stdlib.h>
#include <stdio.h>
#include <memory.h>
#include <alloca.h>
/* Checking information */
static int solutions[] = {
1,
0,
0,
2,
10, /* 5 */
4,
40,
92,
352,
724, /* 10 */
2680,
14200,
73712,
365596,
};
#define MAX_SOLUTIONS sizeof(solutions)/sizeof(int)
int total_count;
int sharedVar = 0;
int ok(int n, char *a)
{
int i, j;
char p, q;
printf("jjjjjjjjj: %d, %p\n", n,&j);
for (i = 0; i < n; i++) {
p = a[i];
for (j = i + 1; j < n; j++) {
q = a[j];
if (q == p || q == p - (j - i) || q == p + (j - i))
return 0;
}
}
return 1;
}
void nqueens (int n, int j, char *a, int *solutions)
{
int i,res;
sharedVar = sharedVar * j - n;
if (n == j) {
/* good solution, count it */
*solutions = 1;
return;
}
printf("solutions: %d, %p\n", j, &solutions);
*solutions = 0;
/* try each possible position for queen <j> */
for (i = 0; i < n; i++) {
a[j] = (char) i;
if (ok(j + 1, a)) {
nqueens(n, j + 1, a,&res);
*solutions += res;
}
}
}
int main()
{
int size = 3;
char *a;
// printf("total_count: %p\n", &total_count);
total_count=0;
a = (char *)alloca(size * sizeof(char));
printf("Computing N-Queens algorithm (n=%d) ", size);
sharedVar = -5;
nqueens(size, 0, a, &total_count);
printf("completed!\n");
printf("sharedVar: %d\n", sharedVar);
}

Implementation of LASSO in C

I am trying to understand the LASSO algorithm for linear regression. I have implemented the algorithm using naive coordinate descent method for optimization. However the coefficients that I obtained from my code, wasn't matching with those obtained from the 'glmnet'package for LASSO in R. I wanted to understand how I could make the algorithm more accurate, so that the coefficients match with those obtained from R. I think they use coordinate descent as well.
Note: I have generated some toy data with 11 observations, and 6
features(x,x^2 ,x^3,...,x^6). The last column contains the y values
generated from a dummy function (e^(-x^2)). I wanted to use LASSO to
estimate this function. Also, I have randomly picked the initial
weight vector, multiple times to crosscheck my results.
Here is my code:
#include<stdio.h>
#include<stdlib.h>
#include<string.h>
#include<math.h>
#include<time.h>
int num_dim = 6;
int num_obs = 11;
/*Computes the normalization factor*/
float norm_feature(int j,double arr[][7],int n){
float sum = 0.0;
int i;
for(i=0;i<n;i++){
sum = sum + pow(arr[i][j],2);
}
return sum;
}
/*Computes the partial sum*/
float approx(int dim,int d_ignore,float weights[],double arr[][7],int
i){
int flag = 1;
if(d_ignore == -1)
flag = 0;
int j;
float sum = 0.0;
for(j=0;j<dim;j++){
if(j != d_ignore)
sum = sum + weights[j]*arr[i][j];
else
continue;
}
return sum;
}
/* Computes rho-j */
float rho_j(double arr[][7],int n,int j,float weights[7]){
float sum = 0.0;
int i;
float partial_sum ;
for(i=0;i<n;i++){
partial_sum = approx(num_dim,j,weights,arr,i);
sum = sum + arr[i][j]*(arr[i][num_dim]-partial_sum);
}
return sum;
}
float intercept(float arr1[7],double arr[][7],int dim) {
int i;
float sum =0.0;
for (i = 0; i < num_obs; i++) {
sum = sum + pow((arr[i][num_dim]) - approx(num_dim, -1, arr1, arr,
i), 1);
}
return sum;
}
int main(){
double data[num_obs][7];
int i=0,j=0;
float a = 1.0;
float lambda = 0.1; //Setting lambda
float weights[7]; //weights[6] contains the intercept
srand((unsigned int) time(NULL));
/*Generating the data matrix */
for(i=0;i<11;i++)
data[i][0] = ((float)rand()/(float)(RAND_MAX)) * a;
for(i=0;i<11;i++)
for(j=1;j<6;j++)
data[i][j] = pow(data[i][0],j+1);
for(i=0;i<11;i++)
data[i][6] = exp(-pow(data[i][0],2)); // the last column in the
datamatrix contains the y values generated by the dummy function
/*Printing the data matrix */
printf("Data Matrix:\n");
for(i=0;i<11;i++){
for(j=0;j<7;j++){
printf("%lf ",data[i][j]);}
printf("\n");}
printf("\n");
int seed =0;
while(seed<20) {
//Initializing the weight vector
for (i = 0; i < 7; i++)
weights[i] = ((float) rand() / (float) (RAND_MAX)) * a;
int iter = 500;
int t = 0;
int r, l;
double rho[num_dim];
for (i = 0; i < 6; i++) {
rho[i] = rho_j(data, num_obs, r, weights);
}
// Intercept initialization
weights[num_dim] = intercept(weights,data,num_dim);
printf("Weights initialization: ");
for (i = 0; i < (num_dim+1); i++)
printf("%f ", weights[i]);
printf("\n");
while (t < iter) {
for (r = 0; r < num_dim; r++) {
rho[r] = rho_j(data, num_obs, r, weights);
//printf("rho %d:%f ",r,rho[r]);
if (rho[r] < -lambda / 2)
weights[r] = (rho[r] + lambda / 2) / norm_feature(r,
data, num_obs);
else if (rho[r] > lambda / 2)
weights[r] = (rho[r] - lambda / 2) / norm_feature(r,
data, num_obs);
else
weights[r] = 0;
weights[num_dim] = intercept(weights, data, num_dim);
}
/* printf("Iter(%d): ", t);
for (l = 0; l < 7; l++)
printf("%f ", weights[l]);
printf("\n");*/
t++;
}
//printf("\n");
printf("Final Weights: ");
for (i = 0; i < 7; i++)
printf("%f ", weights[i]);
printf("\n");
printf("\n");
seed++;
}
return 0;
}
PseudoCode:

integrate_adaptive and integrate_times give different answers for negative step size

I'm using the odeint library in Boost. When using the integrate_adaptive function, the results are as expected. However, when using integrate_times, the ODE is evaluated at very different times that are outside the range of integration. This is a problem for me because my ODE is not defined for some of the values that it is being evaluated at.
The code below demonstrates the issue. The x values for which the ODE is evaluated are printed to the screen.
#include <iostream>
#include <complex>
#include <vector>
#include <boost/numeric/odeint.hpp>
struct observe
{
std::vector<std::vector<std::complex<double> > > & y;
std::vector<double>& x_ode;
observe(std::vector<std::vector<std::complex<double> > > &p_y, std::vector<double> &p_x_ode) : y(p_y), x_ode(p_x_ode) { };
void operator()(const std::vector<std::complex<double> > &y_temp, double x_temp)
{
y.push_back(y_temp);
x_ode.push_back(x_temp);
}
};
class Direct
{
std::complex<double> alpha;
std::complex<double> beta;
std::complex<double> R;
std::vector<std::vector<std::complex<double> > > H0_create(const double y);
public:
Direct(std::complex<double> p_alpha, std::complex<double> p_beta, double p_R) : alpha(p_alpha), beta(p_beta), R(p_R) { }
void operator() (const std::vector<std::complex<double> > &y, std::vector<std::complex<double> > &dydx, const double x)
{
std::vector<std::vector<std::complex<double> > > H0 = H0_create(x);
for(int ii = 0; ii < 6; ii++)
{
dydx[ii] = 0.0;
for(int jj = 0; jj < 6; jj++)
{
dydx[ii] += H0[ii][jj]*y[jj];
}
}
}
};
std::vector<std::vector<std::complex<double> > > Direct::H0_create(const double x)
{
std::complex<double> i = std::complex<double>(0.0,1.0);
std::cout << x << std::endl;
double U = sin(x*3.14159/2.0);
double Ux = cos(x*3.14159/2.0);
std::complex<double> S = alpha*alpha + beta*beta + i*R*alpha*U;
std::vector<std::vector<std::complex<double> > > H0(6);
for(int ii = 0; ii < 6; ii++)
{
H0[ii] = std::vector<std::complex<double> >(6);
}
H0[0][1] = 1.0;
H0[1][0] = S;
H0[1][2] = R*Ux;
H0[1][3] = i*alpha*R;
H0[2][0] = -i*alpha;
H0[2][4] = -i*beta;
H0[3][1] = -i*alpha/R;
H0[3][2] = -S/R;
H0[3][5] = -i*beta/R;
H0[4][5] = 1.0;
H0[5][3] = i*beta*R;
H0[5][4] = S;
return H0;
}
int main()
{
int N = 10;
double x0 = 1.0;
double xf = 0.0;
std::vector<double> x_ode(N);
double delta_x0 = (xf-x0)/(N-1.0);
for(int ii = 0; ii < N; ii++)
{
x_ode[ii] = x0 + ii*delta_x0;
}
x_ode[N-1] = xf;
std::vector<std::vector<std::complex<double> > > y_temp;
std::vector<double> x_temp;
std::complex<double> i = std::complex<double>(0.0,1.0);
std::complex<double> alpha = 0.001*i;
double beta = 0.45;
double R = 500.0;
std::complex<double> lambda = -sqrt(alpha*alpha + beta*beta + i*R*alpha);
// Define Initial Conditions
std::vector<std::complex<double> > ICs = {1, lambda, -i*alpha/lambda,0,0,0};
// Initialize ODE class
Direct direct(alpha,beta,R);
{
using namespace boost::numeric::odeint;
double abs_tol = 1.0e-10;
double rel_tol = 1.0e-6;
std::cout << "integrate_adaptive x values:\n";
size_t steps1 = integrate_adaptive(make_controlled<runge_kutta_cash_karp54<std::vector<std::complex<double> > > >(abs_tol, rel_tol), direct, ICs, x0, xf, delta_x0, observe(y_temp,x_temp));
std::cout << "\n\nintegrate_times x values:\n";
size_t steps2 = integrate_times(make_controlled<runge_kutta_cash_karp54<std::vector<std::complex<double> > > >(abs_tol, rel_tol), direct, ICs, x_ode.begin(), x_ode.end(), delta_x0, observe(y_temp,x_temp));
}
return 0;
}
I am compiling and running by using these commands:
g++ main.cpp -std=C++11; ./a.out
The code produces this output:
integrate_adaptive x values:
1
0.977778
0.966667
0.933333
0.888889
0.902778
0.888889
0.849758
0.830193
0.771496
0.693235
0.717692
0.693235
0.654104
0.634539
0.575842
0.497581
0.522037
0.497581
0.45845
0.438885
0.380188
0.301927
0.326383
0.301927
0.262796
0.24323
0.184534
0.106273
0.130729
0.106273
0.0850181
0.0743908
0.042509
0
0.0132841
integrate_times x values:
1
0.977778
0.966667
0.933333
0.888889
0.902778
0.888889
0.84944
0.829716
0.770543
0.691645
0.716301
0.777778
0.738329
0.718605
0.659432
0.580534
0.60519
0.666667
0.627218
0.607494
0.54832
0.469423
0.494078
0.555556
0.512422
0.490855
0.426154
0.339886
0.366845
0.444444
0.397898
0.374625
0.304806
0.211714
0.240805
0.333333
0.281908
0.256196
0.179058
0.0762077
0.108348
0.222222
0.170797
0.145085
0.0679468
-0.0349035
-0.00276275
0.111111
0.059686
0.0339734
-0.0431643
-0.146015
-0.113874
0.111111
0.0671073
0.0451054
-0.0209003
-0.108908
-0.0814056
The range of integration is from x = 1 to 0 but the ODE is being evaluated at x values less than 0 when using integrate_times.

This is a bug in odeint due to the negative timesteps in your problem, I have created an issue on github:
https://github.com/headmyshoulder/odeint-v2/issues/99
and I have implemented a fix. Please check out the latest odeint version from github and see if the problem remains. If so - feel free to open a new issue on github.
Thanks for pointing out that problem - and sorry for the bug.
Another note: I would suggest to use a dense-output stepper for the integrate_times routine as this is much more efficient (factor 2 in the best case). It basically does what you implemented as a fix above: using adaptive time-steps and interpolates at the intermediate points as required.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Data Clauses (output is zero when i use OpenACC) - nvidia

Related

Exception thrown at 0x00007FFD9ABF024E (ucrtbased.dll) in myapp.exe: 0xC0000005: Access violation reading location

stack smashing in C code about making a histogram

Clang memory allocation

Implementation of LASSO in C

integrate_adaptive and integrate_times give different answers for negative step size

Categories

Resources