Loop unrolling in Metal kernels - ios

I need to force the Metal compiler to unroll a loop in my kernel compute function. So far I've tried to put #pragma unroll(num_times) before a for loop, but the compiler ignores that statement.
It seems that the compiler doesn't unroll the loops automatically — I compared execution times for 1) a code with for loop 2) the same code but with hand-unrolled loop. The hand-unrolled version was 3 times faster.
E.g.: I want to go from this:
for (int i=0; i<3; i++) {
do_stuff();
}
to this:
do_stuff();
do_stuff();
do_stuff();
Is there even something like loop unrolling in the Metal C++ language? If yes, how can I possibly let the compiler know I want to unroll a loop?

Metal is a subset C++11, and you can try using template metaprogramming to unroll loops. The following compiled in metal, though I don't have time to properly test it:
template <unsigned N> struct unroll {
template<class F>
static void call(F f) {
f();
unroll<N-1>::call(f);
}
};
template <> struct unroll<0u> {
template<class F>
static void call(F f) {}
};
kernel void test() {
unroll<3>::call(do_stuff);
}
Please let me know if it works! You'll probably have to add some arguments to call to pass arguments to do_stuff.
See also: Self-unrolling macro loop in C/C++

Related

Variadic Dispatch Function

I have an interface wherein the types of the parameters mostly encode their own meanings. I have a function that takes one of these parameters. I'm trying to make a function that takes a set of these parameters and performs the function on each one in order.
#include <iostream>
#include <vector>
enum param_type{typeA,typeB};
template <param_type PT> struct Container{
int value;
Container(int v):value(v){}
};
int f(Container<typeA> param){
std::cout<<"Got typeA with value "<<param.value<<std::endl;
return param.value;
}
int f(Container<typeB> param){
std::cout<<"Got typeB with value "<<param.value<<std::endl;
return param.value;
}
My current solution uses a recursive variadic template to delegate the work.
void g(){}
template <typename T,typename...R>
void g(T param,R...rest){
f(param);
g(rest...);
}
I would like to use a packed parameter expansion, but I can't seem to get that to work without also using the return values. (In my particular case the functions are void.)
template <typename...T> // TODO: Use concepts once they exist.
void h(T... params){
// f(params);...
// f(params)...; // Fail to compile.
// {f(params)...};
std::vector<int> v={f(params)...}; // Works
}
Example usage
int main(){
auto a=Container<typeA>(5);
auto b=Container<typeB>(10);
g(a,b);
h(a,b);
return 0;
}
Is there an elegant syntax for this expansion in C++?
In C++17: use a fold expression with the comma operator.
template <typename... Args>
void g(Args... args)
{
((void)f(args), ...);
}
Before C++17: comma with 0 and then expand into the braced initializer list of an int array. The extra 0 is there to ensure that a zero-sized array is not created.
template <typename... Args>
void g(Args... args)
{
int arr[] {0, ((void)f(args), 0)...};
(void)arr; // suppress unused variable warning
}
In both cases, the function call expression is cast to void to avoid accidentally invoking a user-defined operator,.

FatalExecutionEngineError on accessing a pointer set with memcpy_s

See update 1 below for my guess as to why the error is happening
I'm trying to develop an application with some C#/WPF and C++. I am having a problem on the C++ side on a part of the code that involves optimizing an object using GNU Scientific Library (GSL) optimization functions. I will avoid including any of the C#/WPF/GSL code in order to keep this question more generic and because the problem is within my C++ code.
For the minimal, complete and verifiable example below, here is what I have. I have a class Foo. And a class Optimizer. An object of class Optimizer is a member of class Foo, so that objects of Foo can optimize themselves when it is required.
The way GSL optimization functions take in external parameters is through a void pointer. I first define a struct Params to hold all the required parameters. Then I define an object of Params and convert it into a void pointer. A copy of this data is made with memcpy_s and a member void pointer optimParamsPtr of Optimizer class points to it so it can access the parameters when the optimizer is called to run later in time. When optimParamsPtr is accessed by CostFn(), I get the following error.
Managed Debugging Assistant 'FatalExecutionEngineError' : 'The runtime
has encountered a fatal error. The address of the error was at
0x6f25e01e, on thread 0x431c. The error code is 0xc0000005. This error
may be a bug in the CLR or in the unsafe or non-verifiable portions of
user code. Common sources of this bug include user marshaling errors
for COM-interop or PInvoke, which may corrupt the stack.'
Just to ensure the validity of the void pointer I made, I call CostFn() at line 81 with the void * pointer passed as an argument to InitOptimizer() and everything works. But in line 85 when the same CostFn() is called with the optimParamsPtr pointing to data copied by memcpy_s, I get the error. So I am guessing something is going wrong with the memcpy_s step. Anyone have any ideas as to what?
#include "pch.h"
#include <iostream>
using namespace System;
using namespace System::Runtime::InteropServices;
using namespace std;
// An optimizer for various kinds of objects
class Optimizer // GSL requires this to be an unmanaged class
{
public:
double InitOptimizer(int ptrID, void *optimParams, size_t optimParamsSize);
void FreeOptimizer();
void * optimParamsPtr;
private:
double cost = 0;
};
ref class Foo // A class whose objects can be optimized
{
private:
int a; // An internal variable that can be changed to optimize the object
Optimizer *fooOptimizer; // Optimizer for a Foo object
public:
Foo(int val) // Constructor
{
a = val;
fooOptimizer = new Optimizer;
}
~Foo()
{
if (fooOptimizer != NULL)
{
delete fooOptimizer;
}
}
void SetA(int val) // Mutator
{
a = val;
}
int GetA() // Accessor
{
return a;
}
double Optimize(int ptrID); // Optimize object
// ptrID is a variable just to change behavior of Optimize() and show what works and what doesn't
};
ref struct Params // Parameters required by the cost function
{
int cost_scaling;
Foo ^ FooObj;
};
double CostFn(void *params) // GSL requires cost function to be of this type and cannot be a member of a class
{
// Cast void * to Params type
GCHandle h = GCHandle::FromIntPtr(IntPtr(params));
Params ^ paramsArg = safe_cast<Params^>(h.Target);
h.Free(); // Deallocate
// Return the cost
int val = paramsArg->FooObj->GetA();
return (double)(paramsArg->cost_scaling * val);
}
double Optimizer::InitOptimizer(int ptrID, void *optimParamsArg, size_t optimParamsSizeArg)
{
optimParamsPtr = ::operator new(optimParamsSizeArg);
memcpy_s(optimParamsPtr, optimParamsSizeArg, optimParamsArg, optimParamsSizeArg);
double ret_val;
// Here is where the GSL stuff would be. But I replace that with a call to CostFn to show the error
if (ptrID == 1)
{
ret_val = CostFn(optimParamsArg); // Works
}
else
{
ret_val = CostFn(optimParamsPtr); // Doesn't work
}
return ret_val;
}
// Release memory used by unmanaged variables in Optimizer
void Optimizer::FreeOptimizer()
{
if (optimParamsPtr != NULL)
{
delete optimParamsPtr;
}
}
double Foo::Optimize(int ptrID)
{
// Create and initialize params object
Params^ paramsArg = gcnew Params;
paramsArg->cost_scaling = 11;
paramsArg->FooObj = this;
// Convert Params type object to void *
void * paramsArgVPtr = GCHandle::ToIntPtr(GCHandle::Alloc(paramsArg)).ToPointer();
size_t paramsArgSize = sizeof(paramsArg); // size of memory block in bytes pointed to by void pointer
double result = 0;
// Initialize optimizer
result = fooOptimizer->InitOptimizer(ptrID, paramsArgVPtr, paramsArgSize);
// Here is where the loop that does the optimization will be. Removed from this example for simplicity.
return result;
}
int main()
{
Foo Foo1(2);
std::cout << Foo1.Optimize(1) << endl; // Use orig void * arg in line 81 and it works
std::cout << Foo1.Optimize(2) << endl; // Use memcpy_s-ed new void * public member of Optimizer in line 85 and it doesn't work
}
Just to reiterate I need to copy the params to a member in the optimizer because the optimizer will run all through the lifetime of the Foo object. So it needs to exist as long as the Optimizer object exist and not just in the scope of Foo::Optimize()
/clr support need to be selected in project properties for the code to compile. Running on an x64 solution platform.
Update 1: While trying to debug this, I got suspicious of the way I get the size of paramsArg at line 109. Looks like I am getting the size of paramsArg as size of int cost_scaling plus size of the memory storing the address to FooObj instead of the size of memory storing FooObj itself. I realized this after stumbling across this answer to another post. I confirmed this by checking the value of paramsArg after adding some new dummy double members to Foo class. As expected the value of paramsArg doesn't change. I suppose this explains why I get the error. A solution would be to write code to correctly calculate the size of a Foo class object and set that to paramsArg instead of using sizeof. But that is turning out to be too complicated and probably another question in itself. For example, how to get size of a ref class object? Anyways hopefully someone will find this helpful.

OpenMP task with and without parallel

Node *head = &node1;
while (head)
{
#pragma omp task
cout<<head->value<<endl;
head = head->next;
}
#pragma omp parallel
{
#pragma omp single
{
Node *head = &node1;
while (head)
{
#pragma omp task
cout<<head->value<<endl;
head = head->next;
}
}
}
In the first block, I just created tasks without parallel directive, while in the second block, I used parallel directive and single directive which is a common way I saw in the papers.
I wonder what's the difference between them? BTW, I know the basic meaning of these directives.
The code in my comment:
void traverse(node *root)
{
if (root->left)
{
#pragma omp task
traverse(root->left);
}
if (root->right)
{
#pragma omp task
traverse(root->right);
}
process(root);
}
The difference is that in the first block you are not really creating any tasks since the block itself is not nested (neither syntactically nor lexically) inside an active parallel region. In the second block the task construct is syntactically nested inside the parallel region and would queue explicit tasks if the region happens to be active at run-time (an active parallel region is one that executes with a team of more than one thread). Lexical nesting is less obvious. Observe the following example:
void foo(void)
{
int i;
for (i = 0; i < 10; i++)
#pragma omp task
bar();
}
int main(void)
{
foo();
#pragma omp parallel num_threads(4)
{
#pragma omp single
foo();
}
return 0;
}
The first call to foo() happens outside of any parallel regions. Hence the task directive does (almost) nothing and all calls to bar() happen serially. The second call to foo() comes from inside the parallel region and hence new tasks would be generated inside foo(). The parallel region is active since the number of threads was fixed to 4 by the num_threads(4) clause.
This different behaviour of the OpenMP directives is a design feature. The main idea is to be able to write code that could execute both as serial and as parallel.
Still the presence of the task construct in foo() does some code transformation, e.g. foo() is transformed to something like:
void foo_omp_fn_1(void *omp_data)
{
bar();
}
void foo(void)
{
int i;
for (i = 0; i < 10; i++)
OMP_make_task(foo_omp_fn_1, NULL);
}
Here OMP_make_task() is a hypothetical (not publicly available) function from the OpenMP support library that queues a call to the function, supplied as its first argument. If OMP_make_task() detects, that it works outside an active parallel region, it would simply call foo_omp_fn_1() instead. This adds some overhead to the call to bar() in the serial case. Instead of main -> foo -> bar, the call goes like main -> foo -> OMP_make_task -> foo_omp_fn_1 -> bar. The implication of this is slower serial code execution.
This is even more obviously illustrated with the worksharing directive:
void foo(void)
{
int i;
#pragma omp for
for (i = 0; i < 12; i++)
bar();
}
int main(void)
{
foo();
#pragma omp parallel num_threads(4)
{
foo();
}
return 0;
}
The first call to foo() would run the loop in serial. The second call would distribute the 12 iterations among the 4 threads, i.e. each thread would only execute 3 iteratons. Once again, some code transformation magic is used to achieve this and the serial loop would run slower than if no #pragma omp for was present in foo().
The lesson here is to never add OpenMP constructs where they are not really necessary.

CUDA cudaMemcpy: invalid argument

Here is my code:
struct S {
int a, b;
float c, d;
};
class A {
private:
S* d;
S h[3];
public:
A() {
cutilSafeCall(cudaMalloc((void**)&d, sizeof(S)*3));
}
void Init();
};
void A::Init() {
for (int i=0;i<3;i++) {
h[i].a = 0;
h[i].b = 1;
h[i].c = 2;
h[i].d = 3;
}
cutilSafeCall(cudaMemcpy(d, h, 3*sizeof(S), cudaMemcpyHostToDevice));
}
A a;
In fact it is a complex program which contain CUDA and OpenGL. When I debug this program, it fails when running at cudaMemcpy with the error information
cudaSafeCall() Runtime API error 11: invalid argument.
Actually, this program is transformed from another one that can run correctly. But in that one, I used two variables S* d and S h[3] in the main function instead of in the class. What is more weird is that I implement this class A in a small program, it works fine.
And I've updated my driver, error still exists.
Could anyone give me a hint on why this happen and how to solve it. Thanks.
Because the memory operations in CUDA are blocking, they make a synchronization point. So other errors, if not checked with cudaThreadSynchonize, will seem like errors on the memory calls.
So if an error is received on a memory operation, try to place a cudaThreadSynchronize before it and check the result.
Be sure that the first malloc statement is being executed. If it is a problem about initialization of CUDA, like #Harrism indicate, then it would fail in this statement?? Try to place printf statements, and see proper initializations are performed. I think generally invalid argument errors are generated because of using uninitalized memory areas.
Write a printf to your constructor showing the address of the cudaMalloc'ed memory area
A()
{
d = NULL;
cutilSafeCall(cudaMalloc((void**)&d, sizeof(S)*3));
printf("D: %p\n", d);
}
Try to make a memory copy for an area that is locally allocated, namely move the cudaMalloc to above of cudaMemcopy (just for testing).
void A::Init()
{
for (int i=0;i<3;i++)
{
h[i].a = 0;
h[i].b = 1;
h[i].c = 2;
h[i].d = 3;
}
cutilSafeCall(cudaMalloc((void**)&d, sizeof(S)*3)); // here!..
cutilSafeCall(cudaMemcpy(d, h, 3*sizeof(S), cudaMemcpyHostToDevice));
}
Good luck.

what is wrong with following pthread program?

I am not able to execute pthreads program in c. Please tell me what is wrong with the following program. I am neither getting any error nor expected output.
void *worker(void * arg)
{
int i;
int *id=(int *)arg;
printf("Thread %d starts\n", *id );
}
void main(int argc, char **argv)
{
int thrd_no,i,*thrd_id,rank=0;
void *exit_status;
pthread_t *threads;
thrd_no=atoi(argv[1]-1);
thrd_id= malloc(sizeof(int)*(thrd_no));
threads=malloc(sizeof(pthread_t)*(thrd_no));
for(i=0;i<thrd_no;i++)
{
rank=i+1;
thrd_id[i]=pthread_create(&threads[i], NULL, worker, &rank);
}
for(i=0;i<thrd_no;i++)
{
pthread_join(threads[i], &exit_status);
}
}
thrd_no = atoi(argv[1] - 1); likely doesn't do what you intended; the way argv is normally passed into a new process and parsed into a C array, argv[1] - 1 is probably pointing at \0 (specifically, the \0 at the end of argv[0]). (More generally, indexing backwards off the start of a string is rarely correct.) The result is that atoi() will return 0 and no threads will be created. What did you actually intend to do there?
You are passing the same address &rank to each thread, so id and *id is the same for all your worker-s.
You should better allocate on the heap the address you pass to each worker routine.
You might also include <stdint.h and use intptr_t, e.g.
void worker (void* p)
{
intptr_t rk = (intptr_t) p;
/// etc
}
and call
intptr_t rank = i + 1;
thrd_id[i]=pthread_create(&threads[i], NULL, worker, (void*)rank);
You should learn to use a debugger and compile with all warnings and debug information, i.e. gcc -Wall -g (and improve your code till it gets no warnings, then use gdb)
code segment rank=i+1;
thrd_id[i]=pthread_create(&threads[i], NULL, worker, &rank);
will produce race condition.

Resources