Related
I was looking to move back from using counter buffer for some compute shader routines, and had some unexpected behaviour on Nvidia cards
I made a really simplified example (so it does not make sense to do that, but that's the smallest that can reproduce the issue I encounter).
So I want to perform conditional writes in several locations on a buffer (also for simplification, I only run a single thread, since the behaviour can also be reproduced that way).
I will write 4 uints, then 2 uint3 (using InterlockedAdd to "simulate conditional writes")
So I use a single buffer (with raw access on uav), with the following simple layout :
0 -> First counter
4 -> Second counter
8 till 24 -> First 4 ints to write
24 till 48 -> Pair of uint3 to write
I also clear the buffer every frame (0 for each counter, and arbitrary value for the rest, 12345 in this case).
I copy the buffer staging resource in order to check the values, so yes my pipeline binding is correct, but I can post the code if asked for.
Now when I call the compute shader, only performing 4 increments as here :
RWByteAddressBuffer RWByteBuffer : BACKBUFFER;
#define COUNTER0_LOCATION 0
#define COUNTER1_LOCATION 4
#define PASS1_LOCATION 8
#define PASS2_LOCATION 24
[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
uint i0,i1,i2,i3;
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i0);
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i1);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i2);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i3);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
}
I then obtain the following results (formatted a little):
4,0,
10,20,30,40,
12345,12345,12345,12345,12345,12345,12345,12345,12345
Which is correct (counter is 4 as I called 4 times, second one was not called), I get 10 till 40 in the right locations, and rest has default values
Now if I want to reuse those indices in order to write them to another location:
[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
uint i0,i1,i2,i3;
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i0);
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i1);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i2);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 1, i3);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
uint3 inds = uint3(i0, i1, i2);
uint3 inds2 = uint3(i1,i2,i3);
uint writeIndex;
RWByteBuffer.InterlockedAdd(COUNTER1_LOCATION, 1, writeIndex);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex * 12, inds);
RWByteBuffer.InterlockedAdd(COUNTER1_LOCATION, 1, writeIndex);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex * 12, inds2);
}
Now If I run that code on Intel card (tried HD4000 and HD4600), or ATI card 290, I get expected results eg :
4,2,
10,20,30,40,
0,1,2,1,2,3
But running that on NVidia (used 970m, gtx1080, gtx570) , I get the following :
4,2,
40,12345,12345,12345,
0,0,0,0,0,0
So it seems it suddenly returns 0 in the return value of interlocked add (it still increments properly as counter is 4, but we end up with 40 in last value.
Also we can see that only 0 got written in i1,i2,i3
In case I "reserve memory", eg, call Interlocked only once per location (incrementing by 4 and 2 , respectively):
[numthreads(1,1,1)]
void CSB(uint3 tid : SV_DispatchThreadID)
{
uint i0;
RWByteBuffer.InterlockedAdd(COUNTER0_LOCATION, 4, i0);
uint i1 = i0 + 1;
uint i2 = i0 + 2;
uint i3 = i0 + 3;
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
uint3 inds = uint3(i0, i1, i2);
uint3 inds2 = uint3(i1,i2,i3);
uint writeIndex;
RWByteBuffer.InterlockedAdd(COUNTER1_LOCATION, 2, writeIndex);
uint writeIndex2 = writeIndex + 1;
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex * 12, inds);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex2 * 12, inds2);
}
Then this works on all cards, but I have some cases when I have to rely on the earlier behaviour.
As a side note, if I use structured buffers with a counter flag on the uav instead of a location in a byte address and do :
RWStructuredBuffer<uint> rwCounterBuffer1;
RWStructuredBuffer<uint> rwCounterBuffer2;
RWByteAddressBuffer RWByteBuffer : BACKBUFFER;
#define PASS1_LOCATION 8
#define PASS2_LOCATION 24
[numthreads(1,1,1)]
void CS(uint3 tid : SV_DispatchThreadID)
{
uint i0 = rwCounterBuffer1.IncrementCounter();
uint i1 = rwCounterBuffer1.IncrementCounter();
uint i2 = rwCounterBuffer1.IncrementCounter();
uint i3 = rwCounterBuffer1.IncrementCounter();
RWByteBuffer.Store(PASS1_LOCATION + i0 * 4, 10);
RWByteBuffer.Store(PASS1_LOCATION + i1 * 4, 20);
RWByteBuffer.Store(PASS1_LOCATION + i2 * 4, 30);
RWByteBuffer.Store(PASS1_LOCATION + i3 * 4, 40);
uint3 inds = uint3(i0, i1, i2);
uint3 inds2 = uint3(i1,i2,i3);
uint writeIndex1= rwCounterBuffer2.IncrementCounter();
uint writeIndex2= rwCounterBuffer2.IncrementCounter();
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex1* 12, inds);
RWByteBuffer.Store3(PASS2_LOCATION + writeIndex2* 12, inds2);
}
This works correctly across all cards, but has all sorts of issues (that are out of topic for this question).
This is running on DirectX11 (I did not try it on DX12, and that's not relevant to my use case, except plain curiosity)
So is it a bug on NVidia?
Or is there something wrong with the first approach?
I'm trying to create a binary perceptron classifier using SAS to develop my skills with SAS. The data has been cleaned and split into training and test sets. Due to my inexperience, I expanded the label vector into a table of seven identical columns to correspond to the seven weights to make the calculations more straightforward, at least, given my limited experience this seemed to be a usable method. Anyway, I run the following:
PROC IML;
W = {0, 0, 0, 0, 0, 0, 0};
USE Work.X_train;
XVarNames = {"Pclass" "Sex" "Age" "FamSize" "EmbC" "EmbQ" "EmbS"};
READ ALL VAR XVarNames INTO X_trn;
USE Work.y_train;
YVarNames = {"S1" "S2" "S3" "S4" "S5" "S6" "S7"};
READ ALL VAR YVarNames INTO y_trn;
DO i = 1 to 668;
IF W`*X_trn[i] > 0 THEN Z = {1, 1, 1, 1, 1, 1, 1};
ELSE Z = {0, 0, 0, 0, 0, 0, 0};
W = W+(y_trn[i]`-Z)#X_trn[i]`;
END;
PRINT W;
RUN;
and the result is a column vector with seven entries each having value -2.373. The particular value isn't important, but clearly, a weight vector that is comprised of identical values is not useful. The question then is, what error in the code am I making that is producing this result?
My intuition is that something with how I am trying to call each row of observations for X_trn and y_trn into the equation is resulting in this error. Otherwise, it might be due to the matrix arithmetic in the W = line, but the orientation of all of the vectors seems to be appropriate.
I can't find any info about this error on google, so I'm posting here to see if anyone knows.
Basically, my code has a snippet that looks something like this:
int rc = pthread_cond_timedwait(&cond, &mutex, &ts);
if ( (0 != rc) && (ETIMEDOUT != rc)) {
assert(false); // This should not happen.
}
Occasionally, my program will crash and the corefile will show that rc = 454.
454 does not map to any of the error codes in errno.h. In addition, looking at the list of possible return values that can be given by pthread_cond_timedwait(), none of them resemble 454.
I've looked into the parameters passed in, but I don't really know how to interpret them or where I would be able to learn how.
(gdb) p *mutex
$20 = {m_lock = {m_owner = 100179, m_flags = 0, m_ceilings = {0, 0}, m_spare = {0, 0, 0, 0}}, m_type = PTHREAD_MUTEX_ERRORCHECK, m_owner = 0x80a004c00, m_count = 0, m_refcount = 0, m_spinloops = 0, m_yieldloops = 0, m_qe = {tqe_next = 0x0, tqe_prev = 0x80a004f10}}
(gdb) p *cond
$21 = {c_lock = {m_owner = 0, m_flags = 0, m_ceilings = {0, 0}, m_spare = {0, 0, 0, 0}}, c_kerncv = {c_has_waiters = 1, c_flags = 0, c_spare = {0, 0}}, c_pshared = 0, c_clockid = 0}
(gdb) p ts
$22 = {tv_sec = 1400543215, tv_nsec = 0}
The internals of "cond" look suspicious to me but, as I mentioned, I have no way to be sure.
Since it's FreeBSD, we can look at the source to see where you're getting the mysterious 454 return from. Using the source archives at fxr.watson.org, I searched for the symbol pthread_cond_timedwait and the only credible references are in the GLIBC27 code, so we'll look there.
In source file pthread_cond_timewait.c, we see the function __pthread_cond_timedwait. There are three returns; the first is EINVAL, the second is the return from __pthread_mutex_unlock_usercnt and the third is the return from __pthread_mutex_cond_lock. Now that I've given you the tools to find the answer, you can go chase down the rest of the answer yourself. The 454 must have come from one of the unlock or lock call.
The versioned_symbol macro at the bottom of the source file is what makes the local __pthread_cond_timedwait call the global pthread_cond_timedwait function.
I want to copy some data from a buffer in the global device memory to the local memory of a processing core - but, with a twist.
I know about async_work_group_copy, and it's nice (or rather, it's klunky and annoying, but working). However, my data is not contiguous - it is strided, i.e. there might be X bytes between every two consecutive Y bytes I want to copy.
Obviously I'm not going to copy all the useless data - and it might not even fit in my local memory. What can I do instead? I want to avoid writing actual kernel code to do the copying, e.g.
threadId = get_local_id(0);
if (threadId < length) {
unsigned offset = threadId * stride;
localData[threadId] = globalData[offset];
}
You can use the async_work_group_strided_copy() OpenCL API call.
Here is a small example in pyopencl thanks to #DarkZeros' comment. Let's assume a small stripe of an RGB image, says 4 by 1 like that:
img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6])
and you want to access the four red channels i.e. [58 157 64 214] you'd do:
def test_asyc_copy_stride_to_local(self):
#Create context, queue, program first
....
#number of R channels
nb_of_el = 4
img = np.array([58, 83, 39, 157, 190, 199, 64, 61, 5, 214, 141, 6])
cl_input = cl.Buffer(ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=img)
#buffer used to check if the copy is correct
cl_output = cl.Buffer(ctx, mf.WRITE_ONLY, size=nb_of_el * np.dtype('int32').itemsize)
lcl_buf = cl.LocalMemory(nb_of_el * np.dtype('int32').itemsize)
prog.asynCopyToLocalWithStride(queue, (nb_of_el,), None, cl_input, cl_output, lcl_buf)
result = np.zeros(nb_of_el, dtype=np.int32)
cl.enqueue_copy(queue, result, cl_output).wait()
print result
The kernel:
kernel void asynCopyToLocalWithStride(global int *in, global int *out, local int *localBuf){
const int idx = get_global_id(0);
localBuf[idx] = 0;
//copy 4 elements, the stride = 3 (RGB)
event_t ev = async_work_group_strided_copy(localBuf, in, 4, 3, 0);
wait_group_events (1, &ev);
out[idx] = localBuf[idx];
}
I want to solve constraints that contain quantifiers using Z3 C API.
I am struggling to use the functions like "Z3_mk_exists()" as I don't find any example either online or in the test examples in the tar file.
I don't exactly understand all the arguments required by these functions and exact significance of them.
Can anyone help?
Thanks.
Kaustubh.
Here is a complete example with universal quantifiers. Comments are inline:
Z3_config cfg = Z3_mk_config();
Z3_set_param_value(cfg, "MODEL", "true");
Z3_context ctx = Z3_mk_context(cfg);
Z3_sort intSort = Z3_mk_int_sort(ctx);
/* Some constant integers */
Z3_ast zero = Z3_mk_int(ctx, 0, intSort);
Z3_ast one = Z3_mk_int(ctx, 1, intSort);
Z3_ast two = Z3_mk_int(ctx, 2, intSort);
Z3_ast three = Z3_mk_int(ctx, 3, intSort);
Z3_ast four = Z3_mk_int(ctx, 4, intSort);
Z3_ast five = Z3_mk_int(ctx, 5, intSort);
We create an uninterpreted function for fibonacci: fib(n). We'll specify its meaning with a universal quantifier.
Z3_func_decl fibonacci = Z3_mk_fresh_func_decl(ctx, "fib", 1, &intSort, intSort);
/* fib(0) and fib(1) */
Z3_ast fzero = Z3_mk_app(ctx, fibonacci, 1, &zero);
Z3_ast fone = Z3_mk_app(ctx, fibonacci, 1, &one);
We're starting to specify the meaning of fib(n). The base cases don't require quantifiers. We have fib(0) = 0 and fib(1) = 1.
Z3_ast fib0 = Z3_mk_eq(ctx, fzero, zero);
Z3_ast fib1 = Z3_mk_eq(ctx, fone, one);
This is a bound variable. They're used within quantified expressions. Indices should start from 0. We have only one in this case.
Z3_ast x = Z3_mk_bound(ctx, 0, intSort);
This represents fib(_), where _ is the bound variable.
Z3_ast fibX = Z3_mk_app(ctx, fibonacci, 1, &x);
The pattern is what will trigger the instantiation. We use fib(_) again. This means (more or less) that Z3 will instantiate the axiom whenever it sees fib("some term").
Z3_pattern pattern = Z3_mk_pattern(ctx, 1, &fibX);
This symbol is only used for debugging as far as I understand. It gives a name to the _.
Z3_symbol someName = Z3_mk_int_symbol(ctx, 0);
/* _ > 1 */
Z3_ast xGTone = Z3_mk_gt(ctx, x, one);
Z3_ast xOne[2] = { x, one };
Z3_ast xTwo[2] = { x, two };
/* _ - 1 */
Z3_ast fibXminusOne = Z3_mk_sub(ctx, 2, xOne);
/* _ - 2 */
Z3_ast fibXminusTwo = Z3_mk_sub(ctx, 2, xTwo);
Z3_ast toSum[2] = { Z3_mk_app(ctx, fibonacci, 1, &fibXminusOne), Z3_mk_app(ctx, fibonacci, 1, &fibXminusTwo) };
/* f(_ - 1) + f(_ - 2) */
Z3_ast fibSum = Z3_mk_add(ctx, 2, toSum);
This is now the body of the axiom. It says: _ > 1 => (fib(_) = fib(_ - 1) + fib(_ - 2), where _ is the bound variable.
Z3_ast axiomTree = Z3_mk_implies(ctx, xGTone, Z3_mk_eq(ctx, fibX, fibSum));
At last we can build a quantifier tree, using the pattern, the bound variable, its name and the axiom body. (Z3_TRUE says its a forall quantifier). The 0 in the argument list specifies the priority. The Z3 doc recommends to use 0 if you don't know what to put.
Z3_ast fibN = Z3_mk_quantifier(ctx, Z3_TRUE, 0, 1, &pattern, 1, &intSort, &someName, axiomTree);
We finally add the axiom(s) the to context.
Z3_assert_cnstr(ctx, fib0);
Z3_assert_cnstr(ctx, fib1);
Z3_assert_cnstr(ctx, fibN);