why are nan being produced when assigning values in a cube? - armadillo

I'm having this weird issue with Armadillo and RcppArmadillo. I'm creating a cube filled with zeroes values, and I want specific elements to be turned into ones. However, when I used an assigment to do that, values of other elements changle slightly and often become equal to nan. Does anyone has any idea what could be causing that?
#include <RcppArmadillo.h>
using namespace arma;
// [[Rcpp::depends(RcppArmadillo)]]
// [[Rcpp::export]]
cube testc() {
cube tester = cube(10,10,2);
uvec indexes = {25,125};
for(unsigned int i=0; i<indexes.n_elem; i++) {
cout<< tester;
This error does not happen when i assign each element individually (tester(25)=1.0 followed by tester(125)=1.0), but this is impractical if I have a larger number of elements to replace. The nan show up in coutand in the R object, which makes me think the issue is independent of Rcpp.

Your cube object is not initialized with zeros, so it's possible to get NaN values.
From the documentation:
cube(n_rows, n_cols, n_slices)     (memory is not initialised)
cube(n_rows, n_cols, n_slices, fill_type)     (memory is initialised)
When using the cube(n_rows, n_cols, n_slices) or cube(size(X)) constructors, by default the memory is uninitialised (ie. may contain garbage); memory can be explicitly initialised by specifying the fill_type, as per the Mat class (except for fill::eye)
Examples of explicit initialization with zeros:
cube A(10,10,2,fill::zeros);
cube B(10,10,2);
cube C;


How to bind a variable number of textures to Metal shader?

On the CPU I'm gathering an array of MTLTexture objects that I want to send to the fragment shader. There can be any number of these textures at any given moment. How can I send a variable-length array of MTLTextures to a fragment shader?
var txrs: [MTLTexture] = []
for ... {
// Send array of textures to fragment shader.
fragment half4 my_fragment(Vertex v [[stage_in]], <array of textures>, ...) {
for(int i = 0; i < num_textures; i++) {
texture2d<half> txr = array_of_textures[i];
The array other person suggested won't work, because textures will take up all the bind points up to 31, at which point it will run out.
Instead, you need to use argument buffers.
So, for this to work, you need a tier 2 argument buffer support. You can check it with argumentBuffersSupport property on an MTLDevice.
You can read more about argument buffers here or watch this talk about bindless rendering.
The basic idea is to use MTLArgumentEncoder to encode textures you need in argument buffers. Unfortunately, I don't think there's a direct way to just encode a bunch of MTLTextures, so instead, you would create a struct in your shaders like this
struct SingleTexture
texture2d<half> texture;
The texture in this struct has an implicit id of 0. To learn more about id, read Argument Buffers section in the spec, but it's basically a unique index for each entry in the ab.
Then, change your function signature to
fragment half4 my_fragment(Vertex v [[stage_in]], device ushort& textureCount [[ buffer(0), device SingleTexture* textures [[ buffer(1) ]])
You will then need to bind the count (use uint16_t instead of uint32_t in most cases). Just as a 2 (or 4) byte buffer. (You can use set<...>Bytes function on an encoder for that).
Then, you will need to compile that function to MTLFunction and from it, you can create a MTLArgumentEncoder using newArgumentEncoderForBufferAtIndex method. You will use buffer index 1 in this case, because that's where your AB is bound in the function.
From MTLArgumentEncoder you can get encodedLength, which is basically a size for one SingleTexture struct in AB. After you get that, multiply it by number of textures to get a buffer of a proper size to encode your argument buffer to.
After that, in your setup code, you can just do this
for(size_t i = 0; i < textureCount; i++)
// We basically just offset into an array of SignlaTexture
[argumentEncoder setArgumentBuffer:<your buffer you just created> offset:argumentEncoder.encodedLength * i];
[argumentEncoder setTexture:textures[i] atIndex:0];
And then, when you are done encoding the buffer, you can hold on to it, until your texture array changes (you don't need to reencode it every frame).
Then, you need to bind the argument buffer to buffer binding point 1, just as you would bind any other buffer.
Last thing you need to do is to make sure all the resources referenced indirectly are resident on the GPU. Since you encoded your textures into AB, driver has no way to know whether you used them or not, because you are not binding them directly.
To do that, use useResource or useResources variation on an encoder you are using, kinda like this:
[encoder useResources:&textures[0] count:textureCount usage:MTLResourceUsageRead];
This is kinda a mouthful, but this is the proper way to bind anything you want to your shaders.

In Lua Torch, the product of two zero matrices has nan entries

I have encountered a strange behavior of the torch.mm function in Lua/Torch. Here is a simple program that demonstrates the problem.
iteration = 0;
a = torch.Tensor(2, 2);
b = torch.Tensor(2, 2);
prod = torch.Tensor(2,2);
prod = torch.mm(a,b);
ent = prod[{2,1}];
iteration = iteration + 1;
until ent ~= ent
print ("error at iteration " .. iteration);
print (prod);
The program consists of one loop, in which the program multiplies two zero 2x2 matrices and tests if entry ent of the product matrix is equal to nan. It seems that the program should run forever since the product should always be equal to 0, and hence ent should be 0. However, the program prints:
error at iteration 548
0.000000 0.000000
nan nan
[torch.DoubleTensor of size 2x2]
Why is this happening?
The problem disappears if I replace prod = torch.mm(a,b) with torch.mm(prod,a,b), which suggests that something is wrong with the memory allocation.
My version of Torch was compiled without BLAS & LAPACK libraries. After I recompiled torch with OpenBLAS, the problem disappeared. However, I am still interested in its cause.
The part of code that auto-generates the Lua wrapper for torch.mm can be found here.
When you write prod = torch.mm(a,b) within your loop it corresponds to the following C code behind the scenes (generated by this wrapper thanks to cwrap):
/* this is the tensor that will hold the results */
arg1 = THDoubleTensor_new();
THDoubleTensor_resize2d(arg1, arg5->size[0], arg6->size[1]);
arg3 = arg1;
/* .... */
luaT_pushudata(L, arg1, "torch.DoubleTensor");
/* effective matrix multiplication operation that will fill arg1 */
a new result tensor is created and resized with the proper dimensions,
but this new tensor is NOT initialized, i.e. there is no calloc or explicit fill here so it points to junk memory and could contain NaN-s,
this tensor is pushed on the stack so as to be available on the Lua side as the return value.
The last point means that this returned tensor is different from the initial prod one (i.e. within the loop, prod shadows the initial value).
On the other hand calling torch.mm(prod,a,b) does use your initial prod tensor to store the results (behind the scenes there is no need to create a dedicated tensor in that case). Since in your code snippet you do not initialize / fill it with given values it could also contain junk.
In both cases the core operation is a gemm multiplication like C = beta * C + alpha * A * B, with beta=0 and alpha=1. The naive implementation looks like that:
real *a_ = a;
for(i = 0; i < m; i++)
real *b_ = b;
for(j = 0; j < n; j++)
real sum = 0;
for(l = 0; l < k; l++)
sum += a_[l*lda]*b_[l];
b_ += ldb;
* WARNING: beta*c[j*ldc+i] could give NaN even if beta=0
* if the other operand c[j*ldc+i] is NaN!
c[j*ldc+i] = beta*c[j*ldc+i]+alpha*sum;
Comments are mine.
with torch.mm(a,b): at each iteration, a new result tensor is created without being initialized (it could contain NaN-s). So every iteration presents a risk of returning NaN-s (see above warning),
with torch.mm(prod,a,b): there is the same risk since you do not initialized the prod tensor. BUT: this risk only exists at the first iteration of the repeat / until loop since right after prod is filled with 0-s and re-used for the subsequent iterations.
So this is why you do not observe a problem here (it is less frequent).
In case 1: this should be improved at the Torch level, i.e. make sure the wrapper initializes the output (e.g. with THDoubleTensor_fill(arg1, 0);).
In case 2: you should initialize prod initially and use the torch.mm(prod,a,b) construct to avoid any NaN problem.
EDIT: this problem is now fixed (see this pull request).

mlpack sparse coding solution not found

I am trying to learn how to use the Sparse Coding algorithm with the mlpack library. When I call Encode() on my instance of mlpack::sparse_coding:SparseCoding, I get the error
[WARN] There are 63 inactive atoms. They will be reinitialized randomly.
error: solve(): solution not found
Is it simply that the algorithm cannot learn a latent representation of the data. Or perhaps it is my usage? The relevant section follows
EDIT: One line was modified to fix an unrelated error, but the original error remains.
double* Application::GetSparseCodes(arma::mat* trainingExample, int atomCount)
double* latentRep = new double[atomCount];
mlpack::sparse_coding::SparseCoding<mlpack::sparse_coding::DataDependentRandomInitializer> sc(*trainingExample, Utils::ATOM_COUNT, 1.0);
arma::mat& latentRepMat = sc.Codes();
for (int i = 0; i < atomCount; i++)
latentRep[i] = latentRepMat.at(i, 0);
return latentRep;
Some relevant parameters
const static int IMAGE_WIDTH = 20;
const static int IMAGE_HEIGHT = 20;
const static int ATOM_COUNT = 64;
const static int MAX_ITERATIONS = 100000;
This could be one of a handful of issues but given the description it's a little difficult to tell which of these it is (or if it is something else entirely). However, these three ideas should provide a good place to start:
Matrices in mlpack are column-major. That means each observation should represent a column. If you use mlpack::data::Load() to load, e.g., a CSV file (which are generally one row per observation), it will automatically transpose the dataset. SparseCoding will act oddly if you pass it transposed data. See also http://www.mlpack.org/doxygen.php?doc=matrices.html.
If there are 63 inactive atoms, then only one atom is actually active (given that ATOM_COUNT is 64). This means that the algorithm has found that the best way to represent the dictionary (at a given step) uses only one atom. This could happen if the matrix you are passing consists of all zeros.
mlpack will provide verbose output, which may also be helpful for debugging. Usually this is used by using mlpack's CLI class to parse command-line input, but you can enable verbose output with mlpack::Log::Info.ignoreInput = false. You may obtain a lot of output that way, but it will give a better look at what is going on...
The mlpack project has its own mailing list where you may be likely to get a quicker or more comprehensive response, by the way.

Any way to check if a bool** pointer to pointer contains a true value in C without loops?

I am trying to optimise an iOS app and just wanted some advice on an issue I am having.
I have a bool** which can hold up to 1024 * 1024 elements. Each element defaults to false but they can also be changed to true at random.
I wanted to know if there is a streamlined way to check if a true value is contained by any of the elements, as, in a worst case scenario over a million iterations would be needed to do this check using two loops.
I could be completely wrong but I had thought about casting the memory to an int, in the belief that, as a false value is equal to 0, if all the elements are false then the result of casting it to an int would, I had thought, be 0. This is not the case however.
What I may have to go with is to keep a tally of the number of true values as they are toggled but this could turn quite messy very quickly.
I hope I have made myself clear enough without code but, if you need to see any code, just ask.
So I decided to go with mvp's answer. Then when I need to check if a certain bit is set:
uint32_t mask32_t[] = {
0x01, 0x02, 0x04, 0x08,
0x10, 0x20, 0x40, 0x80,
0x100, 0x200, 0x400, 0x800,
0x1000, 0x2000, 0x4000, 0x8000,
0x10000, 0x20000, 0x40000, 0x80000,
0x100000, 0x200000, 0x400000, 0x800000,
0x1000000, 0x2000000, 0x4000000, 0x8000000,
0x10000000, 0x20000000, 0x40000000, 0x80000000
bool bitIsSet(uint32_t word, int n) {
return ( word & mask32_t[ n ] ) != 0x00;
bool isSetAtPoint( uint32_t** arr, int x, int y ) {
return bitIsSet( arr[ (int)floor(x / 32.0) ] [ y ], x % 32 );
Probably best optimization you can make is to convert your 1024x1024 array of booleans into array of bits. This approach has some benefits and drawbacks:
+ Bit array will consume 8 times less memory: 1024*1024/8 = 128KB.
+ You can quickly test many bits at once by checking 32-bit integers quickly. In other words, finding first non-zero bit can be 32-times as fast.
- You need to create custom routines to read and write bits in this array. However, this is rather simple task - just a bit of bit twiddling :).
Basically, the answer is no. You need to write a loop to check.
You cannot cast a composite value to an int. You don't report a compiler error, so I suspect you actually tried casting the pointer to your array to int, which is legal; however, the result will only be 0 if the pointer is NULL (that is, not pointing to anything.)
You can streamline the checking code by using a bitvector instead of an array of bools. (You'd also save quite a bit of memory.) However, that involves a lot more code, and it's fairly messy. If you were using C++, you would have access to std::bitset, which is a lot easier than writing your own, but has the disadvantage of needing a compile-time size.
Using integer array has been said here already and is probably the best option, but here's a another approach.
If number of true nodes is very low compared to false nodes, you could turn the problem upside-down: Store coordinates of the true-nodes instead.
I'm thinking of a hashset or BST where you would use position of the true node as a key. Then checking the existence of true would be trivial: check if your data structure has any coordinates stored in it.

How to declare local memory in OpenCL?

I'm running the OpenCL kernel below with a two-dimensional global work size of 1000000 x 100 and a local work size of 1 x 100.
__kernel void myKernel(
const int length,
const int height,
and a bunch of other parameters) {
//declare some local arrays to be shared by all 100 work item in this group
__local float LP [length];
__local float LT [height];
__local int bitErrors = 0;
__local bool failed = false;
//here come my actual computations which utilize the space in LP and LT
This however refuses to compile, since the parameters length and height are not known at compile time. But it is not clear to my at all how to do this correctly. Should I use pointers with memalloc? How to handle this in a way that the memory is only allocated once for the entire workgroup and not once per work item?
All that I need is 2 arrays of floats, 1 int and 1 boolean that are shared among the entire workgroup (so all 100 work items). But I fail to find any method that does this correctly...
It's relatively simple, you can pass the local arrays as arguments to your kernel:
kernel void myKernel(const int length, const int height, local float* LP,
local float* LT, a bunch of other parameters)
You then set the kernelargument with a value of NULL and a size equal to the size you want to allocate for the argument (in byte). Therefore it should be:
clSetKernelArg(kernel, 2, length * sizeof(cl_float), NULL);
clSetKernelArg(kernel, 3, height* sizeof(cl_float), NULL);
local memory is always shared by the workgroup (as opposed to private), so I think the bool and int should be fine, but if not you can always pass those as arguments too.
Not really related to your problem (and not necessarily relevant, since I do not know what hardware you plan to run this on), but at least gpus don't particulary like workingsizes which are not a multiple of a particular power of two (I think it was 32 for nvidia, 64 for amd), meaning that will probably create workgroups with 128 items, of which the last 28 are basically wasted. So if you are running opencl on gpu it might help performance if you directly use workgroups of size 128 (and change the global work size appropriately)
As a side note: I never understood why everyone uses the underscore variant for kernel, local and global, seems much uglier to me.
You could also declare your arrays like this:
__local float LP[LENGTH];
And pass the LENGTH as a define in your kernel compile.
int lp_size = 128; // this is an example; could be dynamically calculated
char compileArgs[64];
sprintf(compileArgs, "-DLENGTH=%d", lp_size);
clBuildProgram(program, 0, NULL, compileArgs, NULL, NULL);
You do not have to allocate all your local memory outside the kernel, especially when it is a simple variable instead of a array.
The reason that your code cannot compile is that OpenCL does not support local memory initialization. This is specified in the document(https://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/local.html). It is also not feasible in CUDA(Is there a way of setting default value for shared memory array?)
ps:The answer from Grizzly is good enough and it would be better if I can post it as a comment, but I am restricted by the reputation policy. Sorry.
