Stack overflow when trying to initialize two fixed size cubes - armadillo

#include <armadillo>
int main()
{
arma::cube::fixed<28, 28, 100> a;
arma::cube::fixed<28, 28, 100> b;
}
This code is giving me the following error
Unhandled exception at 0x000000013F6034A7 in mnist.exe: 0xC00000FD: Stack overflow (parameters: 0x0000000000000001, 0x0000000000133000).
Any ideas why? Cause I'm clueless.

Use of fixed size matrices or cubes is only recommended for very small sizes, as stack size is limited. In your case, 28*28*100*8 = 627200, or about 612 Kb. (The last 8 is the number of bytes taken up by a double precision floating point number). There is also some overhead due to internal matrix and cube house keeping.
Instead of using fixed size cubes, you would be better off using standard cubes:
arma::cube a(28,28,100);

Related

cuda 2D layered tex size: too large?

Following if the code to initiate a 3D array in cuda with size being width = 809; hight = 127; and number of layers = 2160;
cudaArray *sinor;
cudaExtent volumeSize = make_cudaExtent(809, 127, 2160);
const cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc<float>();
gpuErrchk(cudaMalloc3DArray(&sinor, &channelDesc, volumeSize, cudaArrayLayered));
last line returns error "invalid argument" error. Is that because my number of layer is too large? I tried 1940, and it was fine. If I cannot do such a large number of layers, what is the work around here ? Thanks alot.
You can find the texture layer depth limit on the documentation here. As you inferred, the depth limit for layered textures and surfaces is 2048.
As was suggested in comments, your only real workaround here is to split your data over multiple texture objects and select between the objects based on index within the virtual combined textures.

Cuda: Operating on images (linearized 2d arrays) with a single column of constant values

I am processing images, which are long, usually a few hundred thousand pixel in length. The height is usually in the 500-1000 pixel range. The process involves modifying the images on a column by column basis. So, for example, I have a column of constant values that needs to be subtracted from each column in the image.
Currently I split the image into smaller blocks, put them into linearized 2d arrays. Then I make a linearized 2d array from the column of constant values that is the same size as the smaller block. Then a (image array - constant array) operation is done until the full image is processed.
Should I copy the constant column to the device, and then just operate column by column? Or should I try to make as large of a "constant array" as possible, and then perform the subtraction. I am not looking for 100% optimization or even close to that, but an idea about what the right approach to take is.
How can I optimize this process? Any resources to learn more about this type of processing would be appreciated.
Constant memory is up to 64KB, so assuming your pixels are 4 bytes or less, then you should be able to handle an image height up to about 16K pixels, and still put the entire "constant column" in constant memory.
After that, you don't need to process things "column by column". Constant memory is optimized for access when every thread is requesting the same value from constant memory, which perfectly describes your case.
Therefore, your thread code can be trivially simple:
#define MAX_COL_SIZE 1024
__constant__ float const_column[MAX_COL_SIZE];
__global__ void img_col_kernel(float *in, float *out, int num_cols, int col_size){
int idx = threadIdx.x + blockDim.x*blockIdx.x;
if (idx < num_cols)
for (int i = 0; i < col_size; i++)
out[idx+i*num_cols] = in[idx+i*num_cols] - const_column[i];
}
(coded in browser, not tested)
Set up const_column in your host code using cudaMemcpyToSymbol prior to calling img_col_kernel. Call the kernel with a 1D grid including a total number of threads equal to or greater than your image width (num_cols). Pass the "linearized 2D" pointers to your input and output images to the kernel (in and out). The above kernel should run pretty fast, and essentially be bound by memory bandwidth for images of width 1000 or more. For small images, you may want to increase the number of threads by dividing your image vertically into say, 4 pieces, and operate with 4 times as many threads (and 4 regions of constant memory).

DirectX11 wireframe z-fighting help (or why D3D11_RASTERIZER_DESC.DepthBias is an INT?)

I'm trying to use the DepthBias property on the rasterizer state in DirectX 11 (D3D11_RASTERIZER_DESC) to help with the z-fighting that occurs when I render in wireframe mode over solid polygons (wireframe overlay), and it seems setting it to any value doesn't change anything to the result. But I noticed something strange... the value is defined as a INT rather than a FLOAT. That doesn't make sense to me, but it still doesn't happen to work as expected. How do we properly set that value if it is a INT that needs to be interpreted as a UNORM in the shader pipeline?
Here's what I do:
Render all geometry
Set the rasterizer to render in wireframe
Render all geometry again
I can clearly see the wireframe overlay, but the z-fighting is horrible. I tried to set the DepthBias to a lot of different values, such as 0.000001, 0.1, 1, 10, 1000 and all the minus equivalent, still no results... obviously, I'm aware when casting the float as integer, all the decimals get cut... meh?
D3D11_RASTERIZER_DESC RasterizerDesc;
ZeroMemory(&RasterizerDesc, sizeof(RasterizerDesc));
RasterizerDesc.FillMode = D3D11_FILL_WIREFRAME;
RasterizerDesc.CullMode = D3D11_CULL_BACK;
RasterizerDesc.FrontCounterClockwise = FALSE;
RasterizerDesc.DepthBias = ???
RasterizerDesc.SlopeScaledDepthBias = 0.0f;
RasterizerDesc.DepthBiasClamp = 0.0f;
RasterizerDesc.DepthClipEnable = TRUE;
RasterizerDesc.ScissorEnable = FALSE;
RasterizerDesc.MultisampleEnable = FALSE;
RasterizerDesc.AntialiasedLineEnable = FALSE;
As anyone figured out how to set the DepthBias properly? Or perhaps it is a bug in DirectX (which I doubt) or again maybe there's a better way to achieve this than using DepthBias?
Thank you!
http://msdn.microsoft.com/en-us/library/windows/desktop/cc308048(v=vs.85).aspx
Depending on whether your depth buffer is UNORM or floating point varies the meaning of the number. In most cases you're just looking for the smallest possible value that gets rid of your z-fighting rather than any specific value. Small values are a small bias, large values are a large bias, but how that equates to a shift numerically depends on the format of your depth buffer.
As for the values you've tried, anything less than 1 would have rounded to zero and had no effect. 1, 10, 1000 may simply not have been enough to fix the issue. In the case of a D24 UNORM depth buffer, the formula would suggest a depth bias of 1000 would offset depth by: 1000 * (1 / 2^24), which equals 0.0000596, a not very significant shift in z-buffering terms.
Does a large value of 100,000 or 1,000,000 fix the z-fighting?
If anyone cares, I made myself a macro to make it easier. Note that this macro will only work if you are using a 32bit float depth buffer format. A different macro might be needed if you are using a different depth buffer format.
#define DEPTH_BIAS_D32_FLOAT(d) (d/(1/pow(2,23)))
That way you can simply set your depth bias using standard values, such as:
RasterizerDesc.DepthBias = DEPTH_BIAS_D32_FLOAT(-0.00001);

Stack direction and buffer overflow

In a downward growing stack, what's the rationale for stack variables to be written in an upward direction? For example, if I have char buf[200], say at memory address 0x400. When I write to this array, I will write from 0x400 to 0x600, which is toward previous stack frames. This makes the program vulnerable to buffer overflows that can take control over the program by overwriting return pointers, etc. So why not just write the array from 0x600 to 0x400?
It doesn't matter; when you try to write beyond 200 bytes, you are still trying to write to an address that does not belong to the array (out of bounds), hence buffer overflow.

My preallocation of a matrix gives out of memory error in MATLAB

I use zeros to initialize my matrix like this:
height = 352
width = 288
nFrames = 120
imgYuv=zeros([height,width,3,nFrames]);
However, when I set the value of nFrames larger than 120, MATLAB gives me an error message saying out of memory.
The original function is
[imgYuv, S, A]= changeYuv(fileName, width, height, idxFrame, nFrames)
my command is
[imgYuv,S,A]=changeYuv('tilt.yuv',352,288,1:120,120);
Can anyone please tell me what's going on here?
PS: one of the purposes of the function is to load a yuv video which consists more than 2000 frames. Is there any possibility to implement that?
There are three ways to avoid the error
Process a limited number of
frames at any given time.
Work
with integer arrays. Most movies are
in 8-bit format, while Matlab
normally works with doubles.
uint8 takes 1 byte per element,
while double takes 8 bytes. Thus,
if you create your array as B =
zeros(height,width,3,nFrames,'uint8)`,
it only uses 1/8th of the memory.
This might work for 120 frames,
though for 2000 frames, you'll run
again into trouble. Note that not
all Matlab functions work for
integer arrays; you may have to
reimplement those that require
double.
Buy more RAM.
Yes, you (or rather, your Matlab session) are running out of memory.
Get out your calculator and find the product height x width x 3 x nFrames x 8 which will tell you how much memory you have tried to get in your call to zeros. That will be a number either close to or in excess of the RAM available to Matlab on your computer.
Your command is:
[imgYuv,S,A]=changeYuv('tilt.yuv',352,288,1:120,120);
That is:
352*288*120*120 = 1459814400
That is 1.4 * 10^9. If one object has 4 bytes, then you need 6GB. That is a lot of memory...
Referencing the code I've seen in your withdrawn post, your calculating the difference between adjacent frame histograms. One option to avoid massive memory allocation might be to just hold two frames in memory, instead of reading all the frames at once.
The function B = zeros([d1 d2 d3...]) creates an multi-dimensional array with dimensions d1*d2*d3*...
Depending on width and height, given the 3rd dimension of 3 and the 4th dimension of 120 (which effectively results in width*height*360), may result in a very huge array. There are certain memory limits on every machine, maybe you reached these... ;)

Resources