Similar to the SpriteKit Featured Game "Adventure" from WWDC, I am try to load my background image via tiles. I have created a Texture Atlas that contains 6,300 "tiles" that are each 100x100 pixels in size. The complete background image is a total of 30,000x2048 (for retina displays). The idea is that the background will move from right to left (side-scroller). The first column and the last column match so that they seem continuous.
When the application runs, it loads my initial loading screen and title images and spikes to 54MB in my memory tab with a CPU usage of 16%. This stays the same as I navigate through the menus until I choose my level, which tells a background thread to load the level assets (of which contains the aforementioned background image). The entire .atlas folder shows to be only 35.4MB. I don't believe that this is a problem since the Adventure .atlas folder (from WWDC) shows to be only 32.7MB.
Once I select the level, it loads approximately 20 of the textures in the .atlas folder before I start receiving memory warnings and it crashes the application. I've checked in Instruments for leaks and it doesn't show any memory leaks. I don't receive any compiler errors (not even the EXC_BAD_ACCESS one). I've looked at my device console and have found a few lines of where the application crashes, but it doesn't look to make much sense to me. I've also checked for Zombies, but haven't seemed to find any.
CoreLevel.m
dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_HIGH, 0), ^{
// Used to determine time spent to load
NSDate *startDate = [NSDate date];
// Atlas to load
SKTextureAtlas *tileAtlas = [SKTextureAtlas atlasNamed:#"Day"];
// Make sure the array is empty, before storing the tiles
sBackgroundTiles = nil;
sBackgroundTiles = [[NSMutableArray alloc] initWithCapacity:6300];
// For each row (21 Totals Rows)
for (int y = 0; y < 21; y++) {
// For each Column (100 Total Columns)
for (int x = 1; x <= 100; x++) {
// Get the tile number (0 * 32) + 0;
int tileNumber = (y * 300) + x;
// Create a SpriteNode of that tile
SKSpriteNode *tileNode = [SKSpriteNode spriteNodeWithTexture:[tileAtlas textureNamed:[NSString stringWithFormat:#"tile_%d.png", tileNumber]]];
// Position the SpriteNode
CGPoint position = CGPointMake((x * 100), (y * 100));
tileNode.position = position;
// At layer
tileNode.zPosition = -1.0f;
tileNode.blendMode = SKBlendModeReplace;
// Add to array
[(NSMutableArray *)sBackgroundTiles addObject:tileNode];
}
}
NSLog(#"Loaded all world tiles in %f seconds", [[NSDate date] timeIntervalSinceDate:startDate]);
});
This is what seems to pertain to the crash from the Debug console:
com.apple.debugserver-300.2[9438] <Warning>: 1 +0.000000 sec [24de/1807]: error: ::read ( -1, 0x4069ec, 18446744069414585344 ) => -1 err = Bad file descriptor (0x00000009)
com.apple.debugserver-300.2[9438] <Warning>: Exiting.
com.apple.launchd[1] (UIKitApplication:tv.thebasement.Coin-Voyage[0x641d][9441]) <Notice>: (UIKitApplication:tv.thebasement.Coin-Voyage[0x641d]) Exited: Killed: 9
I don't have enough reputation to post images so here is a link to a screenshot screenshot of my allocations in Instruments:
http://postimg.org/image/j17xl39et/
Any help and advice is much appreciated! If I've left out some pertinent information, I'm glad to update.
The file size of an image file (PNG, JPG, atlas folder, etc) tells you nothing about the memory usage.
Instead you have to calculate the texture memory usage using the formula:
width * height * (color bit depth / 8) = texture size in bytes
For example an image with dimensions 4096x4096 pixels and 32 bits color depth (4 bytes) uses this much memory when loaded as a texture (uncompressed):
4096 * 4096 * 4 = 67108864 bytes (64 Megabytes)
According to your specs (6,300 tiles, each 100x100 pixels, assuming they're all unique) you're way, wayyyyyy above any reasonable limit for texture memory usage (about 1.5 Gigabytes!). Considering the atlas size of 35 Megabytes (which is huge for an atlas btw) and assuming a mere 10:1 compression ratio you may still be looking at 350+ Megabytes of texture memory usage.
Related
I am resizing this test picture:
Mat im = Mat::zeros(Size(832*3,832*3),CV_8UC3);
putText(im,"HI THERE",Point2i(10,90),1,7,Scalar(255,255,255),2);
by standard
cv::resize(im,out,Size(416,416),0,0,INTER_NEAREST);
and by CUDA version of resize:
static void gpuResize(Mat in, Mat &out){
double k = in.cols/416.;
cuda::GpuMat gpuInImage;
cuda::GpuMat gpuOutImage;
gpuInImage.upload(in);
const Size2i &newSize = Size(416, in.rows / k);
//cout << "newSize " << newSize<< endl;
cuda::resize(gpuInImage, gpuOutImage, newSize,INTER_NEAREST);
gpuOutImage.download(out);
}
Measuring time shows that cv::resize is ~25 times faster. What am I doing wrong? I on GTX1080ti videocard, but also observe same situation on Jetson NANO. May be there are any alternative methods to resize image faster then cv::resize with nvidia hardware acceleration?
I was doing similar things today, and had the same results on my Jetson NX running in the NVP model 2 mode (15W, 6 core).
Using the CPU to resize an image 10,000 times was faster than resizing the same image 10,000 times with the GPU.
This was my code for the CPU:
cv::Mat cpu_original_image = cv::imread("test.png"); // 1400x690 RGB image
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::Mat cpu_resized_image;
cv::resize(cpu_original_image, cpu_resized_image, desired_image_size);
}
This was my code for the GPU:
cv::cuda::GpuMat gpu_original_image;
gpu_original_image.upload(cpu_original_image);
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::cuda::GpuMat gpu_resized_image;
cv::cuda::resize(gpu_original_image, gpu_resized_image, desired_image_size);
}
My timing code (not shown above) was only for the for() loops, it didn't include imread() nor upload().
When called in a loop 10K times, my results were:
CPU: 5786.930 milliseconds
GPU: 9678.054 milliseconds (plus an additional 170.587 milliseconds for the upload())
Then I made 1 change to each loop. I moved the "resized" mat outside of the loop to prevent it from being created and destroyed at each iteration. My code then looked like this:
cv::Mat cpu_original_image = cv::imread("test.png"); // 1400x690 RGB image
cv::Mat cpu_resized_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::resize(cpu_original_image, cpu_resized_image, desired_image_size);
}
...and for the GPU:
cv::cuda::GpuMat gpu_original_image;
gpu_original_image.upload(cpu_original_image);
cv::cuda::GpuMat gpu_resized_image;
for (size_t count = 0; count < number_of_times_to_iterate; count ++)
{
cv::cuda::resize(gpu_original_image, gpu_resized_image, desired_image_size);
}
The for() loop timing results are now:
CPU: 5768.181 milliseconds (basically unchanged)
GPU: 2827.898 milliseconds (from 9.7 seconds to 2.8 seconds)
This looks much better! GPU resize is now faster than CPU resize...as long as you're doing lots of work with the GPU and not a single resize. And as long as you don't continuously re-allocate temporary GPU mats, as that seems to be quite expensive.
But after all this, to go back to your original question: if all you are doing is resizing a single image once, or resizing many images once each, the GPU resize won't help you since uploading each image to the GPU mat will take longer than the original resize! Here are my results when trying that on a Jetson NX:
single image resize on CPU: 3.565 milliseconds
upload mat to GPU: 186.966 milliseconds
allocation of 2nd GPU mat and gpu resize: 225.925 milliseconds
So on the CPU the NX can do it in < 4 milliseconds, while on the GPU it takes over 400 milliseconds.
I have a task - to multiply big row vector (10 000 elements) via big column-major matrix (10 000 rows, 400 columns). I decided to go with ARM NEON since I'm curious about this technology and would like to learn more about it.
Here's a working example of vector matrix multiplication I wrote:
//float* vec_ptr - a pointer to vector
//float* mat_ptr - a pointer to matrix
//float* out_ptr - a pointer to output vector
//int matCols - matrix columns
//int vecRows - vector rows, the same as matrix
for (int i = 0, max_i = matCols; i < max_i; i++) {
for (int j = 0, max_j = vecRows - 3; j < max_j; j+=4, mat_ptr+=4, vec_ptr+=4) {
float32x4_t mat_val = vld1q_f32(mat_ptr); //get 4 elements from matrix
float32x4_t vec_val = vld1q_f32(vec_ptr); //get 4 elements from vector
float32x4_t out_val = vmulq_f32(mat_val, vec_val); //multiply vectors
float32_t total_sum = vaddvq_f32(out_val); //sum elements of vector together
out_ptr[i] += total_sum;
}
vec_ptr = &myVec[0]; //switch ptr back again to zero element
}
The problem is that it's taking very long time to compute - 30 ms on iPhone 7+ when my goal is 1 ms or even less if it's possible. Current execution time is understandable since I launch multiplication iteration 400 * (10000 / 4) = 1 000 000 times.
Also, I tried to process 8 elements instead of 4. It seems to help, but numbers still very far from my goal.
I understand that I might make some horrible mistakes since I'm newbie with ARM NEON. And I would be happy if someone can give me some tip how I can optimize my code.
Also - is it worth doing big vector-matrix multiplication via ARM NEON? Does this technology fit well for such purpose?
Your code is completely flawed: it iterates 16 times assuming both matCols and vecRows are 4. What's the point of SIMD then?
And the major performance problem lies in float32_t total_sum = vaddvq_f32(out_val);:
You should never convert a vector to a scalar inside a loop since it causes a pipeline hazard that costs around 15 cycles everytime.
The solution:
float32x4x4_t myMat;
float32x2_t myVecLow, myVecHigh;
myVecLow = vld1_f32(&pVec[0]);
myVecHigh = vld1_f32(&pVec[2]);
myMat = vld4q_f32(pMat);
myMat.val[0] = vmulq_lane_f32(myMat.val[0], myVecLow, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[1], myVecLow, 1);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[2], myVecHigh, 0);
myMat.val[0] = vmlaq_lane_f32(myMat.val[0], myMat.val[3], myVecHigh, 1);
vst1q_f32(pDst, myMat.val[0]);
Compute all the four rows in a single pass
Do a matrix transpose (rotation) on-the-fly by vld4
Do vector-scalar multiply-accumulate instead of vector-vector multiply and horizontal add that causes the pipeline hazards.
You were asking if SIMD is suitable for matrix operations? A simple "yes" would be a monumental understatement. You don't even need a loop for this.
For example, for my 940M video card, the canvas created with the following code takes 500 MB of video memory
var c = document.createElement('canvas');
var ctx = c.getContext('webgl');
c.width = c.height = 4096;
At the same time, the OpenGL context of the same sizes uses only 100 MB of video memory:
glutInit(&argc, argv);
glutInitDisplayMode(GLUT_SINGLE);
int s = 4096;
glutInitWindowSize(s, s);
glutCreateWindow("Hello world :D");
Why does the WebGL use so much memory? Is it possible to reduce the amount of used memory for the same sizes of the context?
As LJ pointed out, canvas is double buffered, antialiased, has alpha and a depth buffer by default. You made the canvas 4096 x 4096 so that's
16meg * 4 (RGBA) or 64meg for one buffer
You get that times at least 4
front buffer = 1
antialiased backbuffer = 2 to 16
depth buffer = 1
So that's 256meg to 1152meg depending on what the browser picks for antialiasing.
In answer to your question you can try to not ask for a depth buffer, alpha buffer and/or antialiasing
var c = document.createElement('canvas');
var ctx = c.getContext('webgl', { alpha: false, depth: false, antialias: false});
c.width = c.height = 4096;
Whether the browser actually doesn't allocate an alpha channel or does but just ignores it is up to the browser and driver. Whether it will actually not allocate a depth buffer is also up to the browser. Passing antialias: false should at least make the 2nd buffer 1x instead of 2x to 16x.
I have a SKTextureAtlas with about 90 PNG Images. Every Image has a resolution of 2000 x 70 pixel and has a size of ~1 KB.
Now I put this images from the Atlas into an array like this:
var dropBarAtlas = SKTextureAtlas(named: "DropBar")
for i in 0..<dropBarAtlas.textureNames.count{
var textuteName = NSString(format: "DropBar%i", i)
var texture = dropBarAtlas.textureNamed(textuteName)
dropFrames.addObject(texture)
}
Then I preload the array with the textures in didMoveToView:
SKTexture.preloadTextures(dropFrames, withCompletionHandler: { () -> Void in})
To play the animation with 30 fps I use SKAction.animateWithTextures
var animateDropBar = SKAction.animateWithTextures(dropFrames, timePerFrame: 0.033)
dropBar.runAction(animateDropBar)
My problem is that when I preload the textures the memory usage increases to about 300 MB.
Is there a more performant solution?
And which frame rate and image size is recommended for SKAction.animateWithTextures?
You should keep in mind that image file size (1Kb in your example) have nothing with amount of memory required for same image to be stored in RAM . You can calculate that amount of memory required with this formula:
width x height x bytes per pixel = size in memory
If you using standard RGBA8888 pixel format this means that your image will require about 0.5 megabytes in RAM memory, because RGBA8888 uses 4 bytes per pixel – 1 byte for each red, green, blue, and 1 byte for alpha transparency. You can read more here.
So what you can do, is to optimize you textures and use different texture formats. Here is another example about texture optimization.
I am working on some CUDA program and I wanted to speed up computation using constant memory but it turned that using constant memory makes my code ~30% slower.
I know that constant memory is good at broadcasting reads to whole warps and I thought that my program could take an advantage of it.
Here is constant memory code:
__constant__ float4 constPlanes[MAX_PLANES_COUNT];
__global__ void faultsKernelConstantMem(const float3* vertices, unsigned int vertsCount, int* displacements, unsigned int planesCount) {
unsigned int blockId = __mul24(blockIdx.y, gridDim.x) + blockIdx.x;
unsigned int vertexIndex = __mul24(blockId, blockDim.x) + threadIdx.x;
if (vertexIndex >= vertsCount) {
return;
}
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
//__syncthreads();
for (unsigned int planeIndex = 0; planeIndex < planesCount; ++planeIndex) {
float4 plane = constPlanes[planeIndex];
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
}
displacements[vertexIndex] = displacementSteps;
}
Global memory code is the same but it have one parameter more (with pointer to array of planes) and uses it instead of global array.
I thought that those first global memory reads
float3 v = vertices[vertexIndex];
int displacementSteps = displacements[vertexIndex];
may cause "desynchronization" of threads and then they will not take an advantage of broadcasting of constant memory reads so I've tried to call __syncthreads(); before reading constant memory but it did not changed anything.
What is wrong? Thanks in advance!
System:
CUDA Driver Version: 5.0
CUDA Capability: 2.0
Parameters:
number of vertices: ~2.5 millions
number of planes: 1024
Results:
constant mem version: 46 ms
global mem version: 35 ms
EDIT:
So I've tried many things how to make the constant memory faster, such as:
1) Comment out the two global memory reads to see if they have any impact and they do not. Global memory was still faster.
2) Process more vertices per thread (from 8 to 64) to take advantage of CM caches. This was even slower then one vertex per thread.
2b) Use shared memory to store displacements and vertices - load all of them at beginning, process and save all displacements. Again, slower than shown CM example.
After this experience I really do not understand how the CM read broadcasting works and how can be "used" correctly in my code. This code probably can not be optimized with CM.
EDIT2:
Another day of tweaking, I've tried:
3) Process more vertices (8 to 64) per thread with memory coalescing (every thread goes with increment equal to total number of threads in system) -- this gives better results than increment equal to 1 but still no speedup
4) Replace this if statement
if (v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w > 0) {
++displacementSteps;
}
else {
--displacementSteps;
}
which is giving 'unpredictable' results with little bit of math to avoid branching using this code:
float dist = v.x * plane.x + v.y * plane.y + v.z * plane.z + plane.w;
int distInt = (int)(dist * (1 << 29)); // distance is in range (0 - 2), stretch it to int range
int sign = 1 | (distInt >> (sizeof(int) * CHAR_BIT - 1)); // compute sign without using ifs
displacementSteps += sign;
Unfortunately this is a lot of slower (~30%) than using the if so ifs are not that big evil as I thought.
EDIT3:
I am concluding this question that this problem probably can not be improved by using constant memory, those are my results*:
*Times reported as median from 15 independent measurements. When constant memory was not large enough for saving all planes (4096 and 8192), kernel was invoked multiple times.
Although a compute capability 2.0 chip has 64k of constant memory, each of the multi-processors has only 8k of constant-memory cache. Your code has each thread requiring access to all 16k of the constant memory, so you are losing performance through cache misses. To effectively use constant memory for the plane data, you will need to restructure your implementation.