Why do Metal imageblocks only bulk copy in one direction? - metal

A primary feature of Metal's imageblocks is you can write an image from an imageblock in one bulk operation to a texture in device memory.
It would only make sense that this feature goes both ways, in that you can write an image from device memory into the imageblock in one bulk operation.
But it seems there is no such operation.
Any idea why they would have bulk write operations going one direction but not the other?
Is writing an image in device memory into an imageblock still best done one pixel at a time?
(I'm focused on explicit imageblocks in compute kernels)

Related

Is CGPathContainsPoint() hardware accelerated?

I'm doing an iOS game and would like to use this method for collision detection.
As there are plenty (50+) of points to check every frame, I wondered if this method runs on the iDevice's graphics hardware.
Following up on #DavidRönnqvist point: it doesn't matter if it's "hardware accelerated" or not. What matters is whether it is fast enough for your purpose, and then you can use Instruments to check where it is eating time and try to improve things.
Moving code to the GPU doesn't automatically make it faster; it can in fact make it much slower since you have to haul all the data over to GPU memory, which is expensive. Ideally to run on the GPU, you want to move all the data once, then do lots of expensive vector operations, and then move the data back (or just put it on the screen). If you can't make the problem look like that, then the GPU isn't the right tool.
It is possible that it is NEON accelerated, but again that's kind of irrelevant; the compiler NEON-accelerates lots of things (and running on the NEON doesn't always mean it runs faster, either). That said, I'd bet this kind of problem would run best on the NEON if you can test lots of points (hundreds or thousands) against the same curves.
You should assume that CGPathContainsPoint() is written to be pretty fast for the general case of "I have one random curve and one random point." If your problem looks like that, it seems unlikely that you will beat the Apple engineers on their own hardware (and 50 points isn't much more than 1). I'd assume, for instance, that they're already checking the bounding box for you and that your re-check is wasting time (but I'd profile it to be sure).
But if you can change the problem to something else, like "I have a known curve and tens of thousands of points," then you can probably hand-code a better solution and should look at Accelerate or even hand-written NEON to attack it.
Profile first, then optimize. Don't assume that "vector processor" is exactly equivalent to "fast" even when your problem is "mathy." The graphics processor even more-so.

How important to send Interleaved Vertex Data on ios

I am using Assimp to import some 3d models.
Assimp is great, but it stores everything in a non-interleaved vertex format.
According to the Apple OpenGL ES Programming Guide, interleaved vertex data is preferred on ios: https://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html#//apple_ref/doc/uid/TP40008793-CH107-SW8
I am using VertexArrays to consolidate all the buffer related state changes - is it still worth the effort to interleave all the vertex data?
Because interleaved vertex data increases the locality of vertex data, it allows the GPU to cache much more efficiently and generally to be a lot lighter on memory bandwidth at that stage in the pipeline.
How much difference it makes obviously depends on a bunch of other factors — whether memory access is a bottleneck (though it usually is, since texturing is read intensive), how spaced out your vertex data is if not interleaved and the specifics of how that particular GPU does fetching and caching.
Uploading multiple vertex buffers and bundling them into a vertex array would in theory allow the driver to perform this optimisation behind your back (either so as to duplicate memory or once it becomes reasonably confident that the buffers is the array aren't generally in use elsewhere) but I'm not confident that it will. But the other way around to look at it is that you should be able to make the optimisation yourself at the very end of your data pipeline, so you needn't plan in advance for it or change your toolset. It's an optimisation so if it's significant work to implement then the general rule against premature optimisation applies — wait until you have hard performance data.

People counting using OpenCV

I'm starting a search to implement a system that must count people flow of some place.
The final idea is to have something like http://www.youtube.com/watch?v=u7N1MCBRdl0 . I'm working with OpenCv to start creating it, I'm reading and studying about. But I'd like to know if some one can give me some hints of source code exemples, articles and anything elese that can make me get faster on my deal.
I started with blobtrack.exe sample to study, but I got not good results.
Tks in advice.
Blob detection is the correct way to do this, as long as you choose good threshold values and your lighting is even and consistent; but the real problem here is writing a tracking algorithm that can keep track of multiple blobs, being resistant to dropped frames. Basically you want to be able to assign persistent IDs to each blob over multiple frames, keeping in mind that due to changing lighting conditions and due to people walking very close together and/or crossing paths, the blobs may drop out for several frames, split, and/or merge.
To do this 'properly' you'd want a fuzzy ID assignment algorithm that is resistant to dropped frames (ie blob ID remains, and ideally predicts motion, if the blob drops out for a frame or two). You'd probably also want to keep a history of ID merges and splits, so that if two IDs merge to one, and then the one splits to two, you can re-assign the individual merged IDs to the resulting two blobs.
In my experience the openFrameworks openCv basic example is a good starting point.
I'll not put this as the right answer.
It is just an option for those who are able to read in Portugues or can use a translator. It's my graduation project and there is the explanation of a option to count people in it.
Limitations:
It's do not behave well on envirionaments that change so much the background light.
It must be configured for each location that you will use it.
Advantages:
It's fast!
I used OpenCV to do the basic features as, capture screen, go trough the pixels, etc. But the algorithm to count people was done by my self.
You can check it on this paper
Final opinion about this project: It's not prepared to go alive, to became a product. But it works very well as base for study.

Is it practical to include an adaptive or optimizing memory strategy into a library?

I have a library that does I/O. There are a couple of external knobs for tuning the sizes of the memory buffers used internally. When I ran some tests I found that the sizes of the buffers can affect performance significantly.
But the optimum size seems to depend on a bunch of things - the available memory on the PC, the the size of the files being processed (varies from very small to huge), the number of files, the speed of the output stream relative to the input stream, and I'm not sure what else.
Does it make sense to build an adaptive memory strategy into the library? or is it better to just punt on that, and let the users of the library figure out what to use?
Has anyone done something like this - and how hard is it? Did it work?
Given different buffer sizes, I suppose the library could track the time it takes for various operations, and then it could make some decisions about which size was optimal. I could imagine having the library rotate through various buffer sizes in the initial I/O rounds... and then it eventually would do the calculations and adjust the buffer size in future rounds depending on the outcomes. But then, how often to re-check? How often to adjust?
The adaptive approach is sometimes referred to as "autonomic", using the analogy of a Human's autonomic nervous system: you don't conciously control your heart rate and respiration, your autonomic nervous system does that.
You can read about some of this here, and here (apologies for the plugs, but I wanted to show that the concept is being taken seriously, and is manifesting in real products.)
My experience of using products that try to do this is that they do acually work, but can make me unhappy: that's because there is a tendency for them to take a "Father knows best" approach. You make some (you believe) small change to your app, or the environment and something unexecpected happens. You don't know why, and you don't know if it's good. So my rule for autonomy is:
Tell me what you are doing and why
Now sometimes the underlying math is quite complex - consider that some autonomic systems are trending and hence making predictive changes (number of requests of this type growing, let's provision more of resource X) so the mathematical models are non-trivial. Hence simple explanations are not always available. However some level of feedback to the watching humans can be reassuring.

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources