Can I encode and decode a JSON on the GPU using MetalKit? - ios

I have this situation where my database is a huge JSON, to a point that decoding and encoding it takes too long and my user experience is harmed.
I am syncing my DB constantly with a device that is communicating via BLE, and the DB gets bigger with time.
I used MetalKit in the past to speed up image filtering, but I am not a pro metal programmer and do not have the tools to determine whether I can achieve decoding/encoding my JSON using metal.

The tasks that can be improved via GPU are the ones that can be parallelized. Since the GPU has many more cores than a CPU, a task that can be divided into smaller tasks (like image processing) are ideal for the GPU. The encoding and decoding of JSON is something that needs a lot of serial processing and in that scenario you should go to the CPU.
I can not see how you can efficiently parallelize the serialization and deserialization of a JSON. Maybe if your JSON has an array with lots of small elements (all with the same structure), maybe in that particular scenario using the GPU can improve the performance.

Related

How to specify the code segment to record using the linux perf tool?

Disclaimer: I am new to perf and still trying to learn the ins/outs.
Currently, I encountered this scenario: I need to analyze the performance of a program. This program consists of reading data, preprocessing data, performing data processing, and outputting results. I am currently only interested in the performance of the data processing part and would like to use perf to perform various performance analyses. However, all the usage methods I know so far will analyze the other parts (reading data, preprocessing data, outputting results) together. The time spent on these parts is relatively large, which effects my ability to observe perf's test results very much.
Therefore, I would like to specify that perf only analyzes the performance of a certain section of code (processing data).
As far as I know, vtune can embed __itt_resume, __itt_pause in my code to achieve the above function. I am not sure if it is possible to do similar analysis using perf.
Any content on this would be greatly appreciated.

H2o: Iterating over data bigger than memory without loading all data into memory

Is there a way I can use H2O to iterate over data that is larger than the cumulative memory size of the cluster? I have a big-data set which I need to iterate through in batches and feed into Tensorflow for gradient-descent. At a given time, I only need to load one batch (or a handful) in memory. Is there a way I can setup H2O to perform this kind of iteration without it loading the entire data-set into memory?
Here's a related question that was answered over a year ago, but doesn't solve my problem: Loading data bigger than the memory size in h2o
The short answer is this isn't what H2O was designed to do.
So unfortunately the answer today is no.
The longer answer... (Assuming that the intent of the question is regarding model training in H2O-3.x...)
I can think of at least two ways one might want to use H2O in this way: one-pass streaming, and swapping.
Think of one-pass streaming as having a continuous data stream feeding in, and the data constantly being acted on and then thrown away (or passed along).
Think of swapping as the computer science equivalent of swapping, where there is fast storage (memory) and slow storage (disk) and the algorithms are continuously sweeping over the data and faulting (swapping) data from disk to memory.
Swapping just gets worse and worse from a performance perspective the bigger the data gets. H2O isn't ever tested this way, and you are on your own. Maybe you can figure out how to enable an unsupported swapping mode from clues/hints in the other referenced stackoverflow question (or the source code), but nobody ever runs that way, and you're on your own. H2O was architected to be fast for machine learning by holding data in memory. Machine learning algorithms iteratively sweep over the data again and again. If every data touch is hitting the disk, it's just not the experience the in-memory H2O-3 platform was designed to provide.
The streaming use case, especially for some algorithms like Deep Learning and DRF, definitely makes more sense for H2O. H2O algorithms support checkpoints, and you can imagine a scenario where you read some data, train a model, then purge that data and read in new data, and continue training from the checkpoint. In the deep learning case, you'd be updating the neural network weights with the new data. In the DRF case, you'd be adding new trees based on the new data.

Can I use storm in census database?

So far that I know about Storm, that it's used to analyze Twitter tweets to get trending topics, but can it be used to analyze data from government's census? And because the data is structured, is storm suitable for that?
Storm is generally used for processing unending streams of data, e.g. logs, the twitter stream, or in my case the output of a web crawler.
I believe census type data would be in the form of a fixed report, which could be treated as a stream, but would probably lend itself better to processing via something like Map Reduce, using Hadoop (possibly with cacading or scalding as layers of abstraction over the details).
The structured nature of the data wouldn't prevent use of any of these technologies, that's more related to the problem you are trying to solve.
Storm is designed for streaming data processing, where the data is coming continuously. Your application has all the data it needs to process available, so a Batch processing is more suited. If the data is structured, you can use R or other tools for analysis, or write scripts to convert the data so that it can go to R as input. If its a humongous dataset, & u want to process it faster, only then think of getting into Hadoop & writing your program as per the analysis you have to do. Suggesting an architecture is only possible if you provide more details regarding data size, & what sort of analysis you are looking forward to do on it. If its a smaller dataset, both hadoop & storm can be an overkill for the problem that has to be solved.
--gtaank

How important to send Interleaved Vertex Data on ios

I am using Assimp to import some 3d models.
Assimp is great, but it stores everything in a non-interleaved vertex format.
According to the Apple OpenGL ES Programming Guide, interleaved vertex data is preferred on ios: https://developer.apple.com/library/ios/#documentation/3DDrawing/Conceptual/OpenGLES_ProgrammingGuide/TechniquesforWorkingwithVertexData/TechniquesforWorkingwithVertexData.html#//apple_ref/doc/uid/TP40008793-CH107-SW8
I am using VertexArrays to consolidate all the buffer related state changes - is it still worth the effort to interleave all the vertex data?
Because interleaved vertex data increases the locality of vertex data, it allows the GPU to cache much more efficiently and generally to be a lot lighter on memory bandwidth at that stage in the pipeline.
How much difference it makes obviously depends on a bunch of other factors — whether memory access is a bottleneck (though it usually is, since texturing is read intensive), how spaced out your vertex data is if not interleaved and the specifics of how that particular GPU does fetching and caching.
Uploading multiple vertex buffers and bundling them into a vertex array would in theory allow the driver to perform this optimisation behind your back (either so as to duplicate memory or once it becomes reasonably confident that the buffers is the array aren't generally in use elsewhere) but I'm not confident that it will. But the other way around to look at it is that you should be able to make the optimisation yourself at the very end of your data pipeline, so you needn't plan in advance for it or change your toolset. It's an optimisation so if it's significant work to implement then the general rule against premature optimisation applies — wait until you have hard performance data.

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources