Delta Compression of Stream - stream

I would like to use delta compression to compress a stream of data. I expect this to be very efficient, because it's a stream of increasing 32bit integers.
There seems plenty of libraries for creating a delta between two binary objects, but I couldn't find anything for stream encoding.
Any suggestions?

You might consider the JavaFastPFOR library, which provides simple delta encoding in the me.lemire.integercompression.Delta class, delta-based algorithms that compress blocks of integers, and other algorithms for compressing integer arrays. This library claims to have been optimized for speed (based on a C++ library with speed goals) and the GitHub readme offers scholarly papers documenting its algorithms.

Related

Julia ML: Is there a recommended data format for loading data to Flux, Knet, Deep Learning Libraries

I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.
My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?
Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.
https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/
I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.
For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.
For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.
For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.
In the near future, JLSO.jl will be a good option for replacing Serialization.

What to look for in optimizing already heavily optimized embeded C++ code in term of memory?

What to look for to optimize heavily optimized embedded DSP code in terms of memory that is outside of the obvious?
I need to reduce the memory by at least 10 percent.
In DSP applications, the requirements for precision and/or quantization of data types and saved intermediate data can often be analysed. If the minimum requirements are not multiples of 256 or 8-bits, it might be possible to reformat and pack the data type elements in non-byte-aligned structures or arrays to save data memory. Of course, this comes with the trade-off of a higher computational cost and code footprint accessing said data, which may or may not be significant in your application.

How possible vector operations on a matrix that does not fit memory

How is it possible to make calculations on a matrix with size 6GB and RAM is 4GB? What techniques are used in this case? Is there any open source solution or tool using files during vector operations?
Yes, the famous Hadoop is an open source computing platform, which can be used for operations on pretty big matrices (and not only for that).
For examples, please read this page.

Are algorithms like Huffman coding actually used in production?

Currently, I am developing an app that needs to store large amount of text on an iPad. My question is, are algorithms like Huffman coding actually used in production? I just need a very simple compression algorithm (there's not going to be a huge amount of text and it only needs a more efficient method of storage), so would something like Huffamn work? Should I look into some other kind of compression library?
From Wikipedia on the subject:
Huffman coding today is often used as a "back-end" to some other compression methods. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding (or variable-length prefix-free codes with a similar structure, although perhaps not necessarily designed by using Huffman's algorithm).
So yes, Huffman coding is used in production. Quite a lot, even.
Huffman coding (also entropy coding) is used very widely. Anything you imagine that is being compressed, with exceptions of some very old schemes, uses them. Image compression, Zip and RAR archives, every imaginable codec and so on.
Keep in mind that Huffman coding is lossless and requires you to know all of the data you're compressing in advance. If you're doing lossy compression, you need to perform some transformations on your data to reduce its entropy first (removing and quantizing DCT coefficients in JPEG compression). If you want Huffnam coding to work on real-time data (you don't know every bit in advance), adaptive Huffman coding is used. You can find a whole lot on this topic in signal processing literature.
Some of the pre-Huffman compression include schemes like runlength coding (fax machines). Runlength coding is still sometimes used (JPEG, again) in combination with Huffman coding.
Yes, they are used in production.
As others have mentioned, true Huffman requires you to analyze the entire corpus first to get the most efficient encoding, so it isn't typically used by itself.
Probably shortly after you were born, I implemented Huffman compression on the Psion Series 3 handheld computer in C in order to compress data which was preloaded onto data packs and only decompressed on the handheld. In those days, space was tight and there was no built-in compression library.
Like most software which is well-specified, I would strongly consider using any feature built into the iOS or standard packages available in your development environment.
This will save a lot of debugging and allow you to concentrate on the most significant portions of your app which add value.
Large amounts of text will be amenable to zip-style compression. And it will be unlikely that spending effort improving its performance (in either space or time) will pay off in the long run.
There's an iOS embedded mechanism to support zlib algorithm (zlib.h in Objective-C).
You may implement your own compression functionality and utilize iOS embedded zlib functions. And compare the performance.
I think the embedded zlib functionality will be faster and will give higher compression ratio.
Huffman codes are the backbone to many "real world" production algorithms. The common compression algorithms today improve upon Huffman codes by transforming their data to improve compression ratios. Of course, there are many application specific techniques used to do this.
As for whether or not you should use Huffman codes, my question is why should you when you can achieve better compression and ease of code by using an already implemented 3rd party library?
Yes, I'm using a huffman compression in my web app for storing a complete snapshot of my engine in an hidden input field. First off it was just curiosity but it offload my SESSION memory moving it to the client browser memory and i used it to store it in a file to backup and exchange that snapshot with my collegue. Man, you have to see their faces when you can just load a file in an admin panel to load the engine in the web!!! It's basically a serialized compressed and base64 encoded array. It helps me to save about 15% bandwith but I think I can do it better now.
Yes, you're using Huffman coding (decoding) to read this page, since web pages are compressed to the gzip format. So it is used pretty much every nanosecond by web browsers all over the planet.
Huffman coding is almost never used by itself, but rather always with some higher-order modeling of the data to give the Huffman algorithm what it needs to work with. So LZ77 models text and other byte-oriented data as repeating strings coded as literals and length/distance pairs, which is then fed to Huffman coding, e.g. in the deflate compressed format using zlib. Or with difference or other prediction coding of pixels for PNG, followed by Huffman coding.
As for what you should use, look at lz4, zlib, and zstd.

Fastest method to compute convolution

I have to apply a convolution filter on each row of many images. The classic is 360 images of 1024x1024 pixels. In my use case it is 720 images 560x600 pixels.
The problem is that my code is much slower than what is advertised in articles.
I have implemented the naive convolution, and it takes 2m 30s. I then switched to FFT using fftw. I used complex 2 complex, filtering two rows in each transform. I'm now around 20s.
The thing is that articles advertise around 10s and even less for the classic condition.
So I'd like to ask the experts here if there could be a faster way to compute the convolution.
Numerical recipes suggest to avoid the sorting done in the dft and adapt the frequency domain filter function accordingly. But there is no code example how this could be done.
Maybe I lose time in copying data. With real 2 real transform I wouldn't have to copy the data into the complexe values. But I have to pad with 0 anyway.
EDIT: see my own answer below for progress feedback and further information on solving this issue.
Question (precise reformulation):
I'm looking for an algorithm or piece of code to apply a very fast convolution to a discrete non periodic function (512 to 2048 values). Apparently the discrete time Fourier transform is the way to go. Though, I'd like to avoid data copy and conversion to complex, and avoid the butterfly reordering.
FFT is the fastest technique known for convolving signals, and FFTW is the fastest free library available for computing the FFT.
The key for you to get maximum performance (outside of hardware ... the GPU is a good suggestion) will be to pad your signals to a power of two. When using FFTW use the 'patient' setting when creating your plan to get the best performance. It's highly unlikely that you will hand-roll a faster implementation than what FFTW provides (forget about N.R.). Also be sure to be using the Real version of the forward 1D FFT and not the Complex version; and only use single (floating point) precision if you can.
If FFTW is not cutting it for you, then I would look at Intel's (very affordable) IPP library. The have hand tuned FFT's for Intel processors that have been optimized for images with various bit depths.
Paul
CenterSpace Software
You may want to add image processing as a tag.
But, this article may be of interest, esp with the assumption the image is a power or 2. You can also see where they optimize the FFT. I expect that the articles you are looking at made some assumptions and then optimized the equations for those.
http://www.gamasutra.com/view/feature/3993/sponsored_feature_implementation_.php
If you want to go faster you may want to use the GPU to actually do the work.
This book may be helpful for you, if you go with the GPU:
http://www.springerlink.com/content/kd6qm361pq8mmlx2/
This answer is to collect progress report feedback on this issue.
Edit 11 oct.:
The execution time I measured doesn't reflect the effective time of the FFT. I noticed that when my program ends, the CPU is still busy in system time up to 42% for 10s. When I wait until the CPU is back to 0%, before restarting my program I then get the 15.35s execution time which comes from the GPU processing. I get the same time if I comment out the FFT filtering.
So the FFT is in fact currently faster then the GPU and was simply hindered by a competing system task. I don't know yet what this system task is. I suspect it results from the allocation of a huge heap block where I copy the processing result before writing it to disk. For the input data I use a memory map.
I'll now change my code to get an accurate measurement of the FFT processing time. Making it faster is still actuality because there is room to optimize the GPU processing like for instance by pipelining the transfer of data to process.

Resources