Why binary files serialization has certain performance and size benefits? - binary-data

Why binary files serialization is faster to serialize and capture relatively small amount of disk-space in compare to text files serialization?


Can I encode and decode a JSON on the GPU using MetalKit?

I have this situation where my database is a huge JSON, to a point that decoding and encoding it takes too long and my user experience is harmed.
I am syncing my DB constantly with a device that is communicating via BLE, and the DB gets bigger with time.
I used MetalKit in the past to speed up image filtering, but I am not a pro metal programmer and do not have the tools to determine whether I can achieve decoding/encoding my JSON using metal.
The tasks that can be improved via GPU are the ones that can be parallelized. Since the GPU has many more cores than a CPU, a task that can be divided into smaller tasks (like image processing) are ideal for the GPU. The encoding and decoding of JSON is something that needs a lot of serial processing and in that scenario you should go to the CPU.
I can not see how you can efficiently parallelize the serialization and deserialization of a JSON. Maybe if your JSON has an array with lots of small elements (all with the same structure), maybe in that particular scenario using the GPU can improve the performance.

directories and files to hdf5 groups and datasets

This is probably a really stupid question, at least a very simple one. Please just point me to the right direction if it is not worth detailed reply.
My understanding is that HDF5 is good to store hierarchical data. I use a file system to store my data --- root directory, sub-directories, data files (txt), and metadata text files. The directory names are usually descriptive as well. So it seems natural to bundle these data into a hdf5 file (or files) using directories as groups and data files as datasets.
My question is, are there any advantages in doing so? I want to to able to select and combine datasets by using groups and/or attributes (like SELECT from a database). Also, are there tools to do this?
Sure, this is possible.
For example we have a web-application for visualization scientific data that relies on a single 250GB HDF5 File with 30.000 groups and each of those groups contains multiple datasets. The groups and datasets have attributes. The web-app only accesses this single HDF5 file to retrieve all information.
The advantage of using a HDF5 file is, that is quite portable and can be used in many different languages (C++, Java, python, etc). It's also really efficient for storing binary data and if you combine compression and chunking you can even inrease performance by using todays multi-core CPUs.
However HDF5 is quite different from RDBMS. You can't really use SELECT like in a database. You have to iterate (possibly recursively) through the groups/datasets. There are some libraries (Pandas,PyTables) that are built on top of HDF5 and provide a higher abstraction. The downside is that you might lose some portability.
Another approach is to use a hybrid approach:
You can store the meta-information in a RDBMS and the binary data in one or multiple HDF5 files. This might give you best of both worlds.
Here is also a list of useful libraries:
h5py - simple pyhton hdf5 package
PyTables - high level abstraction over hdf5 dataset (support for tables)
Pandas - Data Analysis Library supports hdf5 as backend.
JHI5 - the low level JNI wrappers: very flexible, but also quite tedious to use.
Java HDF object package - a high-level interface based on JHI5.
JHDF5 - high-level interface building on the JHI5 layer
HDF5 Files
rhdf5 (Bioconductor)
vitables - supports PyTables
HDF5View - official HDF5 Java Viewer

Any point in chunking a 1-D dataset

I'm a newcomer to the HDF5 World. My data is composed of a series of 1D datasets. My application needs to read one dataset at a time, and when it reads a dataset, it needs to read the dataset in its entirety.
I have a basic understanding of HDF5 chunking: a chunk is laid out contiguously on the disk and is fetched in one read operation.
I see how chunking will be helpful when you have a multi-dimensional array and you need to frequently access items that are not contiguous. On the other hand, I don't see chunking being useful in my case: dataset is 1-dimensional and will always be read in its entirety.
Is my analysis correct? If not, please help me understand how chunking will help my cause.
Chunking allows you to handle files that are too big to fit into memory so they need to be processed in chunks. This is not something specific to HDF. What HDF offers you is a storing capability in an open source transparent binary format that has some nice features like meta-data etc. If you can read the file into memory at once and are not interested in alternative ways of storing your files then I do not see the necessity to use HDF. However, if you want to store similar files and possibly related results in a hierarchical i.e. folder-like way in one file to improve work flow or if you have files that need to be processed in chunks because they do not fit into memory at once, then HDF might just be what you are looking for.

Delta Compression of Stream

I would like to use delta compression to compress a stream of data. I expect this to be very efficient, because it's a stream of increasing 32bit integers.
There seems plenty of libraries for creating a delta between two binary objects, but I couldn't find anything for stream encoding.
Any suggestions?
You might consider the JavaFastPFOR library, which provides simple delta encoding in the me.lemire.integercompression.Delta class, delta-based algorithms that compress blocks of integers, and other algorithms for compressing integer arrays. This library claims to have been optimized for speed (based on a C++ library with speed goals) and the GitHub readme offers scholarly papers documenting its algorithms.

Are algorithms like Huffman coding actually used in production?

Currently, I am developing an app that needs to store large amount of text on an iPad. My question is, are algorithms like Huffman coding actually used in production? I just need a very simple compression algorithm (there's not going to be a huge amount of text and it only needs a more efficient method of storage), so would something like Huffamn work? Should I look into some other kind of compression library?
From Wikipedia on the subject:
Huffman coding today is often used as a "back-end" to some other compression methods. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding (or variable-length prefix-free codes with a similar structure, although perhaps not necessarily designed by using Huffman's algorithm).
So yes, Huffman coding is used in production. Quite a lot, even.
Huffman coding (also entropy coding) is used very widely. Anything you imagine that is being compressed, with exceptions of some very old schemes, uses them. Image compression, Zip and RAR archives, every imaginable codec and so on.
Keep in mind that Huffman coding is lossless and requires you to know all of the data you're compressing in advance. If you're doing lossy compression, you need to perform some transformations on your data to reduce its entropy first (removing and quantizing DCT coefficients in JPEG compression). If you want Huffnam coding to work on real-time data (you don't know every bit in advance), adaptive Huffman coding is used. You can find a whole lot on this topic in signal processing literature.
Some of the pre-Huffman compression include schemes like runlength coding (fax machines). Runlength coding is still sometimes used (JPEG, again) in combination with Huffman coding.
Yes, they are used in production.
As others have mentioned, true Huffman requires you to analyze the entire corpus first to get the most efficient encoding, so it isn't typically used by itself.
Probably shortly after you were born, I implemented Huffman compression on the Psion Series 3 handheld computer in C in order to compress data which was preloaded onto data packs and only decompressed on the handheld. In those days, space was tight and there was no built-in compression library.
Like most software which is well-specified, I would strongly consider using any feature built into the iOS or standard packages available in your development environment.
This will save a lot of debugging and allow you to concentrate on the most significant portions of your app which add value.
Large amounts of text will be amenable to zip-style compression. And it will be unlikely that spending effort improving its performance (in either space or time) will pay off in the long run.
There's an iOS embedded mechanism to support zlib algorithm (zlib.h in Objective-C).
You may implement your own compression functionality and utilize iOS embedded zlib functions. And compare the performance.
I think the embedded zlib functionality will be faster and will give higher compression ratio.
Huffman codes are the backbone to many "real world" production algorithms. The common compression algorithms today improve upon Huffman codes by transforming their data to improve compression ratios. Of course, there are many application specific techniques used to do this.
As for whether or not you should use Huffman codes, my question is why should you when you can achieve better compression and ease of code by using an already implemented 3rd party library?
Yes, I'm using a huffman compression in my web app for storing a complete snapshot of my engine in an hidden input field. First off it was just curiosity but it offload my SESSION memory moving it to the client browser memory and i used it to store it in a file to backup and exchange that snapshot with my collegue. Man, you have to see their faces when you can just load a file in an admin panel to load the engine in the web!!! It's basically a serialized compressed and base64 encoded array. It helps me to save about 15% bandwith but I think I can do it better now.
Yes, you're using Huffman coding (decoding) to read this page, since web pages are compressed to the gzip format. So it is used pretty much every nanosecond by web browsers all over the planet.
Huffman coding is almost never used by itself, but rather always with some higher-order modeling of the data to give the Huffman algorithm what it needs to work with. So LZ77 models text and other byte-oriented data as repeating strings coded as literals and length/distance pairs, which is then fed to Huffman coding, e.g. in the deflate compressed format using zlib. Or with difference or other prediction coding of pixels for PNG, followed by Huffman coding.
As for what you should use, look at lz4, zlib, and zstd.