Currently, I am developing an app that needs to store large amount of text on an iPad. My question is, are algorithms like Huffman coding actually used in production? I just need a very simple compression algorithm (there's not going to be a huge amount of text and it only needs a more efficient method of storage), so would something like Huffamn work? Should I look into some other kind of compression library?
From Wikipedia on the subject:
Huffman coding today is often used as a "back-end" to some other compression methods. DEFLATE (PKZIP's algorithm) and multimedia codecs such as JPEG and MP3 have a front-end model and quantization followed by Huffman coding (or variable-length prefix-free codes with a similar structure, although perhaps not necessarily designed by using Huffman's algorithm).
So yes, Huffman coding is used in production. Quite a lot, even.
Huffman coding (also entropy coding) is used very widely. Anything you imagine that is being compressed, with exceptions of some very old schemes, uses them. Image compression, Zip and RAR archives, every imaginable codec and so on.
Keep in mind that Huffman coding is lossless and requires you to know all of the data you're compressing in advance. If you're doing lossy compression, you need to perform some transformations on your data to reduce its entropy first (removing and quantizing DCT coefficients in JPEG compression). If you want Huffnam coding to work on real-time data (you don't know every bit in advance), adaptive Huffman coding is used. You can find a whole lot on this topic in signal processing literature.
Some of the pre-Huffman compression include schemes like runlength coding (fax machines). Runlength coding is still sometimes used (JPEG, again) in combination with Huffman coding.
Yes, they are used in production.
As others have mentioned, true Huffman requires you to analyze the entire corpus first to get the most efficient encoding, so it isn't typically used by itself.
Probably shortly after you were born, I implemented Huffman compression on the Psion Series 3 handheld computer in C in order to compress data which was preloaded onto data packs and only decompressed on the handheld. In those days, space was tight and there was no built-in compression library.
Like most software which is well-specified, I would strongly consider using any feature built into the iOS or standard packages available in your development environment.
This will save a lot of debugging and allow you to concentrate on the most significant portions of your app which add value.
Large amounts of text will be amenable to zip-style compression. And it will be unlikely that spending effort improving its performance (in either space or time) will pay off in the long run.
There's an iOS embedded mechanism to support zlib algorithm (zlib.h in Objective-C).
You may implement your own compression functionality and utilize iOS embedded zlib functions. And compare the performance.
I think the embedded zlib functionality will be faster and will give higher compression ratio.
Huffman codes are the backbone to many "real world" production algorithms. The common compression algorithms today improve upon Huffman codes by transforming their data to improve compression ratios. Of course, there are many application specific techniques used to do this.
As for whether or not you should use Huffman codes, my question is why should you when you can achieve better compression and ease of code by using an already implemented 3rd party library?
Yes, I'm using a huffman compression in my web app for storing a complete snapshot of my engine in an hidden input field. First off it was just curiosity but it offload my SESSION memory moving it to the client browser memory and i used it to store it in a file to backup and exchange that snapshot with my collegue. Man, you have to see their faces when you can just load a file in an admin panel to load the engine in the web!!! It's basically a serialized compressed and base64 encoded array. It helps me to save about 15% bandwith but I think I can do it better now.
Yes, you're using Huffman coding (decoding) to read this page, since web pages are compressed to the gzip format. So it is used pretty much every nanosecond by web browsers all over the planet.
Huffman coding is almost never used by itself, but rather always with some higher-order modeling of the data to give the Huffman algorithm what it needs to work with. So LZ77 models text and other byte-oriented data as repeating strings coded as literals and length/distance pairs, which is then fed to Huffman coding, e.g. in the deflate compressed format using zlib. Or with difference or other prediction coding of pixels for PNG, followed by Huffman coding.
As for what you should use, look at lz4, zlib, and zstd.
Related
I use Tensorflow for deep learning work, but I was interested in some of the features of Julia for ML. Now in Tensorflow, there is a clear standard that protocol buffers--meaning TFRecords format is the best way to load sizable datasets to the GPUs for model training. I have been reading the Flux, KNET, documentation as well as other forum posts looking to see if there is any particular recommendation on the most efficient data format. But I have not found one.
My question is, is there a recommended data format for the Julia ML libraries to facilitate training? In other words, are there any clear dataset formats that I should avoid because of bad performance?
Now, I know that there is a Protobuf.jl library so users can still use protocol buffers. I was planning to use protocol buffers for now, since I can then use the same data format for Tensorflow and Julia. However, I also found this interesting Reddit post about how the user is not using protocol buffers and just using straight Julia Vectors.
https://www.reddit.com/r/MachineLearning/comments/994dl7/d_hows_julia_language_mit_for_ml/
I get that the Julia ML libraries are likely data storage format agnostic. Meaning that no matter what format in which the data is stored, the data gets decoded to some sort of vector or matrix format anyway. So in that case I can use whatever format. But just wanted to make sure I did not miss anything in the documentation or such about problems or low performance due to using the wrong data storage format.
For in-memory use just use arrays and vectors. They're just big contiguous lumps of memory with some metadata. You can't really get any better than that.
For serializing to another Julia process, Julia will handle that for you and use the stdlib Serialization module.
For serializing to disk you should either Just use Serialization.serialize (possibly compressed) or, if you think you might need to read from another program or if you think you'll change Julia version before you're done with the data you can use BSON.jl or Feather.jl.
In the near future, JLSO.jl will be a good option for replacing Serialization.
Im thinking to use the Huffman coding to make an app that takes pictures right from the iPhone camera and compress it. Would it be possible for the hardware to handle the complex computation and building the tree ? In other words, is it doable?
Thank you
If you mean the image files (like jpg, png, etc), then you should know that they are already compressed with algorithms specific to images. The resulting files would not huffman compress much, if at all.
If you mean that you are going to take the UIImage raw pixel data and compress it, you could do that. I am sure that the iPhone could handle it.
If this is for a fun project, then go for it. If you want this to be a useful and used app, you will have some challenges
It is very unlikely that Huffman will be better than the standard image compression used in JPG, PNG, etc.
Apple has already seen a need for better compression and implemented HEIF in iOS 11. WWDC Video about HEIF
They did a lot of work in the OS and Photos app to make sure to use HEIF locally, but if you share the photo it turns it into something anyone could use (e.g. JPG)
All of the compression they implement uses hardware acceleration. You could do this too, but the code is a lot harder than Huffman.
So, for learning and fun, it's a good project -- it might be easier to do as a Mac app instead, but for something meant to be real, it would be extremely hard to overcome the above issues.
There are 2 parts, encoding and decoding. The encoding process involves constructing a tree or a table based representation of a tree. The decoding process covers reading from huff encoding bytes and undoing a delta. It would likely be difficult to get much speed advantage in the encoding as compared to PNG, but for decoding a very effective speedup can be seen by moving the decoding logic to the GPU with Metal. You can have a look at the full source code of an example that does just that for grayscale images on github Metal Huffman.
I'm an aspiring data scientist (presently just a software developer) and I've just got this (foolish?) idea.
So far (as far as I know), we’ve being using compression algorithms based on replacing the standard way of encoding data by a smarter one. What if we could compress data by comprehension? For example by generating a kind of abstract from which we can recover the original data.
Just think of how our minds work. By associating ideas one with another.
Can machine learning techniques learn and understand the data (and how it is represented on disk) so that it can be generated back from an abstract being generated by the algorithm?
Sure, but then you would have to transfer a representation of the associations and "comprehension" to the other end in order to decompress. That representation will likely be much larger than the data you were trying to compress.
There are actually similar ideas that are, at least to a some degree, already realized.
For instance Autoencoder allows for the compression (encoder part) and re-construction of original data (decoder part).
This technique coupled with the idea of Thought Vector which, sort of, "encodes" concept/meaning/comprehension in a single vector would result into something you have described.
I was wondering if the type of photo used to train an object detector makes a difference, I can't seem to find anything about this online. I am using opencv and dlib if that makes a difference but I am interested in a more general answer if possible.
Am I correct in assuming that lossless file formats would be better than lossey formats? And if training for an object jpg would be better than png as pngs are optimized for text and graphs?
As long as the compression doesn't introduce noticeable artifacts it generally won't matter. Also, many real world computer vision systems need to deal with video or images acquired from less than ideal sources. So you usually shouldn't assume you will get super high quality images anyway.
I'm curious as to what type of experiences people have had with libgd. I am looking for an alternative to GDI+ (something faster). I have tried ImageMagick, but can't get the performance out of it i need.
I have heard that ligd is fast, but less feature rich, and that ImageMagick is slow, but more feature rich.
I only need very simple image processing procedures (scale, crop) on a limited number of formats mainly jpg. However I need very high quality interpolation and it has to be fast (well faster than GDI+).
I am considering trying libgd does this seem like a good fit, given my requirements?
My own experiences with libgd are very good.
I've used it on a website where I'm taking jpeg images from disk, adding a text title to them, and rendering them back out, and it feels just about as fast as if the file was being served straight from disk.
Given that each time it's decoding the jpeg, altering it, and then re-encoding it, that's pretty good!