Encoding image data as UTF-8 string with gzip compression - ruby-on-rails

I am trying to store image data from a file into a PostgreSQL database as a base64 string that is compressed by gzip to save space. I am using the following code to encode the image:
#file = File.open("#{Rails.root.to_s}/public/" << #ad_object.image_url).read
#base64 = Base64.encode64(#file)
#compressed = ActiveSupport::Gzip.compress(#base64)
#compressed.force_encoding('UTF-8')
#ad_object.imageData = #compressed
When I try to save the object, I get the following error:
ActiveRecord::StatementInvalid (PG::Error: ERROR: invalid byte sequence for encoding "UTF8": 0x8b
In the rails console, any gzip compression is outputting the data as ASCII 8-BIT encoding. I have tried to set my internal and external encodings to UTF-8 but the results have not changed. How can I get this compressed data into a UTF-8 string?

This doesn't make much sense for a number of reasons.
gzip is a binary encoding. There's absolutely no point base64-encoding something then gzipping it, since the output is binary and base64 is only for transmitting over non-8bit-clean protocols. Just gzip the file directly.
Most image data is already compressed with a codec like PNG or JPEG that is much more efficient at compression of image data than gzip is. Gzipping it will usually make the image slightly bigger. Gzip will never be as efficient for image data as the loss-les PNG format, so if your image data is uncompressed, PNG compress it instead of gzipping it.
When representing binary data there isn't really a text encoding concern, because it isn't text. It won't be valid utf-8, and trying to tell the system it is will just cause further problems.
Do away entirely with the base64 encoding and gzip steps. As mu is too short says, just use the Rails binary field and let Rails handle the encoding and sending of the binary data.
Just use bytea fields in the database and store the PNG or JPEG images directly. These are hex-encoded on the wire for transmission, which takes 2x the space of the binary, but they're stored on disk in binary form. PostgreSQL automatically compresses bytea fields on disk if they benefit from compression, but most image data won't.
To minimize the size of the image, choose an appropriate compression format like PNG for lossless compression or JPEG for photographs. Downsample the image as much as you can before compression, and use the strongest compression that produces acceptable quality (for lossy codecs like JPEG). Do not attempt to further compress the image with gzip/LZMA/etc, it'll achieve nothing.
You'll still have the data double in size when transmitted as hex escapes over the wire. Solving that requires either the use of the PostgreSQL binary protocol (difficult and complicated) or a binary-clean side-band to transmit the image data. If the Pg gem supports SSL compression you can use it to compress the protocol traffic, which will reduce the cost of the hex escaping considerably.
If keeping the size down to the absolute minimum is necessary, I would not use the PotsgreSQL wire protocol to send the images to the device. It's designed for performance and reliability more than absolutely minimum size.

Related

How objects are stored in NSData? Will the data size (in bytes) be the same as before storing as NSData?

To expand my question, I have an NSString that contains a URL to a video file that weighs 2MB. I use dataWithContentsOfURL to store it in the form of NSData. I added a breakpoint and checked out the size of the data. It was way damn high (more than 12MB), just took the bytes and did math on it (data/1024/1024).
Then I save the data as a video file with UISaveVideoAtPathToSavedPhotosAlbum and I used attributesOfItemAtPath to get FileSize attribute of that saved file. It is showing up correctly (2MB). All I want to know is, how the objects are encoded. Why this drastic change in size (in bytes) when converting URL to a data file.
All I want to know is, how the objects are encoded. Why this drastic change in size (in bytes) when converting URL to a data file.
An NSData is not "encoded" in any way, the bytes stored in one are the raw bytes. There is some overhead in the structure, but this is small.
Reading in a 2MB file will take a little more than 2MB, the overhead is small enough that you might not notice it in the memory usage. You can read much larger files than this and see no significant memory usage growth above the file size.
Your increase in memory usage is due to something else. Accidental multiple copies of the NSData? Maybe you are unintentionally uncompressing compressed video? Etc. You'll have to hunt for the cause.

Finding the size of a deflated chunk of data

Let's suppose I have a chunk of deflated data (as in the compressed data right after the first 2 bytes and before the ADLER32 check in a zlib compressed file)
How can I know the length of that chunk? How can I find where it ends?
You would need to get that information from some metadata external to the zlib block. (The same is true of the uncompressed size.) Or you could decompress and see where you end up.
Compressed blocks in deflate format are self-delimiting, so the decoder will terminate at the correct point (unless the datastream has been corrupted).
Most file storage and data transmission formats which include compressed data make this metadata available. But since it is not necessary for decompression, it is not stored in the compressed stream.
The only way to find where it ends is to decompress it. deflate streams are self-terminating.

Is there anyway (commandline tools) to calculate MD5 hash for .NEF (also .CR2, .TIFF) regardless any metadata?

Is there anyway (commandline tools) to calculate MD5 hash for .NEF (also .CR2, .TIFF) regardless any metadata, e.g. EXIF, IPTC, XMP and so on?
The MD5 hash should be same once we update any metadata inside the image file.
I searched for a while, the closest solution is:
exiftool test.nef -all= -o - -m | md5
but 'exiftool -all=' still keeps a set of EXIF tags in the output file. The MD5 hash can be changed if I update remaining tags.
ImageMagick has a method for doing exactly this. It is installed on most Linux distros and is available for OSX (ideally via homebrew) and also Windows. There is an escape for the image signature which includes only pixel data and not metadata - you use it like this:
identify -format %# _DSC2007.NEF
feb37d5e9cd16879ee361e7987be7cf018a70dd466d938772dd29bdbb9d16610
I know it does what you want and that the calculated checksum does not change when you modify the metadata on PNG files for example, and I know it does calculate the checksum correctly for CR2 and NEF files. However, I am not in the habit of modifying RAW files such as you have and have not tested it does the right thing in that case - though I would be startled if it didn't! So please test before use.
The reason that there is still some Exif data left is because the image data for a NEF file (and similar TIFF based filetypes) is located within that Exif block. Remove that and you have removed the image data. See ExifTool FAQ 7, which has an example shortcut tag that may help you out.
I assume your intention is to verify the actual image data has not been tampered with.
An alternate approach to stripping the meta-data can be to convert the image to a format that has no metadata.
ImageMagick is a well known open source (Apache 2 license) for image manipulation and conversion. It provides libraries with various language bindings as well as command line tools for various operating systems.
You could try:
convert test.nef bmp:- | md5
This converts test.nef to bmp on stdout and pipes it to md5.
AFAIR bmp has no support for metadata and I'm not sure if ImageMagick even preserves metadata across conversions.
This will only work with single image files (i.e. not multi-image tiff or gif animations). There is also the slight possibility some changes can be made to the image which result in the same conversion because of color space conversions, but these changes would not be visible.

Does binary encoding of AVRO compress data?

In one of our projects we are using Kafka with AVRO to transfer data across applications. Data is added to an AVRO object and object is binary encoded to write to Kafka. We use binary encoding as it is generally mentioned as a minimal representation compared to other formats.
The data is usually a JSON string and when it is saved in a file, it uses up to 10 Mb of disk. However, when the file is compressed (.zip), it uses only few KBs. We are concerned storing such data in Kafka, so trying to compress before writing to a Kafka topic.
When length of binary encoded message (i.e. length of byte array) is measured, it is proportional to the length of the data string. So I assume binary encoding is not reducing any size.
Could someone tell me if binary encoding compresses data? If not, how can I apply compression?
Thanks!
If binary encoding compresses data?
Yes and no, it depends on your data.
According to avro binary encoding, yes for it only stores the schema once for each .avro file, regardless how many datas in that file, hence save some space w/o storing JSON's key name many times. And avro serialization do a bit compression with storing int and long leveraging variable-length zig-zag coding(only for small values). For the rest, avro don't "compress" data.
No for in some extreme case avro serialized data could be bigger than raw data. Eg. one .avro file with one Record in which only one string field. The schema overhead can defeat the saving from don't need to store the key name.
If not, how can I apply compression?
According to avro codecs, avro has built-in compression codec and optional ones. Just add one line while writing object container files :
DataFileWriter.setCodec(CodecFactory.deflateCodec(6)); // using deflate
or
DataFileWriter.setCodec(CodecFactory.snappyCodec()); // using snappy codec
To use snappy you need to include snappy-java library into your dependencies.
If you plan to store your data on Kafka, consider using Kafka producer compression support:
ProducerConfig.set("compression.codec","snappy")
The compression is totally transparent with consumer side, all consumed messages are automatically uncompressed.

File Format Conversion to TIFF. Some issues?

I'm having a proprietary image format SNG( a proprietary format) which is having a countinous array of Image data along with Image meta information in seperate HDR file.
Now I need to convert this SNG format to a Standard TIFF 6.0 Format. So I studied the TIFF format i.e. about its Header, Image File Directories( IFD's) and Stripped Image Data.
Now I have few concerns about this conversion. Please assist me.
SNG Continous Data vs TIFF Stripped Data: Should I convert SNG Data to TIFF as a continous data in one Strip( data load/edit time problem?) OR make logical StripOffsets of the SNG Image data.
SNG Data Header uses only necessary Meta Information, thus while converting the SNG to TIFF, some information can’t be retrieved such as NewSubFileType, Software Tag etc.
So this raises a concern that after conversion whether any missing directory information such as NewSubFileType, Software Tag etc is necessary and sufficient condition for TIFF File.
Encoding of each pixel component of RGB Sample in SNG data:
Here each SNG Image Data Strip per Pixel component is encoded as:
Out^[i] := round( LineBuffer^[i * 3] * **0.072169** + LineBuffer^[i * 3 + 1] * **0.715160** + LineBuffer^[i * 3+ 2]* **0.212671**);
Only way I deduce from it is that each Pixel is represented with 3 RGB component and some coefficient is multiplied with each component to make the SNG Viewer work RGB color information of SNG Image Data. (Developer who earlier work on this left, now i am following the trace :))
Thus while converting this to TIFF, the decoding the same needs to be done. This raises a concern that the how RBG information in TIFF is produced, or better do we need this information?.
Please assist...
Are you able to load this into a standard windows bitmap handle? If so, there are probably a bunch of free and commercial libraries for saving it as TIFF.
The standard for TIFF is libtiff -- it's a C library. Here's a version for Delphi made by an expert in the TIFF format:
http://www.awaresystems.be/imaging/tiff/delphi.html
There seems to be a lot of choices.
I think the approach of
Loading your format into an in-memory standard bitmap (which you need to do to show it, right?)
Using a pre-existing TIFF encoding library to save as TIFF
Will be a lot easier than trying to do a direct format-to-format conversion. The only reasons I wouldn't do it this way are:
The bitmap is too big to keep in memory
The original format is lossy and I will lose more quality in the re-encoding -- but you'd have to be saving in a standard lossy format (JPEG) to save quality.
Disclaimer: I work for Atalasoft.
We make .NET imaging codecs (including TIFF) -- that are a lot easier to use than LibTiff -- you can call them in Delphi through COM. We can convert standard windows bitmaps to TIFF or PDF (or other formats) with a couple of lines of code.
One approach, if you have a Windows application which handles and can print this format, would be to let it do the work for you, and call it to print the file to one of the many available 'printer drivers' which support direct output to TIFF.

Resources