Safety of xz archive format - storage

While looking for a good option to store large amounts of data (coming mostly from numerical computations) long-term, I arrived at using xz archive format (tar.xz). The default LZMA compression there provides significantly better archive sizes (for my type of data) compared to more common tar.gz (both with reasonable compression options).
However, the first google search on the safety of long-term usage of xz, arrived at the following web-page (coming from one of the developers of lzip) that has a title
Xz format inadequate for long-term archiving
listing several reasons, including:
xz being a container format as opposed to simple compressed data preceded by a necessary header
xz format fragmentation
unreasonable extensibility
poor header design and lack of field length protection
4-byte alignment and use of padding all over the place
inability to add the trailing data to already created archive
multiple issues with xz error detection
no options for data recovery
While some of the concerns seem a bit artificial, I wonder, if there is any solid justification for not using xz as an archive format for long-term archiving.
What should I be concerned about if I choose xz as a file format?
(I guess, access to an xz program itself should not be an issue even 30 years from now)
a couple of notes:
The data stored are results of numerical computations, some of which are published in different conferences and journals. And while storing results does not necessarily imply research reproducibility, it is an important component.
While using more standard tar.gz or even plain zip might be a more obvious choice, an ability to cut about 30% of the archive size is very appealing to me.

If you read carefully the page you linked, you'll find things like this:
https://www.nongnu.org/lzip/xz_inadequate.html#misguided
"the xz format specification sets more strict requirements for the integrity of the padding than for the integrity of the payload. The specification does not guarantee that the integrity of the decompressed data will be verified, but it mandates that the decompression must be aborted as soon as a damaged padding byte is found."
What compressed format does any of the following?
Uses padding.
Protects padding with a CRC.
Aborts if the padding is corrupt.

Maybe the right question is, "is there any solid justification for using such a poorly designed format as xz for long-term archiving when properly designed formats exist?"
The IANA Time Zone Database, for example, is using gzip and lzip to distribute their tarballs, which are archived forever.
http://www.iana.org/time-zones

Related

Advantage for hex formats like SREC or Intel HEX

I want to ask if someone can explain me the benefit for using hex formats (e.g. by Motorola S-Record or Intel HEX) over using direct binary images like for firmware or memory dumps?
I understand that it is useful to have some Meta information about the binary file like used memory areas, checksums for data integrity and so on…
However, the fact that the actual data size is doubled, because everything will be saved in a hex-ASCII representation is confusing me.
Is the reason for using a hex-ASCII representation only the portability, to prevent problems with systems that have a different byte endianness or are there other benefits?
I found for this topic many tutorials about how to convert binary to hex and backwards or the specifications of the certain formats, but no information about the advantages and disadvantages.
There are a couple of advantages of HEX-ASCII formats over binary image. Here are a couple to start with.
Transport. An ASCII file can be transfered through most mediums. Binary files may not be transferred intact across some mediums.
Validity checks. The HEX file has a checksum on each line. The Binary may not have any checks.
Size. Data is written to selected memory address in the Hex file. The binary image has limited addressing information and is likely to contain every memory location from start address to end of file.

Where to store decrypted files?

I am encrypting downloaded files and saving them locally in app's documents directory.
To read them you must decrypt those file and store some where temporarily.
My concerns are:
1.if I store them in doc directory for time they are being used, for that time window one can get those files using tools like iExplorer.
2.My idea is to store them in memory for the time they are being used and flush the vault after use.
This option is good for small files but for large files say 50 MB or video of 100 MB, I am afraid that app will receive memory warning in result will terminate abruptly.
I want to know the best approach for doing this.
There is no perfect security storing local files in a safe way. If a person has full access to the device, he can always find a way to decrypt the files, as long as your application is able to decrypt it.
The only question is: How much effort is necessary to decrypt the files?
If your only concern is that a person may use iExplorer to copy and open these files, a simple local symmetric encryption will do the trick.
Just embed a random symmetric key in your application and encrypt the data block by block while you download it.
You can use the comfortable "Security Transforms" framework to do the symmetric encryption. There are some good examples in the Apple Documentation.
When you load the files, you can use the same key to decrypt them while you load them from the file system.
Just to make things clear: This is not a perfect protection of the files. But to decrypt the files, one has access to your app binary. Analyse this binary in a debugger and searching for the decryption part to extract your symmetric key. This is a lot effort necessary just to decrypt the files.
Split your files into smaller sizes before saving them, then decrypt on load.
Later edit: I noticed this is mentioned in the comments. I agree splitting files isn't the easiest thing in the world, but presumably you'll only need this for video. About 100MB is a lot of text or audio. If your PDF weights as much, it's probably scanned text, and you can change it into a series if images.
And yes, splitting better be done server-side, don't want the user waste battery in video processing.
Decrypt them, obfuscate them with a toy algorithm (e. g. XOR with a constant block), and store them in documents. When needed, load and decrypt.
Since the problem has no solution in theory (a determined enough attacker can read your process memory after all), it's as good a solution as any.

Matlab Parse Binary File

I am looking to speed up the reading of a data file which has been converted from binary (it is my understanding that "binary" can mean a lot of different things - I do not know what type of binary file I have, just that it's a binary file) to plaintext. I looked into reading files quickly awhile ago, and was informed that reading/parsing a binary file is faster than text. So, I would like to parse/read the binary file (that was converted to plaintext) in an effort to speed up the program.
I'm using Matlab for this project (I have a Matlab "program" that needs the data in the file). I guess I need some information on the different "types" of binary, but I really want information on how to read/parse said binary file (I know what I'm looking for in plaintext, so I imagine I'll need to convert that to binary, search the file, then pull the result out into plaintext). The file is a logfile, if that helps in any way.
Thanks.
There are several issues in what you are asking -- however, you need to know the format of the file you are reading. If you can say "At position xx, I can expect to find data yy", that's what you need to know. In you question/comments you talk about searching for strings. You can also do it (much like a text file) "when I find xxxx in the file, give me the following data up to nth character, or up to the next yyyy".
You want to look at the documentation for fread. In the documentation there are snippets of code that will get you started, but as I (and others) said you need to know the format of your binary files. You can use a hex editor to ascertain some information if you are desperate, but what should be quicker is the documentation for the program that outputs these files.
Regarding different "binary files", well, there is least significant byte first or LSB last. You really don't need to know about that for this work. There are also other platform-dependent issues which I am almost certain you don't need to know about (unless you are moving the binary files from Mac to PC to unix machines). If you read to almost the bottom of the fread documentation, there is a section entitled "Reading Files Created on Other Systems" which talks about the issues and how to deal with them.
Another comment that I have to make, you say that "reading/parsing a binary file is faster than text". This is not true (or even if it is, odds are you won't notice the performance gain). In terms of development time, however, reading/parsing a textfile will save you huge amounts of time.
The simple way to store data in a binary file is to use the 'save' command.
If you load from a saved variable it should be significantly faster than if you load from a text file.

Feature extraction from Image metadata

I am working on a security problem, where I am trying to identify malicious images. I have to mine for attributes from images (most likely from the metadata) that can be fed in to Weka to run various machine learning algorithms, in order to detect malicious images.
Since the image metadata can be corrupted in various different ways, I am finding it difficult to identify the features to look at in the image metadata, which I can quantify for the learning algorithms.
I had earlier used information like pixel info etc using tools like ImageJ to help me classify images, however I am looking for a better way (with regards to the security) to identify and quantify features from the image/image-metadata.
Any suggestion on the tools and the features?
As mentioned before this is not a learning problem.
The problem is that one exploit is not *similar* to another exploit. They exploit individual, separate bugs in individual, different (!) libraries, things such as missing bounds checking. It's not so much a property of the file, but more of the library that uses it. 9 out of 10 libraries will not care. One will misbehave because of a programming error.
The best you can do to detect such files is to write the most pedantic and at the same time most robust format verifier you can come up with, and reject any image that doesn't 1000% fit the specifications. Assuming that the libraries do not have errors in processing images that are actually valid.
I strongly would recommend you start with investigating how the exploits actually work. Understanding what you are trying to "learn" may guide you to some way of detecting them in general (or understanding why there is no general detection possible ...).
Here is a simple example of the ideas of how one or two of these exploits might work:
Assume we have a very simple file format, like BMP. For compression, it has support for a simple run length encoding, so that identical pixels can be efficiently stored as (count x color pairs). Does not work well with photos, but is quite compact for line art. Consider the following image data:
Width: 4
Height: 4
Colors: 1 = red, 2 = blue, 3 = green, 4 = black
Pixel data: 2x1 (red), 4x2 (blue), 2x3, 5x1, 1x0, 4x1
How many errors in the file do you spot? They may cause some trusting library code to fail, but any modern library (written with knowing about this kind of attacks and with knowing that files may have been corrupted due to transmission and hard disk errors) should just skip that and maybe even produce a partial image. See, maybe it was not an attack, but just a programming error in the program that produced the image...
Heck, even not every out-of-bounds use must be an attack. Think of CDs. Everybody used "overburning" at some time to put more data on a CD than was meant by the specifications. Yes, some drive might crash because you overburned a CD. But I wouldn't consider all the CDs with more than 650 MB to be attacks, just because they broke the Yellow Book specifications of what a CD is.

What are the differences or advantages of using a binary file vs XML with TClientDataSet?

Is there any difference or advantages using binary a file or XML file with
TClientDataSet.
Binary will be smaller and faster.
XML will be more portable and human readable.
The Binary file will be a little smaller.
The main advantage of the XML format is that you can pass it around via http(s) protocols.
Binary is smaller and faster, but only readable by TClientDataSets.
XML is larger and slower (both are not that bad, i.e. not by orders of magnitude bigger or slower).
XML is readable by people (not recommended in general, but it is doable), and software.
Therefore it is more portable (as Nick wrote).
TClientDataSets can load and save their own style of XML, or you can use the Delphi XML Mapper tool to read and write any kind of XML.
XSLT can for instance be used to transform those XML files into any kind of text, including other XML, HTML, CSV, fixed columns, etc.
In contrast to what Tim indicates, both binary and XML can be transferred through HTTP and HTTPS. However, it is often appreciated sending XML as it is easier to trace.
Without having tested it: I guess the binary format would be quite a lot faster when reading and writing. You'd better do your own benchmarks for that, though.
Another advantage of binary might be, that it cannot be easily edited which prevents people from mucking up the data outside the application.
When using Delphi 2009, we have noticed that if the file has an extension of .XML, it will not save in binary format over an existing dfXMLUTF8 format, even with a LoadFromFile, SaveToFile. Changing the file extension to something else (.DAT, for example) allows saving the file in dfBinary. Our experience is that the binary file, in addition to being somewhat more difficult for the end-user to manipulate (a plus!), is approximately 50% smaller than the dfXMLUTF8 format file.

Resources