Unconcatenating files - parsing

I have a corrupted 7-zip archive that I am extracting manually using the method outlined by Igor Pavlov at this link. An intermediate result is a large file that is a bunch of files cat'ed together that must be separated manually. I understand that some file formats will need to be extracted manually by a human using discretion (text files, etc.) but many file formats encode the size of the file as part of the file itself (e.g. .zip). Furthermore, some files can be parsed and their size can be deduced with just a little information about the file format (e.g. .pdf). Let's say the large file consists of the following files concatenated together:
Key: <filename>(<contents>)
badfile(aaaaaaaaaaabbbbbbbbbcccccccdddddddd) -> zip1.zip(aaaaaaaaaaa)
badfile2(bbbbbbbbbcccccccdddddddd)
I am looking for a program that I can run on a large file (call it badfile) that can determine the type and size of the first logical file (let's say it's a .zip file) contained within and create a new file to hold the contents (e.g. zip1.zip since filenames are lost) and chop the file off the front of badfile. This would allow me to run the program in a loop to extract files with known types and/or pause and let the user handle the difficult cases. Does such a program exist? I know that the *nix command file(1) will do a lot of the work here, but there would be a lot of effort in encoding rules for sizing files (e.g. .pdf) that I would prefer to not duplicate.

I believe this question should be closed due to being off topic as it asks to find existing programs to solve the problem, but open bounty prevents close vote. However.
Does such a program exist?
Yes they exist is and are called data carving tools.
Some commom ones include scalpel and foremost and PhotoRec
A list of other tools is avaliable here

Related

Is there a way to find file type?

I would like to get the type of a file without using the extension, is there a way to use the metadata of a file to distinguish if it is a video file or if it isnt?
I have tried using extensions but I find searching each file extension and comparing it with a list of extensions is quite time consuming.
Yes it is possible to determine file type without using the file extension. You can do this by reading the file header also sometimes referred as file signature which occupies first few bytes of the file.
How many bytes do file header/signature occupy? This depends from file type to file type. So you should check the internet for more detailed information about the file header/signature for specific file type you want to identify.
You can find list of some more popular signatures List of file signatures - Wikipedia
PS: Most program stopped relying only on file signatures for determining file way back when first Windows came out. The main reason for this was the fact that since in the beginning file extensions were limited to three character length (limit of the old file systems like old FAT8 or FAT16) world quickly ran out of possible file extensions so multiple programs began to use same file extensions but used completely different file types. So by storing file header/signature at the beginning of the file you would no longer be limited by this file system limitation.

Convert a filetype to its original state

How can I change a file type?
A year ago I wrote a few articles that should be viewed in any text type of program. however, I recently opened them and they are viewed in symbols and alphanumeric characters. In linux os, the 'file' is now in an archived folder type that contains .xml files. in windows os it is 'file' as type of file. it has no extension.
Is there any way to recover the original readable alpha-numeric information in these files?
My preference would be to salvage the original information than redo.
First off the extension doesn't actually mean anything for the information of the file, it's only purpose is as a hint to the OS for deciding which application should be used in opening the file. You can prove this by renaming something like an exe to have a txt extension which will then open in notepad as a lot of seemingly random characters; renaming it back to exe will allow it to run again.
Based on your description the files you mention are some form of binary file, the bad news with that is you need to know either what application was used to create the file in order to be able to open it or what the original file extension was (which would be a hint to the former).
If you don't know either of those pieces of information you can of course use trial and error by guessing what extension it might be, renaming it, then opening it with the associated application and seeing if it worked.

iOS CoreData: "Data Model Version Compiler" error

I created a data model file "ChatModel.xcdatamodeld" in my project. Then I merged branches on github. There're conflicts in "project.pbxproj". I fixed them. Then the error happened:
"/Users/mac/zhongqing-ios/Zhongqing/Zhongqing/Model/ChatModel.xcdatamodeld: Could not create bundle folder for versioned model at '/Users/mac/Library/Developer/Xcode/DerivedData/Zhongqing-chngcirectbawjenegkxtgdfgoux/Build/Products/Debug-iphonesimulator/Zhongqing.app/ChatModel.momd'".
"/Users/mac/zhongqing-ios/Zhongqing/ChatModel.xcdatamodeld: Unable to write VersionInfo.plist for versioned model at '/Users/mac/Library/Developer/Xcode/DerivedData/Zhongqing-chngcirectbawjenegkxtgdfgoux/Build/Products/Debug-iphonesimulator/Zhongqing.app/ChatModel.momd'".
Each time I have to delete the Derived Data so that the project can be run.
And then the error happen again.
Although some files are readable they should be treated like binary files. .pbxprojfiles are good example.
From pro-git
Some files look like text files but for all intents and purposes are to be treated as binary data. For instance, Xcode projects on the Mac contain a file that ends in .pbxproj, which is basically a JSON (plain text javascript data format) dataset written out to disk by the IDE that records your build settings and so on. Although it’s technically a text file, because it’s all ASCII, you don’t want to treat it as such because it’s really a lightweight database — you can’t merge the contents if two people changed it, and diffs generally aren’t helpful. The file is meant to be consumed by a machine. In essence, you want to treat it like a binary file.

The best way to verify a zip file function in Elixir

I have a path library which has a zip function in it, and while writing the unit test, I tried to find the best way to verify that the zip function works correctly.
Can someone please show me the best way to verify that the zip function works correctly ?
The few ways I can think of are:
Comparing the md5 of the resultant zip file against a sample zip file
Listing out the contents of the zip file and ensuring the content are correct
However, both ways seems a little long winded and I am guessing not exactly idiomatic Elixir. Can someone please show me a better way ?
I would create a directory of test files to zip up in your unit test, zip it up using a trusted utility and get the resulting md5. Then for your unit test, perform the zip function, take the md5, and compare with your verified md5.
If I am correct, there are some parameters that can affect the result of zip processing (like block size). So, unless your purpose is to check exhaustively that your zip process reproduces all or a subset of the zip features, I find more interesting to validate that various files from your system (or/and some random files) can be compresses with your function, and uncompressed outside using different unzip tools, and finally compare that the files are identical to the original. (of course it takes more time, but it can be easily automated).
Concerning the usage of MD5 rather than file comparison, I think that on one hand file comparison is easy to write and in the other hand you need to read the whole file to calculate the MD5, so I don't think it is worth to use this trick to accelerate the comparison - (especially if files are different :o).

join two different files into one file inside a zip (or elsewhere compressed) file

I have two files inside a zip file.
Now, imagine these two files are big... REALLY big... so big that I can't uncompress them into my old, poor, tiny hard disk.
However, they are simple txt files, so the zipped version is quite small.
I need to JOIN the two files into ONE single file.
As they're too big to extract, I need to do this INSIDE the zip.
Is there a way to do this?
Example:
"compressed.zip" contains "part_1.txt" and "part_2.txt".
I want "compressed.zip" to contain one file, called "part_1_and_2.txt".
(If it's not possible with zip, I can pick another compressor... but the idea is the same: each uncompressed file is bigger than total capacity of my hard disk)
Tnx!
It seems like you just need to ensure that the storage requirements are low; I don't think the operation needs to occur "within the zip file" per se. You can do this with command-line tools (in Linux or with similar tools via Cygwin) in the following way:
Start with a tarred, gzipped file with your input files in it. Let that be compressed.tar.gz. Then you can extract the contents of the gzipped tar archive to standard output and pipe it back to gzip:
tar xzf compressed.tar.gz -O | gzip > part_1_and_2.txt.gz
The resultant compressed file is the text of part_1.txt and part_2.txt concatenated (though I suppose it is not the same as having a tar archive that contains one file, but perhaps this will be sufficient).
If you need to do this within a program, I would guess that libtar and zlib can perform this functionality programmatically, or you can run a script from your program.
You can use libzip (which in turn uses zlib) to read uncompressed data from the input files in turn and write a new output zip file. You would not need to store all of an uncompressed file on the mass storage or in memory. You can read and write a small chunk at a time, as you would without compression. I presume that you have room on your mass storage for all three of those zip files.

Resources