Delete Files from .tar Archive while Extracting - tar

I am wondering whether I can incrementally delete the contents of an archive (at a minimum .tar, though ideally also .tar.xz or similar) as I am extracting them all. The ultimate goal is to remove the necessity to have twice the space required for the files available while downloading and extracting.
I have looked around a bit, and it seems tar has a --remove-files option, which is compatible with archiving, but this only seems to work when creating an archive. The --delete option seems to work when extracting as well, but it only seems to work with .tar. Would it in this case be possible to replace the .tar.xz with an unarchived .tar in-place and then proceed to obtain the list of contents somehow and iteratively extract them with --delete? Or is there a better way of doing this?

Related

Unconcatenating files

I have a corrupted 7-zip archive that I am extracting manually using the method outlined by Igor Pavlov at this link. An intermediate result is a large file that is a bunch of files cat'ed together that must be separated manually. I understand that some file formats will need to be extracted manually by a human using discretion (text files, etc.) but many file formats encode the size of the file as part of the file itself (e.g. .zip). Furthermore, some files can be parsed and their size can be deduced with just a little information about the file format (e.g. .pdf). Let's say the large file consists of the following files concatenated together:
Key: <filename>(<contents>)
badfile(aaaaaaaaaaabbbbbbbbbcccccccdddddddd) -> zip1.zip(aaaaaaaaaaa)
badfile2(bbbbbbbbbcccccccdddddddd)
I am looking for a program that I can run on a large file (call it badfile) that can determine the type and size of the first logical file (let's say it's a .zip file) contained within and create a new file to hold the contents (e.g. zip1.zip since filenames are lost) and chop the file off the front of badfile. This would allow me to run the program in a loop to extract files with known types and/or pause and let the user handle the difficult cases. Does such a program exist? I know that the *nix command file(1) will do a lot of the work here, but there would be a lot of effort in encoding rules for sizing files (e.g. .pdf) that I would prefer to not duplicate.
I believe this question should be closed due to being off topic as it asks to find existing programs to solve the problem, but open bounty prevents close vote. However.
Does such a program exist?
Yes they exist is and are called data carving tools.
Some commom ones include scalpel and foremost and PhotoRec
A list of other tools is avaliable here

join two different files into one file inside a zip (or elsewhere compressed) file

I have two files inside a zip file.
Now, imagine these two files are big... REALLY big... so big that I can't uncompress them into my old, poor, tiny hard disk.
However, they are simple txt files, so the zipped version is quite small.
I need to JOIN the two files into ONE single file.
As they're too big to extract, I need to do this INSIDE the zip.
Is there a way to do this?
Example:
"compressed.zip" contains "part_1.txt" and "part_2.txt".
I want "compressed.zip" to contain one file, called "part_1_and_2.txt".
(If it's not possible with zip, I can pick another compressor... but the idea is the same: each uncompressed file is bigger than total capacity of my hard disk)
Tnx!
It seems like you just need to ensure that the storage requirements are low; I don't think the operation needs to occur "within the zip file" per se. You can do this with command-line tools (in Linux or with similar tools via Cygwin) in the following way:
Start with a tarred, gzipped file with your input files in it. Let that be compressed.tar.gz. Then you can extract the contents of the gzipped tar archive to standard output and pipe it back to gzip:
tar xzf compressed.tar.gz -O | gzip > part_1_and_2.txt.gz
The resultant compressed file is the text of part_1.txt and part_2.txt concatenated (though I suppose it is not the same as having a tar archive that contains one file, but perhaps this will be sufficient).
If you need to do this within a program, I would guess that libtar and zlib can perform this functionality programmatically, or you can run a script from your program.
You can use libzip (which in turn uses zlib) to read uncompressed data from the input files in turn and write a new output zip file. You would not need to store all of an uncompressed file on the mass storage or in memory. You can read and write a small chunk at a time, as you would without compression. I presume that you have room on your mass storage for all three of those zip files.

What is the most reliable way to move or copy large files (> 100 MB) on iOS?

Right now I'm moving very large files in iOS with this method:
[fileManager moveItemAtURL:srcURL toURL:toURL error:&error];
This is a method from NSFileManager.
Because the files are so large I try to move them instead of copying and then deleting the source file.
Is there a safer way to do this?
A file move is an extremely light weight operation; it doesn't involve copying anything as it simply moves a directory entry from one point in the filesystem to another.
It should be quite safe.
If you really really want to be paranoid, then:
copy all bytes from A to B
verify B is coherent
delete A
Which is what the "atomically" variants for the write/copy APIs do under the covers, save for the verification part because the filesystem, itself, should do that.
What you are doing is correct and efficient. Moving a file (if to the same file system) is essentially instant. But a copy and delete is very slow. Please note that moving a file to a different file system is actually done with a copy and delete.

Archive format suggestions for exporting iPad app data? Tarball?

I have an nascent iPad application, which stores "documents" internally on the device in the file system as a series of distinct files in a folder.
I'd like to try incorporating an import/export function through iTunes, using the features for OS 3.2 for this. I want to put all the document pieces that I keep internally into one container file for export.
So, smart folks of Stack Overflow: What's the simplest solution that will put a file hierarchy (or could be flat list in a pinch) into one file? There will not in theory need to be manipulation of the "archive"/container outside the app-- so random access isn't super important here, although it would be a bonus of course.
A tar file type thing springs to mind immediately. Roll my own? Any other thoughts or gotchas? (And if anyone can point me to code that reads/writes from a tar file, I'm all ears.)
Thanks!
Update: Made community wiki, since there's no single right answer here.
Try libarchive which is a friendly licensed, BSD derived (easier for iPhone OS) library for handling archive files.

keep rsync from removing unfinished source files

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?
It seems to me the problem is transferring a file before it's complete, not that you're deleting it.
If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.
The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.
How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.
How much control do you have over the download process? If you roll your own, you can have the file being downloaded go to a temp directory or have a temporary name until it's finished downloading, and then mv it to the correct name when it's done. If you're using third party software, then you don't have as much control, but you still might be able to do the temp directory thing.
Rsync can exclude files matching certain patters. Even if you can't modify it to make it download files to a temporary directory, maybe it has a convention of naming the files differently during download (for example: foo.downloading while downloading for a file named foo) and you can use this property to exclude files which are still being downloaded from being copied.
If you have control over the crawling process, or it has predictable output, the above solutions (storing in a tempfile until finished, then mv'ing to the completed-downloads place, or ignoring files with a '.downloading' kind of name) might work. If all of that is beyond your control, you can make sure that the file is not opened by any process by doing 'lsof $filename' and checking if there's a result. Clearly if no one has the file open, it's safe to move it over.

Resources