keep rsync from removing unfinished source files - storage

I have two machines, speed and mass. speed has a fast Internet connection and is running a crawler which downloads a lot of files to disk. mass has a lot of disk space. I want to move the files from speed to mass after they're done downloading. Ideally, I'd just run:
$ rsync --remove-source-files speed:/var/crawldir .
but I worry that rsync will unlink a source file that hasn't finished downloading yet. (I looked at the source code and I didn't see anything protecting against this.) Any suggestions?

It seems to me the problem is transferring a file before it's complete, not that you're deleting it.
If this is Linux, it's possible for a file to be open by process A and process B can unlink the file. There's no error, but of course A is wasting its time. Therefore, the fact that rsync deletes the source file is not a problem.
The problem is rsync deletes the source file only after it's copied, and if it's still being written to disk you'll have a partial file.
How about this: Mount mass as a remote file system (NFS would work) in speed. Then just web-crawl the files directly.

How much control do you have over the download process? If you roll your own, you can have the file being downloaded go to a temp directory or have a temporary name until it's finished downloading, and then mv it to the correct name when it's done. If you're using third party software, then you don't have as much control, but you still might be able to do the temp directory thing.

Rsync can exclude files matching certain patters. Even if you can't modify it to make it download files to a temporary directory, maybe it has a convention of naming the files differently during download (for example: foo.downloading while downloading for a file named foo) and you can use this property to exclude files which are still being downloaded from being copied.

If you have control over the crawling process, or it has predictable output, the above solutions (storing in a tempfile until finished, then mv'ing to the completed-downloads place, or ignoring files with a '.downloading' kind of name) might work. If all of that is beyond your control, you can make sure that the file is not opened by any process by doing 'lsof $filename' and checking if there's a result. Clearly if no one has the file open, it's safe to move it over.

Related

Delete Files from .tar Archive while Extracting

I am wondering whether I can incrementally delete the contents of an archive (at a minimum .tar, though ideally also .tar.xz or similar) as I am extracting them all. The ultimate goal is to remove the necessity to have twice the space required for the files available while downloading and extracting.
I have looked around a bit, and it seems tar has a --remove-files option, which is compatible with archiving, but this only seems to work when creating an archive. The --delete option seems to work when extracting as well, but it only seems to work with .tar. Would it in this case be possible to replace the .tar.xz with an unarchived .tar in-place and then proceed to obtain the list of contents somehow and iteratively extract them with --delete? Or is there a better way of doing this?

join two different files into one file inside a zip (or elsewhere compressed) file

I have two files inside a zip file.
Now, imagine these two files are big... REALLY big... so big that I can't uncompress them into my old, poor, tiny hard disk.
However, they are simple txt files, so the zipped version is quite small.
I need to JOIN the two files into ONE single file.
As they're too big to extract, I need to do this INSIDE the zip.
Is there a way to do this?
Example:
"compressed.zip" contains "part_1.txt" and "part_2.txt".
I want "compressed.zip" to contain one file, called "part_1_and_2.txt".
(If it's not possible with zip, I can pick another compressor... but the idea is the same: each uncompressed file is bigger than total capacity of my hard disk)
Tnx!
It seems like you just need to ensure that the storage requirements are low; I don't think the operation needs to occur "within the zip file" per se. You can do this with command-line tools (in Linux or with similar tools via Cygwin) in the following way:
Start with a tarred, gzipped file with your input files in it. Let that be compressed.tar.gz. Then you can extract the contents of the gzipped tar archive to standard output and pipe it back to gzip:
tar xzf compressed.tar.gz -O | gzip > part_1_and_2.txt.gz
The resultant compressed file is the text of part_1.txt and part_2.txt concatenated (though I suppose it is not the same as having a tar archive that contains one file, but perhaps this will be sufficient).
If you need to do this within a program, I would guess that libtar and zlib can perform this functionality programmatically, or you can run a script from your program.
You can use libzip (which in turn uses zlib) to read uncompressed data from the input files in turn and write a new output zip file. You would not need to store all of an uncompressed file on the mass storage or in memory. You can read and write a small chunk at a time, as you would without compression. I presume that you have room on your mass storage for all three of those zip files.

Safely write to files with conflicting names in an NSOperationQueue

This is probably a pretty basic NSOperationQueue question, but maybe it will help some other people out who are just learning this as well.
I'm trying to copy multiple .plist files from the ~/Documents directory to the ~/Library directory in an iOS application. I want to use NSOperations to copy each file to speed this up and take the import process off the main thread.
In my implementation, it's possible that two files of the same name could be copied into the same place. What I'd like to do is make sure that one of the operations changes the filename to one that doesn't exist before it writes the file. What would be the most straight-forward way to go about this?
Thanks,
-c

What is the most reliable way to move or copy large files (> 100 MB) on iOS?

Right now I'm moving very large files in iOS with this method:
[fileManager moveItemAtURL:srcURL toURL:toURL error:&error];
This is a method from NSFileManager.
Because the files are so large I try to move them instead of copying and then deleting the source file.
Is there a safer way to do this?
A file move is an extremely light weight operation; it doesn't involve copying anything as it simply moves a directory entry from one point in the filesystem to another.
It should be quite safe.
If you really really want to be paranoid, then:
copy all bytes from A to B
verify B is coherent
delete A
Which is what the "atomically" variants for the write/copy APIs do under the covers, save for the verification part because the filesystem, itself, should do that.
What you are doing is correct and efficient. Moving a file (if to the same file system) is essentially instant. But a copy and delete is very slow. Please note that moving a file to a different file system is actually done with a copy and delete.

Temporary file deleted

I'm working on ed (yes, the editor) source code.
The program uses a scratch file, opened with tmpfile, as a buffer.
But, whenever I run the program, lsof always report the temporary file as deleted! (and in fact it's not there). Why?
Because a file can exist on disk without having a filename associated with it, many programs will open a file and then promptly unlink it. The file contents can continue to be modified & read by open file-handles on the file, and won't actually be removed from the disk until all open file handles are closed.
(this is for *nix/POSIX platforms AFAICT; Windows handles files differently, preventing unlinking if an program has the file-handle still open, and thus reboots are often needed for upgrades to force those open file-handles to be closed so file contents can be replaced)

Resources