Tar subdirectories separately - tar

I need to tar a bunch of files, actually ~60 Millions. They are ordered in year/month/day directories. Every day has ~700 files. Is there a "neat" way of taring first the daily directories then put them into monthly tared dirs and finally tar them to yearly directories?
Of course I can try and write a script to do that, but I thought perhaps there is something "out there" or even an inbuilt function that I can use for this task.

ok, not nice to answer your own question, nevertheless here is what I did just using the command line:
for m in /mypath/<year>/*;do for d in ${m}/*;do tar -cf <year>`basename $m``basename $d`.tar $d;done;done
as result I get daily tar files in the form: yyyymmdd.tar
To then tar them to monthly or yearly tar files is then not so difficult.
Any other, more elegant solutions are always welcome.

Related

Exclude a directory from `podman/docker export` stream and save to a file

I have a container that I want to export as a .tar file. I have used a podman run with a tar --exclude=/dir1 --exclude=/dir2 … that outputs to a file located on a bind-mounted host dir. But recently this has been giving me some tar: .: file changed as we read it errors, which podman/docker export would avoid. Besides the export I suppose is more efficient. So I'm trying to migrate to using the export, but the major obstacle is I can't seem to find a way to exclude paths from the tar stream.
If possible, I'd like to avoid modifying a tar archive already saved on disk, and instead modify the stream before it gets saved to a file.
I've been banging my head for multiple hours, trying useless advices from ChatGPT, looking at cpio, and attempting to pipe the podman export to tar --exclude … command. With the last I did have small success at some point, but couldn't make tar save the result to a particularly named file.
Any suggestions?
(note: I do not make distinction between docker and podman here as their export command is completely the same, and it's useful for searchability)

What does "dump" mean in the context of the GNU tar program?

The man page for tar uses the word "dump" and its forms several times. What does it mean? For example (manual page for tar 1.26):
"-h, --dereferencefollow symlinks; archive and dump the files they point to"
Many popular systems have a "trash can" or "recycle bin." I don't want the files dumped there, but it kind of sounds that way.
At present, I don't want tar to write or delete any file, except that I want tar to create or update a single tarball.
FYI, the man page for the tar installed on the system I am using at the moment is a lot shorter than what appears to be the current version. And the description of -h, --dereference there seems very different to me:
"When reading or writing a file to be archived, tar accesses the file that a symbolic link points to, rather than the symlink itself. See section Symbolic Links."
P.S. I could not get "block quote" to work properly in this post.
File system backups are also called dumps.
—#raymond-chen, quoting GNU tar manual

tarring and untarring between two remote hosts

I have two systems that I'm splitting processing between, and I'm trying to find the most efficient way to move the data between the two. I've figured out how to tar and gzip to an archive on the first server ("serverA") and then use rsync to copy to the remote host ("serverB"). However, when I untar/unzip the data there, it saves the archive including the full path name from the original server. So if on server A my data is in:
/serverA/directory/a/lot/of/subdirs/myData/*
and, using this command:
tar -zcvf /serverA/directory/a/lot/of/subdirs/myData-archive.tar.gz /serverA/directory/a/lot/of/subdirs/myData/
Everything in .../myData is successfully tarred and zipped in myData-archive.tar.gz
However, after copying the archive, when I try to untar/unzip on the second host (I manually log in here to finish the processing, the first step of which is to untar/unzip) using this command:
tar -zxvf /serverB/current/directory/myData-archive.tar.gz
It untars everything in my current directory (serverB/current/directory/), however it looks like this:
/serverB/current/directory/serverA/directory/a/lot/of/subdirs/myData/Data*ext
How should I formulate both the tar commands so that my data ends up in a directory called
/serverB/current/directory/dataHERE/
?
I know I'll need the -C flag to untar into a different directory (in my case, /serverB/current/directory/dataHERE ), but I still can't figure out how to make it so that the entire path is not included when the archive gets untarred. I've seen similar posts but none that I saw discussed how to do this when moving between to different hosts.
UPDATE: per one of the answers in this question, I changed my commands to:
tar/zip on serverA:
tar -zcvf /serverA/directory/a/lot/of/subdirs/myData-archive.tar.gz serverA/directory/a/lot/of/subdirs/myData/ -C /serverA/directory/a/lot/of/subdirs/ myData
and, untar/unzip:
tar -zxvf /serverB/current/directory/myData-archive.tar.gz -C /serverB/current/directory/dataHERE
And now, not only does it untar/unzip the data to:
/serverB/current/directory/dataHERE/
like I wanted, but it also puts another copy of the data here:
/serverB/current/directory/serverA/directory/a/lot/of/subdirs/myData/
which I don't want. How do I need to fix my commands so that it only puts data in the first place?
On serverA do
( cd /serverA/directory/a/lot/of/subdirs; tar -zcvf myData-archive.tar.gz myData; )
After some more messing around, I figured out how to achieve what I wanted:
To tar on serverA:
tar -zcvf /serverA/directory/a/lot/of/subdirs/myData-archive.tar.gz -C /serverA/directory/a/lot/of/subdirs/ myData
Then to untar on serverB:
tar -zxvf /serverB/current/directory/myData-archive.tar.gz -C /serverB/current/directory/dataHERE

Rsync folder with a million files, but very small incremental daily updates

we run an rsync on a large folder. This has close to a million files inside it including html, jsp, gif/jpg, etc. We only need to of course incrementally update files. Just a few JSP and HTML files are updated in this folder, and we need this server to rsync to a different server, same folder.
Sometimes rsync is running quite slow these days, so one of our IT team members created this command:
find /usr/home/foldername \
-type f -name *.jsp -exec \
grep -l <ssi:include src=[^$]*${ {} ;`
This looks for only specific files which have JSP extension and which contain certain kinds of text, because these are the files which we need to rsync. But this command is consuming a lot of memory. I think it's a stupid way to rsync, but I'm being told this is how things will work.
Some googling suggests that this should work on this folder too:
rsync -a --update --progress --rsh --partial /usr/home/foldername /destination/server
I'm worried that this will be too slow on a daily basis, but I can't imagine why this will be slower than that silly find option that our IT folks are recommending. Any ideas about large directory rsyncs in the real world?
A find command will not be faster than the rsync scan, and the grep command must be slower than rsync because it requires reading all the text from all the .jsp files.
The only way a find-and-grep could be faster is if
The timestamps on your files do not match, so rsync has to checksum the contents (on both sides!)
This seems unlikely, since you're using -a that will sync the timestamps properly (because -a implies -t). However, it can happen if the file-systems on the different machines allow different timestamp precision (e.g. Linux vs. Windows), in which case the --modify-window option is what you need.
There are many more files changed than the ones you care about, and rsync is transferring those also.
If this is the case then you can limit the transfer to .jsp files like this:
--include '*.jsp' --include '*/' --exclude '*'
(Include all .jsp files and all directories, but exclude everything else.)
rsync does the scan up front, then does the compare (possibly using lots of RAM), then does the transfer, where as find/grep/copy does it now.
This used to be a problem, but rsync ought to do an incremental recursive scan as long as both local and remote versions are 3.0.0 or greater, and you don't use any of the fancy delete or delay options that force an up-front scan (see --recursive in the documentation).

Linux tar help to extract folders

I kind of found the answer on the stackoverflow but have some confusion. I need some help.
I have a tar file which contains files and folders like this: usr/CCS/HMS*
I would like to extract all files and folders usr/CCS/HMS* but into a different filesystem, the new filesystem is /usr/TRAINP
HMS* should replace TRAINP*. TRAINP has folders like TRAINP/TRAINP.GL, TRAINP.AR, etc
the backup contains folders like usr/CCS/HMS/HMS.GL, usr/CCS/HMS.AR
When I am doing, it is restoring under /usr/TRAINP. I want usr/CCS/HMS* to replace /usr/TRAINP. This is kind of database restore with a different name.
Thanks a lot in advance.
Tar itself does not rename the contents when extracting. The best bet is to extract to some place in the target filesystem and move the results where you want.
For example:
cd /usr/CCS/TRAINP1
tar xf archive.tar usr/CCS/HMS1
mv usr/CCS/HMS1/* .
Or, if the TRAINP directories do not exist:
cd /
tar xf archive.tar usr/CCS
cd usr/CCS
for file in HMS*; do mv "$file" "TRAINP${file#HMS}"; done
Of course there are many variations and alternatives that will yield the same result. Note my example assumes usr/CCS belongs in /usr/CCS.

Resources