tar untar files loses modification date - tar

When I tar a file and then untar it, I lose the modification and creation date of the file. Is there a way to protect the date? I need this because I have a jenkins job that uses the aws sync command after untaring and that keeps uploading the same files to s3.

tar --atime-preserve will preserve the access time, but the creation and modification times are not generally preserved.
If the receiving file system time stamps are very important, you might have more success with dump and restore commands (provided they are supported). I believe there are more options on how to handle time stamps, as they were intended for backup and restore.

Related

Exclude a directory from `podman/docker export` stream and save to a file

I have a container that I want to export as a .tar file. I have used a podman run with a tar --exclude=/dir1 --exclude=/dir2 … that outputs to a file located on a bind-mounted host dir. But recently this has been giving me some tar: .: file changed as we read it errors, which podman/docker export would avoid. Besides the export I suppose is more efficient. So I'm trying to migrate to using the export, but the major obstacle is I can't seem to find a way to exclude paths from the tar stream.
If possible, I'd like to avoid modifying a tar archive already saved on disk, and instead modify the stream before it gets saved to a file.
I've been banging my head for multiple hours, trying useless advices from ChatGPT, looking at cpio, and attempting to pipe the podman export to tar --exclude … command. With the last I did have small success at some point, but couldn't make tar save the result to a particularly named file.
Any suggestions?
(note: I do not make distinction between docker and podman here as their export command is completely the same, and it's useful for searchability)

What does "dump" mean in the context of the GNU tar program?

The man page for tar uses the word "dump" and its forms several times. What does it mean? For example (manual page for tar 1.26):
"-h, --dereferencefollow symlinks; archive and dump the files they point to"
Many popular systems have a "trash can" or "recycle bin." I don't want the files dumped there, but it kind of sounds that way.
At present, I don't want tar to write or delete any file, except that I want tar to create or update a single tarball.
FYI, the man page for the tar installed on the system I am using at the moment is a lot shorter than what appears to be the current version. And the description of -h, --dereference there seems very different to me:
"When reading or writing a file to be archived, tar accesses the file that a symbolic link points to, rather than the symlink itself. See section Symbolic Links."
P.S. I could not get "block quote" to work properly in this post.
File system backups are also called dumps.
—#raymond-chen, quoting GNU tar manual

Is there a way to force a particular line in my Dockerfile to always rebuild, whilst still benefiting from caching on the preceding layers? [duplicate]

This question already has answers here:
Disable cache for specific RUN commands
(9 answers)
Closed 1 year ago.
I frequently seem to have to write Dockerfiles like this (line numbers added for clarity):
1. FROM somebase
2. RUN cp /some/local/stuff /some/docker/container/path
3. RUN some-other-local-commands
4. RUN wget http://some.remote.server/some.remote.path.for.example.json
5. RUN some-other-local-commands-which-may-depend-on-the-json
On line (4), I'm fetching a remote resource. Let's assume for now that's a JSON file. It might change from time-to-time, maybe not on every build, but perhaps every few hours or days.
What this means is that every time I build my container, I want to ensure the freshest JSON file is fetched. One way to force this is to add the --no-cache parameter to my docker build command, but this forces all of the lines/layers to rebuild, including (1)-(3), where that is likely not necessary. Is there a pattern or technique to automatically 'taint' or 'mark' line (4) so that Docker knows it always has to re-run the wget (presumably this would also have to force a rebuild of line 5), whilst still getting the layer caching behaviour for lines (1)-(3) when Docker detects the pre-req files haven't changed?
If the specific thing you're trying to trigger rebuilds is the result of RUN wget ... a specific URL, Docker does actually have native support for this.
There are two similar commands to copy files into a container. COPY only copies files from the build context. ADD can also fetch external URLs and unpack local archives (but not both at the same time). The general recommendation is to use COPY, unless you need one of the specific things ADD does differently.
So you should be able to say
ADD http://some.remote.server/some.remote.path.for.example.json .
RUN some-other-local-commands-which-may-depend-on-the-json
and the RUN command will use the Docker layer cache based on the contents of the fetched file.
If this approach doesn't work for you (maybe you need special authentication to fetch the file) you can also fetch the file outside of Docker before you run docker build, and then COPY it in. Again, it will work like any other file you COPY in, and layer caching will take effect based on whether the file has changed or not.

How to use azcopy version 10.3.0 to copy and then delete file from Blob storage to VM using powershell

Some csv files keep getting placed in an azure storage container. I need to continuously move the files to a VM. I am using the following powershell script running on my VM.
while($true){
.\azcopy sync "source blob" "destination folder on VM" --include-pattern "*.csv" --log-level ERROR
Start-Sleep -Seconds 60
}
The files are getting copied, but how do I delete the files from the source. They are not needed anymore after copying to the VM.
I scoured the interwebs for this and still none of the "solutions" were sufficient. SO -- I had to write my own. Below is the code I wrote that will do the following:
a. Download AZCopy if it doesnt exist.
b. Copy information from an Azure blob storage
c. Upon successful copy, delete information from source
https://github.com/SlkRck/demoscripts/blob/master/copy-blobstorage.ps1
You can use azcopy remove command after the azcopy sync operation is completed. And here I need to mention that the azcopy sync operation is thread-blocking, so it's safe to use azcopy rm command at the end of the azcopy sync operation.
Note that if you want to just remove all the .csv file, you should add --include-pattern="*.csv" in the command.
I'm using the latest version of azcopy, v10.3.1. If you prefer to use v10.3.0, then first you need to use this command azcopy remove --help for the details of this command and it's parameters.
A sample command like below:
while($true){
.\azcopy sync "source blob" "destination folder on VM" --include-pattern "*.csv" --log-level ERROR
Start-Sleep -Seconds 60
#after the copy operation is completed, use remove command as below.
azcopy.exe rm "https://xxx.blob.core.windows.net/test4?sastoken" --include-pattern="*.csv" --recursive=true
}
One thing need to mention, in my test, I use the sas token for my testing purpose.

GSUTIL not re-uploading a file that has already been uploaded earlier that day

I'm running GSUTIL v3.42 from a Windows CMD script on a Windows server 2008 R2 using Python 2.7.6. Files to be uploaded arrive in an "outgoing" directory and are uploaded in parallel by GSUTIL to an "incoming" bucket. The script requests a listing of the "incoming" bucket after uploading has finished and then compares the files listed with those it attempted to upload, in order to detect any upload failures. Another separate script moves files from the "incoming" bucket to a "processed" bucket afterwards.
If I attempt to upload the identical file (same name/size/content/date etc.) a second time, it doesn't upload, although I get no errors and nothing in my logging to indicate failure. I am not using the "no clobber" option, so I would expect gsutil to just upload the file.
In the scenario below, assume that the file has been successfully uploaded and moved to the "processed" bucket already on that day. In case timings matter, the second upload is being attempted within half an hour of the first.
File A arrives in "outgoing" directory.
I get a file listing of "outgoing" and write this to dirListing.txt
I perform a GSUTIL upload using
type dirListing.txt | python gsutil -m cp -I -L myGsutilLogFile.txt gs://myIncomingBucket
I then perform a GSUTIL listing
python gsutil ls -l -h gs://myIncomingBucket > bucketListing.txt
File match dirListing.txt and bucketListing.txt to detect mismatches and hence upload failures.
On the second run, File A isn't being uploaded in step 3 and consequently isn't returned in step 4, causing a mismatch in step 5. [I've checked the content of all of the relevant files and it's definitely in dirListing.txt and not in bucketListing.txt]
I need the ability to re-process a file in case the separate script that moves the file from the "incoming" to the "processed" bucket fails for some reason or doesn't do what it should do. I have to upload in parallel because there are normally hundreds of files on each run.
Is what I've described above expected behaviour from GSUTIL? (I haven't seen anything in the documentation that suggests this) If so, is there any way of forcing GSUTIL to re-attempt the upload? Or am I missing something obvious, please? I have debug output from GSUTIL if that's necessary/useful.
From the above, it looks like you're uploading using "-L" to log to a manifest file. If you're using the same manifest file, and the file has already been uploaded once, then gsutil will not try to re-upload the file. From the docs on "-L" in "gsutil help cp":
If the log file already exists, gsutil will use the file as an
input to the copy process, and will also append log items to the
existing file. Files/objects that are marked in the existing log
file as having been successfully copied (or skipped) will be
ignored.

Resources