wget files with extension from S3 bucket_contents.html - url

Problem outline
I'm trying to get all the files from an URL: https://archive-gw-1.kat.ac.za/public/repository/10.48479/7epd-w356/data/basic_products/bucket_contents.html
which appears to be a list of contents of an S3 bucket with associated download links.
When I attempt to download all the files with the extension *.jpeg, I'm simply returned the directory structure leading up to an subdirectory with no downloaded files.
Things I've tried
To do this I've tried all the variations of leading parameters for:
$ wget -r -np -A '*.jpeg' https://archive-gw-1.kat.ac.za/public/repository/10.48479/7epd-w356/data/basic_products/
...that I can think of, but none have actually downloaded the jpeg files.
If you provide the path to a specific file e.g.
$ wget https://archive-gw-1.kat.ac.za/public/repository/10.48479/7epd-w356/data/basic_products/Abell_133_hi.jpeg
...the files can be downloaded, which would suggest that I must be mishandling the wildcard aspect of the download surely?
Thoughts which could be wrong owing to limited knowledge of wget and website protocols
Unless the fact that the contents are held in a bucket_contents.html rather than an index.html is causing problems?

Related

Exclude a directory from `podman/docker export` stream and save to a file

I have a container that I want to export as a .tar file. I have used a podman run with a tar --exclude=/dir1 --exclude=/dir2 … that outputs to a file located on a bind-mounted host dir. But recently this has been giving me some tar: .: file changed as we read it errors, which podman/docker export would avoid. Besides the export I suppose is more efficient. So I'm trying to migrate to using the export, but the major obstacle is I can't seem to find a way to exclude paths from the tar stream.
If possible, I'd like to avoid modifying a tar archive already saved on disk, and instead modify the stream before it gets saved to a file.
I've been banging my head for multiple hours, trying useless advices from ChatGPT, looking at cpio, and attempting to pipe the podman export to tar --exclude … command. With the last I did have small success at some point, but couldn't make tar save the result to a particularly named file.
Any suggestions?
(note: I do not make distinction between docker and podman here as their export command is completely the same, and it's useful for searchability)

What does "dump" mean in the context of the GNU tar program?

The man page for tar uses the word "dump" and its forms several times. What does it mean? For example (manual page for tar 1.26):
"-h, --dereferencefollow symlinks; archive and dump the files they point to"
Many popular systems have a "trash can" or "recycle bin." I don't want the files dumped there, but it kind of sounds that way.
At present, I don't want tar to write or delete any file, except that I want tar to create or update a single tarball.
FYI, the man page for the tar installed on the system I am using at the moment is a lot shorter than what appears to be the current version. And the description of -h, --dereference there seems very different to me:
"When reading or writing a file to be archived, tar accesses the file that a symbolic link points to, rather than the symlink itself. See section Symbolic Links."
P.S. I could not get "block quote" to work properly in this post.
File system backups are also called dumps.
—#raymond-chen, quoting GNU tar manual

docker cp not working

I'm following this tutorial and when I get to the part where I call:
cp /tf_files/stripped_retrained_graph.pb bazel-bin/tensorflow/examples/android/assets/stripped_output_graph.pb
and
cp /tf_files/retrained_labels.txt bin/tensorflow/examples/android/assets/imagenet_comp_graph_label_strings.txt
They both say "No such file or directory".
As you can see in this image I can cd to the tf_files folder and see that the files are there.
I can also cd to /tensorflow/tensorflow/examples/android/assets and call ls which shows there's just a BUILD file there.
In the cp command is there supposed to already be a stripped_output_graph.pb file in the destination which gets replaced? Or is it meant to just be creating a new file there?
Is there some way of doing cp [source] [current directory] rather than specifying the destination as a path?
I've tried removing the file path part in hope that it just uses the source filename but that doesn't work.
Calling
cp /tf_files/stripped_retrained_graph.pb /tensorflow/tensorflow/examples/android/assets/stripped_output_graph.pb
and
cp /tf_files/retrained_labels.txt /tensorflow/tensorflow/examples/android/assets/imagenet_comp_graph_label_strings.txt
finally worked, wasn’t at all obvious that I’d have to change the destination path or what it should be though.
Also I accidentally saved a file as .p rather than .pb but managed to remove it using $ docker exec <container> rm -rf /tensorflow/tensorflow/examples/android/asset
s/stripped_output_graph.p
Now I managed to copy the files in correctly, but then when I installed the app it was still just running the regular demo app.
Not sure why it didn’t work, so frustrating.
When I rebuilt it after copying the files in I got these conflict messages
Are these normal to have?
It looks like maybe a different labels file is taking priority over mine, how can I reach the external/inception5h/imagenet_comp_graph_label_strings.txt file to delete it so my file is used instead?
Does the “external” part mean that I can’t actually access it?

Rsync folder with a million files, but very small incremental daily updates

we run an rsync on a large folder. This has close to a million files inside it including html, jsp, gif/jpg, etc. We only need to of course incrementally update files. Just a few JSP and HTML files are updated in this folder, and we need this server to rsync to a different server, same folder.
Sometimes rsync is running quite slow these days, so one of our IT team members created this command:
find /usr/home/foldername \
-type f -name *.jsp -exec \
grep -l <ssi:include src=[^$]*${ {} ;`
This looks for only specific files which have JSP extension and which contain certain kinds of text, because these are the files which we need to rsync. But this command is consuming a lot of memory. I think it's a stupid way to rsync, but I'm being told this is how things will work.
Some googling suggests that this should work on this folder too:
rsync -a --update --progress --rsh --partial /usr/home/foldername /destination/server
I'm worried that this will be too slow on a daily basis, but I can't imagine why this will be slower than that silly find option that our IT folks are recommending. Any ideas about large directory rsyncs in the real world?
A find command will not be faster than the rsync scan, and the grep command must be slower than rsync because it requires reading all the text from all the .jsp files.
The only way a find-and-grep could be faster is if
The timestamps on your files do not match, so rsync has to checksum the contents (on both sides!)
This seems unlikely, since you're using -a that will sync the timestamps properly (because -a implies -t). However, it can happen if the file-systems on the different machines allow different timestamp precision (e.g. Linux vs. Windows), in which case the --modify-window option is what you need.
There are many more files changed than the ones you care about, and rsync is transferring those also.
If this is the case then you can limit the transfer to .jsp files like this:
--include '*.jsp' --include '*/' --exclude '*'
(Include all .jsp files and all directories, but exclude everything else.)
rsync does the scan up front, then does the compare (possibly using lots of RAM), then does the transfer, where as find/grep/copy does it now.
This used to be a problem, but rsync ought to do an incremental recursive scan as long as both local and remote versions are 3.0.0 or greater, and you don't use any of the fancy delete or delay options that force an up-front scan (see --recursive in the documentation).

GSUTIL not re-uploading a file that has already been uploaded earlier that day

I'm running GSUTIL v3.42 from a Windows CMD script on a Windows server 2008 R2 using Python 2.7.6. Files to be uploaded arrive in an "outgoing" directory and are uploaded in parallel by GSUTIL to an "incoming" bucket. The script requests a listing of the "incoming" bucket after uploading has finished and then compares the files listed with those it attempted to upload, in order to detect any upload failures. Another separate script moves files from the "incoming" bucket to a "processed" bucket afterwards.
If I attempt to upload the identical file (same name/size/content/date etc.) a second time, it doesn't upload, although I get no errors and nothing in my logging to indicate failure. I am not using the "no clobber" option, so I would expect gsutil to just upload the file.
In the scenario below, assume that the file has been successfully uploaded and moved to the "processed" bucket already on that day. In case timings matter, the second upload is being attempted within half an hour of the first.
File A arrives in "outgoing" directory.
I get a file listing of "outgoing" and write this to dirListing.txt
I perform a GSUTIL upload using
type dirListing.txt | python gsutil -m cp -I -L myGsutilLogFile.txt gs://myIncomingBucket
I then perform a GSUTIL listing
python gsutil ls -l -h gs://myIncomingBucket > bucketListing.txt
File match dirListing.txt and bucketListing.txt to detect mismatches and hence upload failures.
On the second run, File A isn't being uploaded in step 3 and consequently isn't returned in step 4, causing a mismatch in step 5. [I've checked the content of all of the relevant files and it's definitely in dirListing.txt and not in bucketListing.txt]
I need the ability to re-process a file in case the separate script that moves the file from the "incoming" to the "processed" bucket fails for some reason or doesn't do what it should do. I have to upload in parallel because there are normally hundreds of files on each run.
Is what I've described above expected behaviour from GSUTIL? (I haven't seen anything in the documentation that suggests this) If so, is there any way of forcing GSUTIL to re-attempt the upload? Or am I missing something obvious, please? I have debug output from GSUTIL if that's necessary/useful.
From the above, it looks like you're uploading using "-L" to log to a manifest file. If you're using the same manifest file, and the file has already been uploaded once, then gsutil will not try to re-upload the file. From the docs on "-L" in "gsutil help cp":
If the log file already exists, gsutil will use the file as an
input to the copy process, and will also append log items to the
existing file. Files/objects that are marked in the existing log
file as having been successfully copied (or skipped) will be
ignored.

Resources