I want to download all accessible html files under www.site.com/en/. However, there are a lot of linked URLS with post parameters on the site (e.g. pages 1,2,3.. for each product category). I want wget NOT to download these links. I'm using
-R "*\?*"
But it's not perfect because it only removes the file after downloading it.
Is there some way for example to filter the links followed by wget with a regex?
It is possible to avoid those files with a regex, you would have to use --reject-regex '(.*)\?(.*)' but it will work only with wget version 1.15, so I would recommend you to check your wget version first.
Related
Problem outline
I'm trying to get all the files from an URL: https://archive-gw-1.kat.ac.za/public/repository/10.48479/7epd-w356/data/basic_products/bucket_contents.html
which appears to be a list of contents of an S3 bucket with associated download links.
When I attempt to download all the files with the extension *.jpeg, I'm simply returned the directory structure leading up to an subdirectory with no downloaded files.
Things I've tried
To do this I've tried all the variations of leading parameters for:
$ wget -r -np -A '*.jpeg' https://archive-gw-1.kat.ac.za/public/repository/10.48479/7epd-w356/data/basic_products/
...that I can think of, but none have actually downloaded the jpeg files.
If you provide the path to a specific file e.g.
$ wget https://archive-gw-1.kat.ac.za/public/repository/10.48479/7epd-w356/data/basic_products/Abell_133_hi.jpeg
...the files can be downloaded, which would suggest that I must be mishandling the wildcard aspect of the download surely?
Thoughts which could be wrong owing to limited knowledge of wget and website protocols
Unless the fact that the contents are held in a bucket_contents.html rather than an index.html is causing problems?
The man page for tar uses the word "dump" and its forms several times. What does it mean? For example (manual page for tar 1.26):
"-h, --dereferencefollow symlinks; archive and dump the files they point to"
Many popular systems have a "trash can" or "recycle bin." I don't want the files dumped there, but it kind of sounds that way.
At present, I don't want tar to write or delete any file, except that I want tar to create or update a single tarball.
FYI, the man page for the tar installed on the system I am using at the moment is a lot shorter than what appears to be the current version. And the description of -h, --dereference there seems very different to me:
"When reading or writing a file to be archived, tar accesses the file that a symbolic link points to, rather than the symlink itself. See section Symbolic Links."
P.S. I could not get "block quote" to work properly in this post.
File system backups are also called dumps.
—#raymond-chen, quoting GNU tar manual
I am completing this tutorial and am at the part where you download the code for the tutorial. The request we send to Github is:
wget https://github.com/GoogleCloudPlatform/cloudml-samples/archive/master.zip
I understand that this downloads archive to GCP, and I can see the files in the Cloud shell, but is there a way to see the files through the Google Console GUI? I would like to browse the files I have downloaded to understand their structure better.
By clicking on the pencil icon on the top right corner, the Cloud Shell Code editor will pop.
Quoting the documentation:
"The built-in code editor is based on Orion. You can use the code
editor to browse file directories as well as view and edit files, with
continued access to the Cloud Shell. The code editor is available by
default with every Cloud Shell instance."
You can find more info here: https://cloud.google.com/shell/docs/features#code_editor
If you prefer to use the command line to view files, you can install and run the tree Unix CLI command 1 and run it in Cloud Shell to list contents of directories in a tree-like format.
install tree => $ sudo apt-get install tree
run it => $ tree ./ -h --filelimit 4
-h will show human readable size of files/directories
and you can use --filelimit to set the maximum number of directories to descent within the list.
Use $ man tree to see the available parameters for the command, or check the man online documentation here: https://linux.die.net/man/1/tree
There is a web page www.somepage.com/images/
I know some of the images there (e.g. www.somepage.com/images/cat_523.jpg, www.somepage.com/images/dog_179.jpg)
I know there are some more but I don't know the names of those photos. How can I scan whole /images/ folder?
you can use wget to download all the files
--no-parent to grab all the files below in the directory hierachy
--recursive to look into subfolders
wget --recursive --no-parent -A jpeg,jpg,bmp,gif,png http://example.com/
If they are on the webpage as an img tag you could try just searching the page source for an img tag. If you are using terminal you could also try using a tool such as wget to download the web page and then try using grep on the file for the img tag.
I have a site (http://a-site.com) with many links like that. How can I use wget to crawl and grep all similar links to a file?
Follow
I tried this but this command only get me all similar links on one page but not recursively follow other links to find similar links.
$ wget -erobots=off --no-verbose -r --quiet -O - http://a-site.com 2>&1 | \
grep -o '['"'"'"][^"'"'"']*/follow_user['"'"'"]'
You may want to use the --accept-regex option of wget rather than piping through grep :
wget -r --accept-regex '['"'"'"][^"'"'"']*/follow_user['"'"'"]' http://a-site.com
(not tested, the regex may need adjustment or specification of --regex-type (see man wget), and of course add other options you find useful).