Merge these wget & egrep commands for recursive download of sitemap - grep

I am trying to find a way to make these work together. Whereas I can run this successfully using Wget for Windows:
wget --html-extension -r http://www.sitename.com
this downloads every single file on my server that is directory linked from the root domain. I'd rather download only the pages in my sitemap. For this, I found the following trick which uses CygWin:
wget --quiet https://www.sitename.com/sitemap.xml --output-document - | egrep -o
"http://www\.sitename\.com[^<]+" | wget --spider -i - --wait 1
However this is only checking that the pages exist, not downloading them as static HTML files as the prior wget command is doing.
Is there a way to merge these and download the sitemap pages as local html files?

If you look at the man page for wget, you will see that the --spider entry is as follows:
--spider
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
All you need to do to actually download the file is remove the --spider from your command.
wget --quiet https://www.sitename.com/sitemap.xml --output-document - | egrep -o \
"https?://www\.sitename\.com[^<]+" | wget -i - --wait 1

Related

Jenkins check file exists inside zip file

is there a way to check a file exists inside a zip file without unzip it. I'm using Artifactory . if use curl can't. can advice me,
I tried below
sh(script: "curl -o /dev/null -sf <antifactory url>")
this always return success
andbelow
unzip -l <file.zip> | grep -q <file name>
this need install unzip
From Artifactory 7.15.3 as mentioned in this page archive search is disabled by default. Can you confirm if you are version above this. If yes, you can enable this feature by Navigating to Admin > Artifactory > Archive Search Enabled and enable this checkbox. But please be aware that, if we enable this feature, it keeps writing a lot of information to the database for every archive file and can impact the performance.
Later you can search items in a zip file. Below is an example command where I am searching for .class files in my zip from curl. You may opt similar to this in Jenkis.
$ curl -uadmin:password -X GET "http://localhost:8082/artifactory/api/search/archive?name=*.class&repos=test" -H "Content-type: application/json"
You can make use of bash commands unzip and grep.
unzip -l my_file.zip | grep -q file_to_search
# Example
$ unzip -l 99bottles.zip | grep -q 99bottles.pdf && echo $?
0
P.S. If zip contains directory structure, then grep with full path of the file name

wget command not found in git bash

I've already tried pip install wget in my cmd, which reads
>pip install wget
Requirement already satisfied: wget in c:\users\user\...\python\python38-32\lib\site-packages (3.2)
however when I try the command in git bash, it keeps showing
$ wget
bash: wget: command not found
I've made sure both the python file and the git file are in PATH.
What am I doing wrong here?
If you would like to use curl on Git Bash, here is an example:
$ curl -kLSs https://github.com/opscode/chef-repo/tarball/master -o master.tar.gz
$ ls master.tar.gz
master.tar.gz
-L follow redirects
-o (lower case O) to write output to file instead of stdout.
Ss silent mode, but show errors, if any
k allows curl to proceed and operate even for server connections otherwise considered insecure.
Reference: curl manpage.
With the command:
pip install wget
you installed this Python library https://pypi.org/project/wget/, so you can use that from inside Python:
import wget
I imagine what you actually want is to be able to use wget from inside Git bash. To do what, install Wget for Windows and add the executable to the path. Or, alternatively, use curl.
if you are just looking for having wget in the git bash without pip or any other dependency, you can follow the nice and quick tutorial from this page:
How to add more to Git Bash on Windows
the essence of it is:
Download wget binaries for Windows here (preferrably as ZIP) eternallybored
extract the wget.exe from the zip
copy the EXE file to your git bash binaries folder e.g. "c:\Program Files\Git\mingw64\bin"
done :)
Quick and dirty replacement for the single argument, fetch a file usecase:
alias wget='curl -O'
-O, --remote-name Write output to a file named as the remote file
Maybe give the alias a different name so you don't try to use wget flags in curl.

How to add a file to an image in Dockerfile without using the ADD or COPY directive

I need the contents of a large *.zip file (5 gb) in my Docker container in order to compile a program. The *.zip file resides on my local machine. The strategy for this would be:
COPY program.zip /tmp/
RUN cd /tmp \
&& unzip program.zip \
&& make
After having done this I would like to remove the unzipped directory and the original *.zip file because they are not needed any more. The problem is that the COPY (and also the ADD directive) will add a layer to the image that will contain the file program.zip which is problematic as may image will be at least 5gb big. Is there a way to add a file to a container without using COPY or ADD directive? wget will not work as the mentioned *.zip file is on my local machine and curl file://localhost/home/user/program.zip -o /tmp/program.zip will not work either.
It is not straightforward but it can be done via wget or curl with a little support from python. (All three tools should usually be available on a *nix system.)
wget will not work when no url is given and
curl file://localhost/home/user/program.zip -o /tmp/
will not work from within a Dockerfile's RUN instruction. Hence, we will need a server which wget and curl can access and download program.zip from.
To do this we set up a little python server which serves our http requests. We will be using the http.server module from python for this. (You can use python or python 3. It will work with both.).
python -m http.server --bind 192.168.178.20 8000
The python server will serve all files in the directory it is started in. So you should make sure that you start your server either in the directory the file you want to download during your image build resides in or create a temporary directory which contains your program. For illustration purposes let's create the file foo.txt which we will later download via wget in our Dockerfile:
echo "foo bar" > foo.txt
When starting the http server, it is important, that we specify the IP address of our local machine on the LAN. Furthermore, we will open Port 8000. Having done this we should see the following output:
python3 -m http.server --bind 192.168.178.20 8000
Serving HTTP on 192.168.178.20 port 8000 ...
Now we build a Dockerfile to illustrate how this works. (We will assume that the file foo.txt should be downloaded into /tmp):
FROM debian:latest
RUN apt-get update -qq \
&& apt-get install -y wget
RUN cd /tmp \
&& wget http://192.168.178.20:8000/foo.txt
Now we start the build with
docker build -t test .
During the build you will see the following output on our python server:
172.17.0.21 - - [01/Nov/2014 23:32:37] "GET /foo.txt HTTP/1.1" 200 -
and the build output of our image will be:
Step 2 : RUN cd /tmp && wget http://192.168.178.20:8000/foo.txt
---> Running in 49c10e0057d5
--2014-11-01 22:56:15-- http://192.168.178.20:8000/foo.txt
Connecting to 192.168.178.20:8000... connected.
HTTP request sent, awaiting response... 200 OK
Length: 25872 (25K) [text/plain]
Saving to: `foo.txt'
0K .......... .......... ..... 100% 129M=0s
2014-11-01 22:56:15 (129 MB/s) - `foo.txt' saved [25872/25872]
---> 5228517c8641
Removing intermediate container 49c10e0057d5
Successfully built 5228517c8641
You can then check if it really worked by starting and entering a container from the image you just build:
docker run -i -t --rm test bash
You can then look in /tmp for foo.txt.
We can now add any file to our image without creating an new layer. Assuming you want to add a program of about 5 gb as mentioned in the question we could do:
FROM debian:latest
RUN apt-get update -qq \
&& apt-get install -y wget
RUN cd /tmp \
&& wget http://conventiont:8000/program.zip \
&& unzip program.zip \
&& cd program \
&& make \
&& make install \
&& cd /tmp \
&& rm -f program.zip \
&& rm -rf program
In this way we will not be left with 10 gb of cruft.
There's no way to do this. A feature request is here https://github.com/docker/docker/issues/3156.
Can you not map a local folder to the container when launched and then copy the files you need.
sudo docker run -d -P --name myContainerName -v /localpath/zip_extract:/container/path/ yourContainerID
https://docs.docker.com/userguide/dockervolumes/
I have posted a similar answer here: https://stackoverflow.com/a/37542913/909579
You can use docker-squash to squash newly created layers. That will essentially remove the archive from final image if you remove it in subsequent RUN instruction.

Getting Wget and JQL to run

We are trying to run a command wget -O xyz.xls --user=COldPolar --password=GlacierICe --ignore-length=on "http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000"
This is returning a 3kb output
If we open IE and use the following "http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000" This allows us to save a 1.7MB file. Please advise how to get the wget to work
If you can use cURL you can do:
curl -o xyz.xls -u COldPolar:GlacierICe 'http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000'
What I managed to get wget to work was to do this first:
wget --save-cookies cookies.txt --post-data 'os_username=COldPolar&os_password=GlacierICe&os_cookie=true' http://Colder.near.com:8080/login.jsp
And then:
wget -O xyz.xls --load-cookies cookies.txt "http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000"
One of the things to watch out for is a self-signed certificate. You can rule this out by running with --no-check-certificate.

Grep from wget without saving files

I am trying to download a site (with permission) and grepping a particular text from that. The problem is I want to grep on the go without saving any files on local drive. Following command does not help.
wget --mirror site.com -O - | grep TEXT
wget command manual (man page) tells, the usage of the command should be:
wget [option]... [URL]...
in your case, it should be:
wget --mirror -O - site.com|grep TXT
You can use curl:
curl -s http://www.site.com | grep TEXT
how about this
wget -qO- site.com |grep TEXT
and
curl -vs site.com 2>&1 |grep TEXT

Resources