Spider a Website and Return URLs Only - grep

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?
UPDATE
So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:
wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'
I'd still be interested in other/better means for doing this kind of thing, if any exist.

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
wget --spider --force-html -r -l2 $url 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
> urls.m3u
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.
The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

Create a few regular expressions to extract the addresses from all
<a href="(ADDRESS_IS_HERE)">.
Here is the solution I would use:
wget -q http://example.com -O - | \
tr "\t\r\n'" ' "' | \
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.
Explanation regarding the options used in the series of piped commands:
wget -q makes it not have excessive output (quiet mode).
wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.
tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.
grep -i makes the search case-insensitive
grep -o makes it output only the matching portions.
sed is the Stream EDitor unix utility which allows for filtering and transformation operations.
sed -e just lets you feed it an expression.
Running this little script on "http://craigslist.org" yielded quite a long list of links:
http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...

I've used a tool called xidel
xidel http://server -e '//a/#href' |
grep -v "http" |
sort -u |
xargs -L1 -I {} xidel http://server/{} -e '//a/#href' |
grep -v "http" | sort -u
A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!

Related

Dockerhub: listing all available versions of a given image?

I'm looking for a way to list all publicly available versions of an image from Dockerhub. Is there a way this could be achieved?
Specifically, I'm interested in the openjdk:8-jdk-alpine images.
Dockerhub typically only lists the latest version of each image, and there are no linking to historic versions. For openjdk, it's currently 8u191-jdk-alpine3.8:
However, it possible to pull older versions if we know their image digest ID:
openjdk:8-jdk-alpine#sha256:1fd5a77d82536c88486e526da26ae79b6cd8a14006eb3da3a25eb8d2d682ccd6
openjdk:8-jdk-alpine#sha256:c5c705b462abab858066d412b3f871865684d8f837571c98b68e78c505dc7549
With some luck, I was able to find these digests for OpenJDK 8 (Java versions 1.8.0_171 and 1.8.0_151 respectively), by googling openjdk8 alpine digest and looking at github tickets, which included the image digest.
But, is there a systematic way for listing all publicly available digests?
Looking at docker search documentation, there's doesn't seem to be an option for listing the image version, only search by name.
You don't need digests to pull "old" images, you would rather use their tags (even if they are not displayed in Docker Hub).
I use the following command to retrieve tags of a particular image, parsing the output of https://registry.hub.docker.com/v1/repositories/$REPOSITORY/tags :
REPOSITORY=openjdk # can be "<registry>/<image_name>" ("google/cloud-sdk" for example)
wget -q https://registry.hub.docker.com/v1/repositories/$REPOSITORY/tags -O - | \
jq -r '.[].name'
Result for REPOSITORY=openjdk (1593 tags at the time of writing) looks like :
latest
10
10-ea
10-ea-32
10-ea-32-experimental
10-ea-32-jdk
10-ea-32-jdk-experimental
10-ea-32-jdk-slim
10-ea-32-jdk-slim-experimental
10-ea-32-jre
[...]
If you can't/don't want to install jq (tool to manipulate JSON), then you could use :
wget -q https://registry.hub.docker.com/v1/repositories/$REPOSITORY/tags -O - | \
sed -e 's/[][]//g' -e 's/"//g' -e 's/ //g' | \
tr '}' '\n' | \
awk -F: '{print $3}'
(I'm pretty sure I got this command from another question, but I can't find where)
You can of course filter the output of this command and keep only tags you're interested in :
wget -q https://registry.hub.docker.com/v1/repositories/$REPOSITORY/tags -O - | \
jq -r '.[].name | select(match("^8.*jdk-alpine"))'
or :
wget -q https://registry.hub.docker.com/v1/repositories/$REPOSITORY/tags -O - | \
jq -r '.[].name' \
grep -E '^8.*jdk-alpine'

Linux: Search through sub-folders recursively for a file that contains a string and move it to another file

So far, I have this command on my terminal and it doesn't do anything.
Essentially it's to look for any file that contains the word bango and move it to another directory.
grep -r ".*bango.*" /Users/user/Desktop/drums | xargs mv /Users/user/Desktop/bango
Grep has a function to list the filename only you should use that to list the name of the files.
Also xargs can build commands with positional arguments.
Try to use
grep -rlE ".*bango.*" /Users/user/Desktop/drums | xargs -I # mv # /Users/user/Desktop/bango
The option -E allows to use regular expressions.
However, a regular expression is not needed, you can activate a fast grep algorithm for fixed strings:
grep -rlF "bango" /Users/user/Desktop/drums | xargs -I # mv # /Users/user/Desktop/bango

How do I grep for a pattern in which the pattern is a shell-expansion that generates a list?

echo $'one\ntwo\nthree' | grep -F -v $(echo three$'\n'one)
Output should in theory be the string two
I've read that the -F command lets grep interpret each line as a list connected by 'or' qualifier.
Only mistake is some missing double-quotes:
echo $'one\ntwo\nthree' | grep -F -v "$(echo three$'\n'one)"
Also, keep in mind that this will also filter out "threesome", "someone", etc...
(#etan-reisner points out that running set -x before the original and the fixed command can be used to observe the difference the double-quotes make here, and, more generally, is a useful way to debug bash commands.)

How to pass a URL to Wget

If I have a document with many links and I want to download especially one picture with the name www.website.de/picture/example_2015-06-15.jpeg, how can I write a command that downloads me automatically exactly this one I extracted out of my document?
My idea would be this, but I'll get a failure message like "wget: URL is missing":
grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document | wget
Use xargs:
grep etc... | xargs wget
It takes its stdin (grep's output), and passes that text as command line arguments to whatever application you tell it to.
For example,
echo hello | xargs echo 'from xargs '
produces:
from xargs hello
Using back ticks would be the easiest way of doing it:
wget `grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document`
This will do too:
wget "$(grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document)"

How do I use tshark to print request-response pairs from a pcap file?

Given a pcap file, I'm able to extract a lot of information from the reconstructed HTTP request and responses using the neat filters provided by Wireshark. I've also been able to split the pcap file into each TCP stream.
Trouble I'm running into now is that of all the cool filters I'm able to use with tshark, I can't find one that will let me print out full request/response bodies. I'm calling something like this:
tshark -r dump.pcap -R "tcp.stream==123 and http.request" -T fields -e http.request.uri
Is there some filter name I can pass to -e to get the request/response body? The closest I've come is to use the -V flag, but it also prints out a bunch of information I don't necessary want and want to avoid having to kludge out with a "dumb" filter.
If you are willing to switch to another tool, tcptrace can do this with the -e option. It also has an HTTP analysis extension (xHTTP option) that generates the HTTP request/repsonse pairs for each TCP stream.
Here is a usage example:
tcptrace --csv -xHTTP -f'port=80' -lten capturefile.pcap
--csv to format output as comma sperated variable
-xHTTP for HTTP request/response written to 'http.times' this also switches on -e to dump the TCP stream payloads, so you really don't need -e as well
-f'port=80' to filter out non-web traffic
-l for long output form
-t to give me progress indication
-n to turn off hostname resolution (much faster without this)
If you captured a pcap file, you can do the following to show all requests+responses.
filename="capture_file.pcap"
for stream in `tshark -r "$filename" -2 -R "tcp and (http.request or http.response)" -T fields -e tcp.stream | sort -n | uniq`; do
echo "==========BEGIN REQUEST=========="
tshark -q -r "$filename" -z follow,tcp,ascii,$stream;
echo "==========END REQUEST=========="
done;
I just made diyism answer a bit easier to understand (you don't need sudo, and multiline script is imo simple to look at)
This probably wasn't an option when the question was asked but newer versions of tshark can "follow" conversations.
tshark -nr dump.pcap -qz follow,tcp,ascii,123
I know this is a super old question. I'm just adding this for anyone that ends up here looking for a current solution.
I use this line to show last 10 seconds request body and response body(https://gist.github.com/diyism/eaa7297cbf2caff7b851):
sudo tshark -a duration:10 -w /tmp/input.pcap;for stream in `sudo tshark -r /tmp/input.pcap -R "tcp and (http.request or http.response) and !(ip.addr==192.168.0.241)" -T fields -e tcp.stream | sort -n | uniq`; do sudo tshark -q -r /tmp/input.pcap -z follow,tcp,ascii,$stream; done;sudo rm /tmp/input.pcap

Resources