How do I get the URLs out of an HTML file? - url

I need to get a long list of valid URLs for testing my DNS server. I found a web page that has a ton of links in it that would probably yield quite a lot of good links (http://www.cse.psu.edu/~groenvel/urls.html), and I figured that the easiest way to do this would be to download the HTML file and simply grep for the URLs. However, I can't get it to list out my results with only the link.
I know there are lots of ways to do this. I'm not picky how it's done.
Given the URL above, I want a list of all of the URLs (one per line) like this:
http://www.cse.psu.edu/~groenvel/
http://www.acard.com/
http://www.acer.com/
...

Method 1
Step1:
wget "http://www.cse.psu.edu/~groenvel/urls.html"
Step2:
perl -0ne 'print "$1\n" while (/a href=\"(.*?)\">.*?<\/a>/igs)' /PATH_TO_YOUR/urls.html | grep 'http://' > /PATH_TO_YOUR/urls.txt
Just replace the "/PATH_TO_YOUR/" with your filepath. This would yield a text file with only urls.
Method 2
If you have lynx installed you could simply do this in 1 step:
Step1:
lynx --dump http://www.cse.psu.edu/~groenvel/urls.html | awk '/(http|https):\/\// {print $2}' > /PATH_TO_YOUR/urls.txt
Method 3
Using curl:
Step1
curl http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o "(http|https):.*\">" | awk 'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt
Method 4
Using wget:
wget -qO- http://www.cse.psu.edu/~groenvel/urls.html 2>&1 | egrep -o "(http|https):.*\">" | awk 'BEGIN {FS="\""};{print $1}' > /PATH_TO_YOUR/urls.txt

you need wget, grep, sed.
I will try a solution and update my post later.
Update:
wget [the_url];
cat urls.html | egrep -i '<a href=".*">' | sed -e 's/.*<A HREF="\(.*\)">.*/\1/i'

Related

grep -o search stop at first instance of second expression, rather then last? Greedy?

Not sure who to phrase this question
This is an example line.
30/Oct/2019:00:17:22 +0000|v1|177.95.140.78|www.somewebsite.com|200|162512|-|-|0.000|GET /product/short-velvet-cloak-with-hood/?attribute_pa_color=dark-blue&attribute_pa_accent-color=gold&attribute_pa_size=small HTTP/1.0|0|0|-
I need to extract attribute_pa_color=
So I have
cat somewebsite.access.log.2.csv | grep -o "?.*=" > just-parameters.txt
Which works but if there are multiple parameters in the URL is returns all of them
So instead of stopping the match at the first instance of "=" its taking the last instance of "=" in the line.
How can I make it stop at the first.
I tried this
cat somewebsite.access.log.2.csv | grep -o "?(.*?)=" > just-parameters2.txt
cat somewebsite.access.log.2.csv | grep -o "\?(.*?)=" > just-parameters2.txt
Both return nothing
Also I need each unique parameter so once I created the file I ran
sort just-parameters.txt | uniq > clean.txt
Which does not appear to work, is it possible to remove duplicates and have it be part of them same command?
You can try something like with awk
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|sort -u > clean.txt
This will work if attribute_pa_color is the first parameter on URL
If you want to extract only text attribute_pa_color= you can try something like:
awk -F'[?&]' '{print $2}' somewebsite.access.log.2.csv|awk -F\= '{print $1"="}'|sort -u > clean.txt
Instead of using second awk you can try something like:
awk -F'[?&]' '{split($2,a,=);print a[1]}' somewebsite.access.log.2.csv|sort -u > clean.txt
Split internally in awk using = as delimiter

grep - Get word from string

I have a bunch of strings that I have to fetch the 'port_num' from -
"76 : client=new; tags=circ, LINK; port_num=switch01; far_port=Gi1/0"
The word might be in a different place in the string and it might be a different length, but it always says 'port_num=' before it and ';' after it...
I only want this bit- 'switch01'
Currently I use-
| grep -Eo 'port_num=.+' | cut -d"=" -f2 | cut -d";" -f1'
But there has got to be a better way
You can try grep -oP '(?<=port_num=).+(?=;)', if you run this:
echo "76 : client=new; tags=circ, LINK; port_num=switch01; far_port=Gi1/0" \
| grep -oP '(?<=port_num=).+(?=;)'
result will be:
switch01
Updated answer: grep -oP '(?<=port_num=)[^;]+(?=;)'
This is what I would use:
... | grep -E 'port_num=.+' | sed 's/^.*port_num=\([^;]*\).*$/\1/'
This works with or without the -o on grep, and the availability of -P will depend on the version of grep you have. (e.g., my grep does not have it). I'm not saying the other answers that rely on -P aren't any good -- they look fine to me. But grep -P will be less portable.
IMHO, piping grep with sed allows each utility to do what it specializes in -- grep is for selecting lines, sed is for modifying lines.
This can be done in a simple sed command:
s="76 : client=new; tags=circ, LINK; port_num=switch01; far_port=Gi1/0"
sed 's/.*port_num=\([^;]*\);.*/\1/' <<< "$s"
switch01
... | grep -Po 'port_num.+(?=;)'
This uses grep's Perl Compatible Regular Expression (PCRE) syntax. The (?=;) is a look-ahead assertion which looks for a match with ";" but doesn't include it in the matched output.
This produces:
port_num=switch01
As #Vladimir Kovpak noted, if you want to exclude the "port_num=" string from this output, add a look-behind assertion:
... | grep -Po '(?<=port_num).+(?=;)'

Search file for usernames, and sort number of instances for each user in file?

I am tasked with taking a file that has line entries that include string username=xxxx:
$ cat file.txt
Yadayada username=jdoe blablabla
Yadayada username=jdoe blablabla
Yadayada username=jdoe blablabla
Yadayada username=dsmith blablabla
Yadayada username=dsmith blablabla
Yadayada username=sjones blablabla
And finding how many times each user in the file shows up, which I can do manually by feeding username=jdoe for example:
$ grep -r "username=jdoe" file.txt | wc -l | tr -d ' '
3
What's the best way to report each user in the file, and the number of lines for each user, sorted from highest to lowest instances:
3 jdoe
2 dsmith
1 sjones
Been thinking of how to approach this, but drawing blanks, figured I'd check with our gurus on this forum. :)
TIA,
Don
In GNU awk:
$ awk '
BEGIN { RS="[ \n]" }
/=/ {
split($0,a,"=")
u[a[2]]++ }
END {
PROCINFO["sorted_in"]="#val_num_desc"
for(i in u)
print u[i],i
}' file
3 jdoe
2 dsmith
1 sjones
Using grep :
$ grep -o 'username=[^ ]*' file | cut -d "=" -f 2 | sort | uniq -c | sort -nr
Awk alone:
awk '
{sub(/.*username=/,""); sub(/ .*/,"")}
{a[$0]++}
END {for(i in a) printf "%d\t%s\n",a[i],i | "sort -nr"}
' file.txt
This uses awk's sub() function to achieve what grep -o does in other answers. It embeds the call to sort within the awk script. You could of course use that pipe after the awk script rather than within it if you prefer.
Oh, and unlike the other awk solutions presented here, this one (1) is portable to non-GNU-awk environments (like BSD, macOS) and doesn't depend on the username being in a predictable location on each line (i.e. $2).
Why might awk be a better choice than simpler tools like uniq? It probably wouldn't, for a super simple requirement like this. But good to have in your toolbox if you want something with the capability of a little more text processing.
Using sed, uniq, and sort:
sed 's/.*username=\([^ ]*\).*/\1/' file.txt | sort | uniq -c | sort -nr
If there are lines without usernames:
sed -n 's/.*username=\([^ ]*\).*/\1/p' input | sort | uniq -c | sort -nr
$ awk -F'[= ]' '{print $3}' file | sort | uniq -c | sort -nr
3 jdoe
2 dsmith
1 sjones
Following awk may help you on same too.
awk -F"[ =]" '{a[$3]++} END{for(i in a){print a[i],i | "sort -nr"}}' Input_file

How to grep out substring which can change?

Basically I have a very large text file and each line contains
tag=yyyyy;id=xxxxx;db_ref=zzzzz;
What I want is to grep out the id, but the id can change in length and form, I was wondering if its possible to use grep -o and then grep for "id=" then extract everything that comes after it until the semicolon?
You could do:
$ grep -o 'id=[^;]*' file
And if you don't want to inlcude the id= part you can using positive look-behind:
$ grep -Po '(?<=id=)[^;]*' file
try :
grep -Po "(?<=id=)[^;]*" file
Via grep:
grep -o 'id=[^;]*'
Via awk:
awk -F';' '{ print $2}' testlog
id=xxxxx
edit: see sudo_O's answer for the look-behind. it's more to the point of your question, IMO.
You could try this awk. It should also work if there are multiple id= entries per line and it would not give a false positive for ...;pid=blabla;...
awk '/^id=/' RS=\; file
Try the following:
grep -oP 'id=\K[^;]*' file
perl -lne 'print $1 if(/id=([^\;]*);/)' your_file
tested:
> echo "tag=yyyyy;id=xxxxx;db_ref=zzzzz; "|perl -lne 'print $1 if(/id=([^\;]*);/)'
xxxxx
>

Simple Grep Issue

I am trying to parse items out of a file I have. I cant figure out how to do this with grep
here is the syntax
<FQDN>Compname.dom.domain.com</FQDN>
<FQDN>Compname1.dom.domain.com</FQDN>
<FQDN>Compname2.dom.domain.com</FQDN>
I want to spit out just the bits between the > and the <
can anyone assist?
Thanks
grep can do some text extraction. however not sure if this is what you want:
grep -Po "(?<=>)[^<]*"
test
kent$ echo "<FQDN>Compname.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname1.dom.domain.com</FQDN>
dquote>
dquote> <FQDN>Compname2.dom.domain.com</FQDN>"|grep -Po "(?<=>)[^<]*"
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com
Grep isn't what you are looking for.
Try sed with a regular expression : http://unixhelp.ed.ac.uk/CGI/man-cgi?sed
You can do it like you want with grep :
grep -oP '<FQDN>\K[^<]+' FILE
Output:
Compname.dom.domain.com
Compname1.dom.domain.com
Compname2.dom.domain.com
As others have said, grep is not the ideal tool for this. However:
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | egrep -io '[a-z]+\.[^<]+'
Compname.dom.domain.com
Remember that grep's purpose is to MATCH things. The -o option shows you what it matched. In order to make regex conditions that are not part of the expression that is returned, you'd need to use lookahead or lookbehind, which most command-line grep does not support because it's part of PCRE rather than ERE.
$ echo '<FQDN>Compname.dom.domain.com</FQDN>' | grep -Po '(?<=>)[^<]+'
Compname.dom.domain.com
The -P option will work in most Linux environments, but not in *BSD or OSX or Solaris, etc.

Resources