I am trying to return text between header tags <h1> and </h1> and also between <h3> and </h3> tags from a text file . The file has only one <h1> and has 4 <h3> tags, I'm really only interested in the first 2 <h3> tags. But I want the text between them.
These grep commands produce the correct output one at a time, but when I try and combine them I have problems.
grep -o -P '(?<=<h1>).*(?=</h1>)' file.txt
grep -o -P '(?<=<h3>).*(?=</h3>)' file.txt
I tried
grep -e '(?<=<h1>).*(?=</h1>)'-e'(?<=<h3>).*(?=</h3>)'
grep -o '(?<=<h1>).*(?=</h1>)'\'(?<=<h3>).*(?=<h3>)'
I'm not sure what the -P does other than the man page says Perl expression. But it only allows one -P at a time. Is there another command that I could use to pull text between
Related
I spent several hours trying to figure out what I am doing wrong. Thanks for any help in advance.
I want to grep the string toze359485948584 from multi different binary file within a specific folder. The first part of the string stay the same but the 12 digits after the world toze could change.
When I use
grep -a -o -E -r 'toze' /my folder/
I get the output toze
but when I use
grep -a -o -E -r 'toze[0-9]' /my folder/
I get no output at all.
The word toze is the same in all other binary files within that folder but the 12 digits following it are different from file to file.
Example of file:
:?5o2g0?2?76=1?7?5 clasFSCl??˹?t0?l?Ah?Ob??9??$[??Te?J? ????C?'fھ???ӽ?Agj?(m?r??q[4 '?E??'黼}v?seUC?ؑFh??0?-?:??ꅜP?~0?zMANP1?p?????cBMac60:30:d4:2d:0d:c2???ɜm0SrNm9I4l6?5?5?=?4!3L2?2?5}3
6?636?5{1(1?/?.uDX3X3JWLHG7F?????cWMac60:30:d4:2b:ef:ab?????c
/U/]-?5?6m+?.?-?*?*a-4;6'.?-?0x*?.?,00?faic??˵?i0toze359485948584??˹?t#0!inst00008010-001348443E100026?????d:08seid0040E3FF32F48800180401178969456532CBE6122F11BB554?????n*0(srvn :??j?^<?`m4,G????##???180718064325Z?????d0tsid928C7F80C073CA01???ٚR? 0?NvMR1???????T0DGSTo8En?HC??G??]???Q???????s0
,?0M/540K21 clasNvMR??˹?t0instF5?l?Ah?Ob??9??$[??Te?J? ????C?'fھ???ӽ?Agj?(m?r??q[4 '?E??'黼}v?seUC?ؑFh??0?-?:?????l?0?bbcl1?
RiMcP?SYS?Hs9v>B|B?AC?#?A?=$;U<?;?>?C?9?:E9?4X<7?:6?9?5-4?4?68?8?355L5$2
Because the numbers are more than one you can try something like:
grep -a -o -E -r 'toze[0-9].' /my folder/
If you are ready to loop the files and manage them one by one you can simplify the work via:
strings $file|grep -a -o -E 'toze[0-9].'
I want the output of the sed file edit to go into my log file name d_selinuxlog.txt. Currently, grep outputs the specified string as well as 3 other strings above and below in the edited file.
#!/bin/bash
{ getenforce;
sed -i s/SELINUX=enforcing/SELINUX=disabled /etc/selinux/config;
grep "SELINUX=*" /etc/selinux/config > /home/neb/scropts/logs/d_selinuxlog.txt;
setenforce 0;
getenforce; }
I want to be seeing just SELINUX=disabled in the log file
All the lines with the lines SELINUX are going to match, even the commented ones, so, you need to omit that ones, and the * from the match.
grep "SELINUX=" /etc/selinux/config | grep -v "#"
This is my output
17:52:07 alvaro#lykan /home/alvaro
$ grep "SELINUX=" /etc/selinux/config | grep -v "#"
SELINUX=disabled
17:52:22 alvaro#lykan /home/alvaro
If I have a document with many links and I want to download especially one picture with the name www.website.de/picture/example_2015-06-15.jpeg, how can I write a command that downloads me automatically exactly this one I extracted out of my document?
My idea would be this, but I'll get a failure message like "wget: URL is missing":
grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document | wget
Use xargs:
grep etc... | xargs wget
It takes its stdin (grep's output), and passes that text as command line arguments to whatever application you tell it to.
For example,
echo hello | xargs echo 'from xargs '
produces:
from xargs hello
Using back ticks would be the easiest way of doing it:
wget `grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document`
This will do too:
wget "$(grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document)"
I would need the combination of the 2 commands, is there a way to just grep once? Because the file may be really big, >1gb
$ grep -w 'word1' infile
$ grep -w 'word2' infile
I don't need them on the same line like grep for 2 words existing on the same line. I just need to avoid redundant iteration of the whole file
use this:
grep -E -w "word1|word2" infile
or
egrep -w "word1|word2" infile
It will match lines matching either word1, word2 or both.
From man grep:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below).
Test
$ cat file
The fish [ate] the bird.
[This is some] text.
Here is a number [1001] and another [1201].
$ grep -E -w "is|number" file
[This is some] text.
Here is a number [1001] and another [1201].
I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?
UPDATE
So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:
wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'
I'd still be interested in other/better means for doing this kind of thing, if any exist.
The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
wget --spider --force-html -r -l2 $url 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
> urls.m3u
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.
The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.
Create a few regular expressions to extract the addresses from all
<a href="(ADDRESS_IS_HERE)">.
Here is the solution I would use:
wget -q http://example.com -O - | \
tr "\t\r\n'" ' "' | \
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.
Explanation regarding the options used in the series of piped commands:
wget -q makes it not have excessive output (quiet mode).
wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.
tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.
grep -i makes the search case-insensitive
grep -o makes it output only the matching portions.
sed is the Stream EDitor unix utility which allows for filtering and transformation operations.
sed -e just lets you feed it an expression.
Running this little script on "http://craigslist.org" yielded quite a long list of links:
http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...
I've used a tool called xidel
xidel http://server -e '//a/#href' |
grep -v "http" |
sort -u |
xargs -L1 -I {} xidel http://server/{} -e '//a/#href' |
grep -v "http" | sort -u
A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!