I've used grep -F extensively at work (CentOS) to ignore regex pattern in the match. Now here's what I'm trying at home (Ubuntu 14.04):
$ cat file
Here is
the -F
you were looking
for!
~$ grep -F '-F' file
_
The underscore is meant to show a blinking cursor, as if it's waiting for input. Could it be because Ubuntu's grep doesn't follow all POSIX switches (I read that -F was specified by POSIX) or am I making a mistake somewhere?'
===== Update ======
Interestingly, it fails only when there's a newline following -F. If you change the text to, say, -F option, then the line matches. A bug in grep?
Specifying the same argument twice has no effect, so
grep -F '-F' file
is the same as grep -F file, which of course searches for the fixed-string file in its standard input.
The single-quotes are a red-herring. They protect the -F from any expansion by the shell (like glob-expansion, variable expansion, command substitution, etc.), and are removed by the shell before grep sees the -F.
What you need to do is use grep's -e pattern argument:
grep -F -e '-F' file
Given that context, the -F will be interpreted as the pattern. The single-quotes are still redundant. You could single-quote every other arg. I left them in because it helps humans come to the right conclusion at first glance, and it's generally not bad practice to quote stuff that could contain shell meta-characters in a future version of the script, even if it's currently safe.
Related
After reading the xargs man page, I am unable to understand the difference in exit codes from the following xargs invocations.
(The original purpose was to combine find and grep to check if an expressions exists in ALL the given files when I came across this behaviour)
To reproduce:
(use >>! if using zsh to force creation of file)
# Create the input files.
echo "a" >> 1.txt
echo "ab" >> 2.txt
# The end goal is to check for a pattern (in this case simply 'b') inside
# ALL the files returned by a find search.
find . -name "1.txt" -o -name "2.txt" | xargs -I {} grep -q "b" {}
echo $?
123 # Works as expected since 'b' is not present in 1.txt
find . -name "1.txt" -o -name "2.txt" | xargs grep -q "b"
echo $?
0 # Am more puzzled by why the behaviour is inconsistent
The EXIT_STATUS section on the man page says:
xargs exits with the following status:
0 if it succeeds
123 if any invocation of the command exited with status 1-125
124 if the command exited with status 255
125 if the command is killed by a signal
126 if the command cannot be run
127 if the command is not found
1 if some other error occurred.
I would have thought, that 123 if any invocation of the command exited with status 1-125 should apply irrespective of whether or not -I is used ?
Could you share any insights to explain this conundrum please?
Here is evidence of the effect of -I option with xargs with the help of a wrapper script which shows the number of invocations:
cat ./grep.sh
#/bin/bash
echo "I am being invoked at $(date +%Y%m%d_%H-%M-%S)"
grep $#
(the actual command being invoked, in this case grep doesn't really matter)
Now execute the same commands as in the question using the wrapper script instead:
❯ find . -name "1.txt" -o -name "2.txt" | xargs -I {} ./grep.sh -q "b" {}
I am being invoked at 20190410_09-46-29
I am being invoked at 20190410_09-46-30
❯ find . -name "1.txt" -o -name "2.txt" | xargs ./grep.sh -q "b"
I am being invoked at 20190410_09-46-53
I have just discovered a comment on the answer of a similar question that answers this question (complete credit to https://superuser.com/users/49184/daniel-andersson for his wisdom):
https://superuser.com/questions/557203/xargs-i-behaviour#comment678705_557230
Also, unquoted blanks do not terminate input items; instead the separator is the newline character. — this is central to understanding the behavior. Without -I, xargs only sees the input as a single field, since newline is not a field separator. With -I, suddenly newline is a field separator, and thus xargs sees three fields (that it iterates over). That is a real subtle point, but is explained in the man page quoted.
-I replace-str
Replace occurrences of replace-str in the initial-arguments
with names read from standard input. Also, unquoted blanks do
not terminate input items; instead the separator is the
newline character. Implies -x and -L 1.
Based on that,
find . -name "1.txt" -o -name "2.txt"
#returns
# ./1.txt
# ./2.txt
xargs -I {} grep -q "b" {}
# interprets the above as two separate lines since,
# with -I option the newline is now a *field separator*.
# So this results in TWO invocations of grep and since one of them fails,
# the overall output is 123 as documented in the EXIT_STATUS section
xargs grep -q "b"
# interprets the above as a single input field,
# so a single grep invocation which returns a successful exit code of 0 since the pattern was found in one of the files.
echo $'one\ntwo\nthree' | grep -F -v $(echo three$'\n'one)
Output should in theory be the string two
I've read that the -F command lets grep interpret each line as a list connected by 'or' qualifier.
Only mistake is some missing double-quotes:
echo $'one\ntwo\nthree' | grep -F -v "$(echo three$'\n'one)"
Also, keep in mind that this will also filter out "threesome", "someone", etc...
(#etan-reisner points out that running set -x before the original and the fixed command can be used to observe the difference the double-quotes make here, and, more generally, is a useful way to debug bash commands.)
I would need the combination of the 2 commands, is there a way to just grep once? Because the file may be really big, >1gb
$ grep -w 'word1' infile
$ grep -w 'word2' infile
I don't need them on the same line like grep for 2 words existing on the same line. I just need to avoid redundant iteration of the whole file
use this:
grep -E -w "word1|word2" infile
or
egrep -w "word1|word2" infile
It will match lines matching either word1, word2 or both.
From man grep:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below).
Test
$ cat file
The fish [ate] the bird.
[This is some] text.
Here is a number [1001] and another [1201].
$ grep -E -w "is|number" file
[This is some] text.
Here is a number [1001] and another [1201].
I'm trying to do this:
run "echo -n 'foo' > bar.txt"
and the contents of bar.txt ends up being:
-n foo \n
(With \n representing an actual newline)
I use run for other commands like rm -rf and, to my knowledge, it works fine.
I just found this in man echo:
Some shells may provide a builtin echo command which is similar or identical to this utility. Most notably, the builtin echo in sh(1) does not accept the -n option. Consult the builtin(1) manual page.
My version of bash has an echo builtin but seems to be respecting the -n flag. It looks like the shell on your deployment machine doesn't, in which case using the full path to the echo binary might do what you want here:
run "/bin/echo -n 'foo' > bar.txt"
It appears as though the -n flag isn't being interpreted as a flag by the shell. If, from the command line, one executes echo -Y hi, the output will be -Y hi.
I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?
UPDATE
So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:
wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'
I'd still be interested in other/better means for doing this kind of thing, if any exist.
The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
wget --spider --force-html -r -l2 $url 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
> urls.m3u
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.
The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.
Create a few regular expressions to extract the addresses from all
<a href="(ADDRESS_IS_HERE)">.
Here is the solution I would use:
wget -q http://example.com -O - | \
tr "\t\r\n'" ' "' | \
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.
Explanation regarding the options used in the series of piped commands:
wget -q makes it not have excessive output (quiet mode).
wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.
tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.
grep -i makes the search case-insensitive
grep -o makes it output only the matching portions.
sed is the Stream EDitor unix utility which allows for filtering and transformation operations.
sed -e just lets you feed it an expression.
Running this little script on "http://craigslist.org" yielded quite a long list of links:
http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...
I've used a tool called xidel
xidel http://server -e '//a/#href' |
grep -v "http" |
sort -u |
xargs -L1 -I {} xidel http://server/{} -e '//a/#href' |
grep -v "http" | sort -u
A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!