grep - difference between ways of specifying directory and globs - grep

Say I'm in a project folder and want to grep a keyword using grep -rni. What's the difference between these 3 commands?
grep -rni . -e "keyword"
grep -rni * -e "keyword"
grep -rni **/* -e "keyword"
I tested this and noticed that the first two commands return the same number of matches, although in different ordering. The third one returned significantly more matches than the first two, however.
Is there any reason to use the third one ever? Is the reason it's returning more matches duplicates?

First of all, the difference has nothing to do with the arguments -n and -i.
From grep man page:
-n, --line-number
Prefix each line of output with the 1-based line number within its input file.
-i, --ignore-case
Ignore case distinctions in patterns and input data, so that characters that differ only in case match each other.
-r, --recursive
Read all files under each directory, recursively, following symbolic links only if they are on the command line. Note that if no file operand is given, grep searches the working directory. This is equivalent to the
-d recurse option.
So, the difference is actually on how the strings * and **/* are interpreted by the shell.
With . you pass the current directory as an argument to grep. No mystery here because it is grep the one who walks the current working directory.
With * you pass every file in the current directory as an argument to grep (this include directories).
Now, suppose you have the following directory structure:
├── file.txt
├── one
│   └── file.txt
└── two
└── file.txt
Running grep -rni * -e keyword is translated to:
grep -rni file.txt one two -e keyword
This conditions grep to iterate files and nested directories in that order.
Finally, grep -rni **/* -e keyword will translate to this command line:
grep -rni file.txt one one/file.txt two two/file.txt -e keyword
The problem with this last approach is that some files will be processed more than once. For instance: one/file.txt will be processed twice: once because it is explicitly in the argument list, and another time because it belongs to the directory one, which is also in the argument list.

Related

Find multiple files containing a certian string and list the files that don't containt the string

In the midst of building a site checker I have ran unto a problem, the client needs to check all of their pages for certain strings if they are included in the code and then list the files that do not have the code yet.
Tried with multiple grep commands with no success. The -v supposedly exports the inverted match of the results, but that does not happen. Currently I am missing the part of the code telling grep to only search in specific files (example files names code.php) in all sub folders.
With the current code it searches all the files even unnecessary ones.
grep -vrn '.' -e "SRING" > list.txt
I'd like to export a list of files (preferably that it only checks files in all sub folders with the same name) that do not posses the sting that I am looking for.
grep -c STRING files will give you a count of lines with STRING for each file.
You can optomize it somewhat with -m1 to stop after the first match.
You can pipe that to sed to grab the files with zero matches:
grep -cm1 STRING files | sed -n '/:0$/s/:0//p'
That gets you one file per line.
You can pipe that to xargs to merge it into a one-line list.
If your STRING is just that and not a regex, you could use the -F flag with grep to specify it's a fixed string, and that will also speed things up. So maybe...
grep -Fcm1 STRING files | sed -n '/:0$/s/:0//p' > list.txt
...in that case

why doesn't recursive grep appear to work?

The man page for grep says:
-r, --recursive
Read all files under each directory, recursively
OK, then how is this possible:
# grep -r BUILD_AP24_TRUE apache
# grep BUILD_AP24_TRUE apache/Makefile.in
#BUILD_AP24_TRUE#mod_shib_24_la_DEPENDENCIES = $(am__DEPENDENCIES_1) \
(...)
There are two likely causes this type of problem:
grep is aliased to something that excluded the files you were interested in.
The file of interest is in a symlinked directory or is itself a symlink.
In your case, it appears to be the second cause. One solution is to use grep -R in place of grep -r. From man grep:
-r, --recursive
Read all files under each directory, recursively, following symbolic
links only if they are on the command line. Note that if no file
operand is given, grep searches the working directory.
This is equivalent to the -d recurse option.
-R, --dereference-recursive
Read all files under each directory, recursively.
Follow all symbolic links, unlike -r.

grep recursive filename matching (grep -ir "xyz" *.cpp) does not work

while
grep -ir "xyz" * recursively searches through the directories and tell me that the text is present in ./x/y/z/abc.cpp
However ,
grep -ir "xyz" *.cpp offers no result.
Isn't the second command supposed to recursively grep all cpp files inside the directory ?
What am I missing here?
Grep will recurse through any directories you match with your glob pattern. (In your case, you probably do not have any directories that match the pattern "*.cpp") You could explicitly specify them: grep -ir "xyz" *.cpp */*.cpp */*/*.cpp */*/*/*.cpp, etc. You can also use the --include option (see the example below)
If you are using GNU grep, then you can use the following:
grep -ir --include "*.cpp" "xyz" .
The command above says to search recursively starting in current directory ignoring case on the pattern and to only search in files that match the glob pattern "*.cpp".
OR if you are on some other Unix platform, you can use this:
find ./ -type f -name "*.cpp" -print0 | xargs -0 grep -i "xyz"
If you are sure that none of your files have spaces in their names, you can omit the -print0 argument to find and the -0 to xargs
The command above says the following: find all files (-type f) under the current directory (./) that match the name glob/wildcard "*.cpp" (-name "*.cpp") and then print them out delimited by a null (-print0). That list of files found should be written to the stdin of the next command: xargs. xargs should read from stdin (default behavior) and split its input on nulls (-0) and then call the grep command with the specified options (grep -i "xyz") on that list of files.
If you are interested in learning more about why grep -ir "xyz" *.cpp does not work the way you think it should, you should search for "shell globbing" (here is a good first article on the subject). I'll also try to provide a quick explanation. When you type in the command grep -ir "xyz" *.cpp and hit enter, there are two programs that are involved in executing your command. The first program is your shell (and unless you've done something to customize things, you are probably usually the bash shell - if you've never heard of a shell or bash, that's where you should start looking, there are tons of good articles). Suffice it say that a shell is just a program that is designed to let you navigate the filesystem on your computer and run other programs. (In Windows, when you double click on an icon to launch a program, or open a folder to access a file, the program that you are running is explorer.exe and it is the Windows graphical shell). So, when you type the command grep -ir "xyz" *.cpp, before grep is run, the shell handles reading your command and does a few things. One of the things is does is expand glob patterns (things like *.txt or [0-9]+.pdf). Like I said, if you want to understand it, go read more about it, but the thing you should take away is that the grep command never sees the *.cpp. What happens is, the shell looks in the current directory for any files or directories with a name that match the pattern *.cpp and then replaces them on the command line BEFORE it runs the grep command. (If it doesn't find anything that matches, then it will leave the *.cpp there and grep will see it, but grep because doesn't normally do glob matching, this doesn't do anything for you).
Alternatively, when you type in grep -ir "xyz" *, what happens is that the shell replaces the * with the name of every file and directory in the current directory (because * matches anything). Let's say you had a directory that contained file1, file2, and dir1, and dir2, then the shell would perform its replacements and then execute a command that looked like this grep -ir "xyz" file1 file2 dir1 dir2, which means grep would search file1 and file2 for a line with the string xyz, and because of the -ir it also search recursively through dir1 and dir2 and search any files found for that string as well. Lastly, if you've followed everything I've said so far, then it will make sense to you that grep does have a way to use glob patterns on recursive searches, and that is to use the --include option, as in the command I described earlier: grep -ir --include "*.cpp" "xyz" ., and the reason why we put the *.cpp in quotes in that command is to prevent the shell from trying to expand the glob pattern before we run the command.

How to grep for two words existing on the same line? [duplicate]

This question already has answers here:
Match two strings in one line with grep
(23 answers)
Closed 3 years ago.
How do I grep for lines that contain two input words on the line? I'm looking for lines that contain both words, how do I do that? I tried pipe like this:
grep -c "word1" | grep -r "word2" logs
It just stucks after the first pipe command.
Why?
Why do you pass -c? That will just show the number of matches. Similarly, there is no reason to use -r. I suggest you read man grep.
To grep for 2 words existing on the same line, simply do:
grep "word1" FILE | grep "word2"
grep "word1" FILE will print all lines that have word1 in them from FILE, and then grep "word2" will print the lines that have word2 in them. Hence, if you combine these using a pipe, it will show lines containing both word1 and word2.
If you just want a count of how many lines had the 2 words on the same line, do:
grep "word1" FILE | grep -c "word2"
Also, to address your question why does it get stuck : in grep -c "word1", you did not specify a file. Therefore, grep expects input from stdin, which is why it seems to hang. You can press Ctrl+D to send an EOF (end-of-file) so that it quits.
Prescription
One simple rewrite of the command in the question is:
grep "word1" logs | grep "word2"
The first grep finds lines with 'word1' from the file 'logs' and then feeds those into the second grep which looks for lines containing 'word2'.
However, it isn't necessary to use two commands like that. You could use extended grep (grep -E or egrep):
grep -E 'word1.*word2|word2.*word1' logs
If you know that 'word1' will precede 'word2' on the line, you don't even need the alternatives and regular grep would do:
grep 'word1.*word2' logs
The 'one command' variants have the advantage that there is only one process running, and so the lines containing 'word1' do not have to be passed via a pipe to the second process. How much this matters depends on how big the data file is and how many lines match 'word1'. If the file is small, performance isn't likely to be an issue and running two commands is fine. If the file is big but only a few lines contain 'word1', there isn't going to be much data passed on the pipe and using two command is fine. However, if the file is huge and 'word1' occurs frequently, then you may be passing significant data down the pipe where a single command avoids that overhead. Against that, the regex is more complex; you might need to benchmark it to find out what's best — but only if performance really matters. If you run two commands, you should aim to select the less frequently occurring word in the first grep to minimize the amount of data processed by the second.
Diagnosis
The initial script is:
grep -c "word1" | grep -r "word2" logs
This is an odd command sequence. The first grep is going to count the number of occurrences of 'word1' on its standard input, and print that number on its standard output. Until you indicate EOF (e.g. by typing Control-D), it will sit there, waiting for you to type something. The second grep does a recursive search for 'word2' in the files underneath directory logs (or, if it is a file, in the file logs). Or, in my case, it will fail since there's neither a file nor a directory called logs where I'm running the pipeline. Note that the second grep doesn't read its standard input at all, so the pipe is superfluous.
With Bash, the parent shell waits until all the processes in the pipeline have exited, so it sits around waiting for the grep -c to finish, which it won't do until you indicate EOF. Hence, your code seems to get stuck. With Heirloom Shell, the second grep completes and exits, and the shell prompts again. Now you have two processes running, the first grep and the shell, and they are both trying to read from the keyboard, and it is not determinate which one gets any given line of input (or any given EOF indication).
Note that even if you typed data as input to the first grep, you would only get any lines that contain 'word2' shown on the output.
Footnote:
At one time, the answer used:
grep -E 'word1.*word2|word2.*word1' "$#"
grep 'word1.*word2' "$#"
This triggered the comments below.
you could use awk. like this...
cat <yourFile> | awk '/word1/ && /word2/'
Order is not important. So if you have a file and...
a file named , file1 contains:
word1 is in this file as well as word2
word2 is in this file as well as word1
word4 is in this file as well as word1
word5 is in this file as well as word2
then,
/tmp$ cat file1| awk '/word1/ && /word2/'
will result in,
word1 is in this file as well as word2
word2 is in this file as well as word1
yes, awk is slower.
The main issue is that you haven't supplied the first grep with any input. You will need to reorder your command something like
grep "word1" logs | grep "word2"
If you want to count the occurences, then put a '-c' on the second grep.
git grep
Here is the syntax using git grep combining multiple patterns using Boolean expressions:
git grep -e pattern1 --and -e pattern2 --and -e pattern3
The above command will print lines matching all the patterns at once.
If the files aren't under version control, add --no-index param.
Search files in the current directory that is not managed by Git.
Check man git-grep for help.
See also:
How to use grep to match string1 AND string2?
Check if all of multiple strings or regexes exist in a file.
How to run grep with multiple AND patterns?
For multiple patterns stored in the file, see: Match all patterns from file at once.
You cat try with below command
cat log|grep -e word1 -e word2
Use grep:
grep -wE "string1|String2|...." file_name
Or you can use:
echo string | grep -wE "string1|String2|...."

To understand the practical use of Grep's option -H in different situations

This question is based on this answer.
Why do you get the same output from the both commands?
Command A
$sudo grep muel * /tmp
masi:muel
Command B
$sudo grep -H muel * /tmp
masi:muel
Rob's comment suggests me that Command A should not give me masi:, but only muel.
In short, what is the practical purpose of -H?
Grep will list the filenames by default if more than one filename is given. The -H option makes it do that even if only one filename is given. In both your examples, more than one filename is given.
Here's a better example:
$ grep Richie notes.txt
Richie wears glasses.
$ grep -H Richie notes.txt
notes.txt:Richie wears glasses.
It's more useful when you're giving it a wildcard for an unknown number of files, and you always want the filenames printed even if the wildcard only matches one file.
If you grep a single file, -H makes a difference:
$ grep muel mesi
muel
$ grep -H muel mesi
masi:muel
This could be significant in various scripting contexts. For example, a script (or a non-trivial piped series of commands) might not be aware of how many files it's actually dealing with: one, or many.
When you grep from multiple files, by default it shows the name of the file where the match was found. If you specify -H, the file name will always be shown, even if you grep from a single file. You can specify -h to never show the file name.
Emacs has grep interface (M-x grep, M-x lgrep, M-x rgrep). If you ask Emacs to search for foo in the current directory, then Emacs calls grep and process the grep output and then present you with results with clickable links. Clickable links, just like Google.
What Emacs does is that it passes two options to grep: -n (show line number) and -H (show filenames even if only one file. the point is consistency) and then turn the output into clickable links.
In general, consistency is good for being a good API, but consistency conflicts with DWIM.
When you directly use grep, you want DWIM, so you don't pass -H.

Resources