grep multiple patterns using pattern file

grep multiple patterns using pattern file - grep

I downloaded very huge list of hosts to block ads.
The problem is some sites are broken its functionality, like forum/discussion and/or pics. So i wanna remove some sites in hosts file.
Let say I wanna remove a.com and b.com from hosts.
These methods work.
grep -ve a.com -e b.com hosts > new_hosts
or
egrep -v 'a.com|b.com' hosts > new_hosts
Both are working fine. But if pattern increase, I wanna write the pattern in file.
If I use this
grep -vf pattern.txt hosts > new_hosts
Only the last pattern will be removed.
If pattern.txt contain
a.com
b.com
Only b.com omitted from new_hosts, a.com still written in new_hosts.
So what grep command to use using pattern file?

If you have a hosts file that you want to compare with another file containing entries you want to eliminate, this will be easier with uniq than with grep.
Just combine the files and run something like this:
cat hosts badfile badfile | sort | uniq -u > new_hosts
Badfile is added twice because if an entry is not already present in hosts, it will remain. Duplicating guarantees all copies are eliminated.

Thx for the feedback guys. Since most of you suspect the error from pattern.txt, then I suspect it could be windows notepad which made the error.
New line from Windows notepad is terminated by 0D 0A (hex).
I read somewhere the new line for grep shoud be 0A (hex).
After editing the pattern.txt using Notepad++, this command finally works :-)
grep -vf pattern.txt hosts > new_hosts
Or maybe this is better
fgrep -vf pattern.txt hosts > new_hosts
Both are working perfectly :-)

Related

Grep returns no such file or directory when using multiple flags

I am a beginner of bash script. I just started to write a script where it checks the contents of b.txt can all be found in a.txt. (line by line preferably). My code is as following:
grep -Ffw b.txt a.txt
As you can see, I want to do fixed string instead of REGEX, I want to check everything from the b.txt file, because there are some strings inside the b.txt and I want to check if all of them exist in a.txt. And I also want to match the whole word only of course. So these are the requirements, however when I run this command it returns me an error says: grep: w: No such file or directory
I am thinking that maybe there are some limitations of the flags in bash? Sorry I am not really familiar with the language, didn't read much about the MAN page etc. If anyone could help me to solve the puzzle it would be appreciated :) In addition, i think if possible I would like to add a -q to surpress the output when there is a match also, right now I didn't add it in the example since it couldn't make it through with 3 flags even. So can anyone give me some hints here? Thanks in advance!

Hereby some explanation from the manpage:
OPTIONS
Generic Program Information
...
-F, --fixed-strings
Interpret PATTERNS as fixed strings, ...
-f FILE, --file=FILE
Obtain patterns from FILE, ...
-w, --word-regexp
Select only those lines ...
As you can see, the options -F and -w are indicated ending immediately (hence the comma in -F, and -w,), but the -f switch is followed by FILE, with means they belong together.
I you want to preserve the order Ffw, that's possible, but then you need to do something like:
grep -Ff b.txt -w a.txt

As mentioned by #kvantour, the solution is simply placing the -f before the b.txt file. grep -Fwf b.txt a.txt
Should have thought it when it says 'no such file or directory' as it was a clear indication that flags after the -f were treated as the path already.

Strange behavior grep -rnw

I am using grep (BSD grep) 2.5.1-FreeBSD in MacOS and I have found the following behavior.
I have two *.tex files. Each one of these contains the following lines
$k$-th bit of
$(i-m)$-th bit of
respectively. When I ran
grep --color -rnw . -e '\$-th bit of' --include="*.tex"
I got only the second file, i.e., $(i-m)$-th bit of, while I expect the two lines. Could you help me please to understand this behavior?

Never use -r or --include or any other grep option to find files. The GNU guys really screwed up by adding those options to grep when there's a perfectly good tool named find for finding files and now they've turned grep into a convoluted mush of finding files and Globally matching a Regular Expression within a file and Printing the result (G/RE/P).
Keep it simple - find the files with find then g/re/p within then using grep:
find . -name '*.tex' -exec grep --color -n '\$-th bit of' {} +
As others pointed out your g/re/p problem was the -w arg so I've removed that above.

I have the same version of grep.
It is caused by your use of the -w option:
-w, --word-regexp
The expression is searched for as a word (as if surrounded by `[[:<:]]' and `[[:>:]]'; see re_format(7)).
The matched part of the string $k$-th bit of is bounded on the left-hand side by a word character (i.e. k) so the match is treated as being inside a "word" and it can't therefore satisfy the "searched for as a whole word" requirement.
Try without -w and it will work fine.

Remove a certain amount of files with a sequence of characters in a directory

I stopped a process to troubleshoot something. Now, I would like to start the process where it left off in CentOS 6.4.
This script I stopped runs a perl script in a loop to process all of the files in /dev/shm/split/. These files were split into many parts from a larger file. An example of how they are named are as follows:
file.txt.aa
file.txt.ab
file.txt.ac
...and so on.
I have identified that the script left off at file.txt.fy. So, I would like to remove all of the files in /dev/shm/split/ that are from file.txt.aa through file.txt.fy.
I tried to create a whitelist for the rm command by doing:
ls /dev/shm/split/ > whitelist
cat whitelist | egrep -v 'file.txt.[aa-fz]' | tee whitelist.tmp
This did not do what I had intended.
Please help me! Thank you!

The problem with your command is that you cannot match two characters with the square bracket pattern in bash. You should use something like that instead:
ls file.txt.[a-e]? file.txt.f[a-y]
Basically decompose your range into two ranges, the first will match .aa to .ez, and the second .fa to .fy (included).
Note that I have used the ls command here. I always find it a good idea to first echo or ls the commands/files you're going to run when the operations you do are potentially destructive. When you're sure it produces the right output, go on and use rm instead of ls.

How to grep for two words existing on the same line? [duplicate]

This question already has answers here:
Match two strings in one line with grep
(23 answers)
Closed 3 years ago.
How do I grep for lines that contain two input words on the line? I'm looking for lines that contain both words, how do I do that? I tried pipe like this:
grep -c "word1" | grep -r "word2" logs
It just stucks after the first pipe command.
Why?

Why do you pass -c? That will just show the number of matches. Similarly, there is no reason to use -r. I suggest you read man grep.
To grep for 2 words existing on the same line, simply do:
grep "word1" FILE | grep "word2"
grep "word1" FILE will print all lines that have word1 in them from FILE, and then grep "word2" will print the lines that have word2 in them. Hence, if you combine these using a pipe, it will show lines containing both word1 and word2.
If you just want a count of how many lines had the 2 words on the same line, do:
grep "word1" FILE | grep -c "word2"
Also, to address your question why does it get stuck : in grep -c "word1", you did not specify a file. Therefore, grep expects input from stdin, which is why it seems to hang. You can press Ctrl+D to send an EOF (end-of-file) so that it quits.

Prescription
One simple rewrite of the command in the question is:
grep "word1" logs | grep "word2"
The first grep finds lines with 'word1' from the file 'logs' and then feeds those into the second grep which looks for lines containing 'word2'.
However, it isn't necessary to use two commands like that. You could use extended grep (grep -E or egrep):
grep -E 'word1.*word2|word2.*word1' logs
If you know that 'word1' will precede 'word2' on the line, you don't even need the alternatives and regular grep would do:
grep 'word1.*word2' logs
The 'one command' variants have the advantage that there is only one process running, and so the lines containing 'word1' do not have to be passed via a pipe to the second process. How much this matters depends on how big the data file is and how many lines match 'word1'. If the file is small, performance isn't likely to be an issue and running two commands is fine. If the file is big but only a few lines contain 'word1', there isn't going to be much data passed on the pipe and using two command is fine. However, if the file is huge and 'word1' occurs frequently, then you may be passing significant data down the pipe where a single command avoids that overhead. Against that, the regex is more complex; you might need to benchmark it to find out what's best — but only if performance really matters. If you run two commands, you should aim to select the less frequently occurring word in the first grep to minimize the amount of data processed by the second.
Diagnosis
The initial script is:
grep -c "word1" | grep -r "word2" logs
This is an odd command sequence. The first grep is going to count the number of occurrences of 'word1' on its standard input, and print that number on its standard output. Until you indicate EOF (e.g. by typing Control-D), it will sit there, waiting for you to type something. The second grep does a recursive search for 'word2' in the files underneath directory logs (or, if it is a file, in the file logs). Or, in my case, it will fail since there's neither a file nor a directory called logs where I'm running the pipeline. Note that the second grep doesn't read its standard input at all, so the pipe is superfluous.
With Bash, the parent shell waits until all the processes in the pipeline have exited, so it sits around waiting for the grep -c to finish, which it won't do until you indicate EOF. Hence, your code seems to get stuck. With Heirloom Shell, the second grep completes and exits, and the shell prompts again. Now you have two processes running, the first grep and the shell, and they are both trying to read from the keyboard, and it is not determinate which one gets any given line of input (or any given EOF indication).
Note that even if you typed data as input to the first grep, you would only get any lines that contain 'word2' shown on the output.
Footnote:
At one time, the answer used:
grep -E 'word1.*word2|word2.*word1' "$#"
grep 'word1.*word2' "$#"
This triggered the comments below.

you could use awk. like this...
cat <yourFile> | awk '/word1/ && /word2/'
Order is not important. So if you have a file and...
a file named , file1 contains:
word1 is in this file as well as word2
word2 is in this file as well as word1
word4 is in this file as well as word1
word5 is in this file as well as word2
then,
/tmp$ cat file1| awk '/word1/ && /word2/'
will result in,
word1 is in this file as well as word2
word2 is in this file as well as word1
yes, awk is slower.

The main issue is that you haven't supplied the first grep with any input. You will need to reorder your command something like
grep "word1" logs | grep "word2"
If you want to count the occurences, then put a '-c' on the second grep.

git grep
Here is the syntax using git grep combining multiple patterns using Boolean expressions:
git grep -e pattern1 --and -e pattern2 --and -e pattern3
The above command will print lines matching all the patterns at once.
If the files aren't under version control, add --no-index param.
Search files in the current directory that is not managed by Git.
Check man git-grep for help.
See also:
How to use grep to match string1 AND string2?
Check if all of multiple strings or regexes exist in a file.
How to run grep with multiple AND patterns?
For multiple patterns stored in the file, see: Match all patterns from file at once.

You cat try with below command
cat log|grep -e word1 -e word2

Use grep:
grep -wE "string1|String2|...." file_name
Or you can use:
echo string | grep -wE "string1|String2|...."

To understand the practical use of Grep's option -H in different situations

This question is based on this answer.
Why do you get the same output from the both commands?
Command A
$sudo grep muel * /tmp
masi:muel
Command B
$sudo grep -H muel * /tmp
masi:muel
Rob's comment suggests me that Command A should not give me masi:, but only muel.
In short, what is the practical purpose of -H?

Grep will list the filenames by default if more than one filename is given. The -H option makes it do that even if only one filename is given. In both your examples, more than one filename is given.
Here's a better example:
$ grep Richie notes.txt
Richie wears glasses.
$ grep -H Richie notes.txt
notes.txt:Richie wears glasses.
It's more useful when you're giving it a wildcard for an unknown number of files, and you always want the filenames printed even if the wildcard only matches one file.

If you grep a single file, -H makes a difference:
$ grep muel mesi
muel
$ grep -H muel mesi
masi:muel
This could be significant in various scripting contexts. For example, a script (or a non-trivial piped series of commands) might not be aware of how many files it's actually dealing with: one, or many.

When you grep from multiple files, by default it shows the name of the file where the match was found. If you specify -H, the file name will always be shown, even if you grep from a single file. You can specify -h to never show the file name.

Emacs has grep interface (M-x grep, M-x lgrep, M-x rgrep). If you ask Emacs to search for foo in the current directory, then Emacs calls grep and process the grep output and then present you with results with clickable links. Clickable links, just like Google.
What Emacs does is that it passes two options to grep: -n (show line number) and -H (show filenames even if only one file. the point is consistency) and then turn the output into clickable links.
In general, consistency is good for being a good API, but consistency conflicts with DWIM.
When you directly use grep, you want DWIM, so you don't pass -H.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart