Listing non matching entries using 'grep -f' - parsing

The following command gives me a list of matching expressions:
grep -f /tmp/list Filename* > /tmp/output
The list file is then parsed and used to search Filename* for the parsed string. The results are then saved to output.
How would I output the parsed string from list in the case where there is no match in Filename*?
Contents of the list file could be:
ABC
BLA
ZZZ
HJK
Example Files:
Filename1:5,ABC,123
Filename2:5,ZZZ,342
Result of Running Command:
BLA
HJK
Stack overflow question 2480584 looks like it may be relevant, through the use of an if statement. However I'm not sure how to output the parsed string to the output file. Would require some type of read line?
TIA,
Mic

Obviously, grep -f list Filename* gives all matches of patterns from the file list in the files specified by Filename*, i.e.,
Filename1:5,ABC,123
Filename2:5,ZZZ,342
in your example.
By adding the -o (only print matching expression) and -h (do not print filename) flags, we can turn this into:
ABC
ZZZ
Now you want all patterns from list that are not contained in this list, which can be achieved by
grep -f list Filename* -o -h | grep -f /dev/stdin -v list
where the second grep takes it's patterns from the output of the first and by using the -v flag gives all the lines of file list that do not match those patterns.

This makes it:
$ grep -v "$(cat Filename* | cut -d, -f2)" /tmp/list
BLA
HJK
Explanation
$ cat Filename* | cut -d, -f2
ABC
ZZZ
And then grep -v looks for the inverse matching.

Related

How to grep lines non-repeatedly for same command?

I have a space-separated file that looks like this:
$ cat in_file
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004927566.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004919950.1 FAD_binding_3
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 FAD_binding_3
I am using the following shell script utilizing grep to search for strings:
$ cat search_script.sh
grep "GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1" Pfam_anntn_temp.txt
grep "GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1" Pfam_anntn_temp.txt
The problem is that I want each grep command to return only the first instance of the string it finds exclusive of the previous identical grep command's output.
I need an output which would look like this:
$ cat out_file
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 Chal_sti_synt_C
GCF_000046845.1_ASM4684v1_protein.faa WP_004920342.1 FAD_binding_3
in which line 1 is exclusively the output of the first grep command and line 2 is exclusively the output of the second grep command. How do I do it?
P.S. I am running this on a big file (>125,000 lines). So, search_script.sh is mostly composed of unique grep commands. It is the identical commands' execution that is messing up my downstream analysis.
I'm assuming you are generating search_script.sh automatically from the contents of in_file. If you can count how many times you'll repeat the same grep command you can just use grep once and use head, for example if you know you'll be using it 2 times:
grep "foo" bar.txt | head -2
Will output the first 2 occurrences of "foo" in bar.txt.
If you have to do the grep commands separately, for example if you have other code in between the grep commands, you can mix head and tail:
grep "foo" bar.txt | head -1 | tail -1
Some other commands...
grep "foo" bar.txt | head -2 | tail -1
head -n displays the first n lines of the input
tail -n displays the last n lines of the input
If you really MUST always use the same command, but ensure that the outputs always differ, the only way I can think of to achieve this is using temporary files and a complex sequence of commands:
cat foo.bar.txt.tmp 2>&1 | xargs -I xx echo "| grep -v \\'xx\\' " | tr '\n' ' ' | xargs -I xx sh -c "grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp"
So to explain this command, given foo as a search string and bar.txt as the filename, then foo.bar.txt.tmp is a unique name for a temporary file. The temporary file will hold the strings that have already been output:
cat foo.bar.txt.tmp 2>&1 : outputs the contents of the temporary file. If none is present, will output an error message to stdout, (important because if the output was empty the rest of the command wouldn't work.)
xargs -I xx echo "| grep -v \\'xx\\' " adds | grep -v to the start of each line in the temporary file, grep -v something excludes lines that include something.
tr '\n' ' ' replaces newlines with spaces, to have on a single string a sequence of grep -vs.
xargs -I xx sh -c "grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp" runs a new command, grep 'foo' bar.txt xx | head -1 | tee -a foo.bar.txt.tmp, replacing xx with the previous output. xx should be the sequence of grep -vs that exclude previous outputs.
head -1 makes sure only one line is output at a time
tee -a foo.bar.txt.tmp appends the new output to the temporary file.
Just be sure to clear the temporary files, rm *.tmp, at the end of your script.
If I am getting question right and you want to remove duplicates based on last field of each line then try following(this should be easy task for awk).
awk '!a[$NF]++' Input_file

print filename if several matches are present in file

I want to print the filename if only ALL the matches are present... on different lines
grep -l -w '10B\|01A\|gencode' */$a*filename.vcf
this prints out the filename, but not only if ALL three matches are present.
Would you consider to try awk? awk may solve it in following method,
awk '/10B/&&/01A/&&/gencode/{print FILENAME}' */$a*filename.vcf
try following, just edited your solution a bit.
grep -l '10B.*01A.*gencode' Input_file
With grep and its -P (Perl-Compatibility) option and positive lookahead regex (?=(regex)), to match patterns if in any order.
grep -lwP '(?=.*?10B)(?=.*?01A)(?=.*?gencode)' /path/to/infile
grep -l 'pattern1' files ... | xargs grep -l 'pattern2' | xargs grep -l 'pattern3'
From the grep manual:
-l, --files-with-matches
Suppress normal output; instead print the name of each input file from which output would normally have been printed. The scanning will stop on the first match. (-l is specified by POSIX.)

grep is not working inside while loop

I have two files
File1
area a
area b
areaf
File2
area a :aaaa
area b:bbbb
area3:abc
areaf:hsg
area4:uhg
area5:yutr
while read -r line
do
grep -w ^line File2 | cut -d ":" -f2
done < File1
Desired output
aaaa
bbbb
hsg
actual output
grep: can't open a
area a
grep: cant open b
area3:abc
areaf:hsg
area4:uhg
area5:yutr
but when i run grep -w ^"area a" File2 | cut -d ":" -f2 it is giving the correct output :
aaaa
Please assist me on this. i tried for loop also. no success. grep is not working inside loop.
Your variable line might contain "special characters". For example, a space that might be interpreted as a separator by the shell. Or some characters that might be interpreted as pattern metacharacter by grep.
You both need to use fgrep and to quote your variable (I'm not sure -w add anything to that command -- why do you feel the need of it?):
fgrep -w "$line"
But doing so you loose the ability to locate "the first character"
An other option if the "start of line" match is required is to escape the search string:
while read -r line
do
line=$(echo "$line" | sed -e 's/[]\/$*.^|[]/\\&/g')
grep -w "^$line" File2 | cut -d ":" -f2
done < File1
You can achieve the same result without a loop, since grep can read patterns from a file via the -f option. This will be more robust:
grep -f input1 input2 | cut -d: -f2
Gives:
aaaa
bbbb
hsg

How to grep -w for 2 words that might or might not occur in the same line?

I would need the combination of the 2 commands, is there a way to just grep once? Because the file may be really big, >1gb
$ grep -w 'word1' infile
$ grep -w 'word2' infile
I don't need them on the same line like grep for 2 words existing on the same line. I just need to avoid redundant iteration of the whole file
use this:
grep -E -w "word1|word2" infile
or
egrep -w "word1|word2" infile
It will match lines matching either word1, word2 or both.
From man grep:
-E, --extended-regexp
Interpret PATTERN as an extended regular expression (ERE, see below).
Test
$ cat file
The fish [ate] the bird.
[This is some] text.
Here is a number [1001] and another [1201].
$ grep -E -w "is|number" file
[This is some] text.
Here is a number [1001] and another [1201].

Determining word count using grep (in cases where there are multiple words in a line)

Is it possible to determine the number of times a particular word appears using grep
I tried the "-c" option but this returns the number of matching lines the particular word appears in
For example if I have a file with
some words and matchingWord and matchingWord
and then another matchingWord
running grep on this file for "matchingWord" with the "-c" option will only return 2 ...
note: this is the grep command line utility on a standard unix os
grep -o string file will return all matching occurrences of string. You can then do grep -o string file | wc -l to get the count you're looking for.
I think that using grep -i -o string file | wc -l should give you the correct output, what happens when you do grep -i -o string file on the file?
You can simply count words (-w) with wc program:
> echo "foo foo" | grep -o "foo" | wc -w
> 2

Resources