I have two files (LIST.txt and FILE1.txt). I'm trying to use the script grep to obtain an output in the same order of LIST.txt
LIST.txt
rs201196551
rs8071824
rs74620303
FILE1.txt
rs201196551 red
rs74620303 blue
rs9000000 pink
rs8071824 purple
I used this code: grep -wFf LIST.txt FILE1.txt > OUTPUT.txt
And I obtained this output:
rs201196551 red
rs74620303 blue
rs8071824 purple
But actually I expect this output:
rs201196551 red
rs8071824 purple
rs74620303 blue
(in the same order of LIST.txt).
I don't think you can change the output order of grep without additional tools. However, here is an awk that buffers the output in the order of the list file:
$ awk '
NR==FNR { # process list file
a[$0]=++c # store first word in a hash
next # process next list item
}
{ # process file1
for(i in a) # for each list item
if($1==i) { # see if it is the first word
b[a[i]]=b[a[i]] (b[a[i]]==""?"":ORS) $0 # store to output buffer
next # no more candidates after match
}
}
END { # in the end
for(i=1;i<=c;i++) # start outputing
if(b[i]!="") # skip empties
print b[i]
}' list file1
Output:
rs201196551 red
rs8071824 purple
rs74620303 blue
Update: From the comments, thanks #Sundeep:
$ awk '
NR==FNR { # lets hash the haystack instead ie. file1
a[$1]=$0
next
}
($0 in a) { # now read the needles from the list and lookup from a
print a[$0]
}' file1 list
Output:
rs201196551 red
rs8071824 purple
rs74620303 blue
However, if there are identical entries (of $1) in the file1, they will be lost (due to a[$1]=$0). Last entry in the file will remain.
Related
I have a corpus file and the rules file. I am trying to find matching words where the word from rule appear in corpus.
# cat corpus.txt
this is a paragraph number one
second line
third line
# cat rule.txt
a
b
c
This returns 2 lines
# grep -F0 -f rule.txt corpus.txt
this is a paragraph number one
second line
But I am expecting 4 words like this...
a
paragraph
number
second
Trying to achive these results using grep or awk.
Assuming words are seperated by white spaces
awk '{print "\\S*" $1 "\\S*"}' rule.txt | grep -m 4 -o -f - corpus.txt
I have a list of files, and I want to look for some specific keywords in those files. The output should be a line for each file with matches, showing the words that we found just once. For example, if I have the following file test.txt
one,two,three
four,five,six,
seven,eight,nine
and i do a grep of the words five and eight, it should return something like this:
test.txt:five,eight
I'm not interested in the lines, or the number of matches. I just want to know which words matched in each file. How can I do that?
GNU grep + awk solution:
Let's say we have file test1.txt with contents:
one,two,three
four,five,six,
seven,eight,nine
and test2.txt with contents:
one
two
three, four, five
Finding matches for words five and eight:
grep -Hwo '\(five\|eight\)' test*
| awk -F':' '{ a[$1]=(a[$1])? a[$1]","$2:$2 }END{ for(i in a) print i FS a[i] }'
The output:
test1.txt:five,eight
test2.txt:five
grep details:
-H - Print the file name for each match
-w - Select only those lines containing matches that form whole words
-o - Print only the matched (non-empty) parts of matching lines
awk details:
-F':' - field separator
a[$1]=(a[$1])? a[$1]","$2:$2 - using filename $1 as array key for accumulating all matched words
I have a fasta file like the test one here:
>HWI-D00196:168:C66U5ANXX:3:1106:16404:19663 1:N:0:GCCAAT
CCTAGCACCATGATTTAATGTTTCTTTTGTACGTTCTTTCTTTGGAAACTGCACTTGTTGCAACCTTGCAAGCCATATAAACACATTTCAGATATAAGGCT
>HWI-D00196:168:C66U5ANXX:3:1106:16404:19663 2:N:0:GCCAAT
AAAACATAAATTTGAGCTTGACAAAAATTAAAAATGAGCCCAGCCTTATATCTGAAATGTGTTTATATGGCTTGCAAGGTTGCAACAAGTGCAGTTTCCAA
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 1:N:0:GCCAAT
ATATTTGAATTATCAGAAATAAACACAAAGAAAACCTAGAACAGATAATTTCTTCCACATTATTGATCAGATACAGATTTCAAGGGTACCGTTGTGAATTG
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 2:N:0:GCCAAT
AAACGATTGATAGATCTATTTGCATTATAAAAACATTAAAAAAACAAAATACTGATTAAATGTCGTCTTTCTATTCCACAATTTTATAGATCTCACTGTAT
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 1:N:0:GCCAAT
CTTACTTTGCCTCTCTCAGCCAATGTCTCCTGAGTCTAATTTTTTGGAGGCTAAGCTATGAGCTAATGATGGGTTCCATTTGGGGCCAATGCTTCAGCCTG
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 2:N:0:GCCAAT
CTATTAGTTCTTATCTTTGCCTGCAAATATAAGACTAGCGCTTGAGTAGCTGACAGAGACAAAGTAAGCTGGAGTGTTTATCACCTGGTCACTCCAATTGT
When i type in a simple grep command like:
grep -B1 "CTT" test.fasta
I get a really strange output in which "--" is sometimes placed on a newline above the grep hit like so:
>HWI-D00196:168:C66U5ANXX:4:1304:10466:100132 2:N:0:GCCAAT
AAACGATTGATAGATCTATTTGCATTATAAAAACATTAAAAAAACAAAATACTGATTAAATGTCGTCTTTCTATTCCACAATTTTATAGATCTCACTGTAT
--
>HWI-D00196:168:C66U5ANXX:4:1307:12056:64030 2:N:0:GCCAAT
CTATTAGTTCTTATCTTTGCCTGCAAATATAAGACTAGCGCTTGAGTAGCTGACAGAGACAAAGTAAGCTGGAGTGTTTATCACCTGGTCACTCCAATTGT
I can't figure out why some fasta entries have this and others don't. I don't get this problem when i remove the -B1. I can remove those lines from my file with a grep -v "--" statement, but I'd really like to understand what's going on here.
You are asking for one line of leading context by using the -B1 option. This means grep will display both the line which matched and the line directly before it. Each match will be separated by -- on a line by itself as shown below:
$ man grep | grep -B1 context
-A num, --after-context=num
Print num lines of trailing context after each match. See also
--
-B num, --before-context=num
Print num lines of leading context before each match. See also
--
-C[num, --context=num]
Print num lines of leading and trailing context surrounding each
--
--context[=num]
Print num lines of leading and trailing context. The default is
The reason you aren't seeing -- between every match is that the context is only displayed above a sequence of consecutive matches. So see the following example:
seq 13 | grep -B1 1
1
--
9
10
11
12
13
The seq command produces all the numbers between 1 and 13. Only the first line and the lines from 10 on contain a 1, so you see the 1 in its own group, then --, then the one line context, then the group of consecutive matching lines.
GREP_COLORS section of the grep manpage says :
Specifies the colors and other attributes used to highlight various > parts of the output. Its value is a colon-separated list
of capabilities that defaults to
ms=01;31:mc=01;31:sl=:cx=:fn=35:ln=32:bn=32:se=36 with the rv and
ne boolean capabilities omitted (i.e., false).
and
se=36 SGR substring for separators that are inserted between
selected line fields (:), between context line fields, (-), and
between groups of adjacent lines when nonzero context is
specified (--). The default is a cyan text foreground over the
terminal's default background.
Consider file sample.txt :
$cat sample.txt
ABBB
AAB
AAB
S
S
S
AABB
ABAA
BAA
CCC
$grep -B2 'AAB' sample.txt
ABBB
AAB
AAB
--
S
S
AABB
Here -- is the way of grep to tell you that AAB before -- and S after -- are not adjacent lines in the actual file.
I have several files that goes like that:
abcd
several lines
abcd
several lines
abcd
several lines
.
.
.
what I want to do (preferably using grep) is to get the 20 lines immediately following the LAST abcd line.
Any help is appreciated.
Thanks
Use -A option:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line
containing a group separator (--) between contiguous groups of matches.
With the -o or --only-matching option, this has no effect and a warning
is given.
So:
$ grep -A 20 abcd file.txt
will give you abcd lines + 20 lines after each. To get that last 21 lines, use tail:
$ grep -A 20 abcd file.txt | tail -21
You can do this:
awk '/abcd/ {n=NR} {a[NR]=$0} END {for (i=n;i<=n+20;i++) print a[i]}' file
It will search for pattern abcd and update n so only last will be stored.
It also store all line in array a
Then it print 20 lines form last pattern found in the END section.
I would like to search for a certain pattern (say Bar line) but also print lines above and below (i.e 1 line) the pattern or 2 lines above and below the pattern.
Foo line
Bar line
Baz line
....
Foo1 line
Bar line
Baz1 line
....
Use grep with the parameters -A and -B to indicate the number a of lines After and Before you want to print around your pattern:
grep -A1 -B1 yourpattern file
An stands for n lines "after" the match.
Bm stands for m lines "before" the match.
If both numbers are the same, just use -C:
grep -C1 yourpattern file
Test
$ cat file
Foo line
Bar line
Baz line
hello
bye
hello
Foo1 line
Bar line
Baz1 line
Let's grep:
$ grep -A1 -B1 Bar file
Foo line
Bar line
Baz line
--
Foo1 line
Bar line
Baz1 line
To get rid of the group separator, you can use --no-group-separator:
$ grep --no-group-separator -A1 -B1 Bar file
Foo line
Bar line
Baz line
Foo1 line
Bar line
Baz1 line
From man grep:
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing a group separator (--) between
contiguous groups of matches. With the -o or --only-matching
option, this has no effect and a warning is given.
-C NUM, -NUM, --context=NUM
Print NUM lines of output context. Places a line containing a
group separator (--) between contiguous groups of matches. With
the -o or --only-matching option, this has no effect and a
warning is given.
grepis the tool for you, but it can be done with awk
awk '{a[NR]=$0} $0~s {f=NR} END {for (i=f-B;i<=f+A;i++) print a[i]}' B=1 A=2 s="Bar" file
NB this will also find one hit.
or with grep
grep -A2 -B1 "Bar" file