Isolating a part of text file using grep - grep

I have a big file like this small example:
chr1 HAVANA transcript 69091 70008 . + . gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1 HAVANA exon 69091 70008 . + . gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1; exon_id "ENSE00002319515.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
chr1 HAVANA CDS 69091 70005 . + 0 gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1; exon_id "ENSE00002319515.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";
Each line starts with "chr". I want to make a new file in which the 3rd column is "CDS". How can I do the conditional and grep? I used the following code:
grep -i CDS infile.txt > outfile
but this one returns any line with CDS regardless of column number. Do you know how to fix it?
I want to get this from the small example:
chr1 HAVANA CDS 69091 70005 . + 0 gene_id "ENSG00000186092.4"; transcript_id "ENST00000335137.3"; gene_type "protein_coding"; gene_status "KNOWN"; gene_name "OR4F5"; transcript_type "protein_coding"; transcript_status "KNOWN"; transcript_name "OR4F5-001"; exon_number 1; exon_id "ENSE00002319515.1"; level 2; tag "basic"; tag "appris_principal"; tag "CCDS"; ccdsid "CCDS30547.1"; havana_gene "OTTHUMG00000001094.1"; havana_transcript "OTTHUMT00000003223.1";

The clean solution is to explicitly check the third column, with awk:
awk '$3 == "CDS"' infile.txt
For your limited sample, it looks like all the CDS matches on other lines are part of a longer word, so
grep -w 'CDS' infile.txt
would work as well by requiring the match to be the exact word, but that's only based on the limited sample you show.
A grep solution that checks the third column could look like this (requires GNU grep for \s, \S and \>):
grep -E '^(\S+\s+){2}CDS\>' infile.txt
or POSIX conformant:
grep -E '^([^[:blank:]]+[[:blank:]]+){2}CDS([[:blank:]]|$)' infile.txt

Related

Passing multiple parameters to GNU parallel

I have a sample script called sample.sh which takes three inputs X,Y and Z
>> cat sample.sh
#! /bin/bash
X=$1
Y=$2
Z=#3
file=X$1_Y$2_Z$3
echo `hostname` `date` >> ./$file
Now I can give parameters in the following way:
parallel ./sample.sh {1} {2} {3} ::: 1.0000 1.1000 ::: 2.0000 2.1000 ::: 3.0000 3.1000
Or I could do:
parallel ./sample.sh {1} {2} {3} :::: xlist ylist zlist
where xlist, ylist and zlist are files which contain the parameter list.
But what if I want to have one file called parameter.dat?
>>> cat parameter.dat
#xlist
1.0000 1.1000
#ylist
2.0000 2.1000
#zlist
3.0000 3.1000
I can use awk to read parameter.dat and produce temporary files called xlist, ylist and so on...
But is there a better way using gnu-parallel itself?
Ultimately what I am looking for is to simply add more lines of xlist,ylist and zlist to parameter.dat and use the last instance of xlist, ylist or zlist to run sample.sh with, so that I keep a record of the parameter runs I have already done in parameter.dat itself.
I am looking for an elegant way to do this.
Edit: My current solution is:
#! /bin/bash
tail -1 < parameter.dat | head -1 | awk '{$1=$1};1' | tr ' ' '\n' > zlist
tail -3 < parameter.dat | head -1 | awk '{$1=$1};1' | tr ' ' '\n' > ylist
tail -5 < parameter.dat | head -1 | awk '{$1=$1};1' | tr ' ' '\n' > xlist
parallel ./sample.sh {1} {2} {3} :::: xlist ylist zlist
rm xlist ylist zlist
There is no built-in way of doing what you want, and your solution is not too bad.
If you control parameter.dat and it is not too big (128 KB) I would probably do:
$ cat parameter.dat
::: x valueX1 ValueX2
::: y valueY1 ValueY2
::: z ValueZ1 ValueZ2 ValueZ3
# There is on purpose no " around $() and the ::: is in parameter.dat
$ parallel --header : ./sample.sh $(cat parameter.dat)
--header : is used to ignore the first value of each line. It also means you can use {x} {y} and {z} in the command template.
This is easy to add another parameter, and you do not need to clean up tmp-files.
You are, however, restricted: Your values cannot contain space and some of the characters that have special meaning in shell (e.g. ? *). Other characters (e.g. $ ' " `) are fine.

joining 2 files on matching column values using awk

I know there have been similar questions posted but I'm still having a bit of trouble getting the output I want using awk FNR==NR...
I have 2 files as such
File 1:
123|this|is|good
456|this|is|better
...
File 2:
aaa|123
bbb|456
...
So I want to join on values from file 2/column2 to file 1/column1 and output file 1 (col 2,3,4) and file 2 (col 1).
Thanks in advance.
With awk you could do something like
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } $1 in val { $(NF + 1) = val[$1]; print }' file2 file1
NF is the number of fields in a record (line by default), so $NF is the last field, and $(NF + 1) is the field after that. By assigning the saved value from the pass over file2 to it, a new field is appended to the record before it is printed.
One thing to note: This behaves like an inner join, i.e., only records are printed whose key appears in both files. To make this a right join, you can use
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } { $(NF + 1) = val[$1]; print }' file2 file1
That is, you can drop the $1 in val condition on the append-and-print action. If $1 is not in val, val[$1] is empty, and an empty field will be appended to the record before printing.
But it's probably better to use join:
join -1 1 -2 2 -t \| file1 file2
If you don't want the key field to be part of the output, pipe the output of either of those commands through cut -d \| -f 2- to get rid of it, i.e.
join -1 1 -2 2 -t \| file1 file2 | cut -d \| -f 2-
If the files have the same number of lines in the same order, then
paste -d '|' file1 file2 | cut -d '|' -f 2-5
this|is|good|aaa
this|is|better|bbb
I see in a comment to Wintermute's answer that the files aren't sorted. With bash, process substitutions are handy to sort on the fly:
paste -d '|' <(sort -t '|' -k 1,1 file1) <(sort -t '|' -k 2,2 file2) |
cut -d '|' -f 2-5
To reiterate: this solution requires a one-to-one correspondence between the files

fgrep with pattern in fixed positions

I am having a file x with lines say
123
345
789
830
I want to grep those line matching these patterns from another file x1, but only those lines where these patterns appear in positions 151-153 of the lines in x1
Something like the following might work with bash:
pfx=$(printf %*s 150 | tr ' ' .)
grep -f <(sed -e 's/^/^'"$pfx"'/' x) x1
I feel like there should be a better way to do this but they are all more complicated.

grep that match around the first match

I would like to grep a specific word 'foo' inside specific files, then get the N lines around my match and show only the blocks that contain a second grep.
I found this but it doesn't really work...
find . | grep -E '.*?\.(c|asm|mac|inc)$' | \
xargs grep --color -C3 -rie 'foo' | \
xargs -n1 --delimiter='--' | grep --color -l 'bar'
For instance I have the file 'a':
a
b
c
d
bar
f
foo
g
h
i
j
bar
l
The file b:
a
bar
c
d
e
foo
g
h
i
j
k
I expect this for grep -c2 on both files because bar is contained in the -c2 range of foo. I do not get any match for ./bar because bar is not in the range -c2 of foo...
--
./foo- bar
./foo- f
./foo- **foo**
./foo- g
./foo- h
--
Any ideas?
You could do this pretty simply with a "while read line" loop:
find -regextype posix-extended -regex "./file[a-z]" | while read line; do grep -nHC2 "foo" $line | grep --color bar; done
Output:
./filea-5-bar
./filec-46-... host pwns.me [94.23.120.252]: 451 4.7.1 Local bar
configuration error ...
In this example, I created the following files:
filea - your example a
fileb - your example b
filec - some random exim log output with foo and bar tossed in 2 lines apart
filed - the same exim log output, but with foo and bar tossed in 3 lines apart
You could also pipe the output after done, to alter the format:
; done | sed 's/-([0-9]{1,6})-/: line: \1 ::: /'
Formatted output
./filea: line: 5 ::: bar
./filec: line: 46 ::: ... host pwns.me [94.23.120.252]: 451 4.7.1 Local bar configuration error ...
I think I only understand the first line of your question and this does what I think you mean!
#!/bin/bash
N=2
pattern1=a
pattern2=z
matchinglines=$(awk -v p="$pattern1" '$0~p{print NR}' file) # Generate array of matching line numbers
for x in ${matchinglines[#]}
do
((start=x-N))
[[ $start -lt 1 ]] && start=1 # Avoid passing negative line nmumbers to sed
((end=x+N))
echo DEBUG: Checking block between lines $start and $end
sed -ne "${start},${end}p" file | grep -q "$pattern2"
[[ $? -eq 0 ]] && sed -ne "${start},${end}p" file
done
You need to set pattern1 and pattern2 at the start of the script. It basically does some awk to build an array of the line numbers that match your first pattern. Then it loops through the array and sets the start and end range to +/-N either side of each matching line number. It then uses sed to extraact that block and passes it through grep to see if it contains pattern2 printing it if it does. It may not be the most efficient, but it is easy enough to understand and maintain.
It assumes your file is called file
pipe it twice
grep "[^foo\n]" | grep "\n{ntimes}foo\n{ntimes}"

How to grep out a batch of consecutive lines from a file

I want to grep out a batch of consecutive lines, that starts at a specific pattern and ends at specific pattern.
E.g. the content of file looks like this :
line 1
line 2
.
.
.
my_start_pattern
.
.
.
my_end_pattern
.
.
.
line n
The output of grep should look like following :
my_start_pattern
.
.
.
my_end_pattern
Thanks.
Don't know if grep can do this, but awk can.
awk '/start pattern/,/end pattern/' data_file_name
(leave off the file name if you want to filter from stdin)

Resources