removing lines that don't match by grep - grep

I have a file containing gene names of interest (24423 genes), and another file containing the lengths to all the genes (41306 genes). I want the lengths only to the 24424 genes, but when I grep using grep -wf file1 file2 or even fgrep -Fwf file1 file2, I get some excess genes, because some genes in my list may contain only the sense or the anti-sense strands, whereas if the reference file may contain both, and that is being reflected.
I want to know if there is a way to remove from the reference file (file2) all the lines that don't match?
Thank you.
P.S. The question is also on biostars.org
edit -
file1
A1BG
A1BG-AS1
TSPAN6
MYB
MYB-AS1
file2
A1BG 2941
A1BG-AS1 560
TSPAN6 7923
MYB-AS1 362
MYB-AS2 713
MYB-AS3 396
desired_output
A1BG 2941
A1BG-AS1 560
TSPAN6 7923
MYB-AS1 362
But I always get MYB-AS2 and MYB-AS3

$ cat f1
A1BG
A1BG-AS1
TSPAN6
MYB
MYB-AS1
$ cat f2
A1BG 2941
A1BG-AS1 560
TSPAN6 7923
MYB-AS1 362
MYB-AS2 713
MYB-AS3 396
$ grep -Fwf f1 f2
A1BG 2941
A1BG-AS1 560
TSPAN6 7923
MYB-AS1 362
MYB-AS2 713
MYB-AS3 396
grep won't help here because MYB will match MYB- as - will work as word boundary
Use awk instead
$ awk 'NR==FNR{a[$1]; next} $1 in a' f1 f2
A1BG 2941
A1BG-AS1 560
TSPAN6 7923
MYB-AS1 362
NR==FNR{a[$1]; next} build an array with first field from first file as keys
$1 in a print lines from second file if first field is a key in the array. Entire field has to match
See also http://backreference.org/2010/02/10/idiomatic-awk/ for more examples/explanation on this type of two file processing

Related

how to change the output file format in pSQL from default to csv?

I am using Linux and has connected to a pSQL DB server. After using the command \o to export file, the output file is separated by "|" horizontally and "_" (and '+') vertically. Please see below:
abc | cde | fgh | xyz
----+-----+-----+-----
123 | 321 | 123 | 123
123 | 321 | 222 | 111
923 | 238 | 928 | 192
ect.
This format might be a default but not very useful for data analysis.
Can I change the output file format into ".csv" by some additional optional command in pSQL?
Thanks,
You can export CSV in psql. There's a dedicated command to do so. I've written a longer article on this.
The gist is this:
\copy (SELECT ...) TO 'locale_file.csv' WITH (FORMAT csv, HEADER)
This will copy the data as CSV to your local drive (i.e. where psql is running from).

Grep -w is ignoring hyphen[-]

I have text file sample.txt like following
ID=Sam-S-PA.path1;Name=Sam-S-PA 23 Hz42
ID=GlcAT-S-PA.path1;Name=GlcAT-S-PA 45 iu7s
ID=TfIIA-S-PA.path1;Name=TfIIA-S-PA 76 5ghz
ID=S-PA.path1;Name=S-PA 69 ivcs
ID=TyrRS-PA.path1;Name=TyrRS-PA 51 Pqas
ID=HisRS-PA.path1;Name=HisRS-PA 32 Majs
I would like to extract row containing only S-PA using grep. I tried following command:
grep -w "S-PA" sample.txt
But it gave a output that included all the entries which I dont want. I want the following output
ID=S-PA.path1;Name=S-PA 69 ivcs
Kindly guide me. Thanks in advance.
Using negative look-ahead and look-behind.
$ grep -P '(?<![\w-])S-PA(?![\w-])' sample.txt
ID=S-PA.path1;Name=S-PA 69 ivcs
Effectively you include - into the "word" for word boundary considerations.
(?<![\w-]) ensures that S-PA is not preceded with a word character or -.
Similarly (?![\w-]) ensures the same for the following characters.
Using regex.
grep -oE "S-PA (.+)" sample.txt
or
egrep -o "S-PA (.+)" sample.txt
It seems you want to match =S-PA followed with a space. Use
grep '=S-PA ' sample.txt
or
grep '=S-PA[[:blank:]]' sample.txt
where [[:blank:]] matches either a regular space or a tab char.
See this regex demo showing how this regex works.

grep invert match on two files

I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.
You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input
awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.
With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns

fgrep with pattern in fixed positions

I am having a file x with lines say
123
345
789
830
I want to grep those line matching these patterns from another file x1, but only those lines where these patterns appear in positions 151-153 of the lines in x1
Something like the following might work with bash:
pfx=$(printf %*s 150 | tr ' ' .)
grep -f <(sed -e 's/^/^'"$pfx"'/' x) x1
I feel like there should be a better way to do this but they are all more complicated.

print every nth line into a row using gawk

I have a very huge file in which I need to obtain every nth line and print it into a row.
My data:
1 937 4.320194
2 667 4.913314
3 934 1.783326
4 940 -0.299312
5 939 2.309559
6 936 3.229496
7 611 -1.41808
8 608 -1.154019
9 606 2.159683
10 549 0.767828
I want my data to look like this:
1 937 4.320194
3 934 1.783326
5 939 2.309559
7 611 -1.41808
9 606 2.159683
This is of course an example, I want every 10th line for my huge data file. I tried this so far:
NF == 6 {
if(NR%10) {print;}
}
To print every second line, starting with the first:
awk 'NR%2==1' file.txt
To print every tenth line, starting with the tenth line:
awk 'NR%10==0' file.txt
To use this in a script, add the following to a file called script.awk:
BEGIN {
print "Processing file"
}
NR%10==0
END {
print "Finished processing"
}
Then execute:
awk -f script.awk file.txt
With sed, you can do a lot of variations on this quite easily with the first~step command. For instance:
# Odd lines
sed -n 1~2p file
# Every tenth line (10, 20, 30, ...)
sed -n 10~10p file
# Every tenth line (1, 11, 21, ...)
sed -n 1~10p file
# First plus every tenth (1, 10, 20, 30, ...)
sed -n -e 1p -e 10~10p file
Piece of cake: cat test.txt | awk 'NR % 10 == 1'
It's not (g)awk, but it'll work:
cat myfile | grep ^[[:digit:]]*0[[:blank:]] should do the trick.
Doing it directly in command Prompt (Windows).
Put the gawk.exe file in the folder where the file is and start a command Prompt in the folder, and write
gawk "NR%n==x" oldfile.txt>newfile.txt
n is every n'th line you want to print and x is the starting line.
E.g n=10 and x=1, printing line 1,11,21,31,41......end line from the original file into the new file.
E.g n=20 and x=5, printing line 5,25,45,65......end line from the original file into the new file.

Resources