I have a big file with many lines starting like this:
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
In these lines, DR2values range from 0 to 1 and I would like to extract those lines that containsDR2values higher than 0.8.
I've tried both sed or awk solutions, but neither seems to work... I've tried the following:
grep "DR2=[0-1]\.[8-9]*" myfile
This matches lines with a value greater than or equal to 0.8. If you insist on strictly greater than, then I'll have to add some complexity to prevent 0.8 from matching.
grep 'DR2=\(1\|0\.[89]\)' myfile
The trick is that you need two separate subpatterns: one to match 1 and greater, one to match 0.8 and greater.
grep: grep -E 'DR2=\([1-9]\|0[.][89]\)'
sed: sed -n '/\([1-9]\|0[.][89]\)/p'
awk: awk '/\([1-9]\|0[.][89]\)/'
These 3 solutions are all based on a single regular expression and all do the same (see Ruud HelderMan's solution)
With awk, however, you could do an artithmetic check if your limits are a bit more tricky. Let's say, I want the value of DR2 to be between 0.53 and 1.39.
awk '! match($0,/DR2=/) { next }
{ val = substr($0,RSTART+RLENGTH)+0 }
( 0.53 < val) && ( val < 1.39 )'
Whenever you have tag=value pairs in your data I find it best to first create an array of those pairings (f[]) below and then you can just access the values by their tags. You didn't provide any input of 0.8 to test against so using the data you did provide:
$ awk '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f["DR2"] > 0.01' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
or using variables for the tag and value:
$ awk -v tag='DR2' -v val='0.8' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
$
$ awk -v tag='DR2' -v val='0.01' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
$
$ awk -v tag='AF' -v val='0.4' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
$
$ awk -v tag='AF' -v val='0.5' '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]} f[tag] > val' file
$
or using compound conditions:
$ awk '{split($8,t,/[=;]/); for (i=1; i in t; i+=2) f[t[i]]=t[i+1]}
(f["AF"] > 0.4) && (f["AF"] < 0.5) && (f["DR2"] >= 0.02)
' file
22 16052167 rs375684679 A AAAAC . PASS DR2=0.02;AF=0.4728;IMP GT:DS
The point is whatever comparisons you want to do with the values of those tags is trivial and you don't need to write more code to isolate and save those tags and their values.
Related
I have a file f1:
line1
line2
line3
line4
..
..
I want to delete all the lines which are in another file f2:
line2
line8
..
..
I tried something with cat and sed, which wasn't even close to what I intended. How can I do this?
grep -v -x -f f2 f1 should do the trick.
Explanation:
-v to select non-matching lines
-x to match whole lines only
-f f2 to get patterns from f2
One can instead use grep -F or fgrep to match fixed strings from f2 rather than patterns (in case you want remove the lines in a "what you see if what you get" manner rather than treating the lines in f2 as regex patterns).
Try comm instead (assuming f1 and f2 are "already sorted")
comm -2 -3 f1 f2
For exclude files that aren't too huge, you can use AWK's associative arrays.
awk 'NR == FNR { list[tolower($0)]=1; next } { if (! list[tolower($0)]) print }' exclude-these.txt from-this.txt
The output will be in the same order as the "from-this.txt" file. The tolower() function makes it case-insensitive, if you need that.
The algorithmic complexity will probably be O(n) (exclude-these.txt size) + O(n) (from-this.txt size)
Similar to Dennis Williamson's answer (mostly syntactic changes, e.g. setting the file number explicitly instead of the NR == FNR trick):
awk '{if (f==1) { r[$0] } else if (! ($0 in r)) { print $0 } } ' f=1 exclude-these.txt f=2 from-this.txt
Accessing r[$0] creates the entry for that line, no need to set a value.
Assuming awk uses a hash table with constant lookup and (on average) constant update time, the time complexity of this will be O(n + m), where n and m are the lengths of the files. In my case, n was ~25 million and m ~14000. The awk solution was much faster than sort, and I also preferred keeping the original order.
if you have Ruby (1.9+)
#!/usr/bin/env ruby
b=File.read("file2").split
open("file1").each do |x|
x.chomp!
puts x if !b.include?(x)
end
Which has O(N^2) complexity. If you want to care about performance, here's another version
b=File.read("file2").split
a=File.read("file1").split
(a-b).each {|x| puts x}
which uses a hash to effect the subtraction, so is complexity O(n) (size of a) + O(n) (size of b)
here's a little benchmark, courtesy of user576875, but with 100K lines, of the above:
$ for i in $(seq 1 100000); do echo "$i"; done|sort --random-sort > file1
$ for i in $(seq 1 2 100000); do echo "$i"; done|sort --random-sort > file2
$ time ruby test.rb > ruby.test
real 0m0.639s
user 0m0.554s
sys 0m0.021s
$time sort file1 file2|uniq -u > sort.test
real 0m2.311s
user 0m1.959s
sys 0m0.040s
$ diff <(sort -n ruby.test) <(sort -n sort.test)
$
diff was used to show there are no differences between the 2 files generated.
Some timing comparisons between various other answers:
$ for n in {1..10000}; do echo $RANDOM; done > f1
$ for n in {1..10000}; do echo $RANDOM; done > f2
$ time comm -23 <(sort f1) <(sort f2) > /dev/null
real 0m0.019s
user 0m0.023s
sys 0m0.012s
$ time ruby -e 'puts File.readlines("f1") - File.readlines("f2")' > /dev/null
real 0m0.026s
user 0m0.018s
sys 0m0.007s
$ time grep -xvf f2 f1 > /dev/null
real 0m43.197s
user 0m43.155s
sys 0m0.040s
sort f1 f2 | uniq -u isn't even a symmetrical difference, because it removes lines that appear multiple times in either file.
comm can also be used with stdin and here strings:
echo $'a\nb' | comm -23 <(sort) <(sort <<< $'c\nb') # a
Seems to be a job suitable for the SQLite shell:
create table file1(line text);
create index if1 on file1(line ASC);
create table file2(line text);
create index if2 on file2(line ASC);
-- comment: if you have | in your files then specify “ .separator ××any_improbable_string×× ”
.import 'file1.txt' file1
.import 'file2.txt' file2
.output result.txt
select * from file2 where line not in (select line from file1);
.q
Did you try this with sed?
sed 's#^#sed -i '"'"'s%#g' f2 > f2.sh
sed -i 's#$#%%g'"'"' f1#g' f2.sh
sed -i '1i#!/bin/bash' f2.sh
sh f2.sh
Not a 'programming' answer but here's a quick and dirty solution: just go to http://www.listdiff.com/compare-2-lists-difference-tool.
Obviously won't work for huge files but it did the trick for me. A few notes:
I'm not affiliated with the website in any way (if you still don't believe me, then you can just search for a different tool online; I used the search term "set difference list online")
The linked website seems to make network calls on every list comparison, so don't feed it any sensitive data
A Python way of filtering one list using another list.
Load files:
>>> f1 = open('f1').readlines()
>>> f2 = open('f2.txt').readlines()
Remove '\n' string at the end of each line:
>>> f1 = [i.replace('\n', '') for i in f1]
>>> f2 = [i.replace('\n', '') for i in f2]
Print only the f1 lines that are also in the f2 file:
>>> [a for a in f1 if all(b not in a for b in f2)]
$ cat values.txt
apple
banana
car
taxi
$ cat source.txt
fruits
mango
king
queen
number
23
43
sentence is long
so what
...
...
I made a small shell scrip to "weed" out the values in source file which are present in values.txt file.
$cat weed_out.sh
from=$1
cp -p $from $from.final
for x in `cat values.txt`;
do
grep -v $x $from.final > $from.final.tmp
mv $from.final.tmp $from.final
done
executing...
$ ./weed_out source.txt
and you get a nicely cleaned up file....
I have a problem with this Linux command:
ls | grep -E 'i{2,3}'
.It should take a file that has at least 2 i and max 3 i, but it doesn't work.
This is the output
ls:
life.py, viiva.txt, viiiiiiiiiva.txt
grep:
viiva.txt, viiiiiiiiiva.txt (with the first 3 I highlighted)
Thanks for the help.
Issue with OP's attempt grep -E 'i{2,3}' will match two or three consecutive occurrences of i anywhere in the input, so 4 or more consecutive i is also a valid match.
Parsing ls output is not recommended, see Why not parse ls (and what to do instead)?. If you wish to pass the filenames after filtering to some other command, find is a good option.
$ ls
1i2i3i.txt aibi.txt II.txt life.py viiiiiiiiiva.txt viiva.txt
$ # files with 2 or 3 consecutive i
$ # note that the regex will act on entire filename, thus anchors are not needed
$ find -type f -regextype egrep -regex '[^i]*i{2,3}[^i]*'
./viiva.txt
$ # files with 2 or 3 i anywhere in the name
$ find -type f -regextype egrep -regex '[^i]*i[^i]*i[^i]*(i[^i]*)?'
./aibi.txt
./1i2i3i.txt
./viiva.txt
$ # files with 2 or 3 i anywhere in the name, ignoring case
$ find -type f -regextype egrep -iregex '[^i]*i[^i]*i[^i]*(i[^i]*)?'
./II.txt
./aibi.txt
./1i2i3i.txt
./viiva.txt
If filenames won't cause an issue, you can grep -xE or grep -ixE with above regex, where x option will make sure the regex matches the whole line, instead of anywhere in the line. Or you can also use awk:
$ # NF will give number of fields after splitting on i
$ ls | awk -F'i' 'NF>=3 && NF<=4'
1i2i3i.txt
aibi.txt
viiva.txt
$ ls | awk -F'[iI]' 'NF>=3 && NF<=4'
1i2i3i.txt
aibi.txt
II.txt
viiva.txt
Am not able to get the desired o/p when the data field has pipe in it.
If the i/p is
SAmple file is tst
hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst"
lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst"
I tried with this cmd but dont get the expected o/p - cut -f2,3 -d"|" tst
The expected o/p is
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Is there an easy way that we can crack this o/p...Dont want to go with sed bcoz the tool that am using doesnt allow the charecter (""- backslash). I mean am embedding this command in one of the tool
Also am using old version of gawk -
so this cmd doesnt give te desired o/p
gawk -v FPAT='[^|]*|("[^"]*")+' '{print $2, $3}' OFS="|"
Output of gawk --version
GNU Awk 3.1.7
Output of cat -vet tst
hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst"$
lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst"$
Upgrading your gawk version is by far the best approach as you're missing a few bug fixes and a ton of extremely useful functionality introduced since gawk 3.1.7 came out 10+ years ago (we're currently on gawk version 5.1!) but if you can't do that for some reason then - here's what you can do if you don't have FPAT using any awk in any shell on every UNIX box:
$ cat tst.awk
BEGIN { OFS="|" }
{
orig = $0
$0 = i = ""
while ( (orig != "") && match(orig,/[^|]*|("[^"]*")+/) ) {
$(++i) = substr(orig,RSTART,RLENGTH)
orig = substr(orig,RSTART+RLENGTH+1)
}
print $2, $3
}
.
$ awk -f tst.awk file
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Just to verify that it's identifying all of the fields correctly:
$ cat tst.awk
BEGIN { OFS="|" }
{
orig = $0
$0 = i = ""
while ( (orig != "") && match(orig,/[^|]*|("[^"]*")+/) ) {
$(++i) = substr(orig,RSTART,RLENGTH)
orig = substr(orig,RSTART+RLENGTH+1)
}
print NF " <" $0 ">"
for (i=1; i<=NF; i++) {
print "\t" i " <" $i ">"
}
}
.
$ awk -f tst.awk file
5 <hdr1|"hdr2|tst"|"hdr3|tst|tst"|hdr4|"hdr5|tst|tst">
1 <hdr1>
2 <"hdr2|tst">
3 <"hdr3|tst|tst">
4 <hdr4>
5 <"hdr5|tst|tst">
5 <lbl1|"lbl2|tst"|"lbl3|tst|tst"|lbl4|"lbl5|tst|tst">
1 <lbl1>
2 <"lbl2|tst">
3 <"lbl3|tst|tst">
4 <lbl4>
5 <"lbl5|tst|tst">
if you don't have embedded double quotes, you can substitute the quoted delimiter values with another unused character (I used ~) and after extraction switch back to the original values. Obviously it requires that the new delimiter is not used within text.
$ awk 'BEGIN{OFS=FS="\""} {for(i=2;i<NF;i+=2) gsub("\\|","~",$i)}1' file |
awk 'BEGIN{OFS=FS="|"} {print $2,$3}' |
sed 's/~/|/g'
"hdr2|tst"|"hdr3|tst|tst"
"lbl2|tst"|"lbl3|tst|tst"
Not sure it's simpler than the single awk script though.
Main problem here is the document format design. Requires another patch if there are embedded double quotes, or escaped pipes etc.
How can I do an exact match using grep -v?
For example: the following command
for i in 0 0.3 a; do echo $i | grep -v "0"; done
returns a. But I want it to return 0.3 a.
Using
grep -v "0\b"
is not working
for i in 0 0.3 a; do echo $i | grep -v "^0$"; done
You need to match the start and end of the string with ^ and $
So, we say "match the beginning of a line, the char 0 and then the end of the line.
$ for i in 0 0.3 a; do echo $i | grep -v "^0$"; done
0.3
a
The safest way for single-column entries is using awk. Normally, I would use grep with the -w flag, but since you want to exactly match an integer that could be part of a float, it is a bit more tricky. The <dot>-character makes it hard to use any of
grep -vw 0
grep -v '\b0\b'
grep -v '\<0\>'
The proposed solution also will only work on perfect lines, what if you have a lost space in front or after your zero. The line will fail. So the safest would be:
single column file:
awk '($1!="0")' file
multi-word file: (adopt the variable FS to fit your needs)
awk '{for(i=1;i<=NF;++i) if($i == "0") next}1' file
The command 'grep -c blah *' lists all the files, like below.
% grep -c jill *
file1:1
file2:0
file3:0
file4:0
file5:0
file6:1
%
What I want is:
% grep -c jill * | grep -v ':0'
file1:1
file6:1
%
Instead of piping and grep'ing the output like above, is there a flag to suppress listing files with 0 counts?
SJ
How to grep nonzero counts:
grep -rIcH 'string' . | grep -v ':0$'
-r Recurse subdirectories.
-I Ignore binary files (thanks #tongpu, warlock).
-c Show count of matches. Annoyingly, includes 0-count files.
-H Show file name, even if only one file (thanks #CraigEstey).
'string' your string goes here.
. Start from the current directory.
| grep -v ':0$' Remove 0-count files. (thanks #LaurentiuRoescu)
(I realize the OP was excluding the pipe trick, but this is what works for me.)
Just use awk. e.g. with GNU awk for ENDFILE:
awk '/jill/{c++} ENDFILE{if (c) print FILENAME":"c; c=0}' *