separate 8th field - printing

I could not separate my file:
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;TI=NM_000465;GI=BARD1;FC=Silent ... ...
I would like to print first seven fields and from 8th field print just DP=10 and GI=BARD1. DP in GI info is always in 8th field. Fields are continue (...) so 8th field is not last.
I know how to extract 8th field :
awk '{print $8}' PLZ-10_S2.vcf | awk -F ";" '/DP/ {OFS="\t"} {print $1}'
of course how to extract first seven fields, but how to pipe it together? Between all fields is tab.

If DP= and GI= are always in the same position within $8:
$ awk 'BEGIN{FS=OFS="\t"} {split($8,a,/;/); $8=a[1]";"a[3]} 1' file
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;GI=BARD1 ... ...
If not:
$ awk 'BEGIN{FS=OFS="\t"} {split($8,a,/;/); $8=""; for (i=1;i in a;i++) $8 = $8 (a[i] ~ /^(DP|GI)=/ ? ($8?";":"") a[i] : "")} 1' file
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;GI=BARD1 ... ...

One way is to split() with semicolon the eight field and traverse all results to check which of them begin with DP or GI:
awk '
BEGIN { FS = OFS = "\t" }
{
split( $8, arr8, /;/ )
$8 = ""
for ( i = 1; i <= length(arr8); i++ ) {
if ( arr8[i] ~ /^(DP|GI)/ ) {
$8 = $8 arr8[i] ";"
}
}
$8 = substr( $8, 1, length($8) - 1 )
print $0
}
' infile
It yields:
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;GI=BARD1 ... ...

Related

Adding numbers in a dynamic string separated by some token in the kusto table

Suppose there is a table like below:
datatable(str:string) [
"a,b,2,10,d,e;a,b,c,14,d,e;a,b,c,10,d,e",
"a,b,c,11,d,e;a,b,c,12,d,e;a,b,c,13,d,e;a,b,c,10,d,e",
"a,b,c,20,d,e;a,b,c,25,d,e",
]
I need to add 4th value in each string separated by semicolon
e.g. Answer for above table is
10+14+10=34
11+12+13+10=46
20+25=45
I tried below which works for single row:
let calculateCostForARow = (str:string) {
print row = split(str,";")
| mv-expand row
| parse row with * "," * "," * "," cost:long "," *
| summarize sum(cost)
};
calculateCostForARow("a,b,c,11,d,e;a,b,c,12,d,e;a,b,c,13,d,e;a,b,c,10,d,e")
but doesn't work for table as to_scalar has issues with table
let calculateCostForARow = (str:string) {
toscalar(print row = split(str,";")
| mv-expand row
| parse row with * "," * "," * "," cost:long "," *
| summarize sum(cost))
};
datatable(str:string) [
"a,b,c,10,d,e;a,b,c,10,d,e;a,b,c,10,d,e",
"a,b,c,10,d,e;a,b,c,10,d,e;a,b,c,10,d,e;a,b,c,10,d,e",
"a,b,c,10,d,e;a,b,c,10,d,e",
]
| project calculateCostForARow(str)
Let me know if there are other ways to do this?
you could try this, using mv-apply:
datatable(str:string) [
"a,b,2,10,d,e;a,b,c,14,d,e;a,b,c,10,d,e",
"a,b,c,11,d,e;a,b,c,12,d,e;a,b,c,13,d,e;a,b,c,10,d,e",
"a,b,c,20,d,e;a,b,c,25,d,e",
]
| mv-apply s = split(str, ";") on (
summarize result = sum(tolong(split(s, ",", 3)[0]))
)
str
result
a,b,2,10,d,e;a,b,c,14,d,e;a,b,c,10,d,e
34
a,b,c,11,d,e;a,b,c,12,d,e;a,b,c,13,d,e;a,b,c,10,d,e
46
a,b,c,20,d,e;a,b,c,25,d,e
45

How to parse values with AWK when column number is inconsistent

Input file:
6 31236622 HLA_C*05:01:01:01 A T . PASS AF=0.07724;MAF=0.07724;R2=0.98466;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.994:0.995,1.000:0.000,0.006,0.994
6 29910248 HLA_A*01:01 A T . PASS AF=0.15969;MAF=0.15969;R2=0.97333;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 1|0:1.000:1.000,0.000:0.000,1.000,0.000 0|0:0:0,0:1,0,0
6 31322134 HLA_B*55:01 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94511;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322132 HLA_B*55 A T . PASS AF=0.01091;MAF=0.01091;R2=0.94485;IMPUTED GT:DS:HDS:GP 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0 0|0:0:0,0:1,0,0
6 31322006 HLA_B*44:02:01:01 A T . PASS AF=0.08074;MAF=0.08074;R2=0.97706;IMPUTED GT:DS:HDS:GP 1|0:0.999:0.999,0.000:0.001,0.999,0.000 0|0:0:0,0:1,0,0 1|1:1.997:0.998,0.999:0.000,0.003,0.997
I want to parse a specific number from each column after the "GT:DS:HDS:GP" column, specifically, the numbers after "x|x:". So desired output is:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
To parse the desired values from (e.g.) line 4, I can use:
awk -F: '{for (i=5; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
Line 5 would require:
awk -F: '{for (i=9; i<=NF; i+=3) printf "%s%s", $i, (i+3 <= NF ? ", " : ORS)}'
So the problem with the input file is that column 3 (space delimited) contains a variable number of colons, which makes colons a poor delimiter for this particular input file (but the desired values are surrounded by colons!)
I though about using "|" as delimiter, with substr($i,3,?), but the desired values have an inconsistent number of digits (hence the "?").
Is there a flexible awk code to get the desired output?
You may try this awk:
awk -v OFS=', ' '$9 == "GT:DS:HDS:GP" {for (i=10; i<=NF; ++i) if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/)) printf "%s", (i == 10 ? "" : OFS) a[2]; print ""}' file
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
An expanded form:
awk -v OFS=', ' '
$9 == "GT:DS:HDS:GP" {
for (i=10; i<=NF; ++i)
if ($i ~ /^[0-9]+\|[0-9]+:/ && split($i, a, /:/))
printf "%s", (i == 10 ? "" : OFS) a[2]
print ""
}' file
Why do you care about the space-delimited columns at all?
awk '{ sub(/.* GT:DS:HDS:GP */, "");
i = split($0, n, /[0-9]\|[0-9]:/);
sep = "";
for(x=2; x<=i; x++) {
sub(/:.*/, "", n[x]); printf("%s%s", sep, n[x]); sep=", " }
printf "\n"; }' file
We successively pick apart each line, first by removing everything through GT:DS:HDS:GP from the line, then by splitting the remaining string into n on the specified delimiter, and then cleaning up the resulting fields by removing everything after the first colon in each, and printing the result. (We skip the first one, which only contains the useless short or empty string before the first delimiter.)
Output for your sample:
0.999, 0, 1.994
0, 1.000, 0
0, 0, 0
0, 0, 0
0.999, 0, 1.997
I have no idea what these fields stands for so I just picked single-letter variable names; you can probably improve the readability by giving these variables more descriptive names.

field separator in awk

I have the following "input.file":
10 61694 rs546443136 T G . PASS RefPanelAF=0.0288539;AC=0;AN=1186;INFO=1.24991e-09 GT:DS:GP 0/0:0.1:0.9025,0.095,0.0025 0/0:0.1:0.9025,0.095,0.0025 0/0:0.1:0.9025,0.095,0.0025
My desired output.file is:
0.1, 0.1, 0.1
Using an awk script called "parse.awk":
BEGIN {FS = ":"}
{for (i = 4; i <= NF; i += 2) printf ("%s%c", $i, i +2 <= NF ? "," : "\n ");}
which is invocated with:
awk -f parse.awk <input.file >output.file
my current output.file is as follows:
0.1,0.1,0.1
i.e. no spaces.
Changing pasre.awk to:
BEGIN {FS = ":"}
{for (i = 4; i <= NF; i += 2) printf ("%s%c", $i, i +2 <= NF ? ", " : "\n ");}
did not change the output.file. What change(s) to parse.awk will yield the desired output.file?
You may use this awk:
awk -F: -v OFS=', ' '{
for (i = 4; i <= NF; i += 2) printf "%s%s", $i, (i < NF-1 ? OFS : ORS)}' file
0.1, 0.1, 0.1
Could you please try following. Written and tested it in
https://ideone.com/e26q7u
awk '
BEGIN {FS = ":"}
val!=""{ print val; val=""}
{for (i = 4; i <= NF; i += 2){
val=(val==""?"":val", ")$i
}
}
END{
if(val!=""){
print val
}
}
' Input_file
The problem is when you changed the output separator from a single comma (",") to comma with space (", "); you did not change the format string from %c to %s. So that is how to fix your script:
BEGIN {FS = ":"}
{for (i = 4; i <= NF; i += 2) printf ("%s%s", $i, i +2 <= NF ? ", " : "\n ");}
# ^ Change this

convert data formatting in a lua file

hello i need to convert 720 data sets from a 1 liner to this format below.
Atm i got them in a open office file with each number in a column but i have no idea how i can convert that formatting.
12 -8906.071289 560.890564 93.236107 0 test2
13 -846.814636 -526.218323 10.981694 0 southshore
to
[12] = {
[1] = "test2",
[2] = "-8906.071289",
[3] = "560.890564",
[4] = "93.236107",
[5] = "0",
},
[13] = {
[1] = "Southshore",
[2] = "-846.814636",
[3] = "-526.218323",
[4] = "10.981694",
[5] = "0",
},
One possibility in Lua. Run with program.lua datafile
where program.lua is whatever name you give this file, and datafile is, well, your external data file. Test with just program.lua
--[[
12 -8906.071289 560.890564 93.236107 0 test2
13 -846.814636 -526.218323 10.981694 0 southshore
--]]
local filename = arg[1] or arg[0] --data from 1st command line argument or this file
local index,head,tail
print '{'
for line in io.lines(filename) do
if line:match '^%d+' then
head, line, tail = line:match '^(%d+)%s+(.-)(%S+)$'
print(' [' .. head .. '] = {\n [1] = "' .. tail .. '",')
index = 1
for line in line:gmatch '%S+' do
index = index + 1
print(' [' .. index .. '] = "' .. line .. '",')
end
print ' },'
end
end
print '}'
This awk program does it:
{
print "[" $1 "] = {"
print "\t[" 1 "] = \"" $NF "\","
for (i=2; i<NF; i++) {
print "\t[" i "] = \"" $i "\","
}
print "},"
}

How to use some text processing(awk etc..) to put some character in a text file at certain lines

I have a text file which has hex values, one value on one separate line. A file has many such values one below another. I need to do some analysis of the values for which i need to but some kind of delimiter/marker say a '#' in this file before line numbers 32,47,62,77... difference between two line numbers in this patterin is 15 always.
I am trying to do it using awk. I tried few things but didnt work.
What is the command in awk to do it?
Any other solution involving some other language/script/tool is also welcome.
Thank you.
-AD
This is how you can use AWK for it,
awk 'BEGIN{ i=0; } \
{if (FNR<31) {print $0} \
else {i++; if (i%15) {print $0} else {printf "#%s\n",$0}}\
}' inputfile.txt > outputfile.txt
How it works,
BEGIN sets an iterator for counting from your starting line 32
FNR<31 starts counting from the 31st record (the next record needs a #)
input lines are called records and FNR is an AWK variable that counts them
Once we start counting, the i%15 prefixes a # on every 15th line
$0 prints the record (the line) as is
You can type all the text with white spaces skipping the trailing '\' on a single command line.
Or, you can use it as an AWK file,
# File: comment.awk
BEGIN{ i=0; }
$0 ~ {\
if (FNR<31) {print $0} \
else {\
i++; \
if (i%15) {\
print $0
}\
else {\
printf "#%s\n",$0
}\
}\
}
And run it as,
awk -f comment.awk inputfile.txt > outputfile.txt
Hope this will help you to use more AWK.
Python:
f_in = open("file.txt")
f_out = open("file_out.txt","w")
offset = 4 # 0 <= offset < 15 ; first marker after fourth line in this example
for num,line in enumerate(f_in):
if not (num-offset) % 15:
f_out.write("#\n")
f_out.write(line)
Haskell:
offset = 31;
chunk_size = 15;
main = do
{
(h, t) <- fmap (splitAt offset . lines) getContents;
mapM_ putStrLn h;
mapM_ ((putStrLn "#" >>) . mapM_ putStrLn) $
map (take chunk_size) $
takeWhile (not . null) $
iterate (drop chunk_size) t;
}

Resources