field separator in awk - parsing

I have the following "input.file":
10 61694 rs546443136 T G . PASS RefPanelAF=0.0288539;AC=0;AN=1186;INFO=1.24991e-09 GT:DS:GP 0/0:0.1:0.9025,0.095,0.0025 0/0:0.1:0.9025,0.095,0.0025 0/0:0.1:0.9025,0.095,0.0025
My desired output.file is:
0.1, 0.1, 0.1
Using an awk script called "parse.awk":
BEGIN {FS = ":"}
{for (i = 4; i <= NF; i += 2) printf ("%s%c", $i, i +2 <= NF ? "," : "\n ");}
which is invocated with:
awk -f parse.awk <input.file >output.file
my current output.file is as follows:
0.1,0.1,0.1
i.e. no spaces.
Changing pasre.awk to:
BEGIN {FS = ":"}
{for (i = 4; i <= NF; i += 2) printf ("%s%c", $i, i +2 <= NF ? ", " : "\n ");}
did not change the output.file. What change(s) to parse.awk will yield the desired output.file?

You may use this awk:
awk -F: -v OFS=', ' '{
for (i = 4; i <= NF; i += 2) printf "%s%s", $i, (i < NF-1 ? OFS : ORS)}' file
0.1, 0.1, 0.1

Could you please try following. Written and tested it in
https://ideone.com/e26q7u
awk '
BEGIN {FS = ":"}
val!=""{ print val; val=""}
{for (i = 4; i <= NF; i += 2){
val=(val==""?"":val", ")$i
}
}
END{
if(val!=""){
print val
}
}
' Input_file

The problem is when you changed the output separator from a single comma (",") to comma with space (", "); you did not change the format string from %c to %s. So that is how to fix your script:
BEGIN {FS = ":"}
{for (i = 4; i <= NF; i += 2) printf ("%s%s", $i, i +2 <= NF ? ", " : "\n ");}
# ^ Change this

Related

convert data formatting in a lua file

hello i need to convert 720 data sets from a 1 liner to this format below.
Atm i got them in a open office file with each number in a column but i have no idea how i can convert that formatting.
12 -8906.071289 560.890564 93.236107 0 test2
13 -846.814636 -526.218323 10.981694 0 southshore
to
[12] = {
[1] = "test2",
[2] = "-8906.071289",
[3] = "560.890564",
[4] = "93.236107",
[5] = "0",
},
[13] = {
[1] = "Southshore",
[2] = "-846.814636",
[3] = "-526.218323",
[4] = "10.981694",
[5] = "0",
},
One possibility in Lua. Run with program.lua datafile
where program.lua is whatever name you give this file, and datafile is, well, your external data file. Test with just program.lua
--[[
12 -8906.071289 560.890564 93.236107 0 test2
13 -846.814636 -526.218323 10.981694 0 southshore
--]]
local filename = arg[1] or arg[0] --data from 1st command line argument or this file
local index,head,tail
print '{'
for line in io.lines(filename) do
if line:match '^%d+' then
head, line, tail = line:match '^(%d+)%s+(.-)(%S+)$'
print(' [' .. head .. '] = {\n [1] = "' .. tail .. '",')
index = 1
for line in line:gmatch '%S+' do
index = index + 1
print(' [' .. index .. '] = "' .. line .. '",')
end
print ' },'
end
end
print '}'
This awk program does it:
{
print "[" $1 "] = {"
print "\t[" 1 "] = \"" $NF "\","
for (i=2; i<NF; i++) {
print "\t[" i "] = \"" $i "\","
}
print "},"
}

Elixir parse binary data?

​for example:
I have a binary look like this:
bin1 = "2\nok\n3\nbcd\n\n"​
or
bin2 = "2\nok\n3\nbcd\n1\na\n\n"​
and so on...
The format is
byte_size \n bytes \n byte_size \n bytes \n \n
I want parse binary get
["ok", "bcd"]
how to implement in Elixir or Erlang ?
Go version
a Go version parse this
func (c *Client) parse() []string {
resp := []string{}
buf := c.recv_buf.Bytes()
var idx, offset int
idx = 0
offset = 0
for {
idx = bytes.IndexByte(buf[offset:], '\n')
if idx == -1 {
break
}
p := buf[offset : offset+idx]
offset += idx + 1
//fmt.Printf("> [%s]\n", p);
if len(p) == 0 || (len(p) == 1 && p[0] == '\r') {
if len(resp) == 0 {
continue
} else {
c.recv_buf.Next(offset)
return resp
}
}
size, err := strconv.Atoi(string(p))
if err != nil || size < 0 {
return nil
}
if offset+size >= c.recv_buf.Len() {
break
}
v := buf[offset : offset+size]
resp = append(resp, string(v))
offset += size + 1
}
return []string{}
}
Thanks
A more flexible solution:
result = bin
|> String.split("\n")
|> Stream.chunk(2)
|> Stream.map(&parse_bytes/1)
|> Enum.filter(fn s -> s != "" end)
def parse_bytes(["", ""]), do: ""
def parse_bytes([byte_size, bytes]) do
byte_size_int = byte_size |> String.to_integer
<<parsed :: binary-size(byte_size_int)>> = bytes
parsed
end
I wrote a solution:
defp parse("\n") do
[]
end
defp parse(data) do
{offset, _} = :binary.match(data, "\n")
size = String.to_integer(binary_part(data, 0, offset))
value = binary_part(data, offset + 1, size)
len = offset + 1 + size + 1
[value] ++ parse(binary_part(data, len, byte_size(data) - len))
end
The Elixir mailing list provides another one:
defp parse_binary("\n"), do: []
defp parse_binary(binary) do
{size, "\n" <> rest} = Integer.parse(binary)
<<chunk :: [binary, size(size)], "\n", rest :: binary>> = rest
[chunk|parse_binary(rest)]
end

Parsing nested indented text into lists

Parsing nested indented text into lists
Hi,
maybe someone can give me a start help.
I have nested indented txt similar to this. I should parse that into a nested list structure like
TXT = r"""
Test1
NeedHelp
GotStuck
Sometime
NoLuck
NeedHelp2
StillStuck
GoodLuck
"""
Nested_Lists = ['Test1',
['NeedHelp',
['GotStuck',
['Sometime',
'NoLuck']]],
['NeedHelp2',
['StillStuck',
'GoodLuck']]
]
Nested_Lists = ['Test1', ['NeedHelp', ['GotStuck', ['Sometime', 'NoLuck']]], ['NeedHelp2', ['StillStuck', 'GoodLuck']]]
Any help for python3 would be appriciated
You could exploit Python tokenizer to parse the indented text:
from tokenize import NAME, INDENT, DEDENT, tokenize
def parse(file):
stack = [[]]
lastindent = len(stack)
def push_new_list():
stack[-1].append([])
stack.append(stack[-1][-1])
return len(stack)
for t in tokenize(file.readline):
if t.type == NAME:
if lastindent != len(stack):
stack.pop()
lastindent = push_new_list()
stack[-1].append(t.string) # add to current list
elif t.type == INDENT:
lastindent = push_new_list()
elif t.type == DEDENT:
stack.pop()
return stack[-1]
Example:
from io import BytesIO
from pprint import pprint
pprint(parse(BytesIO(TXT.encode('utf-8'))), width=20)
Output
['Test1',
['NeedHelp',
['GotStuck',
['Sometime',
'NoLuck']]],
['NeedHelp2',
['StillStuck',
'GoodLuck']]]
I hope you can understand my solution. If not, ask.
def nestedbyindent(string, indent_char=' '):
splitted, i = string.splitlines(), 0
def first_non_indent_char(string):
for i, c in enumerate(string):
if c != indent_char:
return i
return -1
def subgenerator(indent):
nonlocal i
while i < len(splitted):
s = splitted[i]
title = s.lstrip()
if not title:
i += 1
continue
curr_indent = first_non_indent_char(s)
if curr_indent < indent:
break
elif curr_indent == indent:
i += 1
yield title
else:
yield list(subgenerator(curr_indent))
return list(subgenerator(-1))
>>> nestedbyindent(TXT)
['Test1', ['NeedHelp', ['GotStuck', ['Sometime', 'NoLuck']],
'NeedHelp2',['StillStuck', 'GoodLuck']]]
Here is the answer that is very non-Pythonic and verbose way. But it seems to work.
TXT = r"""
Test1
NeedHelp
GotStuck
Sometime
NoLuck
NeedHelp2
StillStuck
GoodLuck
"""
outString = '['
level = 0
first = 1
for i in TXT.split("\n")[1:]:
count = 0
for j in i:
if j!=' ':
break
count += 1
count /= 4 #4 space = 1 indent
if i.lstrip()!='':
itemStr = "'" + i.lstrip() + "'"
else:
itemStr = ''
if level < count:
if first:
outString += '['*(count - level) + itemStr
first = 0
else:
outString += ',' + '['*(count - level) + itemStr
elif level > count:
outString += ']'*(level - count) + ',' + itemStr
else:
if first:
outString += itemStr
first = False
else:
outString += ',' + itemStr
level = count
if len(outString)>1:
outString = outString[:-1] + ']'
else:
outString = '[]'
output = eval(outString)
#['Test1', ['NeedHelp', ['GotStuck', ['Sometime', 'NoLuck']], 'NeedHelp2', ['StillStuck', 'GoodLuck']]]
Riffing off of this answer, if entire lines want to be retained and if those lines consist of more than just variable names, t.type == NAME can be substituted with t.type == NEWLINE, and that if-statement can append the stripped line instead of the t.string. Something like this:
from tokenize import NEWLINE, INDENT, DEDENT, tokenize
def parse(file):
stack = [[]]
lastindent = len(stack)
def push_new_list():
stack[-1].append([])
stack.append(stack[-1][-1])
return len(stack)
for t in tokenize(file.readline):
if t.type == NEWLINE:
if lastindent != len(stack):
stack.pop()
lastindent = push_new_list()
stack[-1].append(t.line.strip()) # add entire line to current list
elif t.type == INDENT:
lastindent = push_new_list()
elif t.type == DEDENT:
stack.pop()
return stack[-1]
Otherwise, the lines get split on any token, where a token includes spaces, parentheses, brackets, etc.

separate 8th field

I could not separate my file:
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;TI=NM_000465;GI=BARD1;FC=Silent ... ...
I would like to print first seven fields and from 8th field print just DP=10 and GI=BARD1. DP in GI info is always in 8th field. Fields are continue (...) so 8th field is not last.
I know how to extract 8th field :
awk '{print $8}' PLZ-10_S2.vcf | awk -F ";" '/DP/ {OFS="\t"} {print $1}'
of course how to extract first seven fields, but how to pipe it together? Between all fields is tab.
If DP= and GI= are always in the same position within $8:
$ awk 'BEGIN{FS=OFS="\t"} {split($8,a,/;/); $8=a[1]";"a[3]} 1' file
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;GI=BARD1 ... ...
If not:
$ awk 'BEGIN{FS=OFS="\t"} {split($8,a,/;/); $8=""; for (i=1;i in a;i++) $8 = $8 (a[i] ~ /^(DP|GI)=/ ? ($8?";":"") a[i] : "")} 1' file
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;GI=BARD1 ... ...
One way is to split() with semicolon the eight field and traverse all results to check which of them begin with DP or GI:
awk '
BEGIN { FS = OFS = "\t" }
{
split( $8, arr8, /;/ )
$8 = ""
for ( i = 1; i <= length(arr8); i++ ) {
if ( arr8[i] ~ /^(DP|GI)/ ) {
$8 = $8 arr8[i] ";"
}
}
$8 = substr( $8, 1, length($8) - 1 )
print $0
}
' infile
It yields:
chr2 215672546 rs6435862 G T 54.00 LowDP;sb DP=10;GI=BARD1 ... ...

How to remove spaces from a string in Lua?

I want to remove all spaces from a string in Lua. This is what I have tried:
string.gsub(str, "", "")
string.gsub(str, "% ", "")
string.gsub(str, "%s*", "")
This does not seem to work. How can I remove all of the spaces?
It works, you just have to assign the actual result/return value. Use one of the following variations:
str = str:gsub("%s+", "")
str = string.gsub(str, "%s+", "")
I use %s+ as there's no point in replacing an empty match (i.e. there's no space). This just doesn't make any sense, so I look for at least one space character (using the + quantifier).
You use the following function :
function all_trim(s)
return s:match"^%s*(.*)":match"(.-)%s*$"
end
Or shorter :
function all_trim(s)
return s:match( "^%s*(.-)%s*$" )
end
usage:
str=" aa "
print(all_trim(str) .. "e")
Output is:
aae
The fastest way is to use trim.so compiled from trim.c:
/* trim.c - based on http://lua-users.org/lists/lua-l/2009-12/msg00951.html
from Sean Conner */
#include <stddef.h>
#include <ctype.h>
#include <lua.h>
#include <lauxlib.h>
int trim(lua_State *L)
{
const char *front;
const char *end;
size_t size;
front = luaL_checklstring(L,1,&size);
end = &front[size - 1];
for ( ; size && isspace(*front) ; size-- , front++)
;
for ( ; size && isspace(*end) ; size-- , end--)
;
lua_pushlstring(L,front,(size_t)(end - front) + 1);
return 1;
}
int luaopen_trim(lua_State *L)
{
lua_register(L,"trim",trim);
return 0;
}
compile something like:
gcc -shared -fpic -O -I/usr/local/include/luajit-2.1 trim.c -o trim.so
More detailed (with comparison to the other methods): http://lua-users.org/wiki/StringTrim
Usage:
local trim15 = require("trim")--at begin of the file
local tr = trim(" a z z z z z ")--anywhere at code
For LuaJIT all methods from Lua wiki (except for, possibly, native C/C++) were awfully slow in my tests. This implementation showed the best performance:
function trim (str)
if str == '' then
return str
else
local startPos = 1
local endPos = #str
while (startPos < endPos and str:byte(startPos) <= 32) do
startPos = startPos + 1
end
if startPos >= endPos then
return ''
else
while (endPos > 0 and str:byte(endPos) <= 32) do
endPos = endPos - 1
end
return str:sub(startPos, endPos)
end
end
end -- .function trim
If anyone is looking to remove all spaces in a bunch of strings, and remove spaces in the middle of the string, this this works for me:
function noSpace(str)
local normalisedString = string.gsub(str, "%s+", "")
return normalisedString
end
test = "te st"
print(noSpace(test))
Might be that there is an easier way though, I'm no expert!

Resources