flat file parsing in shell

flat file parsing in shell - parsing

I have a fixed length file of the format :
Name Age Party Role
---------- ---------- ------------------ --------------
Shubham 27 XYZ User
Drek 28 ABC Admin
Raj 23 USR User
Now I want to write a shell script/command to output a file containing all Parties with Age<25
In this case, it should print something like this :
Party
-----------------
USR
I am new to awk and shell. I tried using awk, and substr but it is way too expensive since my file is huge (>200000 lines with multiple columns). Is there a neat way to do this ?
Update
Any of the fields can have spaces within them. The real idea is that the file is a fixed length file. So length of each record is fixed (Name:10, Age:10, Part:20,Role:10). The Records however can have anything in the dat including spaces and blanks. For instance:
Name Age Party Role
---------- ---------- ------------------ --------------
Shub A 27 XYZ & A User
Drek GH 28 ABC & C Admin
Raj 23 USR User
and so on.
Now I want to use Name to do a select, such that my script prints Party records where Name = "Shub A" . So here output should be :
Party
-------------------
XYZ & A

$ awk '($2+0) < 25{print $3}' input
Party
------------------
USR
Update
Various for-loops to determine which field that contains the number (n), then the name is in $1..n and the party-field is in $n+1..NF-1
/Shub A/ {
# determine which field that contains a number
for (i=1;i<NF;i++) {
if ($i ~ /[0-9]+$/) {
break
}
}
for (j=1;j<i;j++) {
printf "%s ", $j
}
for (k=(i+1);k<NF;k++) {
printf "%s ", $k
}
}
Output:
Shub A XYZ & A
...or you can try to split on "2 spaces or more" i.e.
$ awk -F" +" '/^Shub/{print $3}' input
XYZ & A

Try:
gawk 'BEGIN{ FIELDWIDTHS = "11 11 19 14" } NR<3 || $1~/^Shub A +$/{print $3}' file

try this, if it works for u:
awk 'NR<3||($2+0)<25{a[++i]=$3}END{for(x in a)print a[x]}' file

If you know that none of your ages fill the full 10 digits, you can probably just do:
< input-file cut -b 11-30 | awk '$1 < 25' | cut -b 11-

Something like this should work. It prints the first two lines (header) and after that compares if second fields is lower than 25.
awk 'FNR < 3 || $2 < 25 { print $3 }' infile
It yields:
Party
------------------
USR
EDIT: This is posted before the update and doesn't work for it. Take a look to other answers

Related

Merging >2 files with AWK or JOIN?

Merging 2 files using AWK is a well covered topic on StackOverflow. However, the technique of reading 3 files into an array gets more complicated. As I'm formatting the output to go into an R script, I'm going to need to add lots of syntax so I don't think I can use JOIN. Here is a simplistic version I have working so far:
awk 'FNR==1{f++}
f==1{a[FNR]=$1;next}
f==2{b[FNR]=$1;next}
{print a[FNR], "<- c(", b[FNR], ",", $1, ")"}' words.txt x.txt y.txt
Where:
$ cat words.txt
word1
word2
word3
$ cat x.txt
1
2
3
$ cat y.txt
11
22
33
The output is then
word1 <- c(1, 11)
word2 <- c(2, 22)
word3 <- c(3, 22)
The best way I can summarize this technique is
Create a variable f to keep track of which file you're processing
For file 1 read the values into array a
For file 2 read the values into array b
Fall through to file three, where you concatenate your final output
As a beginner to AWK, this works, but I find it a bit awkward and I worry coming back to the code in 6 months, I'll no longer understand it. Is this the best way to merge these 3 files in AWK? Could JOIN actually handle this level of formatting the final output?

a variation of #RavinderSingh13's solution
$ paste {words,x,y}.txt | awk '{print $1, "<- c(" $2 ", " $3 ")"}'

EDIT: Could you please try following.
paste words.txt x.txt y.txt | awk '{$2="<- c("$2", "$3")";$3="";sub(/ +$/,"")} 1'
Output will be as follows.
word1 <- c(1, 11)
word2 <- c(2, 22)
word3 <- c(3, 33)
In case you simply want to add 3 file's contents in column vice then try following.
paste words.txt x.txt y.txt
word1 1 11
word2 2 22
word3 3 33

If it's for readability, you can change the file checking method, as well as the variable names.
Try these please:
awk 'ARGIND==1{words[FNR]=$1;}
ARGIND==2{xcol[FNR]=$1;}
ARGIND==3{print words[FNR], "<- c(", xcol[FNR], ",", $1, ")"}' words.txt x.txt y.txt
Above file checking method is for GNU awk.
Change to another, as well as change the file reading order, would be:
awk 'FILENAME=="words.txt"{print $1, "<- c(", xcol[FNR], ",", ycol[FNR], ")";}
FILENAME=="x.txt"{xcol[FNR]=$1;}
FILENAME=="y.txt"{ycol[FNR]=$1;}' x.txt y.txt words.txt
As you can also see here, file reading order and block order can be different.
Since words.txt has first column, or main column, so to speak, so it's sensible to read it last.
You can also use FILENAME==ARGV[1] FILENAME==ARGV[2] etc to check files, and put comments inside (use awk script file and load with awk -f scriptfile is better with comments):
awk 'FILENAME==ARGV[1]{xcol[FNR]=$1;} #Read column B, x column
FILENAME==ARGV[2]{ycol[FNR]=$1;} # Read column C, y cloumn
FILENAME==ARGV[3]{print $1, "<- c(", xcol[FNR], ",", ycol[FNR], ")";}' x.txt y.txt words.txt

Elegant way of using AWK

I have a file that has field separated by multiple characters. For Ex:
abc sometext def;ghi=123;
abc sometext def;ghi=123;
abc sometext def;ghi=123;
Now I want to parse the file in AWK to extract the fields. for example to get all the values of 'ghi',
awk '{print $3}' | awk 'BEGIN {FS = "="} { print $NF }' inputFile.txt
Is there any way to parse the file in one shot instead of using multiple pipes and AWK commands.

Yes, you can use the split function in awk
awk '{split($3,a,"=");print a[2]}'
123;
123;
123;
This divides filed nr 3 using = as separator in to an array a, then print second value of array a[2]
If there are variation of fields in filed number 3 and you like the last, do like this:
awk '{n=split($3,a,"=");print a[n]}'
123;
123;
123;
In your case, this will do too:
awk -F= '{print $NF}'

This can also be accomplished using multiple field separators in awk:
$ awk -F"[=;]" '{print $3}' file
123
123
123
This tells awk to use field separators = or ;. Based on that, the numbers you want are in the 3rd position.
If you expect the ghi part to be changeable and important, you can also use grep with a look-behind:
$ grep -Po '(?<=ghi=)\d+' file
123
123
123
This will print all digits after ghi=.

creating a file with uniques string per line in command line

I am trying to create a file (using AWK, but do not mind switching if another command is easier) that has a unique string in each line (183745 lines total). I am trying to make a file as such:
line1
line2
line3
....
line183745
With poor knowledge of AWK, and failure to find a similar example, I have unsuccessfully tried (with 10 lines for this example):
awk '{ i = 1; while (i < 10) { print "line$i \n"}; i++ }'
And this leads to no error or output. Thank you.

Why make it complicate?
seq -f "line%06g" 3
line000001
line000002
line000003
seq -f "line%06g" 183745 >newfile

You'll need to put this in a BEGIN block, as you're not processing any lines of input.
awk 'BEGIN { i = 1 ; while (i <= 10) { print "line"i ; i++ } }'

awk acts like a filter by default. In your case, it's simply blocking on input. Unblock it by explicitly not having input, for example.
awk '...' </dev/null

If I do this, I would do it with seq or in vim.
but since others have already posted seq and classic awk solution, I would add another awk solution for fun.
A very "useful" command yes could help us:
awk '$0="line"NR;NR==183745{exit}'
test with 1-10, for example:
kent$ yes|awk '$0="line"NR;NR==10{exit}'
line1
line2
line3
line4
line5
line6
line7
line8
line9
line10

How to horizontally mirror ascii art?

So ... I know that I can reverse the order of lines in a file using tac or a few other tools, but how do I reorder in the other dimension, i.e. horizontally? I'm trying to do it with the following awk script:
{
out="";
for(i=length($0);i>0;i--) {
out=out substr($0,i,1)}
print out;
}
This seems to reverse the characters, but it's garbled, and I'm not seeing why. What am I missing?
I'm doing this in awk, but is there something better? sed, perhaps?
Here's an example. Input data looks like this:
$ cowsay <<<"hello"
_______
< hello >
-------
\ ^__^
\ (oo)\_______
(__)\ )\/\
||----w |
|| ||
And the output looks like this:
$ cowsay <<<"hello" | rev
_______
> olleh <
-------
^__^ \
_______\)oo( \
\/\) \)__(
| w----||
|| ||
Note that the output is identical whether I use rev or my own awk script. As you can see, things ARE reversed, but ... it's mangled.

rev is nice, but it doesn't pad input lines. It just reverses them.
The "mangling" you're seeing is because one line may be 20 characters long, and the next may be 15 characters long. In your input text they share a left-hand column. But in your output text, they need to share a right-hand column.
So you need padding. Oh, and asymmetric reversal, as Joachim said.
Here's my revawk:
#!/usr/bin/awk -f
#
length($0)>max {
max=length($0);
}
{
# Reverse the line...
for(i=length($0);i>0;i--) {
o[NR]=o[NR] substr($0,i,1);
}
}
END {
for(i=1;i<=NR;i++) {
# prepend the output with sufficient padding
fmt=sprintf("%%%ds%%s\n",max-length(o[i]));
printf(fmt,"",o[i]);
}
}
(I did this in gawk; I don't think I used any gawkisms, but if you're using a more classic awk variant, you may need to adjust this.)
Use this the same way you'd use rev.
ghoti#pc:~$ echo hello | cowsay | ./revawk | tr '[[]()<>/\\]' '[][)(><\\/]'
_______
< olleh >
-------
^__^ /
_______/(oo) /
/\/( /(__)
| w----||
|| ||
If you're moved to do so, you might even run the translate from within the awk script by adding it to the last printf line:
printf(fmt," ",o[i]) | "tr '[[]()<>/\\]' '[][)(><\\/]'";
But I don't recommend it, as it makes the revawk command less useful for other applications.

Your lines aren't the same length, so reversing the cow will break it. What you need to do is to "pad" the lines to be the same length, then reverse.
For example;
cowsay <<<"hello" | awk '{printf "%-40s\n", $0}' | rev
will pad it to 40 columns, and then reverse.
EDIT: #ghoti did a script that sure beats this simplistic reverse, have a look at his answer.

Here's one way using GNU awk and rev
Run like:
awk -f ./script.awk <(echo "hello" | cowsay){,} | rev
Contents of script.awk:
FNR==NR {
if (length > max) {
max = length
}
next
}
{
while (length < max) {
$0=$0 OFS
}
}1
Alternatively, here's the one-liner:
awk 'FNR==NR { if (length > max) max = length; next } { while (length < max) $0=$0 OFS }1' <(echo "hello" | cowsay){,} | rev
Results:
_______
> olleh <
-------
^__^ \
_______\)oo( \
\/\) \)__(
| w----||
|| ||
----------------------------------------------------------------------------------------------
Here's another way just using GNU awk:
Run like:
awk -f ./script.awk <(echo "hello" | cowsay){,}
Contents of script.awk:
BEGIN {
FS=""
}
FNR==NR {
if (length > max) {
max = length
}
next
}
{
while (length < max) {
$0=$0 OFS
}
for (i=NF; i>=1; i--) {
printf (i!=1) ? $i : $i ORS
}
}
Alternatively, here's the one-liner:
awk 'BEGIN { FS="" } FNR==NR { if (length > max) max = length; next } { while (length < max) $0=$0 OFS; for (i=NF; i>=1; i--) printf (i!=1) ? $i : $i ORS }' <(echo "hello" | cowsay){,}
Results:
_______
> olleh <
-------
^__^ \
_______\)oo( \
\/\) \)__(
| w----||
|| ||
----------------------------------------------------------------------------------------------
Explanation:
Here's an explanation of the second answer. I'm assuming a basic knowledge of awk:
FS="" # set the file separator to read only a single character
# at a time.
FNR==NR { ... } # this returns true for only the first file in the argument
# list. Here, if the length of the line is greater than the
# variable 'max', then set 'max' to the length of the line.
# 'next' simply means consume the next line of input
while ... # So when we read the file for the second time, we loop
# through this file, adding OFS (output FS; which is simply
# a single space) to the end of each line until 'max' is
# reached. This pad's the file nicely.
for ... # then loop through the characters on each line in reverse.
# The printf statement is short for ... if the character is
# not at the first one, print it; else, print it and ORS.
# ORS is the output record separator and is a newline.
Some other things you may need to know:
The {,} wildcard suffix is a shorthand for repeating the input file name twice.
Unfortunately, it's not standard Bourne shell. However, you could instead use:
<(echo "hello" | cowsay) <(echo "hello" | cowsay)
Also, in the first example, { ... }1 is short for { ... print $0 }
HTH.

You could also do it with bash, coreutils and sed (to make it work with zsh the while loop needs to be wrapped in tr ' ' '\x01' | while ... | tr '\x01' ' ', not sure why yet):
say=hello
longest=$(cowsay "$say" | wc -L)
echo "$say" | rev | cowsay | sed 's/\\/\\\\/g' | rev |
while read; do printf "%*s\n" $longest "$REPLY"; done |
tr '[[]()<>/\\]' '[][)(><\\/]'
Output:
_______
< hello >
-------
^__^ /
_______/(oo) /
/\/( /(__)
| w----||
|| ||
This leaves a lot of excess spaces at the end, append | sed 's/ *$//' to remove.
Explanation
The cowsay output needs to be quoted, especially the backslashes which sed takes care of by duplicating them. To get the correct line width printf '%*s' len str is used, which uses len as the string length parameter. Finally asymmetrical characters are replaced by their counterparts, as done in ghoti's answer.

I don't know if you can do this in AWK, but here are the needed steps:
Identify the length of your original's most lengthy line, you will need it give proper spacing to any smaller lines.
(__)\ )\/\
For the last char on each line, map out the need of start-of-line spaces based on what you acquired from the first step.
< hello >
//Needs ??? extra spaces, because it ends right after '>'.
//It does not have spaces after it, making it miss it's correct position after reverse.
(__)\ )\/\
< hello >???????????????
For each line, apply the line's needed number of spaces, followed by the original chars in reverse order.
_______
> olleh <
-------
^__^ \
_______\)oo( \
\/\) \)__(
| w----||
|| ||
Finally, replace all characters that are not horizontally symmetrical with their horizontally-opposite chars. (< to >, [ to ], etc)
_______
< olleh >
-------
^__^ /
_______/(oo) /
/\/( /(__)
| w----||
|| ||
Two things to watch out for:
Text, as you can see, will not go right with reversions.
Characters like $, % and & are not horizontally symmetrical,
but also might not have an opposite unless you use specialized
Unicode blocks.

I would say that you may need each line to be fixed column width so each line is the same length. So if the first line is a character followed by a LF, you'll need to pad the reverse with white space before reversing.

Use grep -A1 to return a value in the second line as long as a numeric value in the first line is met

I have log entries that are paired two lines each. I have to parse the first line to extract
a number to know if it is greater than 5000. If this number is greater than 5000 then I need to return the second line, which will also be parsed to retrieve an ID.
I know how to grep all of the info and to parse it. I don't know how to make the grep ignore
things if they are less than a particular value. Note that I am not committed to using grep if some
other means like awk/sed can be substituted.
Raw Data (two lines separated for example clarity).
The target of my grep is the number 5001
following "credits extracted = ", if this is over 5000 then I want to return number "12345" from
the second line --------------------------
2012-03-16T23:26:12.082358 0x214d000 DEBUG ClientExtractAttachmentsPlayerMailTask for envelope 22334455 finished: credits extracted = 5001, items extracted count = 0, status = 0. [Mail.heomega.mail.Mail](PlayerMailTasks.cpp:OnExtractAttachmentsResponse:944)
2012-03-16T23:26:12.082384 0x214d000 DEBUG Mail Cache found cached mailbox for: 12345 [Mail.heomega.mail.Mail](MailCache.cpp:GetCachedMailbox:772)
Snippits --------------------------
-- Find the number of credits extracted, without the comma noise:
grep "credits extracted = " fileName.log | awk '{print $12}' | awk -F',' '{print $1}'
-- Find the second line's ID no matter what the value of credits extracted is:
grep -A1 "credits extracted = " fileName.log | grep "cached mailbox for" | awk -F, '{print $1}' | awk '{print $10}'
-- An 'if' statement symbolizing the logic I need to acquire:
v_CredExtr=5001; v_ID=12345; if [ $v_Cred -gt 5000 ]; then echo $v_ID; fi;

You can do everything with a single AWK filter I believe:
#!/usr/bin/awk -f
/credits extracted =/ {
credits = substr($12, 1, length($12) - 1) + 0
if (credits > 5000)
show_id = 1
next
}
show_id == 1 {
print $10
show_id = 0
}
Obviously, you can stuff all the AWK script in a shell string inside a script, even multiline. I showed it here in its own script for clarity.
P.S: Please notify when it works ;-)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

flat file parsing in shell - parsing

Try: gawk 'BEGIN{ FIELDWIDTHS = "11 11 19 14" } NR<3 || $1~/^Shub A +$/{print $3}' file

try this, if it works for u: awk 'NR<3||($2+0)<25{a[++i]=$3}END{for(x in a)print a[x]}' file

If you know that none of your ages fill the full 10 digits, you can probably just do: < input-file cut -b 11-30 | awk '$1 < 25' | cut -b 11-

Related

Merging >2 files with AWK or JOIN?

Elegant way of using AWK

creating a file with uniques string per line in command line

How to horizontally mirror ascii art?

Use grep -A1 to return a value in the second line as long as a numeric value in the first line is met

Categories

Resources