I know it's a ridiculously simple problem, but I'd like to print the first line in many files for which a given field condition is met:
$ awk ' ( $3>=0.2 ) { print $3, $5 } ' Data.out
I've tried to insert END in a few places to exit printing, but I can't get it to work... The above prints ALL the lines for which $3>=0.2...
The first thing that springs to mind is to add exit:
awk '$3 >= 0.2 { print $3, $5; exit }' file
But unless that's all you want to do, you will need a flag:
awk '$3 >= 0.2 && !f { print $3, $5; f=1 }' file
The command you are looking for is nextfile:
gawk '$3 >= 0.2 { print $3, $5; nextfile }' *.out
If you're not using gawk, here is some advice for simulating this behaviour in other awks. It has however made it into the 2012 POSIX standard as per the note on the gnu page.
awk ' ( $3>=0.2 ) { print $3, $5; exit } ' Data.out
The problem with exit (apart from the apparent compatibility issue you have) is that it will not process the next file at all if you have multiple files. Here is a script for multiple files:
awk 'FNR==1{f=1}
$3>=0.2{if(f)print $3,$5;f=0}' file1 file2 ...
You may be able to optimize by e.g. closing the input file after the first match, but this should at least get you started.
Related
I have a quite large fixed format file without spaces (file1):
file1:
0808563800555550000367120000500000
0005555566369330000078020000500000
01066666780000000008933600009000005635
0904251263088000000786590056500000
0000469011009904440425120444444440
I want to extract lines with fields 4-8,11-15 and 20-24 when fields 4-8 (only) are in a list of IDs in file2
file2:
55555
42512
The desired outputs are:
55555 36933 07802
42512 08800 78659
I have tried the following combination of cut | grep commands:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -w -F -f file2
It works fine and the speed is very good, but the problem is that I am getting columns where the lookup ID (fields 4-8) is not in the first column of the cutted data, and that is because grep checks the three columns after cut, not only the first one.
Here are the outputs of the command above:
85638 55555 36712
55555 36933 07802
66666 00000 89336
42512 08800 78659
04690 00990 42512
I know one may write the output to a file and then use, for example awk, but I thought there could be a much simpler approach to avoid longer processing time (for example, makes grep picks only the match in a specific cutted column).
Any help will be very appreciated and many thanks!
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='3 5 2 5 4 5 *' 'NR==FNR{a[$0]; next} $2 in a{ print $2, $4, $6 }' file2 file1
55555 36933 07802
42512 08800 78659
Would you please try the following:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -wf <(sed 's/^/^/' file2)
Each line in file2 is prepended by a caret ^ character to anchor to
the start of the line of the output by cut.
It may be a bit slower than before due to the lack of -F option.
I have a program which output is summary file with header and few columns of results.
I want to show only two data: file name and best period prediction and I use this command:
program input_file | gawk 'NR==2 {print $3}; NR==4 {print $2}'
as the result I obtain result in one column, two lines. What I have to do to have this result in one line, two columns?
You could use:
program input_file | gawk 'NR==2 {heading = $3}; NR==4 {print heading " = " $2}'
This saves the value in $3 on line 2 in variable heading and prints the heading and the value from column 2 when it reads line 4.
I'm totally new to AWK, however I think this is the best way to solve my problem and a good time to learn AWK.
I am trying to read a large data file that is created by a simulation program. The output is made to be readable by humans, so its formatting isn't very consistent. An example of the output is in this image
http://i.imgur.com/0kf8l.png
I need a way to find a line like "He 2 4686A -2.088 0.0071", by specifying the "He 2 4686A" part and get the following two numbers. The problem is the line "He 2 4686A -2.088 0.0071" can appear anywhere in the table.
I know how to find the entry "He 2 4686A", but I don't know which of the 4 columns it's in. So I don't know how to address the values that follow it.
A command that lets me just read the next two words, or tells me the location of the pattern once a match is found will both help.
/He 2 4686A/ finds the line
Ca A 3970A -0.900 0.1100 He 2 4686A -2.088 0.0071 S 3 18.67m -0.371 0.3721 Ar 4 444.7A -2.124 0.0066
Any help is appreciated.
First step should be to bring what seems to be 4 columns of records into a 1-column format...then its easy with awk because you can then filter for the first 5 fields - like:
echo "He 2 4686A -2.088 0.0071" | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
which gives
-2.088 0.0071
So, for me, the only challenge is to transform your data to one-column format...And from the picture that look simple because it seems that the columns have a fixed length which you can count.
Assuming that your column-width is 30 characters (difficult to tell from a picture, beware of tabs) and you data is in input_file, then you could first "cut" the data into 4 columns and then pipe the output to another awk-process
awk '{
print substr($0,1,30)
print substr($0,31,30)
print substr($0,61,30)
print substr($0,91,30)
}' input_file | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
If you really just need the next two numbers behind an anchor then I would say the grep-solution from Costa is best for you, however this gives you the possibility to implement further logic...
If you're not dead set on using awk, grep would be the easiest way...
egrep -o "He 2 4686A \-?[0-9.]+ \-?[0-9.]+" output.txt
EDIT: The above would work only if the spacing was done with a whitespace, which doesn't seem to be your case. In order to handle tabs and/or repeating whitespaces...
egrep -o "He[ \t]+2[ \t]+4686A[ \t]+\-?[0-9.]+[ \t]+\-?[0-9.]+" output.txt
I'm looking for zoom to understand why this:
palabra=s_gonzalez
i=10
awk -vvar1=$palabra -vvvar2=$i '( $1 == var1 ) && ( $2 == var2 ) {print $0}' As
is not printing anything. The As file contains:
r_castillo 10
flores 6
s_gonzalez 10
o_gutzwiller 12
h_ji 4
Thanks in advance for any suggestion.
Where're your:
vvar2
Did you misspell var2?
As a technique to avoid this sort of problem, you can assign variables without -v. I would rewrite the command:
awk '$1==var1 && $2==var2' var1=$palabra var2=$i As
It always seems simpler to me to assign variables as arguments after the program rather than as -v options before the program. (-v assignments are available in the BEGIN block, but that is irrelevant in this case.)
I have to parse some information out of big log file lines.
Its something like
abc.log:2012-03-03 11:12:12,457 ABC[123.RPH.-101] XYZ: Query=get_data #a=0,#b=1 Rows=10Time=100
There are many log lines like above in the logfiles. I need to extract information like
datetime i.e. 2012-03-03 11:12:12,457
job details i.e. 123.RPH.-101
Query i.e. get_data (no parameters)
Rows i.e. 10
Time i.e. 100
So output should look like
2012-03-03 11:12:12,457|123|-101|get_data|10|100
I have tried various permutation computations with awk but not getting it right.
Well, this is really horrible, but since sed is in the tags and there are no answers yet...
sed -e 's/[^0-9]*//' -re 's/[^ ]*\[([^.]*)\.[^.]*\.([^]]*)\]/| \1 | \2/' -e 's/[^ ]* Query=/| /' -e 's/ [^ ]* Rows=/ | /' -e 's/Time=/ | /' my_logfile
My solution in gawk: it uses gawk extension to match.
You didn't give specification of file format, so you may have to adjust the regexes.
Script invocation:
gawk -v OFS='|' -f script.awk
{
match($0, /[0-9]+-[0-9]+-[0-9]+ [0-9]+:[0-9]+:[0-9]+,[0-9]+/)
date_time = substr($0, RSTART, RLENGTH)
match($0, /\[([0-9]+).RPH.(-?[0-9]+)\]/, matches)
job_detail_1 = matches[1]
job_detail_2 = matches[2]
match($0, /Query=(\w+)/, matches)
query = matches[1]
match($0, /Rows=([0-9]+)/, matches)
rows = matches[1]
match($0, /Time=([0-9]+)/, matches)
time = matches[1]
print date_time, job_detail_1, job_detail_2, query,rows, time
}
Here's another, less fancy, AWK solution (but works in mawk too):
BEGIN { OFS="|" }
{
i = match($3, /\[[^]]+\]/)
job = substr($3, i + 1, RLENGTH - 2)
split($5, X, "=")
query = X[2]
split($7, X, "=")
rows = X[2]
split($8, X, "=")
time= X[2]
print $1 " " $2, job, query, rows, time
}
Nothe that this assumes the Rows=10 and Time=100 strings are separated by space, that is, there was a typo in the question example.
TXR:
#(collect :vars ())
#file:#year-#mon-#day #hh:#mm:#ss,#ms #jobname[#job1.RPH.#job2] #queryname: Query=#query #params Rows=#{rows /[0-9]+/}Time=#time
#(output)
#year-#mon-#day #hh-#mm-#ss,#ms|#job1|#job2|#query|#rows|#time
#(end)
#(end)
Run:
$ txr data.txr data.log
2012-03-03 11-12-12,457|123|-101|get_data|10|100
Here is one way to make the program assert that every line in the log file must match the pattern. First, do not allow gaps in the collection. This means that nonmatching material cannot be skipped to just look for the lines which match:
#(collect :gap 0 :vars ())
Secondly, at the end of the script we add this:
#(eof)
This specifies a match on the end of file. If the #(collect) bails early because of a nonmatching line (due to the :gap 0 constraint), the #(eof) will fail and so the script will terminate with a failed status.
In this type of task, field splitting regex hacks will backfire because they can blindly produce incorrect results for some subset of the input being processed. If the input contains a vast number of lines, there is no easy way to check for mistakes. It's best to have a very specific match that is likely to reject anything which doesn't resemble the examples on which the pattern is based.
Just need the right field separators
awk -F '[][ =.]' -v OFS='|' '{print $1 " " $2, $4, $6, $10, $15, $17}'
I'm assuming the "abc.log:" is not actually in the log file.