Parsing log lines using awk - parsing

I have to parse some information out of big log file lines.
Its something like
abc.log:2012-03-03 11:12:12,457 ABC[123.RPH.-101] XYZ: Query=get_data #a=0,#b=1 Rows=10Time=100
There are many log lines like above in the logfiles. I need to extract information like
datetime i.e. 2012-03-03 11:12:12,457
job details i.e. 123.RPH.-101
Query i.e. get_data (no parameters)
Rows i.e. 10
Time i.e. 100
So output should look like
2012-03-03 11:12:12,457|123|-101|get_data|10|100
I have tried various permutation computations with awk but not getting it right.

Well, this is really horrible, but since sed is in the tags and there are no answers yet...
sed -e 's/[^0-9]*//' -re 's/[^ ]*\[([^.]*)\.[^.]*\.([^]]*)\]/| \1 | \2/' -e 's/[^ ]* Query=/| /' -e 's/ [^ ]* Rows=/ | /' -e 's/Time=/ | /' my_logfile

My solution in gawk: it uses gawk extension to match.
You didn't give specification of file format, so you may have to adjust the regexes.
Script invocation:
gawk -v OFS='|' -f script.awk
{
match($0, /[0-9]+-[0-9]+-[0-9]+ [0-9]+:[0-9]+:[0-9]+,[0-9]+/)
date_time = substr($0, RSTART, RLENGTH)
match($0, /\[([0-9]+).RPH.(-?[0-9]+)\]/, matches)
job_detail_1 = matches[1]
job_detail_2 = matches[2]
match($0, /Query=(\w+)/, matches)
query = matches[1]
match($0, /Rows=([0-9]+)/, matches)
rows = matches[1]
match($0, /Time=([0-9]+)/, matches)
time = matches[1]
print date_time, job_detail_1, job_detail_2, query,rows, time
}

Here's another, less fancy, AWK solution (but works in mawk too):
BEGIN { OFS="|" }
{
i = match($3, /\[[^]]+\]/)
job = substr($3, i + 1, RLENGTH - 2)
split($5, X, "=")
query = X[2]
split($7, X, "=")
rows = X[2]
split($8, X, "=")
time= X[2]
print $1 " " $2, job, query, rows, time
}
Nothe that this assumes the Rows=10 and Time=100 strings are separated by space, that is, there was a typo in the question example.

TXR:
#(collect :vars ())
#file:#year-#mon-#day #hh:#mm:#ss,#ms #jobname[#job1.RPH.#job2] #queryname: Query=#query #params Rows=#{rows /[0-9]+/}Time=#time
#(output)
#year-#mon-#day #hh-#mm-#ss,#ms|#job1|#job2|#query|#rows|#time
#(end)
#(end)
Run:
$ txr data.txr data.log
2012-03-03 11-12-12,457|123|-101|get_data|10|100
Here is one way to make the program assert that every line in the log file must match the pattern. First, do not allow gaps in the collection. This means that nonmatching material cannot be skipped to just look for the lines which match:
#(collect :gap 0 :vars ())
Secondly, at the end of the script we add this:
#(eof)
This specifies a match on the end of file. If the #(collect) bails early because of a nonmatching line (due to the :gap 0 constraint), the #(eof) will fail and so the script will terminate with a failed status.
In this type of task, field splitting regex hacks will backfire because they can blindly produce incorrect results for some subset of the input being processed. If the input contains a vast number of lines, there is no easy way to check for mistakes. It's best to have a very specific match that is likely to reject anything which doesn't resemble the examples on which the pattern is based.

Just need the right field separators
awk -F '[][ =.]' -v OFS='|' '{print $1 " " $2, $4, $6, $10, $15, $17}'
I'm assuming the "abc.log:" is not actually in the log file.

Related

Extracting lines from a fixed format without spaces file based on a column and list of inquiring IDs

I have a quite large fixed format file without spaces (file1):
file1:
0808563800555550000367120000500000
0005555566369330000078020000500000
01066666780000000008933600009000005635
0904251263088000000786590056500000
0000469011009904440425120444444440
I want to extract lines with fields 4-8,11-15 and 20-24 when fields 4-8 (only) are in a list of IDs in file2
file2:
55555
42512
The desired outputs are:
55555 36933 07802
42512 08800 78659
I have tried the following combination of cut | grep commands:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -w -F -f file2
It works fine and the speed is very good, but the problem is that I am getting columns where the lookup ID (fields 4-8) is not in the first column of the cutted data, and that is because grep checks the three columns after cut, not only the first one. 
Here are the outputs of the command above:
85638 55555 36712
55555 36933 07802
66666 00000 89336
42512 08800 78659
04690 00990 42512
I know one may write the output to a file and then use, for example awk, but I thought there could be a much simpler approach to avoid longer processing time (for example, makes grep picks only the match in a specific cutted column).
Any help will be very appreciated and many thanks!
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='3 5 2 5 4 5 *' 'NR==FNR{a[$0]; next} $2 in a{ print $2, $4, $6 }' file2 file1
55555 36933 07802
42512 08800 78659
Would you please try the following:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -wf <(sed 's/^/^/' file2)
Each line in file2 is prepended by a caret ^ character to anchor to
the start of the line of the output by cut.
It may be a bit slower than before due to the lack of -F option.

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

Match Lines From Two Lists With Wildcards In One List

I have two lists, one of which contains wildcards (in this case represented by *). I would like to compare the two lists and create an output of those that match, with each wildcard * representing a single character.
For example:
File 1
123456|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|frankie1#hotmail.com
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
File 2
1***6|Jane|Johnson|Pharmacist|janejohnson#gmail.com
09876579|Frank|Roberts|Butcher|f**1#hotmail.com
092362936|Joe|Jordan|J*****|joe#joesjoinery.com
928|Bob|Horton|Farmer|b*****n#f*********.co.uk
Output
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
Explanation
The first two lines are not considered matches because the number of *s is not equal to the number of characters shown in the first file. The latter two are, so they are added to output.
I have tried to reason out ways to do this in AWK and using Join, but I don't know any way to even start trying to achieve this. Any help would be greatly appreciated.
$ cat tst.awk
NR==FNR {
file1[$0]
next
}
{
# Make every non-* char literal (see https://stackoverflow.com/a/29613573/1745001):
gsub(/[^^*]/,"[&]") # Convert every char X to [X] except ^ and *
gsub(/\^/,"\\^") # Convert every ^ to \^
# Convert every * to .:
gsub(/\*/,".")
# Add line start/end anchors
$0 = "^" $0 "$"
# See if the current file2 line matches any line from file1
# and if so print that line from file1:
for ( line in file1 ) {
if ( line ~ $0 ) {
print line
}
}
}
$ awk -f tst.awk file1 file2
092362936|Joe|Jordan|Joiner|joe#joesjoinery.com
928|Bob|Horton|Farmer|bhorton#farmernews.co.uk
sed 's/\./\\./g; s/\*/./g' file2 | xargs -I{} grep {} file1
Explanation:
I'd take advantage of regular expression matching. To do that, we need to turn every asterisk * into a dot ., which represents any character in regular expressions. As a side effect of enabling regular expressions, we need to escape all special characters, particularly the ., in order for them to be taken literally. In a regular expression, we need to use \. to represent a dot (as opposed to any character).
The first step is perform these substitutions with sed, the second is passing every resulting line as a search pattern to grep, and search file1 for that pattern. The glue that allows to do this is xargs, where a {} is a placeholder representing a single line from the results of the sed command.
Note:
This is not a general, safe solution you can simply copy and paste: you should watch out for any characters, in your file containing the asterisks, that are considered special in grep regular expressions.
Update:
jhnc extends the escaping to any of the following characters: .\^$[], thus accounting for almost all sorts of email addresses. He/she then avoids the use of xargs by employing -f - to pass the results of sed as search expressions to grep:
sed 's/[.\\^$[]/\\&/g; s/[*]/./g' file2 | grep -f - file1
This solution is both more general and more efficient, see comment below.

grep to find words with unique letters

how to use grep to find occurrences of words from a dictionary file which have a given set of letters with the restriction that each letter occurs once and only once.
EG if the letters are abc then the expected output is:
cab
EDIT:
Given a dictionary file (that is a file containing one word per line such as /usr/share/dict/words on mac os x operating system) and a set of (unique) characters, I want to print out all of the dictionary file's words that contain each character of the input set once and only once. For example if the set of characters is {a,b,c} then print out all (3-letter) words that contain each character of the set.
I am looking, preferably, for a solution that uses just grep expressions.
Given a series of letters, for example abc, you can convert each one to a lookahead, like this:
^(?=[^a]*a[^a]*)(?=[^b]*b[^b]*)(?=[^c]*c[^c]*)$
You may need to use the "extended regex" flag -E to use this regex with grep.
To create this regex from a string, you could use sed (an exercise for the reader)
grep -E ^[abc]{3}.$ <Dictionary file> | grep -v -e a.*a -e b.*b -e c.*c
i.e. Find all three letter strings matching the input and pipe these through inverse grep to remove strings with double letters.
I'm using the '.' after {3} because my dictionary file is windows based so has an extra carriage return or line feed. So, that's probably not necessary.
Below is a Perl solution. Note, you'll need to add more words to the dictionary, and read input in to the $input variable. An array of valid words will end up in #results.
#!/usr/bin/env perl
use Data::Dumper;
my $input = "abc";
my #dictionary = qw(aaa aac aad aal aam aap aar aas aat aaw aba abc abd abf abg
abh abm abn abo abr abs abv abw aca acc ace aci ack acl acp acs act acv ada adb
adc add adf adh adl adn ado adp adq adr ads adt adw aea aeb aec aed aef aes aev
afb afc afe aff afg afi afk afl afn afp aft afu afv agb agc agl agm agn ago agp
...
PUT A REAL DICTIONARY HERE!
...
zie zif zig zii zij zik zil zim zin zio zip zir zis zit ziu ziv zlm zlo zlx zma
zme zmi zmu zna zoa zob zoe zog zoi zol zom zon zoo zor zos zot zou zov zoy zrn
zsr zub zud zug zui zuk zul zum zun zuo zur zus zut zuz zva zwo zye zzz);
# Generate a lookahead expression for each character in the input word
my $regexp = join("", map { "(?=.*$_)" } split(//, $input));
my #results;
foreach my $word (#dictionary) {
# If the size of the input doesn't match the dictionary word, skip to the
# next word.
if (length($input) != length($word)) {
next;
}
if ($word =~ /$regexp/) {
push(#results, $word);
}
}
print Dumper #results;
The solution I found involves using grep first to extract all n-letter words that contain only letters from the input set - although some letters might appear more than once, some may not appear; (again I am assuming that the input letters are unique). Then it does a series of 1-letter greps to make sure each letter occurs at least once. Because the words are of length n this ensures the word contains each letter once and only once. For example, if the input character set is (a,b,c} then the solution would be:
grep -E '^[abc]{3}$' /usr/share/dict/words | grep a | grep b | grep c
a simple bash script can be written which creates this grep string and executes it against the word file, using $1 as the input letter set. It might not be the most efficient method of generating the string, but as I am not familiar with sed or awk it does seem to solve my problem. The script I created is:
#!/bin/sh
slen=${#1}
g2="'^[$1]{$slen}\$'"
g3=""
ix1=0
while [ $ix1 -lt $slen ]
do
g3="$g3 | grep ${1:$ix1:1}"
ix1=$((ix1+1))
done
eval grep -E $g2 /usr/share/dict/words $g3

reading semi-formatted data

I'm totally new to AWK, however I think this is the best way to solve my problem and a good time to learn AWK.
I am trying to read a large data file that is created by a simulation program. The output is made to be readable by humans, so its formatting isn't very consistent. An example of the output is in this image
http://i.imgur.com/0kf8l.png
I need a way to find a line like "He 2 4686A -2.088 0.0071", by specifying the "He 2 4686A" part and get the following two numbers. The problem is the line "He 2 4686A -2.088 0.0071" can appear anywhere in the table.
I know how to find the entry "He 2 4686A", but I don't know which of the 4 columns it's in. So I don't know how to address the values that follow it.
A command that lets me just read the next two words, or tells me the location of the pattern once a match is found will both help.
/He 2 4686A/ finds the line
Ca A 3970A -0.900 0.1100 He 2 4686A -2.088 0.0071 S 3 18.67m -0.371 0.3721 Ar 4 444.7A -2.124 0.0066
Any help is appreciated.
First step should be to bring what seems to be 4 columns of records into a 1-column format...then its easy with awk because you can then filter for the first 5 fields - like:
echo "He 2 4686A -2.088 0.0071" | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
which gives
-2.088 0.0071
So, for me, the only challenge is to transform your data to one-column format...And from the picture that look simple because it seems that the columns have a fixed length which you can count.
Assuming that your column-width is 30 characters (difficult to tell from a picture, beware of tabs) and you data is in input_file, then you could first "cut" the data into 4 columns and then pipe the output to another awk-process
awk '{
print substr($0,1,30)
print substr($0,31,30)
print substr($0,61,30)
print substr($0,91,30)
}' input_file | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
If you really just need the next two numbers behind an anchor then I would say the grep-solution from Costa is best for you, however this gives you the possibility to implement further logic...
If you're not dead set on using awk, grep would be the easiest way...
egrep -o "He 2 4686A \-?[0-9.]+ \-?[0-9.]+" output.txt
EDIT: The above would work only if the spacing was done with a whitespace, which doesn't seem to be your case. In order to handle tabs and/or repeating whitespaces...
egrep -o "He[ \t]+2[ \t]+4686A[ \t]+\-?[0-9.]+[ \t]+\-?[0-9.]+" output.txt

Resources