Grep how do I check if a token comes before another in a fil? - grep

I am trying to find among a bunch of files who hold SQL statements whether we ever SELECT from a table before we INSERT into it. It seems it should be a one-liner with Grep.
I've come up with
grep -zl "FROM (\S*).*INSERT INTO \0
The -z treats the input as one line, and then the back reference does the rest.
However testing with
echo "SELECT a FROM x INSERT INTO x VALUES(1);" | grep -zl "FROM (\S*).*INSERT INTO \0"
produces no result.
In fact even echo "aa aa" | grep "(\S*) \0" returns nothing.
What am I missing?

First, let's solve it for x:
echo "SELECT a FROM x INSERT INTO x VALUES(1);" | grep -E "FROM (\S)*x.*INSERT INTO (\S)*x"
However, you may have many tables and you are interested about all of them. So, this is how you can list the table names:
select TABLE_NAME
from information_schema.tables;
Now, let's generate the grep for each table:
select CONCAT('sudo bash foo.sh "your script" ', TABLE_NAME)
from information_schema.tables;
and implement foo.sh as follows:
echo "$1" | grep -E "FROM (\S)*$2.*INSERT INTO (\S)*$2"
The query generates the grep for each table. Naturally, you can filter your query to a selection of tables instead and you might also need to handle cases like
select ... from yourschema.yourtable
or
select ... from `yourtable`
but start with the proof-of-concept I have given and see whether that's enough for you.

grep solution:
Use -P option to use RegExp with Perl notation
grep -zPl "FROM ([[:alnum:]]+) INSERT INTO \1 VALUES"
Matching SELECT statement before INSERT, solution:
The reported problem is more complicated than described.
Assuming SELECT statements and corresponding INSERT statement are not in sequence.
For instance:
SELECT a FROM x1
INSERT INTO x1 VALUES(1);
SELECT a FROM x2
SELECT a FROM x3
SELECT a FROM y1
SELECT a FROM x2
INSERT INTO x3 VALUES(1);
INSERT INTO x2 VALUES(1);
INSERT INTO y1 VALUES(1);
INSERT INTO x3 VALUES(2);
SELECT a FROM y2
INSERT INTO y1 VALUES(1);
Only x3 and y1 are not matched. And there are nesting and duplicates.
We do not know ahead all table-names.
We need a stack data structure.
Push-in every table-name in select select statement (no duplicates), pull-out every table-name in insert.
Implemented using gawk (standard awk in most Linux machines) associative array. Screening input SQL file once.
gawk script: script.awk
/SELECT .* FROM / { # for each line matching RegExp "SELECT .* FROM"
# read table name from current line
tableName = gensub(/.*FROM[[:space:]]+([[:alnum:]]+).*/,"\\1",1);
# push-in tableName into associative array (used as stack)
tableNamesStack[tableName] = 1;
}
/INSERT INTO / { # for each line matching RegExp "INSERT INTO "
# read table name from current line
tableName = gensub(/.*INTO[[:space:]]+([[:alnum:]]+).*/,"\\1",1);
# if current tableName is in stack
if (tableName in tableNamesStack) {
# pull-out current tableName from stack
delete tableNamesStack[tableName];
} else {
# current tableName is missing from stack, report and continue.
printf ("Unmatched INSERT statement in line %d, for table %s\n", NR, tableName);
}
}
running script.awk
gawk -f script.awk input.sql
Unmatched INSERT statement in line 9, for table x3
Unmatched INSERT statement in line 11, for table y1

Related

Parsing simple string with awk or sed in linux

original string :
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/
Depth of directories will vary, but /trunk part will always remain the same.
And a single character in front of /trunk is the indicator of that line.
desired output :
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Q /trunk/melon/juice/venti/straw
*** edit
I'm sorry I made a mistake by adding a slash at the end of each path in the original string which made the output confusing. Original string didn't have the slash in front of the capital letter, but I'll leave it be.
my attempt :
echo $str1 | sed 's/\(.\/trunk\)/\n\1/g'
I feel like it should work but it doesn't.
With GNU awk for multi-char RS and RT:
$ awk -v RS='([^/]+/){2}[^/\n]+' 'RT{sub("/",OFS,RT); print RT}' file
A trunk/apple
B trunk/apple
Z trunk/orange
I'm setting RS to a regexp describing each string you want to match, i.e. 2 repetitions of non-/s followed by / and then a final string of non-/s (and non-newline for the last string on the input line). RT is automatically set to each of the matching strings, so then I just change the first / to a blank and print the result.
If each path isn't always 3 levels deep but does always start with something/trunk/, e.g.:
$ cat file
A/trunk/apple/banana/B/trunk/apple/Z/trunk/orange
then:
$ awk -v RS='[^/]+/trunk/' 'RT{if (NR>1) print pfx $0; pfx=gensub("/"," ",1,RT)} END{printf "%s%s", pfx, $0}' file
A trunk/apple/banana/
B trunk/apple/
Z trunk/orange
To deal with complex samples input, like where there could be N number of / and values after trunk in a single line please try following.
awk '
{
gsub(/[^/]*\/trunk/,OFS"&")
sub(/^ /,"")
sub(/\//,OFS"&")
gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&")
sub(/\n/,OFS)
gsub(/\n /,ORS)
gsub(/\/trunk/,OFS"&")
sub(/[[:space:]]+/,OFS)
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
gsub(/[^/]*\/trunk/,OFS"&") ##Globally substituting everything from / to till next / followed by trunk/ with space and matched value.
sub(/^ /,"") ##Substituting starting space with NULL here.
sub(/\//,OFS"&") ##Substituting first / with space / here.
gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&") ##Globally substituting spaces followed by everything till / trunk till space comes with new line and matched values.
sub(/\n/,OFS) ##Substituting new line with space.
gsub(/\n /,ORS) ##Globally substituting new line space with ORS.
gsub(/\/trunk/,OFS"&") ##Globally substituting /trunk with OFS and matched value.
sub(/[[:space:]]+/,OFS) ##Substituting spaces with OFS here.
}
1 ##Printing edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
With your shown samples, please try following awk code.
awk '{gsub(/\/trunk/,OFS "&");gsub(/trunk\/[^/]*\//,"&\n")} 1' Input_file
In awk you can try this solution. It deals with the special requirement of removing forward slashes when the next character is upper case. Will not win a design award but works.
$ echo "A/trunk/apple/B/trunk/apple/Z/trunk/orange" |
awk -F '' '{ x=""; for(i=1;i<=NF;i++){
if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
if($i~/[A-Z]/){ printf x""$i" "}
else{ x="\n"; printf $i } }; print "" }'
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Also works for n words. Actually works with anything that follows the given pattern.
$ echo "A/fruits/apple/mango/B/anything/apple/pear/banana/Z/ball/orange/anything" |
awk -F '' '{ x=""; for(i=1;i<=NF;i++){
if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
if($i~/[A-Z]/){ printf x""$i" "}
else{ x="\n"; printf $i } }; print "" }'
A /fruits/apple/mango
B /anything/apple/pear/banana
Z /ball/orange/anything
This might work for you (GNU sed):
sed 's/[^/]*/& /;s/\//\n/3;P;D' file
Separate the first word from the first / by a space.
Replace the third / by a newline.
Print/delete the first line and repeat.
If the first word has the property that it is only one character long:
sed 's/./& /;s#/\(./\)#\n\1#;P;D' file
Or if the first word has the property that it begins with an upper case character:
sed 's/[[:upper:]][^/]*/& /;s#/\([[:upper:][^/]*/\)#\n\1#;P;D' file
Or if the first word has the property that it is followed by /trunk/:
sed -E 's#([^/]*)(/trunk/)#\n\1 \2#g;s/.//' file
With GNU sed:
$ str="A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/"
$ sed -E 's|/?(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Note the first empty output line. If it is undesirable we can separate the processing of the first output line:
$ sed -E 's|(.)|\1 |;s|/(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Using gnu awk you could use FPAT to set contents of each field using a pattern.
When looping the fields, replace the first / with /
str1="A/trunk/apple/B/trunk/apple/Z/trunk/orange"
echo $str1 | awk -v FPAT='[^/]+/trunk/[^/]+' '{
for(i=1;i<=NF;i++) {
sub("/", " /", $i)
print $i
}
}'
The pattern matches
[^/]+ Match any char except /
/trunk/[^/]+ Match /trunk/ and any char except /
Output
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Other patterns that can be used by FPAT after the updated question:
Matching a word boundary \\< and an uppercase char A-Z and after /trunk repeat / and lowercase chars
FPAT='\\<[A-Z]/trunk(/[a-z]+)*'
If the length of the strings for the directories after /trunk are at least 2 characters:
FPAT='\\<[A-Z]/trunk(/[^/]{2,})*'
If there can be no separate folders that consist of a single uppercase char A-Z
FPAT='\\<[A-Z]/trunk(/([^/A-Z][^/]*|[^/]{2,}))*'
Output
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Assuming your data will always be in the format provided as a single string, you can try this sed.
$ sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g' input_file
$ echo "A/trunk/apple/pine/skunk/B/trunk/runk/bunk/apple/Z/trunk/orange/T/fruits/apple/mango/P/anything/apple/pear/banana/L/ball/orange/anything/S/fruits/apple/mango/B/rupert/cream/travel/scout/H/tall/mountains/pottery/barnes" | sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g'
A /trunk/apple/pine/skunk
B /trunk/runk/bunk/apple
Z /trunk/orange
T /fruits/apple/mango
P /anything/apple/pear/banana
L /ball/orange/anything
S /fruits/apple/mango
B /rupert/cream/travel/scout
H /tall/mountains/pottery/barnes
Some fun with perl, where you can using nonconsuming regex to autosplit into the #F array, then just print however you want.
perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
Step #1: Split
perl -lanF/(?=.{1,2}trunk)/'
This will take the input stream, and split each line whenever the pattern .{1,2}trunk is encountered
Because we want to retain trunk and the preceeding 1 or 2 chars, we wrap the split pattern in the (?=) for a non-consuming forward lookahead
This splits things up this way:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print join " ", #F'
A /trunk/apple/ B /trunk/apple/ Z /trunk/orange/citrus/ Q /trunk/melon/juice/venti/straw/
Step 2: Format output:
The #F array contains pairs that we want to print in order, so we'll iterate half of the array indices, and print 2 at a time:
print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2 --> Double the iterator, and print pairs
using perl -l means each print has an implicit \n at the end
The results:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
A /trunk/apple/
B /trunk/apple/
Z /trunk/orange/citrus/
Q /trunk/melon/juice/venti/straw/
Endnote: Perl obfuscation that didn't work.
Any array in perl can be cast as a hash, of the format (key,val,key,val....)
So %F=#F; print "$_ $F{$_}" for keys %F seems like it would be really slick
But you lose order:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e '%F=#F; print "$_ $F{$_}" for keys %F'
Z /trunk/orange/citrus/
A /trunk/apple/
Q /trunk/melon/juice/venti/straw/
B /trunk/apple/
Update
With your new data file:
$ cat file
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/
This GNU awk solution:
awk '
{
sub(/[/]$/,"")
gsub(/[[:upper:]]{1}/,"& ")
print gensub(/([/])([[:upper:]])/,"\n\\2","g")
}' file
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

Merge two files by one column - awk

I have two different scripts to merge files by one matching column.
file1.tsv - 4 columns separated by tab
1 LAK c.66H>T p.Ros49Kos
2 OLD c.11A+1>R p.Ill1639Los
3 SRP c.96V-T>X p.Zub%D23
4 HRP c.1S>T p.Lou33aa
file2.tsv - 14 columns, separated by tab
LAK "empty_column" c.66H>T ......
SRP "empty_column" c.96-T>X ......
Ouptut.tsv - all columns from file2.tsv and behind 1st column of file1 if match.
LAK "empty_column" c.66H>T ......1
SRP "empty_column" c.96-T>X ......3
I am using these two scripts, but doesn´t work:
awk -v FILE_A="file1.tsv" -v OFS="\t" 'BEGIN { while ( ( getline <
FILE_A ) > 0 ) { VAL = $0 ; sub( /^[^ ]+ /, "", VAL ) ; DICT[ $3 ] =
VAL } } { print $0, DICT[ $3 ] }' file2.tsv
or
awk 'NR==FNR{h[$3] = $1; next} {print h[$3]}' file1.tsv file2.tsv
Thanks for help.
You might want to use the join command to join column 2 of the first file with column 1 of the second:
join --nocheck-order -1 2 -2 1 file1.tsv file2.tsv
A few notes
This is the first step, after this, you still have the task of cutting out unwanted columns, or rearrange them. I suggest to look into the cut command, or use awk this time.
The join command expects the text on both files are in the same order (alphabetical or otherwise)
Alternatively, import them into a temporary sqlite3 database and perform a join there.

Extracting specific part of each line in file based on prior string

I have a file with lines like the one here:
intergenic NONE(dist=NONE),ENSG00000223972(dist=1692) 1 10177 10177 - C 1 10177 rs367896724 A AC 100 PASS AC=2130;AF=0.425319;AN=5008;NS=2504;DP=103152;EAS_AF=0.3363;AMR_AF=0.3602;AFR_AF=0.4909;EUR_AF=0.4056;SAS_AF=0.4949;AA=|||unknown(NO_COVERAGE);VT=INDEL
What I would like to do is extract parts I require using the start and end chars. So for instance I would like to extract the value of AFR_AF. What I know is that this value begins with AFR_AF and ends with ; (the whole thing looks like: AFR_AF=0.4909; so I want the 0.4909.
I would like to extract multiple parts of each line like to if possible. Is this possible using something like awk?
A portable solution with awk:
# extract.awk
BEGIN {
FS="="
RS=";"
search["AFR_AF"]=1
# Add more items as you wish
search["FOO_BAR"]=1
search["HELLO_WORLD"]=1
}
$1 in search {
print $2
}
Run it like this:
awk -f extract.awk input.file
Explanation:
Using ; as the record separator (RS), awk sees records like this (instead of line by line):
foo=bar
hello=world
no equal sign in this record
...
Since we set the field separator (FS) to =, we can check whether the first field $1 contains a certain value and print the value $2 in that case.
The search itself is been implemented with an associative array. $1 in search checks whether $1 is a key of that array.
grep with o and P should help:
grep -oP 'AFR_AF=\K[^;]*` file
or you want to multiple values in one short, for example:
grep -oP '(AFR_AF=|VT=)\K[^;]*' file
will give
0.4909
INDEL

AWK - Merge multiple lines in two particular columns into one line?

Newbie here.. I'm confused how to merge multiple lines in particular columns and print into one row. For example I have this kind of data in .csv file (separated by comma):
ID1,X1,X2,X3,X4,X5,X6,T,C
ID2,X1,X2,X3,X4,X5,X6,G,A
ID3,X1,X2,X3,X4,X5,X6,C,G
ID4,X1,X2,X3,X4,X5,X6,A,A
I plan to select only the 8th and 9th columns per-row, and print them all in one row and separated using whitespace, so that the result will be like this:
T C G A C G A A
To do that, I tried to use AWK code :
awk -F "," '{printf "%s ",$8, "%s ",$9}' FILE > outputfile
But it gave result the merge between all in col 8th then all in col 9th:
T G C A C A G A
Any suggestions are very welcomed.
Thank you very much for your kind help.
like this?
kent$ awk -F, '{t=$8 OFS $9;s=s?s OFS t:t}END{print s}' file
T C G A C G A A
Try this awk:
awk -F "," '{printf "%s %s ", $8,$9}' yourfile

Parsing log lines using awk

I have to parse some information out of big log file lines.
Its something like
abc.log:2012-03-03 11:12:12,457 ABC[123.RPH.-101] XYZ: Query=get_data #a=0,#b=1 Rows=10Time=100
There are many log lines like above in the logfiles. I need to extract information like
datetime i.e. 2012-03-03 11:12:12,457
job details i.e. 123.RPH.-101
Query i.e. get_data (no parameters)
Rows i.e. 10
Time i.e. 100
So output should look like
2012-03-03 11:12:12,457|123|-101|get_data|10|100
I have tried various permutation computations with awk but not getting it right.
Well, this is really horrible, but since sed is in the tags and there are no answers yet...
sed -e 's/[^0-9]*//' -re 's/[^ ]*\[([^.]*)\.[^.]*\.([^]]*)\]/| \1 | \2/' -e 's/[^ ]* Query=/| /' -e 's/ [^ ]* Rows=/ | /' -e 's/Time=/ | /' my_logfile
My solution in gawk: it uses gawk extension to match.
You didn't give specification of file format, so you may have to adjust the regexes.
Script invocation:
gawk -v OFS='|' -f script.awk
{
match($0, /[0-9]+-[0-9]+-[0-9]+ [0-9]+:[0-9]+:[0-9]+,[0-9]+/)
date_time = substr($0, RSTART, RLENGTH)
match($0, /\[([0-9]+).RPH.(-?[0-9]+)\]/, matches)
job_detail_1 = matches[1]
job_detail_2 = matches[2]
match($0, /Query=(\w+)/, matches)
query = matches[1]
match($0, /Rows=([0-9]+)/, matches)
rows = matches[1]
match($0, /Time=([0-9]+)/, matches)
time = matches[1]
print date_time, job_detail_1, job_detail_2, query,rows, time
}
Here's another, less fancy, AWK solution (but works in mawk too):
BEGIN { OFS="|" }
{
i = match($3, /\[[^]]+\]/)
job = substr($3, i + 1, RLENGTH - 2)
split($5, X, "=")
query = X[2]
split($7, X, "=")
rows = X[2]
split($8, X, "=")
time= X[2]
print $1 " " $2, job, query, rows, time
}
Nothe that this assumes the Rows=10 and Time=100 strings are separated by space, that is, there was a typo in the question example.
TXR:
#(collect :vars ())
#file:#year-#mon-#day #hh:#mm:#ss,#ms #jobname[#job1.RPH.#job2] #queryname: Query=#query #params Rows=#{rows /[0-9]+/}Time=#time
#(output)
#year-#mon-#day #hh-#mm-#ss,#ms|#job1|#job2|#query|#rows|#time
#(end)
#(end)
Run:
$ txr data.txr data.log
2012-03-03 11-12-12,457|123|-101|get_data|10|100
Here is one way to make the program assert that every line in the log file must match the pattern. First, do not allow gaps in the collection. This means that nonmatching material cannot be skipped to just look for the lines which match:
#(collect :gap 0 :vars ())
Secondly, at the end of the script we add this:
#(eof)
This specifies a match on the end of file. If the #(collect) bails early because of a nonmatching line (due to the :gap 0 constraint), the #(eof) will fail and so the script will terminate with a failed status.
In this type of task, field splitting regex hacks will backfire because they can blindly produce incorrect results for some subset of the input being processed. If the input contains a vast number of lines, there is no easy way to check for mistakes. It's best to have a very specific match that is likely to reject anything which doesn't resemble the examples on which the pattern is based.
Just need the right field separators
awk -F '[][ =.]' -v OFS='|' '{print $1 " " $2, $4, $6, $10, $15, $17}'
I'm assuming the "abc.log:" is not actually in the log file.

Resources