Merge two files by one column - awk - join

I have two different scripts to merge files by one matching column.
file1.tsv - 4 columns separated by tab
1 LAK c.66H>T p.Ros49Kos
2 OLD c.11A+1>R p.Ill1639Los
3 SRP c.96V-T>X p.Zub%D23
4 HRP c.1S>T p.Lou33aa
file2.tsv - 14 columns, separated by tab
LAK "empty_column" c.66H>T ......
SRP "empty_column" c.96-T>X ......
Ouptut.tsv - all columns from file2.tsv and behind 1st column of file1 if match.
LAK "empty_column" c.66H>T ......1
SRP "empty_column" c.96-T>X ......3
I am using these two scripts, but doesn´t work:
awk -v FILE_A="file1.tsv" -v OFS="\t" 'BEGIN { while ( ( getline <
FILE_A ) > 0 ) { VAL = $0 ; sub( /^[^ ]+ /, "", VAL ) ; DICT[ $3 ] =
VAL } } { print $0, DICT[ $3 ] }' file2.tsv
or
awk 'NR==FNR{h[$3] = $1; next} {print h[$3]}' file1.tsv file2.tsv
Thanks for help.

You might want to use the join command to join column 2 of the first file with column 1 of the second:
join --nocheck-order -1 2 -2 1 file1.tsv file2.tsv
A few notes
This is the first step, after this, you still have the task of cutting out unwanted columns, or rearrange them. I suggest to look into the cut command, or use awk this time.
The join command expects the text on both files are in the same order (alphabetical or otherwise)
Alternatively, import them into a temporary sqlite3 database and perform a join there.

Related

Grep how do I check if a token comes before another in a fil?

I am trying to find among a bunch of files who hold SQL statements whether we ever SELECT from a table before we INSERT into it. It seems it should be a one-liner with Grep.
I've come up with
grep -zl "FROM (\S*).*INSERT INTO \0
The -z treats the input as one line, and then the back reference does the rest.
However testing with
echo "SELECT a FROM x INSERT INTO x VALUES(1);" | grep -zl "FROM (\S*).*INSERT INTO \0"
produces no result.
In fact even echo "aa aa" | grep "(\S*) \0" returns nothing.
What am I missing?
First, let's solve it for x:
echo "SELECT a FROM x INSERT INTO x VALUES(1);" | grep -E "FROM (\S)*x.*INSERT INTO (\S)*x"
However, you may have many tables and you are interested about all of them. So, this is how you can list the table names:
select TABLE_NAME
from information_schema.tables;
Now, let's generate the grep for each table:
select CONCAT('sudo bash foo.sh "your script" ', TABLE_NAME)
from information_schema.tables;
and implement foo.sh as follows:
echo "$1" | grep -E "FROM (\S)*$2.*INSERT INTO (\S)*$2"
The query generates the grep for each table. Naturally, you can filter your query to a selection of tables instead and you might also need to handle cases like
select ... from yourschema.yourtable
or
select ... from `yourtable`
but start with the proof-of-concept I have given and see whether that's enough for you.
grep solution:
Use -P option to use RegExp with Perl notation
grep -zPl "FROM ([[:alnum:]]+) INSERT INTO \1 VALUES"
Matching SELECT statement before INSERT, solution:
The reported problem is more complicated than described.
Assuming SELECT statements and corresponding INSERT statement are not in sequence.
For instance:
SELECT a FROM x1
INSERT INTO x1 VALUES(1);
SELECT a FROM x2
SELECT a FROM x3
SELECT a FROM y1
SELECT a FROM x2
INSERT INTO x3 VALUES(1);
INSERT INTO x2 VALUES(1);
INSERT INTO y1 VALUES(1);
INSERT INTO x3 VALUES(2);
SELECT a FROM y2
INSERT INTO y1 VALUES(1);
Only x3 and y1 are not matched. And there are nesting and duplicates.
We do not know ahead all table-names.
We need a stack data structure.
Push-in every table-name in select select statement (no duplicates), pull-out every table-name in insert.
Implemented using gawk (standard awk in most Linux machines) associative array. Screening input SQL file once.
gawk script: script.awk
/SELECT .* FROM / { # for each line matching RegExp "SELECT .* FROM"
# read table name from current line
tableName = gensub(/.*FROM[[:space:]]+([[:alnum:]]+).*/,"\\1",1);
# push-in tableName into associative array (used as stack)
tableNamesStack[tableName] = 1;
}
/INSERT INTO / { # for each line matching RegExp "INSERT INTO "
# read table name from current line
tableName = gensub(/.*INTO[[:space:]]+([[:alnum:]]+).*/,"\\1",1);
# if current tableName is in stack
if (tableName in tableNamesStack) {
# pull-out current tableName from stack
delete tableNamesStack[tableName];
} else {
# current tableName is missing from stack, report and continue.
printf ("Unmatched INSERT statement in line %d, for table %s\n", NR, tableName);
}
}
running script.awk
gawk -f script.awk input.sql
Unmatched INSERT statement in line 9, for table x3
Unmatched INSERT statement in line 11, for table y1

How to add text from two files in awk

I have two tab delim files as shown below:
FileA.txt
1 a,b,c
2 b,c,e
3 e,d,f,a
FileB.txt
a xxx
b xyx
c zxxy
I would need the output in the below way:
Output:
1 a,b,c xxx,xyx.zxxy
2 b,c,e xyx,zxxy,e
3 e,d,f,a e,d,f,xxx
The comma separated values in $2 of FileA are to be used as key to search for a match in $1 of FileB and add a new column in the output with their values in $2 from FileB. Incase of no match it should print the original value. Any help on how to do this?
awk to the rescue!
$ awk 'NR==FNR {a[$1]=$2; next}
{NF++; s=""; n=split($2,t,",");
for(i=1;i<=n;i++) {k=t[i];
$NF=$NF s ((k in a)?a[k]:k);
s=","}}1' fileB fileA | column -t
1 a,b,c xxx,xyx,zxxy
2 b,c,e xyx,zxxy,e
3 e,d,f,a e,d,f,xxx

gawk: presenting two operations outcome in two rows

I have a program which output is summary file with header and few columns of results.
I want to show only two data: file name and best period prediction and I use this command:
program input_file | gawk 'NR==2 {print $3}; NR==4 {print $2}'
as the result I obtain result in one column, two lines. What I have to do to have this result in one line, two columns?
You could use:
program input_file | gawk 'NR==2 {heading = $3}; NR==4 {print heading " = " $2}'
This saves the value in $3 on line 2 in variable heading and prints the heading and the value from column 2 when it reads line 4.

How to print un join lines using join command or awk?

join command prints 2 common strings in 2 files. But is there any way to print the strings that did't match ?
file 1
a 1
b 2
c 3
file2
a 3
b 3
output
c 3
Using join command:
join -a1 -v1 file1 file2
-a1 = print non-matching lines of first file. -v to suppress normal output
To join on the first field, here's one way using awk:
awk 'FNR==NR { a[$1]; next } !($1 in a)' file2 file1
Results:
c 3

reading semi-formatted data

I'm totally new to AWK, however I think this is the best way to solve my problem and a good time to learn AWK.
I am trying to read a large data file that is created by a simulation program. The output is made to be readable by humans, so its formatting isn't very consistent. An example of the output is in this image
http://i.imgur.com/0kf8l.png
I need a way to find a line like "He 2 4686A -2.088 0.0071", by specifying the "He 2 4686A" part and get the following two numbers. The problem is the line "He 2 4686A -2.088 0.0071" can appear anywhere in the table.
I know how to find the entry "He 2 4686A", but I don't know which of the 4 columns it's in. So I don't know how to address the values that follow it.
A command that lets me just read the next two words, or tells me the location of the pattern once a match is found will both help.
/He 2 4686A/ finds the line
Ca A 3970A -0.900 0.1100 He 2 4686A -2.088 0.0071 S 3 18.67m -0.371 0.3721 Ar 4 444.7A -2.124 0.0066
Any help is appreciated.
First step should be to bring what seems to be 4 columns of records into a 1-column format...then its easy with awk because you can then filter for the first 5 fields - like:
echo "He 2 4686A -2.088 0.0071" | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
which gives
-2.088 0.0071
So, for me, the only challenge is to transform your data to one-column format...And from the picture that look simple because it seems that the columns have a fixed length which you can count.
Assuming that your column-width is 30 characters (difficult to tell from a picture, beware of tabs) and you data is in input_file, then you could first "cut" the data into 4 columns and then pipe the output to another awk-process
awk '{
print substr($0,1,30)
print substr($0,31,30)
print substr($0,61,30)
print substr($0,91,30)
}' input_file | \
awk '$1 == "He" && $2 == 2 && $3 == "4686A" {print $4, $5}'
If you really just need the next two numbers behind an anchor then I would say the grep-solution from Costa is best for you, however this gives you the possibility to implement further logic...
If you're not dead set on using awk, grep would be the easiest way...
egrep -o "He 2 4686A \-?[0-9.]+ \-?[0-9.]+" output.txt
EDIT: The above would work only if the spacing was done with a whitespace, which doesn't seem to be your case. In order to handle tabs and/or repeating whitespaces...
egrep -o "He[ \t]+2[ \t]+4686A[ \t]+\-?[0-9.]+[ \t]+\-?[0-9.]+" output.txt

Resources