Compare specific columns across multiple files and print matched specific column - join

I have multiple files (six files) in csv format. I am trying to compare $3,$4,$5 across multiple files and if match print $6 from all files along with column $2,$3,$4,$5 from file 1.
Input file 1:
Blink,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.458533399568206
Blink,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.548181169267479
Blink,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.553099787284982
Input file 2:
Farmcpu,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.907010463957269
Farmcpu,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.782521980037194
Farmcpu,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.589126094555234
Input file 3:
GLM,Seeddensity(g/cm^3),1_0002,VU10,37586764,0.24089
GLM,Seeddensity(g/cm^3),1_0004,VU08,37687622,0.25771
GLM,Seeddensity(g/cm^3),1_0006,VU02,6629660,0.31282
Desired output:
Trait Marker Chr Pos Blink Farmcpu GLM
Seeddensity(g/cm^3) 2_27144 VU08 36984438 1.7853934213866E-11 0.907010463957269 0.24089
Seeddensity(g/cm^3) 2_13819 VU08 21705264 3.98653459293212E-09 0.782521980037194 0.25771
Seeddensity(g/cm^3) 2_07286 VU01 38953729 3.16663946775461E-07 0.589126094555234 0.31282
I have checked multiple awk commands but this is the closest one do a job across two files:
awk 'NR==FNR{ a[$2,$3,$4,$5]=$1; next } { s=SUBSEP; k=$2 s $3 s $4 s $5 }k in a{ print $0,a[k] }' File1 File2 > output
join <(sort File1) <(sort File2) | join - <(sort File3) | join - <(sort File4) | join - <(sort File5) | join - <(sort File6) > output
I believe join is not working as first column is not same across files so I tried this command:
join -t, -j3 -o 1.2,1.3,1.4,1.5,1.6,2.6,3.6,4.6,5.6,6.6 <(sort -k 3 File1) <(sort -k 3 File2) <(sort -k 3 File3) <(sort -k 3 File4) <(sort -k 3 File5) <(sort -k 3 File6) > output
But I am getting an error msg:
join: invalid file number in field spec: ‘3.6’
For two files the following command works, but I am not sure how to use it for multiple files:
join -t, -j3 -o 1.2,1.3,1.4,1.5,1.6,2.6 <(sort -k 3 File1) <(sort -k 3 File2) > output

Assuming you actually want CSV output then using GNU awk for ARGIND:
$ cat tst.awk
BEGIN { FS=OFS="," }
{ key = $3 FS $4 FS $5 }
ARGIND < (ARGC-1) {
val[key,ARGIND] = $6
next
}
{
sfx = ""
for (i=1; i<ARGIND; i++) {
if ( (key,i) in val ) {
sfx = sfx OFS val[key,i]
}
else {
next
}
}
print $2, $3, $4, $5, $6 sfx
}
$ awk -f tst.awk file2 file3 file1
Seeddensity(g/cm^3),1_0002,VU10,37586764,0.458533399568206,0.907010463957269,0.24089
Seeddensity(g/cm^3),1_0004,VU08,37687622,0.548181169267479,0.782521980037194,0.25771
Seeddensity(g/cm^3),1_0006,VU02,6629660,0.553099787284982,0.589126094555234,0.31282
With any other awk just add a line that's FNR==1 { ARGIND++ } at the start of the script.

Related

Why using awk in bash like vlookup in Excel give empty output file?

The good using of awk is still unclear for me, but I know it will be useful for what I want.
I have two files, both are tab delimited:
transcriptome.txt (with billion of lines):
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN299_c0_g1_i1 GGACACGGGCCTCAAGCCAAGTCAAAACCACCACAAAG
>TRINITY_DN216_c0_g1_i1 GTTCAATATTCAATGACTGAAGGGCCCGCTGATTTTCCCCTATAAA
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
selected_genes.txt (thousands of lines):
>TRINITY_DN261_c0_g1_i1 1
>TRINITY_DN220_c0_g1_i1 0
I want this output (first column of selected_genes.txt and second column of transcriptome.txt):
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
Usually I use the vlookup function in Excel.
I try to obtain my result with awk, like in many threads (stackexchange1, stackexchange2, stackoverflow1,stackoverflow2, stackoverflow3, and others..)
So I tried to used advices from these threads, but my output is either blank, either it's only a copy of my selected_genes.txt file.
I checked, my 2 files are in UTF-8, with CRLF. Also,
awk '{print $1}' `transcriptome.txt`
awk '{print $1}' `selected_genes.txt`
Give me well the first column of my files, so the problem didn't come from them.
Here is what I tried:
awk -F, 'FNR==NR {a[$1]=$1; next}; $1 in a {print a[$2]}' selected_genes.txt transcriptome.txt > output.txt
# Blank result
awk -F 'FNR==NR{var[$1]=$1;next;}{print var[$1]FS$2}' selected_genes.txt transcriptome.txt > output.txt
# Blank result
awk 'NR == FNR{a[$1] = $2;next}; {print $1, $1 in a?a[$1]: "NA"}' selected_genes.txt transcriptome.txt > output.txt
# Print only transcriptome.txt with first column and NAs
awk -F, 'FNR==NR{var[$1]=$1}FNR!=NR{print(var[$2]","$1)}' selected_genes.txt transcriptome.txt > output.txt
# Print only selected_genes.txt
I didn't achieve to produce the wanted output.
Any advices to explain me what is the problem with my code will be grateful.
Awk classic. Hash the thousands of lines gene file to a hash (a) to not waste all the memory and lookup $1 from billions of lines transcriptome file:
$ awk '
# { sub(/\r$/,"") } # uncomment to remove Windows style line-endings.
NR==FNR{a[$1] # hash $1 of genes file to a
next
}
($1 in a) { # lookup from transcriptome
print
}' genes transcriptome # mind the order
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
your code:
awk -F, 'FNR==NR{a[$1]=$1; next}; $1 in a {print a[$2]}'
will not work since you're trying to print a[$2] which doesn't exist.
Change to
awk -F, 'FNR==NR{a[$1]; next} $1 in a' selected_genes.txt transcriptome.txt
which should give you the expected output
The second expression is shorthand for ($1 in a) {print $0}
There's a better tool in the box than awk for this sort of file merge on a common field, especially for big files: join(1)
$ join -t $'\t' -11 -21 -o 0,2.2 \
<(sort -t $'\t' -k1,1 selected_genes.txt) \
<(sort -t $'\t' -k1,1 transcriptome.txt)
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
The only caveat is that the files to be joined have to be sorted on the join column, hence the uses of sort.
In database terms, it does an INNER JOIN of the two files - for each row of the first file, every row of the second file with a matching join column results in one row of output. The -o 0,2.2 makes those lines be the join column and the second column of the second file.
Another interesting option:
$ grep -F -f <(sed -e 's/[^\t]*$//' selected_genes.txt) transcriptome.txt
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
will, very efficiently, show just the lines of transcriptome.txt that have the first column of a line in selected_genes.txt appear in them. This is faster than the other approaches by a large margin in my tests.

awk to parse file and export as variable

I am parsing a text file
Lines File Name Gen LnkLN LINK Time
----- -------------------- ---- ----- ---- ------------------------
00090 TEST1_1519230912 0 00092 .X.X Wed Feb 21 16:35:14 2018
00091 TEST2_1619330534 0 00093 .X.X Wed Feb 21 16:35:14 2018
using code
awk '{if (($1 ~ /^[0-9A-Fa-f]+$/) && (length($1)==5)) {
if (! c[$4]) TLN=TLN $4 ","
c[$4]=$4;
if (! d[$3]) TGN=TGN $3 ","
d[$3]=$3
if (! b[$2]) TLNK=TLNK $2 ","
b[$2]=$2
}
} END {print "TLines="TLN,"TGEN="TGN,"TLink="TLNK}' /var/tmp/slink.jnk
I get O/p
TLines=00092,00093, TGEN=0,0, TLink=TEST1_1519230912,TEST2_1619330534,
I have two questions with this.
First one is I don't understand why value for TGN is being printed twice in the output "0,0,". If file has duplicate value for the field I want only one value in the o/p.
Second, I redirect these o/p into another file and use #source filename.txt command to set these values as environment variables and use them in later part of the script. Is there any better way to use them as variables inside the script rather than creating another file and sourcing it.
Use in to see if a value is being repeated to avoid the case where the value itself evaluates to false. That is what's happening with your 0 value and why it's being repeated in your output.
$ awk '{if (($1 ~ /^[0-9A-Fa-f]+$/) && (length($1)==5)) {
if (!($4 in c)) TLN=TLN $4 ","
c[$4]
if (!($3 in d)) TGN=TGN $3 ","
d[$3]
if (!($2 in b)) TLNK=TLNK $2 ","
b[$2]
}
} END {print "TLines="TLN,"TGEN="TGN,"TLink="TLNK}' f
Output:
TLines=00092,00093, TGEN=0, TLink=TEST1_1519230912,TEST2_1619330534,
EDIT
Above I've kept things close to your original version, but as mentioned in the comments, a more idiomatic and nicer version would be:
$ awk '($1 ~ /^[0-9A-Fa-f]+$/) && (length($1)==5) {
if (!c[$4]++) TLN=TLN $4 ","
if (!d[$3]++) TGN=TGN $3 ","
if (!b[$2]++) TLNK=TLNK $2 ","
} END {print "TLines="TLN,"TGEN="TGN,"TLink="TLNK}' f
END EDIT
For setting the variables, this worked for me (where a.awk contains the awk code, above):
$ eval "$(awk -f a.awk f)"
$ echo $TLines
00092,00093,
$ echo $TGEN
0,
$ echo $TLink
TEST1_1519230912,TEST2_1619330534,

joining 2 files on matching column values using awk

I know there have been similar questions posted but I'm still having a bit of trouble getting the output I want using awk FNR==NR...
I have 2 files as such
File 1:
123|this|is|good
456|this|is|better
...
File 2:
aaa|123
bbb|456
...
So I want to join on values from file 2/column2 to file 1/column1 and output file 1 (col 2,3,4) and file 2 (col 1).
Thanks in advance.
With awk you could do something like
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } $1 in val { $(NF + 1) = val[$1]; print }' file2 file1
NF is the number of fields in a record (line by default), so $NF is the last field, and $(NF + 1) is the field after that. By assigning the saved value from the pass over file2 to it, a new field is appended to the record before it is printed.
One thing to note: This behaves like an inner join, i.e., only records are printed whose key appears in both files. To make this a right join, you can use
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } { $(NF + 1) = val[$1]; print }' file2 file1
That is, you can drop the $1 in val condition on the append-and-print action. If $1 is not in val, val[$1] is empty, and an empty field will be appended to the record before printing.
But it's probably better to use join:
join -1 1 -2 2 -t \| file1 file2
If you don't want the key field to be part of the output, pipe the output of either of those commands through cut -d \| -f 2- to get rid of it, i.e.
join -1 1 -2 2 -t \| file1 file2 | cut -d \| -f 2-
If the files have the same number of lines in the same order, then
paste -d '|' file1 file2 | cut -d '|' -f 2-5
this|is|good|aaa
this|is|better|bbb
I see in a comment to Wintermute's answer that the files aren't sorted. With bash, process substitutions are handy to sort on the fly:
paste -d '|' <(sort -t '|' -k 1,1 file1) <(sort -t '|' -k 2,2 file2) |
cut -d '|' -f 2-5
To reiterate: this solution requires a one-to-one correspondence between the files

merge 2 lists based on first 2 columns

I need to merge 2 lists based on column 1 and 2
file1:
client1,server1,3000.00
client1,server2,2500.00
client1,server3,1500.00
client2,server1,4500.00
client2,server2,2300.00
client2,server3,1230.00
client3,server1,3400.00
client3,server2,4500.00
client3,server3,1245.00
client4,server1,3400.00
client5,server2,4500.00
client6,server3,1245.00
client7,server1,3400.00
client7,server2,4500.00
client8,server3,1245.00
client8,server1,3400.00
client8,server2,4500.00
client9,server3,1245.00
file2:
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
what I need is to update file2 with the missing values from column 1 an 2 only of file1 and adding comma to keep same number of columns
with this example the output should be like this :
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
client4,server1,,
client5,server2,,
client6,server3,,
client7,server1,,
client7,server2,,
client8,server3,,
client8,server1,,
client8,server2,,
client9,server3,,
I have tried with awk and join but I am not able to get the same result
if creating new file is easier then no issue
thanks for your help
Another awk way
awk -F, -vOFS="," 'NR!=FNR{NF--;NF+=2}!a[$1 FS $2]++' test2 test
or
awk -F, 'NR!=FNR{$0=$1 FS $2",,"}!a[$1 FS $2]++' test2 test
Shortest
awk -F, '{x=$1","$2}NR!=FNR{$0=x",,"}!a[x]++' test2 test
give this line a try:
awk -F, '{k=$1 FS $2}NR==FNR{a[k]++;print;next}!a[k]{print k",,"}' file2 file1
Using the join command. Problem is join can not join on multiple fields, so we need to manipulate the first comma temporarily:
join -t , -o 0,2.2,2.3 -a 1 <(sed 's/,/:/' file1) <(sed 's/,/:/' file2) | sed 's/:/,/'

Joining 2 files on the first field

I would like to compare the two files file1 $1 is equal to file2 $1 and display the output file1 $1,$2,$3,$4,$5, file2 $2,$5. and difference of file1 $5 - file2 $5
input file 1.txt
1,raja,AP,NIND,14:51:56.46
2,mona,KR,SIND,12:41:46.36
3,JO,TM,SIND,18:31:56.36
4,andrew,sind,13:43:23.12
5,drew,sind,17:53:53.42
input file 2.txt
5,raju,UP,NIND,11:51:56.46
6,NAG,KR,SIND,12:41:46.36
7,JO,TM,SIND,18:31:56.36
8,andrew,sind,kkd,14:43:23.12
4,andrew,sind,ggf,15:53:53.42
10,asJO,TM,SIND,16:31:56.36
3,sandrew,sind,gba,9:43:23.12
2,xcandrew,sind,sds,6:53:53.42
1,cv,GTM,SIND,5:31:56.36
9,mnJO,TM,SIND,2:20:56.36
output:
1,raja,AP,NIND,14:51:56.46,cv,5:31:56.36
2,mona,KR,SIND,12:41:46.36,xcandrew,6:53:53.42
3,JO,TM,SIND,18:31:56.36,sandrew,9:43:23.12
4,andrew,sind,13:43:23.12,andrew,15:53:53.42
5,drew,sind,17:53:53.42,raju,11:51:56.46
With awk you would do:
$ awk 'NR==FNR{a[$1]=$0;next}$1 in a{print a[$1],$2,$5}' FS=, OFS=, f1 f2
5,drew,sind,17:53:53.42,raju,11:51:56.46
4,andrew,sind,13:43:23.12,andrew,
3,JO,TM,SIND,18:31:56.36,sandrew,
2,mona,KR,SIND,12:41:46.36,xcandrew,
1,raja,AP,NIND,14:51:56.46,cv,5:31:56.36
If you want the output sorted then pipe to sort:
$ awk 'NR==FNR{a[$1]=$0;next}$1 in a{print a[$1],$2,$5}' FS=, OFS=, f1 f2 | sort
1,raja,AP,NIND,14:51:56.46,cv,5:31:56.36
2,mona,KR,SIND,12:41:46.36,xcandrew,
3,JO,TM,SIND,18:31:56.36,sandrew,
4,andrew,sind,13:43:23.12,andrew,
5,drew,sind,17:53:53.42,raju,11:51:56.46
Alternative using join:
$ join -j1 -t, -o 1.1,1.2,1.3,1.4,1.5,2.2,2.5 <(sort f1) <(sort f2)
1,raja,AP,NIND,14:51:56.46,cv,5:31:56.36
2,mona,KR,SIND,12:41:46.36,xcandrew,
3,JO,TM,SIND,18:31:56.36,sandrew,
4,andrew,sind,13:43:23.12,,andrew,
5,drew,sind,17:53:53.42,,raju,11:51:56.46

Resources