merge 2 lists based on first 2 columns - join

I need to merge 2 lists based on column 1 and 2
file1:
client1,server1,3000.00
client1,server2,2500.00
client1,server3,1500.00
client2,server1,4500.00
client2,server2,2300.00
client2,server3,1230.00
client3,server1,3400.00
client3,server2,4500.00
client3,server3,1245.00
client4,server1,3400.00
client5,server2,4500.00
client6,server3,1245.00
client7,server1,3400.00
client7,server2,4500.00
client8,server3,1245.00
client8,server1,3400.00
client8,server2,4500.00
client9,server3,1245.00
file2:
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
what I need is to update file2 with the missing values from column 1 an 2 only of file1 and adding comma to keep same number of columns
with this example the output should be like this :
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
client4,server1,,
client5,server2,,
client6,server3,,
client7,server1,,
client7,server2,,
client8,server3,,
client8,server1,,
client8,server2,,
client9,server3,,
I have tried with awk and join but I am not able to get the same result
if creating new file is easier then no issue
thanks for your help

Another awk way
awk -F, -vOFS="," 'NR!=FNR{NF--;NF+=2}!a[$1 FS $2]++' test2 test
or
awk -F, 'NR!=FNR{$0=$1 FS $2",,"}!a[$1 FS $2]++' test2 test
Shortest
awk -F, '{x=$1","$2}NR!=FNR{$0=x",,"}!a[x]++' test2 test

give this line a try:
awk -F, '{k=$1 FS $2}NR==FNR{a[k]++;print;next}!a[k]{print k",,"}' file2 file1

Using the join command. Problem is join can not join on multiple fields, so we need to manipulate the first comma temporarily:
join -t , -o 0,2.2,2.3 -a 1 <(sed 's/,/:/' file1) <(sed 's/,/:/' file2) | sed 's/:/,/'

Related

print the rest of input along with matching line

I am new to linux and I am experimenting with basic terminal commands. I found out that I can list all users using compgen -u but what if I only want to display the bottom line outputs ?
Ok lets say the output of compgen -u goes like this:
extra
extra
extra
extra
extra
extra
extra
extra
extra
John
William
Kate
Harold
I can only use grep to find a single text (ex. compgen -u | grep John). But what if I want to use grep to display John as well as all the remaining entries after it ?
sed or awk solution would be easier, but if you can only use grep, then the option --after-context (or -A) might do:
grep -A 5 John file
The drawback is that you need to know the number of lines to display after the matching (or use an arbitrary big number for the rest of the file).
compgen -u | grep -A$(compgen -u| wc -l) John
Explanation:
From man grep
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines. Places a line containing a group separator (described under --group-separator) between
contiguous groups of matches.
grep -A -- print number of rows after pattern
$() -- Execute unix command
compgen -u| wc -l --> Get total number of rows of output of command.
You can use the following one-liner :
n=$( compgen -u | grep -n John | head -1 | cut -d ":" -f 1 ) && compgen -u | tail -n +$n
This finds out the line number for first occurrence of John, and prints everything starting that line.

Why using awk in bash like vlookup in Excel give empty output file?

The good using of awk is still unclear for me, but I know it will be useful for what I want.
I have two files, both are tab delimited:
transcriptome.txt (with billion of lines):
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN299_c0_g1_i1 GGACACGGGCCTCAAGCCAAGTCAAAACCACCACAAAG
>TRINITY_DN216_c0_g1_i1 GTTCAATATTCAATGACTGAAGGGCCCGCTGATTTTCCCCTATAAA
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
selected_genes.txt (thousands of lines):
>TRINITY_DN261_c0_g1_i1 1
>TRINITY_DN220_c0_g1_i1 0
I want this output (first column of selected_genes.txt and second column of transcriptome.txt):
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
Usually I use the vlookup function in Excel.
I try to obtain my result with awk, like in many threads (stackexchange1, stackexchange2, stackoverflow1,stackoverflow2, stackoverflow3, and others..)
So I tried to used advices from these threads, but my output is either blank, either it's only a copy of my selected_genes.txt file.
I checked, my 2 files are in UTF-8, with CRLF. Also,
awk '{print $1}' `transcriptome.txt`
awk '{print $1}' `selected_genes.txt`
Give me well the first column of my files, so the problem didn't come from them.
Here is what I tried:
awk -F, 'FNR==NR {a[$1]=$1; next}; $1 in a {print a[$2]}' selected_genes.txt transcriptome.txt > output.txt
# Blank result
awk -F 'FNR==NR{var[$1]=$1;next;}{print var[$1]FS$2}' selected_genes.txt transcriptome.txt > output.txt
# Blank result
awk 'NR == FNR{a[$1] = $2;next}; {print $1, $1 in a?a[$1]: "NA"}' selected_genes.txt transcriptome.txt > output.txt
# Print only transcriptome.txt with first column and NAs
awk -F, 'FNR==NR{var[$1]=$1}FNR!=NR{print(var[$2]","$1)}' selected_genes.txt transcriptome.txt > output.txt
# Print only selected_genes.txt
I didn't achieve to produce the wanted output.
Any advices to explain me what is the problem with my code will be grateful.
Awk classic. Hash the thousands of lines gene file to a hash (a) to not waste all the memory and lookup $1 from billions of lines transcriptome file:
$ awk '
# { sub(/\r$/,"") } # uncomment to remove Windows style line-endings.
NR==FNR{a[$1] # hash $1 of genes file to a
next
}
($1 in a) { # lookup from transcriptome
print
}' genes transcriptome # mind the order
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
your code:
awk -F, 'FNR==NR{a[$1]=$1; next}; $1 in a {print a[$2]}'
will not work since you're trying to print a[$2] which doesn't exist.
Change to
awk -F, 'FNR==NR{a[$1]; next} $1 in a' selected_genes.txt transcriptome.txt
which should give you the expected output
The second expression is shorthand for ($1 in a) {print $0}
There's a better tool in the box than awk for this sort of file merge on a common field, especially for big files: join(1)
$ join -t $'\t' -11 -21 -o 0,2.2 \
<(sort -t $'\t' -k1,1 selected_genes.txt) \
<(sort -t $'\t' -k1,1 transcriptome.txt)
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
The only caveat is that the files to be joined have to be sorted on the join column, hence the uses of sort.
In database terms, it does an INNER JOIN of the two files - for each row of the first file, every row of the second file with a matching join column results in one row of output. The -o 0,2.2 makes those lines be the join column and the second column of the second file.
Another interesting option:
$ grep -F -f <(sed -e 's/[^\t]*$//' selected_genes.txt) transcriptome.txt
>TRINITY_DN261_c0_g1_i1 GATATTTATCCGAATATTCAATATGAT
>TRINITY_DN220_c0_g1_i1 GGGAGATAATAACAATGATAACACACAAAATTCCAATG
will, very efficiently, show just the lines of transcriptome.txt that have the first column of a line in selected_genes.txt appear in them. This is faster than the other approaches by a large margin in my tests.

Retrieve only the matching pattern of a list in another list with grep

I have this kind of files:
file1 :
TMCS09g1008676.1
MAMBO.3.3.2.1
TMCS09g1008678.1
TMCS09g1008678.2
CSH.1.2
TMCS09g1008681.3
TMCS09g1008682.1
TMCS09g1008683
file2:
TMCS09g1008676
MAMBO.3.3.2
TMCS09g1008678
TMCS09g1008679
CSH.1
TMCS09g1008681
TMCS09g1008682
TMCS09g1008683
What I want to do is to retrieve only the matching part of file 2 in file 1. Basically only the "red" part of the terminal showing the matching with the command grep -w -F -f file2 file1
Is there a way to achieve this?
thanks in advance

joining 2 files on matching column values using awk

I know there have been similar questions posted but I'm still having a bit of trouble getting the output I want using awk FNR==NR...
I have 2 files as such
File 1:
123|this|is|good
456|this|is|better
...
File 2:
aaa|123
bbb|456
...
So I want to join on values from file 2/column2 to file 1/column1 and output file 1 (col 2,3,4) and file 2 (col 1).
Thanks in advance.
With awk you could do something like
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } $1 in val { $(NF + 1) = val[$1]; print }' file2 file1
NF is the number of fields in a record (line by default), so $NF is the last field, and $(NF + 1) is the field after that. By assigning the saved value from the pass over file2 to it, a new field is appended to the record before it is printed.
One thing to note: This behaves like an inner join, i.e., only records are printed whose key appears in both files. To make this a right join, you can use
awk -F \| 'BEGIN { OFS = FS } NR == FNR { val[$2] = $1; next } { $(NF + 1) = val[$1]; print }' file2 file1
That is, you can drop the $1 in val condition on the append-and-print action. If $1 is not in val, val[$1] is empty, and an empty field will be appended to the record before printing.
But it's probably better to use join:
join -1 1 -2 2 -t \| file1 file2
If you don't want the key field to be part of the output, pipe the output of either of those commands through cut -d \| -f 2- to get rid of it, i.e.
join -1 1 -2 2 -t \| file1 file2 | cut -d \| -f 2-
If the files have the same number of lines in the same order, then
paste -d '|' file1 file2 | cut -d '|' -f 2-5
this|is|good|aaa
this|is|better|bbb
I see in a comment to Wintermute's answer that the files aren't sorted. With bash, process substitutions are handy to sort on the fly:
paste -d '|' <(sort -t '|' -k 1,1 file1) <(sort -t '|' -k 2,2 file2) |
cut -d '|' -f 2-5
To reiterate: this solution requires a one-to-one correspondence between the files

Compare two files and make union

I tested a line below to compare 1st columns in 2 files and make an union. However the different value with identical 1st column in file2 was eliminated. Below I attached sample files, obtained result, and desired result.
awk -F, 'BEGIN{OFS=","}FNR==NR{a[$1]=$1","$2;next}($1 in a && $2=$2","a[$1])' file2.csv file1.csv >testout.txt
file1
John,red
John,blue
Mike,red
Mike,blue
Carl,red
Carl,blue
file2
John,V1
John,V2
Kent,V1
Kent,V2
Mike,V1
Mike,V2
obtained result
John,red,John,V2
John,blue,John,V2
Mike,red,Mike,V2
Mike,blue,Mike,V2
desired result
John,red,John,V1
John,red,John,V2
John,blue,John,V1
John,blue,John,V2
Mike,red,Kent,V1
Mike,red,Kent,V2
Mike,blue,Kent,V1
Mike,blue,Kent,V2
try this one-liner:
awk -F, -v OFS="," 'NR==FNR{a[$0];next}{for(x in a)if(x~"^"$1FS)print $0,x}' file2 file1
test:
kent$ awk -F, -v OFS="," 'NR==FNR{a[$0];next}{for(x in a)if(x~"^"$1FS)print $0,x}' f2 f1
John,red,John,V1
John,red,John,V2
John,blue,John,V1
John,blue,John,V2
Mike,red,Mike,V1
Mike,red,Mike,V2
Mike,blue,Mike,V1
Mike,blue,Mike,V2
Using join could do that:
join -t, -1 1 -2 1 --nocheck-order -o 1.1 1.2 2.1 2.2 file1 file2
Output:
John,red,John,V1
John,red,John,V2
John,blue,John,V1
John,blue,John,V2
Mike,red,Mike,V1
Mike,red,Mike,V2
Mike,blue,Mike,V1
Mike,blue,Mike,V2

Resources