I have two tab delim files as shown below:
FileA.txt
1 a,b,c
2 b,c,e
3 e,d,f,a
FileB.txt
a xxx
b xyx
c zxxy
I would need the output in the below way:
Output:
1 a,b,c xxx,xyx.zxxy
2 b,c,e xyx,zxxy,e
3 e,d,f,a e,d,f,xxx
The comma separated values in $2 of FileA are to be used as key to search for a match in $1 of FileB and add a new column in the output with their values in $2 from FileB. Incase of no match it should print the original value. Any help on how to do this?
awk to the rescue!
$ awk 'NR==FNR {a[$1]=$2; next}
{NF++; s=""; n=split($2,t,",");
for(i=1;i<=n;i++) {k=t[i];
$NF=$NF s ((k in a)?a[k]:k);
s=","}}1' fileB fileA | column -t
1 a,b,c xxx,xyx,zxxy
2 b,c,e xyx,zxxy,e
3 e,d,f,a e,d,f,xxx
Related
Merging 2 files using AWK is a well covered topic on StackOverflow. However, the technique of reading 3 files into an array gets more complicated. As I'm formatting the output to go into an R script, I'm going to need to add lots of syntax so I don't think I can use JOIN. Here is a simplistic version I have working so far:
awk 'FNR==1{f++}
f==1{a[FNR]=$1;next}
f==2{b[FNR]=$1;next}
{print a[FNR], "<- c(", b[FNR], ",", $1, ")"}' words.txt x.txt y.txt
Where:
$ cat words.txt
word1
word2
word3
$ cat x.txt
1
2
3
$ cat y.txt
11
22
33
The output is then
word1 <- c(1, 11)
word2 <- c(2, 22)
word3 <- c(3, 22)
The best way I can summarize this technique is
Create a variable f to keep track of which file you're processing
For file 1 read the values into array a
For file 2 read the values into array b
Fall through to file three, where you concatenate your final output
As a beginner to AWK, this works, but I find it a bit awkward and I worry coming back to the code in 6 months, I'll no longer understand it. Is this the best way to merge these 3 files in AWK? Could JOIN actually handle this level of formatting the final output?
a variation of #RavinderSingh13's solution
$ paste {words,x,y}.txt | awk '{print $1, "<- c(" $2 ", " $3 ")"}'
EDIT: Could you please try following.
paste words.txt x.txt y.txt | awk '{$2="<- c("$2", "$3")";$3="";sub(/ +$/,"")} 1'
Output will be as follows.
word1 <- c(1, 11)
word2 <- c(2, 22)
word3 <- c(3, 33)
In case you simply want to add 3 file's contents in column vice then try following.
paste words.txt x.txt y.txt
word1 1 11
word2 2 22
word3 3 33
If it's for readability, you can change the file checking method, as well as the variable names.
Try these please:
awk 'ARGIND==1{words[FNR]=$1;}
ARGIND==2{xcol[FNR]=$1;}
ARGIND==3{print words[FNR], "<- c(", xcol[FNR], ",", $1, ")"}' words.txt x.txt y.txt
Above file checking method is for GNU awk.
Change to another, as well as change the file reading order, would be:
awk 'FILENAME=="words.txt"{print $1, "<- c(", xcol[FNR], ",", ycol[FNR], ")";}
FILENAME=="x.txt"{xcol[FNR]=$1;}
FILENAME=="y.txt"{ycol[FNR]=$1;}' x.txt y.txt words.txt
As you can also see here, file reading order and block order can be different.
Since words.txt has first column, or main column, so to speak, so it's sensible to read it last.
You can also use FILENAME==ARGV[1] FILENAME==ARGV[2] etc to check files, and put comments inside (use awk script file and load with awk -f scriptfile is better with comments):
awk 'FILENAME==ARGV[1]{xcol[FNR]=$1;} #Read column B, x column
FILENAME==ARGV[2]{ycol[FNR]=$1;} # Read column C, y cloumn
FILENAME==ARGV[3]{print $1, "<- c(", xcol[FNR], ",", ycol[FNR], ")";}' x.txt y.txt words.txt
I have two different scripts to merge files by one matching column.
file1.tsv - 4 columns separated by tab
1 LAK c.66H>T p.Ros49Kos
2 OLD c.11A+1>R p.Ill1639Los
3 SRP c.96V-T>X p.Zub%D23
4 HRP c.1S>T p.Lou33aa
file2.tsv - 14 columns, separated by tab
LAK "empty_column" c.66H>T ......
SRP "empty_column" c.96-T>X ......
Ouptut.tsv - all columns from file2.tsv and behind 1st column of file1 if match.
LAK "empty_column" c.66H>T ......1
SRP "empty_column" c.96-T>X ......3
I am using these two scripts, but doesn´t work:
awk -v FILE_A="file1.tsv" -v OFS="\t" 'BEGIN { while ( ( getline <
FILE_A ) > 0 ) { VAL = $0 ; sub( /^[^ ]+ /, "", VAL ) ; DICT[ $3 ] =
VAL } } { print $0, DICT[ $3 ] }' file2.tsv
or
awk 'NR==FNR{h[$3] = $1; next} {print h[$3]}' file1.tsv file2.tsv
Thanks for help.
You might want to use the join command to join column 2 of the first file with column 1 of the second:
join --nocheck-order -1 2 -2 1 file1.tsv file2.tsv
A few notes
This is the first step, after this, you still have the task of cutting out unwanted columns, or rearrange them. I suggest to look into the cut command, or use awk this time.
The join command expects the text on both files are in the same order (alphabetical or otherwise)
Alternatively, import them into a temporary sqlite3 database and perform a join there.
In the data given below (which is tab separated):
# data
1 xyz alfa x=abc;z=cbe;d=fed xt
2 xyz alfa y=cde;z=xy ft
3 xyb delta xy=def zf
I want to add a suffix _LT in the elements (values of the variables) of 4th column after splitting at ;.
Output like:
1 xyz alfa x=abc_LT;z=cbe_LT;d=fed_LT xt
2 xyz alfa y=cde_LT;z=xy_LT ft
I am able to add suffix at specific columns, but can't split(at delim)-add-merge.
awk -v PRE='_LT' '{$4=$4PRE; print}' OFS="\t" data.txt > data_LT.txt
you can use split function, loop and merge... or use substitutions
$ awk -v PRE='_LT' '{gsub(/;/,PRE";",$4); sub(/$/,PRE,$4); print}' OFS='\t' data.txt
1 xyz alfa x=abc_LT;z=cbe_LT;d=fed_LT xt
2 xyz alfa y=cde_LT;z=xy_LT ft
3 xyb delta xy=def_LT zf
gsub(/;/,PRE";",$4) replace all ; with _LT; only for 4th column
sub(/$/,PRE,$4) append _LT to 4th column
Another thought is to use split() in awk,
awk -v PRE='_LT' '{
n=split($4,a,/;/);
b="";
for(i in a){ b=b a[i] PRE; if(i!=n){b=b";"} }
{$4=b; print $0}
}' OFS='\t' data.txt
n=split($4,a,/;/) splits $4 using the separator ';'. And print the split result as your desired.
Using Perl
$ cat everestial
1 xyz alfa x=abc;z=cbe;d=fed xt
2 xyz alfa y=cde;z=xy ft
3 xyb delta xy=def zf
$ perl -F'/\s+/' -lane ' $F[3]=~s/(.+?)=(.+?)\b/${1}=${2}_LT/g; print join("\t",#F) ' everestial
1 xyz alfa x=abc_LT;z=cbe_LT;d=fed_LT xt
2 xyz alfa y=cde_LT;z=xy_LT ft
3 xyb delta xy=def_LT zf
Newbie here.. I'm confused how to merge multiple lines in particular columns and print into one row. For example I have this kind of data in .csv file (separated by comma):
ID1,X1,X2,X3,X4,X5,X6,T,C
ID2,X1,X2,X3,X4,X5,X6,G,A
ID3,X1,X2,X3,X4,X5,X6,C,G
ID4,X1,X2,X3,X4,X5,X6,A,A
I plan to select only the 8th and 9th columns per-row, and print them all in one row and separated using whitespace, so that the result will be like this:
T C G A C G A A
To do that, I tried to use AWK code :
awk -F "," '{printf "%s ",$8, "%s ",$9}' FILE > outputfile
But it gave result the merge between all in col 8th then all in col 9th:
T G C A C A G A
Any suggestions are very welcomed.
Thank you very much for your kind help.
like this?
kent$ awk -F, '{t=$8 OFS $9;s=s?s OFS t:t}END{print s}' file
T C G A C G A A
Try this awk:
awk -F "," '{printf "%s %s ", $8,$9}' yourfile
join command prints 2 common strings in 2 files. But is there any way to print the strings that did't match ?
file 1
a 1
b 2
c 3
file2
a 3
b 3
output
c 3
Using join command:
join -a1 -v1 file1 file2
-a1 = print non-matching lines of first file. -v to suppress normal output
To join on the first field, here's one way using awk:
awk 'FNR==NR { a[$1]; next } !($1 in a)' file2 file1
Results:
c 3