How to print un join lines using join command or awk? - join

join command prints 2 common strings in 2 files. But is there any way to print the strings that did't match ?
file 1
a 1
b 2
c 3
file2
a 3
b 3
output
c 3

Using join command:
join -a1 -v1 file1 file2
-a1 = print non-matching lines of first file. -v to suppress normal output

To join on the first field, here's one way using awk:
awk 'FNR==NR { a[$1]; next } !($1 in a)' file2 file1
Results:
c 3

Related

Extracting lines from a fixed format without spaces file based on a column and list of inquiring IDs

I have a quite large fixed format file without spaces (file1):
file1:
0808563800555550000367120000500000
0005555566369330000078020000500000
01066666780000000008933600009000005635
0904251263088000000786590056500000
0000469011009904440425120444444440
I want to extract lines with fields 4-8,11-15 and 20-24 when fields 4-8 (only) are in a list of IDs in file2
file2:
55555
42512
The desired outputs are:
55555 36933 07802
42512 08800 78659
I have tried the following combination of cut | grep commands:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -w -F -f file2
It works fine and the speed is very good, but the problem is that I am getting columns where the lookup ID (fields 4-8) is not in the first column of the cutted data, and that is because grep checks the three columns after cut, not only the first one. 
Here are the outputs of the command above:
85638 55555 36712
55555 36933 07802
66666 00000 89336
42512 08800 78659
04690 00990 42512
I know one may write the output to a file and then use, for example awk, but I thought there could be a much simpler approach to avoid longer processing time (for example, makes grep picks only the match in a specific cutted column).
Any help will be very appreciated and many thanks!
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='3 5 2 5 4 5 *' 'NR==FNR{a[$0]; next} $2 in a{ print $2, $4, $6 }' file2 file1
55555 36933 07802
42512 08800 78659
Would you please try the following:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -wf <(sed 's/^/^/' file2)
Each line in file2 is prepended by a caret ^ character to anchor to
the start of the line of the output by cut.
It may be a bit slower than before due to the lack of -F option.

Merge two files by one column - awk

I have two different scripts to merge files by one matching column.
file1.tsv - 4 columns separated by tab
1 LAK c.66H>T p.Ros49Kos
2 OLD c.11A+1>R p.Ill1639Los
3 SRP c.96V-T>X p.Zub%D23
4 HRP c.1S>T p.Lou33aa
file2.tsv - 14 columns, separated by tab
LAK "empty_column" c.66H>T ......
SRP "empty_column" c.96-T>X ......
Ouptut.tsv - all columns from file2.tsv and behind 1st column of file1 if match.
LAK "empty_column" c.66H>T ......1
SRP "empty_column" c.96-T>X ......3
I am using these two scripts, but doesn´t work:
awk -v FILE_A="file1.tsv" -v OFS="\t" 'BEGIN { while ( ( getline <
FILE_A ) > 0 ) { VAL = $0 ; sub( /^[^ ]+ /, "", VAL ) ; DICT[ $3 ] =
VAL } } { print $0, DICT[ $3 ] }' file2.tsv
or
awk 'NR==FNR{h[$3] = $1; next} {print h[$3]}' file1.tsv file2.tsv
Thanks for help.
You might want to use the join command to join column 2 of the first file with column 1 of the second:
join --nocheck-order -1 2 -2 1 file1.tsv file2.tsv
A few notes
This is the first step, after this, you still have the task of cutting out unwanted columns, or rearrange them. I suggest to look into the cut command, or use awk this time.
The join command expects the text on both files are in the same order (alphabetical or otherwise)
Alternatively, import them into a temporary sqlite3 database and perform a join there.

How to use awk, to split a particular column (using delimiter), then add suffix and then merge?

In the data given below (which is tab separated):
# data
1 xyz alfa x=abc;z=cbe;d=fed xt
2 xyz alfa y=cde;z=xy ft
3 xyb delta xy=def zf
I want to add a suffix _LT in the elements (values of the variables) of 4th column after splitting at ;.
Output like:
1 xyz alfa x=abc_LT;z=cbe_LT;d=fed_LT xt
2 xyz alfa y=cde_LT;z=xy_LT ft
I am able to add suffix at specific columns, but can't split(at delim)-add-merge.
awk -v PRE='_LT' '{$4=$4PRE; print}' OFS="\t" data.txt > data_LT.txt
you can use split function, loop and merge... or use substitutions
$ awk -v PRE='_LT' '{gsub(/;/,PRE";",$4); sub(/$/,PRE,$4); print}' OFS='\t' data.txt
1 xyz alfa x=abc_LT;z=cbe_LT;d=fed_LT xt
2 xyz alfa y=cde_LT;z=xy_LT ft
3 xyb delta xy=def_LT zf
gsub(/;/,PRE";",$4) replace all ; with _LT; only for 4th column
sub(/$/,PRE,$4) append _LT to 4th column
Another thought is to use split() in awk,
awk -v PRE='_LT' '{
n=split($4,a,/;/);
b="";
for(i in a){ b=b a[i] PRE; if(i!=n){b=b";"} }
{$4=b; print $0}
}' OFS='\t' data.txt
n=split($4,a,/;/) splits $4 using the separator ';'. And print the split result as your desired.
Using Perl
$ cat everestial
1 xyz alfa x=abc;z=cbe;d=fed xt
2 xyz alfa y=cde;z=xy ft
3 xyb delta xy=def zf
$ perl -F'/\s+/' -lane ' $F[3]=~s/(.+?)=(.+?)\b/${1}=${2}_LT/g; print join("\t",#F) ' everestial
1 xyz alfa x=abc_LT;z=cbe_LT;d=fed_LT xt
2 xyz alfa y=cde_LT;z=xy_LT ft
3 xyb delta xy=def_LT zf

How to add text from two files in awk

I have two tab delim files as shown below:
FileA.txt
1 a,b,c
2 b,c,e
3 e,d,f,a
FileB.txt
a xxx
b xyx
c zxxy
I would need the output in the below way:
Output:
1 a,b,c xxx,xyx.zxxy
2 b,c,e xyx,zxxy,e
3 e,d,f,a e,d,f,xxx
The comma separated values in $2 of FileA are to be used as key to search for a match in $1 of FileB and add a new column in the output with their values in $2 from FileB. Incase of no match it should print the original value. Any help on how to do this?
awk to the rescue!
$ awk 'NR==FNR {a[$1]=$2; next}
{NF++; s=""; n=split($2,t,",");
for(i=1;i<=n;i++) {k=t[i];
$NF=$NF s ((k in a)?a[k]:k);
s=","}}1' fileB fileA | column -t
1 a,b,c xxx,xyx,zxxy
2 b,c,e xyx,zxxy,e
3 e,d,f,a e,d,f,xxx

AWK - Merge multiple lines in two particular columns into one line?

Newbie here.. I'm confused how to merge multiple lines in particular columns and print into one row. For example I have this kind of data in .csv file (separated by comma):
ID1,X1,X2,X3,X4,X5,X6,T,C
ID2,X1,X2,X3,X4,X5,X6,G,A
ID3,X1,X2,X3,X4,X5,X6,C,G
ID4,X1,X2,X3,X4,X5,X6,A,A
I plan to select only the 8th and 9th columns per-row, and print them all in one row and separated using whitespace, so that the result will be like this:
T C G A C G A A
To do that, I tried to use AWK code :
awk -F "," '{printf "%s ",$8, "%s ",$9}' FILE > outputfile
But it gave result the merge between all in col 8th then all in col 9th:
T G C A C A G A
Any suggestions are very welcomed.
Thank you very much for your kind help.
like this?
kent$ awk -F, '{t=$8 OFS $9;s=s?s OFS t:t}END{print s}' file
T C G A C G A A
Try this awk:
awk -F "," '{printf "%s %s ", $8,$9}' yourfile

Resources