Merging >2 files with AWK or JOIN? - join

Merging 2 files using AWK is a well covered topic on StackOverflow. However, the technique of reading 3 files into an array gets more complicated. As I'm formatting the output to go into an R script, I'm going to need to add lots of syntax so I don't think I can use JOIN. Here is a simplistic version I have working so far:
awk 'FNR==1{f++}
f==1{a[FNR]=$1;next}
f==2{b[FNR]=$1;next}
{print a[FNR], "<- c(", b[FNR], ",", $1, ")"}' words.txt x.txt y.txt
Where:
$ cat words.txt
word1
word2
word3
$ cat x.txt
1
2
3
$ cat y.txt
11
22
33
The output is then
word1 <- c(1, 11)
word2 <- c(2, 22)
word3 <- c(3, 22)
The best way I can summarize this technique is
Create a variable f to keep track of which file you're processing
For file 1 read the values into array a
For file 2 read the values into array b
Fall through to file three, where you concatenate your final output
As a beginner to AWK, this works, but I find it a bit awkward and I worry coming back to the code in 6 months, I'll no longer understand it. Is this the best way to merge these 3 files in AWK? Could JOIN actually handle this level of formatting the final output?

a variation of #RavinderSingh13's solution
$ paste {words,x,y}.txt | awk '{print $1, "<- c(" $2 ", " $3 ")"}'

EDIT: Could you please try following.
paste words.txt x.txt y.txt | awk '{$2="<- c("$2", "$3")";$3="";sub(/ +$/,"")} 1'
Output will be as follows.
word1 <- c(1, 11)
word2 <- c(2, 22)
word3 <- c(3, 33)
In case you simply want to add 3 file's contents in column vice then try following.
paste words.txt x.txt y.txt
word1 1 11
word2 2 22
word3 3 33

If it's for readability, you can change the file checking method, as well as the variable names.
Try these please:
awk 'ARGIND==1{words[FNR]=$1;}
ARGIND==2{xcol[FNR]=$1;}
ARGIND==3{print words[FNR], "<- c(", xcol[FNR], ",", $1, ")"}' words.txt x.txt y.txt
Above file checking method is for GNU awk.
Change to another, as well as change the file reading order, would be:
awk 'FILENAME=="words.txt"{print $1, "<- c(", xcol[FNR], ",", ycol[FNR], ")";}
FILENAME=="x.txt"{xcol[FNR]=$1;}
FILENAME=="y.txt"{ycol[FNR]=$1;}' x.txt y.txt words.txt
As you can also see here, file reading order and block order can be different.
Since words.txt has first column, or main column, so to speak, so it's sensible to read it last.
You can also use FILENAME==ARGV[1] FILENAME==ARGV[2] etc to check files, and put comments inside (use awk script file and load with awk -f scriptfile is better with comments):
awk 'FILENAME==ARGV[1]{xcol[FNR]=$1;} #Read column B, x column
FILENAME==ARGV[2]{ycol[FNR]=$1;} # Read column C, y cloumn
FILENAME==ARGV[3]{print $1, "<- c(", xcol[FNR], ",", ycol[FNR], ")";}' x.txt y.txt words.txt

Related

Extracting lines from a fixed format without spaces file based on a column and list of inquiring IDs

I have a quite large fixed format file without spaces (file1):
file1:
0808563800555550000367120000500000
0005555566369330000078020000500000
01066666780000000008933600009000005635
0904251263088000000786590056500000
0000469011009904440425120444444440
I want to extract lines with fields 4-8,11-15 and 20-24 when fields 4-8 (only) are in a list of IDs in file2
file2:
55555
42512
The desired outputs are:
55555 36933 07802
42512 08800 78659
I have tried the following combination of cut | grep commands:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -w -F -f file2
It works fine and the speed is very good, but the problem is that I am getting columns where the lookup ID (fields 4-8) is not in the first column of the cutted data, and that is because grep checks the three columns after cut, not only the first one. 
Here are the outputs of the command above:
85638 55555 36712
55555 36933 07802
66666 00000 89336
42512 08800 78659
04690 00990 42512
I know one may write the output to a file and then use, for example awk, but I thought there could be a much simpler approach to avoid longer processing time (for example, makes grep picks only the match in a specific cutted column).
Any help will be very appreciated and many thanks!
With GNU awk for FIELDWIDTHS:
$ awk -v FIELDWIDTHS='3 5 2 5 4 5 *' 'NR==FNR{a[$0]; next} $2 in a{ print $2, $4, $6 }' file2 file1
55555 36933 07802
42512 08800 78659
Would you please try the following:
cut -c 4-8,11-15,20-24 file1 --output-delimiter=' ' | grep -wf <(sed 's/^/^/' file2)
Each line in file2 is prepended by a caret ^ character to anchor to
the start of the line of the output by cut.
It may be a bit slower than before due to the lack of -F option.

Parsing simple string with awk or sed in linux

original string :
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/
Depth of directories will vary, but /trunk part will always remain the same.
And a single character in front of /trunk is the indicator of that line.
desired output :
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Q /trunk/melon/juice/venti/straw
*** edit
I'm sorry I made a mistake by adding a slash at the end of each path in the original string which made the output confusing. Original string didn't have the slash in front of the capital letter, but I'll leave it be.
my attempt :
echo $str1 | sed 's/\(.\/trunk\)/\n\1/g'
I feel like it should work but it doesn't.
With GNU awk for multi-char RS and RT:
$ awk -v RS='([^/]+/){2}[^/\n]+' 'RT{sub("/",OFS,RT); print RT}' file
A trunk/apple
B trunk/apple
Z trunk/orange
I'm setting RS to a regexp describing each string you want to match, i.e. 2 repetitions of non-/s followed by / and then a final string of non-/s (and non-newline for the last string on the input line). RT is automatically set to each of the matching strings, so then I just change the first / to a blank and print the result.
If each path isn't always 3 levels deep but does always start with something/trunk/, e.g.:
$ cat file
A/trunk/apple/banana/B/trunk/apple/Z/trunk/orange
then:
$ awk -v RS='[^/]+/trunk/' 'RT{if (NR>1) print pfx $0; pfx=gensub("/"," ",1,RT)} END{printf "%s%s", pfx, $0}' file
A trunk/apple/banana/
B trunk/apple/
Z trunk/orange
To deal with complex samples input, like where there could be N number of / and values after trunk in a single line please try following.
awk '
{
gsub(/[^/]*\/trunk/,OFS"&")
sub(/^ /,"")
sub(/\//,OFS"&")
gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&")
sub(/\n/,OFS)
gsub(/\n /,ORS)
gsub(/\/trunk/,OFS"&")
sub(/[[:space:]]+/,OFS)
}
1
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
{
gsub(/[^/]*\/trunk/,OFS"&") ##Globally substituting everything from / to till next / followed by trunk/ with space and matched value.
sub(/^ /,"") ##Substituting starting space with NULL here.
sub(/\//,OFS"&") ##Substituting first / with space / here.
gsub(/ +[^/]*\/trunk\/[^[:space:]]+/,"\n&") ##Globally substituting spaces followed by everything till / trunk till space comes with new line and matched values.
sub(/\n/,OFS) ##Substituting new line with space.
gsub(/\n /,ORS) ##Globally substituting new line space with ORS.
gsub(/\/trunk/,OFS"&") ##Globally substituting /trunk with OFS and matched value.
sub(/[[:space:]]+/,OFS) ##Substituting spaces with OFS here.
}
1 ##Printing edited/non-edited line here.
' Input_file ##Mentioning Input_file name here.
With your shown samples, please try following awk code.
awk '{gsub(/\/trunk/,OFS "&");gsub(/trunk\/[^/]*\//,"&\n")} 1' Input_file
In awk you can try this solution. It deals with the special requirement of removing forward slashes when the next character is upper case. Will not win a design award but works.
$ echo "A/trunk/apple/B/trunk/apple/Z/trunk/orange" |
awk -F '' '{ x=""; for(i=1;i<=NF;i++){
if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
if($i~/[A-Z]/){ printf x""$i" "}
else{ x="\n"; printf $i } }; print "" }'
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Also works for n words. Actually works with anything that follows the given pattern.
$ echo "A/fruits/apple/mango/B/anything/apple/pear/banana/Z/ball/orange/anything" |
awk -F '' '{ x=""; for(i=1;i<=NF;i++){
if($(i+1)~/[A-Z]/&&$i=="/"){$i=""};
if($i~/[A-Z]/){ printf x""$i" "}
else{ x="\n"; printf $i } }; print "" }'
A /fruits/apple/mango
B /anything/apple/pear/banana
Z /ball/orange/anything
This might work for you (GNU sed):
sed 's/[^/]*/& /;s/\//\n/3;P;D' file
Separate the first word from the first / by a space.
Replace the third / by a newline.
Print/delete the first line and repeat.
If the first word has the property that it is only one character long:
sed 's/./& /;s#/\(./\)#\n\1#;P;D' file
Or if the first word has the property that it begins with an upper case character:
sed 's/[[:upper:]][^/]*/& /;s#/\([[:upper:][^/]*/\)#\n\1#;P;D' file
Or if the first word has the property that it is followed by /trunk/:
sed -E 's#([^/]*)(/trunk/)#\n\1 \2#g;s/.//' file
With GNU sed:
$ str="A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/"
$ sed -E 's|/?(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Note the first empty output line. If it is undesirable we can separate the processing of the first output line:
$ sed -E 's|(.)|\1 |;s|/(.)(/trunk/)|\n\1 \2|g;s|/$||' <<< "$str"
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Using gnu awk you could use FPAT to set contents of each field using a pattern.
When looping the fields, replace the first / with /
str1="A/trunk/apple/B/trunk/apple/Z/trunk/orange"
echo $str1 | awk -v FPAT='[^/]+/trunk/[^/]+' '{
for(i=1;i<=NF;i++) {
sub("/", " /", $i)
print $i
}
}'
The pattern matches
[^/]+ Match any char except /
/trunk/[^/]+ Match /trunk/ and any char except /
Output
A /trunk/apple
B /trunk/apple
Z /trunk/orange
Other patterns that can be used by FPAT after the updated question:
Matching a word boundary \\< and an uppercase char A-Z and after /trunk repeat / and lowercase chars
FPAT='\\<[A-Z]/trunk(/[a-z]+)*'
If the length of the strings for the directories after /trunk are at least 2 characters:
FPAT='\\<[A-Z]/trunk(/[^/]{2,})*'
If there can be no separate folders that consist of a single uppercase char A-Z
FPAT='\\<[A-Z]/trunk(/([^/A-Z][^/]*|[^/]{2,}))*'
Output
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw
Assuming your data will always be in the format provided as a single string, you can try this sed.
$ sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g' input_file
$ echo "A/trunk/apple/pine/skunk/B/trunk/runk/bunk/apple/Z/trunk/orange/T/fruits/apple/mango/P/anything/apple/pear/banana/L/ball/orange/anything/S/fruits/apple/mango/B/rupert/cream/travel/scout/H/tall/mountains/pottery/barnes" | sed 's/$/\//;s|\([A-Z]\)\([a-z/]*\)/\([a-z]*\?\)|\1 \2\3\n|g'
A /trunk/apple/pine/skunk
B /trunk/runk/bunk/apple
Z /trunk/orange
T /fruits/apple/mango
P /anything/apple/pear/banana
L /ball/orange/anything
S /fruits/apple/mango
B /rupert/cream/travel/scout
H /tall/mountains/pottery/barnes
Some fun with perl, where you can using nonconsuming regex to autosplit into the #F array, then just print however you want.
perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
Step #1: Split
perl -lanF/(?=.{1,2}trunk)/'
This will take the input stream, and split each line whenever the pattern .{1,2}trunk is encountered
Because we want to retain trunk and the preceeding 1 or 2 chars, we wrap the split pattern in the (?=) for a non-consuming forward lookahead
This splits things up this way:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print join " ", #F'
A /trunk/apple/ B /trunk/apple/ Z /trunk/orange/citrus/ Q /trunk/melon/juice/venti/straw/
Step 2: Format output:
The #F array contains pairs that we want to print in order, so we'll iterate half of the array indices, and print 2 at a time:
print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2 --> Double the iterator, and print pairs
using perl -l means each print has an implicit \n at the end
The results:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e 'print "$F[2*$_] $F[2*$_+1]" for 0..$#F/2'
A /trunk/apple/
B /trunk/apple/
Z /trunk/orange/citrus/
Q /trunk/melon/juice/venti/straw/
Endnote: Perl obfuscation that didn't work.
Any array in perl can be cast as a hash, of the format (key,val,key,val....)
So %F=#F; print "$_ $F{$_}" for keys %F seems like it would be really slick
But you lose order:
$ echo A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/ | perl -lanF'/(?=.{1,2}trunk)/' -e '%F=#F; print "$_ $F{$_}" for keys %F'
Z /trunk/orange/citrus/
A /trunk/apple/
Q /trunk/melon/juice/venti/straw/
B /trunk/apple/
Update
With your new data file:
$ cat file
A/trunk/apple/B/trunk/apple/Z/trunk/orange/citrus/Q/trunk/melon/juice/venti/straw/
This GNU awk solution:
awk '
{
sub(/[/]$/,"")
gsub(/[[:upper:]]{1}/,"& ")
print gensub(/([/])([[:upper:]])/,"\n\\2","g")
}' file
A /trunk/apple
B /trunk/apple
Z /trunk/orange/citrus
Q /trunk/melon/juice/venti/straw

extract the adjacent character of selected letter

I have this text file:
# cat letter.txt
this
is
just
a
test
to
check
if
grep
works
The letter "e" appear in 3 words.
# grep e letter.txt
test
check
grep
Is there any way to return the letter printed on left of the selected character?
expected.txt
t
h
r
With shown samples in awk, could you please try following.
awk '/e/{print substr($0,index($0,"e")-1,1)}' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
/e/{ ##Looking if current line has e in it then do following.
print substr($0,index($0,"e")-1,1)
##Printing sub string from starting value of index e-1 and print 1 character from there.
}
' Input_file ##Mentioning Input_file name here.
You can use positive lookahead to match a character that is followed by an e, without making the e part of the match.
cat letter.txt | grep -oP '.(?=e)'
With sed:
sed -nE 's/.*(.)e.*/\1/p' letter.txt
Assuming you have this input file:
cat file
this
is
just
a
test
to
check
if
grep
works
egg
element
You may use this grep + sed solution to find letter or empty string before e:
grep -oE '(^|.)e' file | sed 's/.$//'
t
h
r
l
m
Or alternatively this single awk command should also work:
awk -F 'e' 'NF > 1 {
for (i=1; i<NF; i++) print substr($i, length($i), 1)
}' file
This might work for you (GNU sed):
sed -nE '/(.)e/{s//\n\1\n/;s/^[^\n]*\n//;P;D}' file
Turn off implicit printing and enable extended regexp -nE.
Focus only on lines that meet the requirements i.e. contain a character before e.
Surround the required character by newlines.
Remove any characters before and including the first newline.
Print the first line (up to the second newline).
Delete the first line (including the newline).
Repeat.
N.B. The solution will print each such character on a separate line.
To print all such characters on their own line, use:
sed -nE '/(.e)/{s//\n\1/g;s/^/e/;s/e[^\n]*\n?//g;s/\B/ /g;p}' file
N.B. Remove the s/\B /g if space separation is not needed.
With GNU awk you can use empty string as FS to split the input as individual characters:
awk -v FS= '/[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file
t
h
r
Excluding "e" at the beginning in the for loop.
edited
empty string if e is the first character in the word.
For example, this input:
cat file2
grep
erroneously
egg
Wednesday
effectively
awk -v FS= '/^[e]/ {print ""} /[e]/ {for(i=2;i<=NF;i++) if ($i=="e") print $(i-1)}' file2
r
n
W
n
f
v

How to join 2 files using a pattern

is it possible to join these files based on first column pattern by using awk ?
Thanks
file1
qwex-123d-947774-sm-shebha
qwex-123d-947774-sm-shebhb
qwex-123d-947774-sm-shebhd
qwex-23d-947774-sm-shebha
qwex-23d-947774-sm-shebhb
qwex-235d-947774-sm-shebhd
file2
qwex-235d none1
qwex-23d none2
output
qwex-23d none2 qwex-23d-947774-sm-shebha
qwex-23d none2 qwex-23d-947774-sm-shebhb
qwex-235d none1 qwex-235d-947774-sm-shebhd
this awk one-liner should do:
awk 'NR==FNR{a[$0];next}{for(x in a)if($0~"^"x){print x, $0;break}}' file2 file1
Note that, the line has risk if the lines in your file2 containing special characters, which have special meaning in regex. like qwex$-23d
If that is the case, ~ should not be used, instead, we should compare the string literally.

AWK - Merge multiple lines in two particular columns into one line?

Newbie here.. I'm confused how to merge multiple lines in particular columns and print into one row. For example I have this kind of data in .csv file (separated by comma):
ID1,X1,X2,X3,X4,X5,X6,T,C
ID2,X1,X2,X3,X4,X5,X6,G,A
ID3,X1,X2,X3,X4,X5,X6,C,G
ID4,X1,X2,X3,X4,X5,X6,A,A
I plan to select only the 8th and 9th columns per-row, and print them all in one row and separated using whitespace, so that the result will be like this:
T C G A C G A A
To do that, I tried to use AWK code :
awk -F "," '{printf "%s ",$8, "%s ",$9}' FILE > outputfile
But it gave result the merge between all in col 8th then all in col 9th:
T G C A C A G A
Any suggestions are very welcomed.
Thank you very much for your kind help.
like this?
kent$ awk -F, '{t=$8 OFS $9;s=s?s OFS t:t}END{print s}' file
T C G A C G A A
Try this awk:
awk -F "," '{printf "%s %s ", $8,$9}' yourfile

Resources