Joining several files based on first file

Joining several files based on first file - join

I am trying to join several files based on a column in a specific file, namely "file1"
file1, serves as an "anchor":
rs00001
rs00002
rs00003
rs00004
rs00005
file2:
rs00001 chr4:180168624 ANAPC4
rs00002 chr5:67819450 FABP2
rs00004 chr4:115169445 TBC1D1
rs00005 chr4:67815503 MAML3
file3:
rs00003 19.65 6 5 1
rs00004 17.23 5 4 1
rs00005 20.95 8 2 0
Desired output:
rs00001 chr4:180168624 ANAPC4 . . . .
rs00002 chr5:67819450 FABP2 . . . .
rs00003 . . 19.65 6 5 1
rs00004 chr4:115169445 TBC1D1 17.23 5 4 1
rs00005 chr4:67815503 MAML3 20.95 8 2 0
Codes that I tried:
paste file1 file2 file3
But it only combines everything into one and did not based on the column in file1.
rs00001 rs00001 chr4:180168624 ANAPC4 rs00003 19.65 6 5 1
rs00002 rs00002 chr5:67819450 FABP2 rs00004 17.23 5 4 1
rs00003 rs00004 chr4:115169445 TBC1D1 rs00005 20.95 8 2 0
rs00004 rs00005 chr4:67815503 MAML3
rs00005
Appreciate any help. Thanks!

You can use join but you need to set a few options:
join -a1 -o1.1,2.2,2.3 -e "." <(sort test_1) <(sort test_2) > tmp_1
join -a1 -o1.1,1.2,1.3,2.2,2.3,2.4,2.5 -e "." <(sort tmp_1) <(sort test_3) > output
Explanation: Your example is in 3 files ('test_1' 'test_2' and 'test_3') so the first step is to combine test_1 and test_2 into a temporary file (tmp_1) using join. The -a1 option is telling join to look at the first column in both files for 'matches', the -o1.1,2.2,2.3 is telling join to print the first column of the first file (1.1), the second column of the second file (2.2) and the third column of the second file (2.3). The -e "." is telling join to fill in any blanks with a dot. The inputs need to be sorted, so <(sort file) is used to sort the contents before being joined. Next step is to join the temp file with the test_3 file. The options are the same, but different columns are printed.

Related

grep invert match on two files

I have two text files containing one column each, for example -
File_A File_B
1 1
2 2
3 8
If I do grep -f File_A File_B > File_C, I get File_C containing 1 and 2. I want to know how to use grep -v on two files so that I can get the non-matching values, 3 and 8 in the above example.
Thanks.

You can also use comm if it allows empty output delimiter
$ # -3 means suppress lines common to both input files
$ # by default, tab character appears before lines from second file
$ comm -3 f1 f2
3
8
$ # change it to empty string
$ comm -3 --output-delimiter='' f1 f2
3
8
Note: comm requires sorted input, so use comm -3 --output-delimiter='' <(sort f1) <(sort f2) if they are not already sorted
You can also pass common lines got from grep as input to grep -v. Tested with GNU grep, some version might not support all these options
$ grep -Fxf f1 f2 | grep -hxvFf- f1 f2
3
8
-F option to match strings literally, not as regex
-x option to match whole lines only
-h to suppress file name prefix
f- to accept stdin instead of file input

awk 'NR==FNR{a[$0]=$0;next} !($0 in a) {print a[(FNR)], $0}' f1 f2
3 8
To Understand the meaning of NR and FNR check below output of their print.
awk '{print NR,FNR}' f1 f2
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
Condition NR==FNR is used to extract the data from first file as both NR and FNR would be same for first file only.

With GNU diff command (to compare files line by line):
diff --suppress-common-lines -y f1 f2 | column -t
The output (left column contain lines from f1, right column - from f2):
3 | 8
-y, --side-by-side - output in two columns

merge 2 lists based on first 2 columns

I need to merge 2 lists based on column 1 and 2
file1:
client1,server1,3000.00
client1,server2,2500.00
client1,server3,1500.00
client2,server1,4500.00
client2,server2,2300.00
client2,server3,1230.00
client3,server1,3400.00
client3,server2,4500.00
client3,server3,1245.00
client4,server1,3400.00
client5,server2,4500.00
client6,server3,1245.00
client7,server1,3400.00
client7,server2,4500.00
client8,server3,1245.00
client8,server1,3400.00
client8,server2,4500.00
client9,server3,1245.00
file2:
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
what I need is to update file2 with the missing values from column 1 an 2 only of file1 and adding comma to keep same number of columns
with this example the output should be like this :
client1,server1,windows,250g
client1,server2,linux,450g
client1,server3,linux,400g
client2,server1,windows,250g
client2,server2,linux,450g
client2,server3,linux,400g
client3,server1,windows,250g
client3,server2,linux,450g
client3,server3,linux,400g
client4,server1,,
client5,server2,,
client6,server3,,
client7,server1,,
client7,server2,,
client8,server3,,
client8,server1,,
client8,server2,,
client9,server3,,
I have tried with awk and join but I am not able to get the same result
if creating new file is easier then no issue
thanks for your help

Another awk way
awk -F, -vOFS="," 'NR!=FNR{NF--;NF+=2}!a[$1 FS $2]++' test2 test
or
awk -F, 'NR!=FNR{$0=$1 FS $2",,"}!a[$1 FS $2]++' test2 test
Shortest
awk -F, '{x=$1","$2}NR!=FNR{$0=x",,"}!a[x]++' test2 test

give this line a try:
awk -F, '{k=$1 FS $2}NR==FNR{a[k]++;print;next}!a[k]{print k",,"}' file2 file1

Using the join command. Problem is join can not join on multiple fields, so we need to manipulate the first comma temporarily:
join -t , -o 0,2.2,2.3 -a 1 <(sed 's/,/:/' file1) <(sed 's/,/:/' file2) | sed 's/:/,/'

Compare two files and make union

I tested a line below to compare 1st columns in 2 files and make an union. However the different value with identical 1st column in file2 was eliminated. Below I attached sample files, obtained result, and desired result.
awk -F, 'BEGIN{OFS=","}FNR==NR{a[$1]=$1","$2;next}($1 in a && $2=$2","a[$1])' file2.csv file1.csv >testout.txt
file1
John,red
John,blue
Mike,red
Mike,blue
Carl,red
Carl,blue
file2
John,V1
John,V2
Kent,V1
Kent,V2
Mike,V1
Mike,V2
obtained result
John,red,John,V2
John,blue,John,V2
Mike,red,Mike,V2
Mike,blue,Mike,V2
desired result
John,red,John,V1
John,red,John,V2
John,blue,John,V1
John,blue,John,V2
Mike,red,Kent,V1
Mike,red,Kent,V2
Mike,blue,Kent,V1
Mike,blue,Kent,V2

try this one-liner:
awk -F, -v OFS="," 'NR==FNR{a[$0];next}{for(x in a)if(x~"^"$1FS)print $0,x}' file2 file1
test:
kent$ awk -F, -v OFS="," 'NR==FNR{a[$0];next}{for(x in a)if(x~"^"$1FS)print $0,x}' f2 f1
John,red,John,V1
John,red,John,V2
John,blue,John,V1
John,blue,John,V2
Mike,red,Mike,V1
Mike,red,Mike,V2
Mike,blue,Mike,V1
Mike,blue,Mike,V2

Using join could do that:
join -t, -1 1 -2 1 --nocheck-order -o 1.1 1.2 2.1 2.2 file1 file2
Output:
John,red,John,V1
John,red,John,V2
John,blue,John,V1
John,blue,John,V2
Mike,red,Mike,V1
Mike,red,Mike,V2
Mike,blue,Mike,V1
Mike,blue,Mike,V2

Egrep - find 0 (zero) and ignore previous line

I am trying hard to get the output as I Like.
Current Output:
###Server1###
2
###Server2###
0
###Server3###
5
###Server4###
0
Required Output:
###Server1###
2
###Server3###
5
All I am looking is to grep and ignore any line and the previous line that containts 0 (zero) in any place of the line. I am using bash shell.

This is a possible approach:
$ grep -B 1 "^\s*[1-9]$" file
###Server1###
2
--
###Server3###
5
To get rid of the group separator, we can also do:
$ grep --no-group-separator -B 1 "^\s*[1-9]$" file
###Server1###
2
###Server3###
5
Explanation
Instead of using grep -v to find the inverse, I think it is easier to look for the lines having a single digit value not being 0. This is done with the "^\s*[1-9]$" expression, that allows spaces before the digit.
With -B 1 we make it print also the line before the matched one.

Code for GNU sed:
sed '$!N;/\s*\b0\b\s*/d' file
$ sed '$!N;/\s*\b0\b\s*/d' file
###Server1###
2
###Server3###
5

print every nth line into a row using gawk

I have a very huge file in which I need to obtain every nth line and print it into a row.
My data:
1 937 4.320194
2 667 4.913314
3 934 1.783326
4 940 -0.299312
5 939 2.309559
6 936 3.229496
7 611 -1.41808
8 608 -1.154019
9 606 2.159683
10 549 0.767828
I want my data to look like this:
1 937 4.320194
3 934 1.783326
5 939 2.309559
7 611 -1.41808
9 606 2.159683
This is of course an example, I want every 10th line for my huge data file. I tried this so far:
NF == 6 {
if(NR%10) {print;}
}

To print every second line, starting with the first:
awk 'NR%2==1' file.txt
To print every tenth line, starting with the tenth line:
awk 'NR%10==0' file.txt
To use this in a script, add the following to a file called script.awk:
BEGIN {
print "Processing file"
}
NR%10==0
END {
print "Finished processing"
}
Then execute:
awk -f script.awk file.txt

With sed, you can do a lot of variations on this quite easily with the first~step command. For instance:
# Odd lines
sed -n 1~2p file
# Every tenth line (10, 20, 30, ...)
sed -n 10~10p file
# Every tenth line (1, 11, 21, ...)
sed -n 1~10p file
# First plus every tenth (1, 10, 20, 30, ...)
sed -n -e 1p -e 10~10p file

Piece of cake: cat test.txt | awk 'NR % 10 == 1'

It's not (g)awk, but it'll work:
cat myfile | grep ^[[:digit:]]*0[[:blank:]] should do the trick.

Doing it directly in command Prompt (Windows).
Put the gawk.exe file in the folder where the file is and start a command Prompt in the folder, and write
gawk "NR%n==x" oldfile.txt>newfile.txt
n is every n'th line you want to print and x is the starting line.
E.g n=10 and x=1, printing line 1,11,21,31,41......end line from the original file into the new file.
E.g n=20 and x=5, printing line 5,25,45,65......end line from the original file into the new file.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Joining several files based on first file - join

Related

grep invert match on two files

merge 2 lists based on first 2 columns

Compare two files and make union

Egrep - find 0 (zero) and ignore previous line

print every nth line into a row using gawk

Categories

Resources