Trying to output all possible combinations of joining two files - join

I have a folder of 24 different files that all have the same tab-separated format:
This is an example:
zinc-n with-iodide-n 8.0430 X
zinc-n with-amount-of-supplement-n 12.7774 X
zinc-n with-value-of-horizon-n 14.5585 X
zirconium-n as-valence-n 11.3255 X
zirconium-n for-form-of-norm-n 15.4607 X
I want to join the files in every possible combination of 2.
For instance, I want to join File 1 and File 2, File 1 and File 3, File 1 and File 4... and so on until I have an output of 552 files joining EACH file with EACH other file considering all the UNIQUE combinations
I know this can be done for instance in the Terminal with cat.
i.e.
cat File1 File2 > File1File2
cat File1 File3 > File1File3
... and so on.
But, to do this for each unique combination would be an extremely laborious process.
Is there a possible to automatize this process to join all of the unique combination using a command line in Terminal with grep for instance? or perhaps another suggestion for a more optimized solution than CAT.

You can try with python. I use the combinations() function from the itertools module and join() the contents of each pair of files. Note that I use a cache to avoid reading each file many times, but you could exhaust your memory, so use the best approach for you:
import sys
import itertools
seen = {}
for files in itertools.combinations(sys.argv[1:], 2):
outfile = ''.join(files)
oh = open(outfile, 'w')
if files[0] in seen:
f1_data = seen[files[0]]
else:
f1_data = open(files[0], 'r').read()
seen[files[0]] = f1_data
if files[1] in seen:
f2_data = seen[files[1]]
else:
f2_data = open(files[1], 'r').read()
seen[files[1]] = f2_data
print('\n'.join([f1_data, f2_data]), file=oh)
A test:
Assuming following content of three files:
==> file1 <==
file1 one
f1 two
==> file2 <==
file2 one
file2 two
==> file3 <==
file3 one
f3 two
f3 three
Run the script like:
python3 script.py file[123]
And it will create three new files with content:
==> file1file2 <==
file1 one
f1 two
file2 one
file2 two
==> file1file3 <==
file1 one
f1 two
file3 one
f3 two
f3 three
==> file2file3 <==
file2 one
file2 two
file3 one
f3 two
f3 three

Related

Grepping twice using result of first Grep in Large file

Am given a list if ID which I need to trace back a name in a file
file: ID contains
1
2
3
4
5
6
The ID are contained in a Large 2 GB file called result.txt
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
So I cat the ID file into a variable
I then use this variable in a loop to grep out the values to link back to the name using grep and cut -d from results.txt and output to a variable
so variable contains ABS CDE FG1
In the same loop I pass the output of the grep to perform another grep on results.txt, to get the name
ie regrets file for ABC CDE FG1
I do get the answer but takes a long time is their a more efficient way?
Thanks
Making some assumptions about your requirement... ID's that are not found in the big file will not be shown in the output; the desired output is in the format shown below.
Here are mock input files - f1 for the id's and f2 for the large file:
[mathguy#localhost test]$ cat f1
1
2
3
4
5
6
[mathguy#localhost test]$ cat f2
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
Proposed solution and output:
[mathguy#localhost test]$ sed 's/.*/\*\*id=&\*\*/' f1 | grep -Ff - f2 | \
> sed -E 's/^.*\*\*id=([[:digit:]]*)\*\*.*,([^,]*)$/\1 \2/'
1 ABC
2 CDE
3 FG1
The hard work here is done by grep -F which might be just fast enough for your needs. There is some prep work and some clean-up work done by sed, but those are both on small datasets.
First we take the id's from the input file and we output strings in the format **id=<number>**. The output is presented as the fixed-character patterns to grep -F via the option -f (take the patterns from file, in this case from stdin, invoked as -; that is, from the output of sed).
After we find the needed lines from the big file, the final sed just extracts the id and the name from each line.
Note: this assumes that each id is only found once in the big file. (Actually the command will work regardless; but if there are duplicate lines for an id, your business users will have to tell you how to handle. What if you get contradictory names for the same id? Etc.)

Delete lines during file compare without deleting line numbers or injecting new blank lines

file2 has a big list of numbers. File1 has a small list of numbers. file2 is a duplicate of some of the numbers in file1. I want to remove the duplicate numbers in file2 from file1 without deleting any data from file2 but at same time not deleting the line number in file1. I use PyCharm IDE and that assigns the line number. This code does remove the duplicate data from file1 and does not remove the data from file2. Which is what I want, however it is deleting the duplicate numbers and the lines and rewiting them in file1 which is what I don't want to do.
import fileinput
# small file2
with open('file2.txt') as fin:
exclude = set(line.rstrip() for line in fin)
# big file1
for line in fileinput.input('file1.txt', inplace=True):
if line.rstrip() not in exclude:
print(line)
Example: of what is happening, file2 34344
file-1 at start:
54545
34344
23232
78787
file-1 end:
54545
23232
78787
What I want.
file-1 start:
54545
34344
23232
78787
file-1 end:
54545
23232
78787
You just need to print an empty line when you find a data that is in the exclude set.
import fileinput
# small file2
with open('file2.txt') as fin:
exclude = set(line.rstrip() for line in fin)
# big file1
for line in fileinput.input('file1.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
If file1.txt is:
54545
1313
23232
13551
And file2.txt is:
1313
13551
After running the script before file1.txt becomes:
54545
23232
Small note on efficiency
As you said, this code is in fact rewriting all the lines, those edited and those not. Delete and rewrite only few lines in the middle of a file is not easy, and in any case I am not sure it will be more efficient in your case, as you do not know a priori which lines should be edited: you will always need to read and process the full file line by line to know which lines should be edited. As far as I know, you will hardly find a solution really more efficient than this one. Glad to be denied if anybody knows how.

grep or awk? : Find similar substring and output in two fields

I Have two files, I want to find the strings of file1 as substring in file2,
but want the output of the results (when matching) containing both strings in two fields divided by ':'
So that my output of matches has "file1string:file2string"
Example
file 1
464697uifs4h44yy
48oo895i6iu8gg11
j4h5y7yu4g655h44
jyyuthvcxx22zerc
File 2
j4h5y7yu4g655h447ijj651cvpijgtkk
strxzdokui464697uifs4rdffgjfudjh
kjhbdfgfx1154m87gjgbgcsqubyu6u3k
gfhgysj4h5y7yu4g655h44jkhgfhhfhu
Desired Output
j4h5y7yu4g655h44:j4h5y7yu4g655h447ijj651cvpijgtkk
j4h5y7yu4g655h44:gfhgysj4h5y7yu4g655h44jkhgfhhfhu
j4h5y7yu4g655h44:j4h5y7yu4g655h447ijj651cvpijgtkk
j4h5y7yu4g655h44:gfhgysj4h5y7yu4g655h44jkhgfhhfhu
I used :
fgrep -f file1 file2 >output
but this gives only results from file 2

How to join 2 files using a pattern

is it possible to join these files based on first column pattern by using awk ?
Thanks
file1
qwex-123d-947774-sm-shebha
qwex-123d-947774-sm-shebhb
qwex-123d-947774-sm-shebhd
qwex-23d-947774-sm-shebha
qwex-23d-947774-sm-shebhb
qwex-235d-947774-sm-shebhd
file2
qwex-235d none1
qwex-23d none2
output
qwex-23d none2 qwex-23d-947774-sm-shebha
qwex-23d none2 qwex-23d-947774-sm-shebhb
qwex-235d none1 qwex-235d-947774-sm-shebhd
this awk one-liner should do:
awk 'NR==FNR{a[$0];next}{for(x in a)if($0~"^"x){print x, $0;break}}' file2 file1
Note that, the line has risk if the lines in your file2 containing special characters, which have special meaning in regex. like qwex$-23d
If that is the case, ~ should not be used, instead, we should compare the string literally.

Addressing a specific occurrence of a character in sed

How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...
If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file
awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10
A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,

Resources