Am given a list if ID which I need to trace back a name in a file
file: ID contains
1
2
3
4
5
6
The ID are contained in a Large 2 GB file called result.txt
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
So I cat the ID file into a variable
I then use this variable in a loop to grep out the values to link back to the name using grep and cut -d from results.txt and output to a variable
so variable contains ABS CDE FG1
In the same loop I pass the output of the grep to perform another grep on results.txt, to get the name
ie regrets file for ABC CDE FG1
I do get the answer but takes a long time is their a more efficient way?
Thanks
Making some assumptions about your requirement... ID's that are not found in the big file will not be shown in the output; the desired output is in the format shown below.
Here are mock input files - f1 for the id's and f2 for the large file:
[mathguy#localhost test]$ cat f1
1
2
3
4
5
6
[mathguy#localhost test]$ cat f2
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
Proposed solution and output:
[mathguy#localhost test]$ sed 's/.*/\*\*id=&\*\*/' f1 | grep -Ff - f2 | \
> sed -E 's/^.*\*\*id=([[:digit:]]*)\*\*.*,([^,]*)$/\1 \2/'
1 ABC
2 CDE
3 FG1
The hard work here is done by grep -F which might be just fast enough for your needs. There is some prep work and some clean-up work done by sed, but those are both on small datasets.
First we take the id's from the input file and we output strings in the format **id=<number>**. The output is presented as the fixed-character patterns to grep -F via the option -f (take the patterns from file, in this case from stdin, invoked as -; that is, from the output of sed).
After we find the needed lines from the big file, the final sed just extracts the id and the name from each line.
Note: this assumes that each id is only found once in the big file. (Actually the command will work regardless; but if there are duplicate lines for an id, your business users will have to tell you how to handle. What if you get contradictory names for the same id? Etc.)
file2 has a big list of numbers. File1 has a small list of numbers. file2 is a duplicate of some of the numbers in file1. I want to remove the duplicate numbers in file2 from file1 without deleting any data from file2 but at same time not deleting the line number in file1. I use PyCharm IDE and that assigns the line number. This code does remove the duplicate data from file1 and does not remove the data from file2. Which is what I want, however it is deleting the duplicate numbers and the lines and rewiting them in file1 which is what I don't want to do.
import fileinput
# small file2
with open('file2.txt') as fin:
exclude = set(line.rstrip() for line in fin)
# big file1
for line in fileinput.input('file1.txt', inplace=True):
if line.rstrip() not in exclude:
print(line)
Example: of what is happening, file2 34344
file-1 at start:
54545
34344
23232
78787
file-1 end:
54545
23232
78787
What I want.
file-1 start:
54545
34344
23232
78787
file-1 end:
54545
23232
78787
You just need to print an empty line when you find a data that is in the exclude set.
import fileinput
# small file2
with open('file2.txt') as fin:
exclude = set(line.rstrip() for line in fin)
# big file1
for line in fileinput.input('file1.txt', inplace=True):
if line.rstrip() not in exclude:
print(line, end='')
else:
print('')
If file1.txt is:
54545
1313
23232
13551
And file2.txt is:
1313
13551
After running the script before file1.txt becomes:
54545
23232
Small note on efficiency
As you said, this code is in fact rewriting all the lines, those edited and those not. Delete and rewrite only few lines in the middle of a file is not easy, and in any case I am not sure it will be more efficient in your case, as you do not know a priori which lines should be edited: you will always need to read and process the full file line by line to know which lines should be edited. As far as I know, you will hardly find a solution really more efficient than this one. Glad to be denied if anybody knows how.
is it possible to join these files based on first column pattern by using awk ?
Thanks
file1
qwex-123d-947774-sm-shebha
qwex-123d-947774-sm-shebhb
qwex-123d-947774-sm-shebhd
qwex-23d-947774-sm-shebha
qwex-23d-947774-sm-shebhb
qwex-235d-947774-sm-shebhd
file2
qwex-235d none1
qwex-23d none2
output
qwex-23d none2 qwex-23d-947774-sm-shebha
qwex-23d none2 qwex-23d-947774-sm-shebhb
qwex-235d none1 qwex-235d-947774-sm-shebhd
this awk one-liner should do:
awk 'NR==FNR{a[$0];next}{for(x in a)if($0~"^"x){print x, $0;break}}' file2 file1
Note that, the line has risk if the lines in your file2 containing special characters, which have special meaning in regex. like qwex$-23d
If that is the case, ~ should not be used, instead, we should compare the string literally.
How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...
If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file
awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10
A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,