How to split paired-end fastq files? - grep

I have Illumina paired-end reads contained within one .fastq file, denoted as '/1' for forward reads and '/2' for reverse reads.
I am using grep to pull out the individual reads and place them into 2 respective files (one for forward reads and one for reverse.
grep -A 3 "/1$" sample21_pe.unmapped.fq > sample21_1_rfa.fq
grep -A 3 "/2$" sample21_pe.unmapped.fq > sample21_2_rfa.fq
However, when I try to use the files (fastqc, assembly, etc), they do not work. When running
fastqc i get the following error:
Failed to process file sample21_1_rfa.fq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '#'
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:134)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:105)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
at java.lang.Thread.run(Thread.java:662)
But, if you look at the files they identifier does indeed start with an '#'. Any advice on why these files aren't working? I had originally converted .bam files into the .fastq files with
samtools bam2fq
Here are samples of each individual file:
merged .fastq
#HISEQ:534:CB14TANXX:4:1101:1091:2161/1
GAGAAGCTCGTCCGGCTGGAGAATGTTGCGCTTGCGGTCCGGAGAGGACAGAAATTCGTTGATGTTAACGGTGCGCTCGCCGCGGACGCTCTTGATGGTGACGTCGGCGTTGAGCGTGACGCACG
+
B/</<//B<BFF<FFFFFF/BFFFFFFB<BFFF<B/7FFF7B/B/FF/F/<<F/FFBFFFBBFFFBFB/FF<BBB<B/B//BBFFFFFFF/B/FF/B77B//B7B7F/7F###############
#HISEQ:534:CB14TANXX:4:1101:1091:2161/2
TGACGCCTGCCGTCAGGTAGGTTCTCCGCAGATCCGAAATCTCGCGACGCTCGGCGGCAACATCTGCCAGTCGTCCGTGGCGGGCGACGGTCTCGCGGCGTGCGTCACGCTCAACGCCGACGTAC
+
/B<B//F/F//B<///<FB/</F<<FFFFF<FFBF/FF<//FB/F//F7FBFFFF/B</7<F//<BB7/7BB7/B<F7BF<BFFFB7B#####################################
#HISEQ:534:CB14TANXX:4:1101:1637:2053/1
NGTTTACCATACAACAATCTTGCGACCTATTCAAATCATCTATATGCCTTATCAAGTTTTCATAGCTTTCAAGATTCTCAATTTCCTCACGTCTCGCTTTGCTCAACCTACAAAAACTCTCTTCT
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFB/BFBBFBB<<<<FFFFFFBB<FBFFBFF
#HISEQ:534:CB14TANXX:4:1101:1637:2053/2
TCGGTCGTTGGGAAAAGACCTGTGGTAAACATCCTACGCAAAAGCCATTGCGGTTACTCGTTCGTATGATTCTTGCATCAACTAATCAAGGCGATTGGGTTCTCGACCCATTTTGTGGAAGTTCG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFBFFFFFB<FF<<BBFB
#HISEQ:534:CB14TANXX:4:1101:1792:2218/1
TCTATCGGCTGACCGATAAGCTGTCGCCTGCCGACCGTCCTGCCATGGGACGGCGCATCGCACAGCTCACCCTGGACTAACTCTCCAACACCATGATGCTGACACGCTCGGCAAAAACACCCGAT
+
<<B/<B</FF/<B/<//F<//FF<<<FF//</7/F<</FFF####################################################################################
#HISEQ:534:CB14TANXX:4:1101:1792:2218/2
TGCCGGAGGGCGTCGATGGTGGCATCGAGCTTTTTTGCCGAGCGTGTCAGCATGATGGTGTTGTAGAGATAGTCCATGGTGAGCTGTGCGATGCGCCGTTCCATGGCAGGACGGTCGGCAGGCGC
+
BBBBBFFFFFFFFBFFFBBFFFFFFFFFFFBBFFFF/FF<F7FF//F/FBB/FFBFFF/F7BFF<F/FFFFFFFFB/7BB<7BFFFFFFFFFFFFF<B///B/7B/7/B//77BB//7B/B7/B#
#HISEQ:534:CB14TANXX:4:1101:1903:2238/1
TATTCCAGCGACCGTTATAATCAAACTCAACTACATAGTCATTGCGGATTGCTTCAAGAAATTTTTTCCAGACTATTTCATCAATATTTATTTTGGGAACTGGTGCAACAGCAATTCTTTTTAAA
+
BBBBBFFFFFFFFFFFFFBFF/FFBFFBFFFFFFFF/FFFFFF<<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBF/B/<B<B/FBF7/<FFFFFFF/BB/7///7FF<BFFF//B/FFF###
#HISEQ:534:CB14TANXX:4:1101:1903:2238/2
TAAGGTTGGAGAAGCAACAATTTACCGTGATATTGATTTGCTCCGAACATATTTTCATGCGCCACTCGAGTTTGACAGGGAGAAAGGCGGGTATTATTATTTTAATGAAAAATGGGATTTTGCCC
+
B<BBBFFFFFFF<FFFFFFFFFFFFFFFFFF/BFFFFFFF<<FF<F<FFF/FF/FFFFBFB</<//<B/////<<FFFFB/<F<BFF/7/</7/7FB/B/BFF<//7BFF###############
#HISEQ:534:CB14TANXX:4:1101:2107:2125/1
TGTAGTATTTATTACATCATATAGAATGTAGATATAAAAGATGAAAAAGCTATAATTTCTTTGATAATATAAGGAGGGAATAACACTATGAGGATTGATAGAGCAGGAATCGAGGATTTGCCGAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBFFFFFFFBB<FBB7BFF#
#HISEQ:534:CB14TANXX:4:1101:2107:2125/2
TACCACTATCGGCAAATCCTCGATTCCTGCTCTATCAATCCTCATAGTGTTATTCCCTCCTTATATTATCAAAGAAATTATAGCTTTTTCATCTTTTATATCTACATTCTATATGATGTAATAAA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFBFBFFFFFFFBBFFFFFFFBF7F/B/BBF7/</FF/77F/77BB#
#HISEQ:534:CB14TANXX:4:1101:2023:2224/1
TCACCAGCTCGGCACGCTTGTCCTTGACCTCCTGCTCGATCTGACCGTCCATCTTGGCTGCCACGGTGTTCTCCTCGGCGGAGTAGGCAAAGCAGCCCAGACGGTCGAACTGTATCTCCTTGACA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFB<<B7BBFBFFF<FFBBFFFBF/7B/<B<
#HISEQ:534:CB14TANXX:4:1101:2023:2224/2
TCGAGGATCTGTGCAACTTTGTCAAGGAGATACAGTTCGACCGTCTGGGCTGCTTTGCCTACTCCGCCGAGGAGAACACCGTGGCAGCCAAGATGGACGGTCAGATCGAGCAGGAGGTCAAGGAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFF<7BF/<<BB###
#HISEQ:534:CB14TANXX:4:1101:2038:2235/1
TTTATGCGAATGTAGAGTGGCTTCTCCACTGCCTCGGTGAAGCCCACGCGCGAGATGAGCGAATTAAGCTGCTTTGCAGTGAATTGCATTGCATATACACCTGCGTCGGCTTGAATACTTGTGCT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF//BFFFFFFFFFFFFF<B<BB###
#HISEQ:534:CB14TANXX:4:1101:2038:2235/2
AATCCGCTCGTGAAAGCTCCCGATAACGCCACAGTGAACACCGTGGAGTTCTCTGATACCGAAGATTTCGCACGCAGCACAAGTATTCAAGCCGACGCAGGTGTATATGCAATGCAATTCACTGC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFFFFFFF
#HISEQ:534:CB14TANXX:4:1101:2271:2041/1
NACACTTGTCGATGATCTTGCCAAGCTGCTTCTTGCCCACCAGGAAGCCGATCTCCAGATCAAACTCGTGGCCGGGAACACTCCGGTCCACAAAGCCCAGGTCCTGGGGAATGGGCTCATCGTAG
+
#<</BB/F/BB/F<FFFFFFFFF/<BFFFFFFFF<<FFBFFFFFFBFBFBBB<<FFFFBFFF/<B/FFFFFFFFFFFFFFFFF<FB<<BFF77BFFF/<BFFFB<</BB</7BFFFB########
#HISEQ:534:CB14TANXX:4:1101:2271:2041/2
GACTCATCTACAATGAGCCCATTCCCCAGGACCTGGGCTTTGTGGACCGGAGTGTTCCCGGCCACGAGTTTGATCTGGAGATCGGCTTCCTGGTGGGCAAGAAGCAGCTTGGCAAGATCATCGCC
+
<<BBBFFF<F/BFFFBFBF<BFF<<F/FFFBFFFF<<FFFFBFFFFFFBFFF/<B<F/<</<FFF//FFFFF/<<F/B/B/7/FF<<FF/7B/BBB/7///7////<B/B/BB/B/B/B/7BB##
Example of forward reads after being pulled out and placed into their own .fastq file:
#HISEQ:534:CB14TANXX:4:1101:1091:2161/1
GAGAAGCTCGTCCGGCTGGAGAATGTTGCGCTTGCGGTCCGGAGAGGACAGAAATTCGTTGATGTTAACGGTGCGCTCGCCGCGGACGCTCTTGATGGTGACGTCGGCGTTGAGCGTGACGCACG
+
B/</<//B<BFF<FFFFFF/BFFFFFFB<BFFF<B/7FFF7B/B/FF/F/<<F/FFBFFFBBFFFBFB/FF<BBB<B/B//BBFFFFFFF/B/FF/B77B//B7B7F/7F###############
--
#HISEQ:534:CB14TANXX:4:1101:1637:2053/1
NGTTTACCATACAACAATCTTGCGACCTATTCAAATCATCTATATGCCTTATCAAGTTTTCATAGCTTTCAAGATTCTCAATTTCCTCACGTCTCGCTTTGCTCAACCTACAAAAACTCTCTTCT
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFB/BFBBFBB<<<<FFFFFFBB<FBFFBFF
--
#HISEQ:534:CB14TANXX:4:1101:1792:2218/1
TCTATCGGCTGACCGATAAGCTGTCGCCTGCCGACCGTCCTGCCATGGGACGGCGCATCGCACAGCTCACCCTGGACTAACTCTCCAACACCATGATGCTGACACGCTCGGCAAAAACACCCGAT
+
<<B/<B</FF/<B/<//F<//FF<<<FF//</7/F<</FFF####################################################################################
--
#HISEQ:534:CB14TANXX:4:1101:1903:2238/1
TATTCCAGCGACCGTTATAATCAAACTCAACTACATAGTCATTGCGGATTGCTTCAAGAAATTTTTTCCAGACTATTTCATCAATATTTATTTTGGGAACTGGTGCAACAGCAATTCTTTTTAAA
+
BBBBBFFFFFFFFFFFFFBFF/FFBFFBFFFFFFFF/FFFFFF<<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBF/B/<B<B/FBF7/<FFFFFFF/BB/7///7FF<BFFF//B/FFF###
--
#HISEQ:534:CB14TANXX:4:1101:2107:2125/1
TGTAGTATTTATTACATCATATAGAATGTAGATATAAAAGATGAAAAAGCTATAATTTCTTTGATAATATAAGGAGGGAATAACACTATGAGGATTGATAGAGCAGGAATCGAGGATTTGCCGAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBFFFFFFFBB<FBB7BFF#
--
#HISEQ:534:CB14TANXX:4:1101:2023:2224/1
TCACCAGCTCGGCACGCTTGTCCTTGACCTCCTGCTCGATCTGACCGTCCATCTTGGCTGCCACGGTGTTCTCCTCGGCGGAGTAGGCAAAGCAGCCCAGACGGTCGAACTGTATCTCCTTGACA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFB<<B7BBFBFFF<FFBBFFFBF/7B/<B<
--
#HISEQ:534:CB14TANXX:4:1101:2038:2235/1
TTTATGCGAATGTAGAGTGGCTTCTCCACTGCCTCGGTGAAGCCCACGCGCGAGATGAGCGAATTAAGCTGCTTTGCAGTGAATTGCATTGCATATACACCTGCGTCGGCTTGAATACTTGTGCT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF//BFFFFFFFFFFFFF<B<BB###
--
#HISEQ:534:CB14TANXX:4:1101:2271:2041/1
NACACTTGTCGATGATCTTGCCAAGCTGCTTCTTGCCCACCAGGAAGCCGATCTCCAGATCAAACTCGTGGCCGGGAACACTCCGGTCCACAAAGCCCAGGTCCTGGGGAATGGGCTCATCGTAG
+
#<</BB/F/BB/F<FFFFFFFFF/<BFFFFFFFF<<FFBFFFFFFBFBFBBB<<FFFFBFFF/<B/FFFFFFFFFFFFFFFFF<FB<<BFF77BFFF/<BFFFB<</BB</7BFFFB########
--
#HISEQ:534:CB14TANXX:4:1101:2678:2145/1
CTGTACATAGTACGTATTTGACGCCTGCGTCGATGTAGCGTTTGAGGAAGGGAAGCAGCGGTTCTGCAGAGTCCTCTTTCCATCCGTTGATGCTAATCATTCCGTTGCGTACATCCGCTCCGAGA
+
BBBBBFFFFFFF<FFF<FFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<BFFF7BFFFFFFFF<BBFFFFFFFFBBFBBB<FFBFFFFFFFFFFFFB<BFFFFFFBFB/BFFF####
--
#HISEQ:534:CB14TANXX:4:1101:2972:2114/1
CTCTGTGCCGATCCCTTTGCCTTTGCGTTTTGAGGAAAGGAAACCACCTTCTGGGTCGGTGAGGATAGTTCCGGTGAAGGTGTTGTCCACCGCCAGGCATAGGGAATAGCTGTCAGCCTTTGCTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFBFFFF<FFFFFFFFFF<BFFFFF
--
#HISEQ:534:CB14TANXX:4:1101:2940:2222/1
CTAATTTTTTCATTATATTACTAATTTTGTAATTGGTAAAATATTATAATATCCTTGTACATTAAGACCCCAATAATCAGAAGAAGTAAAATTAATTCCTGCAACAGTTCTTAAATATCCATTAG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FBFFFFFFFFFFFFFFF/FBFBFFBFFFFF/<F<FFFFFFFFFF<FFFFFFBFFFFFFFFF</FBFBBF<F/7//FFBFBBFFF/<7BF#
--
#HISEQ:534:CB14TANXX:4:1101:3037:2180/1
CGTCAGTTCCGCAACGATAAAGAGTTCCGCATTGCAGTCACCTGTACGCTGGTAGCCACCGGAACCGATGTCAAGCCGTTGGAGGTGGTGATGTTCATGCGCGACGTAGCTTCCGAGCCGTTATA
+
B/BBBBBFFFFFFF<FFBFFFFFF<FFFFBFFFFFFF<BBFFFFFFFFFFFFFFFFFBFF/FFFFBFFBFFFFBFF/7F/BFB/BBFFFFFFFFBFF<BBF<7BBFFFFFFBBFFF/B#######
--
#HISEQ:534:CB14TANXX:4:1101:3334:2171/1
ACCGATGTACATACCCGGACGGGTACGCACATGCTCCATATCGCTCAAGTGGCGGATATTGTCATCTGTATATTCTACAGGTTGCTCCTGAGGGGTATTTGCCAGTTCTTCGGCAGCACCCTTTT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFBFFFFFFFFFFFBFFFFFFFFFF</<BFFFFFFFFBBFFFFFFBF</BB///BF<FFFFF<</<B
--
#HISEQ:534:CB14TANXX:4:1101:3452:2185/1
CGCAGACGGATTTGCTTGAAGTCCGTCTCATCGTATTCCGACAACTCATCGAGGAACACACGCTTGTATTGACTGATACCCTTGATTTTCTCCGGGTCGTCAAGACCACTGAAATCAATCTTGCC
+
BBBBBFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFBFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFBFFFFFBFFBFFFFFFFFFB/77B/FBBFFF/<FFF/77BBFFFBFFBBB
--
Any advice would be appreciated. Thanks!

In general, this operation is called deinterlace fastq or deinterleave fastq. The question already has the answer here:
deinterleave fastq file
https://www.biostars.org/p/141256/
I am copying it here, with minor reformatting for clarity:
paste - - - - - - - - < interleaved.fq \
| tee >(cut -f 1-4 | tr "\t" "\n" > read1.fq) \
| cut -f 5-8 | tr "\t" "\n" > read2.fq
This command converts the interlaced fastq file into 8-column tsv file, cuts columns 1-4 (read 1 lines), changes from tsv to fastq format (by replacing tabs with newlines) and redirects the output to read1.fq. In the same STDOUT stream (for speed), using tee, it cuts columns 5-8 (read 2 lines), etc, and redirects the output to read2.fq.
You can also use these command line tools:
iamdelf/deinterlace: Deinterlaces paired-end FASTQ files into first and second strand files.
https://github.com/iamdelf/deinterlace
deinterleave FASTQ files
https://gist.github.com/nathanhaigh/3521724
Or online tools with Galaxy web UI, for example this tool: "FASTQ splitter on joined paired end reads", installed on several public Galaxy instances, such as https://usegalaxy.org/ .
Avoid using a regex for simple fastq file parsing if you can use line numbers, both for speed (pattern matching is slower than simple counting) and for robustness.
Highly unlikely, but a pattern like ^#.*/1$ (or whatever the readers might change it to, while reusing this code later) can match also the base quality line. A good general rule is to simply rely on fastq spec, which says 4 lines per record.
Note that #, /, 1, and 2 characters are allowed in Illumina Phred scores: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm .
A one-liner that pulls out such (admittedly, very rare) reads is left as an exercise to the reader.

The fastq format uses 4 lines per read.
Your snippet has 5, as there are -- lines. That could cause confusion to softwares expecting a 4 line format.
You can add --no-group-separator to the grep call to avoid adding that separator.
I usually follow these steps to convert bam to fastq.gz
samtools bam2fq myBamfile.bam > myBamfile.fastq
cat myBamfile.fastq | grep '^#.*/1$' -A 3 --no-group-separator > sample_1.fastq
cat myBamfile.fastq | grep '^#.*/2$' -A 3 --no-group-separator > sample_2.fastq
gzip sample_1.fastq
gzip sample_1.fastq
Once you have the two files, you should order them to be sure that the reads are really paired.

We can split FASTQ files using Seqkit.
seqkit split2 -p 2 sample21_pe.unmapped.fq
https://bioinf.shenwei.me/seqkit/usage/#split2
Example 4 will help this question.
I'm not sure if it recognize the read ID. It split and write alternately into 1st-output-file and 2nd-output-file.

Related

Grepping twice using result of first Grep in Large file

Am given a list if ID which I need to trace back a name in a file
file: ID contains
1
2
3
4
5
6
The ID are contained in a Large 2 GB file called result.txt
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
So I cat the ID file into a variable
I then use this variable in a loop to grep out the values to link back to the name using grep and cut -d from results.txt and output to a variable
so variable contains ABS CDE FG1
In the same loop I pass the output of the grep to perform another grep on results.txt, to get the name
ie regrets file for ABC CDE FG1
I do get the answer but takes a long time is their a more efficient way?
Thanks
Making some assumptions about your requirement... ID's that are not found in the big file will not be shown in the output; the desired output is in the format shown below.
Here are mock input files - f1 for the id's and f2 for the large file:
[mathguy#localhost test]$ cat f1
1
2
3
4
5
6
[mathguy#localhost test]$ cat f2
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
Proposed solution and output:
[mathguy#localhost test]$ sed 's/.*/\*\*id=&\*\*/' f1 | grep -Ff - f2 | \
> sed -E 's/^.*\*\*id=([[:digit:]]*)\*\*.*,([^,]*)$/\1 \2/'
1 ABC
2 CDE
3 FG1
The hard work here is done by grep -F which might be just fast enough for your needs. There is some prep work and some clean-up work done by sed, but those are both on small datasets.
First we take the id's from the input file and we output strings in the format **id=<number>**. The output is presented as the fixed-character patterns to grep -F via the option -f (take the patterns from file, in this case from stdin, invoked as -; that is, from the output of sed).
After we find the needed lines from the big file, the final sed just extracts the id and the name from each line.
Note: this assumes that each id is only found once in the big file. (Actually the command will work regardless; but if there are duplicate lines for an id, your business users will have to tell you how to handle. What if you get contradictory names for the same id? Etc.)

How to grep multiple lines using a .txt vocab, matching only first word as variable?

I'm trying to reduce a .sm file1 - around 10 GB by filtering it using a fair long set of words (around 180.108 items) listed in a text file file2.
File1 is structured as follows:
word <http://internet.address.com> 1
i.e. one word followed by a blank space, an internet address, and a number.
File2 is a simple .txt file, a list of words, one on each line.
My aim is to create a third file File3 containing only those lines in file1 whose first word matches with the word-list of file2, and disregard the rest.
My attempt is the following:
grep -w -F -f file2.txt file1.sm > file3.sm
I've also attempted something along this line:
gawk 'FNR==NR {a[$1]; next } !($2 in a)' file2.txt file1.sm > file3.sm
but with no success. I understand /^ and \b might play a part here, but I don't know how to fit them in the syntax. I've looked around extensively but no solution seems to fit.
My problem is that here grep reads the entire file1's line, and it can happen that the matching word lies in the webpage address, which I'm not interested in finding out.
sed 's/^/^/' file2.txt | grep -f - file1.sm
join is the best tool for this, not grep/awk:
join -t' ' <(sort file1.sm) <(sort file2.txt) >file3.sm

Counting the number of times each pattern in a file appears in a separate file

I am trying to scan a file (test.txt), something like this:
make
bake
baker
makes
take
cook
sbake
for patterns listed in a separate file (ref.txt):
ake
make
bake
look
I have tried looping with grep like so:
while read seq; do grep -c "$seq" test.txt; done > out.txt < ref.txt
However, it doesn't count partial matches only exact matches (or inconsistent in counting partial matches) and I output:
4
1
2
0
instead of
6
2
3
0
Thanks for any help!
See why-is-using-a-shell-loop-to-process-text-considered-bad-practice for some, but not all, of the reasons not to try to do this with a shell loop.
The standard UNIX tool for manipulating text is awk:
$ awk 'NR==FNR{cnt[$0]=0;next} {for (re in cnt) cnt[re]+=gsub(re,"&")} END{for (re in cnt) print re, cnt[re]}' ref.txt test.txt
ake 6
bake 3
look 0
make 2
The above assumes the text in your ref.txt file doesn't contain any regexp metacharacters or if it does then a regexp match is desirable. If it can but you need a string instead of regexp match, you'd need a slightly different solution.
$ while read -r line; do grep -c $line test.txt ; done < ref.txt
6
2
3
0

How can I extract some data out of the middle of a noisy file using Perl 6?

I would like to do this using idiomatic Perl 6.
I found a wonderful contiguous chunk of data buried in a noisy output file.
I would like to simply print out the header line starting with Cluster Unique and all of the lines following it, up to, but not including, the first occurrence of an empty line. Here's what the file looks like:
</path/to/projects/projectname/ParameterSweep/1000.1.7.dir> was used as the working directory.
....
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
Input file: "../../filename.count.fa"
...
Here's what I want parsed out:
Cluster Unique Sequences Reads RPM
1 31 3539 3539
2 25 2797 2797
3 17 1679 1679
4 21 1636 1636
5 14 1568 1568
6 13 1548 1548
7 7 1439 1439
One-liner version
.say if /Cluster \s+ Unique/ ff^ /^\s*$/ for lines;
In English
Print every line from the input file starting with the once containing the phrase Cluster Unique and ending just before the next empty line.
Same code with comments
.say # print the default variable $_
if # do the previous action (.say) "if" the following term is true
/Cluster \s+ Unique/ # Match $_ if it contains "Cluster Unique"
ff^ # Flip-flop operator: true until preceding term becomes true
# false once the term after it becomes true
/^\s*$/ # Match $_ if it contains an empty line
for # Create a loop placing each element of the following list into $_
lines # Create a list of all of the lines in the file
; # End of statement
Expanded version
for lines() {
.say if (
$_ ~~ /Cluster \s+ Unique/ ff^ $_ ~~ /^\s*$/
)
}
lines() is like <> in perl5. Each line from each file listed on the command line is read in one at a time. Since this is in a for loop, each line is placed in the default variable $_.
say is like print except that it also appends a newline. When written with a starting ., it acts directly on the default variable $_.
$_ is the default variable, which in this case contains one line from the file.
~~ is the match operator that is comparing $_ with a regular expression.
// Create a regular expression between the two forward slashes
\s+ matches one or more spaces
ff is the flip-flop operator. It is false as long as the expression to its left is false. It becomes true when the expression to its left is evaluated as true. It becomes false when the expression to its right becomes true and is never evaluated as true again. In this case, if we used ^ff^ instead of ff^, then the header would not be included in the output.
When ^ comes before (or after) ff, it modifies ff so that it is also false the iteration that the expression to its left (or right) becomes true.
/^\*$/ matches an empty line
^ matches the beginning of a string
\s* matches zero or more spaces
$ matches the end of a string
By the way, the flip-flop operator in Perl 5 is .. when it is in a scalar context (it's the range operator in list context). But its features are not quite as rich as in Perl 6, of course.
I would like to do this using idiomatic Perl 6.
In Perl, the idiomatic way to locate a chunk in a file is to read the file in paragraph mode, then stop reading the file when you find the chunk you are interested in. If you are reading a 10GB file, and the chunk is found at the top of the file, it's inefficient to continue reading the rest of the file--much less perform an if test on every line in the file.
In Perl 6, you can read a paragraph at a time like this:
my $fname = 'data.txt';
my $infile = open(
$fname,
nl => "\n\n", #Set what perl considers the end of a line.
); #Removed die() per Brad Gilbert's comment.
for $infile.lines() -> $para {
if $para ~~ /^ 'Cluster Unique'/ {
say $para.chomp;
last; #Quit reading the file.
}
}
$infile.close;
# ^ Match start of string.
# 'Cluster Unique' By default, whitespace is insignificant in a perl6 regex. Quotes are one way to make whitespace significant.
However, in perl6 rakudo/moarVM the open() function does not read the nl argument correctly, so you currently can't set paragraph mode.
Also, there are certain idioms that are considered by some to be bad practice, like:
Postfix if statements, e.g. say 'hello' if $y == 0.
Relying on the implicit $_ variable in your code, e.g. .say
So, depending on what side of the fence you live on, that would be considered a bad practice in Perl.

Addressing a specific occurrence of a character in sed

How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...
If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file
awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10
A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,

Resources