Find lines that end with "+" - grep

I am unable to form a grep regex which will find me only those lines that end with + signs. Example:
Should match - This is a sample line +
Should not match - This is another with + in between or This is another with + in between and ends with +

Do use $ to indicate the end of the line:
grep '+$' file
Example
$ cat a
This is a sample line +
This is another with + in between
hello
$ grep '+$' a
This is a sample line +
Update
What if I want to display lines which only has + in the end. Even if a
line is like this This is a line with + in bw and in the end +. I
don't want this line to be matched.
Then you can use awk:
awk '/\+$/ && split($0, a, "\+")==2' file
Explanation
/\+$/ matches lines ending with +.
split($0, a, "\+")==2 divides the string in blocks based on the + delimiter. The return value is the number of pieces, so 2 means that it just contains one +.
Example
$ cat a
This is a sample line +
This is another with + in between
Hello + and +
hello
$ awk '/\+$/ && split($0, a, "\+")==2' a
This is a sample line +

Specify that the ending + should not be preceded by any other +, or as a regular expression:
grep '^[^+]*+'
Output when tested on the 2nd version of fedorqui's a file:
This is a sample line +

Related

Counting specific lines that don't contain specific word

Please I have question: I have a file like this
#HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;#?++8A?;C;F92+2A#19:1*1?DDDECDE?B4:BDEEI
#BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
#HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
#C#FF?EDGFDHH#HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
#HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ
I
I'm interested just about the line that starts with '#HWI', but I want to count all the lines that are not starting with '#HWI'. In the example shown, the result will be 1 because there's one line that starts with '#BBB'.
To be more clear: I just want to know know the number of the first line of the patterns (that are 4 line that repeated) that are not '#HWI'; I hope I'm clear enough. Please tell me if you need more clarification
With GNU sed, you can use its extended address to print every fourth line, then use grep to count the ones that don't start with #HWI:
sed -n '1~4p' file.fastq | grep -cv '^#HWI'
Otherwise, you can use e.g. Perl
perl -ne 'print if 1 == $. % 4' -- file.fastq | grep -cv '^#HWI'
$. contains the current line number, % is the modulo operator.
But once we're running Perl, we don't need grep anymore:
perl -lne '++$c if 1 == $. % 4; END { print $c }' -- file.fastq
-l removes newlines from input and adds them to output.

How to split paired-end fastq files?

I have Illumina paired-end reads contained within one .fastq file, denoted as '/1' for forward reads and '/2' for reverse reads.
I am using grep to pull out the individual reads and place them into 2 respective files (one for forward reads and one for reverse.
grep -A 3 "/1$" sample21_pe.unmapped.fq > sample21_1_rfa.fq
grep -A 3 "/2$" sample21_pe.unmapped.fq > sample21_2_rfa.fq
However, when I try to use the files (fastqc, assembly, etc), they do not work. When running
fastqc i get the following error:
Failed to process file sample21_1_rfa.fq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '#'
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:134)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:105)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
at java.lang.Thread.run(Thread.java:662)
But, if you look at the files they identifier does indeed start with an '#'. Any advice on why these files aren't working? I had originally converted .bam files into the .fastq files with
samtools bam2fq
Here are samples of each individual file:
merged .fastq
#HISEQ:534:CB14TANXX:4:1101:1091:2161/1
GAGAAGCTCGTCCGGCTGGAGAATGTTGCGCTTGCGGTCCGGAGAGGACAGAAATTCGTTGATGTTAACGGTGCGCTCGCCGCGGACGCTCTTGATGGTGACGTCGGCGTTGAGCGTGACGCACG
+
B/</<//B<BFF<FFFFFF/BFFFFFFB<BFFF<B/7FFF7B/B/FF/F/<<F/FFBFFFBBFFFBFB/FF<BBB<B/B//BBFFFFFFF/B/FF/B77B//B7B7F/7F###############
#HISEQ:534:CB14TANXX:4:1101:1091:2161/2
TGACGCCTGCCGTCAGGTAGGTTCTCCGCAGATCCGAAATCTCGCGACGCTCGGCGGCAACATCTGCCAGTCGTCCGTGGCGGGCGACGGTCTCGCGGCGTGCGTCACGCTCAACGCCGACGTAC
+
/B<B//F/F//B<///<FB/</F<<FFFFF<FFBF/FF<//FB/F//F7FBFFFF/B</7<F//<BB7/7BB7/B<F7BF<BFFFB7B#####################################
#HISEQ:534:CB14TANXX:4:1101:1637:2053/1
NGTTTACCATACAACAATCTTGCGACCTATTCAAATCATCTATATGCCTTATCAAGTTTTCATAGCTTTCAAGATTCTCAATTTCCTCACGTCTCGCTTTGCTCAACCTACAAAAACTCTCTTCT
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFB/BFBBFBB<<<<FFFFFFBB<FBFFBFF
#HISEQ:534:CB14TANXX:4:1101:1637:2053/2
TCGGTCGTTGGGAAAAGACCTGTGGTAAACATCCTACGCAAAAGCCATTGCGGTTACTCGTTCGTATGATTCTTGCATCAACTAATCAAGGCGATTGGGTTCTCGACCCATTTTGTGGAAGTTCG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFBFFFFFB<FF<<BBFB
#HISEQ:534:CB14TANXX:4:1101:1792:2218/1
TCTATCGGCTGACCGATAAGCTGTCGCCTGCCGACCGTCCTGCCATGGGACGGCGCATCGCACAGCTCACCCTGGACTAACTCTCCAACACCATGATGCTGACACGCTCGGCAAAAACACCCGAT
+
<<B/<B</FF/<B/<//F<//FF<<<FF//</7/F<</FFF####################################################################################
#HISEQ:534:CB14TANXX:4:1101:1792:2218/2
TGCCGGAGGGCGTCGATGGTGGCATCGAGCTTTTTTGCCGAGCGTGTCAGCATGATGGTGTTGTAGAGATAGTCCATGGTGAGCTGTGCGATGCGCCGTTCCATGGCAGGACGGTCGGCAGGCGC
+
BBBBBFFFFFFFFBFFFBBFFFFFFFFFFFBBFFFF/FF<F7FF//F/FBB/FFBFFF/F7BFF<F/FFFFFFFFB/7BB<7BFFFFFFFFFFFFF<B///B/7B/7/B//77BB//7B/B7/B#
#HISEQ:534:CB14TANXX:4:1101:1903:2238/1
TATTCCAGCGACCGTTATAATCAAACTCAACTACATAGTCATTGCGGATTGCTTCAAGAAATTTTTTCCAGACTATTTCATCAATATTTATTTTGGGAACTGGTGCAACAGCAATTCTTTTTAAA
+
BBBBBFFFFFFFFFFFFFBFF/FFBFFBFFFFFFFF/FFFFFF<<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBF/B/<B<B/FBF7/<FFFFFFF/BB/7///7FF<BFFF//B/FFF###
#HISEQ:534:CB14TANXX:4:1101:1903:2238/2
TAAGGTTGGAGAAGCAACAATTTACCGTGATATTGATTTGCTCCGAACATATTTTCATGCGCCACTCGAGTTTGACAGGGAGAAAGGCGGGTATTATTATTTTAATGAAAAATGGGATTTTGCCC
+
B<BBBFFFFFFF<FFFFFFFFFFFFFFFFFF/BFFFFFFF<<FF<F<FFF/FF/FFFFBFB</<//<B/////<<FFFFB/<F<BFF/7/</7/7FB/B/BFF<//7BFF###############
#HISEQ:534:CB14TANXX:4:1101:2107:2125/1
TGTAGTATTTATTACATCATATAGAATGTAGATATAAAAGATGAAAAAGCTATAATTTCTTTGATAATATAAGGAGGGAATAACACTATGAGGATTGATAGAGCAGGAATCGAGGATTTGCCGAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBFFFFFFFBB<FBB7BFF#
#HISEQ:534:CB14TANXX:4:1101:2107:2125/2
TACCACTATCGGCAAATCCTCGATTCCTGCTCTATCAATCCTCATAGTGTTATTCCCTCCTTATATTATCAAAGAAATTATAGCTTTTTCATCTTTTATATCTACATTCTATATGATGTAATAAA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFBFBFFFFFFFBBFFFFFFFBF7F/B/BBF7/</FF/77F/77BB#
#HISEQ:534:CB14TANXX:4:1101:2023:2224/1
TCACCAGCTCGGCACGCTTGTCCTTGACCTCCTGCTCGATCTGACCGTCCATCTTGGCTGCCACGGTGTTCTCCTCGGCGGAGTAGGCAAAGCAGCCCAGACGGTCGAACTGTATCTCCTTGACA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFB<<B7BBFBFFF<FFBBFFFBF/7B/<B<
#HISEQ:534:CB14TANXX:4:1101:2023:2224/2
TCGAGGATCTGTGCAACTTTGTCAAGGAGATACAGTTCGACCGTCTGGGCTGCTTTGCCTACTCCGCCGAGGAGAACACCGTGGCAGCCAAGATGGACGGTCAGATCGAGCAGGAGGTCAAGGAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFF<7BF/<<BB###
#HISEQ:534:CB14TANXX:4:1101:2038:2235/1
TTTATGCGAATGTAGAGTGGCTTCTCCACTGCCTCGGTGAAGCCCACGCGCGAGATGAGCGAATTAAGCTGCTTTGCAGTGAATTGCATTGCATATACACCTGCGTCGGCTTGAATACTTGTGCT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF//BFFFFFFFFFFFFF<B<BB###
#HISEQ:534:CB14TANXX:4:1101:2038:2235/2
AATCCGCTCGTGAAAGCTCCCGATAACGCCACAGTGAACACCGTGGAGTTCTCTGATACCGAAGATTTCGCACGCAGCACAAGTATTCAAGCCGACGCAGGTGTATATGCAATGCAATTCACTGC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFFFFFFF
#HISEQ:534:CB14TANXX:4:1101:2271:2041/1
NACACTTGTCGATGATCTTGCCAAGCTGCTTCTTGCCCACCAGGAAGCCGATCTCCAGATCAAACTCGTGGCCGGGAACACTCCGGTCCACAAAGCCCAGGTCCTGGGGAATGGGCTCATCGTAG
+
#<</BB/F/BB/F<FFFFFFFFF/<BFFFFFFFF<<FFBFFFFFFBFBFBBB<<FFFFBFFF/<B/FFFFFFFFFFFFFFFFF<FB<<BFF77BFFF/<BFFFB<</BB</7BFFFB########
#HISEQ:534:CB14TANXX:4:1101:2271:2041/2
GACTCATCTACAATGAGCCCATTCCCCAGGACCTGGGCTTTGTGGACCGGAGTGTTCCCGGCCACGAGTTTGATCTGGAGATCGGCTTCCTGGTGGGCAAGAAGCAGCTTGGCAAGATCATCGCC
+
<<BBBFFF<F/BFFFBFBF<BFF<<F/FFFBFFFF<<FFFFBFFFFFFBFFF/<B<F/<</<FFF//FFFFF/<<F/B/B/7/FF<<FF/7B/BBB/7///7////<B/B/BB/B/B/B/7BB##
Example of forward reads after being pulled out and placed into their own .fastq file:
#HISEQ:534:CB14TANXX:4:1101:1091:2161/1
GAGAAGCTCGTCCGGCTGGAGAATGTTGCGCTTGCGGTCCGGAGAGGACAGAAATTCGTTGATGTTAACGGTGCGCTCGCCGCGGACGCTCTTGATGGTGACGTCGGCGTTGAGCGTGACGCACG
+
B/</<//B<BFF<FFFFFF/BFFFFFFB<BFFF<B/7FFF7B/B/FF/F/<<F/FFBFFFBBFFFBFB/FF<BBB<B/B//BBFFFFFFF/B/FF/B77B//B7B7F/7F###############
--
#HISEQ:534:CB14TANXX:4:1101:1637:2053/1
NGTTTACCATACAACAATCTTGCGACCTATTCAAATCATCTATATGCCTTATCAAGTTTTCATAGCTTTCAAGATTCTCAATTTCCTCACGTCTCGCTTTGCTCAACCTACAAAAACTCTCTTCT
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFB/BFBBFBB<<<<FFFFFFBB<FBFFBFF
--
#HISEQ:534:CB14TANXX:4:1101:1792:2218/1
TCTATCGGCTGACCGATAAGCTGTCGCCTGCCGACCGTCCTGCCATGGGACGGCGCATCGCACAGCTCACCCTGGACTAACTCTCCAACACCATGATGCTGACACGCTCGGCAAAAACACCCGAT
+
<<B/<B</FF/<B/<//F<//FF<<<FF//</7/F<</FFF####################################################################################
--
#HISEQ:534:CB14TANXX:4:1101:1903:2238/1
TATTCCAGCGACCGTTATAATCAAACTCAACTACATAGTCATTGCGGATTGCTTCAAGAAATTTTTTCCAGACTATTTCATCAATATTTATTTTGGGAACTGGTGCAACAGCAATTCTTTTTAAA
+
BBBBBFFFFFFFFFFFFFBFF/FFBFFBFFFFFFFF/FFFFFF<<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBF/B/<B<B/FBF7/<FFFFFFF/BB/7///7FF<BFFF//B/FFF###
--
#HISEQ:534:CB14TANXX:4:1101:2107:2125/1
TGTAGTATTTATTACATCATATAGAATGTAGATATAAAAGATGAAAAAGCTATAATTTCTTTGATAATATAAGGAGGGAATAACACTATGAGGATTGATAGAGCAGGAATCGAGGATTTGCCGAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBFFFFFFFBB<FBB7BFF#
--
#HISEQ:534:CB14TANXX:4:1101:2023:2224/1
TCACCAGCTCGGCACGCTTGTCCTTGACCTCCTGCTCGATCTGACCGTCCATCTTGGCTGCCACGGTGTTCTCCTCGGCGGAGTAGGCAAAGCAGCCCAGACGGTCGAACTGTATCTCCTTGACA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFB<<B7BBFBFFF<FFBBFFFBF/7B/<B<
--
#HISEQ:534:CB14TANXX:4:1101:2038:2235/1
TTTATGCGAATGTAGAGTGGCTTCTCCACTGCCTCGGTGAAGCCCACGCGCGAGATGAGCGAATTAAGCTGCTTTGCAGTGAATTGCATTGCATATACACCTGCGTCGGCTTGAATACTTGTGCT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF//BFFFFFFFFFFFFF<B<BB###
--
#HISEQ:534:CB14TANXX:4:1101:2271:2041/1
NACACTTGTCGATGATCTTGCCAAGCTGCTTCTTGCCCACCAGGAAGCCGATCTCCAGATCAAACTCGTGGCCGGGAACACTCCGGTCCACAAAGCCCAGGTCCTGGGGAATGGGCTCATCGTAG
+
#<</BB/F/BB/F<FFFFFFFFF/<BFFFFFFFF<<FFBFFFFFFBFBFBBB<<FFFFBFFF/<B/FFFFFFFFFFFFFFFFF<FB<<BFF77BFFF/<BFFFB<</BB</7BFFFB########
--
#HISEQ:534:CB14TANXX:4:1101:2678:2145/1
CTGTACATAGTACGTATTTGACGCCTGCGTCGATGTAGCGTTTGAGGAAGGGAAGCAGCGGTTCTGCAGAGTCCTCTTTCCATCCGTTGATGCTAATCATTCCGTTGCGTACATCCGCTCCGAGA
+
BBBBBFFFFFFF<FFF<FFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<BFFF7BFFFFFFFF<BBFFFFFFFFBBFBBB<FFBFFFFFFFFFFFFB<BFFFFFFBFB/BFFF####
--
#HISEQ:534:CB14TANXX:4:1101:2972:2114/1
CTCTGTGCCGATCCCTTTGCCTTTGCGTTTTGAGGAAAGGAAACCACCTTCTGGGTCGGTGAGGATAGTTCCGGTGAAGGTGTTGTCCACCGCCAGGCATAGGGAATAGCTGTCAGCCTTTGCTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFBFFFF<FFFFFFFFFF<BFFFFF
--
#HISEQ:534:CB14TANXX:4:1101:2940:2222/1
CTAATTTTTTCATTATATTACTAATTTTGTAATTGGTAAAATATTATAATATCCTTGTACATTAAGACCCCAATAATCAGAAGAAGTAAAATTAATTCCTGCAACAGTTCTTAAATATCCATTAG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FBFFFFFFFFFFFFFFF/FBFBFFBFFFFF/<F<FFFFFFFFFF<FFFFFFBFFFFFFFFF</FBFBBF<F/7//FFBFBBFFF/<7BF#
--
#HISEQ:534:CB14TANXX:4:1101:3037:2180/1
CGTCAGTTCCGCAACGATAAAGAGTTCCGCATTGCAGTCACCTGTACGCTGGTAGCCACCGGAACCGATGTCAAGCCGTTGGAGGTGGTGATGTTCATGCGCGACGTAGCTTCCGAGCCGTTATA
+
B/BBBBBFFFFFFF<FFBFFFFFF<FFFFBFFFFFFF<BBFFFFFFFFFFFFFFFFFBFF/FFFFBFFBFFFFBFF/7F/BFB/BBFFFFFFFFBFF<BBF<7BBFFFFFFBBFFF/B#######
--
#HISEQ:534:CB14TANXX:4:1101:3334:2171/1
ACCGATGTACATACCCGGACGGGTACGCACATGCTCCATATCGCTCAAGTGGCGGATATTGTCATCTGTATATTCTACAGGTTGCTCCTGAGGGGTATTTGCCAGTTCTTCGGCAGCACCCTTTT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFBFFFFFFFFFFFBFFFFFFFFFF</<BFFFFFFFFBBFFFFFFBF</BB///BF<FFFFF<</<B
--
#HISEQ:534:CB14TANXX:4:1101:3452:2185/1
CGCAGACGGATTTGCTTGAAGTCCGTCTCATCGTATTCCGACAACTCATCGAGGAACACACGCTTGTATTGACTGATACCCTTGATTTTCTCCGGGTCGTCAAGACCACTGAAATCAATCTTGCC
+
BBBBBFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFBFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFBFFFFFBFFBFFFFFFFFFB/77B/FBBFFF/<FFF/77BBFFFBFFBBB
--
Any advice would be appreciated. Thanks!
In general, this operation is called deinterlace fastq or deinterleave fastq. The question already has the answer here:
deinterleave fastq file
https://www.biostars.org/p/141256/
I am copying it here, with minor reformatting for clarity:
paste - - - - - - - - < interleaved.fq \
| tee >(cut -f 1-4 | tr "\t" "\n" > read1.fq) \
| cut -f 5-8 | tr "\t" "\n" > read2.fq
This command converts the interlaced fastq file into 8-column tsv file, cuts columns 1-4 (read 1 lines), changes from tsv to fastq format (by replacing tabs with newlines) and redirects the output to read1.fq. In the same STDOUT stream (for speed), using tee, it cuts columns 5-8 (read 2 lines), etc, and redirects the output to read2.fq.
You can also use these command line tools:
iamdelf/deinterlace: Deinterlaces paired-end FASTQ files into first and second strand files.
https://github.com/iamdelf/deinterlace
deinterleave FASTQ files
https://gist.github.com/nathanhaigh/3521724
Or online tools with Galaxy web UI, for example this tool: "FASTQ splitter on joined paired end reads", installed on several public Galaxy instances, such as https://usegalaxy.org/ .
Avoid using a regex for simple fastq file parsing if you can use line numbers, both for speed (pattern matching is slower than simple counting) and for robustness.
Highly unlikely, but a pattern like ^#.*/1$ (or whatever the readers might change it to, while reusing this code later) can match also the base quality line. A good general rule is to simply rely on fastq spec, which says 4 lines per record.
Note that #, /, 1, and 2 characters are allowed in Illumina Phred scores: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm .
A one-liner that pulls out such (admittedly, very rare) reads is left as an exercise to the reader.
The fastq format uses 4 lines per read.
Your snippet has 5, as there are -- lines. That could cause confusion to softwares expecting a 4 line format.
You can add --no-group-separator to the grep call to avoid adding that separator.
I usually follow these steps to convert bam to fastq.gz
samtools bam2fq myBamfile.bam > myBamfile.fastq
cat myBamfile.fastq | grep '^#.*/1$' -A 3 --no-group-separator > sample_1.fastq
cat myBamfile.fastq | grep '^#.*/2$' -A 3 --no-group-separator > sample_2.fastq
gzip sample_1.fastq
gzip sample_1.fastq
Once you have the two files, you should order them to be sure that the reads are really paired.
We can split FASTQ files using Seqkit.
seqkit split2 -p 2 sample21_pe.unmapped.fq
https://bioinf.shenwei.me/seqkit/usage/#split2
Example 4 will help this question.
I'm not sure if it recognize the read ID. It split and write alternately into 1st-output-file and 2nd-output-file.

Parsing text from .txt files

I've a tabbed log file but I need only few chracters of the line marked 30.10 in the beginning.
Using the command
awk '/^30.10/{print}' FOOD_ORDERS_201907041307.DEL
i get this output
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
What i need to extract is 3547 and the last nth caracthers from the very end after zeros.
So, expected output will be:
3547
34
29
11
But if the last 10 caracthers contains leading zeros and a number, i need that number
While your question is unclear, your answer to Ed Morton's comment provides a bit more clarity on what you are trying to achieve. Where it is still unclear is just exactly you want from the third field. From your question and the various comments, it appears if the line begins with 30.10 you want the first 4-digits from second field and you want the rightmost digits that are [1-9] from the third field.
If that accurately captures what you need, then awk with a combination of substr, match and length string functions can isolate the digits you are interested in. For example:
awk '/^30.10/ {
l=match ($3, /[1-9]+$/)
print substr ($2, 1, 4) " " substr ($3, l, length($3)-l+1)
}' test
Would take the input file (borrowed from Dudi Boy's answer), e.g.
$ cat test
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000001143
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
and return to you:
3547 34
3547 1143
3547 29
3547 11
Let me know if that accurately captures what you need.
Here is a simple awk script to do the task:
script.awk
/^30.10/ { # for each line starting with 30.10
last2chars = substr($3, length($3)-1); # extract last 2 chars from 3rd field into variable last2chars
if($3 ~ /00001143$/) last2chars = 1143; # if 3rd field ends with 1143, update variable last2chars respectively
print last2chars; # output variable last2chars
}
input.txt
30.1006 35470015000205910002019070420190705 00000014870000000034
30.1006 35470015000205900002019070420190705 00000014890000001143
30.1006 35470015000205900002019070420190705 00000014890000000029
30.1006 35470023000205920002019070420190705 00000014900000000011
running:
awk -f script.awk input.txt
outupt:
34
1143
29
11
GOT Part of it!
awk '/^30.10/{print}' FOOD_ORDERS_201907041307.DEL | sed 's/.*(..)/\1/'

Join multiple lines into One (.cap file) CentOS

Single entry has multiple lines. Each entry is separated by two blank lines.
Each entry has to be made into a single line followed by a delimiter(;).
Sample Input:
Name:Sid
ID:123
Name:Jai
ID:234
Name:Arun
ID:12
Tried replacing the blank lines with cat test.cap | tr -s [:space:] ';'
Output:
Name:Sid;ID:123;Name:Jai;ID:234;Name:Arun;ID:12;
Expected Output:
Name:SidID:123;Name:JaiID:234;Name:ArunID:12;
Same is the case with Xargs.
I've used sed command as well but it only joined two lines into one. Where as I've 132 lines as one entry and 1000 such entries in one file.
You may use
cat file | awk 'BEGIN { FS = "\n"; RS = "\n\n"; ORS=";" } { gsub(/\n/, "", $0); print }' | sed 's/;;*$//' > output.file
Output:
Name:SidID:123;Name:JaiID:234;Name:ArunID:12
Notes:
FS = "\n" will set field separators to a newline`
RS = "\n\n" will set your record separators to double newline
gsub(/\n/, "", $0) will remove all newlines from a found record
sed 's/;;*$//' will remove the trailing ; added by awk
See the online demo
Could you please try following.
awk 'NF{val=(val?$0~/^ID/?val $0";":val $0:$0)} END{print val}' Input_file
Output will be as follows.
Name:SidID:123;Name:JaiID:234;Name:ArunID:12;
Explanation: Adding explanation of above code too now.
awk ' ##Starting awk program here.
NF{ ##Checking condition if a LINE is NOT NULL and having some value in it.
val=(val?$0~/^ID/?val $0";":val $0:$0) ##Creating a variable val here whose value is concatenating its own value along with check if a line starts with string ID then add a semi colon at last else no need to add it then.
}
END{ ##Starting END section of awk here.
print val ##Printing value of variable val here.
}
' Input_file ##Mentioning Input_file name here.
This might work for you (GNU sed):
sed -r '/./{N;s/\n//;H};$!d;x;s/.//;s/\n|$/;/g' file
If it is not a blank line, append the following line and remove the newline between them. Append the result to the hold space and if it is not the end of the file, delete the current line. At the end of the file, swap to the hold space, remove the first character (which will be a newline) and then replace all newlines (append an extra semi-colon for the last line only) with semi-colons.

ANTLR4 lexer rules not matching correct block of text

I am trying to understand how ANTLR4 works based on lexer and parser rules but I am missing something in the following example:
I am trying to parse a file and match all mathematic additions (eg 1+2+3 etc.). My file contains the following text:
start
4 + 5 + 22 + 1
other text other text test test
test test other text
55 other text
another text 2 + 4 + 255
number 44
end
and I would like to match
4 + 5 + 22 + 1
and
2 + 4 + 255
My grammar is as follows:
grammar Hello;
hi : expr+ EOF;
expr : NUM (PLUS NUM)+;
PLUS : '+' ;
NUM : [0-9]+ ;
SPACE : [\n\r\t ]+ ->skip;
OTHER : [a-z]+ ;
My abstract Syntax Tree is visualized as
Why does rule 'expr' matches the text 'start'? I also get an error "extraneous input 'start' expecting NUM"
If i make the following change in my grammar
OTHER : [a-z]+ ->skip;
the error is gone. In addition in the image above text '55 other text
another text' matches the expression as a node in the AST. Why is this happening?
All the above have to do with the way lexer matches an input? I know that lexer looks for the first longest matching rule but how can I change my grammar so as to match only the additions?
Why does rule 'expr' matches the text 'start'?
It doesn't. When a token shows up red in the tree, that indicates an error. The token did not match any of the possible alternatives, so an error was produced and the parser continued with the next token.
In addition in the image above text '55 other text another text' matches the expression as a node in the AST. Why is this happening?
After you skipped the OTHER tokens, your input basically looks like this:
4 + 5 + 22 + 1 55 2 + 4 + 255 44
4 + 5 + 22 + 1 can be parsed as an expression, no problem. After that the parser either expects a + (continuing the expression) or a number (starting a new expression). So when it sees 55, that indicates the start of a new expression. Now it expects a + (because the grammar says that PLUS NUM must appear at least once after the first number in an expression). What it actually gets is the number 2. So it produces an error and ignores that token. Then it sees a +, which is what it expected. And then it continues that way until the 44, which again starts a new expression. Since that isn't followed by a +, that's another error.
All the above have to do with the way lexer matches an input?
Not really. The token sequence for "start 4 + 5" is OTHER NUM PLUS NUM, or just NUM PLUS NUM if you skip the OTHERs. The token sequence for "55 skippedtext 2 + 4" is NUM NUM PLUS NUM. I assume that's exactly what you'd expect.
Instead what seems to be confusing you is how ANTLR recovers from errors (or maybe that it recovers from errors).

Resources