Match patterns in files and cut/delete lines in between - parsing

I have a text (input.txt) file containing the following pattern:
#########
##### AV1
#########
great picture
good tv
decent sound
vibrant color
#########
#########
#### AV2 new TV
#########
we are testing this out now
this is 4K
need HDMI
#########
#########
### AV3 not working
#########
not enough ports
buy new device
#########
What I am in need for is the following output of 3 seperate files:
File 1.txt
#########
##### AV1
#########
great picture
good tv
decent sound
vibrant color
#########
File 2.txt
#########
#### AV2 new TV
#########
we are testing this out now
this is 4K
need HDMI
#########
File 3.txt
#########
### AV3 not working
#########
not enough ports
buy new device
#########
I am not sure how to search and match patterns in 3 consecutive lines and then ignore the texts in between and find the #### at the end. Then I need to take the texts in between and output a new file.
I am able to "greedily" match the first 3 lines with the following regex:
/^#*\_^#*\_^#.*$

You may use this awk:
awk '
/^#{7,}/ {
++n
if (n%3 == 1) {
close(fn)
fn = "file" (++c) ".txt"
}
}
{
print > fn
}' file

When your requirement may be formulated as "split after the first line of each pair of lines with 9 hashes", you can use
csplit -b "%d" --suppress-matched -f "File " \
<(sed -rz 's/(#{9})\n\1/\1\n\n\1/g' input.txt) /^$/ '{*}'
Explanation:
-b "%d": Get filenumbers without leading 0
--suppress-matched: Skip the empty lines that will be inserted
-f "File ": basename of the files created
<(...): use output of the command like it is a file
sed -rz 's/(#{9})\n\1/\1\n\n\1/g' input.txt: create an empty line between 2 patterns with 9 times #
/^$/: Empty line
'{*}': Repeat csplit pattern for the complete file

Related

How to split paired-end fastq files?

I have Illumina paired-end reads contained within one .fastq file, denoted as '/1' for forward reads and '/2' for reverse reads.
I am using grep to pull out the individual reads and place them into 2 respective files (one for forward reads and one for reverse.
grep -A 3 "/1$" sample21_pe.unmapped.fq > sample21_1_rfa.fq
grep -A 3 "/2$" sample21_pe.unmapped.fq > sample21_2_rfa.fq
However, when I try to use the files (fastqc, assembly, etc), they do not work. When running
fastqc i get the following error:
Failed to process file sample21_1_rfa.fq
uk.ac.babraham.FastQC.Sequence.SequenceFormatException: ID line didn't start with '#'
at uk.ac.babraham.FastQC.Sequence.FastQFile.readNext(FastQFile.java:134)
at uk.ac.babraham.FastQC.Sequence.FastQFile.next(FastQFile.java:105)
at uk.ac.babraham.FastQC.Analysis.AnalysisRunner.run(AnalysisRunner.java:76)
at java.lang.Thread.run(Thread.java:662)
But, if you look at the files they identifier does indeed start with an '#'. Any advice on why these files aren't working? I had originally converted .bam files into the .fastq files with
samtools bam2fq
Here are samples of each individual file:
merged .fastq
#HISEQ:534:CB14TANXX:4:1101:1091:2161/1
GAGAAGCTCGTCCGGCTGGAGAATGTTGCGCTTGCGGTCCGGAGAGGACAGAAATTCGTTGATGTTAACGGTGCGCTCGCCGCGGACGCTCTTGATGGTGACGTCGGCGTTGAGCGTGACGCACG
+
B/</<//B<BFF<FFFFFF/BFFFFFFB<BFFF<B/7FFF7B/B/FF/F/<<F/FFBFFFBBFFFBFB/FF<BBB<B/B//BBFFFFFFF/B/FF/B77B//B7B7F/7F###############
#HISEQ:534:CB14TANXX:4:1101:1091:2161/2
TGACGCCTGCCGTCAGGTAGGTTCTCCGCAGATCCGAAATCTCGCGACGCTCGGCGGCAACATCTGCCAGTCGTCCGTGGCGGGCGACGGTCTCGCGGCGTGCGTCACGCTCAACGCCGACGTAC
+
/B<B//F/F//B<///<FB/</F<<FFFFF<FFBF/FF<//FB/F//F7FBFFFF/B</7<F//<BB7/7BB7/B<F7BF<BFFFB7B#####################################
#HISEQ:534:CB14TANXX:4:1101:1637:2053/1
NGTTTACCATACAACAATCTTGCGACCTATTCAAATCATCTATATGCCTTATCAAGTTTTCATAGCTTTCAAGATTCTCAATTTCCTCACGTCTCGCTTTGCTCAACCTACAAAAACTCTCTTCT
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFB/BFBBFBB<<<<FFFFFFBB<FBFFBFF
#HISEQ:534:CB14TANXX:4:1101:1637:2053/2
TCGGTCGTTGGGAAAAGACCTGTGGTAAACATCCTACGCAAAAGCCATTGCGGTTACTCGTTCGTATGATTCTTGCATCAACTAATCAAGGCGATTGGGTTCTCGACCCATTTTGTGGAAGTTCG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFBFFFFFB<FF<<BBFB
#HISEQ:534:CB14TANXX:4:1101:1792:2218/1
TCTATCGGCTGACCGATAAGCTGTCGCCTGCCGACCGTCCTGCCATGGGACGGCGCATCGCACAGCTCACCCTGGACTAACTCTCCAACACCATGATGCTGACACGCTCGGCAAAAACACCCGAT
+
<<B/<B</FF/<B/<//F<//FF<<<FF//</7/F<</FFF####################################################################################
#HISEQ:534:CB14TANXX:4:1101:1792:2218/2
TGCCGGAGGGCGTCGATGGTGGCATCGAGCTTTTTTGCCGAGCGTGTCAGCATGATGGTGTTGTAGAGATAGTCCATGGTGAGCTGTGCGATGCGCCGTTCCATGGCAGGACGGTCGGCAGGCGC
+
BBBBBFFFFFFFFBFFFBBFFFFFFFFFFFBBFFFF/FF<F7FF//F/FBB/FFBFFF/F7BFF<F/FFFFFFFFB/7BB<7BFFFFFFFFFFFFF<B///B/7B/7/B//77BB//7B/B7/B#
#HISEQ:534:CB14TANXX:4:1101:1903:2238/1
TATTCCAGCGACCGTTATAATCAAACTCAACTACATAGTCATTGCGGATTGCTTCAAGAAATTTTTTCCAGACTATTTCATCAATATTTATTTTGGGAACTGGTGCAACAGCAATTCTTTTTAAA
+
BBBBBFFFFFFFFFFFFFBFF/FFBFFBFFFFFFFF/FFFFFF<<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBF/B/<B<B/FBF7/<FFFFFFF/BB/7///7FF<BFFF//B/FFF###
#HISEQ:534:CB14TANXX:4:1101:1903:2238/2
TAAGGTTGGAGAAGCAACAATTTACCGTGATATTGATTTGCTCCGAACATATTTTCATGCGCCACTCGAGTTTGACAGGGAGAAAGGCGGGTATTATTATTTTAATGAAAAATGGGATTTTGCCC
+
B<BBBFFFFFFF<FFFFFFFFFFFFFFFFFF/BFFFFFFF<<FF<F<FFF/FF/FFFFBFB</<//<B/////<<FFFFB/<F<BFF/7/</7/7FB/B/BFF<//7BFF###############
#HISEQ:534:CB14TANXX:4:1101:2107:2125/1
TGTAGTATTTATTACATCATATAGAATGTAGATATAAAAGATGAAAAAGCTATAATTTCTTTGATAATATAAGGAGGGAATAACACTATGAGGATTGATAGAGCAGGAATCGAGGATTTGCCGAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBFFFFFFFBB<FBB7BFF#
#HISEQ:534:CB14TANXX:4:1101:2107:2125/2
TACCACTATCGGCAAATCCTCGATTCCTGCTCTATCAATCCTCATAGTGTTATTCCCTCCTTATATTATCAAAGAAATTATAGCTTTTTCATCTTTTATATCTACATTCTATATGATGTAATAAA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFBFBFFFFFFFBBFFFFFFFBF7F/B/BBF7/</FF/77F/77BB#
#HISEQ:534:CB14TANXX:4:1101:2023:2224/1
TCACCAGCTCGGCACGCTTGTCCTTGACCTCCTGCTCGATCTGACCGTCCATCTTGGCTGCCACGGTGTTCTCCTCGGCGGAGTAGGCAAAGCAGCCCAGACGGTCGAACTGTATCTCCTTGACA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFB<<B7BBFBFFF<FFBBFFFBF/7B/<B<
#HISEQ:534:CB14TANXX:4:1101:2023:2224/2
TCGAGGATCTGTGCAACTTTGTCAAGGAGATACAGTTCGACCGTCTGGGCTGCTTTGCCTACTCCGCCGAGGAGAACACCGTGGCAGCCAAGATGGACGGTCAGATCGAGCAGGAGGTCAAGGAC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFBFBFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFF<7BF/<<BB###
#HISEQ:534:CB14TANXX:4:1101:2038:2235/1
TTTATGCGAATGTAGAGTGGCTTCTCCACTGCCTCGGTGAAGCCCACGCGCGAGATGAGCGAATTAAGCTGCTTTGCAGTGAATTGCATTGCATATACACCTGCGTCGGCTTGAATACTTGTGCT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF//BFFFFFFFFFFFFF<B<BB###
#HISEQ:534:CB14TANXX:4:1101:2038:2235/2
AATCCGCTCGTGAAAGCTCCCGATAACGCCACAGTGAACACCGTGGAGTTCTCTGATACCGAAGATTTCGCACGCAGCACAAGTATTCAAGCCGACGCAGGTGTATATGCAATGCAATTCACTGC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBBFFFFFFFFFFFFFFFFFFFFFFF
#HISEQ:534:CB14TANXX:4:1101:2271:2041/1
NACACTTGTCGATGATCTTGCCAAGCTGCTTCTTGCCCACCAGGAAGCCGATCTCCAGATCAAACTCGTGGCCGGGAACACTCCGGTCCACAAAGCCCAGGTCCTGGGGAATGGGCTCATCGTAG
+
#<</BB/F/BB/F<FFFFFFFFF/<BFFFFFFFF<<FFBFFFFFFBFBFBBB<<FFFFBFFF/<B/FFFFFFFFFFFFFFFFF<FB<<BFF77BFFF/<BFFFB<</BB</7BFFFB########
#HISEQ:534:CB14TANXX:4:1101:2271:2041/2
GACTCATCTACAATGAGCCCATTCCCCAGGACCTGGGCTTTGTGGACCGGAGTGTTCCCGGCCACGAGTTTGATCTGGAGATCGGCTTCCTGGTGGGCAAGAAGCAGCTTGGCAAGATCATCGCC
+
<<BBBFFF<F/BFFFBFBF<BFF<<F/FFFBFFFF<<FFFFBFFFFFFBFFF/<B<F/<</<FFF//FFFFF/<<F/B/B/7/FF<<FF/7B/BBB/7///7////<B/B/BB/B/B/B/7BB##
Example of forward reads after being pulled out and placed into their own .fastq file:
#HISEQ:534:CB14TANXX:4:1101:1091:2161/1
GAGAAGCTCGTCCGGCTGGAGAATGTTGCGCTTGCGGTCCGGAGAGGACAGAAATTCGTTGATGTTAACGGTGCGCTCGCCGCGGACGCTCTTGATGGTGACGTCGGCGTTGAGCGTGACGCACG
+
B/</<//B<BFF<FFFFFF/BFFFFFFB<BFFF<B/7FFF7B/B/FF/F/<<F/FFBFFFBBFFFBFB/FF<BBB<B/B//BBFFFFFFF/B/FF/B77B//B7B7F/7F###############
--
#HISEQ:534:CB14TANXX:4:1101:1637:2053/1
NGTTTACCATACAACAATCTTGCGACCTATTCAAATCATCTATATGCCTTATCAAGTTTTCATAGCTTTCAAGATTCTCAATTTCCTCACGTCTCGCTTTGCTCAACCTACAAAAACTCTCTTCT
+
#<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFFFFFFFFFFFB/BFBBFBB<<<<FFFFFFBB<FBFFBFF
--
#HISEQ:534:CB14TANXX:4:1101:1792:2218/1
TCTATCGGCTGACCGATAAGCTGTCGCCTGCCGACCGTCCTGCCATGGGACGGCGCATCGCACAGCTCACCCTGGACTAACTCTCCAACACCATGATGCTGACACGCTCGGCAAAAACACCCGAT
+
<<B/<B</FF/<B/<//F<//FF<<<FF//</7/F<</FFF####################################################################################
--
#HISEQ:534:CB14TANXX:4:1101:1903:2238/1
TATTCCAGCGACCGTTATAATCAAACTCAACTACATAGTCATTGCGGATTGCTTCAAGAAATTTTTTCCAGACTATTTCATCAATATTTATTTTGGGAACTGGTGCAACAGCAATTCTTTTTAAA
+
BBBBBFFFFFFFFFFFFFBFF/FFBFFBFFFFFFFF/FFFFFF<<FFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFBF/B/<B<B/FBF7/<FFFFFFF/BB/7///7FF<BFFF//B/FFF###
--
#HISEQ:534:CB14TANXX:4:1101:2107:2125/1
TGTAGTATTTATTACATCATATAGAATGTAGATATAAAAGATGAAAAAGCTATAATTTCTTTGATAATATAAGGAGGGAATAACACTATGAGGATTGATAGAGCAGGAATCGAGGATTTGCCGAT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFF/FFFFFFFFFFFFFFFFFFFFFFFFFFFFBBBFFFFFFFBB<FBB7BFF#
--
#HISEQ:534:CB14TANXX:4:1101:2023:2224/1
TCACCAGCTCGGCACGCTTGTCCTTGACCTCCTGCTCGATCTGACCGTCCATCTTGGCTGCCACGGTGTTCTCCTCGGCGGAGTAGGCAAAGCAGCCCAGACGGTCGAACTGTATCTCCTTGACA
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFFFFFFFFB<<B7BBFBFFF<FFBBFFFBF/7B/<B<
--
#HISEQ:534:CB14TANXX:4:1101:2038:2235/1
TTTATGCGAATGTAGAGTGGCTTCTCCACTGCCTCGGTGAAGCCCACGCGCGAGATGAGCGAATTAAGCTGCTTTGCAGTGAATTGCATTGCATATACACCTGCGTCGGCTTGAATACTTGTGCT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFF//BFFFFFFFFFFFFF<B<BB###
--
#HISEQ:534:CB14TANXX:4:1101:2271:2041/1
NACACTTGTCGATGATCTTGCCAAGCTGCTTCTTGCCCACCAGGAAGCCGATCTCCAGATCAAACTCGTGGCCGGGAACACTCCGGTCCACAAAGCCCAGGTCCTGGGGAATGGGCTCATCGTAG
+
#<</BB/F/BB/F<FFFFFFFFF/<BFFFFFFFF<<FFBFFFFFFBFBFBBB<<FFFFBFFF/<B/FFFFFFFFFFFFFFFFF<FB<<BFF77BFFF/<BFFFB<</BB</7BFFFB########
--
#HISEQ:534:CB14TANXX:4:1101:2678:2145/1
CTGTACATAGTACGTATTTGACGCCTGCGTCGATGTAGCGTTTGAGGAAGGGAAGCAGCGGTTCTGCAGAGTCCTCTTTCCATCCGTTGATGCTAATCATTCCGTTGCGTACATCCGCTCCGAGA
+
BBBBBFFFFFFF<FFF<FFFFFFFFBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<BFFF7BFFFFFFFF<BBFFFFFFFFBBFBBB<FFBFFFFFFFFFFFFB<BFFFFFFBFB/BFFF####
--
#HISEQ:534:CB14TANXX:4:1101:2972:2114/1
CTCTGTGCCGATCCCTTTGCCTTTGCGTTTTGAGGAAAGGAAACCACCTTCTGGGTCGGTGAGGATAGTTCCGGTGAAGGTGTTGTCCACCGCCAGGCATAGGGAATAGCTGTCAGCCTTTGCTC
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB/FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFBFFFF<FFFFFFFFFF<BFFFFF
--
#HISEQ:534:CB14TANXX:4:1101:2940:2222/1
CTAATTTTTTCATTATATTACTAATTTTGTAATTGGTAAAATATTATAATATCCTTGTACATTAAGACCCCAATAATCAGAAGAAGTAAAATTAATTCCTGCAACAGTTCTTAAATATCCATTAG
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFF<FBFFFFFFFFFFFFFFF/FBFBFFBFFFFF/<F<FFFFFFFFFF<FFFFFFBFFFFFFFFF</FBFBBF<F/7//FFBFBBFFF/<7BF#
--
#HISEQ:534:CB14TANXX:4:1101:3037:2180/1
CGTCAGTTCCGCAACGATAAAGAGTTCCGCATTGCAGTCACCTGTACGCTGGTAGCCACCGGAACCGATGTCAAGCCGTTGGAGGTGGTGATGTTCATGCGCGACGTAGCTTCCGAGCCGTTATA
+
B/BBBBBFFFFFFF<FFBFFFFFF<FFFFBFFFFFFF<BBFFFFFFFFFFFFFFFFFBFF/FFFFBFFBFFFFBFF/7F/BFB/BBFFFFFFFFBFF<BBF<7BBFFFFFFBBFFF/B#######
--
#HISEQ:534:CB14TANXX:4:1101:3334:2171/1
ACCGATGTACATACCCGGACGGGTACGCACATGCTCCATATCGCTCAAGTGGCGGATATTGTCATCTGTATATTCTACAGGTTGCTCCTGAGGGGTATTTGCCAGTTCTTCGGCAGCACCCTTTT
+
BBBBBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFFFBFFFFFFFFFFFBFFFFFFFFFF</<BFFFFFFFFBBFFFFFFBF</BB///BF<FFFFF<</<B
--
#HISEQ:534:CB14TANXX:4:1101:3452:2185/1
CGCAGACGGATTTGCTTGAAGTCCGTCTCATCGTATTCCGACAACTCATCGAGGAACACACGCTTGTATTGACTGATACCCTTGATTTTCTCCGGGTCGTCAAGACCACTGAAATCAATCTTGCC
+
BBBBBFFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFF<FFFFFBFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFBFFFFFBFFBFFFFFFFFFB/77B/FBBFFF/<FFF/77BBFFFBFFBBB
--
Any advice would be appreciated. Thanks!
In general, this operation is called deinterlace fastq or deinterleave fastq. The question already has the answer here:
deinterleave fastq file
https://www.biostars.org/p/141256/
I am copying it here, with minor reformatting for clarity:
paste - - - - - - - - < interleaved.fq \
| tee >(cut -f 1-4 | tr "\t" "\n" > read1.fq) \
| cut -f 5-8 | tr "\t" "\n" > read2.fq
This command converts the interlaced fastq file into 8-column tsv file, cuts columns 1-4 (read 1 lines), changes from tsv to fastq format (by replacing tabs with newlines) and redirects the output to read1.fq. In the same STDOUT stream (for speed), using tee, it cuts columns 5-8 (read 2 lines), etc, and redirects the output to read2.fq.
You can also use these command line tools:
iamdelf/deinterlace: Deinterlaces paired-end FASTQ files into first and second strand files.
https://github.com/iamdelf/deinterlace
deinterleave FASTQ files
https://gist.github.com/nathanhaigh/3521724
Or online tools with Galaxy web UI, for example this tool: "FASTQ splitter on joined paired end reads", installed on several public Galaxy instances, such as https://usegalaxy.org/ .
Avoid using a regex for simple fastq file parsing if you can use line numbers, both for speed (pattern matching is slower than simple counting) and for robustness.
Highly unlikely, but a pattern like ^#.*/1$ (or whatever the readers might change it to, while reusing this code later) can match also the base quality line. A good general rule is to simply rely on fastq spec, which says 4 lines per record.
Note that #, /, 1, and 2 characters are allowed in Illumina Phred scores: https://support.illumina.com/help/BaseSpace_OLH_009008/Content/Source/Informatics/BS/QualityScoreEncoding_swBS.htm .
A one-liner that pulls out such (admittedly, very rare) reads is left as an exercise to the reader.
The fastq format uses 4 lines per read.
Your snippet has 5, as there are -- lines. That could cause confusion to softwares expecting a 4 line format.
You can add --no-group-separator to the grep call to avoid adding that separator.
I usually follow these steps to convert bam to fastq.gz
samtools bam2fq myBamfile.bam > myBamfile.fastq
cat myBamfile.fastq | grep '^#.*/1$' -A 3 --no-group-separator > sample_1.fastq
cat myBamfile.fastq | grep '^#.*/2$' -A 3 --no-group-separator > sample_2.fastq
gzip sample_1.fastq
gzip sample_1.fastq
Once you have the two files, you should order them to be sure that the reads are really paired.
We can split FASTQ files using Seqkit.
seqkit split2 -p 2 sample21_pe.unmapped.fq
https://bioinf.shenwei.me/seqkit/usage/#split2
Example 4 will help this question.
I'm not sure if it recognize the read ID. It split and write alternately into 1st-output-file and 2nd-output-file.

Inject PostScript code before 'showpage'

I want to print a book on a laser printer, so I have prepared a PostScript file ready for printing, with reordered and 2-up'ed pages. Now I want to add booklet marks on appropriate pages, like on the following picture:
From other SO questions I know that its showpage command that separates individual pages in PS file, so I wrote simple Perl script which counts showpage's occurences and prepends PostSript code if necessary, just to test if this approach works:
#!/usr/bin/env perl
use strict;
use warnings;
my $bookletSheets = 6;
my $occurence = 1;
while (my $line = <>) {
if ($line !~ /^showpage/) {
print $line;
next;
}
my $mod = $occurence % (2*$bookletSheets);
if ($mod == 1) {
print " 1 setlinewidth\n";
print " 5 5 newpath moveto\n";
print "-5 5 lineto\n";
print "-5 -5 lineto\n";
print " 5 -5 lineto\n";
print " 5 5 lineto\n";
print "0 setgray\n";
print "stroke\n";
print "%NOP\n"
}
print $line;
$occurence++;
}
But after running:
$ cat book.ps | ./preprocess.pl > book-marked.ps
I can see no signs of additional marks in document viewer, despite the code got injected correctly. What have I done wrong?
There are some links I've based my thinking on:
https://stackoverflow.com/a/532220/1447225
http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.postscript/2007-11/msg00097.html
http://www.physics.emory.edu/faculty/weeks//graphics/howtops1.html
After some more investigation, it turned out that the lack of image was caused by BoundingBox clipping. Bounding box was shifted because the PS file was obtained from PDF with stripped margins.

duplicate grep output when comparing two files

I have literally been at this for 5 hours, I have busybox on my device, and I unfortunately do not have -X in grep to make my life easier.
edit;
I have two list both of them have mac addresses, essentially I am just wanting to achieve offline mac address lookup so I don't have to keep looking it up online
list.txt has vendor mac prefix of course this isn't the complete list but just for an example
00:13:46
00:15:E9
00:17:9A
00:19:5B
00:1B:11
00:1C:F0
scan will have list of different mac addresses unknown to which vendor they go to. Which will be full length mac addresses. when ever there is a match I want the line in scan to be output.
Pretty much it does that, but it outputs everything from the scan file, and then it will output matching one at the end, and causing duplicate. I tried sort -u, but it has no effect its as if there is two different output from two different methods, the reason why I say that is because it will instantly output scan file that has everything in it, and couple seconds later it will output the matching one.
From searching I came across this
#!/bin/bash
while read line; do
grep -F 'list' 'scan'
done < list.txt
which displays the duplicate result when/if found, the output is pretty much echoing my scan file then displaying the matched pattern, this creating duplicate
This is frustrating me that I have not found a solution after click on all the links in google up to page 9.
Please someone help me.
I don't know if the Busybox sed supports this out of the box, but it should be easy to do in Awk or Perl instead then.
Create a sed script to print lines from file2 which are covered by a prefix in file1 by transforming each line in file1 into a sed command to print a match for that regular expression:
sed 's%.*%/&/p%' file1 | sed -n -f - file2
The same in Awk:
awk 'NR==FNR { a[++i]="^" $0; next }
{ for (j=1; j<=i; ++j) if ($0 ~ a[j]) print }' file1 file2
Ok guys I did a nested for loop (probably very in efficient) but I got it working printing the matching mac addresses using this
#!/usr/bin/bash
for scanlist in `cat scan | cut -d: -f1,2,3`
do
for listt in `cat list`
do
if [[ $scanlist == $listt ]]; then
grep $scanlist scan
fi
done
done
if anyone can make this more elegant but it works for me for now. I think the problem I had was one list contained just 00:11:22 while my other list contained 00:11:22:33:44:55 that is why I cut it on my scanlist to make same length as my other list. So this only output the matches instead of doing duplicate output.

Matching pattern across multiple files: perl or grep?

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501
1n1e+NDE+A+400 2qzl+IXS+A+449
1llf+F23+A+800 1y0g+8PP+A+320
1ewf+PC1+A+577 2a94+AP0+A+336
2ydx+TXP+E+1339 3g8i+RO7+A+1
1gvh+HEM+A+1398 1v9y+HEM+A+1140
2i0z+FAD+A+501 3m2r+F43+A+1
1h6d+NDP+A+500 3rt4+LP5+C+501
1w07+FAD+A+1660 2pgn+FAD+A+612
2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt
I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings;
open AS, "combi_output_2_fixed.txt";
open AQ, "NAMES.txt";
#arr=<AS>;
#arr1=<AQ>;
foreach $line(#arr)
{
#split=split(' ',$line);
foreach $line1(#arr1)
{
#split1=split(' ',$line1);
if($split[0] eq $split1[0] && $split[1] eq $split1[1])
{ print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
}
}
close AQ;
close AS;
Doing this uses up the entire memory: and shows Out of memory error message..
I am aware that this can be done using grep. but do not know hw to do it.
Can anyone please let me know how I can do this using grep -F AND WITHOUT USING UP THE ENTIRE MEMORY?
Thanks.
Does pattern.txt fit in memory?
If it does, you could use a command like grep -F -f pattern.txt data.txt to match lines in data.txt against the patterns. You would get the full line though, and extra processing would be required to get only the second column of numbers.
Or you could fix the Perl script. The reason you run out of memory is because you read the 8gb file entirely to memory, when you could be processing it line-by-line like grep. For the 8GB file you should use code like this:
open FH, "<", "data.txt";
while ($line = <FH>) {
# check $line against list of patterns ...
}
Try This
grep "`more pattern.txt`" data.txt | awk -F' ' '{ print $1 " " $2 " " $4}'

Addressing a specific occurrence of a character in sed

How do I remove or address a specific occurrence of a character in sed?
I'm editing a CSV file and I want to remove all text between the third and the fifth occurrence of the comma (that is, dropping fields four and five) . Is there any way to achieve this using sed?
E.g:
% cat myfile
one,two,three,dropthis,dropthat,six,...
% sed -i 's/someregex//' myfile
% cat myfile
one,two,three,,six,...
If it is okay to consider cut command then:
$ cut -d, -f1-3,6- file
awk or any other tools that are able to split strings on delimiters are better for the job than sed
$ cat file
1,2,3,4,5,6,7,8,9,10
Ruby(1.9+)
$ ruby -ne 's=$_.split(","); s[2,3]=nil ;puts s.compact.join(",") ' file
1,2,6,7,8,9,10
using awk
$ awk 'BEGIN{FS=OFS=","}{$3=$4=$5="";}{gsub(/,,*/,",")}1' file
1,2,6,7,8,9,10
A real parser in action
#!/usr/bin/python
import csv
import sys
cr = csv.reader(open('my-data.csv', 'rb'))
cw = csv.writer(open('stripped-data.csv', 'wb'))
for row in cr:
cw.writerow(row[0:3] + row[5:])
But do note the preface to the csv module:
The so-called CSV (Comma Separated
Values) format is the most common
import and export format for
spreadsheets and databases. There is
no “CSV standard”, so the format is
operationally defined by the many
applications which read and write it.
The lack of a standard means that
subtle differences often exist in the
data produced and consumed by
different applications. These
differences can make it annoying to
process CSV files from multiple
sources. Still, while the delimiters
and quoting characters vary, the
overall format is similar enough that
it is possible to write a single
module which can efficiently
manipulate such data, hiding the
details of reading and writing the
data from the programmer.
$ cat my-data.csv
1
1,2
1,2,3
1,2,3,4,
1,2,3,4,5
1,2,3,4,5,6
1,2,3,4,5,6,
1,2,,4,5,6
1,2,"3,3",4,5,6
1,"2,2",3,4,5,6
,,3,4,5
,,,4,5
,,,,5
$ python csvdrop.py
$ cat stripped-data.csv
1
1,2
1,2,3
1,2,3
1,2,3
1,2,3,6
1,2,3,6,
1,2,,6
1,2,"3,3",6
1,"2,2",3,6
,,3
,,
,,

Resources