Inject PostScript code before 'showpage' - printing

I want to print a book on a laser printer, so I have prepared a PostScript file ready for printing, with reordered and 2-up'ed pages. Now I want to add booklet marks on appropriate pages, like on the following picture:
From other SO questions I know that its showpage command that separates individual pages in PS file, so I wrote simple Perl script which counts showpage's occurences and prepends PostSript code if necessary, just to test if this approach works:
#!/usr/bin/env perl
use strict;
use warnings;
my $bookletSheets = 6;
my $occurence = 1;
while (my $line = <>) {
if ($line !~ /^showpage/) {
print $line;
next;
}
my $mod = $occurence % (2*$bookletSheets);
if ($mod == 1) {
print " 1 setlinewidth\n";
print " 5 5 newpath moveto\n";
print "-5 5 lineto\n";
print "-5 -5 lineto\n";
print " 5 -5 lineto\n";
print " 5 5 lineto\n";
print "0 setgray\n";
print "stroke\n";
print "%NOP\n"
}
print $line;
$occurence++;
}
But after running:
$ cat book.ps | ./preprocess.pl > book-marked.ps
I can see no signs of additional marks in document viewer, despite the code got injected correctly. What have I done wrong?
There are some links I've based my thinking on:
https://stackoverflow.com/a/532220/1447225
http://newsgroups.derkeiler.com/Archive/Comp/comp.lang.postscript/2007-11/msg00097.html
http://www.physics.emory.edu/faculty/weeks//graphics/howtops1.html

After some more investigation, it turned out that the lack of image was caused by BoundingBox clipping. Bounding box was shifted because the PS file was obtained from PDF with stripped margins.

Related

Counting specific lines that don't contain specific word

Please I have question: I have a file like this
#HWI-ST273:296:C0EFRACXX:2:2101:17125:145325/1
TTAATACACCCAACCAGAAGTTAGCTCCTTCACTTTCAGCTAAATAAAAG
+
8?8A;DDDD;#?++8A?;C;F92+2A#19:1*1?DDDECDE?B4:BDEEI
#BBBB-ST273:296:C0EFRACXX:2:1303:5281:183410/1
TAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTTACCA
+
CCBFFFFFFHHHHJJJJJJJJJIIJJJJJJJJJJJJJJJJJJJIJJJJJI
#HWI-ST273:296:C0EFRACXX:2:1103:16617:140195/1
AAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCCCAGTACTTCTTTTTT
+
#C#FF?EDGFDHH#HGHIIGEGIIIIIEDIIGIIIGHHHIIIIIIIIIII
#HWI-ST273:296:C0EFRACXX:2:1207:14316:145263/1
AATACACCCAACCAGAAGTTAGCTCCTTCGCTTTCAGCTAAATAAAAGCC
+
CCCFFFFFHHHHHJJJJJJJIJJJJJJJJJJJJJJJJJJJJJJJJJJJIJ
I
I'm interested just about the line that starts with '#HWI', but I want to count all the lines that are not starting with '#HWI'. In the example shown, the result will be 1 because there's one line that starts with '#BBB'.
To be more clear: I just want to know know the number of the first line of the patterns (that are 4 line that repeated) that are not '#HWI'; I hope I'm clear enough. Please tell me if you need more clarification
With GNU sed, you can use its extended address to print every fourth line, then use grep to count the ones that don't start with #HWI:
sed -n '1~4p' file.fastq | grep -cv '^#HWI'
Otherwise, you can use e.g. Perl
perl -ne 'print if 1 == $. % 4' -- file.fastq | grep -cv '^#HWI'
$. contains the current line number, % is the modulo operator.
But once we're running Perl, we don't need grep anymore:
perl -lne '++$c if 1 == $. % 4; END { print $c }' -- file.fastq
-l removes newlines from input and adds them to output.

AppleScript parsing html from site

What I'm trying to do is to get the names of all TV shows on this Wikipedia page.
Ok, so I did this first:
property showsWebList : {}
tell application "Safari"
set loadDelay to 2 -- in seconds; test for your system
make new document at end of every document
set URL of document 1 to "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
delay loadDelay
set nrOfUls to do JavaScript "document.getElementById('mw-content-text').querySelectorAll('ul').length;" in document 1
set nrOfUls to nrOfUls - 1 as number
log nrOfUls
repeat with ws from 1 to nrOfUls
delay loadDelay
set nrOfLis to do JavaScript "document.getElementById('mw-content-text').getElementsByTagName('UL')[" & ws & "].querySelectorAll('li').length;" in document 1
set nrOfLis to nrOfLis - 1 as number
log nrOfLis
repeat with rs from 0 to nrOfLis
delay 0.3
set aShow to do JavaScript "document.getElementById('mw-content-text').getElementsByTagName('UL')[" & ws & "].getElementsByTagName('LI')[" & rs & "].getElementsByTagName('I')[0].getElementsByTagName('A')[0].innerHTML;" in document 1
if aShow is not "" or "missing value" then
copy aShow to end of showsWebList
end if
end repeat
end repeat
end tell
And this works exactly how I want it to. The problem is that it takes 15 minutes until it's done and you gotta have the safari document in front the whole time. So my thought was to pick up the whole code and parse it. Not that easy. This is how my code looks now:
tell application "Safari"
make new document at end of every document
set URL of document 1 to "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
delay 4
set orgHTML to do JavaScript "document.getElementById('mw-content-text').innerHTML;" in document 1
set orgHTML to orgHTML as text
set readyText to my extractBetween(orgHTML, "<li><i><a ", "</a></i></li>")
log (item 0 of readyText)
set removeArray to my extractBetween(readyText, "href", ">")
set completeArray to {}
repeat with rt from 0 to (count readyText)
repeat with ra from 0 to (count removeArray)
if (item ra of removeArray) is in (item rt of readyText) then
set completeName to trim_line((item rt of readyText), (item ra of removeArray), 1)
set end of completeArray to completeName
end if
end repeat
end repeat
log completeArray
end tell
on extractBetween(SearchText, startText, endText)
set tid to AppleScript's text item delimiters -- save them for later.
set AppleScript's text item delimiters to startText -- find the first one.
set liste to text items of SearchText
set AppleScript's text item delimiters to endText -- find the end one.
set extracts to {}
repeat with subText in liste
if subText contains endText then
copy text item 1 of subText to end of extracts
end if
end repeat
set AppleScript's text item delimiters to tid -- back to original values.
return extracts
end extractBetween
on trim_line(this_text, trim_chars, trim_indicator)
-- 0 = beginning, 1 = end, 2 = both
set x to the length of the trim_chars
-- TRIM BEGINNING
if the trim_indicator is in {0, 2} then
repeat while this_text begins with the trim_chars
try
set this_text to characters (x + 1) thru -1 of this_text as string
on error
-- the text contains nothing but the trim characters
return ""
end try
end repeat
end if
-- TRIM ENDING
if the trim_indicator is in {1, 2} then
repeat while this_text ends with the trim_chars
try
set this_text to characters 1 thru -(x + 1) of this_text as string
on error
-- the text contains nothing but the trim characters
return ""
end try
end repeat
end if
return this_text
end trim_line
Not that smooth and not working. Somehow it seems like I can't get the items out of the list, because it doesn't see it as a list item. Can someone help me out?
Cheers
I would recommend a different approach. DL the source, and then just grab the title between tags. The whole script takes under two seconds. Start with:
property baseURL : "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
set rawHTML to do shell script "curl '" & baseURL & "'"
set preTag to "\" title=\"" -- " title="
set otid to AppleScript's text item delimiters
set AppleScript's text item delimiters to preTag
set rawList to text items of rawHTML
set nameList to {}
repeat with eachLine in rawList
set theOff to offset of ">" in eachLine
set thisName to text 1 thru (theOff - 2) of eachLine
-- add some error checking here to skip the opening non-title hits, and to fine-tune the precise title string
set nameList to nameList & return & thisName
end repeat
set AppleScript's text item delimiters to otid
return nameList
Add a little error checking, and tweak which preTag and postTag fits best.
I suggest you make use of a specialized 3rd-party tool for this task, which can greatly speed things up.
Here's a solution using the multi-platform web-scraping CLI xidel:
A shell command to demonstrate its brevity and speed (takes less than 1 sec. on my system) - extracts all show names from the page:
xidel -e '//*[#id="mw-content-text"]/ul/li/i/a' https://en.wikipedia.org/wiki/List_of_television_programs_by_name
An equivalent AppleScript snippet - be sure to fill in the path to where you place xidel on your system below:
set targetUrl to "https://en.wikipedia.org/wiki/List_of_television_programs_by_name"
set xPathExpr to "//*[#id=\"mw-content-text\"]/ul/li/i/a"
# Fill in the path to `xidel` on your system here:
set xidelPath to "/path/to/xidel"
# Perform scraping and convert result into an AppleScript list.
set showNames to paragraphs of ¬
(do shell script ¬
quoted form of xidelPath & " -e " & quoted form of xPathExpr & " " & ¬
quoted form of targetUrl)
Here's another solution, use javascript to get the names without any AppleScript loop.
The javascript script takes less than one second to get the names.
tell application "Safari"
make new document at end of every document with properties {URL:"http://en.wikipedia.org/wiki/List_of_television_programs_by_name"}
delay 2 -- in seconds; test for your system
set showsWebList to do JavaScript "var a=new Array();var ul=document.getElementById('mw-content-text').querySelectorAll('UL'); for (var i=1;i<ul.length;i++){li=ul[i].querySelectorAll('LI'); for (var j=0; j< li.length; j++){try {var t=li[j].getElementsByTagName('I')[0].getElementsByTagName('A')[0].innerText; a.push(t)} catch(e) {}}} a;" in document 1
end tell
curl/sed/perl solution:
do shell script "curl 'http://en.wikipedia.org/wiki/List_of_television_programs_by_name' | sed -n '/0-9/,/NewPP/p' | sed -n '/^<li/ s/^.*title=.\\([^\"]*\\).*$/\\1/p' | perl -n -mHTML::Entities -e ' ; print HTML::Entities::decode_entities($_);'"
Here another solution using awk using a very simple script. If the line begins with <li><i> then remove html tags (gsub) and then print it. Then by using every paragraph of the return separated output is converted into a list.
set theURL to "http://en.wikipedia.org/wiki/List_of_television_programs_by_name"
every paragraph of (do shell script "curl " & theURL & " | awk '/^\\<li\\>\\<i\\>/{gsub(\"<[^>]*>\", \"\");print}'")

Powershell parse parts of a text file and save to CSV

All, I'm very new to powershell and am hoping someone can get me going on what I think would be a simple script.
I need to parse a text file, capture certain lines from it, and save those lines as a csv file.
For example, each alert is in its own text file. Each file is similar to this:
--start of file ---
Name John Smith
Dept Accounting
Codes bas-2349,cav-3928,deg-3942
iye-2830,tel-3890
Urls hxxp://blah.com
hxxp://foo.com, hxxp://foo2.com
Some text I dont care about
More text i dont care about
Comments
---------
"here is a multi line
comment I need
to capture"
Some text I dont care about
More text i dont care about
Date 3/12/2013
---END of file---
For each text file if I wanted to write only Name, Codes, and Urls to a CSV file. Could someone help me get going on this?
I'm more a PERL guy so I know I could write a regex for capturing a single line beginning with Name. However I am completely lost on how I could read the "Codes" line when it might be one line or it might be X lines long until I run into the Urls field.
Any help would be greatly appreciated!
Text parsing usually means regex. With regex, sometimes you need anchors to know when to stop a match and that can make you care about text you otherwise wouldn't. If you can specify that first line of "Some text I don't care about" you can use that to "anchor" your match of the URLs so you know when to stop matching.
$regex = #'
(?ms)Name (.+)?
Dept .+?
Codes (.+)?
Urls (.+)?
Some text I dont care about.+
Comments
---------
(.+)?
Some text I dont care about
'#
$file = 'c:\somedir\somefile.txt'
[IO.File]::ReadAllText($file) -match $regex
if ([IO.File]::ReadAllText($file) -match $regex)
{
$Name = $matches[1]
$Codes = $matches[2] -replace '\s+',','
$Urls = $matches[3] -replace '\s+',','
$comment = $matches[4] -replace '\s+',' '
}
$Name
$Codes
$Urls
$comment
If the file is not too big to be processed in memory, the simple way is to read it as an array of strings. (What too big means is subject to your system. Anything sub-gigabyte should work without too much a hickup.)
After you've read the file, set up a head and tail counters to point to element zero. Move the tail pointer row by row forward, until you find the date row. You can match data with regexps. Now you know the start and end of a single record. For the next record, set head counter to tail+1, tail to tail+2 and start scanning rows again. Lather, rinse, repeat until end of array is reached.
When a record is matched, you can extract name with a regex. Codes and Urls are a bit trickier. Match the Codes row with a regex. Extract it and all the next rows unless they do not match the code pattern. Same goes to Urls data. If the file always has whitespace padding on rows that are data to previous Urls and Codes, you could use match whitespace count with a regexp to get data rows too.
Maybe something line this would to it:
foreach ($Line in gc file.txt) {
switch -regex ($Line) {
'^(Name|Dept|Codes|Urls)' {
$Capture = $true
break
}
'^[A-Za-z0-9_-]+' {
$Capture = $false
break
}
}
if ($Capture) {
$Line
}
}
If you want the end result as a CSV file then you may use the Export-Csv cmdlet.
According the fact that c:\temp\file.txt contains :
Name John Smith
Dept Accounting
Codes bas-2349,cav-3928,deg-3942
iye-2830,tel-3890
Urls hxxp://blah.com
hxxp://foo.com
hxxp://foo2.com
Some text I dont care about
More text i dont care about
.
.
Date 3/12/2013
You can use regular expressions like this :
$a = Get-Content C:\temp\file.txt
$b = [regex]::match($a, "^.*Codes (.*)Urls (.*)Some.*$", "Multiline")
$codes = $b.groups[1].value -replace '[ ]{2,}',','
$urls = $b.groups[2].value -replace '[ ]{2,}',','
If all files have the same structure you could do something like this:
$srcdir = "C:\Test"
$outfile = "$srcdir\out.csv"
$re = '^Name (.*(?:\r\n .*)*)\r\n' +
'Dept .*(?:\r\n .*)*\r\n' +
'Codes (.*(?:\r\n .*)*)\r\n' +
'Urls (.*(?:\r\n .*)*)' +
'[\s\S]*$'
Get-ChildItem $srcdir -Filter *.txt | % {
[io.file]::ReadAllText($_.FullName)
} | Select-String $re | % {
$f = $_.Matches | % { $_.Groups } | ? { $_.Index -gt 0 }
New-Object -TypeName PSObject -Prop #{
'Name' = $f[0].Value;
'Codes' = $f[1].Value;
'Urls' = $f[2].Value;
}
} | Export-Csv $outfile -NoTypeInformation

Matching pattern across multiple files: perl or grep?

I have a pattern.txt file which looks like this:
2gqt+FAD+A+601 2i0z+FAD+A+501
1n1e+NDE+A+400 2qzl+IXS+A+449
1llf+F23+A+800 1y0g+8PP+A+320
1ewf+PC1+A+577 2a94+AP0+A+336
2ydx+TXP+E+1339 3g8i+RO7+A+1
1gvh+HEM+A+1398 1v9y+HEM+A+1140
2i0z+FAD+A+501 3m2r+F43+A+1
1h6d+NDP+A+500 3rt4+LP5+C+501
1w07+FAD+A+1660 2pgn+FAD+A+612
2qd1+PP9+A+701 3gsi+FAD+A+902
There is another file called data (approx 8gb in size) which has lines like this.
2gqt+FAD+A+601 2i0z+FAD+A+501 0.874585 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.145278 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.784512 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.362542 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.251452 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+1140 0.784521 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.369856 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.925478 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.584785 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.874526 0.125453
However the data file is not as simple as it looks like given above. The large size of the file is due to the fact that there are approx 18000 lines in it which begin the string in the first column of every line. i.e. 18000 lines beginning with 2gqt+FAD+A+601, followed by 18000 lines beginning with 1n1e+NDE+A+400. But there will be only one such line which matches the given pattern as in pattern.txt
I am trying to match the lines in pattern.txt with data and want to print out:
2gqt+FAD+A+601 2i0z+FAD+A+501 0.785412
1n1e+NDE+A+400 2qzl+IXS+A+449 0.589452
1llf+F23+A+800 1y0g+8PP+A+320 0.341786
1ewf+PC1+A+577 2a94+AP0+A+336 0.784785
2ydx+TXP+E+1339 3g8i+RO7+A+1 0.365298
1gvh+HEM+A+1398 1v9y+HEM+A+114 0 0.625893
2i0z+FAD+A+501 3m2r+F43+A+1 0.354842
1h6d+NDP+A+500 3rt4+LP5+C+501 0.365895
1w07+FAD+A+1660 2pgn+FAD+A+612 0.325863
2qd1+PP9+A+701 3gsi+FAD+A+902 0.125453
As of now I am using something in perl, like this:
use warnings;
open AS, "combi_output_2_fixed.txt";
open AQ, "NAMES.txt";
#arr=<AS>;
#arr1=<AQ>;
foreach $line(#arr)
{
#split=split(' ',$line);
foreach $line1(#arr1)
{
#split1=split(' ',$line1);
if($split[0] eq $split1[0] && $split[1] eq $split1[1])
{ print $split1[0],"\t",$split1[1],"\t",$split1[3],"\n";}
}
}
close AQ;
close AS;
Doing this uses up the entire memory: and shows Out of memory error message..
I am aware that this can be done using grep. but do not know hw to do it.
Can anyone please let me know how I can do this using grep -F AND WITHOUT USING UP THE ENTIRE MEMORY?
Thanks.
Does pattern.txt fit in memory?
If it does, you could use a command like grep -F -f pattern.txt data.txt to match lines in data.txt against the patterns. You would get the full line though, and extra processing would be required to get only the second column of numbers.
Or you could fix the Perl script. The reason you run out of memory is because you read the 8gb file entirely to memory, when you could be processing it line-by-line like grep. For the 8GB file you should use code like this:
open FH, "<", "data.txt";
while ($line = <FH>) {
# check $line against list of patterns ...
}
Try This
grep "`more pattern.txt`" data.txt | awk -F' ' '{ print $1 " " $2 " " $4}'

Need help parsing file/writing script

Hey all, I have been doing nothing but web development over the last few years and haven't written any Java or C++ in what feels like forever. I don't necessarily need to use these languages, so I'm entirely open to suggestion. I was given an email list by a client to import into their mailchimp account yesterday and unfortunately, Mailchimp couldn't read the file. It's a text file, but I don't believe it's tab delimited (which would make this much, much easier for me).
A small portion of the file (I've changed last names and email addresses) can be viewed here: http://sparktoignite.com/patients.txt
If anyone has suggestions on how I can get this into a Mailchimp readable format (csv, tab delimited txt, excel) please let me know. I feel like 3 years ago I would've been able to do this in a matter of minutes, but given that I haven't touched anything other than RoR, PHP, and jQuery for the last few years, I don't know where to start.
Thanks!
if you are on *nix, you can use tools like awk
awk -F"|" 'NR>2{$1=$1}1' OFS="," file > newfile.xls
however, you stated that you know PHP, so why not stick to something you know. you can use fgetcsv()/fputcsv() function
$output=fopen("out.csv","w");
$handle = fopen("file", "r");
if ($handle ) {
$line=fgetcsv($handle, 2048, "|");
$line=fgetcsv($handle, 2048, "|");
while (($data = fgetcsv($handle, 2048, "|")) !== FALSE) {
$num = count($data);
fputcsv($output,$data,',');
}
fclose($handle);
fclose($output);
}
In bash, outputs TAB delimited file:
cat patients.txt | tr -d [[:blank:]] | tr "|" "\t" > output.txt
If you prefer csv, just change the last "\t" to ",":
cat patients.txt | tr -d [[:blank:]] | tr "|" "\t" > output.txt
It messes up the header though. If you need to preserve header, first couple of lines need to be skipped:
head -n2 > output.txt
tail -n+3 | tr -d [[:blank:]] | tr "|" "\t" >> output.txt

Resources