Use array variable in Automator as filename input - automator

I am trying to create a workflow that converts a list of URLs into plain text using Instapaper, and then saves the text in text documents on my machine.
So far, I have been able to grab the list of URLs, get the title of each webpage, and convert the URLs to plain text.
I have the list of titles saved in a variable "Article Titles." The plain text of each article is then being passed from "Get Text from Webpage" to "New Text File"
I tried putting the Article Titles variable in the Save As input of the "New Text File" action, but no files are being generated (unlike when I simply entered a generic title into the Save As field. But then, all of the files generated were the same name). I suspect that I can't use a variable containing an array as a Save As input. But I'd like each new file to have it's respective name.
How can I have the action iterate over the array of titles so that each item of plain text from "Get Text from Webpage" is saved with it's title from the "Article Titles" variable?

The one thing that frustrates many is the problem you have when you want to pass more than one Variable to an action. There are ways around it like saving to an external script.
But in this case a simple Applescript mixed with the bit of script #adayzdone gave you will get you what I think you want.
You just need to pass the list of URLs to this 'Run Applescript"
on run {input, parameters}
set docPath to POSIX path of (path to documents folder)
repeat with i from 1 to count of items of input
set this_item to item i of input
set thePage to (do shell script "curl " & quoted form of this_item)
set theTitle to docPath & "/" & (do shell script "echo " & quoted form of thePage & " | grep -o \\<title\\>.*\\</title\\> | sed -E 's/<\\/?title>//g'")
set t_text to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -output \"" & theTitle & ".txt\"")
end repeat
end run
** Update for passing the text on to next action. **
This will pass a list of the text contents from all the URLS.
It will still do what the above does but will now pass a list of the text contents from all the URLS on to the next action.
I have tested it with 'Text to Speech and it reads multiple text content.
on run {input, parameters}
set docPath to POSIX path of (path to documents folder)
set bigList to {}
repeat with i from 1 to count of items of input
set this_item to item i of input
set thePage to (do shell script "curl " & quoted form of this_item)
set theTitle to docPath & "/" & (do shell script "echo " & quoted form of thePage & " | grep -o \\<title\\>.*\\</title\\> | sed -E 's/<\\/?title>//g'")
set t_text to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -output \"" & theTitle & ".txt\"")
set t_text_for_action to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -stdout")
copy t_text_for_action to end of bigList
end repeat
return bigList --> text list can now be passed to the next action
end run
If you want to test : may I suggest a page that has a small amount of text on tit like : http://www.javascripter.net/
Update 2 - Save text to audio file using the unix command 'say'.
Ok there are a couple of things here.
1, Because of the same reason I kept everything in one script in the previous codings. I have done the same here. i.e passing the text objects and titles together to the next Action would be a pain if not impossible.
2,The script uses the unix command and it's output option to save the text as an aiff file.
It also names the file by the title.
3,
I had a problem where instead of saving the file it started speaking the text. ???
This turned out that the URL I was testing on (http://www.javascripter.net) had a title tag that was in caps. so the #adayzdone grep and sed part of the script was returning "" . Which threw the say command.
I fixed this by using the -i ( ignore case ) option in the grep command and using the "|" ( or) option in sed and adding a caps version of the expression.
4,
The Title being returned also had other characters in it that would cause a problem with the file being saved as a recognisable file by the system due to the extension not being added.
This is fixed by a simple handler that returns the title text with allowed characters.
6,
It works.
on run {input, parameters}
set docPath to POSIX path of (path to documents folder)
repeat with i from 1 to count of items of input
set this_item to item i of input
set thePage to (do shell script "curl -A \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.112 Safari/534.30\" " & quoted form of this_item)
set theTitle to replaceBadChars((do shell script "echo " & quoted form of thePage & " | grep -io \\<title\\>.*\\</title\\> | sed -E 's/<\\/?title>|<\\/?TITLE>//g'"))
set t_text_for_action to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -stdout")
do shell script "cd " & quoted form of docPath & " ;say -o \"" & theTitle & "\" , " & quoted form of t_text_for_action
end repeat
end run
on replaceBadChars(TEXT_)
log TEXT_
set OkChars to {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "1", "2", "3", "4", "5", "6", "7", "8", "9", "0", "_", space}
set TEXT_ to characters of TEXT_
repeat with i from 1 to count of items in TEXT_
set this_char to item i of TEXT_
if this_char is not in OkChars then
set item i of TEXT_ to "_"
else
end if
end repeat
set TEXT_ to TEXT_ as string
do shell script " echo " & quoted form of TEXT_
end replaceBadChars

Related

How to grep umlauts and other accented text characters via AppleScript

I have a problem trying to execute shell scripts from apple script. I do a "grep", but as soon as it contains special characters it doesn't work as intended.
(The script reads a list list ob subfolders in a directory and checks if any of the subfolders appear in a file.)
Here is my script:
set searchFile to "/tmp/output.txt"
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set theCommand to "grep -c " & quoted form of company & space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
The problem is e.g. with strings with umlauts. "theCommand" is somehow differently encoded that when I do it on the CLI directly.
$ grep -c 'Württemberg' '/tmp/output.txt' --> typed on command line
3
$ grep -c 'Württemberg' '/tmp/output.txt' --> copy & pasted from AppleScript
0
$ grep -c 'rttemberg' '/tmp/output.txt' --> no umlauts, no problems
3
The "ü" from the first and the second line are different; a echo 'Württemberg' | openssl base64 shows this.
I tried several encoding tricks at different places, basically everything I could find or think of.
Does anyone have any idea? How can I check which encoding a string has?
Thanks in advance!
Sebastian
Overview
This can work by escaping each character that has an accent in each company name before they are used in the grep command.
So, you'll need to escape each one of those characters (i.e. those which have an accent) with double backslashes (i.e. \\). For example:
The ü in Württemberg will need to become \\ü
The ö in Königsberg will need to become \\ö
The ß in Einbahnstraße will need to become \\ß
Why is this necessary:
These accented characters, such as a u with diaeresis, are certainly getting encoded differently. Which type of encoding they receive is difficult to ascertain. My assumption is that the encoding pattern used begins with a backslash - hence why escaping those characters with backslashes fixes the issue. Consider the u with diaeresis in the previous link, it shows that for the C/C++ language the ü is encoded as \u00FC.
Solution
In the complete script below you'll notice the following:
set accentedChars to {"ü", "ö", "ß", "á", "ė"} has been added to hold a list of all characters that will need to be escaped. You'll need to explicitly state each one as there doesn't seem to be a way to infer whether the character has an accent.
Before assigning the grepcommand to theCommand variable we firstly escape the necessary characters via the line reading:
set company to escapeChars(company, accentedChars)
As you can see here we are passing two arguments to the escapeChars sub-routine, (i.e. the non-escaped company variable and the list of accented characters).
In the escapeChars sub-routine we iterate over each char in the accentedChars list and invoke the findAndReplace sub-routine. This will escape any instances of those characters with backslashes found in the company variable.
Complete script:
set searchFile to "/tmp/output.txt"
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & ¬
space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set company to escapeChars(company, accentedChars)
set theCommand to "grep -c " & quoted form of company & ¬
space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
(**
* Checks each character of a given word. If any characters of the word
* match a character in the given list of characters they will be escapd.
*
* #param {text} searchWord - The word to check the characters of.
* #param {text} charactersList - List of characters to be escaped.
* #returns {text} The new text with the item(s) replaced.
*)
on escapeChars(searchWord, charactersList)
repeat with char in charactersList
set searchWord to findAndReplace(char, ("\\" & char), searchWord)
end repeat
return searchWord
end escapeChars
(**
* Replaces all occurances of findString with replaceString
*
* #param {text} findString - The text string to find.
* #param {text} replaceString - The replacement text string.
* #param {text} searchInString - Text string to search.
* #returns {text} The new text with the item(s) replaced.
*)
on findAndReplace(findString, replaceString, searchInString)
set oldTIDs to text item delimiters of AppleScript
set text item delimiters of AppleScript to findString
set searchInString to text items of searchInString
set text item delimiters of AppleScript to replaceString
set searchInString to "" & searchInString
set text item delimiters of AppleScript to oldTIDs
return searchInString
end findAndReplace
Note about current counts:
Currently your grep pattern only reports the number of lines that the word was found on. Not how many instances of the word were found.
If you want the actual number of instances of the word then use the -o option with grep to output each occurrence. Then pipe that to wc with the -l option to count the number of lines. For example:
grep -o 'Württemberg' /tmp/output.txt | wc -l
and in your AppleScript that would be:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l"
Tip: If your want to remove the leading spaces in the count/number that gets logged then pipe it to sed to strip the spaces: For example via your script:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l | sed -e 's/ //g'"
and the equivalent via the command line:
grep -o 'Württemberg' /tmp/output.txt | wc -l | sed -e 's/ //g'

How to build parameters for System.cmd?

I want to use System.cmd run "convert" from ImageMagick, but I am having difficulty
System.cmd("convert", ["origin.jpg", "-fill", "black", "-pointsize", "12", "-gravity", "SouthWest", "-draw", "\"text +4,+4 'Test hello world!'\"", "output.jpg"])
The args -draw 's value is \"text +4,+4 'Test hello world!'\", but ImageMagick requires "text +4,+4 'Test hello world!'" Do not need to escape double quotes.
How can I do it?
You don't need the outer double quotes here with System.cmd/3. You need them when running the command from the shell because the argument contains spaces and without the outer double quotes the shell will split the whole thing on every space and end up passing the equivalent of ["text", "+4,+4", "Test hello world!"]. The following should work:
System.cmd(..., [..., "text +4,+4 'Test hello world!'", ...])

Why do I get an array of chars when splitting a string using \d* regex notation?

For example, my string is:
"this8is8my8string"
And here are the two varied results:
2.1.0 :084 > str.split(%r{\d})
=> ["this", "is", "my", "string"]
2.1.0 :085 > str.split(%r{\d*})
=> ["t", "h", "i", "s", "i", "s", "m", "y", "s", "t", "r", "i", "n", "g"]
I don't quite understand why the string is being split by characters if there is no digits in between them. Could somebody clarify what is going on in the second version?
Because * means "zero or more," and:
"this8is8my8string"
^^ there's 0 or more digits between the t and the h
^^ there's 0 or more digits between the h and the i
^^ there's 0 or more digits between the i and the s
^^ well... you get the point
You're probably looking for +. \d+ would mean "one or more digits."
Also, on a slightly related topic: typically regexen in Ruby are seen as regex literals, like /\d*/, not with the %r format. That actually threw me off a little when I read your question; it seems very strange. I suggest using the /.../ literal format to make your code easier to read for most Rubyists.

What do % signs mean in a url?

When I copy paste this Wikipedia article it looks like this.
http://en.wikipedia.org/wiki/Gruy%C3%A8re_%28cheese%29
However if you paste this back into the URL address the percent signs disappear and what appears to be Unicode characters ( and maybe special URL characters ) take the place of the percent signs.
Are these abbreviations for Unicode and special URL characters?
I'm use to seeing \u00ff, etc. in JavaScript.
The reference you're looking for is RFC 3987: Internationalized Resource Identifiers, specifically the section on mapping IRIs to URIs.
RFC 3986: Uniform Resource Identifiers specifies that reserved characters must be percent-encoded, but it also specifies that percent-encoded characters are decoded to US-ASCII, which does not include characters such as è.
RFC 3987 specifies that non-ASCII characters should first be encoded as UTF-8 so they can be percent-encoded as per RFC 3986. If you'll permit me to illustrate in Python:
>>> u'è'.encode('utf-8')
'\xc3\xa8'
Here I've asked Python to encode the Unicode è to a string of bytes using UTF-8. The bytes returned are 0xc3 and 0xa8. Percent-encoded, this looks like %C3%A8.
The parenthesis also appearing in your URL do fit in US-ASCII, so they are percent-escaped with their US-ASCII code points, which are also valid UTF-8.
So, no, there is no simple 16×16 table—such a table could never represent the richness of Unicode. But there is a method to the apparent madness.
% in a URI is followed by two characters from 0-9A-F, and is the escaped version of writing the character with that hex code. Doing this means you can write a URI with characters that might have special meaning in other languages.
Common examples are %20 for a space and %5B and %5C for [ and ], respectively.
It's just a different syntactical convention for what you're used to from JavaScript. URL syntax is simply different from that of JavaScript, in other words, and % is the way one introduces a two-hex-digit character code in that syntax.
Some characters must be escaped in order to be part of a URL/URI. For example, the / character has meaning; it's a metacharacter, in other words. If you need a / in the middle of a path component (which admittedly would be a little weird), you'd have to escape it. It's analogous to the need to escape quote characters in JavaScript string constants.
It is important to note the % sign servers two primary purposes. One is to encode special characters and the other is to encode Unicode characters outside of what you can put in with your hardware/keyboard. For example %C3%A8 to encode è, and whatever encoding represents a forward slash /.
Using JavaScript we can create a encoding chart:
http://jsfiddle.net/CG8gx/3/
["\x00", "\x01", "\x02", "\x03", "\x04", "\x05",
"\x06", "\x07", "\b", "\t", "\n", "\v", "\f", "\r", "\x0E", "\x0F",
"\x10", "\x11", "\x12", "\x13", "\x14", "\x15", "\x16", "\x17",
"\x18", "\x19", "\x1A", "\x1B", "\x1C", "\x1D", "\x1E", "\x1F", " ",
"!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".",
"/", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<",
"=", ">", "?", "#", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
"K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X",
"Y", "Z", "[", "\", "]", "^", "_", "`", "a", "b", "c", "d", "e", "f",
"g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t",
"u", "v", "w", "x", "y", "z", "{", "|", "}", "~", "\x7F"]
I don't know the technical details for this. But if you change the beginning of the URL so it is no longer recognized as a URL, it will copy & past correctly. For example, if you add or remove a character to the beginning of the url (when you do the copy), it will paste without the percent signs, as follows:
_ttps://en.wikipedia.org/wiki/Gruyère_cheese

Unable to manipulate a byte array

I'm trying to pass a byte array from inside my rails app into another ruby script (still inside my rails app), for example:
`./app/animations/fade.sh "\x01\x01\x04\x00" &`
Yields ArgumentError (string contains null byte)
I suppose I'm stumped with how I can form this string and than pass it to my script, which will use it in this sort of fashion:
#sp.write ["#{ARGV[0]}", "f", "\x12"]
I'd like to form the string (on my rails app) like this if possible:
led = "\x01#{led.id}\x04\x00"
But I keep getting ArgumentError (string contains null byte) error. Is there a way I can form this string from elements in my rails app, then pass it to my external script?
You should just pass the data in through standard input, not the command line. You can use IO.popen for this purpose:
IO.popen("./app/animations/fade.sh", "w+") do |f|
f.write "\x01\x01\x04\x00"
end
And on the reading side:
input = $stdin.read
#sp.write [input, "f", "\x12"]
(By the way, it's more common to name Ruby scripts .rb instead of .sh; if fade.sh is meant to be a Ruby script, as I assume from the syntax you used in its example contents, you might want to name it fade.rb)
you could use base64 to pass the bytestring around
$ cat > test.sh
echo $1 | base64 -d
$ chmod a+x test.sh
and then from ruby:
irb
>> require 'base64'
=> true
>> `./test.sh "#{Base64.encode64 "\x01\x01\x04\x00"}"`
=> "\x01\x01\x04\x00"
Can your script accept input from STDIN instead? Perhaps using read.
If you can't do this, you could encode your null and escape your encoding.
E.G. 48656c6c6f0020576f726c64 could be encoded as 48656c6c6f200102020576f726c64
which in turn would be decoded again if both sides agree 2020=20 and 2001=00
Update I think encoding is what you'll have to do because I tried using read and it turns out to be a little too difficult. There's probably another option, but I don't see it yet.
Here's my script and two test runs:
dlamblin$ cat test.sh
echo "reading two lines of input, first line is length of second."
read len
read ans
echo "C string length of second line is:" ${#ans}
for ((c=0; c<$len; c++))
do
/bin/echo -n "${ans:$c:1},"
done
echo ' '
exit
dlamblin$ echo -e '12\0012Hello \0040World' | sh test.sh
reading two lines of input, first line is length of second.
C string length of second line is: 12
H,e,l,l,o, , ,W,o,r,l,d,
dlamblin$ echo -e '12\0012Hello \0000World' | sh test.sh
reading two lines of input, first line is length of second.
C string length of second line is: 5
H,e,l,l,o,,,,,,,,
#Octals \0000 \0012 \0040 are NUL NL and SP respectively

Resources