When I copy paste this Wikipedia article it looks like this.
http://en.wikipedia.org/wiki/Gruy%C3%A8re_%28cheese%29
However if you paste this back into the URL address the percent signs disappear and what appears to be Unicode characters ( and maybe special URL characters ) take the place of the percent signs.
Are these abbreviations for Unicode and special URL characters?
I'm use to seeing \u00ff, etc. in JavaScript.
The reference you're looking for is RFC 3987: Internationalized Resource Identifiers, specifically the section on mapping IRIs to URIs.
RFC 3986: Uniform Resource Identifiers specifies that reserved characters must be percent-encoded, but it also specifies that percent-encoded characters are decoded to US-ASCII, which does not include characters such as è.
RFC 3987 specifies that non-ASCII characters should first be encoded as UTF-8 so they can be percent-encoded as per RFC 3986. If you'll permit me to illustrate in Python:
>>> u'è'.encode('utf-8')
'\xc3\xa8'
Here I've asked Python to encode the Unicode è to a string of bytes using UTF-8. The bytes returned are 0xc3 and 0xa8. Percent-encoded, this looks like %C3%A8.
The parenthesis also appearing in your URL do fit in US-ASCII, so they are percent-escaped with their US-ASCII code points, which are also valid UTF-8.
So, no, there is no simple 16×16 table—such a table could never represent the richness of Unicode. But there is a method to the apparent madness.
% in a URI is followed by two characters from 0-9A-F, and is the escaped version of writing the character with that hex code. Doing this means you can write a URI with characters that might have special meaning in other languages.
Common examples are %20 for a space and %5B and %5C for [ and ], respectively.
It's just a different syntactical convention for what you're used to from JavaScript. URL syntax is simply different from that of JavaScript, in other words, and % is the way one introduces a two-hex-digit character code in that syntax.
Some characters must be escaped in order to be part of a URL/URI. For example, the / character has meaning; it's a metacharacter, in other words. If you need a / in the middle of a path component (which admittedly would be a little weird), you'd have to escape it. It's analogous to the need to escape quote characters in JavaScript string constants.
It is important to note the % sign servers two primary purposes. One is to encode special characters and the other is to encode Unicode characters outside of what you can put in with your hardware/keyboard. For example %C3%A8 to encode è, and whatever encoding represents a forward slash /.
Using JavaScript we can create a encoding chart:
http://jsfiddle.net/CG8gx/3/
["\x00", "\x01", "\x02", "\x03", "\x04", "\x05",
"\x06", "\x07", "\b", "\t", "\n", "\v", "\f", "\r", "\x0E", "\x0F",
"\x10", "\x11", "\x12", "\x13", "\x14", "\x15", "\x16", "\x17",
"\x18", "\x19", "\x1A", "\x1B", "\x1C", "\x1D", "\x1E", "\x1F", " ",
"!", "\"", "#", "$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".",
"/", "0", "1", "2", "3", "4", "5", "6", "7", "8", "9", ":", ";", "<",
"=", ">", "?", "#", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J",
"K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X",
"Y", "Z", "[", "\", "]", "^", "_", "`", "a", "b", "c", "d", "e", "f",
"g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t",
"u", "v", "w", "x", "y", "z", "{", "|", "}", "~", "\x7F"]
I don't know the technical details for this. But if you change the beginning of the URL so it is no longer recognized as a URL, it will copy & past correctly. For example, if you add or remove a character to the beginning of the url (when you do the copy), it will paste without the percent signs, as follows:
_ttps://en.wikipedia.org/wiki/Gruyère_cheese
Related
Why would you ever use %w[] considering arrays in Rails are type-agnostic?
This is the most efficient way to define array of strings, because you don't have to use quotes and commas.
%w(abc def xyz)
Instead of
['abc', 'def', 'xyz']
Duplicate question of
http://stackoverflow.com/questions/1274675/what-does-warray-mean
http://stackoverflow.com/questions/5475830/what-is-the-w-thing-in-ruby
For more details you can follow https://simpleror.wordpress.com/2009/03/15/q-q-w-w-x-r-s/
These are the types of percent strings in ruby:
%w : Array of Strings
%i : Array of Symbols
%q : String
%r : Regular Expression
%s : Symbol
%x : Backtick (capture subshell result)
Let take some example
you have some set of characters which perform a paragraph like
Thanks for contributing an answer to Stack Overflow!
so when you try with
%w(Thanks for contributing an answer to Stack Overflow!)
Then you will get the output like
=> ["Thanks", "for", "contributing", "an", "answer", "to", "Stack", "Overflow!"]
if you will use some sets or words as a separate element in array so you should use \
lets take an example
%w(Thanks for contributing an answer to Stack\ Overflow!)
output would be
=> ["Thanks", "for", "contributing", "an", "answer", "to", "Stack Overflow!"]
Here ruby interpreter split the paragraph from spaces within the input. If you give \ after end of word so it merge next word with the that word and push as an string type element in array.
If can use like below
%w[2 4 5 6]
if you will use
%w("abc" "def")
then output would be
=> ["\"abc\"", "\"def\""]
%w(abc def xyz) is a shortcut for ["abc", "def","xyz"]. Meaning it's a notation to write an array of strings separated by spaces instead of commas and without quotes around them.
For example, my string is:
"this8is8my8string"
And here are the two varied results:
2.1.0 :084 > str.split(%r{\d})
=> ["this", "is", "my", "string"]
2.1.0 :085 > str.split(%r{\d*})
=> ["t", "h", "i", "s", "i", "s", "m", "y", "s", "t", "r", "i", "n", "g"]
I don't quite understand why the string is being split by characters if there is no digits in between them. Could somebody clarify what is going on in the second version?
Because * means "zero or more," and:
"this8is8my8string"
^^ there's 0 or more digits between the t and the h
^^ there's 0 or more digits between the h and the i
^^ there's 0 or more digits between the i and the s
^^ well... you get the point
You're probably looking for +. \d+ would mean "one or more digits."
Also, on a slightly related topic: typically regexen in Ruby are seen as regex literals, like /\d*/, not with the %r format. That actually threw me off a little when I read your question; it seems very strange. I suggest using the /.../ literal format to make your code easier to read for most Rubyists.
I am trying to create a workflow that converts a list of URLs into plain text using Instapaper, and then saves the text in text documents on my machine.
So far, I have been able to grab the list of URLs, get the title of each webpage, and convert the URLs to plain text.
I have the list of titles saved in a variable "Article Titles." The plain text of each article is then being passed from "Get Text from Webpage" to "New Text File"
I tried putting the Article Titles variable in the Save As input of the "New Text File" action, but no files are being generated (unlike when I simply entered a generic title into the Save As field. But then, all of the files generated were the same name). I suspect that I can't use a variable containing an array as a Save As input. But I'd like each new file to have it's respective name.
How can I have the action iterate over the array of titles so that each item of plain text from "Get Text from Webpage" is saved with it's title from the "Article Titles" variable?
The one thing that frustrates many is the problem you have when you want to pass more than one Variable to an action. There are ways around it like saving to an external script.
But in this case a simple Applescript mixed with the bit of script #adayzdone gave you will get you what I think you want.
You just need to pass the list of URLs to this 'Run Applescript"
on run {input, parameters}
set docPath to POSIX path of (path to documents folder)
repeat with i from 1 to count of items of input
set this_item to item i of input
set thePage to (do shell script "curl " & quoted form of this_item)
set theTitle to docPath & "/" & (do shell script "echo " & quoted form of thePage & " | grep -o \\<title\\>.*\\</title\\> | sed -E 's/<\\/?title>//g'")
set t_text to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -output \"" & theTitle & ".txt\"")
end repeat
end run
** Update for passing the text on to next action. **
This will pass a list of the text contents from all the URLS.
It will still do what the above does but will now pass a list of the text contents from all the URLS on to the next action.
I have tested it with 'Text to Speech and it reads multiple text content.
on run {input, parameters}
set docPath to POSIX path of (path to documents folder)
set bigList to {}
repeat with i from 1 to count of items of input
set this_item to item i of input
set thePage to (do shell script "curl " & quoted form of this_item)
set theTitle to docPath & "/" & (do shell script "echo " & quoted form of thePage & " | grep -o \\<title\\>.*\\</title\\> | sed -E 's/<\\/?title>//g'")
set t_text to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -output \"" & theTitle & ".txt\"")
set t_text_for_action to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -stdout")
copy t_text_for_action to end of bigList
end repeat
return bigList --> text list can now be passed to the next action
end run
If you want to test : may I suggest a page that has a small amount of text on tit like : http://www.javascripter.net/
Update 2 - Save text to audio file using the unix command 'say'.
Ok there are a couple of things here.
1, Because of the same reason I kept everything in one script in the previous codings. I have done the same here. i.e passing the text objects and titles together to the next Action would be a pain if not impossible.
2,The script uses the unix command and it's output option to save the text as an aiff file.
It also names the file by the title.
3,
I had a problem where instead of saving the file it started speaking the text. ???
This turned out that the URL I was testing on (http://www.javascripter.net) had a title tag that was in caps. so the #adayzdone grep and sed part of the script was returning "" . Which threw the say command.
I fixed this by using the -i ( ignore case ) option in the grep command and using the "|" ( or) option in sed and adding a caps version of the expression.
4,
The Title being returned also had other characters in it that would cause a problem with the file being saved as a recognisable file by the system due to the extension not being added.
This is fixed by a simple handler that returns the title text with allowed characters.
6,
It works.
on run {input, parameters}
set docPath to POSIX path of (path to documents folder)
repeat with i from 1 to count of items of input
set this_item to item i of input
set thePage to (do shell script "curl -A \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.30 (KHTML, like Gecko) Chrome/12.0.742.112 Safari/534.30\" " & quoted form of this_item)
set theTitle to replaceBadChars((do shell script "echo " & quoted form of thePage & " | grep -io \\<title\\>.*\\</title\\> | sed -E 's/<\\/?title>|<\\/?TITLE>//g'"))
set t_text_for_action to (do shell script "echo " & quoted form of thePage & "|textutil -format html -convert txt -stdin -stdout")
do shell script "cd " & quoted form of docPath & " ;say -o \"" & theTitle & "\" , " & quoted form of t_text_for_action
end repeat
end run
on replaceBadChars(TEXT_)
log TEXT_
set OkChars to {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "A", "B", "C", "D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S", "T", "U", "V", "W", "X", "Y", "Z", "1", "2", "3", "4", "5", "6", "7", "8", "9", "0", "_", space}
set TEXT_ to characters of TEXT_
repeat with i from 1 to count of items in TEXT_
set this_char to item i of TEXT_
if this_char is not in OkChars then
set item i of TEXT_ to "_"
else
end if
end repeat
set TEXT_ to TEXT_ as string
do shell script " echo " & quoted form of TEXT_
end replaceBadChars
I have a weird behaviour in my params whichare passed as utf-8 but the special characters are not well managed.
Instead of 1 special character, I have 2 characters: the normal letter + the accent.
Parameters: {"name"=>"Mylène.png", "_cardbiz_session"=>"be1d5b7a2f27c7c4979ac4c16fe8fc82", "authenticity_token"=>"9vmJ02DjgKYCpoBNUcWwUlpxDXA8ddcoALHXyT6wrnM=", "asset"=>{"file"=># < ActionDispatch::Http::UploadedFile:0x007f94d38d37d0 #original_filename="Mylène.png", #content_type="image/png", #headers="Content-Disposition: form-data; name=\"asset[file]\"; filename=\"Myle\xCC\x80ne.png\"\r\nContent-Type: image/png\r\n", #tempfile=# < File:/var/folders/q5/yvy_v9bn5wl_s5ccy_35qsmw0000gn/T/RackMultipart20130805-51100-1eh07dp > >}, "id"=>"copie-de-sm"}
I log this:
logger.debug file_name
logger.debug file_name.chars.map(&:to_s).inspect
Each time, same result:
Mylène
["M", "y", "l", "e", "̀", "n", "e"]
As i try to use the filename as a matcher with already existing names properly encoded utf-8, you see my problem ;)
Encodings are utf-8 everywhere.
working under ruby 1.9.3 and rails 3.2.14.
Added #encoding: utf-8 in top of any file involved.
I anyone as an idea, take it !
I also published an Issue here : https://github.com/carrierwaveuploader/carrierwave/issues/1185 but not sure if its a carrierwave issue or me missing something...
Seems to be linked to MACOSX.
https://www.ruby-forum.com/topic/4407424 explains it and refers to https://bugs.ruby-lang.org/issues/7267 for more details and discution.
MACOSX decomposing special characters into utf8-mac instead of utf-8...
While you can't know the encoding of a file name, just presupose it.
Thanks to our Linux guy where it works properly. ;)
file_name.encode!('utf-8', 'utf-8-mac').chars.map(&:to_s)
Perhaps you have a Combining character and a problem with Unicode equivalence
When I check the codepoints with:
#encoding: utf-8
Parameters = {"name"=>"Mylène.png",}
p Parameters['name'].codepoints.to_a
I get Myl\u00E8ne.png, but I think that's a conversion problem when I copy the text. It would be helpfull, if you can provide a file with the raw data.
I expect you have a combining grave accent and a e
The solution would be a Unicode normalization. (Sorry, I don't know how to do it with ruby. Perhaps somebody else has an answer for it).
You found your problem, so this is not needed any longer for you.
But in meantime I found a mechanism to normalize Unicode strings:
#encoding: utf-8
text = "Myl\u00E8ne.png" #"Mylène.png"
text2 = "Myle\u0300ne.png" #"Mylène.png"
puts text #Mylène.png
puts text2 #Mylène.png
p text == text2 #false
#http://apidock.com/rails/ActiveSupport/Multibyte/Unicode/normalize
require 'active_support'
p text #"Myl\u00E8ne.png"
p ActiveSupport::Multibyte::Unicode.normalize(text, :d) #"Myle\u0300ne.png"
p text2 #"Myle\u0300ne.png"
p ActiveSupport::Multibyte::Unicode.normalize(text2, :c)#"Myl\u00E8ne.png"
Maybe there is an easier way, but up to now I found none.
I'm trying to parse words out of a string and put them into an array. I've tried the following thing:
#string1 = "oriented design, decomposition, encapsulation, and testing. Uses "
puts #string1.scan(/\s([^\,\.\s]*)/)
It seems to do the trick, but it's a bit shaky (I should include more special characters for example). Is there a better way to do so in ruby?
Optional: I have a cs course description. I intend to extract all the words out of it and place them in a string array, remove the most common word in the English language from the array produced, and then use the rest of the words as tags that users can use to search for cs courses.
The split command.
words = #string1.split(/\W+/)
will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.
For me the best to spliting sentences is:
line.split(/[^[[:word:]]]+/)
Even with multilingual words and punctuation marks work perfectly:
line = 'English words, Polski Żurek!!! crème fraîche...'
line.split(/[^[[:word:]]]+/)
=> ["English", "words", "Polski", "Żurek", "crème", "fraîche"]
Well, you could split the string on spaces if that's your delimiter of interest
#string1.split(' ')
Or split on word boundaries
\W # Any non-word character
\b # Any word boundary character
Or on non-words
\s # Any whitespace character
Hint: try testing each of these on http://rubular.com
And note that ruby 1.9 has some differences from 1.8
For Rails you can use something like this:
#string1.split(/\s/).delete_if(&:blank?)
I would write something like this:
#string
.split(/,+|\s+/) # any ',' or any whitespace characters(space, tab, newline)
.reject(&:empty?)
.map { |w| w.gsub(/\W+$|^\W+^*/, '') } # \W+$ => any trailing punctuation; ^\W+^* => any leading punctuation
irb(main):047:0> #string1 = "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
=> "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
irb(main):048:0> #string1.split(/,+|\s+/).reject(&:empty?).map { |w| w.gsub(/\W+$|^\W+^*/, '')}
=> ["oriented", "design", "with", "qwe", "and", "testing", "can't", "rubyisgood", "and", "rails", "is", "good"]