I'm trying to write a regular expression in Ruby where I want to see if the string contains a certain word (e.g. "string"), followed by a url and link name in parenthesis.
Right now I'm doing:
string.include?("string") && string.scan(/\(([^\)]+)\)/).present?
My input in both conditionals is a string. In the first one, I'm checking if it contains the word "link" and then I will have the link and link_name in parenthesis, like this:
"Please go to link( url link_name)"
After validating that, I extract the HTML link.
Is there a way I can combine them using regular expressions?
The most important improvement you can make is to also test that the word and the parentheseses have the correct relationship. If I understand correctly, "link(url link_name)" should be a match but "(url link_name)link" or "link stuff (url link_name)" should not. So match "link", the parentheses, and their contents, and capture the contents, all at once:
"stuff link(url link_name) more stuff".match(/link\((\S+?) (\S+?)\)/)&.captures
=> ["url", "link_name"]
(&. is Ruby 2.3; use Rails' .try :captures in older versions.)
Side note: string.scan(regex).present? is more concisely written as string =~ regex.
Checking If a Word Is Contained
If you want to find matches that contain a specific word somewhere in the string, you can accomplish this through a lookahead :
# This will match any string that contains your string "{your-string-here}"
(?=.*({your-string-here}).*).*
You could consider building a string version of your expression and passing the word you are looking for using a variable :
wordToFind = "link"
if stringToTest =~ /(?=.*(#{wordToFind}).*).*/
# stringToTest contains "link"
else
# stringToTest does not contain "link"
end
Checking for a Word AND Parentheses
If you also wanted to ensure that somewhere in your string you had a set of parentheses with some content in them and your previous lookahead for a word, you could use :
# This will match any strings that contain your word and contain a set of parentheses
(?=.*({your-string-here}).*).*\([^\)]+\).*
which might be used as :
wordToFind = "link"
if stringToTest =~ /(?=.*(#{wordToFind}).*).*\([^\)]+\).*/
# stringToTest contains "link" and some non-empty parentheses
else
# stringToTest does not contain "link" or non-empty parentheses
end
def has_both?(str, word)
str.scan(/\b#{word}\b|(?<=\()[^\(\)]+(?=\))/).size == 2
end
has_both?("Wait for me, Wild Bill.", "Bill")
#=> false
has_both?("Wait (for me), Wild William.", "Bill")
#=> false
has_both?("Wait (for me), Wild Billy.", "Bill")
#=> false
has_both?("Wait (for me), Wild bill.", "Bill")
#=> false
has_both?("Wait (for me, Wild Bill.", "Bill")
#=> false
has_both?("Wait (for me), Wild Bill.", "Bill")
#=> true
has_both?("Wait ((for me), Wild Bill.", "Bill")
#=> true
has_both?("Wait ((for me)), Wild Bill.", "Bill")
#=> true
These are the calculations for
word = "Bill"
str = "Wait (for me), Wild Bill."
r = /
\b#{word}\b # match the value of the variable 'word' with word breaks for and aft
| # or
(?<=\() # match a left paren in a positive lookbehind
[^\(\)]+ # match one or more characters other than parens
(?=\)) # match a right paren in a positive lookahead
/x # free-spacing regex definition mode
#=> /
\bBill\b # match the value of the variable 'word' with word breaks for and aft
| # or
(?<=\() # match a left paren in a positive lookbehind
[^\(\)]+ # match one or more characters other than parens
(?=\)) # match a right paren in a positive lookahead
/x
arr = str.scan(r)
#=> ["for me", "Bill"]
arr.size == 2
#=> true
I would go with something like this regex:
/link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i
This will find any match starting with the word link, followed by any number of spaces, then a url followed by a link name, both in parentheses. In this regex, the link name is optional, but the url is not. The matching is case-insensitive, so it will match link and LINK exactly the same.
You can use the Regexp#match method to compare the regex to a string, and check the result for matches and captures, like so:
m = /link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i.match("link (stackoverflow.com StackOverflow)")
if m # the match array is not nil
puts "Matched: #{m[0]}"
puts " -- url: {m[1]}"
puts " -- link-name: #{m[2] || 'none'}"
else # the match array is nil, so no match was found
puts "No match found"
end
If you'd like to use different strings to identify the match, you can use a non-capturing group, where you change link to something like:
(?:link|site|website|url)
In this case, the (?: syntax says not to capture this part of the match. If you want to capture which term matched, simply change that from (?: to (, and adjust the capture indexes by 1 to account for the new capture value.
Here's a short Ruby test program:
data = [
[ true, "link (http://google.com Google)", "http://google.com", "Google" ],
[ true, "LiNk(ftp://website.org)", "ftp://website.org", nil ],
[ true, "link (https://facebook.com/realstanlee/ Stan Lee) linkety link", "https://facebook.com/realstanlee/", "Stan Lee" ],
[ true, "x link (https://mail.yahoo.com Yahoo! Mail)", "https://mail.yahoo.com", "Yahoo! Mail" ],
[ false, "link lunk (http://www.com)", nil, nil ]
]
data.each do |test_case|
link = /link\s*\(([^\)\s]+)\s*([^\)]+)?\)/i.match(test_case[1])
url = link ? link[1] : nil
link_name = link ? link[2] : nil
success = test_case[0] == !link.nil? && test_case[2] == url && test_case[3] == link_name
puts "#{success ? 'Pass' : 'Fail'}: '#{test_case[1]}' #{link ? 'found' : 'not found'}"
if success && link
puts " -- url: '#{url}' link_name: '#{link_name || '(no link name)'}'"
end
end
This produces the following output:
Pass: 'link (http://google.com Google)' found
-- url: 'http://google.com' link_name: 'Google'
Pass: 'LiNk(ftp://website.org)' found
-- url: 'ftp://website.org' link_name: '(no link name)'
Pass: 'link (https://facebook.com/realstanlee/ Stan Lee) linkety link' found
-- url: 'https://facebook.com/realstanlee/' link_name: 'Stan Lee'
Pass: 'x link (https://mail.yahoo.com Yahoo! Mail)' found
-- url: 'https://mail.yahoo.com' link_name: 'Yahoo! Mail'
Pass: 'link lunk (http://www.com)' not found
If you want to allow anything other than spaces between the word 'link' and the first paren, simply change the \s* to [^\(]* and you should be good to go.
Related
I have a query string which I want to separate out
created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30' AND updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30' AND user_id = 5 AND status = 'closed'
Like this
created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30'
updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30'
user_id = 5
status = 'closed'
This is just an example string, I want to separate the query string dynamically. I know can't just split with AND because of the pattern like BETWEEN .. AND
You might be able to do this with regex but here's a parser that may work for your use case. It can surely be improved but it should work.
require 'time'
def parse(sql)
arr = []
split = sql.split(' ')
date_counter = 0
split.each_with_index do |s, i|
date_counter = 2 if s == 'BETWEEN'
time = Time.parse(s.strip) rescue nil
date_counter -= 1 if time
arr << i+1 if date_counter == 1
end
arr.select(&:even?).each do |index|
split.insert(index + 2, 'SPLIT_ME')
end
split = split.join(' ').split('SPLIT_ME').map{|l| l.strip.gsub(/(AND)$/, '')}
split.map do |line|
line[/^AND/] ? line.split('AND') : line
end.flatten.select{|l| !l.empty?}.map(&:strip)
end
This is not really a regex, but more a simple parser.
This works by matching a regex from the start of the string until it encounters a whitespace followed by either and or between followed by a whitespace character. The result is removed from the where_cause and saved in statement.
If the start of the string now starts with a whitespace followed by between followed by a whitespace. It is added to statement and removed from where_cause with anything after that, allowing 1 and. Matching stops if the end of the string is reached or another and is encountered.
If point 2 didn't match check if the string starts with a whitespace followed by and followed by a whitespace. If this is the case remove this from where_cause.
Finally add statement to the statements array if it isn't an empty string.
All matching is done case insensitive.
where_cause = "created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30' AND updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30' AND user_id = 5 AND status = 'closed'"
statements = []
until where_cause.empty?
statement = where_cause.slice!(/\A.*?(?=[\s](and|between)[\s]|\z)/mi)
if where_cause.match? /\A[\s]between[\s]/i
between = /\A[\s]between[\s].*?[\s]and[\s].*?(?=[\s]and[\s]|\z)/mi
statement << where_cause.slice!(between)
elsif where_cause.match? /\A[\s]and[\s]/i
where_cause.slice!(/\A[\s]and[\s]/i)
end
statements << statement unless statement.empty?
end
pp statements
# ["created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30'",
# "updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30'",
# "user_id = 5",
# "status = 'closed'"]
Note: Ruby uses \A to match the start of the string and \z to match the end of a string instead of the usual ^ and $, which match the beginning and ending of a line respectively. See the regexp anchor documentation.
You can replace every [\s] with \s if you like. I've added them in to make the regex more readable.
Keep in mind that this solution isn't perfect, but might give you an idea how to solve the issue. The reason I say this is because it doesn't account for the words and/between in column name or string context.
The following where cause:
where_cause = "name = 'Tarzan AND Jane'"
Will output:
#=> ["name = 'Tarzan", "Jane'"]
This solution also assumes correctly structured SQL queries. The following queries don't result in what you might think:
where_cause = "created_at = BETWEEN AND"
# TypeError: no implicit conversion of nil into String
# ^ does match /\A[\s]between[\s]/i, but not the #slice! argument
where_cause = "id = BETWEEN 1 AND 2 BETWEEN 1 AND 3"
#=> ["id = BETWEEN 1 AND 2 BETWEEN 1", "3"]
I'm not certain if I understand the question, particularly in view of the previous answers, but if you simply wish to extract the indicated substrings from your string, and all column names begin with lowercase letters, you could write the following (where str holds the string given in the question):
str.split(/ +AND +(?=[a-z])/)
#=> ["created_at BETWEEN '2018-01-01T00:00:00+05:30' AND '2019-01-01T00:00:00+05:30'",
# "updated_at BETWEEN '2018-05-01T00:00:00+05:30' AND '2019-05-01T00:00:00+05:30'",
# "user_id = 5",
# "status = 'closed'"]
The regular expression reads, "match one or more spaces, followed by 'AND', followed by one or more spaces, followed by a positive lookahead that contains a lowercase letter". Being in a positive lookahead, the lowercase letter is not part of the match that is returned.
I'm trying to display an array of words from a user's post. However the method I'm using treats an apostrophe like whitespace.
<%= var = Post.pluck(:body) %>
<%= var.join.downcase.split(/\W+/) %>
So if the input text was: The baby's foot
it would output the baby s foot,
but it should be the baby's foot.
How do I accomplish that?
Accepted answer is too naïve:
▶ "It’s naïve approach".split(/[^'\w]+/)
#⇒ [
# [0] "It",
# [1] "s",
# [2] "nai",
# [3] "ve",
# [4] "approach"
# ]
this is because nowadays there is almost 2016 and many users might want to use their normal names, like, you know, José Østergaard. Punctuation is not only the apostroph, as you might notice.
▶ "It’s naïve approach".split(/[^'’\p{L}\p{M}]+/)
#⇒ [
# [0] "It’s",
# [1] "naïve",
# [2] "approach"
# ]
Further reading: Character Properties.
Along the lines of mudasobwa's answer, here's what \w and \W bring to the party:
chars = [*' ' .. "\x7e"].join
# => " !\"\#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~"
That's the usual visible lower-ASCII characters we'd see in code. See the Regexp documentation for more information.
Grabbing the characters that match \w returns:
chars.scan(/\w+/)
# => ["0123456789",
# "ABCDEFGHIJKLMNOPQRSTUVWXYZ",
# "_",
# "abcdefghijklmnopqrstuvwxyz"]
Conversely, grabbing the characters that don't match \w, or that match \W:
chars.scan(/\W+/)
# => [" !\"\#$%&'()*+,-./", ":;<=>?#", "[\\]^", "`", "{|}~"]
\w is defined as [a-zA-Z0-9_] which is not what you want to normally call "word" characters. Instead they're typically the characters we use to define variable names.
If you're dealing with only lower-ASCII characters, use the character-class
[a-zA-Z]
For instance:
chars = [*' ' .. "\x7e"].join
lower_ascii_chars = '[a-zA-Z]'
not_lower_ascii_chars = '[^a-zA-Z]'
chars.scan(/#{lower_ascii_chars}+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
chars.scan(/#{not_lower_ascii_chars}+/)
# => [" !\"\#$%&'()*+,-./0123456789:;<=>?#", "[\\]^_`", "{|}~"]
Instead of defining your own, you can take advantage of the POSIX definitions and character properties:
chars.scan(/[[:alpha:]]+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
chars.scan(/\p{Alpha}+/)
# => ["ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz"]
Regular expressions always seem like a wonderful new wand to wave when extracting information from a string, but, like the Sorcerer's Apprentice found out, they can create havoc when misused or not understood.
Knowing this should help you write a bit more intelligent patterns. Apply that to what the documentation shows and you should be able to easily figure out a pattern that does what you want.
You can use below RegEx instead of /\W+/
var.join.downcase.split(/[^'\w]+/)
/\W/ refers to all non-word characters, apostrophe is one such non-word character.
To keep the code as close to original intent, we can use /[^'\w]/ - this means that all characters that are not apostrophe and word character.
Running that string through irb with the same split call that you wrote in your comment gets this:
irb(main):008:0> "The baby's foot".split(/\W+/)
=> ["The", "baby", "s", "foot"]
However, if you use split without an explicit delimiter, you get the split you're looking for:
irb(main):009:0> "The baby's foot".split
=> ["The", "baby's", "foot"]
Does that get you what you're looking for?
I am having trouble writing this so that it will take a sentence as an argument and perform the translation on each word without affecting the punctuation.
I'd also like to continue using the partition method.
It would be nice if I could have it keep a quote together as well, such as:
"I said this", I said.
would be:
"I aidsay histay", I said.
def convert_sentence_pig_latin(sentence)
p split_sentence = sentence.split(/\W/)
pig_latin_sentence = []
split_sentence.each do |word|
if word.match(/^[^aeiou]+/x)
pig_latin_sentence << word.partition(/^[^aeiou]+/x)[2] + word.partition(/^[^aeiou]+/x)[1] + "ay"
else
pig_latin_sentence << word
end
end
rejoined_pig_sentence = pig_latin_sentence.join(" ").downcase + "."
p rejoined_pig_sentence.capitalize
end
convert_sentence_pig_latin("Mary had a little lamb.")
Your main problem is that [^aeiou] matches every character outside that range, including spaces, commas, quotation marks, etc.
If I were you, I'd use a positive match for consonants, ie. [b-df-hj-np-tv-z] I would also put that regex in a variable, so you're not having to repeat it three times.
Also, in case you're interested, there's a way to make your convert_sentence_pig_latin method a single gsub and it will do the whole sentence in one pass.
Update
...because you asked...
sentence.gsub( /\b([b-df-hj-np-tv-z])(\w+)/i ) { "#{$2}#{$1}ay" }
# iterate over and replace regexp matches using gsub
def convert_sentence_pig_latin2(sentence)
r = /^[^aeiou]+/i
sentence.gsub(/"([^"]*)"/m) {|x| x.gsub(/\w+/) {|y| y =~ r ? "#{y.partition(r)[2]}#{y.partition(r)[1]}ay" : y}}
end
puts convert_sentence_pig_latin2('"I said this", I said.')
# define instance method: String#to_pl
class String
R = Regexp.new '^[^aeiou]+', true # => /^[^aeiou]+/i
def to_pl
self.gsub(/"([^"]*)"/m) {|x| x.gsub(/\w+/) {|y| y =~ R ? "#{y.partition(R)[2]}#{y.partition(R)[1]}ay" : y}}
end
end
puts '"I said this", I said.'.to_pl
sources:
http://www.ruby-doc.org/core-2.1.0/Regexp.html
http://ruby-doc.org/core-2.0/String.html#method-i-gsub
unless (place =~ /^\./) == 0
I know the unless is like if not but what about the condtional?
=~ means matches regex
/^\./ is a regular expression:
/.../ are the delimiters for the regex
^ matches the start of the string or of a line (\A matches the start of the string only)
\. matches a literal .
It checks if the string place starts with a period ..
Consider this:
p ('.foo' =~ /^\./) == 0 # => true
p ('foo' =~ /^\./) == 0 # => false
In this case, it wouldn't be necessary to use == 0. place =~ /^\./ would suffice as a condition:
p '.foo' =~ /^\./ # => 0 # 0 evaluates to true in Ruby conditions
p 'foo' =~ /^\./ # => nil
EDIT: /^\./ is a regular expression. The start and end slashes denotes that it is a regular expression, leaving the important bit to ^\.. The first character, ^ marks "start of string/line" and \. is the literal character ., as the dot character is normally considered a special character in regular expressions.
To read more about regular expressions, see Wikipedia or the excellent regular-expressions.info website.
I'm looking to check if a string is all capitals in Rails.
How would I go about doing that?
I'm writing my own custom pluralize helper method and I would something be passing words like "WORD" and sometimes "Word" - I want to test if my word is all caps so I can return "WORDS" - with a capital "S" in the end if the word is plural (vs. "WORDs").
Thanks!
Do this:
str == str.upcase
E.g:
str = "DOG"
str == str.upcase # true
str = "cat"
str == str.upcase # false
Hence the code for your scenario will be:
# In the code below `upcase` is required after `str.pluralize` to transform
# DOGs to DOGS
str = str.pluralize.upcase if str == str.upcase
Thanks to regular expressions, it is very simple. Just use [[:upper:]] character class which allows all upper case letters, including but not limited to those present in ASCII-8.
Depending on what do you need exactly:
# allows upper case characters only
/\A[[:upper:]]*\Z/ =~ "YOURSTRING"
# additionally disallows empty strings
/\A[[:upper:]]+\Z/ =~ "YOURSTRING"
# also allows white spaces (multi-word strings)
/\A[[:upper:]\s]*\Z/ =~ "YOUR STRING"
# allows everything but lower case letters
/\A[^[:lower:]]*\Z/ =~ "YOUR 123_STRING!"
Ruby doc: http://www.ruby-doc.org/core-2.1.4/Regexp.html
Or this:
str =~ /^[A-Z]+$/
e.g.:
"DOG" =~ /^[A-Z]+$/ # 0
"cat" =~ /^[A-Z]+$/ # nil