Ruby gsub for exact number of words - ruby-on-rails

I have a piece of code, where I can switch words from #post.swap_content to hyperlinks by keyword. For example, I have a word 'michigan' in #post.swap_content and I have keyword 'Michigan' in keywords, so it would switch it to the hyperlink that attached to keyword. Here is part of the function:
def execute
all_keys = Keyword.all.pluck(:key, :link).to_h.transform_keys(&:downcase)
#post.swap_content = #post.swap_content.to_s.gsub!(/\w+/) do |word|
url = all_keys[word.downcase]
url ? "<a href='#{url}'>#{word}</a>" : word
end
#post.save!
end
And my question is - how can I make it gsub only the first two keywords in #post.swap_content? For example, I have #post.swap_content 'michigan, michigan and michigan, utah and utah', how can I switch to hyperlinks only first two keywords(first two 'michigan' and first two 'utah')? I think, that I need somehow to work gsub but I don't know hot to manage number of words that can be gsub.

You can provide a block to gsub that will be invoked with each match, you could use this to count occurences and condtionally replace content.
str = "Dog dog dog cat cat cat"
occurences = {}
str.gsub(/\w+/) do |match|
# downcase so Dog and dog are counted together
key = match.downcase
# build a hash which counts the number of times we've matched a word.
count = occurences.store(key, occurences.fetch(key, 0).next)
# return the word unchanged or wrap in a hyperlink depending on count
count > 2 ? match : "<a>#{match}</a>"
end
# output => "<a>Dog</a> <a>dog</a> dog <a>cat</a> <a>cat</a> cat"

Suppose:
str = "Dog dog cat dog cat Dog cat cat"
If Ruby's regex engine supported variable-length negative lookbehinds we could write:
R = /\b(\w+)\b(?<!(?:\b\1\b.*){2})/i
str.gsub(R, '<a>\1</a>')
#=> "<a>Dog</a> <a>dog</a> <a>cat</a> dog <a>cat</a> Dog cat cat"
We can write this regular expression in free-spacing mode to make it self-documenting:
R = /
\b # assert a word break
(\w+) # match 1+ word characters and save to capture group 1
\b # assert a word break
(?! # begin a negative lookbehind
(?: # begin a non-capture group
\b # assert a word break
\1 # match the content of capture group 1
\b # assert a word break
.* # match 0+ characters
) # end non-capture group
{2} # execute non-capture group twice
) # end negative lookbehind
/ix # assert case-independent and free-spacing regex def modes
Unfortunately, Ruby's regex engine does not support variable-length (positive or negative) lookbehinds (though one day it might). It does, however, support variable-length (positive and negative) lookaheads. We therefore could reverse the string, perform the desired replacements using gsub then reverse the resulting string, as follows:
R = /\b(\w+)\b(?!(?:.*\b\1\b){2})/i
str.reverse.gsub(R, '>a/<\1>a<').reverse
#=> "<a>Dog</a> <a>dog</a> <a>cat</a> dog <a>cat</a> Dog cat cat"
The steps are as follows.
s = str.reverse
#=> "tac tac goD tac god tac god goD"
t = s.gsub(R, '>a/<\1>a<')
#=> "tac tac goD >a/<tac>a< god >a/<tac>a< >a/<god>a< >a/<goD>a<"
t.reverse
#=> "<a>Dog</a> <a>dog</a> <a>cat</a> dog <a>cat</a> Dog cat cat"
Let's have a closer look at the regular expression.
R = /
\b # assert a word break
(\w+) # match 1+ word characters and save to capture group 1
\b # assert a word break
(?! # begin a negative lookahead
(?: # begin a non-capture group
.* # match 0+ characters
\b # assert a word break
\1 # match the content of capture group 1
\b # assert a word break
) # end non-capture group
{2} # execute non-capture group twice
) # end negative lookahead
/ix # assert case-independent and free-spacing regex def modes

Related

Ruby regexp - get facebook video id from different urls with a unique regexp

I would like to extract video ids from potentially different URLs
https://www.facebook.com/{page-name}/videos/{video-id}/
https://www.facebook.com/{username}/videos/{video-id}/
https://www.facebook.com/video.php?id={video-id}
https://www.facebook.com/video.php?v={video-id}
How can I retrieve the video ids with a single ruby regex?
I haven't managed to convert this to Ruby regex but I (partially) managed to write it in standard JS regex:
^(https?://www\.facebook\.com/(?:video\.php\?v=\d+|.*?/videos/\d+))$
When I run the following code in Ruby it gives me an error:
text = "https://www.facebook.com/pili.morillo.56/videos/352355988613922/"
id = text.gsub( ^(https?://www\.facebook\.com/(?:video\.php\?v=\d+|.*?/videos/\d+))$ )
Here is the regexp I came up with: /(?<=\/videos\/)\d+?(?=\/|$)|(?<=[?&]id=)\d+?(?=&|$)|(?<=[?&]v=)\d+?(?=&|$)/
Breaking this up we can get this:
(?<=\/videos\/)\d+(?=\/|$)|
(?<=[?&]id=)\d+(?=&|$)|
(?<=[?&]v=)\d+(?=&|$)
Each of the three options follow the following simple structure: (?<=beforeMatch)target(?=afterMatch).
Here is the first as an example:
(?<=\/videos\/) # Positive lookbehind
\d+ # Matching the digits
(?=\/|$) # Positive lookahead
So, this means, match \d+ any digit, as long as it's preceeded by \/videos\/ and followed by \/ or it's the end of the line.
Therefore, we can match by 'id=', 'v=' or 'videos/'.
The full explaination:
(?<=\/videos\/) # Match as long as preceeded by '\/videos\/'
\d+ # Matching the id digits
(?=\/|$) # As long as it's followed by '\/' or the EOL
| # Or
(?<=[?&]id=) # Match as long as preceeded by '?id' or '&id'
\d+ # Matching the id digits
(?=&|$) # As long as it's followed by either '&' or the EOL
| # Or
(?<=[?&]v=) # Match as long as preceeded by '?v' or '&v'
\d+ # Matching the id digits
(?=&|$) # As long as it's followed by either '&' or the EOL
Where 'EOL' means end of line.
RE = %r[https://www.facebook.com/(?:.+?/)?video(?:.*?[/=])(.+?)(?:/?\z)]
%w[
https://www.facebook.com/{page-name}/videos/{video-id}/
https://www.facebook.com/{username}/videos/{video-id}/
https://www.facebook.com/video.php?id={video-id}
https://www.facebook.com/video.php?v={video-id}
].map { |url| url[RE, 1] }
#⇒ ["{video-id}", "{video-id}", "{video-id}", "{video-id}"]
You might use:
^https?:\/\/www\.facebook\.com\/.*?video(?:s|\.php.*?[?&](?:id|v)=)\/?([^\/&\n]+).*$
That would match
Begin of the string and begin url
^https?:\/\/www\.facebook\.com\/
Followed by:
.*? # Match any character zero or more times
video # Match video
(?: # Non capturing group
s # Match s
| # Or
\.php # Match .php
.*? # Match any character zero or more times
[?&] # Match ? or &
(?:id|v)= # Match id or v in non capturing group followed by =
) # Close non capturing group
\/? # Match optional /
( # Capturing group (group 1)
[^\/&\n]+ # Match not / or & or newline
) # Close capturing group
.* # Match any character zero or more times
$ # End of the string
text = "https://www.facebook.com/pili.morillo.56/videos/352355988613922/"
id = text.gsub(/^https?:\/\/www\.facebook\.com\/.*?video(?:s|\.php.*?[?&](?:id|v)=)\/?([^\/&\n]+).*$/, "\\1")
puts id
That will result in: 352355988613922
Demo

How to check username format with a regex

I need to check username format using a regular expression.
My username criterion is:
Must contain 1 or more letters, anywhere.
May contain any amount of numbers, anywhere.
Can contain up to 2 - or _, anywhere.
^[0-9\-_]*[a-z|A-Z]+[0-9\-_]*$ is what I was using but this will reject usernames such as 123hi123hi, or hi123hi. I need something less string location dependent.
I'm using Ruby on Rails to match strings against this.
A very inefficient Ruby function version for Rails is:
validate :check_username
def check_username
if self.username.count("-") > 2
errors.add(:username, "cannot contain more than 2 dashes")
elsif self.username.count("_") > 2
errors.add(:username, "cannot contain more than 2 underscores")
elsif self.username.count("a-zA-Z") < 1
errors.add(:username, "must contain a letter")
elsif (self.username =~ /^[0-9a-zA-Z\-_]+$/) !=0
errors.add(:username, "cannot contain special characters")
end
end
Here are two approaches you could use.
Construct a single regex
Because regular expressions are concerned with the ordering of characters in a string, one would have to construct a regular expression for each of the following combinations and then "or" those regexes into a single regex.
one letter, zero hyphens, zero underscores
one letter, zero hyphens, one underscores
one letter, zero hyphens, two underscores
one letter, one hyphen, zero underscores
one letter, one hyphen, one underscore
one letter, one hyphen, two underscores
one letter, two hyphens, zero underscores
one letter, two hyphens, one underscore
one letter, two hyphens, two underscores
Digits and additional letters could appear anywhere in the username.
Let's call those regular expressions t0, t1,..., t8. The desired single, overall regular expression would be:
/#{t0}|#{t1}|...|#{t8}/
Let's consider the construction of t4 (one letter, one hyphen, one underscore).
Six possible orders are possible for this combination.
a letter, a hyphen, an underscore
a letter, an underscore, a hyphen
a hyphen, a letter, an underscore
a hyphen, an underscore, a letter
an underscore, a letter, a hyphen
an underscore, a hyphen, a letter
We would need to construct a regular expression for each of these six orders (r0, r1,..., r5) and then "or" them to obtain t4:
t4 = /#{r0}|#{r1}|#{r2}|#{r3}|#{r4}|#{r5}/
Now let's consider the construction of a regex r0 that would implement the first of these orderings (a letter, a hyphen, an underscore):
r0 = /\A[a-z0-9]*[a-z][a-z0-9]*-[a-z0-9]*_[a-z0-9]*\z/i
"3ab4-3cd_e5".match?(r0) #=> true
"3ab4-3cde5".match?(r0) #=> false (no underscore)
"34-3cd_e5".match?(r0) #=> false (no letter before the hyphen)
"3ab4_3cd-e5".match?(r0) #=> false (underscore precedes hyphen)
Construction of each of the other five ri's would be similar.
We would then need to compute ti for each of the eight combination other than the fifth one. t0 (one letter, zero hyphens, zero underscores) is easy:
t0 = /\A[a-z0-9]*[a-z][a-z0-9]*\z/i
By contrast, t8 (one letter, two hyphens, two underscores) would be a much longer regex than t4 (considered above), as a regular expression would have to be hand-crafted for each of 5!/(2!*2!) #=> 30 orderings (r0, r1,..., r29).
It should now be obvious that the use of a single regular expression is simply not the right tool for validating usernames.
Do not construct a single regex
def username_valid?(username)
cnt = username.each_char.with_object(Hash.new(0)) do |c,cnt|
case c
when /\d/
when /[[:alpha:]]/
cnt[:letter] += 1
when '-'
cnt[:hyphen] += 1
when '_'
cnt[:underscore] += 1
else
return false
end
end
cnt.fetch(:letter, 0) > 0 && cnt.fetch(:hyphen, 0) <= 2 &&
cnt.fetch(:underscore, 0) <= 2
end
username_valid? "Bob" #=> true
username_valid? "Bob1_23_-" #=> true
username_valid? "z" #=> true
username_valid? "123--_" #=> false (no letters)
username_valid? "Melba1-23--_" #=> false (3 hyphens)
username_valid? "Bob1_23_-$" #=> false ($ not permitted)
Hash#new with an argument (the default value) of zero is often called a counting hash. If h is a hash with no key k, h[k] returns the default value. It is evaluated thusly:
h[k] += 1
#=> h[k] = h[k] + 1
#=> h[k] = 0 + 1
The method could instead be written to return false as soon as it determines that the regex is incorrect.
def username_valid?(username)
cnt = username.each_char.with_object(Hash.new(0)) do |c,cnt|
case c
when /\d/
when /[[:alpha:]]/
cnt[:letter] += 1
when '-'
return false if cnt[:hyphen] == 2
cnt[:hyphen] += 1
when '_'
return false if cnt[:underscore] == 2
cnt[:underscore] += 1
else
return false
end
end
cnt.fetch(:letter, 0) > 0
end
This is a bad use for a regular expression because your data isn't structured enough. Instead, a small series of simple tests will tell you what you need to know:
def valid?(str)
str[/[a-z]/i] && str.tr('^-_', '').size <= 2
end
%w(123hi123hi hi123hi).each do |username|
username # => "123hi123hi", "hi123hi"
valid?(username) # => true, true
end
There is a loss of speed due to the use of the regular expression
/[a-z]/i
so instead
def valid?(str)
str.downcase.tr('^a-z', '').size >= 0 && str.tr('^-_', '').size <= 2
end
could be used. The use of the regular expression is about 45% slower based on testing.
Breaking it down:
str[/[a-z]/i] tests for a minimum of one character. Since there can be more than one this will suffice.
str.downcase.tr('^a-z', '').size converts the characters to lowercase, then strips all non-letter characters, resulting in only letters remaining, then counts how many there are:
'123hi123hi'.downcase # => "123hi123hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
'hi123hi'.downcase # => "hi123hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
'hi-123_hi'.downcase # => "hi-123_hi"
.tr('^a-z', '') # => "hihi"
.size # => 4
The rule
May contain any amount of numbers, anywhere
isn't worth testing so I ignored it.
This is improved version of that regex of yours
^[\w-]*[A-Za-z]+[\w-]*$
But this will fail to calculate how many - or _ there, so you will need another regex to filter that or count that manually on code.
This is the regex for check only two or less [-_] disregarding its position:
^[A-Za-z\d]*[-_]{0,1}[A-Za-z\d]*[-_]{0,1}[A-Za-z\d]*$
If you're allowing only letters, numbers, dashes and underscores,
and everything else is considered a special character,
I think it's only that the pattern you have needs negation.
Instead of (self.username =~ /^[0-9a-zA-Z\-_]+$/) !=0
try (self.username =~ /^[^0-9a-zA-Z\-_]+$/) !=0
or (self.username =~ /^[\W-]+$/) > 0.
Also, why not do a count for special characters, like in the conditions above this one?

Determining length of array of strings ignoring commas/periods (letters only)

I'm trying to find the best way to determine the letter count in an array of strings. I'm splitting the string, and then looping every word, then splitting letters and looping those letters.
When I get to the point where I determine the length, the problem I have is that it's counting commas and periods too. Thus, the length in terms of letters only is inaccurate.
I know this may be a lot shorter with regex, but I'm not well versed on that yet. My code is passing most tests, but I'm stuck where it counts commas.
E.g. 'You,' should be string.length = "3"
Sample code:
def abbr(str)
new_words = []
str.split.each do |word|
new_word = []
word.split("-").each do |w| # it has to be able to handle hyphenated words as well
letters = w.split('')
if letters.length >= 4
first_letter = letters.shift
last_letter = letters.pop
new_word << "#{first_letter}#{letters.count}#{last_letter}"
else
new_word << w
end
end
new_words << new_word.join('-')
end
new_words.join(' ')
I tried doing gsub before looping the words, but that wouldn't work because I don't want to completely remove the commas. I just don't need them to be counted.
Any enlightenment is appreciated.
arr = ["Now is the time for y'all Rubiests",
"to come to the aid of your bowling team."]
arr.join.size
#=> 74
Without a regex
def abbr(arr)
str = arr.join
str.size - str.delete([*('a'..'z')].join + [*('A'..'Z')].join).size
end
abbr arr
#=> 58
Here and below, arr.join converts the array to a single string.
With a regex
R = /
[^a-z] # match a character that is not a lower-case letter
/ix # case-insenstive (i) and free-spacing regex definition (x) modes
def abbr(arr)
arr.join.gsub(R,'').size
end
abbr arr
#=> 58
You could of course write:
arr.join.gsub(/[^a-z]/i,'').size
#=> 58
Try this:
def abbr(str)
str.gsub /\b\w+\b/ do |word|
if word.length >= 4
"#{word[0]}#{word.length - 2}#{word[-1]}"
else
word
end
end
end
The regex in the gsub call says "one or more word characters preceded and followed by a word boundary". The block passed to gsub operates on each word, the return from the block is the replacement for the 'word' match in gsub.
You can check for each character that whether its ascii value lies in 97-122 or 65-90.When this condition is fulfilled increment a local variable that will give you total length of string without any number or any special character or any white space.
You can use something like that (short version):
a.map { |x| x.chars.reject { |char| [' ', ',', '.'].include? char } }
Long version with explanation:
a = ['a, ', 'bbb', 'c c, .'] # Initial array of strings
excluded_chars = [' ', ',', '.'] # Chars you don't want to be counted
a.map do |str| # Iterate through all strings in array
str.chars.reject do |char| # Split each string to the array of chars
excluded_chars.include? char # and reject excluded_chars from it
end.size # This returns [["a"], ["b", "b", "b"], ["c", "c"]]
end # so, we take #size of each array to get size of the string
# Result: [1, 3, 2]

Check if the content inside a NSString has a pattern with regex

I have a NSString that will store data coming from a UITextField, these data should follow a certain pattern that would be:
[0-9] / [0-9] / [0-9]
In this case the content that User type must follow this pattern. I try to do something like this, but doesn't work:
if([myString isEqual: #"[0-9]/[0-9]/[0-9]"]){
/* ...Others code here! */
}
I believe in Objective-C Have a specific way to treat a regex, how can I be doing this?
Thanks.
Assuming you want to allow optional spaces around the slashes as in your [0-9] / [0-9] / [0-9] example, this regex matches your pattern:
^(?:\[\d-\d\]\s*(?:/\s*|$)){3}$
Note that in your regex string, you may have to escape each backslash with a backslash.
Explain Regex
^ # the beginning of the string
(?: # group, but do not capture (3 times):
\[ # '['
\d # digits (0-9)
- # '-'
\d # digits (0-9)
\] # ']'
\s* # whitespace (\n, \r, \t, \f, and " ") (0
# or more times (matching the most amount
# possible))
(?: # group, but do not capture:
/ # '/'
\s* # whitespace (\n, \r, \t, \f, and " ")
# (0 or more times (matching the most
# amount possible))
| # OR
$ # before an optional \n, and the end of
# the string
) # end of grouping
){3} # end of grouping
$ # before an optional \n, and the end of the
# string

what does this ruby do?

unless (place =~ /^\./) == 0
I know the unless is like if not but what about the condtional?
=~ means matches regex
/^\./ is a regular expression:
/.../ are the delimiters for the regex
^ matches the start of the string or of a line (\A matches the start of the string only)
\. matches a literal .
It checks if the string place starts with a period ..
Consider this:
p ('.foo' =~ /^\./) == 0 # => true
p ('foo' =~ /^\./) == 0 # => false
In this case, it wouldn't be necessary to use == 0. place =~ /^\./ would suffice as a condition:
p '.foo' =~ /^\./ # => 0 # 0 evaluates to true in Ruby conditions
p 'foo' =~ /^\./ # => nil
EDIT: /^\./ is a regular expression. The start and end slashes denotes that it is a regular expression, leaving the important bit to ^\.. The first character, ^ marks "start of string/line" and \. is the literal character ., as the dot character is normally considered a special character in regular expressions.
To read more about regular expressions, see Wikipedia or the excellent regular-expressions.info website.

Resources