Regexp does not match utf8 characters in words (\w+) [duplicate] - ruby-on-rails

This question already has answers here:
How to match unicode words with ruby 1.9?
(3 answers)
Closed 9 years ago.
Why does the following code return nil:
'The name of the city is: Ørbæk'.match(/:\s\w+/)
#=> nil
When I would expect it to return "Ørbæk"
I have tried setting the #encoding=utf-8 in the beginning of the document but it does not change anything.
PS. Ø and Æ are danish letters

The metacharacters \w is equivalent to the character class [a-zA-Z0-9_]; matches only alphabets, digits, and _.
Instead use the character property \p{Word}:
'The name of the city is: Ørbæk'.match(/:\s\p{Word}+/)
# => #<MatchData ": Ørbæk">
According to Character Properties from Ruby Regexp documentation:
/\p{Word}/ - A member of one of the following Unicode general category Letter, Mark, Number, Connector_Punctuation

You can use \p{Word} instead:
irb(main):001:0> 'The name of the city is: Ørbæk'.match(/:\s\p{Word}+/)
=> #<MatchData ": Ørbæk">

If the word you want to match contains just letter characters, then use \p{L} :
match(/:\s\p{L}+/)

Related

.Gsub and .Scan Rails. Regex deciphering [duplicate]

This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 5 years ago.
Why does this happen:
filename
=> "/Users/user/Desktop/work/arthouse/digitization/in-process/cat.jpg"
[4] pry(DigitizedView)> filename.gsub(/.*\//,'')
=> "cat.jpg"
What is the regex in the first argument of gsub? I know the .* is any number of any characters... but what is the backslash? Why does it delete everything except the cat.jpg part?
Also,
"cat.jpg".scan(/(\w+)-(\d+)([a-z]?)/)
=> []
What is that code doing?
let us examine what this first argument for the gsub method, /.*\// means.
the first and last slashes /.../ denotes that we are dealing with a regex here, not string.
There are two parts to this regex. .* and \/.
.* says that grep any characters, including empty character.
\/ says that grep a string with a slash, /.
This regex would catch,
['/', 'Users/', 'user/', 'Desktop/', 'work/', 'arthouse/', 'digitization/', 'in-process/']
All these strings are now replaced with ''.
Except cat.jpg which doesn't have the slash at the end.
Hope that explanation helps.
edit
In the second part, /(\w+)-(\d+)([a-z]?)/
(\w+): grep a group of word characters (includes numbers)
-: grep for a dash
(\d+): grep a group of numeric digits
([a-z]?): grep for nil char or a single char.
cat.jpg doesn't fit into this regex in many ways. No dash, . in the string. etc.
Therefore, scan will return an empty array.
The Regexp /.*\// matches zero or more characters terminated by a forward slash. The String#gsub method replaces all substrings matching the pattern with the replacement value, in this case ''.
So in this case, the pattern matches the following substrings: '/', 'Users/', 'user/', 'Desktop/', 'work/', 'arthouse/', 'digitization/', and 'in-process/'. It replaces each of these with a blank string. It does not match the remaining substring, cat.jpg, because that substring doesn't terminate with a '/'. So 'cat.jpg' is all that remains.

Extract year from string, check if successful

I would like to check whether a year was found within a string. Something like
if string.scan(/\d{4}/).first == TRUE
for example a string looks like "there were 3 earthquakes in 2007"
Any suggestions?
If you want to match standalone 4 digit string, you may consider a regex with word boundaries:
!('It is 2016 now.' =~ /\b\d{4}\b/).nil? # => true
or - a more real world sample usage:
if string =~ /\b\d{4}\b/
The \b\d{4}\b matches any 4 digits that are not preceded nor followed with word characters (digits, letters or underscore), so there will be no match in 02312345.
Also, in case you want to precise to current century, or the 20th century, you may use /\b(?:19|20)\d{2}\b/ regex.
To extract the digits, use s[/\b\d{4}\b/].
'It was in 2015/16.'[/\b\d{4}\b/] # => 2015
See the Ruby demo

storing a string with '#' in it in rails [duplicate]

This question already has an answer here:
Ruby string prepend '\' character
(1 answer)
Closed 7 years ago.
I tried storing a string in rails like
string = 'abc#$123'
but the string stores "abc\ #$123". I tried removing "\" by using string.delete("\",'') but didn't work
Is there any way to solve this problem ?
This is correct, Ruby interpreter is just escaping the character #$ by using backslash (\) character.
It is not exactly changing your string and adding the unwanted (\) character. You can verify this by doing puts string and it should print abc#$123
The "\" is a way to escape the # character, which has a special meaning in ruby. So the "\" does not actually exist, it is just a convenience used by ruby to store your string value.
So, don't bother about it. You will see that if you print your string, the "\" will magically disappear.
irb(main):001:0> s = 'abc#$123'
=> "abc\#$123" # internal representation of your string
irb(main):002:0> print s
abc#$123=> nil # printed string value

Rails, select last n characters from a string? [duplicate]

This question already has answers here:
Extracting the last n characters from a ruby string
(9 answers)
Closed 7 years ago.
I have a string "foo bar man chu" and I want to select the last 5 characters from it.
The regex expression /.{5}$/ does the job of selecting them, but how do I save them to a string in Rails? gsub(/.{5}$/,'') removes them, kind of the opposite of what I want. Thanks!
The match method will return the result of attempting to match the string with the regular expression
result = "foo bar man chu".match(/.{5}$/)
puts result
=> "n chu"
If the regular expression is not matched, then nil will be returned.

Splitting strings using Ruby ignoring certain characters

I'm trying to split a string and counts the number os words using Ruby but I want ignore special characters.
For example, in this string "Hello, my name is Hugo ..." I'm splitting it by spaces but the last ... should't counts because it isn't a word.
I'm using string.inner_text.split(' ').length. How can I specify that special characters (such as ... ? ! etc.) when separated from the text by spaces are not counted?
Thank you to everyone,
Kind Regards,
Hugo
"Hello, my name is não ...".scan /[^*!#%\^\s\.]+/
# => ["Hello,", "my", "name", "is", "não"]
/[^*!#%\^]+/ will match anything other than *!#%\^. You can add more to this list which need not be matched
this is part answer, part response to #Neo's answer: why not use proper tools for the job?
http://www.ruby-doc.org/core-1.9.3/Regexp.html says:
POSIX bracket expressions are also similar to character classes. They provide a portable alternative to the above, with the added benefit that they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
...
Ruby also supports the following non-POSIX character classes:
/[[:word:]]/ - A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
you want words, use str.scan /[[:word:]]+/

Resources