Regular Expression for Special Characters in Rails - ruby-on-rails

I need the regex method in rails for the european language special characters like eg. é, ä, ö, ü, ß. Kindly help me.

Regular expressions will work just fine with "special" characters. If you're wanting to match a set of special characters, you'll need to tell the expression exactly what those characters are. Your definition of "special" might not match the next guy's.
For instance, if you wanted to see if a string contains any of the characters you listed above, you can do this:
irb(main):001:0> word = "resumé"
=> "resum\303\251"
irb(main):002:0> word =~ /[éäöüß]/
=> 5
irb(main):003:0> word.gsub(/é/, 'e')
=> "resume"
I hope this helps!

Related

Ruby on rails: Regex to include accented and specials characters?

In my rails app I want to use a regex that accept accented characters(é ç à, ...) and special characters(& () " ' , ...), right now this is my vlidation
validates_format_of :job_title,
:with => /[a-zA-Z0-9]/,
:message => "le titre de l'offre n'est pas valide",
:multiline => true
i want also that regex to not accept non latin characters like Arabic, Chinese, ...
Use [:alnum:] for alphanumeric characters:
validates_format_of :job_title,
:with => /[[:alnum:]]/,
:message => "le titre de l'offre n'est pas valide",
:multiline => true
For the Latin characters you could use the \p{Latin} script character property. You would have to make sure you normalize the input first, as decomposed strings won’t match (i.e. strings containing characters using combining characters). Also this wouldn’t match things like x́ (that’s x followed by COMBINING ACUTE ACCENT) since it won’t compose into a single character, but that’s probably okay as it’s not likely to be actually used by anyone.
For the “special characters” you really need to be more specific about what you want. You say you want to allow " and ' (so called “straight” quotes), but what about “, ”, ‘ and ’ (“typographical” or “curly” quotes”). And since you are allowing European languages, what about «, », ‹, › and „? You could use the \p{Punct} class, which should match all these and more, you will need to decide if it matches too much.
You probably also want to match spaces as well. Will just the space character be okay? What about tabs, non-breaking spaces, newlines etc.? \p{Space} should get them.
There may be other characters you need to match that these won’t pick up, e.g. current symbols, may need to add those too.
So a first attempt at your regex might look like this (I’ve added \A and \z to anchor the start and end, as well as * to match all characters – I think you will need them):
/\A[\p{Latin}\p{Punct}\p{Space}0-9]*\z/
A simple option is to white-list all the characters you want to accept. For example:
/[a-zA-Z0-9áéíóúÁÉÍÓÚÑñ&*]/
Instead of a-zA-Z0-9 you can use \w. It represents any word character (letter, number, underscore).
/[\wáéíóúÁÉÍÓÚÑñ&*]/

Testing for word characters in Ruby/Rails regular expressions for all languages

I know I can match a word character with \w in Ruby's regular expressions:
2.0.0p247 :003 > /[\w]+/.match('hi')
=> #<MatchData "hi">
However, as I understand, that only matches [a-zA-Z0-9_]. I'd like to also match characters that appear in standard words in other languages. Is there an easy way to do this?
UPDATE: Seems like I may have found my answer in the POSIX bracket expressions:
/[[:alnum:]]/ - Alphabetic and numeric character
/[[:alpha:]]/ - Alphabetic character
Is this what I'm looking for?
Yes. Definitely on the right track with :alpha: Here's a locale aware example from (https://stackoverflow.com/a/3879835/499581):
/\A[[:alpha:]]+\Z/
also for certain punctuation consider using:
/[[:punct:]]/
more here.

Ruby: Extracting Words From String

I'm trying to parse words out of a string and put them into an array. I've tried the following thing:
#string1 = "oriented design, decomposition, encapsulation, and testing. Uses "
puts #string1.scan(/\s([^\,\.\s]*)/)
It seems to do the trick, but it's a bit shaky (I should include more special characters for example). Is there a better way to do so in ruby?
Optional: I have a cs course description. I intend to extract all the words out of it and place them in a string array, remove the most common word in the English language from the array produced, and then use the rest of the words as tags that users can use to search for cs courses.
The split command.
words = #string1.split(/\W+/)
will split the string into an array based on a regular expression. \W means any "non-word" character and the "+" means to combine multiple delimiters.
For me the best to spliting sentences is:
line.split(/[^[[:word:]]]+/)
Even with multilingual words and punctuation marks work perfectly:
line = 'English words, Polski Żurek!!! crème fraîche...'
line.split(/[^[[:word:]]]+/)
=> ["English", "words", "Polski", "Żurek", "crème", "fraîche"]
Well, you could split the string on spaces if that's your delimiter of interest
#string1.split(' ')
Or split on word boundaries
\W # Any non-word character
\b # Any word boundary character
Or on non-words
\s # Any whitespace character
Hint: try testing each of these on http://rubular.com
And note that ruby 1.9 has some differences from 1.8
For Rails you can use something like this:
#string1.split(/\s/).delete_if(&:blank?)
I would write something like this:
#string
.split(/,+|\s+/) # any ',' or any whitespace characters(space, tab, newline)
.reject(&:empty?)
.map { |w| w.gsub(/\W+$|^\W+^*/, '') } # \W+$ => any trailing punctuation; ^\W+^* => any leading punctuation
irb(main):047:0> #string1 = "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
=> "oriented design, 'with', !!qwe, and testing. can't rubyisgood#)(*#%)(*, and,rails,is,good"
irb(main):048:0> #string1.split(/,+|\s+/).reject(&:empty?).map { |w| w.gsub(/\W+$|^\W+^*/, '')}
=> ["oriented", "design", "with", "qwe", "and", "testing", "can't", "rubyisgood", "and", "rails", "is", "good"]

Regex to validate string having only characters (not special characters), blank spaces and numbers

I am using Ruby on Rails 3.0.9 and I would like to validate a string that can have only characters (not special characters - case insensitive), blank spaces and numbers.
In my validation code I have:
validates :name,
:presence => true,
:format => { :with => regex } # Here I should set the 'regex'
How I should state the regex?
There are a couple ways of doing this. If you only want to allow ASCII word characters (no accented characters like Ê or letters from other alphabets like Ӕ or ל), use this:
/^[a-zA-Z\d\s]*$/
If you want to allow only numbers and letters from other languages for Ruby 1.8.7, use this:
/^(?:[^\W_]|\s)*$/u
If you want to allow only numbers and letters from other languages for Ruby 1.9.x, use this:
^[\p{Word}\w\s-]*$
Also, if you are planning to use 1.9.x regex with unicode support in Ruby on Rails, add this line at the beginning of your .rb file:
# coding: utf-8
You're looking for:
[a-zA-Z0-9\s]+
The + says one or more so it'll not match empty string. If you need to match them as well, use * in place of +.
In addition to what have been said, assign any of the regular expresion to your regex variable in your control this, for instance
regex = ^[a-zA-Z\d\s]*$

Regular expression for valid subdomain in Ruby

I'm attempting to validate a string of user input that will be used as a subdomain. The rules are as follows:
Between 1 and 63 characters in length (I take 63 from the number of characters Google Chrome appears to allow in a subdomain, not sure if it's actually a server directive. If you have better advice on valid max length, I'm interested in hearing it)
May contain a-zA-Z0-9, hyphen, underscore
May not begin or end with a hyphen or underscore
EDIT: From input below, I've added the following:
4. Should not contain consecutive hyphens or underscores.
Examples:
a => valid
0 => valid
- => not valid
_ => not valid
a- => not valid
-a => not valid
a_ => not valid
_a => not valid
aa => valid
aaa => valid
a-a-a => valid
0-a => valid
a&a => not valid
a-_0 => not valid
a--a => not valid
aaa- => not valid
My issue is I'm not sure how to specify with a RegEx that the string is allowed to be only one character, while also specifying that it may not begin or end with a hyphen or underscore.
Thanks!
You can't can have underscores in proper subdomains, but do you need them? After trimming your input, do a simple string length check, then test with this:
/^[a-z\d]+(-[a-z\d]+)*$/i
With the above, you won't get consecutive - characters, e.g. a-bbb-ccc passes and a--d fails.
/^[a-z\d]+([-_][a-z\d]+)*$/i
Will allow non-consecutive underscores as well.
Update: you'll find that, in practice, underscores are disallowed and all subdomains must start with a letter. The solution above does not allow internationalised subdomains (punycode). You're better of using this
/\A([a-z][a-z\d]*(-[a-z\d]+)*|xn--[\-a-z\d]+)\z/i
I'm not familiar with Ruby regex syntax, but I'll assume it's like, say, Perl. Sounds like you want:
/^(?![-_])[-a-z\d_]{1,63}(?<![-_])$/i
Or if Ruby doesn't use the i flag, just replace [-a-z\d_] with [-a-zA-Z\d_].
The reason I'm using [-a-zA-Z\d_] instead of the shorter [-\w] is that, while nearly equivalent, \w will allow special characters such as ä rather than just ASCII-type characters. That behavior can be optionally turned off in most languages, or you can allow it if you like.
Some more information on character classes, quantifiers, and lookarounds
/^([a-z0-9][a-z0-9\-\_]{0,61}[a-z0-9]|[a-z0-9])$/i
I've took it as a challenge to create a regex that should match only strings with non-repeating hyphens or underscores and also check the proper length for you:
/^([a-z0-9]([_\-](?![_\-])|[a-z0-9]){0,61}[a-z0-9]|[a-z0-9])$/i
The middle part uses a lookaround to verify that.
^[a-zA-Z]([-a-zA-Z\d]*[a-zA-Z\d])?$
This simply enforces the standard in an efficient way without backtracking. It does not check the length, but Regex is inefficient at things like that. Just check the string length (1 to 64 chars).
/[^\W\_](.+?)[^\W\_]$/i should work for ya (try our http://rubular.com/ to test out regular expressions)
EDIT: actually, this doesn't check single/double letter/numbers. try /([^\W\_](.+?)[^\W\_])|([a-z0-9]{1,2})/i instead, and tinker with it in rubular until you get exactly what ya want (if this doesn't take care of it already).

Resources