Regular expression for valid subdomain in Ruby - ruby-on-rails

I'm attempting to validate a string of user input that will be used as a subdomain. The rules are as follows:
Between 1 and 63 characters in length (I take 63 from the number of characters Google Chrome appears to allow in a subdomain, not sure if it's actually a server directive. If you have better advice on valid max length, I'm interested in hearing it)
May contain a-zA-Z0-9, hyphen, underscore
May not begin or end with a hyphen or underscore
EDIT: From input below, I've added the following:
4. Should not contain consecutive hyphens or underscores.
Examples:
a => valid
0 => valid
- => not valid
_ => not valid
a- => not valid
-a => not valid
a_ => not valid
_a => not valid
aa => valid
aaa => valid
a-a-a => valid
0-a => valid
a&a => not valid
a-_0 => not valid
a--a => not valid
aaa- => not valid
My issue is I'm not sure how to specify with a RegEx that the string is allowed to be only one character, while also specifying that it may not begin or end with a hyphen or underscore.
Thanks!

You can't can have underscores in proper subdomains, but do you need them? After trimming your input, do a simple string length check, then test with this:
/^[a-z\d]+(-[a-z\d]+)*$/i
With the above, you won't get consecutive - characters, e.g. a-bbb-ccc passes and a--d fails.
/^[a-z\d]+([-_][a-z\d]+)*$/i
Will allow non-consecutive underscores as well.
Update: you'll find that, in practice, underscores are disallowed and all subdomains must start with a letter. The solution above does not allow internationalised subdomains (punycode). You're better of using this
/\A([a-z][a-z\d]*(-[a-z\d]+)*|xn--[\-a-z\d]+)\z/i

I'm not familiar with Ruby regex syntax, but I'll assume it's like, say, Perl. Sounds like you want:
/^(?![-_])[-a-z\d_]{1,63}(?<![-_])$/i
Or if Ruby doesn't use the i flag, just replace [-a-z\d_] with [-a-zA-Z\d_].
The reason I'm using [-a-zA-Z\d_] instead of the shorter [-\w] is that, while nearly equivalent, \w will allow special characters such as ä rather than just ASCII-type characters. That behavior can be optionally turned off in most languages, or you can allow it if you like.
Some more information on character classes, quantifiers, and lookarounds

/^([a-z0-9][a-z0-9\-\_]{0,61}[a-z0-9]|[a-z0-9])$/i
I've took it as a challenge to create a regex that should match only strings with non-repeating hyphens or underscores and also check the proper length for you:
/^([a-z0-9]([_\-](?![_\-])|[a-z0-9]){0,61}[a-z0-9]|[a-z0-9])$/i
The middle part uses a lookaround to verify that.

^[a-zA-Z]([-a-zA-Z\d]*[a-zA-Z\d])?$
This simply enforces the standard in an efficient way without backtracking. It does not check the length, but Regex is inefficient at things like that. Just check the string length (1 to 64 chars).

/[^\W\_](.+?)[^\W\_]$/i should work for ya (try our http://rubular.com/ to test out regular expressions)
EDIT: actually, this doesn't check single/double letter/numbers. try /([^\W\_](.+?)[^\W\_])|([a-z0-9]{1,2})/i instead, and tinker with it in rubular until you get exactly what ya want (if this doesn't take care of it already).

Related

RegEx contain a combination of letters, numbers, and a special character

I'm finding a regular expression which adheres below rules.
Acceptance criteria: Password must contain a combination of letters, numbers, and at least a special character.`
Here is my Regex:
validates :password, presence: true,
format: { with: ^(?=[a-zA-Z0-9]*$)([^A-Za-z0-9])}
I am not all that great at regex, so any and all help is greatly appreciated!
You can use the following RegEx pattern
/^(?=.*\d)(?=.*([a-z]|[A-Z]))([\x20-\x7E]){8,}$/
Let's look at what it is doing:
(?=.*\d) shows that the string should contain atleast one integer.
(?=.*([a-z]|[A-Z])) shows that the string should contain atleast one alphabet either from downcase or upcase.
([\x20-\x7E]) shows that string can have special characters of ascii values 20 to 7E.
{8,} shows that string should be minimum of 8 characters long. While you have not mentioned it should be at least 8 characters long but it is good to have.
If you're unsure about the ASCII values, you can google it or you could use the following instead:
/^(?=.*\d)(?=.*([a-z]|[A-Z]))(?=.*[##$%^&+=]){8,}$/
As suggested in the comments, a better way can be:
/\A(?=.*\d)(?=.*([a-z]))(?=.*[##$%^&+=]){8,}\z/i
Here:
\A represents beginning of string.
\z represents end of string.
/i represents case in-sensitive mode.
P.S: I have not tested it yet. May be I'll test and update later if required.

Rails strip all except numbers commas and decimal points

Hi I've been struggling with this for the last hour and am no closer. How exactly do I strip everything except numbers, commas and decimal points from a rails string? The closest I have so far is:-
rate = rate.gsub!(/[^0-9]/i, '')
This strips everything but the numbers. When I try add commas to the expression, everything is getting stripped. I got the aboves from somewhere else and as far as I can gather:
^ = not
Everything to the left of the comma gets replaced by what's in the '' on the right
No idea what the /i does
I'm very new to gsub. Does anyone know of a good tutorial on building expressions?
Thanks
Try:
rate = rate.gsub(/[^0-9,\.]/, '')
Basically, you know the ^ means not when inside the character class brackets [] which you are using, and then you can just add the comma to the list. The decimal needs to be escaped with a backslash because in regular expressions they are a special character that means "match anything".
Also, be aware of whether you are using gsub or gsub!
gsub! has the bang, so it edits the instance of the string you're passing in, rather than returning another one.
So if using gsub! it would be:
rate.gsub!(/[^0-9,\.]/, '')
And rate would be altered.
If you do not want to alter the original variable, then you can use the version without the bang (and assign it to a different var):
cleaned_rate = rate.gsub!(/[^0-9,\.]/, '')
I'd just google for tutorials. I haven't used one. Regexes are a LOT of time and trial and error (and table-flipping).
This is a cool tool to use with a mini cheat-sheet on it for ruby that allows you to quickly edit and test your expression:
http://rubular.com/
You can just add the comma and period in the square-bracketed expression:
rate.gsub(/[^0-9,.]/, '')
You don't need the i for case-insensitivity for numbers and symbols.
There's lots of info on regular expressions, regex, etc. Maybe search for those instead of gsub.
You can use this:
rate = rate.gsub!(/[^0-9\.\,]/g,'')
Also check this out to learn more about regular expressions:
http://www.regexr.com/

Matching function in erlang based on string format

I have user information coming in from an outside source and I need to check if that user is active. Sometimes I have a User and a Server and other times I have User#Server. The former case is no problem, I just have:
active(User, Server) ->
do whatever.
What I would like to do with the User#Server case is something like:
active([User, "#", Server]) ->
active(User, Server).
Doesn't seem to work. When calling active in the erlang terminal with a#b for example, I get an error that there is no match. Any help would be appreciated!
You can tokenize the string to get the result:
active(UserString) ->
[User,Server] = string:tokens(UserString,"#"),
active(User,Server).
If you need something more elaborate, or with better handling of something like email addresses, it might then be time to delve into using regular expressions with the re module.
active(UserString) ->
RegEx = "^([\\w\\.-]+)#([\\w\\.-]+)$",
{match, [User,Server]} = re:run(UserString,RegEx,[{capture,all_but_first,list}]),
active(User,Server).
Note: The supplied Regex is hardly sufficient for email address validation, it's just an example that allows all alphanumeric characters including underscores (\\w), dots (\\.), and dashes (-) seperated by an at symbol. And it will fail if the match doesn't stretch the whole length of the string: (^ to $).
A note on the pattern matching, for the real solution to your problem I think #chops suggestions should be used.
When matching patterns against strings I think it's useful to keep in mind that erlang strings are really lists of integers. So the string "#" is actually the same as [64] (64 being the ascii code for #)
This means that you match pattern [User, "#", Server] will match lists like: [97,[64],98], but not "a#b" (which in list form is [97,64,98]).
To match the string you need to do [User,$#,Server]. The $ operator gives you the ascii value of the character.
However this match pattern limits the matching string to be 1 character followed by # and then one more character...
It can be improved by doing [User, $# | Server] which allows the server part to have arbitrary length, but the User variable will still only match one single character (and I don't see a way around that).

username regex in rails

I am trying to find a regex to limit what a person can use for a username on my site. I don't need to have it check to see how many characters there are in it, as another validation does this. Basically all I need to make it do is make sure that it allows: letters (capital and lowercase) numbers, dashes and underscores.
I came across this: /^[-a-z]+$/i
But it doesn't seem to allow numbers.
What am I missing?
The regex you're looking for is
/\A[a-z0-9\-_]+\z/i
Meaning one or more characters of range a-z, range 0-9, - (needs to be escaped with a backslash) and _, case insensitive (the i qualifier)
Use
/\A[\w-]+\z$/
\w is shorthand for letters, digits and underscore.
\A matches at the start of the string, \z matches at the end of the string. These tokens are called anchors, and Ruby is a bit special with regard to them: Most regex engines use ^ and $ as start/end-of-string anchors by default, whereas in Ruby they can also match at the start/end of lines (which matters if you're working with multiline strings). Therefore, it's safer (as #JustMichael pointed out) to use \A and \z because there is no such ambiguity.
Your regular expression contains a character class [-a-z] that allows the characters - (dash) and a through z. In order to expand the range of characters allowed by this character class, you will need to add more characters within the [].
Please see Character Classes or Character Sets for further information and examples.

How to make a Ruby string safe for a filesystem?

I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.
For instance:
my§document$is°° very&interesting___thisIs%nice445.doc.pdf
should become
my_document_is_____very_interesting___thisIs_nice445_doc.pdf
and then ideally
my_document_is_very_interesting_thisIs_nice445_doc.pdf
Is there a nice and elegant way for doing this?
I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.
Here's my solution:
def sanitize_filename(filename)
# Split the name when finding a period which is preceded by some
# character, and is followed by some character other than a period,
# if there is no following period that is followed by something
# other than a period (yeah, confusing, I know)
fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m
# We now have one or two parts (depending on whether we could find
# a suitable period). For each of these parts, replace any unwanted
# sequence of characters with an underscore
fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }
# Finally, join the parts with a period and return the result
return fn.join '.'
end
You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:
There should be at most one filename extension, which means that there should be at most one period in the filename
Trailing periods do not mark the start of an extension
Leading periods do not mark the start of an extension
Any sequence of characters beyond A–Z, a–z, 0–9 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')
The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.
My results from testing the function:
1.9.3p125 :006 > sanitize_filename 'my§document$is°° very&interesting___thisIs%nice445.doc.pdf'
=> "my_document_is_very_interesting_thisIs_nice445_doc.pdf"
which I think is what you requested. I hope this is nice and elegant enough.
From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:
def sanitize_filename(filename)
returning filename.strip do |name|
# NOTE: File.basename doesn't work right with Windows paths on Unix
# get only the filename, not the whole path
name.gsub!(/^.*(\\|\/)/, '')
# Strip out the non-ascii character
name.gsub!(/[^0-9A-Za-z.\-]/, '_')
end
end
In Rails you might also be able to use ActiveStorage::Filename#sanitized:
ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"
If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.
"my§document$is°° very&interesting___thisIs%nice445.doc.pdf".parameterize
For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:
filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")
Implementation details and ideas see source: https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/transliterate.rb
def parameterize(string, separator: "-", preserve_case: false)
# Turn unwanted chars into the separator.
parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
#... some more stuff
end
If your goal is just to generate a filename that is "safe" to use on all operating systems (and not to remove any and all non-ASCII characters), then I would recommend the zaru gem. It doesn't do everything the original question specifies, but the filename produced should be safe to use (and still keep any filename-safe unicode characters untouched):
Zaru.sanitize! " what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°° very&interes:ting___thisIs%nice445.doc.pdf"
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"
There is a library that may be helpful, especially if you're interested in replacing weird Unicode characters with ASCII: unidecode.
irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"

Resources