Need Code for US Phone numbers using Dashes or Parentheses but not spaces - phone-number

I am new to RegEx and need some guidance. Right now I have the following validation for phone numbers:
(\d{3}) ?\d{3}( |-)?\d{4}|^\d{3}( |-)?\d{3}( |-)?\d{4}
Unfortunately, the system I am importing the results into does not think favorably of the numbers being separated solely by spaces or not separated at all. What would the formula look like that requires either dashes or parentheses and accepts only the following formats: XXX-XXX-XXXX or (XXX) XXX-XXXX?
Thank you for your assistance.

Start simple:
\d{3}-\d{3}-\d{4} works beautifully for numbers like 212-867-5309.
As for others, I'd say you and your users would be better off if you kept it simple. No switching, no choices. Pick a standard. Simple is good.
If you must persist, look at this web site for help. You aren't the first.

Related

Parsing one number into multiple tokens in ANTLR

I am trying to use ANTLR as a parser for my company's latest project. I have been unable to find any information on how to parse one number, say (0005039906179210835699175654) into multiple tokens (a 5 digit number, a 3 digit number, a 14 digit number, and a 6 digit number).
My current code spits back an error,
line 1:1 no viable alternative at input '0005039906179210835699175654'
Also, on another note, does anyone know how to get the name of a token by using a listener? That's just a bonus question I guess :) Thanks in advance to everyone who responds!
EDIT:
To clarify the whole problem, my company receives information in a legacy format from automated systems. This information must be parsed into POJOs for further processing. I am trying to use ANTLR as an easy, smooth, readable, and expandable solution to this. One example is this line:
U0005138606179090232769522950 0863832 18322862 0284785 3
Which must be parsed into the sections: U, 00051, 386, 06179090232769, 522950, 0863832, 18322862, 0284785, and 3. Obviously the sections separated by white space are easy to parse but I have been unable to find a way in ANTLR to parse the values not separated by white space. Any help would be appreciated, thanks!
EDIT2:
To be perfectly clear as to why I'm using ANTLR instead of just java, my company receives messages in 5 legacy formats, and the system implemented to parse them must be easily expandable to accommodate more in the future. ANTLR is easy to read and understand. Plus, it will be easier to construct additional grammars and listeners than try to maintain a random mess of java.
EDIT3:
I thought of a solution but it is pretty janky. My idea is to parse the 28 character number as one token, then split it using java from a listener since it is broken up the same way each time. I'll report back later today on whether I got it to work.
EDIT4, FINAL UPDATE:
I have chosen to go my solution mentioned in edit3. It is not pretty, but it works and it is fast enough. Thank your very much to everyone who commented, shared ideas, and stimulated thought!

How do Europeans write a list of numbers with decimals?

As I understand it, Europeans(*) write numbers with a comma for a decimal separator, so one-and-a-quarter is written as 1,25
Europeans also use commas to separate lists, so how do you write a list of decimal numbers? I, as an Englishman, would write one-and-a-quarter, one-and-a-half, one-and-three-quarters like this:
1.25, 1.5, 1.75
How do you do that in Europe?
(Why is this a programming question? Because I'm writing a program that will ask European users for a list of numbers!)
* For the purposes of this question, there are no English-speaking countries in Europe. :-)
I'm European (french), and in almost all programs here we have to use semicolons ';' as a separator, even if the numbers are only integers because the comma doesn't look like a separator for us. In mathematics, semicolons are the only right way here to separate a list of numbers.
The most common example is when we have to enter the page numbers we want to print on a PDF, all programs ask for a semicolon-separated list and I clearly found it intuitive. I think they would have changed it if it was uncomfortable for some.
This varies by culture, and within a culture. The CLDR data contains the “list” element that specifies the list separator character, and it is the semicolon for most cultures, see the chart of number symbols (element “list”). The definition is very implicit though, and there is variation inside locales. Some people regard 1,25, 1,5, 1,75 as acceptable, while others prefer 1,25; 1,5; 1,75. There are also people who seriously think that in a strongly mathematical or numeric context, one should deviate from the locale practices and use the Anglo-Saxon notation with decimal point, hence with comma as separator.
On the practical side, I think it would not be very wrong to use ”;” as number list separator when decimal comma is used, or even when decimal point is used. So you might even consider using ”;” in all locales.
But when it comes to user input, it’s trickier. In principle, you be liberal in what you accept, but since the comma can be meant to be a decimal comma, a thousands separator, or a list item separator, there is such a thing as being too liberal.
If possible, prompt for each number separately, avoiding the separator issue. If this is not possible, the crucial thing is to make it very, very clear to the use which separator is expected. I would go as far as saying that requiring for the semicolon ”;” is the most reliable thing to do.
Why ask about Europeans in general ? I don't think there is one European way of doing so, and if it happens to be the case then it would be sheer luck. Europe is comprised of different cultures and each has its own rules.
You don't mention what platform you are using but you might be able to rely on your plaform to get this information. In the case of .NET, you can get this information through Textinfo.ListSeparator. For example this would give you the French one (result: a semicolon):
string listSeparator = new CultureInfo("fr-FR").TextInfo.ListSeparator;
I don't think there is one way to do it. White space separating the numbers would works just the same, or you could use a semicolon (';') to separate the numbers

Extra text after the URL in a QR code

I've seen a number of QR codes that contain a URL but also have extra some text after it. Something like:
http://www.example.com Thanks for scanning this QR code!
I've experimented with using a number of different delimiting methods (several spaces, a question mark, two dashes, one or two returns) and all work to varying degrees on various scanning programs.
Some respect the space character, others respect the return. Some think the URL isn't a URL at all when I use a return. Long story short, it's all over the map how the various scanning programs (NeoReader, iNigma, Qrafter, Beetag, OptiScan, etc.) treat characters after a URL.
Is there any consensus on weather (a)this is even a good idea or not and (b)if so, what is the 'correct' (best practice) way to do it? (I know I should go read the RFC for the exact definition of a URL but since the reader programs are all over the map, I suspect they didn't read it either.)
You can make it work by converting the text message into valid URL, while trying to keep readability.
In your case it can be:
http://www.example.com?Thanks_for_scanning_this_QR_code
It's not perfect, but it can help on web analytics side to distinguish all QR code users.
Spaces are definitely not part of a URL, so, in that sense a space definitely should delimit the end of a URL.
The entire string is not a URL, taken as a whole of course. So yes it's asking for trouble.
As you've found, the empirical answer is that not every reader does what you want. Barcode Scanner for instance understands the split here, but does not prompt the user to launch the browser since the payload isn't a URL per se.
So: it's a bad idea.

Is it possible to create a mask to handle non-north american phone numbers?

For north american phone numbers, (999) 999-9999 works pretty well for an input mask.
However, I can't find a good example that will handle non-north american numbers. I know that the number of digits can vary, so other than restricting it to digits only, is there a good example anywhere?
There is no generic mask, really: There are too many combinations.
The only thing that is fixed is the international country code, usually prefixed by +.
According to the Wikipedia Article on telephone numbering plans, most countries conform with the E.164 numbering plan.
If I read E.164 correctly, you can safely make the following assumptions:
Country code: 1-3 digits
Network / Area code and Number: Up to 19 digits
I would ask for the country code, and have the "area code + number" field as a 19-digit input.
You can deduce the country code with a simple RegEx such as:
^(?:(?:0(?:0|11)\s?)|+)([17]|2([07]|[1-689]\d)|3([0-469]|[578]\d)|4([013-9]|2\d)|5([1-8]|[09]\d)|6([0-6]|[789]\d)|8([12469]|[03578]\d)|9([0-58]|[679]\d))
Followed by
(([\s\(\).-]{0,2}\d){4,13})$
to extract the national number.
For validating the national number length and validity, you'd need libphonenumber or similar.
The long RegEx above allows +, 00 or 011 before the country code and a selection of punctuation in the number which will also have to be stripped.
You don't mention your application but this is certainly possible using regular expressions. You might want to take a look here.
Not easily. Take a look at this page for an example why: if you only look at the German phone numbers, you'll note that there are different formats depending on where you're calling the number from. Which one do you pick? And that's just for German phone numbers; they differ from continent to continent, and from country to country.
Going with "numbers-only" is probably your safest bet.
I would allow for spaces, dashes, slashes and all that, but actually only care for numbers and the optional leading + sign. Everything else, such as assuming certain blocks of a certain length is just asking for trouble.
May be it is bad to answer an old question. But libphonenumber seems like a good solution to your question.

Regex: Match a string containing numbers and letters but not a string of just numbers

Question
I would like to be able to use a single regex (if possible) to require that a string fits [A-Za-z0-9_] but doesn't allow:
Strings containing just numbers or/and symbols.
Strings starting or ending with symbols
Multiple symbols next to eachother
Valid
test_0123
t0e1s2t3
0123_test
te0_s1t23
t_t
Invalid
t__t
____
01230123
_0123
_test
_test123
test_
test123_
Reasons for the Rules
The purpose of this is to filter usernames for a website I'm working on. I've arrived at the rules for specific reasons.
Usernames with only numbers and/or symbols could cause problems with routing and database lookups. The route for /users/#{id} allows id to be either the user's id or user's name. So names and ids shouldn't be able to collide.
_test looks wierd and I don't believe it's valid subdomain i.e. _test.example.com
I don't like the look of t__t as a subdomain. i.e. t__t.example.com
This matches exactly what you want:
/\A(?!_)(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*(?<!_)\z/i
At least one alphabetic character (the [a-z] in the middle).
Does not begin or end with an underscore (the (?!_) and (?<!_) at the beginning and end).
May have any number of numbers, letters, or underscores before and after the alphabetic character, but every underscore must be separated by at least one number or letter (the rest).
Edit: In fact, you probably don't even need the lookahead/lookbehinds due to how the rest of the regex works - the first ?: parenthetical won't allow an underscore until after an alphanumeric, and the second ?: parenthetical won't allow an underscore unless it's before an alphanumeric:
/\A(?:[a-z0-9]_?)*[a-z](?:_?[a-z0-9])*\z/i
Should work fine.
I'm sure that you could put all this into one regular expression, but it won't be simple and I'm not sure why insist on it being one regex. Why not use multiple passes during validation? If the validation checks are done when users create a new account, there really isn't any reason to try to cram it into one regex. (That is, you will only be dealing with one item at a time, not hundreds or thousands or more. A few passes over a normal sized username should take very little time, I would think.)
First reject if the name doesn't contain at least one number; then reject if the name doesn't contain at least one letter; then check that the start and end are correct; etc. Each of those passes could be a simple to read and easy to maintain regular expression.
What about:
/^(?=[^_])([A-Za-z0-9]+_?)*[A-Za-z](_?[A-Za-z0-9]+)*$/
It doesn't use a back reference.
Edit:
Succeeds for all your test cases. Is ruby compatible.
This doesn't block "__", but it does get the rest:
([A-Za-z]|[0-9][0-9_]*)([A-Za-z0-9]|_[A-Za-z0-9])*
And here's the longer form that gets all your rules:
([A-Za-z]|([0-9]+(_[0-9]+)*([A-Za-z|_[A-Za-z])))([A-Za-z0-9]|_[A-Za-z0-9])*
dang, that's ugly. I'll agree with Telemachus, that you probably shouldn't do this with one regex, even though it's technically possible. regex is often a pain for maintenance.
The question asks for a single regexp, and implies that it should be a regexp that matches, which is fine, and answered by others. For interest, though, I note that these rules are rather easier to state directly as a regexp that should not match. I.e.:
x !~ /[^A-Za-z0-9_]|^_|_$|__|^\d+$/
no other characters than letters, numbers and _
can't start with a _
can't end with a _
can't have two _s in a row
can't be all digits
You can't use it this way in a Rails validates_format_of, but you could put it in a validate method for the class, and I think you'd have much better chance of still being able to make sense of what you meant, a month or a year from now.
Here you go:
^(([a-zA-Z]([^a-zA-Z0-9]?[a-zA-Z0-9])*)|([0-9]([^a-zA-Z0-9]?[a-zA-Z0-9])*[a-zA-Z]+([^a-zA-Z0-9]?[a-zA-Z0-9])*))$
If you want to restrict the symbols you want to accept, simply change all [^a-zA-Z0-9] with [] containing all allowed symbols
(?=.*[a-zA-Z].*)^[A-Za-z0-9](_?[A-Za-z0-9]+)*$
This one works.
Look ahead to make sure there's at least one letter in the string, then start consuming input. Every time there is an underscore, there must be a number or a letter before the next underscore.
/^(?![\d_]+$)[A-Za-z0-9]+(?:_[A-Za-z0-9]+)*$/
Your question is essentially the same as this one, with the added requirement that at least one of the characters has to be a letter. The negative lookahead - (?![\d_]+$) - takes care of that part, and is much easier (both to read and write) than incorporating it into the basic regex as some others have tried to do.
[A-Za-z][A-Za-z0-9_]*[A-Za-z]
That would work for your first two rules (since it requires a letter at the beginning and end for the second rule, it automatically requires letters).
I'm not sure the third rule is possible using regexes.

Resources