I'd like a regex or some function to standardize all phone number input from my ios app.
Thus inputs like the following
(212)555-5555
2125555555
212-555-5555
212 555 5555
212-5555555
all get translated to
(212) 555-5555
would this following regex work to match the phone numbers into 3 match groups which I can then format into the correct output string?
^\D?(\d{3})\D?\D?(\d{3})\D?(\d{4})$
Is the $ sign at the end of the regex required or does that mess things up?
Is regex the best way to do this in iOS?
^ and $ in a regex match the beginning and end of a line of input. You have not provided enough information to determine if these anchors are appropriate in this particular case.
Since you're working in iOS have you looked at the NSDataDetector class? It provides mechanisms for detecting strings which could be valid phone numbers in many different formats. This would give you phone number detection matching the behavior users see in many of the other apps on their devices.
NSDataDetector does not provide a mechanism for re-formatting phone numbers so you would still need to determine how you want to reformat strings detected as possible phone numbers (which may contain more or less than 10 digits). If you do so you should probably fall back to preserving the original format of any detected number which does not match one of your expected formats.
Related
I'm working on a Guitar Chord transposer, and so from a given text file, I want to identify guitar chords. e.g. G#, Ab, F#m, etc.
I'm almost there! I have run into a few problems already due to the number sign (hash tag).
#
For example, you can't include the number sign in your regex pattern. The NSRegularExpression will not initialize with this:
let fail: String = "\\b[ABCDEFG](b|#)?\\b"
let success: String = "\\b[CDEFGAB](b|\\u0023)?\\b"
I had to specifically provide the unicode character. I can live with that.
However, now that I have a NSRegularExpression object, it won't match these (sharps = number sign) when I have a line of text such as:
Am Bb G# C Dm F E
When it starts processing the G#, the sharp associated with that second capture group is not matched. (i.e. the NSTextCheckingResult's second range has a location of NSNotFound) Note, it works for Bb... it matches the 'b'
I'm wondering what I need to do here. It would seem the documentation doesn't cover this case of '#' which IS in fact sometimes used in Regex patterns (I think related to comments or sth)
One thing that would be great would be to not have to look up the unicode identifier for a #, but just use it as a String "#" then convert that so it plays nicely with the pattern. There exists the chance that \u0023 is in fact not the code associated with # ...
The \b word boundary is a context dependent construct. It matches in 4 contexts: 1) between start of string and a word char, 2) between a word char and end of string, 3) between word and a non-word and 4) a non-word and a word char.
Your regex is written in such a way that ultimately the regex engine sees a \b after # and that means a # will only match if there is a word char after it.
If you replace \b with (?!\w), a negative lookahead that fails the match if there is a word char immediately to the right of the current location, it will work.
So, you may use
\\b[CDEFGAB](b|\\u0023)?(?!\\w)
See the regex demo.
Details
\b - a word boundary
[CDEFGAB] - a char from the set
(b|\\u0023)? - an optional sequence of b or #
(?!\\w) - a negative lookahead failing the match (and causing backtracking into the preceding pattern! To avoid that, add + after ? to prevent backtracking into that pattern) if there is a word char immediately to the right of the current position.
(I'd like to first say #WiktorStribiżew has been a tremendous help and what I am writing now would not have been possible without him! I'm not concerned about StackOverflow points and rep, so if you like this answer, please upvote his answer.)
This issue took many turns and had a few issues going on. Ultimately this question should be called How do I use Regex on iOS to detect Musical Chords in a text file?
The answer is (so far), not simply.
CRASH COURSE IN MUSIC THEORY
In music you have notes. They are made up of a letter between A->G and an optional symbol called an accidental. (A note relates to the acoustic frequency of the sound you hear when that note is played) An accidental can be a flat (represented as a ♭ or simply a b), or a sharp (represented as a ♯ or simply a #, as these are easier to type on a keyboard). An accidental serves to make a note a semitone higher (#) or lower (b). As such, a F# is the same acoustic frequency as a Gb. On a piano, the white keys are notes with no accidentals, and the black keys represent notes with an accidental. Depending on some factors of the piece of music, that piece won't mix accidental types. It will either be flats throughout the piece or sharps. (Depending on the musical key of the composition, but this is not that relevant here.)
In terms of regex, you have something like ABCDEFG? to determine the note. In reality it's more complicated.
Then, a Musical Chord is comprised of the root note and it's chord type. There are over 50 types of chords. They have a 'text signature' that is unique. Also, a 'major' chord has an empty signature. So in terms of pseudo-regex you have for a Chord:
[ABCDEFG](b|#)?(...|...|...)?
where the first part you recognize as the note (as before), and the last optional is to determine the chord type. The different types were omitted, but can be as simple as a m (for Minor chord), or maj7#5 (for a major 7th chord with an augmented 5th... don't worry about it. Just know there are many string constants to represent a chord type)
Then finally, with guitar you often have a corresponding bass note that changes the chord's tonality somewhat. You denote this by adding a slash and then the note, giving the general pseudoform:
[ABCDEFG](b|#)?(...|...|...)?(/[ABCDEFG](b|#)?)? // NOT real Regex
real examples: C/F or C#m/G# and so on
where the last part has a slash then the same pattern to recognize a note.
So putting these all together, in general we want to find chords that could take on many forms, such as:
F Gm C#maj7/G# F/C Am A7 A7/F# Bmaj13#11
I was hoping to find one Regex to rule them all. I ended up writing code that works, though it seems like I kind of hacked around a bit to get the results I desired.
You can see this code here, written in Swift. It is not complete for my purposes, but it will parse a string, return a list of Chord Results and their text range within the original string. From there you would have to finish the implementation to suit your needs.
There have been a few issues on iOS:
iOS does not handle the number sign (#) well at all. When providing regex patterns or match text, I either had to replace the # with its unicode \u0023, or what ultimately worked was replacing all occurrences of # with another character (such as 'S'), and then convert it back once regex did it's thing. So this code I wrote often has to 'sanitize' the pattern or the input text before doing anything.
I couldn't get a Regex Pattern to perfectly parse a chord structure. It wasn't fully working for a Chord with a bass note, but it would successfully match a Chord with a bass note, then I had to split those 2 components and parse them separately, then recombine them
Regex is really a bit of voodoo, and I think it sucks that for something so confusing to many people, there are also different platform-dependent implementations of it. For example, Wiktor referred me to Regex patterns he wrote to help me solve the problem on www.regex101.com, that would WORK on that website, but these would not work on iOS, and NSRegularExpression would throw an error (often it had something to do with this # character)
My solution pays absolutely no regard to performance. It just wanted it to work.
I am trying to validate fields in my iOS program:
I need to match a phone number, but the field is optional.
I thought using the regex to match the number to also validate if there is no phone number:
[0-9\-\+\*]{4,14}
Then I thought how to also match where there is either a valid number or no number at all?
(:?[0-9\-\+\*]{4,14})?
Meaning, either match between 4 to 14 chars within the range 0-9,+,-,* or nothing.
This website is showing infinte matches for that pattern.
ideas?
^$|^[0-9\-\+\*]{4,14}$
As to the questions this has brought here:
Regex is a great validation method. and it is cross platform.
no need for another layer of code to implement. Simple and clean.
You should just code it. I don't know your language but basicaly :
If(field.isEmpty)
should do the trick.
As the question states, why is it considered best practice to store telephone numbers as strings rather than integers in the telephone_number column?
Not sure I understand the rationale for this. Please help clear this up!
Thanks!
Telephone numbers are strings of digit characters, they are not integers.
Consider for example:
Expressing a telephone number in a different base would render it meaningless
Adding or multiplying two telephone numbers together, or any math operation on a phone number, is meaningless. The result is not another telephone number (except by conicidence)
Telephone numbers are intended to be entered "as-is" into a connected device.
Telephone numbers may have leading zeroes.
Manipulations of telephone numbers, such as adding an area code, are String operations.
Storing the string version of the telephone number makes this clear and unambiguous.
History: On old pulse-encoded dial systems, the code for each digit in a telephone number was sent as the same number of pulses as the digit (or 10 pulses for "0"). That may be why we still use digits to represent the parts of a phone number. See http://en.wikipedia.org/wiki/Pulse_dialing
What Neil Slater said is correct. I would add that there are lots of edge cases where you can't express a telephone number as a number value consistently.
For example, consider these numbers:
011-123-555-1212
+11-123-555-1212
+1 (112) 355-5121 x2
These are all potentially valid phone numbers, but they mean very different things. Yet, in integer form, they are all 111235551212.
If you are going to store the number for display from input, then you must use a string.
However, while it is true that no mathematical operations can be performed on a number that have meaning. Using a number in hashsets and for indexing is quicker than using a string. So provided you can guarantee or homogenise your set of numbers, so they are all consistent, then you may see better performance operating on a number.
For example, in the Telco world, rating calls for a given customer includes a lot of searching on their CLI and in this situation it is faster and cheaper to search by integer. Generally though strings will be fine performance wise, it is only where performance matters and you have multiple searches to perform for a huge range of numbers - i.e. Rating 250 million calls across 2 million lines and 2000 tariffs. In memory rating also gets expensive, so being able to use a 64bit int or uint is cheaper when dealing with these volumes.
Consider these phone numbers for example
099-1234-56789 or +91-8907-687665.
In this case,if the phone_number attribute is of type integer,then it can't accept these values.It should be a string to hold these type of values.So string is always preferred than integer
There is several reasons for this :
Phone numbers often start with a "0" : an integer will remove all leading "0"s
Phone number can have special char : +, (, -, etc. (for exemple : +33 (0)6 12 23 34)
You cannot perform operations on phones : adding phones, for instance, would be meaningless
Phone number may be internationalised, i.e. different format for different people, thus not possible with integers
There might be other reasons, but I guess that's already a fair amount of those :)
In developing an iOS app containing a twitter client, I must allow for user generated hashtags (which may be created elsewhere within the app, not just in the tweet body).
I would like to ensure any such hashtags are valid for twitter, so I would like to error check the entered value for invalid characters. Bear in mind that users may be from non-English speaking countries.
I am aware of the usual limitations, such as not beginning a hashtag with a number, and no special punctuation characters, but I was wondering if there is a known list of all additional characters that are technically allowed within hashtags (i.e. international characters).
Karl, as you've rightly pointed out, any word in any language can be a valid twitter hashtag (as long as it meets a number of basic criteria). As such what you are asking for is a list of valid international word characters. I'm sure someone has compiled such a list somewhere, but using it would not be the most efficient approach to reaching what appears to be your initial goal: ensuring that a given hashtag is valid for twitter.
I believe, what you are looking for is a regular expression that can match all word characters within a Unicode range. Such an expression would not be dependant on your locale and would match all characters in the modern typography that can appear as part of a word.
You didn't specify what language you are writing your app in, so I can't help you with a language specific implementation. However, the basic approach would be as follows:
Check if any of the bracket expressions or character classes already support Unicode character ranges in your language. If yes, then use them.
Check if there is regex modifier that can enable Unicode character range support for your language.
Most modern languages implement regular expressions in a fairly similar way and a lot of them borrow heavily from Perl, so I hope the following two example will put you on the right track:
Perl:
Use POSIX bracket expressions (eg: [[:alpha:]], [[:allnum:]], [[:digit:]], etc) as they give you greater control over the characters you want to match, compared to character classes (eg: \w).
Use /u modifier to enable Unicode support when pattern matching. Under this modifier, the ASCII platform effectively becomes a Unicode platform; and hence, for example, \w will match any of the more than 100,000 word characters in Unicode.
See Perl documentation for more info:
http://perldoc.perl.org/perlre.html#Character-set-modifiers
http://perldoc.perl.org/perlrecharclass.html#POSIX-Character-Classes
Ruby:
Use POSIX bracket expressions as they encompass non-ASCII characters. For instance, /\d/ matches only the ASCII decimal digits (0-9); whereas /[[:digit:]]/ matches any character in the Unicode Nd category.
See Ruby documentation for more info:
http://www.ruby-doc.org/core-2.1.1/Regexp.html#class-Regexp-label-Character+Classes
Examples:
Given a list of hashtags, the following regex will match all hashtags that start with a word character (inc. international word characters) followed by at least one other word character, a number or an underscore:
m/^#[[:alpha:]][[:alnum:]_]+$/u # Perl
/^#[[:alpha:]][[:alnum:]_]+$/ # Ruby
Twitter allows letters, numbers, and underscores.
I checked this by generating tweets via their API. For example, tweeting
Hash tag test #foo[bar
resulted in "#foo" being marked as a hash tag, and "[bar" being unformatted text.
Well, for starters you can't use a # in the hashtag (##hash).
The guidelines below are being quoted from Twitter's help center:
People use the hashtag symbol # before a relevant keyword or phrase (no spaces) in their Tweet to categorize those Tweets and help them show more easily in Twitter Search.
Clicking on a hashtagged word in any message shows you all other Tweets marked with that keyword.
Hashtags can occur anywhere in the Tweet – at the beginning, middle, or end.
Hashtagged words that become very popular are often Trending Topics.
Example: In the Tweet below, #eddie included the hashtag #FF. Users created this as shorthand for "Follow Friday," a weekly tradition where users recommend people that others should follow on Twitter. You'll see this on Fridays.
Using hashtags correctly:
If you Tweet with a hashtag on a public account, anyone who does a search for that hashtag may find your Tweet
Don't #spam #with #hashtags. Don't over-tag a single Tweet. (Best practices recommend using no more than 2 hashtags per Tweet.)
Use hashtags only on Tweets relevant to the topic.
Just want to add that in addition to alphanumeric characters and underscore, you can apparently use em dash in a Twitter hashtag like #COVIDー19.
Only letters and numbers are allowed to be part of a hashtag. If a character other than these follows the leading # and a letter or number, the hashtag will be cut off at this point.
I would recommend that your user interface indicate this to the user by changing the text color of the input field if the user enters anything other than a letter or number.
I had the same issue to implement in golang.
It seems allowed chars with [[:alpha:]] is only English-alphabet and could not use this syntax for other language characters.
Instead, I could use \p{L} for this purpose.
My test with \p{L} is here.
* Arabic, Hebrew, Hindi...etc is not confirmed yet.
For north american phone numbers, (999) 999-9999 works pretty well for an input mask.
However, I can't find a good example that will handle non-north american numbers. I know that the number of digits can vary, so other than restricting it to digits only, is there a good example anywhere?
There is no generic mask, really: There are too many combinations.
The only thing that is fixed is the international country code, usually prefixed by +.
According to the Wikipedia Article on telephone numbering plans, most countries conform with the E.164 numbering plan.
If I read E.164 correctly, you can safely make the following assumptions:
Country code: 1-3 digits
Network / Area code and Number: Up to 19 digits
I would ask for the country code, and have the "area code + number" field as a 19-digit input.
You can deduce the country code with a simple RegEx such as:
^(?:(?:0(?:0|11)\s?)|+)([17]|2([07]|[1-689]\d)|3([0-469]|[578]\d)|4([013-9]|2\d)|5([1-8]|[09]\d)|6([0-6]|[789]\d)|8([12469]|[03578]\d)|9([0-58]|[679]\d))
Followed by
(([\s\(\).-]{0,2}\d){4,13})$
to extract the national number.
For validating the national number length and validity, you'd need libphonenumber or similar.
The long RegEx above allows +, 00 or 011 before the country code and a selection of punctuation in the number which will also have to be stripped.
You don't mention your application but this is certainly possible using regular expressions. You might want to take a look here.
Not easily. Take a look at this page for an example why: if you only look at the German phone numbers, you'll note that there are different formats depending on where you're calling the number from. Which one do you pick? And that's just for German phone numbers; they differ from continent to continent, and from country to country.
Going with "numbers-only" is probably your safest bet.
I would allow for spaces, dashes, slashes and all that, but actually only care for numbers and the optional leading + sign. Everything else, such as assuming certain blocks of a certain length is just asking for trouble.
May be it is bad to answer an old question. But libphonenumber seems like a good solution to your question.