I want to extract valid(on the basis of format) mobile numbers from a text.
e.g. I/O some text (987) 456 7890, (987)-456-7890 again some text
O/P 9874567890 9874567890
problem is, there are many valid mobile formats in all over world like.
text = "Denmark 11 11 11 11, 1111 1111 "
// + "Germany 03333 123456, +49 (3333) 123456 "
// + "Netherlands + 31 44 12345678 Russia +7(555)123-123 "
// + "spain 12-123-12-12 switzerland +41 11 222 22 22 "
// + "Uk (01222) 333333 India +91-12345-12345 "
// + "Austrailia (04) 1231 1231 USA (011) 154-123-4567 "
// + "China 1234 5678 France 01-23-45-67-89 "
// + "Poland (12) 345 67 89 Singapore 123 4567 "
// + "Thailand (01) 234-5678, (012) 34-5678 "
// + "United Kingdom 0123 456 7890, 01234 567890 "
// + "United States (987) 456 7890, (987)-456-7890+ etc."
How to cover all mobile formats?
min and max length of the mobile numbers(with or without country code)?
how to recognize that mobile number has country code or not?
You might want to check if this fits your needs: A comprehensive regex for phone number validation
By experience I know how this works in my phone OS. It looks at a long enough sequences of digits, separated by a set of allowed chars.
In principle something like:
[\+]?([0-9]|[\(\).- ]){min,max}
This regex is suboptimal since it also looks for long sequences of separator chars. You will probably need to filter those results out as well.
A very simple method with some false positives, but false positives are IMPO better than misses.
You shouldn't use the list of samples you got as a guide to actual mobile phone numbers.
For example the number sequence shown for the Netherlands is incorrect, in that it doesn't cover just mobile numbers but ALL regular phone numbers (it doesn't cover such things as 0800 and 0900 numbers for which different rules apply) and is missing an element even for that.
I can only assume the list is similarly incorrect for other countries (and of course it's far from complete in that it doesn't cover all countries, but maybe you posted only a fragment).
To parse a phone number you'd have to first remove all white space and other formatting characters from what could be a phone number, then check whether it has the correct length to be one, then try to deduce whether it includes a country code or not.
If it includes a country code but doesn't start with either 00 or + (both are used to indicate an international number) it might not be a phone number after all.
Does it include an area code? If so, is the area code one associated with mobile phones (for example in the Netherlands all mobile phone numbers have area code 06, BUT in the past this wasn't always the case so if you have an old document a 06 area code may not be a mobile number anyway.
After you've deduced that (and AFAIK mobile numbers always include an area code) you have to check if the remaining numbers make up something that could be an actual phone number without area code based on the length of the number (hint: area code + numer together have to be 10 long here, and I think everywhere).
And all that while taking into consideration that the rules may well be different for different countries or even different networks within some countries.
And of course if you find a number that looks like a valid phone number it still may not be.
It could be some other number that just looks like a phone number but isn't.
Simple search of all matching string formats in this case is not right way. The optimal way is using Regular Expressions to find all matches of phone numbers, but Blackberry java don't have built-in capabilities to process Regular Expressions.
But you can use 3-rd party library for J2ME implementing RegEx processing, smth. like this.
// Regex - Check Singapore valid mobile numbers
public static boolean isSingaporeMobileNo(String str) {
Pattern mobNO = Pattern.compile("^(((0|((\\+)?65([- ])?))|((\\((\\+)?65\\)([- ])?)))?[8-9]\\d{7})?$");
Matcher matcher = mobNO.matcher(str);
if (matcher.find()) {
return true;
} else {
return false;
}
}
Related
An important question came up when I tried to translate an existing iOS application into Lithuanian. I know how the Apple translation system works, especially for languages like English or Hungarian. But how I have to translate Lithuanian nouns in combination with numerals I don’t know.
The Lithuanian grammar in conjunction with numerals works like this for the word "įvykis" (event):
Lithuanian English
0 įvykių 0 events
1 įvykis 1 event
2 - 9 įvykiai 2 - 9 events
10 - 20 įvykių 10 - 20 events
21 įvykis 21 events
22 -29 įvykiai 22 - 29 events
30 įvykių 30 events
the same logic continuous
as of 21
More information about Lithuanian noun declension by numerals can be found in this Wikipedia article.
My question is, what key values have to be filled into the "Localizable.stringsdict" for Lithuanian? For English this file looks like this:
and for Lithuanian the same file looks this:
Those entries in the last table just partly correct. Does anyone know which keys I have to use in order to map my table into the stringsdict table? Which keys/keywords are necessary?
In the stringsdict file you can only have the keys zero, one, two, few, many, and other. That is all you actually need. iOS has its own data (based on information from the Unicode standard) that tells it which of those keys to use based on the actual number.
This is covered in the (now archived) Internationalization and Localization Guide, specifically the Handling Noun Plurals and Units Of Measure chapter with specifics about the stringsdict file in Appendix C.
You may also find language specific rules from Unicode. Scroll down to Lithianian and you will see the built in rules on how the category is used with a given number.
In short, you want the following for your "events" in Lithuanian:
one - %d įvykis
few - %d įvykiai
other - %d įvykių
iOS will know to use one for 1, 21, 31, 41, etc. It will know to use few for 2~9, 22~29, etc. It will know to use other for 0, 10~20, 30, etc.
In my country(Kazakhstan) phone number +7 (123) 456 7890 is equal to 8 (123) 456 7890.
So one saves numbers with +7, some with 8. I need to compare and verify these two numbers. It is easy, but I wonder if there are countries with rules like this and how to check them?
P.S. It is for an ios app in swift.
I read through similar stackoverflow questions to understand financial track card data.
I think the issue I am facing might be slightly different or maybe I am really weak in regex.
Now we have a service that returns track data accidentally instead of the guest name.
My goal is every time I receive track data I display "" empty string, else return the guest name.( This is a temp solution until we fix the root cause)
This is what my regular expressions is but looks like it doesn't detect track data.
irb(main):043:0> guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
irb(main):044:0> (/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/.match(guestname)) ? "" : guestname
=> "%4234242xx12^TEST/GUEST L ^324532635645744646462"
(Not real data)
Now, looking at the wiki for track data information I want to cover most cases, if not all:
https://en.wikipedia.org/wiki/Magnetic_stripe_card#Financial_cards
Could some help with my regex. This is what I have:
/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/
Track 1, Format B:
Start sentinel — one character (generally '%')
Format code="B" — one character (alpha only)
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Field Separator — one character (generally '^')
Name — 2 to 26 characters
Field Separator — one character (generally '^')
Expiration date — four characters in the form YYMM.
Service code — three characters
Discretionary data — may include Pin Verification Key Indicator (PVKI,
1 character), PIN Verification Value (PVV, 4 characters), Card
Verification Value or Card Verification Code (CVV or CVC, 3
characters)
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track.
Track 2: This format was developed by the banking industry (ABA). This
track is written with a 5-bit scheme (4 data bits + 1 parity), which
allows for sixteen possible characters, which are the numbers 0-9,
plus the six characters : ; < = > ? . The selection of six
punctuation symbols may seem odd, but in fact the sixteen codes simply
map to the ASCII range 0x30 through 0x3f, which defines ten digit
characters plus those six symbols. The data format is as follows:
Start sentinel — one character (generally ';')
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Separator — one char (generally '=')
Expiration date — four characters in the form YYMM.
Service code — three digits. The first digit specifies the interchange
rules, the second specifies authorisation processing and the third
specifies the range of services
Discretionary data — as in track one
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track. Most
reader devices do not return this value when the card is swiped to the
presentation layer, and use it only to verify the input internally to
the reader.
Your example input string does not contain format code after first sentinel.
You are trying to parse html-encoded version, which is weird.
So, I would start with html decoding. E.g. with Nokogiri:
▶ guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
▶ parsed = Nokogiri::HTML.parse(guestname).text
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
OK, now we at least have a leading percent. Now let us ask ourselves: how many users have a guest name starting with a percent sign? I bet none. You might re-check yourself by running a query against your database. Since it is a temporary solution, I would definitely shut the perfectionism up and go with:
▶ parsed =~ /\A%/ ? '' : parsed
Hope it helps.
Trying to work out how to parse out phone numbers that are left in a string.
e.g.
"Hi Han, this is Chewie, Could you give me a call on 02031234567"
"Hi Han, this is Chewie, Could you give me a call on +442031234567"
"Hi Han, this is Chewie, Could you give me a call on +44 (0) 203 123 4567"
"Hi Han, this is Chewie, Could you give me a call on 0207-123-4567"
"Hi Han, this is Chewie, Could you give me a call on 02031234567 OR +44207-1234567"
And be able to consistently replace any one of them with some other item (e.g. some text, or a link).
Am assuming it's a regex type approach (I'm already doing something similar with email which works well).
I've got to
text.scan(/([^A-Z|^"]{6,})/i)
Which leaves me a leading space I can't work out how to drop (would appreciate the help there).
Is there a standard way of doing this that people use?
It also drops things into arrays, which isn't particularly helpful
i.e. if there were multiple numbers.
[["02031234567"]["+44207-1234567"]]
as opposed to
["02031234567","+44207-1234567"]
Adding in the third use-case with spaces is difficult. I think the only way to successfully meet that acceptance criteria would be to chain a #gsub call on to your #scan.
Thus:
text.gsub(/\s+/, "").scan(/([^A-Z|^"|^\s]{6,})/i)
The following code will extract all the numbers for you:
text.scan(/(?<=[ ])[\d \-+()]+$|(?<=[ ])[\d \-+()]+(?=[ ]\w)/)
For the examples you supplied this results in:
["02031234567"]
["+442031234567"]
["+44 (0) 203 123 4567"]
["0207-123-4567"]
["02031234567", "+44207-1234567"]
To understand this regex, what we are matching is:
[\d \-+()]+ which is a sequence of one or more digits, spaces, minus, plus, opening or closing brackets (in any order - NB regex is greedy by default, so it will match as many of these characters next to each other as possible)
that must be preceded by a space (?<=[ ]) - NB the space in the positive look-behind is not captured, and therefore this makes sure that there are no leading spaces in the results
and is either at the end of the string $, or | is followed by a space then a word character (?=[ ]\w) (NB this lookahead is not captured)
This pattern will get rid of the space but not match your third case with spaces:
/([^A-Z|^"|^\s]{6,})/i
This is what I came to in the end in case it helps somebody
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
That gives me an array of
["+442031234567", "02031234567"]
I'm sure there is a more elegant way of doing this and possibly you'd want to check the numbers for likelihood of being phonelike - e.g. using the brilliant Phony gem.
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
real_numbers = numbers.keep_if{|n| Phony.plausible? PhonyRails.normalize_number(n, default_country_code: "GB")}
Which should help exclude serial numbers or the like from being identified as numbers. You'll obviously want to change the country code to something relevant for you.
I'm using the iOS ABPeoplePickerNavigationController to allow a user to select a phone number, but the number I get back is formatted like this:
+44 (0) 20 3162 0001
I can strip out the spaces and the parenthesis, but the number that remains isn't really a valid phone number.
Does iOS offer any way to force ABPeoplePicker to return a valid, canonical phone number i.e.
+442031620001
or will I be fored to apply a regex or something to it?
you will have to apply a regex. but it should just be strip all but optionally + at the beginning
STILL there is no guarantee that'll get you a valid phone number!
e.g. In Addressbook I could write +44 353 1232 (-0 / -1)
to name to alernates