I need a well tested Regular Expression (.net style preferred), or some other simple bit of code that will parse a USA/CA phone number into component parts, so:
3035551234122
1-303-555-1234x122
(303)555-1234-122
1 (303) 555 -1234-122
etc...
all parse into:
AreaCode: 303
Exchange: 555
Suffix: 1234
Extension: 122
None of the answers given so far was robust enough for me, so I continued looking for something better, and I found it:
Google's library for dealing with phone numbers
I hope it is also useful for you.
This is the one I use:
^(?:(?:[\+]?(?<CountryCode>[\d]{1,3}(?:[ ]+|[\-.])))?[(]?(?<AreaCode>[\d]{3})[\-/)]?(?:[ ]+)?)?(?<Number>[a-zA-Z2-9][a-zA-Z0-9 \-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\.]?)){1,2}(?<Ext>[\d]{1,5}))?$
I got it from RegexLib I believe.
This regex works exactly as you want with your examples:
Regex regexObj = new Regex(#"\(?(?<AreaCode>[0-9]{3})\)?[-. ]?(?<Exchange>[0-9]{3})[-. ]*?(?<Suffix>[0-9]{4})[-. x]?(?<Extension>[0-9]{3})");
Match matchResult = regexObj.Match("1 (303) 555 -1234-122");
// Now you have the results in groups
matchResult.Groups["AreaCode"];
matchResult.Groups["Exchange"];
matchResult.Groups["Suffix"];
matchResult.Groups["Extension"];
Strip out anything that's not a digit first. Then all your examples reduce to:
/^1?(\d{3})(\d{3})(\d{4})(\d*)$/
To support all country codes is a little more complicated, but the same general rule applies.
Here is a well-written library used with GeoIP for instance:
http://highway.to/geoip/numberparser.inc
here's a method easier on the eyes provided by the Z Directory (vettrasoft.com),
geared towards American phone numbers:
string_o s2, s1 = "888/872.7676";
z_fix_phone_number (s1, s2);
cout << s2.print(); // prints "+1 (888) 872-7676"
phone_number_o pho = s2;
pho.store_save();
the last line stores the number to database table "phone_number".
column values: country_code = "1", area_code = "888", exchange = "872",
etc.
Related
I have a text file that I'd like to parse with records like this:
===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25
As you can see, the fields in such a text file are fixed, but some of them repeat an arbitrary number of times. The records are separated by a fixed length ==== delimiter.
How would I write parsing logic this this sort of problem? I am think of using switch as it reads the start of the line, but the logic to handle multiple repeating fields baffles me.
A good way to approach this sort of problem is to "divide and conquer". That is, divide the overall problem into smaller sub-problems which are easier to manage and then solve each them individually. If you've planned properly then when you've finished each of the sub-problems you should have solved the whole problem.
Start by thinking about modeling. The document appears to contain a list of records, what should those records be called? What named fields should the records contain and what types should they have? How would you represent them idiomatically in go? For example, you might decide to call each record a Person with fields as such:
type Person struct {
Name string
Credentials []string
Age int
}
Next, think about what the interface (signature) of your parse function should look like. Should it emit an array of people? Should it use a visitor pattern and emit a person as soon as it's parsed? What constraints should drive the answer? Are memory or compute time constraints important? Does the user of the parser want any control over the parsing work such as canceling? Do they need metadata such as the total number of records contained in the document? Will the input always be from a file or a string, maybe from an HTTP request or a network socket? How will these choices drive your design?
func ParsePeople(string) ([]Person, error) // ?
func ParsePeople(io.Reader) ([]Person, error) // ?
func ParsePeople(io.Reader, func visitor(Person) bool) error // ?
Finally you can implement your parser to fulfill the interface that you've decided on. A straightforward approach here would be to read the input file line-by-line and take an action according to the contents of the line. For example (in pseudocode):
forEach line = inputFile.line
if line is a separator
emit or store the last parsed person, if present
create a new person to store parsed fields
else if line is a data field
parse the data
update the person with the parsed data
end
end
return the parsed records or final record, if emitting
Each line of pseudocode above represents a sub-problem that should be easier to solve than the whole.
Edit: Add explanation of why I just post a program as answer.
I am presenting a very straight forward implementation to parse the text you have given in your question. You accepted maerics answer and that is OK. I want to add some counter arguments to his answer, though. Basically the pseude-code in that answer is a non-compilable version of the code in my answer so we agree on the solution to this.
What I do not agree with is the over-engineering talk. I have to deal with code written by over-thinkers everyday. I urge you NOT to think about patterns, memory and time constraints or who might want what from this in the future.
Visitor pattern? That is something that is pretty much only useful in parsing programming languages, do not try to construct a use-case for it out of this problem. The visitor pattern is for traversing trees with different types of things in it. Here we have a list, not a tree, of things that are all the same.
Memory and time constraints? Are you parsing 5 GB of text with this? Then this might be a real concern. But even if you do, always write the simplest thing first. It will suffice. Throughout my career I only ever needed to use something other then a simple array or apply a complicated algorithm at most once per year. Still I see code everywhere that uses complicated data structures and algorithms without reason. This complicates change, is errorprone, sometimes makes things slower eventually! Do not use an observable list abstraction that notifies all observers whenever its contents change - but wait, let's add an update lock and unlock so we can control when to NOT notify everybody... No! Do not go down that route. Use a slice. Do your logic. Make everything read easy from top to bottom. I do not want to jump from A to B to C, chasing interfaces, following getters to finally find not a concrete data type but yet another interface. That is not the way to go.
These are the reasons why my code does not export anything, it is a self-contained, runnable example, a concrete solution to your concrete problem. You can read it, it is easy to follow. It is not heavily commented because it does not need to be. The three comments are not stating what happens but why it happens. Everything else is evident from the code itself. I left the note about the potential error in there on purpose. You know what kind of data you have, there is no line in there where this bug would be triggered. Do not write code to handle what cannot happen. If in the future someone would add a line without a text after the colon (remember, nobody will ever do this, do not worry about it), this will trigger a panic, point you to this line, you add another if or something, you are done. This code is future proof more then a program that tries to handle all kinds of different non-existent variations of the input.
The main point that I want to stretch is: write only what is necessary to solve the problem at hand. Everything beyond that makes your program hard to read and change, it will be untested and unnecessary.
With that said, here is my original answer:
https://play.golang.org/p/T6c51jSM5nr
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
type item struct {
name string
educations []string
age int
}
var items []item
var current item
finishItem := func() {
if current.name != "" { // handle the first ever separator
items = append(items, current)
}
current = item{}
}
lines := strings.Split(code, "\n")
for _, line := range lines {
if line == separator {
finishItem()
} else {
colon := strings.Index(line, ":")
if colon != -1 {
id := line[:colon]
value := line[colon+2:] // note potential bug if text has nothing after ':'
switch id {
case "name":
current.name = value
case "Education":
current.educations = append(current.educations, value)
case "Age":
age, err := strconv.Atoi(value)
if err == nil {
current.age = age
}
}
}
}
}
finishItem() // in case there was no separator at the end
for _, item := range items {
fmt.Printf("%s, %d years old, has educations:\n", item.name, item.age)
for _, e := range item.educations {
fmt.Printf("\t%s\n", e)
}
}
}
const separator = "==================="
const code = `===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25`
Easy question here, probably, but searching did not find a similar question.
The # operator finds the length of a string, among other things, great. But with Lua being dynamically typed, thus no conversion operators, how does one type a number as a string in order to determine its length?
For example suppose I want to print the factorials from 1 to 9 in a formatted table.
i,F = 1,1
while i<10 do
print(i.."! == "..string.rep("0",10-#F)..F)
i=i+1
F=F*i
end
error: attempt to get length of global 'F' (a number value)
why not use tostring(F) to convert F to a string?
Alternatively,
length = math.floor(math.log10(number)+1)
Careful though, this will only work where n > 0!
There are probably a dozen ways to do this. The easy way is to use tostring as Dan mentions. You could also concatenate an empty string, e.g. F_str=""..F to get F_str as a string representation. But since you are trying to output a formatted string, use the string.format method to do all the hard work for you:
i,F = 1,1
while i<10 do
print(string.format("%01d! == %010d", i, F))
i=i+1
F=F*i
end
Isn't while tostring(F).len < 10 do useful?
For example, if I want to read the middle value from magic(5), I can do so like this:
M = magic(5);
value = M(3,3);
to get value == 13. I'd like to be able to do something like one of these:
value = magic(5)(3,3);
value = (magic(5))(3,3);
to dispense with the intermediate variable. However, MATLAB complains about Unbalanced or unexpected parenthesis or bracket on the first parenthesis before the 3.
Is it possible to read values from an array/matrix without first assigning it to a variable?
It actually is possible to do what you want, but you have to use the functional form of the indexing operator. When you perform an indexing operation using (), you are actually making a call to the subsref function. So, even though you can't do this:
value = magic(5)(3, 3);
You can do this:
value = subsref(magic(5), struct('type', '()', 'subs', {{3, 3}}));
Ugly, but possible. ;)
In general, you just have to change the indexing step to a function call so you don't have two sets of parentheses immediately following one another. Another way to do this would be to define your own anonymous function to do the subscripted indexing. For example:
subindex = #(A, r, c) A(r, c); % An anonymous function for 2-D indexing
value = subindex(magic(5), 3, 3); % Use the function to index the matrix
However, when all is said and done the temporary local variable solution is much more readable, and definitely what I would suggest.
There was just good blog post on Loren on the Art of Matlab a couple days ago with a couple gems that might help. In particular, using helper functions like:
paren = #(x, varargin) x(varargin{:});
curly = #(x, varargin) x{varargin{:}};
where paren() can be used like
paren(magic(5), 3, 3);
would return
ans = 16
I would also surmise that this will be faster than gnovice's answer, but I haven't checked (Use the profiler!!!). That being said, you also have to include these function definitions somewhere. I personally have made them independent functions in my path, because they are super useful.
These functions and others are now available in the Functional Programming Constructs add-on which is available through the MATLAB Add-On Explorer or on the File Exchange.
How do you feel about using undocumented features:
>> builtin('_paren', magic(5), 3, 3) %# M(3,3)
ans =
13
or for cell arrays:
>> builtin('_brace', num2cell(magic(5)), 3, 3) %# C{3,3}
ans =
13
Just like magic :)
UPDATE:
Bad news, the above hack doesn't work anymore in R2015b! That's fine, it was undocumented functionality and we cannot rely on it as a supported feature :)
For those wondering where to find this type of thing, look in the folder fullfile(matlabroot,'bin','registry'). There's a bunch of XML files there that list all kinds of goodies. Be warned that calling some of these functions directly can easily crash your MATLAB session.
At least in MATLAB 2013a you can use getfield like:
a=rand(5);
getfield(a,{1,2}) % etc
to get the element at (1,2)
unfortunately syntax like magic(5)(3,3) is not supported by matlab. you need to use temporary intermediate variables. you can free up the memory after use, e.g.
tmp = magic(3);
myVar = tmp(3,3);
clear tmp
Note that if you compare running times with the standard way (asign the result and then access entries), they are exactly the same.
subs=#(M,i,j) M(i,j);
>> for nit=1:10;tic;subs(magic(100),1:10,1:10);tlap(nit)=toc;end;mean(tlap)
ans =
0.0103
>> for nit=1:10,tic;M=magic(100); M(1:10,1:10);tlap(nit)=toc;end;mean(tlap)
ans =
0.0101
To my opinion, the bottom line is : MATLAB does not have pointers, you have to live with it.
It could be more simple if you make a new function:
function [ element ] = getElem( matrix, index1, index2 )
element = matrix(index1, index2);
end
and then use it:
value = getElem(magic(5), 3, 3);
Your initial notation is the most concise way to do this:
M = magic(5); %create
value = M(3,3); % extract useful data
clear M; %free memory
If you are doing this in a loop you can just reassign M every time and ignore the clear statement as well.
To complement Amro's answer, you can use feval instead of builtin. There is no difference, really, unless you try to overload the operator function:
BUILTIN(...) is the same as FEVAL(...) except that it will call the
original built-in version of the function even if an overloaded one
exists (for this to work, you must never overload
BUILTIN).
>> feval('_paren', magic(5), 3, 3) % M(3,3)
ans =
13
>> feval('_brace', num2cell(magic(5)), 3, 3) % C{3,3}
ans =
13
What's interesting is that feval seems to be just a tiny bit quicker than builtin (by ~3.5%), at least in Matlab 2013b, which is weird given that feval needs to check if the function is overloaded, unlike builtin:
>> tic; for i=1:1e6, feval('_paren', magic(5), 3, 3); end; toc;
Elapsed time is 49.904117 seconds.
>> tic; for i=1:1e6, builtin('_paren', magic(5), 3, 3); end; toc;
Elapsed time is 51.485339 seconds.
I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.
Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?
You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).
I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.
An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)
Look into DAWGs. They're much more compact than tries.
I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html
If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.