Related
I have some basic familiarity with python and have been extracting coding sequences from genbank records. However, I'm unsure how to handle records where the coding sequence has been modified (e.g. owing to correcting internal stop codons). An example of such a sequence is this genbank record (or accession: XM_021385495.1 if the link does not work).
In this example, I can translate the two coding sequences that I can access, but both have internal stop codons - and according to the notes also indels! This is the way I have accessed the CDS:
1 - gb_record.seq
2 - cds.location.extract(gb_record) for where feature == "CDS"
However, I need the sequence that has been corrected. As far as I can tell, I think I need to use the "transl_except" tags in the CDS feature but I am at a loss how to do this.
I wonder if anybody might be able to provide an example or some insight of how to do this?
Thanks
Jo
I've got some demo code written in python3 that should help explain this GenBank record.
import re
aa_convert_codon_di = {
'A':['[GRSK][CYSM].'],
'B':['[ARWM][ARWM][CTYWKSM]', '[GRSK][ARWM][TCYWKSM]'],
'C':['[TYWK][GRSK][TCYWKSM]'],
'D':['[GRSK][ARWM][TCYWKSM]'],
'E':['[GRSK][ARWM][AGRSKWM]'],
'F':['[TYWK][TYWK][CTYWKSM]'],
'G':['[GRSK][GRSK].'],
'H':['[CYSM][ARWM][TCYWKSM]'],
'I':['[ARWM][TYWK][^G]'],
'J':['[ARWM][TYWK][^G]', '[CYSM][TYWK].', '[TYWK][TYWK][AGRSKWM]'],
'K':['[ARWM][ARWM][AGRSKWM]'],
'L':['[CYSM][TYWK].', '[TYWK][TYWK][AGRSKWM]'],
'M':['[ARWM][TYWK][GRSK]'],
'N':['[ARWM][ARWM][CTYWKSM]'],
'O':['[TYWK][ARWM][GRSK]'],
'P':['[CYSM][CYSM].'],
'Q':['[CYSM][ARWM][AGRSKWM]'],
'R':['[CYSM][GRSK].', '[ARWM][GRSK][GARSKWM]'],
'S':['[TYWK][CYSM].', '[ARWM][GRSK][CTYWKSM]'],
'T':['[ARWM][CYSM].'],
'U':['[TYWK][GRSK][ARWM]'],
'V':['[GRSK][TYWK].'],
'W':['[TYWK][GRSK][GRSK]'],
'X':['...'],
'Y':['[TYWK][ARWM][CTYWKSM]'],
'Z':['[CYSM][ARWM][AGRSKWM]','[GRSK][ARWM][AGRSKWM]'],
'_':['[TYWK][ARWM][AGRSKWM]', '[TYWK][GRSK][ARWM]'],
'*':['[TYWK][ARWM][AGRSKWM]', '[TYWK][GRSK][ARWM]'],
'x':['[TYWK][ARWM][AGRSKWM]', '[TYWK][GRSK][ARWM]']}
dna_convert_aa_di = {
'ATA':'I', 'ATC':'I', 'ATT':'I', 'ATG':'M',
'ACA':'T', 'ACC':'T', 'ACG':'T', 'ACT':'T',
'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K',
'AGC':'S', 'AGT':'S', 'AGA':'R', 'AGG':'R',
'CTA':'L', 'CTC':'L', 'CTG':'L', 'CTT':'L',
'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P',
'CAC':'H', 'CAT':'H', 'CAA':'Q', 'CAG':'Q',
'CGA':'R', 'CGC':'R', 'CGG':'R', 'CGT':'R',
'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V',
'GCA':'A', 'GCC':'A', 'GCG':'A', 'GCT':'A',
'GAC':'D', 'GAT':'D', 'GAA':'E', 'GAG':'E',
'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G',
'TCA':'S', 'TCC':'S', 'TCG':'S', 'TCT':'S',
'TTC':'F', 'TTT':'F', 'TTA':'L', 'TTG':'L',
'TAC':'Y', 'TAT':'Y', 'TAA':'*', 'TAG':'*',
'TGC':'C', 'TGT':'C', 'TGA':'*', 'TGG':'W'}
dna_str = "ATGACCGAGGTGCAAGACCTTGCACTTGGATTTGTTGAACCTCATGAGGTTCCCCTGGGCCCCTGGACATCGCCTTTTTCCAGCGTTCCACCAGAGACTTCACCCAACTGCTGTGACTTTTCAAACATCATTGAGAGCGGCTTGATACAGTTAGGCCACTCTCGCAGCTGTGAAGTTGTGAAGGCAAACTCCAGCGACCCATTCCTTCTTCCTTCAGAAAAGCAACTCGAGGAGCAGCGGGAGGAAACCCAGCTCTATCCTGCAGCGAGCGGGGCTGCGCAAGAGGCAGGTGCTGCTCTCACGGCCCGAAGGCAGCTCCGAGCTGCCGGGTGCGGTCACGTCAGCGGCCGAGCTGCCCGGCGGGGTGTGCATAAGAGCGAGCTATATGTGCTGCGTGTCATCACGGAGCCTTTCAAGTCCCTCCCTCCTTCTCCACTGCTGGGGCTGCAGTGGGCACCGGGCAGGAGGAGCGGCCGCAGCCCCGCGGGGGTGGGACGAGTCTCTGGGGGCTGCGCCACTTGGAAGATTTGCATTGGGTACATTGATAGCATTGTGATTGATGGCCTATTTAATACCATAATGTGTTCTTTAGATTTCTTTTTGGAGAACTCAGAAGAAAATTTGAAGCCAGCTCCACTTTTTCCAGCACAAATGACCCTTACTGGCACAGAAATTCATTTTAAACTTTCTCTAGATAAAGAGGCTGATGATGGCTTTTATGACCTTATGGATGAACTACTGGGTGATATTTTCCGAATGTCTGCCCAAGTGAAGAGACTAGAAGCCCACCTGGAATCAGAACATTAGGAGGACTATATGAACAGTGTGTTTGATCTGTCTGAACTCAGGCAGGAGAGTATGGAGAGAGTAATAAACGTCACCAACAAGGCCTTGAAGTACAGAAGATCTCATGATAGCTATGCTTATCTCTGACTAGAGGATCAGCTTGAGTTTATGAGGCAATTTCTTCCTTGTGCTCGTGGTTTAATGTCCACACAGATATCTCTTACTGGCATCCCACTACTAAACTGTGTAAAAAGCAGGCAAGAAAGAAACTAGTTTAAATAACTTCCTATTTATGAAAATCTCTGTGTTCAGATGAGTAAGTTTGAAGACCCAAGAATTTTTGAAAGCTGGTTTAAGGTGATTATGAAGCCTTTCAAAATGACACTTCTAAACATTACTAAGAAGTGGAGCTGGATGTTTAAGTAGTACACTATAGAAATAATAAGATTGAGTCTGAATGACTTCAAAGACTTTATAAAAGTGACAGATGCTGGACTTCAAAGAGGGAGGCATTATTGTGCACTGGCAGAAATCACCGGTCACCTCTTGGCTGTGAAAGAGAGGCAGACAGCTGCTGGTGAATCCTTTGAACCTTTAAAAGAANTTGTTGCATTGTTGGAAAGCTACAGACAGAAGATGCCAGATCAAGTTTGCATCCAGTGTCAAATCAGTTGTATCCTGGGAGCCTTTAAGGGTTATGTACTTCTGGTTGGAGTAGGTGGTAGTGATAAATGAAGCTTGTCAAGGCTGGCAGCATGCATCTCTTCCCTGGAGGTCTTTTAAATCATATGGAAGAAAGACCATGAGAGCAAGAACCTGAAGGTAGATGTTGCCAGTTTGTGCATCAAGACTGGTGCCAAGAACATGCCCACAGTGTTTTTGCTGACAGATGCCCAGGTTCCAGATGAACGCTTTCTTGTGCTGATTAATGACTTGTTGGCATCAAGAGATCTTCCTGATCTGTTCAGTGGTGAAGATGAGGAGGGCAAAGTTGCAGGAGTCAGAAAAGAAGTCNNCCTGGGCTTGATGGACACCACAGAAAGCTGCTGGAGGTGGTTCTTTGGTAGAGCGCAGCAGCTGTTAAAAGTGTATGGTGAAGTAGAGTCGAAATGTTGTGCACTGGTCCAGGCAAATACAAAATTAGCAACAGCTAAAGAGAATCTAGAAACAATCTTGAAAAAGCTTATTTCTGAAAATGTGCATTGGAGCCAATCTGTTGAAAACCTCAAAGCATAAAAGAAAACTGTACTCAAGGATGTTACATCAGCAGCAGCGTTTGCATCTTTCTTTGGAGCCTTCACAAAACCATATAGTCAAGAACAGATGGAACATTTCTGGATTCTTTCTCTAAAGTCACAGGAGTGTCCTGTTCCTGTGATAGAGGGGCCAGACTCTGCCATCCTGATGAATGATGCTCCAAGAGCAGCACAGAGTAACAAGAGTCTGCTTGCTGATAGGGTGTCAGCAGAAAATGCCACTGCTCTGACACACTGTGAGCAGGGCCCTCTGATGATAGATCCCCAGAAACAGGGAATTGAATGGACACAGAATAAATACAGAACTGACTTTAAAGTCATGCATCTAGGAGAGAATGGTTATGTGTGTACTATTGATACAGCTTTGGCTTGTGGAGAGATTATACTAATTGAAAACATGGCTGAATCTATCGATCTCTTACTTGATCCCCTAACTGGAAGACATACAGGTAAAAGGGGAAGGAATACTTGCGCAATCAGAATTTCTTGAAGACAAAAAAAAAAAAAGTGTGAATTCTACAGGAATTTCCATCTCATCCTTCACACTAAGCTGGCTAACCCTCCCTGCAAGCCAGAGCTTNAGGCTCAGACCACTCTCATTATTTTCACAGATACCAGGGGCAGGCTGGAAGAACAGCTGTTGGCTGAGGTGGTGAGTGCTGAAAGGCCTGACTTGGAAAACCATACGTCAGCACTGGCGAAACAGAAGAGTGTCTCTGAAATCAAGCCCAAGCAGCTTGAGGACAACATGCTGCTCAGTCTGTCAGCTGCCCAGAGCACTTTTGTAGGTGACAGTGAACTTGAAGAGAAATTCAAGTCAACTGCAGGAGAAATGATTGTCCGCCCACATGTTCACAGCTTCTTATTTTGGCAAAAAGCTTCCACTGTAGACTCTGGAAGATTTCATATCTCTTTAGGACAAGGGCAGGAGATGGTTGTGGAGNGACAACTTGAGAAGGCTGCCAAGCCTGGCCACTGGCTTCTTCTCCAAAATATTAATGTGGTAGCCAAGTGGCTAGGAACCTTGGAAAAACTCCTCGAGCAATAGAGTGAAGAAAGTCACTGGTATTTCCGTGTCTTCACTAGTGCTGAACCAGCTCCAGCCCCAGAAGAGCACATCATTCTTCAAGGAGTACTTGAAAACTGAATTAAAATTACCAGACTATCAATAACACTGCCAGTTGTTAAGTGGATAAATGTATTCCTTTTTTTCCTTTGGCAGGATACCCTTGAACTGTGTGGCAAAGAACAGGAATTTAAGAGCATTCTTTTCTCCCTTCGTTATTTTCACACCCGTGTTGCCAGCAGACTCATTTGGCCTTCCAGGCTGCAATTAAGATACCCATACAATACTAGAGATCTCACTGTTTGCATCAGTGTGCCCTGCAACTATTTAGACACTTACACAGAGGTCAGACGCAGTGGTCAGAAAAACAAGTCTATAAAATCAGCTGATTCCAACCCTTAG"
aa_str = "MTEVQDLALGFVEPHEVPLGPWTSPFSSVPPETSPNCCDFSNIIESGLIQLGHSRSCEVVKANSSDPFLLPSEKQLEEQREETQLYPAASGAAQEAGAALTARRQLRAAGCGHVSGRAARRGVHKSELYVLRVITEPFKSLPPSPLLGLQWAPGRRSGRSPAGVGRVSGGCATWKICIGYIDSIVIDGLFNTIMCSLDFFLENSEENLKPAPLFPAQMTLTGTEIHFKLSLDKEADDGFYDLMDELLGDIFRMSAQVKRLEAHLESEHXEDYMNSVFDLSELRQESMERVINVTNKALKYRRSHDSYAYLXLEDQLEFMRQFLPCARGLMSTQISLTGIPLLNCVKSRQERNXFKXLPIYENLCVQMSKFEDPRIFESWFKVIMKPFKMTLLNITKKWSWMFKXYTIEIIRLSLNDFKDFIKVTDAGLQRGRHYCALAEITGHLLAVKERQTAAGESFEPLKEXVALLESYRQKMPDQVCIQCQISCILGAFKGYVLLVGVGGSDKXSLSRLAACISSLEVFXIIWKKDHESKNLKVDVASLCIKTGAKNMPTVFLLTDAQVPDERFLVLINDLLASRDLPDLFSGEDEEGKVAGVRKEVXLGLMDTTESCWRWFFGRAQQLLKVYGEVESKCCALVQANTKLATAKENLETILKKLISENVHWSQSVENLKAXKKTVLKDVTSAAAFASFFGAFTKPYSQEQMEHFWILSLKSQECPVPVIEGPDSAILMNDAPRAAQSNKSLLADRVSAENATALTHCEQGPLMIDPQKQGIEWTQNKYRTDFKVMHLGENGYVCTIDTALACGEIILIENMAESIDLLLDPLTGRHTGKRGRNTCAIRISXRQKKKKCEFYRNFHLILHTKLANPPCKPELXAQTTLIIFTDTRGRLEEQLLAEVVSAERPDLENHTSALAKQKSVSEIKPKQLEDNMLLSLSAAQSTFVGDSELEEKFKSTAGEMIVRPHVHSFLFWQKASTVDSGRFHISLGQGQEMVVEXQLEKAAKPGHWLLLQNINVVAKWLGTLEKLLEQXSEESHWYFRVFTSAEPAPAPEEHIILQGVLENXIKITRLSITLPVVKWINVFLFFLWQDTLELCGKEQEFKSILFSLRYFHTRVASRLIWPSRLQLRYPYNTRDLTVCISVPCNYLDTYTEVRRSGQKNKSIKSADSN"
mod_dna_str = ""
mod_aa_str = aa_str[:]
start = 0
for index in range(start, len(dna_str), 3):
codon = dna_str[index:index+3]
if len(mod_aa_str) == 0:
break
if codon in dna_convert_aa_di and dna_convert_aa_di[codon] == mod_aa_str[0]:
mod_aa_str = mod_aa_str[1:]
else:
codon_match = "|".join(aa_convert_codon_di[mod_aa_str[0]])
if len(re.findall(codon_match, codon)) > 0:
print(index, codon_match, codon)
mod_aa_str = mod_aa_str[1:]
Code output:
804 ... TAG
930 ... TGA
1056 ... TAG
1065 ... TAA
1209 ... TAG
1389 ... NTT
1518 ... TGA
1566 ... TAA
1800 ... NNC
2019 ... TAA
2529 ... TGA
2622 ... NAG
2985 ... NGA
3087 ... TAG
3186 ... TGA
From the note section of the CDS, we have: inserted 5 bases in 4 codons; deleted 2 bases in 2 codons; substituted 11 bases at 11 genomic stop codons".
How does this relate to our output? The reading frame never changes, suggesting that the 2 deleted bases are absent from the given nucleotide sequence. Five unknown nucleotides (N) exist in 4 codons (unknown amino acid, X). The authors of the sequence have accounted for indels. Eleven premature stop codons are present, which are simply translated as unknown amino acids. The "transl_except" tags match the locations of the premature stop codons. The nucleotides at these sites have not been altered. The authors provide XP_021241170 as a possible corrected translation product, but it's still very bad.
I am working with a third party device which has some implementation of Lua, and communicates in BACnet. The documentation is pretty janky, not providing any sort of help for any more advanced programming ideas. It's simply, "This is how you set variables...". So, I am trying to just figure it out, and hoping you all can help.
I need to set a long list of variables to certain values. I have a userdata 'ME', with a bunch of variables named MVXX (e.g. - MV21, MV98, MV56, etc).
(This is all kind of background for BACnet.) Variables in BACnet all have 17 'priorities', i.e., every BACnet variable is actually a sort of list of 17 values, with priority 16 being the default. So, typically, if I were to say ME.MV12 = 23, that would set MV12's priority-16 to the desired value of 23.
However, I need to set priority 17. I can do this in the provided Lua implementation, by saying ME.MV12_PV[17] = 23. I can set any of the priorities I want by indexing that PV. (Corollaries - what is PV? What is the underscore? How do I get to these objects? Or are they just interpreted from Lua to some function in C on the backend?)
All this being said, I need to make that variable name dynamic, so that i can set whichever value I need to set, based on some other code. I have made several attempts.
This tells me the object(MV12_PV[17]) does not exist:
x = 12
ME["MV" .. x .. "_PV[17]"] = 23
But this works fine, setting priority 16 to 23:
x = 12
ME["MV" .. x] = 23
I was trying to attempt some sort of what I think is called an evaluation, or eval. But, this just prints out function followed by some random 8 digit number:
x = 12
test = assert(loadstring("MV" .. x .. "_PV[17] = 23"))
print(test)
Any help? Apologies if I am unclear - tbh, I am so far behind the 8-ball I am pretty much grabbing at straws.
Underscores can be part of Lua identifiers (variable and function names). They are just part of the variable name (like letters are) and aren't a special Lua operator like [ and ] are.
In the expression ME.MV12_PV[17] we have ME being an object with a bunch of fields, ME.MV12_PV being an array stored in the "MV12_PV" field of that object and ME.MV12_PV[17] is the 17th slot in that array.
If you want to access fields dynamically, the thing to know is that accessing a field with dot notation in Lua is equivalent to using bracket notation and passing in the field name as a string:
-- The following are all equivalent:
x.foo
x["foo"]
local fieldname = "foo"
x[fieldname]
So in your case you might want to try doing something like this:
local n = 12
ME["MV"..n.."_PV"][17] = 23
BACnet "Commmandable" Objects (e.g. Binary Output, Analog Output, and o[tionally Binary Value, Analog Value and a handful of others) actually have 16 priorities (1-16). The "17th" you are referring to may be the "Relinquish Default", a value that is used if all 16 priorities are set to NULL or "Relinquished".
Perhaps your system will allow you to write to a BACnet Property called "Relinquish Default".
I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.
Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?
You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).
I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.
An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)
Look into DAWGs. They're much more compact than tries.
I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html
If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.
I need a well tested Regular Expression (.net style preferred), or some other simple bit of code that will parse a USA/CA phone number into component parts, so:
3035551234122
1-303-555-1234x122
(303)555-1234-122
1 (303) 555 -1234-122
etc...
all parse into:
AreaCode: 303
Exchange: 555
Suffix: 1234
Extension: 122
None of the answers given so far was robust enough for me, so I continued looking for something better, and I found it:
Google's library for dealing with phone numbers
I hope it is also useful for you.
This is the one I use:
^(?:(?:[\+]?(?<CountryCode>[\d]{1,3}(?:[ ]+|[\-.])))?[(]?(?<AreaCode>[\d]{3})[\-/)]?(?:[ ]+)?)?(?<Number>[a-zA-Z2-9][a-zA-Z0-9 \-.]{6,})(?:(?:[ ]+|[xX]|(i:ext[\.]?)){1,2}(?<Ext>[\d]{1,5}))?$
I got it from RegexLib I believe.
This regex works exactly as you want with your examples:
Regex regexObj = new Regex(#"\(?(?<AreaCode>[0-9]{3})\)?[-. ]?(?<Exchange>[0-9]{3})[-. ]*?(?<Suffix>[0-9]{4})[-. x]?(?<Extension>[0-9]{3})");
Match matchResult = regexObj.Match("1 (303) 555 -1234-122");
// Now you have the results in groups
matchResult.Groups["AreaCode"];
matchResult.Groups["Exchange"];
matchResult.Groups["Suffix"];
matchResult.Groups["Extension"];
Strip out anything that's not a digit first. Then all your examples reduce to:
/^1?(\d{3})(\d{3})(\d{4})(\d*)$/
To support all country codes is a little more complicated, but the same general rule applies.
Here is a well-written library used with GeoIP for instance:
http://highway.to/geoip/numberparser.inc
here's a method easier on the eyes provided by the Z Directory (vettrasoft.com),
geared towards American phone numbers:
string_o s2, s1 = "888/872.7676";
z_fix_phone_number (s1, s2);
cout << s2.print(); // prints "+1 (888) 872-7676"
phone_number_o pho = s2;
pho.store_save();
the last line stores the number to database table "phone_number".
column values: country_code = "1", area_code = "888", exchange = "872",
etc.