How to parse files with arbitrary lengths? - parsing

I have a text file that I'd like to parse with records like this:
===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25
As you can see, the fields in such a text file are fixed, but some of them repeat an arbitrary number of times. The records are separated by a fixed length ==== delimiter.
How would I write parsing logic this this sort of problem? I am think of using switch as it reads the start of the line, but the logic to handle multiple repeating fields baffles me.

A good way to approach this sort of problem is to "divide and conquer". That is, divide the overall problem into smaller sub-problems which are easier to manage and then solve each them individually. If you've planned properly then when you've finished each of the sub-problems you should have solved the whole problem.
Start by thinking about modeling. The document appears to contain a list of records, what should those records be called? What named fields should the records contain and what types should they have? How would you represent them idiomatically in go? For example, you might decide to call each record a Person with fields as such:
type Person struct {
Name string
Credentials []string
Age int
}
Next, think about what the interface (signature) of your parse function should look like. Should it emit an array of people? Should it use a visitor pattern and emit a person as soon as it's parsed? What constraints should drive the answer? Are memory or compute time constraints important? Does the user of the parser want any control over the parsing work such as canceling? Do they need metadata such as the total number of records contained in the document? Will the input always be from a file or a string, maybe from an HTTP request or a network socket? How will these choices drive your design?
func ParsePeople(string) ([]Person, error) // ?
func ParsePeople(io.Reader) ([]Person, error) // ?
func ParsePeople(io.Reader, func visitor(Person) bool) error // ?
Finally you can implement your parser to fulfill the interface that you've decided on. A straightforward approach here would be to read the input file line-by-line and take an action according to the contents of the line. For example (in pseudocode):
forEach line = inputFile.line
if line is a separator
emit or store the last parsed person, if present
create a new person to store parsed fields
else if line is a data field
parse the data
update the person with the parsed data
end
end
return the parsed records or final record, if emitting
Each line of pseudocode above represents a sub-problem that should be easier to solve than the whole.

Edit: Add explanation of why I just post a program as answer.
I am presenting a very straight forward implementation to parse the text you have given in your question. You accepted maerics answer and that is OK. I want to add some counter arguments to his answer, though. Basically the pseude-code in that answer is a non-compilable version of the code in my answer so we agree on the solution to this.
What I do not agree with is the over-engineering talk. I have to deal with code written by over-thinkers everyday. I urge you NOT to think about patterns, memory and time constraints or who might want what from this in the future.
Visitor pattern? That is something that is pretty much only useful in parsing programming languages, do not try to construct a use-case for it out of this problem. The visitor pattern is for traversing trees with different types of things in it. Here we have a list, not a tree, of things that are all the same.
Memory and time constraints? Are you parsing 5 GB of text with this? Then this might be a real concern. But even if you do, always write the simplest thing first. It will suffice. Throughout my career I only ever needed to use something other then a simple array or apply a complicated algorithm at most once per year. Still I see code everywhere that uses complicated data structures and algorithms without reason. This complicates change, is errorprone, sometimes makes things slower eventually! Do not use an observable list abstraction that notifies all observers whenever its contents change - but wait, let's add an update lock and unlock so we can control when to NOT notify everybody... No! Do not go down that route. Use a slice. Do your logic. Make everything read easy from top to bottom. I do not want to jump from A to B to C, chasing interfaces, following getters to finally find not a concrete data type but yet another interface. That is not the way to go.
These are the reasons why my code does not export anything, it is a self-contained, runnable example, a concrete solution to your concrete problem. You can read it, it is easy to follow. It is not heavily commented because it does not need to be. The three comments are not stating what happens but why it happens. Everything else is evident from the code itself. I left the note about the potential error in there on purpose. You know what kind of data you have, there is no line in there where this bug would be triggered. Do not write code to handle what cannot happen. If in the future someone would add a line without a text after the colon (remember, nobody will ever do this, do not worry about it), this will trigger a panic, point you to this line, you add another if or something, you are done. This code is future proof more then a program that tries to handle all kinds of different non-existent variations of the input.
The main point that I want to stretch is: write only what is necessary to solve the problem at hand. Everything beyond that makes your program hard to read and change, it will be untested and unnecessary.
With that said, here is my original answer:
https://play.golang.org/p/T6c51jSM5nr
package main
import (
"fmt"
"strconv"
"strings"
)
func main() {
type item struct {
name string
educations []string
age int
}
var items []item
var current item
finishItem := func() {
if current.name != "" { // handle the first ever separator
items = append(items, current)
}
current = item{}
}
lines := strings.Split(code, "\n")
for _, line := range lines {
if line == separator {
finishItem()
} else {
colon := strings.Index(line, ":")
if colon != -1 {
id := line[:colon]
value := line[colon+2:] // note potential bug if text has nothing after ':'
switch id {
case "name":
current.name = value
case "Education":
current.educations = append(current.educations, value)
case "Age":
age, err := strconv.Atoi(value)
if err == nil {
current.age = age
}
}
}
}
}
finishItem() // in case there was no separator at the end
for _, item := range items {
fmt.Printf("%s, %d years old, has educations:\n", item.name, item.age)
for _, e := range item.educations {
fmt.Printf("\t%s\n", e)
}
}
}
const separator = "==================="
const code = `===================
name: John Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Sun Java Certified Programmer
Age: 29
===================
name: Bob Bear
Education: High School Diploma
Age: 18
===================
name: Jane Doe
Education: High School Diploma
Education: Bachelor's Degree
Education: Master's Degree
Education: AWS Certified Solution Architect Professional
Age: 25`

Related

F#: How to examine content in a n-tuple and return true or false?

Consider this F# code:
let isSalary employee =
let (fName,lName,Occupation,Department,SalaryType,
HoursPerWeek, AnnualSalary, HourlyWage
) = employee
SalaryType = "Salary"
if(employee.SalaryType = SalaryType) then
true
else
false
Im getting errors in here, any fixes to it?
First things first, please post error messages and a much more specific question. Thanks! But luckily, I can about deduce the error messages from this problem.
Next, if you want to mutate SalaryType after you've deconstructed your employee 8-tuple, you should write using the mutable keyword:
let mutable (fName, lName, Occupation, Department, SalaryType,
HoursPerWeek, AnnualSalary, HourlyWage) = employee
But you shouldn't. This is explained further below.
Next problem: there is no dot notation (no tuple.member) for accessing members of a tuple. It's only possible through deconstruction. So you can't employee.SalaryType.
Here's what looks to be the crux of the problem, and it's a mistake I made many times when I was learning functional programming, and it's a difficult paradigm shift to adapt to. You should not be attempting to mutate data, or in this case, variables. Variables, or values as they are called in F#, shouldn't change, as a broad rule. Functions should be pure.
What this means is that any parameters you pass into a function should not change after leaving the function. The parameter employee should be the same after you return to the calling scope.
There's a few syntactical errors you've made that make it pretty much impossible for me to deduce what you're trying to do in the first place. Please include this in the question.
Also, one last nitpick. As you know, the last expression in an F# function is it's return value. Instead of using an if statement, just return the condition you're testing, like this:
let ...
...
employee.SalaryType = SalaryType <- but remember, you can't use dot notation on tuples; this is just an example
Please read more on
https://learn.microsoft.com/en-us/dotnet/fsharp/language-reference/

Boost spirit can handle Postscript/PDF like languages?

I noticed that Boost spirit offers some limits, in a question here on SO there is an user asking for help about boost spirit and the other user who gave the answer specified that boost spirit works well with statements and not with "generic text" ( I'm sorry if I don't recall it correctly ).
Now I would like to think about Postscript and PDF in terms of tokens and simplify my approach to this formats this way, the problem is that the PDF is kind of a mix between a markup language and a programming language with jumps and tables in it, and I can't think about something similar when considering the most popular file formats like XML, C++ code and others languages and formats.
There is also another fact: I can't really find people that had some kind of experience with boost::spirit wiriting a pdf parser or writer, so I'm asking, boost::spirit it's capable of parsing a PDF file and output the elements as tokens ?
Although this has nothing to do with Boost, let me assure you that the parsing of PDF (and PostScript) are about as trivial as you could want. Let's say that you have a scanner object that returns a series of tokens. The token types you will get from the scanner are:
String
Dict begin (<<)
Dict End (>>)
Name (/whatever)
Number
Hex array
Left Angle (<)
Right Angle (>)
Array begin ([)
Array end (])
Procedure begin ({)
Procedure end (})
Comment (%foo)
Word
My scanner is a finite-state automata with states for Start, Comment, String, HexArray, Token, DictEnd, and Done.
The way you parse PDF is not by parsing it, but by executing it. Given these tokens, my "parser" looks like this (in C#):
while (true) {
MLPdfToken = scanner.GetToken();
if (token == null)
return MachineExit.EndOfFile;
PdfObject obj = PdfObject.FromToken(token);
PdfProcedure proc = obj as PdfProcedure;
if (proc != null)
{
if (IsExecuting())
{
if (token.Type == PdfTokenType.RBrace)
proc.Execute(this);
else
Push(obj);
}
else {
proc.Execute(this);
}
if (proc.IsTerminal)
return Machine.ParseComplete;
}
else {
Push(obj);
}
}
I'll also add that if you give every PdfObject an Execute() method such that the base class implementation is machine.Push(this) and IsTerminal that returns false, the REPL gets easier:
while (true) {
MLPdfToken = scanner.GetToken();
if (token == null)
return MachineExit.EndOfFile;
PdfObject obj = PdfObject.FromToken(token);
if (IsExecuting())
{
if (token.Type == PdfTokenType.RBrace)
obj.Execute(this);
else
Push(obj);
}
else {
obj.Execute(this);
if (obj.IsTerminal)
return Machine.ParseComplete;
}
}
There's more support in Machine - Machine has a Stack of PdfObject and a few methods for accessing it (Push, Pop, Mark, CountToMark, Index, Dup, Swap), as well as ExecProcBegin and ExecProcEnd.
Beyond that, it's very light. The only thing that is slightly odd is that PdfObject.FromToken takes a token and if it is a primitive type (number, string, name, hex, bool) returns a corresponding PdfObject. Otherwise, it takes the given token and looks in a "proc set" dictionary of procedure names associated with PdfProcedure objects. So when you encounter the token << that gets looked up in a the proc set and comes up with this code:
void DictBegin(PdfMachine machine)
{
machine.Push(new PdfMark(PdfMarkType.Dictionary));
}
So << really means "mark the stack as the start of a dictionary. >> gets more interesting:
void DictEnd(PdfMachine machine)
{
PdfDict dict = new PdfDict();
// PopThroughMark pops the entire stack up to the first matching mark,
// throws an exception if it fails.
PdfObject[] arr = machine.PopThroughMark(PdfMarkType.Dictionary);
if ((arr.Length & 1) != 0)
throw new PdfException("dictionaries need an even number of objects.");
for (int i=0; i < arr.Length; i += 2)
{
PdfObject key = arr[i], val = arr[i + 1];
if (key.Type != PdfObjectType.Name)
throw new PdfException("dictionaries need a /name for the key.");
dict.put((PdfName)key, val);
}
machine.Push(dict);
}
So >> Pops up to the nearest dictionary mark into an array then puts each pair into the dictionary. Now, I could have done this without allocating the array. I could just pop pairs, putting them into the dictionary until I either hit the mark, fail to get a name or underflow the stack.
The important takeaway is that there really isn't any syntax in PDF, nor is there any in PostScript. At least not so much as you'd notice. The only real Syntax (and the read-eval-(push) loop shows it) is '}'.
So when you this is a PDF 14 0 obj << /Type /Annot /SubType /Square >> endobj what your really seeing is a series of procedures:
Push 14
Push 0
Execute obj (Pop two numbers and push a "definition" object).
Execute dictionary begin
Push /Type
Push /Annot
Push /SubType
Push /Square
Execute dictionary end
Execute endobj (pop the top object and then get (not pop) the next one. If the second is a definition, set its "value" to the first object, else throw).
Since "endobj" is terminal, parsing ends and the top of the stack is the result.
So when you are asked to look up object 14 in the PDF, the cross-reference table tells you where to seek to, you make a new Machine with the stream pointer at that location and run it. If the top of the stack is a "definition" object, you've succeeded.
About now you should be nodding but not trusting me, since you're thinking about PDF streams, which look like this:
<< [/key value]* >> stream ...raw data... endstream endobj
Again, there is no syntax. The proc stream looks at the top of the stack, which should be a PdfDict. If it is, it consumes characters until the next newline (scanner does this), stores the current file position in the stream as data start, reads the stream length from the dict (which may cause another Machine to get newed up), and skips past the end of stream and pushes the new stream object on the stack. endstream is a no-op. The only difference between a PdfDict and a PdfStream is that a PdfStream has a start position and a bool saying that it's a stream, otherwise I dual-purpose the object.
PostScript is almost identical except that the execution environment is a little more complex. For example, you need several stacks in your machine: a parameter stack, a dictionary stack, and an execution stack. From there, you more or less just bind your tokenizer into the set of primitive procedures as well as the word exec, and then most of your interpreter is written in PS itself.
If you're talking about boost, you're looking at C++, which means that you can't be as fast and loose with memory as I am, so you'll want to either use smart pointers or figure out where you scope is and be careful to dispose objects instead of blithely throwing them away, but that's just the normal C++ stuff.
Currently, I make PDF tools for my company in .NET, but in a former life I worked on Acrobat versions 1-4, and most of what I described is exactly what Acrobat did under the hood (well, more or less - it was C, not C++, but it's the same approach).
With respect to the xref table (or xref stream), you read that first - the spec tells you that if you jump to EOF and scan back, you find the start of the xref table. You parse that (which is a CS 101 assignment), parse the trailer, seek to the /Prev if any and repeat until no more /Prev entries. That gives you a complete xref for looking up objects.
As for writing - there are a number of approaches that you can take. The most obvious one is that when an object is meant to be referenced, you create a new reference object by assigning the newest available xref entry to it. Whenever objects refer to other objects for writing, they ask if these objects are referenced. If they are, they write the reference (ie, 14 0 R). When it comes time to write a referenced object, you get the current stream pointer and store it in the xref, then write <objnum> <generation> obj <object contents> endobj. For example, my code to write a dictionary looks like this:
public override ToStream(PdfStreamingContext context)
{
if (context.HasReference(this)) // is object referenced in xref
{
PdfUtils.WriteObjectDefinitionBegin(this, context);
}
context.Writer.Indent();
context.Writer.WriteLine("<<");
WriteContents(context);
context.Writer.Exdent();
context.Writer.Writeline(">>");
if (context.HasReference(this))
{
PdfUtils.WriteObjectDefinitionEnd(this, context);
}
}
I've chopped out some chaff so you can see the wheat underneath. The context is an object that holds a new xref table as well as an object for writing to streams that automagically handles appropriate newline discipline, indentation, line wrapping, and so on.
What you should see is that the basics here are straight forward, if not trivial. And now's when you should be asking yourself the question, "if it's trivial, how come there isn't more (serious) competition for Acrobat in the market? The answer is that even though it's trivial, it's still easy to write PDFs that aren't spec compliant and Acrobat handles most of those. The real challenge is to be able to honor the spec and make sure that you include all required values in a dictionary and that they are in range and semantically correct. Hell, even the date time format--which is pretty well-specified--is a mound of special case code in my library to manage where other people have screwed it up royally. Being able to generate consistently correct PDF is hard and consuming the garbage in the sea of PDFs in the world is harder.
I could (and probably should) write a book about how to do this. While a lot of the fringe code is grubby, the overall structure can be very pretty.
tl;dr - If you're thinking of a recursive descent parser for PDF, you're thinking too hard. All you need is a tokenizer and a simple REPL.

Is it necessary to invent IDs for wxErlang widgets?

I've been poking around with Erlang's wx module and this tutorial. I haven't used wxwidgets before, so maybe this is just how it's done, but this code seems really terrible to me:
%% create widgets
T1001 = wxTextCtrl:new(Panel, 1001,[]),
ST2001 = wxStaticText:new(Panel, 2001,"Output Area", []),
B101 = wxButton:new(Panel, 101, [{label, "&Countdown"}]),
B102 = wxButton:new(Panel, ?wxID_EXIT, [{label, "E&xit"}]),
wxFrame:show(Frame),
Do people really have to assign widget IDs to widgets when creating them? Is it normal to name the variable that points to the widget after the widget's ID?
I don't know about Erlang but in C++ (and in the other bindings I know about) it's often preferable to use wxID_ANY as the widget id, meaning that you don't care about its specific value, and then use Connect() to handle events from the widget. Explicit ids may be convenient if you need to find the widget by its id later (although you can also use widget names for this) or if you need a consecutive range for ids (e.g. it would make sense to use ids 100, 101, ..., 109 for the buttons of a calculator as you could then easily deduce each buttons value from its id) but there is no need to always use them.
As for the naming, there is, of course, no need to use this strange convention (and a quick look at the tutorial shows that it's a personal preference of the author -- which, needless to say, I don't share).
Like VZ mentioned above, you can use wxID_ANY if you won't need to lookup a widget by its id later.
However, I believe that not only it's not normal to name the variables after the ids, but rather it's a very bad idea to do so. Name your variables according to their meaning and not using some obscure id.
Also, you'd better define the ids you need and give them proper (semantic) names, so that they are defined in only one place and you can later easily change the ids without affecting your program at all, like this:
-define(ID_TEXT_CTRL, 1001).
-define(ID_OUTPUT_AREA, 2001).
-define(ID_COUNTDOWN_BUTTON, 101).
-define(ID_EXIT_BUTTON, ?wxID_EXIT).
TextCtrl = wxTextCtrl:new(Panel, ?ID_TEXT_CTRL,[]),
OutputArea = wxStaticText:new(Panel, ?ID_OUTPUT_AREA,"Output Area", []),
CountdownButton = wxButton:new(Panel, ?ID_COUNTDOWN_BUTTON, [{label, "&Countdown"}]),
ExitButton = wxButton:new(Panel, ?ID_EXIT_BUTTON, [{label, "E&xit"}])
You may put the definitions either in your .erl file, if it's only one, or in an .hrl file, which you'll have to include in all your GUI-related .erl files.
No. The cases in which you would want to look something up by ID are roughly the same cases in which you would want to look something up by ID in C++. This applies across any widget library I can think of -- every time you react to a signal coded some_standard_button_name and matches a label like ?wxID_OK you are waiting for a numeric ID that is represented by a label hidden by a macro. Most GUI libraries do a lot of preprocessing to wash this out, so often you don't notice it (in the case of sooper-dooper libraries like Qt its still going on, just in the background, and all of your code is run through a precompiler before it becomes "real" C++...).
So how do you get a hold of a wx thingy that's been created? By using its returned reference.
Almost every wx*:new() call returns an object reference[note1]. This is an abstract reference (internally a tuple, but don't count on that) with enough information for the Erlang bindings and the Wx system processes to talk unambiguously about specific Wx objects that have been created. Passing these references around is the typical way of accessing Wx objects later on:
GridSz = wxFlexGridSizer:new(2, 2, 4, 4),
ok = wxFlexGridSizer:setFlexibleDirection(GridSz, ?wxHORIZONTAL),
ok = wxFlexGridSizer:addGrowableCol(GridSz, 1),
A less obvious case, though, is when you want something like a grid of input fields that you can cycle through or pull by key value:
Scripts = [katakana, hiragana, kanji, latin],
Ilks = [family, given, middle, maiden],
Rows = [{Tag, j(J, Tag)} || Tag <- Scripts],
Cols = [{Tag, j(J, Tag)} || Tag <- Ilks],
{GridSz, Fields} = zxw:text_input_grid(Dialog, Rows, Cols),
% Later on, extracting only present values as
Keys = [{S, I} || S <- Scripts, I <- Ilks],
Extract =
fun(Key, Acc) ->
case wxTextCtrl:getValue(proplists:get_value(Key, Fields)) of
"" -> Acc;
Val -> [{Key, Val} | Acc]
end
end,
NewParts = lists:foldl(Extract, [], Keys),
And so on. (zxw:text_input_grid/3 definition, and docs)
The one time you really do want to reference an object by its ID and not its object reference is the same as in C++: when you are listening for a specific click event:
{AddressPicker, _, _, AddressSz} =
zxw:list_picker(Frame,
?widgetADDRESS, ?addADDRESS, ?delADDRESS,
AddressHeader, Addresses, j(J, address)),
And then later in the message handling loop of the generic wx_object:
handle_event(Wx = #wx{id = Id,
event = #wxCommand{type = command_button_clicked}},
State) ->
case Id of
?editNAME -> {noreply, edit_name(State)};
?editDOB -> {noreply, edit_dob(State)};
?editPORTRAIT -> {noreply, edit_portrait(State)};
?addCONTACT -> {noreply, add_contact_info(State)};
?delCONTACT -> {noreply, del_contact_info(State)};
?addADDRESS -> {noreply, add_address_info(State)};
?delADDRESS -> {noreply, del_address_info(State)};
_ ->
ok = unexpected(Wx),
{noreply, State}
end;
handle_event(Wx = #wx{id = Id,
event = #wxList{type = command_list_item_selected,
itemIndex = Index}},
State) ->
case Id of
?widgetCONTACT -> {noreply, update_selection(contact, Index, State)};
?widgetADDRESS -> {noreply, update_selection(address, Index, State)};
_ ->
ok = unexpected(Wx),
{noreply, State}
end;
The first clause deals specifically with clicks on non-standard buttons, and the second with list-control widget selection events to do some arbitrary things in the interface. While unwrapping the #wx{} event record isn't very visually appealing, using matching in clause formation makes this GUI code much easier to understand during maintenance than the gigantic cascade of checks, exception catches and follow-on if elif elif elif elif..., switch or case..break, etc. type code necessary in languages that lack matching.
In the above case all the specific IDs are labeled as macros, exactly the same way this is done in Wx in C++ and other C++ widget kits. Most of the time your needs are served simply by using the standard pre-defined Wx button types and reacting to them accordingly; the example above is from code that is forced to dive a bit below that because of some specific interface requirements (and the equivalent C++ code winds up essentially the same, but dramatically more verbose to accomplish the same task).
Some platforms in slightly higher level languages have different ways of dealing with the identity issue. iOS and Android widget kits (and QtQuick, for that matter) hide this detail behind something like a slightly more universally useful object reference rather than depending on IDs. That is to say those widget kits essentially store all widgets created in a hash of {ID => ObjReference}, pick the ID out of every signal, retrieve the object reference before passing control to your handling callbacks, and return the reference stored in the hash instead of just passing the ID through directly.
That's slick, but its not the way older widget kits bound to C-style enums-as-labels code work. When its all said and done computers still have only one real type: integers -- we invent all sorts of other stuff on top of this and enjoy the illusion of types and other fun.
We could do this ID-to-reference thing in Erlang also, but the way WxErlang code is typically written is to follow the C++ tradition of using object IDs behind macro labels for events you can't avoid identifying uniquely, and object references and standard labels for everything else.
The zx_widgets library employed above is a set of pre-defined meta-widgets that cover some of the most common cases of boilerplate field construction and return data structures that are easy to handle functionally. The OOP style of Wx isn't a very natural fit for Erlang in some ways (the looooooongest functions you'll ever write in Erlang are likely to be GUI code for this reason), so an extra layer is sometimes necessary to make the logic-bearing code jibe with the rest of Erlang. GUI code is pretty universally annoying, though, in any language and in any environment.
[note1: There are some weird, uncomfortable cases where a few C++-style mysteries leak through the bindings into your Erlang code, such as the magical environment creation procedure involved in using 2D graphics DC canvasses and whatnot.]

splitting space delimited entries into new columns in R

I am coding a survey that outputs a .csv file. Within this csv I have some entries that are space delimited, which represent multi-select questions (e.g. questions with more than one response). In the end I want to parse these space delimited entries into their own columns and create headers for them so i know where they came from.
For example I may start with this (note that the multiselect columns have an _M after them):
Q1, Q2_M, Q3, Q4_M
6, 1 2 88, 3, 3 5 99
6, , 3, 1 2
and I want to go to this:
Q1, Q2_M_1, Q2_M_2, Q2_M_88, Q3, Q4_M_1, Q4_M_2, Q4_M_3, Q4_M_5, Q4_M_99
6, 1, 1, 1, 3, 0, 0, 1, 1, 1
6,,,,3,1,1,0,0,0
I imagine this is a relatively common issue to deal with but I have not been able to find it in the R section. Any ideas how to do this in R after importing the .csv ? My general thoughts (which often lead to inefficient programs) are that I can:
(1) pull column numbers that have the special suffix with grep()
(2) loop through (or use an apply) each of the entries in these columns and determine the levels of responses and then create columns accordingly
(3) loop through (or use an apply) and place indicators in appropriate columns to indicate presence of selection
I appreciate any help and please let me know if this is not clear.
I agree with ran2 and aL3Xa that you probably want to change the format of your data to have a different column for each possible reponse. However, if you munging your dataset to a better format proves problematic, it is possible to do what you asked.
process_multichoice <- function(x) lapply(strsplit(x, " "), as.numeric)
q2 <- c("1 2 3 NA 4", "2 5")
processed_q2 <- process_multichoice(q2)
[[1]]
[1] 1 2 3 NA 4
[[2]]
[1] 2 5
The reason different columns for different responses are suggested is because it is still quite unpleasant trying to retrieve any statistics from the data in this form. Although you can do things like
# Number of reponses given
sapply(processed_q2, length)
#Frequency of each response
table(unlist(processed_q2), useNA = "ifany")
EDIT: One more piece of advice. Keep the code that processes your data separate from the code that analyses it. If you create any graphs, keep the code for creating them separate again. I've been down the road of mixing things together, and it isn't pretty. (Especially when you come back to the code six months later.)
I am not entirely sure what you trying to do respectively what your reasons are for coding like this. Thus my advice is more general – so just feel to clarify and I will try to give a more concrete response.
1) I say that you are coding the survey on your own, which is great because it means you have influence on your .csv file. I would NEVER use different kinds of separation in the same .csv file. Just do the naming from the very beginning, just like you suggested in the second block.
Otherwise you might geht into trouble with checkboxes for example. Let's say someone checks 3 out of 5 possible answers, the next only checks 1 (i.e. "don't know") . Now it will be much harder to create a spreadsheet (data.frame) type of results view as opposed to having an empty field (which turns out to be an NA in R) that only needs to be recoded.
2) Another important question is whether you intend to do a panel survey(i.e longitudinal study asking the same participants over and over again) . That (among many others) would be a good reason to think about saving your data to a MySQL database instead of .csv . RMySQL can connect directly to the database and access its tables and more important its VIEWS.
Views really help with survey data since you can rearrange the data in different views, conditional on many different needs.
3) Besides all the personal / opinion and experience, here's some (less biased) literature to get started:
Complex Surveys: A Guide to Analysis Using R (Wiley Series in Survey Methodology
The book is comparatively simple and leaves out panel surveys but gives a lot of R Code and examples which should be a practical start.
To prevent re-inventing the wheel you might want to check LimeSurvey, a pretty decent (not speaking of the templates :) ) tool for survey conductors. Besides I TYPO3 CMS extensions pbsurvey and ke_questionnaire (should) work well too (only tested pbsurvey).
Multiple choice items should always be coded as separate variables. That is, if you have 5 alternatives and multiple choice, you should code them as i1, i2, i3, i4, i5, i.e. each one is a binary variable (0-1). I see that you have values 3 5 99 for Q4_M variable in the first example. Does that mean that you have 99 alternatives in an item? Ouch...
First you should go on and create separate variables for each alternative in a multiple choice item. That is, do:
# note that I follow your example with Q4_M variable
dtf_ins <- as.data.frame(matrix(0, nrow = nrow(<initial dataframe>), ncol = 99))
# name vars appropriately
names(dtf_ins) <- paste("Q4_M_", 1:99, sep = "")
now you have a data.frame with 0s, so what you need to do is to get 1s in an appropriate position (this is a bit cumbersome), a function will do the job...
# first you gotta change spaces to commas and convert character variable to a numeric one
y <- paste("c(", gsub(" ", ", ", x), ")", sep = "")
z <- eval(parse(text = y))
# now you assing 1 according to indexes in z variable
dtf_ins[1, z] <- 1
And that's pretty much it... basically, you would like to reconsider creating a data.frame with _M variables, so you can write a function that does this insertion automatically. Avoid for loops!
Or, even better, create a matrix with logicals, and just do dtf[m] <- 1, where dtf is your multiple-choice data.frame, and m is matrix with logicals.
I would like to help you more on this one, but I'm recuperating after a looong night! =) Hope that I've helped a bit! =)
Thanks for all the responses. I agree with most of you that this format is kind of silly but it is what I have to work with (survey is coded and going into use next week). This is what I came up with from all the responses. I am sure this is not the most elegant or efficient way to do it but I think it should work.
colnums <- grep("_M",colnames(dat))
responses <- nrow(dat)
for (i in colnums) {
vec <- as.vector(dat[,i]) #turn into vector
b <- lapply(strsplit(vec," "),as.numeric) #split up and turn into numeric
c <- sort(unique(unlist(b))) #which values were used
newcolnames <- paste(colnames(dat[i]),"_",c,sep="") #column names
e <- matrix(nrow=responses,ncol=length(c)) #create new matrix for indicators
colnames(e) <- newcolnames
#next loop looks for responses and puts indicators in the correct places
for (i in 1:responses) {
e[i,] <- ifelse(c %in% b[[i]],1,0)
}
dat <- cbind(dat,e)
}
Suggestions for improvement are welcome.

Erlang: What is most-wrong with this trie implementation?

Over the holidays, my family loves to play Boggle. Problem is, I'm terrible at Boggle. So I did what any good programmer would do: wrote a program to play for me.
At the core of the algorithm is a simple prefix trie, where each node is a dict of references to the next letters.
This is the trie:add implementation:
add([], Trie) ->
dict:store(stop, true, Trie);
add([Ch|Rest], Trie) ->
% setdefault(Key, Default, Dict) ->
% case dict:find(Key, Dict) of
% { ok, Val } -> { Dict, Val }
% error -> { dict:new(), Default }
% end.
{ NewTrie, SubTrie } = setdefault(Ch, dict:new(), Trie),
NewSubTrie = add(Rest, SubTrie),
dict:store(Ch, NewSubTrie, NewTrie).
And you can see the rest, along with an example of how it's used (at the bottom), here:
http://gist.github.com/263513
Now, this being my first serious program in Erlang, I know there are probably a bunch of things wrong with it… But my immediate concern is that it uses 800 megabytes of RAM.
So, what am I doing most-wrong? And how might I make it a bit less-wrong?
You could implement this functionality by simply storing the words in an ets table:
% create table; add words
> ets:new(words, [named_table, set]).
> ets:insert(words, [{"zed"}]).
> ets:insert(words, [{"zebra"}]).
% check if word exists
> ets:lookup(words, "zed").
[{"zed"}]
% check if "ze" has a continuation among the words
78> ets:match(words, {"ze" ++ '$1'}).
[["d"],["bra"]]
If trie is a must, but you can live with a non-functional approach, then you can try digraphs, as Paul already suggested.
If you want to stay functional, you might save some bytes of memory by using structures using less memory, for example proplists, or records, such as -record(node, {a,b,....,x,y,z}).
I don't remember how much memory a dict takes, but let's estimate. You have 2.5e6 characters and 2e5 words. If your trie had no sharing at all, that would take 2.7e6 associations in the dicts (one for each character and each 'stop' symbol). A simple purely-functional dict representation would maybe 4 words per association -- it could be less, but I'm trying to get an upper bound. On a 64-bit machine, that'd take 8*4*2.7 million bytes, or 86 megabytes. That's only a tenth of your 800M, so something's surely wrong here.
Update: dict.erl represents dicts with a hashtable; this implies lots of overhead when you have a lot of very small dicts, as you do. I'd try changing your code to use the proplists module, which ought to match my calculations above.
An alternative way to solve the problem is going through the word list and see if the word can be constructed from the dice. That way you need very little RAM, and it might be more fun to code. (optimizing and concurrency)
Look into DAWGs. They're much more compact than tries.
I don't know about your algorithm, but if you're storing that much data, maybe you should look into using Erlang's built-in digraph library to represent your trie, instead of so many dicts.
http://www.erlang.org/doc/man/digraph.html
If all words are in English, and the case doesn't matter, all characters can be encoded by numbers from 1 to 26 (and in fact, in Erlang they are numbers from 97 to 122), reserving 0 for stop. So you can use the array module as well.

Resources