I have an application which models a data domain using some deeply nested record structures. A contrived but analogous example would be something like:
Book
- Genre
- Author
- Hometown
- Country
I've found that when writing queries using Hasql (or Hasql-TH to be more precise), I end up with this enormous function which takes a huge tuple and constructs my record by effectively consuming this tuple tail-first and constructing these nested record types, before finally putting it all together in one big type (including transforming some of the raw values etc.). It ends up looking something like this:
bookDetailStatement :: Statement BookID (Maybe Book)
bookDetailStatement = dimap
(\ (BookID a) -> a) -- extract the actual ID from the container
(fmap mkBook) -- process the record if it exists
[maybeStatement|
select
(some stuff)
from books
join genres on (...)
join authors on (...)
join towns on (...)
join countries on (...)
where books.id = $1 :: int4
limit 1
|]
mkBook (
-- Book
book_id, book_title, ...
-- Genre
genre_name, ...
-- Author
author_id, author_name, ...
-- Town
town_name, town_coords, ...
-- Country
country_name, ...
) = let {- some data processing -} in Book {..}
This has been a bit annoying to write and to maintain / refactor, and I was thinking about trying to remodel it using Control.Applicative. That got me thinking that this is essentially a type of parser (a bit like Megaparsec) where we are consuming an input stream and then want to compose parsing functions which take some "tokens" from that stream and return results wrapped in the Parsing Functor (which really should be a Monad I think). The only difference is that, since these results are nested, they also need to consume the outputs of previous parsers (although actually you can do this with Megaparsec too, and with Control.Applicative). This would allow smaller functions mkCountry, mkTown, mkAuthor, etc. which could be composed with <*> and <$>.
So, my question is basically twofold:
(1) is this a reasonable (or even common) approach to real-world applications of this kind, or am I missing some sort of obvious optimisation which would allow this code to be more composable;
(2) if I were to implement this, is a good route to adapt Megaparsec to the job (basically writing a tokeniser for the query result I think), or would it be simpler to write a data type to contain the query result and output value and then add the Monad and Applicative instance definition?
If I understand you correctly your question is about constructing the mkBook mapping function by composing from smaller pieces.
What does that function do? It maps data from denormalised form (a tuple of all produced fields) to your domain-specific structure consisting of other structures. It is a very basic pure function, where you just move data around based on your domain logic. So the problem sounds like a domain problem. As such it is not general, but specific to the domain of your application and hence trying to abstract over it will likely result in neither a reusable abstraction or a simpler codebase.
If you discover patterns inside such functions, those are likely to be domain-specific as well. I can advise nothing better than to just wrap them in other pure functions and to compose by simply calling them. No need for applicatives or monads.
Concerning parsing libs and tokenisation I really don't see how it has anything to do with the discussed problem, but I may be missing your point. Also I don't recommend bringing lenses in to solve such a trivial problem, you'll likely end up with a more complicated and less maintainable solution.
Related
I've been trying to work out how to use the language-bash package to parse some simple bash scripts, and I've come across the following structure
Right (List [Statement (Last (Pipeline {timed = False, timedPosix = False, inverted = False, commands = [Command (SimpleCommand [Assign (Parameter "x" Nothing) Equals (RValue [Char '3'])] []) []]})) Sequential])
as a result of running
import Language.Bash.Parse
parse "" "x=3"
I could theoretically just pattern match the whole thing away, though I was wondering if there was a cleaner way of accessing the values of the Assign datatype ('x', (Char '3').
Is there anyway to cleanly access those values (or generally access values in a complex datastructure) without obsessive pattern matching ?
Not really.
Here's the problem. You probably want to either handle an extremely limited set of possible Bash statements, in which case just writing out the patterns for specific List values will be faster than anything else you could possibly do.
Or, you want to handle a wide variety of Bash statements, in which case you can't really avoid the functional infrastructure to handle general List values. The same way you'd write an interpreter or compiler for any complex abstract syntax tree, you'll end up more or less writing a function for every (major) type and a case for every constructor.
The main Haskell tools for dealing with big, complex data structures are:
The "functional infrastructure" described above. That is, plain old functions defined using pattern matching, that process recursive data structures in a manner that mirrors the structures themselves. Don't underestimate this approach! It may seem like a lot of work, but it's likely to lead you to a correct program that handles all well-formed inputs, in a way that ad hoc approaches won't. Start with:
{-# OPTIONS_GHC -Wall #-}
data M = ... some monad ...
data Result = ... representation of what you want to extract from the script ...
processList :: List -> M Result
...
processStatement :: Statement -> M Result
...
and go from there. The -Wall is important to get the -Wincomplete-patterns warning so you don't miss any constructors.
Lenses, which provide a more ergonomic hierarchical syntax for referring to parts of deeply nested data structures. Since bash-language doesn't provide lenses for these structures, you'd need to write them yourself. They might allow you to write something along the lines of:
lst ^. _Right.statements._head.andOr.pipeline.commands.
_head._SimpleCommand.assignments._head.parameter.base
to extract the "x" from "x=3". Obviously, that doesn't help much, but lenses complement the "functional infrastructure" approach. The code to actually process all those types is often more easily expressed with lenses than pattern matching.
Generics, which allow you to generically access certain patterns within recursive data structures, while ignoring the "rest" of the data structure that you don't care about. The bash-language library includes deriving clauses for both Data and Generic generics. If it didn't, you could use StandaloneDeriving clauses to derive them. As an example, you can use Data generics to extract all Parameters from a List, regardless of where those Parameters appear, with something like:
import Language.Bash.Parse
import Language.Bash.Word
import Data.Data
import Data.Generics.Schemes
import Data.Generics.Aliases
parameters :: (Data a) => a -> [Parameter]
parameters = everything (++) (mkQ [] (\p -> [p]))
main = do
let Right lst = parse "" "x=3; y=4; LANG=C echo $x $y"
print $ parameters lst
Here, this prints out a list of all parameters appearing in this shell "script", whether for purposes of assignment or substitution, so it includes: "x", "y", "LANG", and "x" and "y" again.
This is a powerful tool, but it's likely to be applicable to only a few specific use-cases.
Ultimately, you'll probably have to take the view that you are writing a Bash interpreter (even if your interpreter does something besides "executing" the Bash script). Someone's been nice enough to supply a Bash parser to get the input source code into an AST, but the other half of the interpreter -- the actual interpretation itself -- still needs to be written by you.
I was given a fragment of code (a function called bubbleSort(), written in Java, for example). How can I, or rather my program, tell if a given source code implements a particular sorting algorithm the correct way (using bubble method, for instance)?
I can enforce a user to give a legitimate function by analyzing function signature: making sure the the argument and return value is an array of integers. But I have no idea how to determine that algorithm logic is being done the right way. The input code could sort values correctly, but not in an aforementioned bubble method. How can my program discern that? I do realize a lot of code parsing would be involved, but maybe there's something else that I should know.
I hope I was somewhat clear.
I'd appreciate if someone could point me in the right direction or give suggestions on how to tackle such a problem. Perhaps there are tested ways that ease the evaluation of program logic.
In general, you can't do this because of the Halting problem. You can't even decide if the function will halt ("return").
As a practical matter, there's a bit more hope. If you are looking for a bubble sort, you can decide that it has number of parts:
a to-be-sorted datatype S with a partial order,
a container data type C with single instance variable A ("the array")
that holds the to-be-sorted data
a key type K ("array index") used to access the container that has a partial order
such that container[K] is type S
a comparison of two members of container, using key A and key B
such that A < B according to the key partial order, that determines
if container[B]>container of A
a swap operation on container[A], container[B] and some variable T of type S, that is conditionaly dependent on the comparison
a loop wrapped around the container that enumerates keys in according the partial order on K
You can build bits of code that find each of these bits of evidence in your source code, and if you find them all, claim you have evidence of a bubble sort.
To do this concretely, you need standard program analysis machinery:
to parse the source code and build an abstract syntax tree
build symbol tables (ST) that know the type of each identifier where it is used
construct a control flow graph (CFG) so that you check that various recognized bits occur in appropriate ordering
construct a data flow graph (DFG), so that you can determine that values recognized in one part of the algorithm flow properly to another part
[That's a lot of machinery just to get started]
From here, you can write ad hoc code procedural code to climb over the AST, ST, CFG, DFG, to "recognize" each of the individual parts. This is likely to be pretty messy as each recognizer will be checking these structures for evidence of its bit. But, you can do it.
This is messy enough, and interesting enough, so there are tools which can do much of this.
Our DMS Software Reengineering Toolkit is one. DMS already contains all the machinery to do standard program analysis for several languages. DMS also has a Dataflow pattern matching language, inspired by Rich and Water's 1980's "Programmer's Apprentice" ideas.
With DMS, you can express this particular problem roughly like this (untested):
dataflow pattern domain C;
dataflow pattern swap(in out v1:S, in out v2:S, T:S):statements =
" \T = \v1;
\v1 = \v2;
\v2 = \T;";
dataflow pattern conditional_swap(in out v1:S, in out v2:S,T:S):statements=
" if (\v1 > \v2)
\swap(\v1,\v2,\T);"
dataflow pattern container_access(inout container C, in key: K):expression
= " \container.body[\K] ";
dataflow pattern size(in container:C, out: integer):expression
= " \container . size "
dataflow pattern bubble_sort(in out container:C, k1: K, k2: K):function
" \k1 = \smallestK\(\);
while (\k1<\size\(container\)) {
\k2 = \next\(k1);
while (\k2 <= \size\(container\) {
\conditionalswap\(\container_access\(\container\,\k1\),
\container_access\(\container\,\k2\) \)
}
}
";
Within each pattern, you can write what amounts to the concrete syntax of the chosen programming language ("pattern domain"), referencing dataflows named in the pattern signature line. A subpattern can be mentioned inside another; one has to pass the dataflows to and from the subpattern by naming them. Unlike "plain old C", you have to pass the container explicitly rather than by implicit reference; that's because we are interested in the actual values that flow from one place in the pattern to another. (Just because two places in the code use the same variable, doesn't mean they see the same value).
Given these definitions, and ask to "match bubble_sort", DMS will visit the DFG (tied to CFG/AST/ST) to try to match the pattern; where it matches, it will bind the pattern variables to the DFG entries. If it can't find a match for everything, the match fails.
To accomplish the match, each of patterns above is converted essentially into its own DFG, and then each pattern is matched against the DFG for the code using what is called a subgraph isomorphism test. Constructing the DFG for the patter takes a lot of machinery: parsing, name resolution, control and data flow analysis, applied to fragments of code in the original language, intermixed with various pattern meta-escapes. The subgraph isomorphism is "sort of easy" to code, but can be very expensive to run. What saves the DMS pattern matchers is that most patterns have many, many constraints [tech point: and they don't have knots] and each attempted match tends to fail pretty fast, or succeed completely.
Not shown, but by defining the various bits separately, one can provide alternative implementations, enabling the recognition of variations.
We have used this to implement quite complete factory control model extraction tools from real industrial plant controllers for Dow Chemical on their peculiar Dowtran language (meant building parsers, etc. as above for Dowtran). We have version of this prototyped for C; the data flow analysis is harder.
Given a line such as
1 pound of Beef
I want to extract the ingredient. Initially im only interested in the ingredient name.
Ive looked at rubys famous time parser Chronic and like its use of regexs.
def self.scan_for_month_names(token)
scanner = {/^jan\.?(uary)?$/ => :january,
/^feb\.?(ruary)?$/ => :february,
/^mar\.?(ch)?$/ => :march,
/^apr\.?(il)?$/ => :april,
/^may$/ => :may,
/^jun\.?e?$/ => :june,
/^jul\.?y?$/ => :july,
/^aug\.?(ust)?$/ => :august,
/^sep\.?(tember)?$/ => :september,
/^oct\.?(ober)?$/ => :october,
/^nov\.?(ember)?$/ => :november,
/^dec\.?(ember)?$/ => :december}
scanner.keys.each do |scanner_item|
return Chronic::RepeaterMonthName.new(scanner[scanner_item]) if scanner_item =~ token.word
end
return nil
end
However in my case Id probably have to create over 300 regexs for each individual ingredient.
I'd also have to take into account of synonyms such as Cilantro & Corriander Leaf
Ive never done parsing before but is the use of regexs here still the best way to go. I cant think of any other reasonable alternative.
Firstly, I'm assuming that the ingredients don't always take the form of QUANTITY UNIT of INGREDIENT - otherwise, this would be a very trivial task (just copy the substring after of
This is a difficult problem - the solution will not be simple.
I think using regex may not be the best approach here:
As you mention, you'll have to write a lot of expressions for each
ingredient
Your list of possible ingredients will always be limited
by the regex list, and you can't detect new ingredients without
compiling more.
it will be very difficult to parse some ingredients(cheese, 1 pound (parmesan))
I think that natural language processing is the way to go here. You have unstructured input, but in a very restricted context.
Perhaps counter-intuitively, I think the best way to find the ingredient may very well be to not look for it - look for everything else instead. If you assume that a line will always have
a numeral (quantity)
a unit (pounds, teaspoons, etc)
a ingredient
and that it's pretty easy to detect numerals and units, it should be straightforward to recognize those first and then extract the ingredient.
If you use a part-of-speech tagger, it's easy to identify relevant words:
[('1', 'LS'), ('pound', 'NN'), ('of', 'IN'), ('Beef', 'NNP')]
From there, you may want to use a classifier. For that, you'll need to label the ingredients manually on a good quantity of lines (say, hundreds). Some possibly good features to use:
position of the word in the line
presence in a precomputed ingredient dictionary (possibly using some partial string matching metric like Levenshtein's
output of part-of-speech tagger
words immediately before and after (if you have an 'of' before the word, there's a high probability it's a ingredient
I'm sure you'll be able to find countless others after working on a few lines.
Finally, I expect that some lines will be very difficult to work on. 1 pound of parmesan cheese, 1 pound of emmentaler: you'd have to infer that the second ingredient is a cheese, too.
As to software, if you can choose the language to use, python has the fantastic Natural Language Toolkit. I can't vouch for toolkits in other languages, but maybe someone else will.
I think I would start by running a series of regex checks against each line, and adjust the parsed text as you go. For example (pseudocode):
First, check for instruction:
/^(add|fold in|stir in|etc...)/
If you found an instruction, remove it from the line, set a flag, and continue:
instruction = $1
this_line = this_line.substring(instruction.length())
If an instruction was found, check to see if there was a subsequent instruction (like "and cover" or "and set aside")
/\b(and\s)(.*)$/
If found, strip that and insert it before the next line of the recipe
instruction = instruction.substring(0, instuction.length - $1.length - $2.length)
splice $2 into the array of lines immediately following this one
Next, maybe you'll check for a preposition:
/((?in)to\s(.+)/
If found, you might use that to check for equipment names, bowls, measuring cups, etc.
Even if you don't use it, you can probably remove it from the string you're parsing, to improve your matching.
Finally, the real work is done with the text that's left:
Check against /^(\d+\s+(?a\s)?\w+)\s*(?of\s*)?(.+)$/
Which should give you $1 containing the unit of measure and $2 containing the ingredient.
Lather. Rinse. Repeat.
After that, do whatever magic your app does with this information.
First of all, I suggest doing some searching to see if someone else has already created a solution to this problem which is good enough for you to use, rather than reinventing the wheel.
For instance, you may find this project to be interesting. It uses machine learning to attempt to parse ingredient phrases, including type of ingredients and amounts.
Other interesting projects also come up when googling for "ingredient parser".
If you are really determined to write this yourself, then I suggest that you do some research into the category of software tools known as a "parser generator", which is a tool which will allow you to write the language you want to recognize in an abstract form (a "grammar"), and then will generate code in your language of choice which will parse text according to that grammar and will identify specific subconstructs within it it very efficiently (much more efficiently than could be done by hundreds of regular expression matches).
For instance, a grammar used as input to a parser generator might look something like this:
// I am making up the following syntax for demonstration purposes, but it illustrates the
// sort of things that one could specify in a grammar, and is not terribly different from
// the grammar languages that real parser generators use.
//
// Note that everything in the curly braces is code to be inserted into the generated parser.
// Each such code block will be invoked when the preceding parsing rule is matched.
%declare { bool organic=false; bool dried=false; bool smoked=false; }
INGREDIENT ::= "organic" INGREDIENT { organic=true; }
| INGREDIENT "(" "organic" ")" { organic=true; }
| "dried" INGREDIENT { dried=true; }
| "smoked" INGREDIENT { smoked=true; }
| AMOUNT "of" INGREDIENT
| INGREDIENT "(" AMOUNT ")"
| BASE_INGREDIENT
BASE_INGREDIENT ::= ( WORD )* {
doSomethingWithBaseIngredient(organic, dried, smoked, $BASE_INGREDIENT);
}
AMOUNT ::= NUMBER ( VOLUME_UNIT | WEIGHT_UNIT )
VOLUME_UNIT ::= "cup" | "liter"
WEIGHT_UNIT ::= "mg" | "kg" | "pound"
NUMBER ::= [0-9]+
WORD ::= [a-zA-Z]+
... and so forth.
The parser generator, when run, would take this grammar as input, and would generate code in your desired programming language as output. This code would parse input text according to the grammar and would also set variables and/or call functions of yours as desired when certain parsing rules are matched. The parsers generated by such tools often use special parsing techniques (often involving large tables, state machines, and so forth) to parse very efficiently in a single pass without having to do any more work than necessary, and avoiding backtracking when possible.
Some common examples of parser generators are lexx/yacc, bison, and Antlr. Many others exist. (Personally, I have gotten good results with Antlr in the past, and am particularly fond of the fact that it can generate parsers in many different programming languages.) Many of these parser generators are mostly intended for use by compiler writers, but that does not mean that they can't be used for other purposes, such as recognizing the various forms that ingredients in recipes take.
This article provides an overview of parser generators, and this article contains a table of various parser generators and their attributes (output languages, etc.) as well as links on where to find more.
I'm writing a program where I need to parse a JavaScript source file, extract some facts, and insert/replace portions of the code. A simplified description of the sorts of things I'd need to do is, given this code:
foo(['a', 'b', 'c']);
Extract 'a', 'b', and 'c' and rewrite the code as:
foo('bar', [0, 1, 2]);
I am using ANTLR for my parsing needs, producing C# 3 code. Somebody else had already contributed a JavaScript grammar. The parsing of the source code is working.
The problem I'm encountering is figuring out how to actually properly analyze and modify the source file. Each approach that I try to take in actually solving the problem leads me to a dead end. I can't help but think that I'm not using the tool as it's intended or am just too much of a novice when it comes to dealing with ASTs.
My first approach was to parse using a TokenRewriteStream and implement the EnterRule_* partial methods for the rules I'm interested in. While this seems to make modifying the token stream pretty easy, there is not enough contextual information for my analysis. It seems that all I have access to is a flat stream of tokens, which doesn't tell me enough about the entire structure of code. For example, to detect whether the foo function is being called, simply looking at the first token wouldn't work because that would also falsely match:
a.b.foo();
To allow me to do more sophisticated code analysis, my second approach was to modify the grammar with rewrite rules to produce more of a tree. Now, the first sample code block produces this:
Program
CallExpression
Identifier('foo')
ArgumentList
ArrayLiteral
StringLiteral('a')
StringLiteral('b')
StringLiteral('c')
This is working great for analyzing the code. However, now I am unable to easily rewrite the code. Sure, I could modify the tree structure to represent the code I want, but I can't use this to output source code. I had hoped that the token associated with each node would at least give me enough information to know where in the original text I would need to make the modifications, but all I get are token indexes or line/column numbers. To use the line and column numbers, I would have to make an awkward second pass through the source code.
I suspect I'm missing something in understanding how to properly use ANTLR to do what I need. Is there a more proper way for me to solve this problem?
What you are trying to do is called program transformation, that is, the automated generation of one program from another. What you are doing "wrong" is assuming is parser is all you need, and discovering that it isn't and that you have to fill in the gap.
Tools that do that this well have parsers (to build ASTs), means to modify the ASTs (both procedural and pattern directed), and prettyprinters which convert the (modified) AST back into legal source code. You seem to be struggling with the the fact that ANTLR doesn't come with prettyprinters; that's not part of its philosophy; ANTLR is a (fine) parser-generator. Other answers have suggested using ANTLR's "string templates", which are not by themselves prettyprinters, but can be used to implement one, at the price of implementing one. This harder to do than it looks; see my SO answer on compiling an AST back to source code.
The real issue here is the widely made but false assumption that "if I have a parser, I'm well on my way to building complex program analysis and transformation tools." See my essay on Life After Parsing for a long discussion of this; basically, you need a lot more tooling that "just" a parser to do this, unless you want to rebuild a significant fraction of the infrastructure by yourself instead of getting on with your task. Other useful features of practical program transformation systems include typically source-to-source transformations, which considerably simplify the problem of finding and replacing complex patterns in trees.
For instance, if you had source-to-source transformation capabilities (of our tool, the DMS Software Reengineering Toolkit, you'd be able to write parts of your example code changes using these DMS transforms:
domain ECMAScript.
tag replace; -- says this is a special kind of temporary tree
rule barize(function_name:IDENTIFIER,list:expression_list,b:body):
expression->expression
= " \function_name ( '[' \list ']' ) "
-> "\function_name( \firstarg\(\function_name\), \replace\(\list\))";
rule replace_unit_list(s:character_literal):
expression_list -> expression_list
replace(s) -> compute_index_for(s);
rule replace_long_list(s:character_list, list:expression_list):
expression_list -> expression_list
"\replace\(\s\,\list)-> "compute_index_for\(\s\),\list";
with rule-external "meta" procedures "first_arg" (which knows how to compute "bar" given the identifier "foo" [I'm guessing you want to do this), and "compute_index_for" which given a string literals, knows what integer to replace it with.
Individual rewrite rules have parameter lists "(....)" in which slots representing subtrees are named, a left-hand side acting as a pattern to match, and an right hand side acting as replacement, both usually quoted in metaquotes " which seperates rewrite-rule language text from target-language (e.g. JavaScript) text. There's lots of meta-escapes ** found inside the metaquotes which indicate a special rewrite-rule-language item. Typically these are parameter names, and represent whatever type of name tree the parameter represents, or represent an external meta procedure call (such as first_arg; you'll note the its argument list ( , ) is metaquoted!), or finally, a "tag" such as "replace", which is a peculiar kind of tree that represent future intent to do more transformations.
This particular set of rules works by replacing a candidate function call by the barized version, with the additional intent "replace" to transform the list. The other two transformations realize the intent by transforming "replace" away by processing elements of the list one at a time, and pushing the replace further down the list until it finally falls off the end and the replacement is done. (This is the transformational equivalent of a loop).
Your specific example may vary somewhat since you really weren't precise about the details.
Having applied these rules to modify the parsed tree, DMS can then trivially prettyprint the result (the default behavior in some configurations is "parse to AST, apply rules until exhaustion, prettyprint AST" because this is handy).
You can see a complete process of "define language", "define rewrite rules", "apply rules and prettyprint" at (High School) Algebra as a DMS domain.
Other program transformation systems include TXL and Stratego. We imagine DMS as the industrial strength version of these, in which we have built all that infrastructure including many standard language parsers and prettyprinters.
So it's turning out that I can actually use a rewriting tree grammar and insert/replace tokens using a TokenRewriteStream. Plus, it's actually really easy to do. My code resembles the following:
var charStream = new ANTLRInputStream(stream);
var lexer = new JavaScriptLexer(charStream);
var tokenStream = new TokenRewriteStream(lexer);
var parser = new JavaScriptParser(tokenStream);
var program = parser.program().Tree as Program;
var dependencies = new List<IModule>();
var functionCall = (
from callExpression in program.Children.OfType<CallExpression>()
where callExpression.Children[0].Text == "foo"
select callExpression
).Single();
var argList = functionCall.Children[1] as ArgumentList;
var array = argList.Children[0] as ArrayLiteral;
tokenStream.InsertAfter(argList.Token.TokenIndex, "'bar', ");
for (var i = 0; i < array.Children.Count(); i++)
{
tokenStream.Replace(
(array.Children[i] as StringLiteral).Token.TokenIndex,
i.ToString());
}
var rewrittenCode = tokenStream.ToString();
Have you looked at the string template library. It is by the same person who wrote ANTLR and they are intended to work together. It sounds like it would suit do what your looking for ie. output matched grammar rules as formatted text.
Here is an article on translation via ANTLR
I have been trying to explain the difference between switch statements and pattern matching(F#) to a couple of people but I haven't really been able to explain it well..most of the time they just look at me and say "so why don't you just use if..then..else".
How would you explain it to them?
EDIT! Thanks everyone for the great answers, I really wish I could mark multiple right answers.
Having formerly been one of "those people", I don't know that there's a succinct way to sum up why pattern-matching is such tasty goodness. It's experiential.
Back when I had just glanced at pattern-matching and thought it was a glorified switch statement, I think that I didn't have experience programming with algebraic data types (tuples and discriminated unions) and didn't quite see that pattern matching was both a control construct and a binding construct. Now that I've been programming with F#, I finally "get it". Pattern-matching's coolness is due to a confluence of features found in functional programming languages, and so it's non-trivial for the outsider-looking-in to appreciate.
I tried to sum up one aspect of why pattern-matching is useful in the second of a short two-part blog series on language and API design; check out part one and part two.
Patterns give you a small language to describe the structure of the values you want to match. The structure can be arbitrarily deep and you can bind variables to parts of the structured value.
This allows you to write things extremely succinctly. You can illustrate this with a small example, such as a derivative function for a simple type of mathematical expressions:
type expr =
| Int of int
| Var of string
| Add of expr * expr
| Mul of expr * expr;;
let rec d(f, x) =
match f with
| Var y when x=y -> Int 1
| Int _ | Var _ -> Int 0
| Add(f, g) -> Add(d(f, x), d(g, x))
| Mul(f, g) -> Add(Mul(f, d(g, x)), Mul(g, d(f, x)));;
Additionally, because pattern matching is a static construct for static types, the compiler can (i) verify that you covered all cases (ii) detect redundant branches that can never match any value (iii) provide a very efficient implementation (with jumps etc.).
Excerpt from this blog article:
Pattern matching has several advantages over switch statements and method dispatch:
Pattern matches can act upon ints,
floats, strings and other types as
well as objects.
Pattern matches can act upon several
different values simultaneously:
parallel pattern matching. Method
dispatch and switch are limited to a single
value, e.g. "this".
Patterns can be nested, allowing
dispatch over trees of arbitrary
depth. Method dispatch and switch are limited
to the non-nested case.
Or-patterns allow subpatterns to be
shared. Method dispatch only allows
sharing when methods are from
classes that happen to share a base
class. Otherwise you must manually
factor out the commonality into a
separate function (giving it a
name) and then manually insert calls
from all appropriate places to your
unnecessary function.
Pattern matching provides redundancy
checking which catches errors.
Nested and/or parallel pattern
matches are optimized for you by the
F# compiler. The OO equivalent must
be written by hand and constantly
reoptimized by hand during
development, which is prohibitively
tedious and error prone so
production-quality OO code tends to
be extremely slow in comparison.
Active patterns allow you to inject
custom dispatch semantics.
Off the top of my head:
The compiler can tell if you haven't covered all possibilities in your matches
You can use a match as an assignment
If you have a discriminated union, each match can have a different 'type'
Tuples have "," and Variants have Ctor args .. these are constructors, they create things.
Patterns are destructors, they rip them apart.
They're dual concepts.
To put this more forcefully: the notion of a tuple or variant cannot be described merely by its constructor: the destructor is required or the value you made is useless. It is these dual descriptions which define a value.
Generally we think of constructors as data, and destructors as control flow. Variant destructors are alternate branches (one of many), tuple destructors are parallel threads (all of many).
The parallelism is evident in operations like
(f * g) . (h * k) = (f . h * g . k)
if you think of control flowing through a function, tuples provide a way to split up a calculation into parallel threads of control.
Looked at this way, expressions are ways to compose tuples and variants to make complicated data structures (think of an AST).
And pattern matches are ways to compose the destructors (again, think of an AST).
Switch is the two front wheels.
Pattern-matching is the entire car.
Pattern matches in OCaml, in addition to being more expressive as mentioned in several ways that have been described above, also give some very important static guarantees. The compiler will prove for you that the case-analysis embodied by your pattern-match statement is:
exhaustive (no cases are missed)
non-redundant (no cases that can never be hit because they are pre-empted by a previous case)
sound (no patterns that are impossible given the datatype in question)
This is a really big deal. It's helpful when you're writing the program for the first time, and enormously useful when your program is evolving. Used properly, match-statements make it easier to change the types in your code reliably, because the type system points you at the broken match statements, which are a decent indicator of where you have code that needs to be fixed.
If-Else (or switch) statements are about choosing different ways to process a value (input) depending on properties of the value at hand.
Pattern matching is about defining how to process a value given its structure, (also note that single case pattern matches make sense).
Thus pattern matching is more about deconstructing values than making choices, this makes them a very convenient mechanism for defining (recursive) functions on inductive structures (recursive union types), which explains why they are so abundantly used in languages like Ocaml etc.
PS: You might know the pattern-match and If-Else "patterns" from their ad-hoc use in math;
"if x has property A then y else z" (If-Else)
"some term in p1..pn where .... is the prime decomposition of x.." ((single case) pattern match)
Perhaps you could draw an analogy with strings and regular expressions? You describe what you are looking for, and let the compiler figure out how for itself. It makes your code much simpler and clearer.
As an aside: I find that the most useful thing about pattern matching is that it encourages good habits. I deal with the corner cases first, and it's easy to check that I've covered every case.