i am learning Common LISP at the moment and i encountered a huge roadblock.
I've had an assignment that we were to learn how to create a parser in Common LISP. I have managed to implement everything from the grammar rules to the lexer with alot of help from different sources online. I cant on the other hand seem to figure out how to implement a symbol table.
This is what i have so far in regards of the symbol table.
(defun symtab-add (state id)
;; *** add symbols to symbol table ***
(defun symtab-member (state id)
;; *** look up symbols in symbol table ***
(defun symtab-display (state)
(format t "------------------------------------------------------~%")
(format t "Symbol Table is: ~S ~%" (pstate-symtab state))
(format t "------------------------------------------------------~%")
As you can see ive only managed with the display part, if someone could link me a tutorial or give me a code example or just help me with this i would be super thankful.
All source code for my assignment: http://www.cs.kau.se/cs/education/courses/dvgc01/LISP/newstart.lsp
There are multiple ways of implementing a symbol table, with varying levels of "suitable for purpose" depending on your exact needs. At the end of the day, a symbol table is, effectively, just a mapping from "symbol name" to something.
So any data structure that allows you to add things to it as well as looking things up should work. Fairly common implementations would be "use a hash table" or "use an alist" (the latter is essentially a list of pairs on the form (<symbol> . <data>)).
First you will have to create the symbol table, say:
(setq my-symbol-table nil)
To keep it simple, we will ignore packages; you can learn about that later.
Next, you have to decide how you are going to store the symbols in the table. Again, keeping it simple, you can store them as an association list, with pairs of symbol name and symbol value. For example, if you store the symbols a and b with the values 3 and 5, you would have the following symbol table:
> my-symbol-table
> ((a . 3) (b . 5))
To use this association list, you can use the functions assoc, push, rplacd.
Add a new symbol
(push '(c . 0) my-symbol-table)
Look up a symbol:
(assoc 'c my-symbol-table)
Change the value of an existing symbol:
(rplacd (assoc 'c my-symbol-table) 18)
I hope this is enough to get you going.
I am writing a basic parser for a Scheme interpreter and here are the definitions I have set up to define the various type of tokens:
# 1. Parens
# 2. Operators (<=, =, +, ...)
# 3. Types (2.5, "Hello", #f, etc.)
# 4. Procedures, builtins, and such
... // probably need a new class for this
Does the above seem like it's a good starting place? Are there some obvious things I'm missing here, or does this give me a "good-enough" foundation?
Your approach makes distinctions which really don't exist in the syntax of the language, and also makes decisions far too early. For example consider this program:
(let ((x 1))
(set! x 2)
(set! x 3)
When I run this:
> (let ((x 1))
(set! x 2)
(set! x 3)
setting x to 2
setting x to 3
In order for this to work with-assignment-notes has to somehow redefine what (set! ...) means in its body. Here's a hacky and probably incorrect (Racket) implementation of that:
(define-syntax with-assignment-notes
(syntax-rules (set!)
[(_ form ...)
(let-syntax ([rewrite/maybe
(syntax-rules (set!)
[(_ (set! var val))
(let ([r val])
(printf "setting ~A to ~A~%" 'var r)
(set! var r))]
[(_ thing)
(rewrite/maybe form) ...)]))
So the critical features of any parser for a Lisp-family language are:
it should not make any decision about the semantics of the language that it can avoid making;
the structure it constructs must be available to the language itself as first-class objects;
(and optionally) the parser should be modifiable from the language itself.
As examples:
it is probably inevitable that the parser needs to make decisions about what is and is not a number and what sort of number it is;
it would be nice if it had default handling for strings, but this should ideally be controllable by the user;
it should make no decision at all about what, say (< x y) means but rather should return a structure representing it for interpretation by the language.
The reason for the last, optional, requirement is that Lisp-family languages are used by people who are interested in using them for implementing languages. Allowing the reader to be altered from within the language makes that hugely easier, since you don't have to start from scratch each time you want to make a language which is a bit like the one you started with but not completely.
Parsing Lisp
The usual approach to parsing Lisp-family languages is to have machinery which will turn a sequence of characters into a sequence of s-expressions consisting of objects which are defined by the language itself, notably symbols and conses (but also numbers, strings &c). Once you have this structure you then walk over it to interpret it as a program: either evaluating it on the fly or compiling it. Critically, you can also write programs which manipulate this structure itself: macros.
In 'traditional' Lisps such as CL this process is explicit: there is a 'reader' which turns a sequence of characters into a sequence of s-expressions, and macros explicitly manipulate the list structure of these s-expressions, after which the evaluator/compiler processes them. So in a traditional Lisp (< x y) would be parsed as (a cons of a symbol < and (a cons of a symbol x and (a cons of a symbol y and the empty list object)), or (< . (x . (y . ()))), and this structure gets handed to the macro expander and hence to the evaluator or compiler.
In Scheme it is a little more subtle: macros are specified (portably, anyway) in terms of rules which turn a bit of syntax into another bit of syntax, and it's not (I think) explicit whether such objects are made of conses & symbols or not. But the structure which is available to syntax rules needs to be as rich as something made of conses and symbols, because syntax rules get to poke around inside it. If you want to write something like the following macro:
(define-syntax with-silly-escape
(syntax-rules ()
[(_ (escape) form ...)
(call/cc (λ (c)
(define (escape) (c 'escaped))
form ...))]
[(_ (escape val ...) form ...)
(call/cc (λ (c)
(define (escape) (c val ...))
form ...))]))
then you need to be able to look into the structure of what came from the reader, and that structure needs to be as rich as something made of lists and conses.
A toy reader: reeder
Reeder is a little Lisp reader written in Common Lisp that I wrote a little while ago for reasons I forget (but perhaps to help me learn CL-PPCRE, which it uses). It is emphatically a toy, but it is also small enough and simple enough to understand: certainly it is much smaller and simpler than the standard CL reader, and it demonstrates one approach to solving this problem. It is driven by a table known as a reedtable which defines how parsing proceeds.
So, for instance:
> (with-input-from-string (in "(defun foo (x) x)")
(reed :from in))
(defun foo (x) x)
To read (reed) something using a reedtable:
look for the next interesting character, which is the next character not defined as whitespace in the table (reedtables have a configurable list of whitespace characters);
if that character is defined as a macro character in the table, call its function to read something;
otherwise call the table's token reader to read and interpret a token.
Reeding tokens
The token reader lives in the reedtable and is responsible for accumulating and interpreting a token:
it accumulates a token in ways known to itself (but the default one does this by just trundling along the string handling single (\) and multiple (|) escapes defined in the reedtable until it gets to something that is whitespace in the table);
at this point it has a string and it asks the reedtable to turn this string into something, which it does by means of token parsers.
There is a small kludge in the second step: as the token reader accumulates a token it keeps track of whether it is 'denatured' which means that there were escaped characters in it. It hands this information to the token parsers, which allows them, for instance, to interpret |1|, which is denatured, differently to 1, which is not.
Token parsers are also defined in the reedtable: there is a define-token-parser form to define them. They have priorities, so that the highest priority one gets to be tried first and they get to say whether they should be tried for denatured tokens. Some token parser should always apply: it's an error if none do.
The default reedtable has token parsers which can parse integers and rational numbers, and a fallback one which parses a symbol. Here is an example of how you would replace this fallback parser so that instead of returning symbols it returns objects called 'cymbals' which might be the representation of symbols in some embedded language:
Firstly we want a copy of the reedtable, and we need to remove the symbol parser from that copy (having previously checked its name using reedtable-token-parser-names).
(defvar *cymbal-reedtable* (copy-reedtable nil))
(remove-token-parser 'symbol *cymbal-reedtable*)
Now here's an implementation of cymbals:
(defvar *namespace* (make-hash-table :test #'equal))
(defstruct cymbal
(defgeneric ensure-cymbal (thing))
(defmethod ensure-cymbal ((thing string))
(or (gethash thing *namespace*)
(setf (gethash thing *namespace*)
(make-cymbal :name thing))))
(defmethod ensure-cymbal ((thing cymbal))
And finally here is the cymbal token parser:
(define-token-parser (cymbal 0 :denatured t :reedtable *cymbal-reedtable*)
(:register (:greedy-repetition 0 nil :everything))
(ensure-cymbal name))
An example of this. Before modifying the reedtable:
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(x y . z)
> (with-input-from-string (in "(x y . z)")
(reed :from in :reedtable *cymbal-reedtable*))
(#S(cymbal :name "x") #S(cymbal :name "y") . #S(cymbal :name "z"))
Macro characters
If something isn't the start of a token then it's a macro character. Macro characters have associated functions and these functions get called to read one object, however they choose to do that. The default reedtable has two-and-a-half macro characters:
" reads a string, using the reedtable's single & multiple escape characters;
( reads a list or a cons.
) is defined to raise an exception, as it can only occur if there are unbalanced parens.
The string reader is pretty straightforward (it has a lot in common with the token reader although it's not the same code).
The list/cons reader is mildly fiddly: most of the fiddliness is dealing with consing dots which it does by a slightly disgusting trick: it installs a secret token parser which will parse a consing dot as a special object if a dynamic variable is true, but otherwise will raise an exception. The cons reader then binds this variable appropriately to make sure that consing dots are parsed only where they are allowed. Obviously the list/cons reader invokes the whole reader recursively in many places.
And that's all the macro characters. So, for instance in the default setup, ' would read as a symbol (or a cymbal). But you can just install a macro character:
(defvar *qr-reedtable* (copy-reedtable nil))
(setf (reedtable-macro-character #\' *qr-reedtable*)
(lambda (from quote table)
(declare (ignore quote))
(values `(quote ,(reed :from from :reedtable table))
(inch from nil))))
And now 'x will read as (quote x) in *qr-reedtable*.
Similarly you could add a more compllicated macro character on # to read objects depending on their next character in the way CL does.
An example of the quote reader. Before:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
The object it has returned is a symbol whose name is "'", and it didn't read beyond that of course. After:
> (with-input-from-string (in "'(x y . z)")
(reed :from in :reedtable *qr-reedtable*))
`(x y . z)
Other notes
Everything works one-character-ahead, so all of the various functions get the stream being read, the first character they should be interested in and the reedtable, and return both their value and the next character. This avoids endlessly unreading characters (and probably tells you what grammar class it can handle natively (obviously macro character parsers can do whatever they like so long as things are sane when they return).
It probably doesn't use anything which isn't moderately implementable in non-Lisp languages. Some
Macros will cause pain in the usual way, but the only one is define-token-parser. I think the solution to that is the usual expand-the-macro-by-hand-and-write-that-code, but you could probably help a bit by having an install-or-replace-token-parser function which dealt with the bookkeeping of keeping the list sorted etc.
You'll need a language with dynamic variables to implement something like the cons reeder.
it uses CL-PPCRE's s-expression representation of regexps. I'm sure other languages have something like this (Perl does) because no-one wants to write stringy regexps: they must have died out decades ago.
It's a toy: it may be interesting to read but it's not suitable for any serious use. I found at least one bug while writing this: there will be many more.
I'm using the smt2-lib interface of z3 and trying to define the following:
(declare-const rem (set sl$REQ))
And get this error:
(error "line 36 column 31: invalid declaration, builtin symbol rem")
Is there a way to get a complete list of all the predefined symbols so that I can do an automatic renaming?
Yes, but it's not quite that trivial. Depending on options and logic definitions, the list of pre-defined symbols may change. But, you can get a list of all potentially predefined symbols by grepping for builtin_name in src/ast/*_decl_plugin.cpp. For example, the rem symbol is defined at arith_decl_plugin.cpp:540.
I'm having trouble working out how to use any of the functions in the Text.Parsec.Indent module provided by the indents package for Haskell, which is a sort of add-on for Parsec.
What do all these functions do? How are they to be used?
I can understand the brief Haddock description of withBlock, and I've found examples of how to use withBlock, runIndent and the IndentParser type here, here and here. I can also understand the documentation for the four parsers indentBrackets and friends. But many things are still confusing me.
In particular:
What is the difference between withBlock f a p and
do aa <- a
pp <- block p
return f aa pp
Likewise, what's the difference between withBlock' a p and do {a; block p}
In the family of functions indented and friends, what is ‘the level of the reference’? That is, what is ‘the reference’?
Again, with the functions indented and friends, how are they to be used? With the exception of withPos, it looks like they take no arguments and are all of type IParser () (IParser defined like this or this) so I'm guessing that all they can do is to produce an error or not and that they should appear in a do block, but I can't figure out the details.
I did at least find some examples on the usage of withPos in the source code, so I can probably figure that out if I stare at it for long enough.
<+/> comes with the helpful description “<+/> is to indentation sensitive parsers what ap is to monads” which is great if you want to spend several sessions trying to wrap your head around ap and then work out how that's analogous to a parser. The other three combinators are then defined with reference to <+/>, making the whole group unapproachable to a newcomer.
Do I need to use these? Can I just ignore them and use do instead?
The ordinary lexeme combinator and whiteSpace parser from Parsec will happily consume newlines in the middle of a multi-token construct without complaining. But in an indentation-style language, sometimes you want to stop parsing a lexical construct or throw an error if a line is broken and the next line is indented less than it should be. How do I go about doing this in Parsec?
In the language I am trying to parse, ideally the rules for when a lexical structure is allowed to continue on to the next line should depend on what tokens appear at the end of the first line or the beginning of the subsequent line. Is there an easy way to achieve this in Parsec? (If it is difficult then it is not something which I need to concern myself with at this time.)
So, the first hint is to take a look at IndentParser
type IndentParser s u a = ParsecT s u (State SourcePos) a
I.e. it's a ParsecT keeping an extra close watch on SourcePos, an abstract container which can be used to access, among other things, the current column number. So, it's probably storing the current "level of indentation" in SourcePos. That'd be my initial guess as to what "level of reference" means.
In short, indents gives you a new kind of Parsec which is context sensitive—in particular, sensitive to the current indentation. I'll answer your questions out of order.
(2) The "level of reference" is the "belief" referred in the current parser context state of where this indentation level starts. To be more clear, let me give some test cases on (3).
(3) In order to start experimenting with these functions, we'll build a little test runner. It'll run the parser with a string that we give it and then unwrap the inner State part using an initialPos which we get to modify. In code
import Text.Parsec
import Text.Parsec.Pos
import Text.Parsec.Indent
import Control.Monad.State
testParse :: (SourcePos -> SourcePos)
-> IndentParser String () a
-> String -> Either ParseError a
testParse f p src = fst $ flip runState (f $ initialPos "") $ runParserT p () "" src
(Note that this is almost runIndent, except I gave a backdoor to modify the initialPos.)
Now we can take a look at indented. By examining the source, I can tell it does two things. First, it'll fail if the current SourcePos column number is less-than-or-equal-to the "level of reference" stored in the SourcePos stored in the State. Second, it somewhat mysteriously updates the State SourcePos's line counter (not column counter) to be current.
Only the first behavior is important, to my understanding. We can see the difference here.
>>> testParse id indented ""
Left (line 1, column 1): not indented
>>> testParse id (spaces >> indented) " "
Right ()
>>> testParse id (many (char 'x') >> indented) "xxxx"
Right ()
So, in order to have indented succeed, we need to have consumed enough whitespace (or anything else!) to push our column position out past the "reference" column position. Otherwise, it'll fail saying "not indented". Similar behavior exists for the next three functions: same fails unless the current position and reference position are on the same line, sameOrIndented fails if the current column is strictly less than the reference column, unless they are on the same line, and checkIndent fails unless the current and reference columns match.
withPos is slightly different. It's not just a IndentParser, it's an IndentParser-combinator—it transforms the input IndentParser into one that thinks the "reference column" (the SourcePos in the State) is exactly where it was when we called withPos.
This gives us another hint, btw. It lets us know we have the power to change the reference column.
(1) So now let's take a look at how block and withBlock work using our new, lower level reference column operators. withBlock is implemented in terms of block, so we'll start with block.
-- simplified from the actual source
block p = withPos $ many1 (checkIndent >> p)
So, block resets the "reference column" to be whatever the current column is and then consumes at least 1 parses from p so long as each one is indented identically as this newly set "reference column". Now we can take a look at withBlock
withBlock f a p = withPos $ do
r1 <- a
r2 <- option [] (indented >> block p)
return (f r1 r2)
So, it resets the "reference column" to the current column, parses a single a parse, tries to parse an indented block of ps, then combines the results using f. Your implementation is almost correct, except that you need to use withPos to choose the correct "reference column".
Then, once you have withBlock, withBlock' = withBlock (\_ bs -> bs).
(5) So, indented and friends are exactly the tools to doing this: they'll cause a parse to immediately fail if it's indented incorrectly with respect to the "reference position" chosen by withPos.
(4) Yes, don't worry about these guys until you learn how to use Applicative style parsing in base Parsec. It's often a much cleaner, faster, simpler way of specifying parses. Sometimes they're even more powerful, but if you understand Monads then they're almost always completely equivalent.
(6) And this is the crux. The tools mentioned so far can only do indentation failure if you can describe your intended indentation using withPos. Quickly, I don't think it's possible to specify withPos based on the success or failure of other parses... so you'll have to go another level deeper. Fortunately, the mechanism that makes IndentParsers work is obvious—it's just an inner State monad containing SourcePos. You can use lift :: MonadTrans t => m a -> t m a to manipulate this inner state and set the "reference column" however you like.
Is there a good reason why the type of Prelude.read is
read :: Read a => String -> a
rather than returning a Maybe value?
read :: Read a => String -> Maybe a
Since the string might fail to be parseable Haskell, wouldn't the latter be be more natural?
Or even an Either String a, where Left would contain the original string if it didn't parse, and Right the result if it did?
I'm not trying to get others to write a corresponding wrapper for me. Just seeking reassurance that it's safe to do so.
Edit: As of GHC 7.6, readMaybe is available in the Text.Read module in the base package, along with readEither: http://hackage.haskell.org/packages/archive/base/latest/doc/html/Text-Read.html#v:readMaybe
Great question! The type of read itself isn't changing anytime soon because that would break lots of things. However, there should be a maybeRead function.
Why isn't there? The answer is "inertia". There was a discussion in '08 which got derailed by a discussion over "fail."
The good news is that folks were sufficiently convinced to start moving away from fail in the libraries. The bad news is that the proposal got lost in the shuffle. There should be such a function, although one is easy to write (and there are zillions of very similar versions floating around many codebases).
See also this discussion.
Personally, I use the version from the safe package.
Yeah, it would be handy with a read function that returns Maybe. You can make one yourself:
readMaybe :: (Read a) => String -> Maybe a
readMaybe s = case reads s of
[(x, "")] -> Just x
_ -> Nothing
Apart from inertia and/or changing insights, another reason might be that it's aesthetically pleasing to have a function that can act as a kind of inverse of show. That is, you want that read . show is the identity (for types which are an instance of Show and Read) and that show . read is the identity on the range of show (i.e. show . read . show == show)
Having a Maybe in the type of read breaks the symmetry with show :: a -> String.
As #augustss pointed out, you can make your own safe read function. However, his readMaybe isn't completely consistent with read, as it doesn't ignore whitespace at the end of a string. (I made this mistake once, I don't quite remember the context)
Looking at the definition of read in the Haskell 98 report, we can modify it to implement a readMaybe that is perfectly consistent with read, and this is not too inconvenient because all the functions it depends on are defined in the Prelude:
readMaybe :: (Read a) => String -> Maybe a
readMaybe s = case [x | (x,t) <- reads s, ("","") <- lex t] of
[x] -> Just x
_ -> Nothing
This function (called readMaybe) is now in the Haskell prelude! (As of the current base -- 4.6)
I'm trying to understand what identifiers represent and what they don't represent.
As I understand it, an identifier is a name for a method, a constant, a variable, a class, a package/module. It covers a lot. But what can you not use it for?
Every language differs in terms of what entities/abstractions can or cannot be named and reused in that language.
In most languages, you can't use an identifier for infix arithmetic operations.
For example, plus is an identifier and you can make a function named plus. But write you can write a = b + c;, there's no way to define an operator named plus to make a = b plus c; work because the language grammar simply does not allow an identifier there.
An identifier allows you to assign a name to some data, so that you can reference it later. That is the limit of what identifiers do; you cannot "use" it for anything other than a reference to some data.
That said, there are a lot of implications that come from this, some subtle. For example, in most languages functions are, to some degree or another, considered to be data, and so a function name is an identifier. In languages where functions are values, but not "first-class" values, you can't use an identifier for a function in an place you could use an identifier for something else. In some languages, there will even be separate namespaces for functions and other data, and so what is textually the same identifier might refer to two different things, and they would be distinguished by the context in which they are used.
An example of what you usually (i.e., in most languages) cannot use an identifier for is as a reference to a language keyword. For example, this sort of thing generally can't be done:
let during = while;
during (true) { print("Hello, world."); }
You could say it's used for everything that you'll want to refer to multiple times, or maybe even once (but use it to clarify the referent's purpose).
What can/can't be named differs per language, it's often quite intuitive, IMHO.
An "Anonymous" entity is something which is not named, although referred to somehow.
$subroutine = sub { return "Anonymous subroutine returning this text"; }
In Perl-speak, this is anonymous - the subroutine is not named, but it is referred to by the reference variable $subroutine.
PS: In Perl, the subroutine would be named like this:
# some code...
Say, in Java your cannot write something like:
Object myIf = if;
myIf (a == b) {
So, you cannot name some code statement, giving it an alias. While in REBOL it is perfectly possible:
myIf: if
myIf a = b [print "True!"]
What can and what can't be named depends on language, as you see.
as its name implifies, an identifier is used to identify something. so for everything that can be identified uniquely, you can use an identifier. But for example a literal (e.g. string literal) is not unique so you can't use an identifier for it. However you can create a variable and assign a string literal to it.
Making soup out them is rather foul.
In languages such as Lisp, an identifier exists in its own right as an symbol, whereas in languages which are not introspective identifiers don't exist in the runtime.
You write a literal identifier/symbol by putting a single quote in front of it:
[1]> 'a
You can create a variable and assign a symbol literal to it:
[2]> (setf a 'Hello)
[3]> a
[4]> (print a)
You can set two variables to the same symbol
[10]> (setf b a)
[11]> b
[12]> a
[13]> (eq b a)
[14]> (eq b 'Hello)
Note that the values bound to b and a are the same, and the value is the literal symbol 'Hello
You can bind a function to the symbol
[15]> (defun hello () (print 'hello))
and call it:
[16]> (hello)
In common lisp, the variable binding and the function binding are distinct
[19]> (setf hello 'goodbye)
[20]> hello
[21]> (hello)
but in Scheme or JavaScript the bindings are in the same namespace.
There are many other things you can do with identifiers, if they are reified as symbols. I suspect that someone more knowledgable than me in Lisp will be able to demonstrate any of the things that you 'can't do with identifiers' exist.
But even Lisp can not make identifier soup.
Sort of a left-field thought, but JSON has all those quotations in it to eliminate the danger of a JavaScript keyword messing up the parsing.