I have a situation where my language allows quotes strings but sometimes I want to interpret the contents of the quoted string as language constructs. Think of it as, say, eval function.
So to support quoted strings i need a lexer rule and it overrides my attempts to have a grammar rule evaluating things in quotes if prefixed with 'eval'. Is there any way to deal with this in the grammar?
IMO you should not try to handle this case directly through the lexer.
I think I would leave the string as it in the lexer and add some code in the eval rule of the parser that calls a sub-parser on the string content.
If you want to implement an eval function, you're really looking for a runtime interpreter.
The only time you need an "eval" function is when you want to build up the content to compile at runtime. If you have the content available at compile-time, you can parse it without it being a string...
So... keep it as a string, and then use the same parser at runtime to parse/compile its contents.
Related
I'm trying to implement Python-like white-space indentation(read as: emit indent/dedent tokens where needed) with FsLexYacc.
It seems like FsLexYacc is not able to use unput which is what the C/C++ examples for lexing white-space based indentation use. I tried to use an additional argument as "indent stack" during lexing but not being able to return more than one token per lex rule made it impossible to return all pending dedents at the end of the file or multiple dedents when needed inbetween.
Is there a way to implement white-space based indentation in FsLexYacc without the need to tokenize the complete string first and applying a separate pass over all tokens to replace the whitespaces with indents/dedents where appropriate? (even this possible solution seems hard to get to work with a (LexBuffer<char> -> token) signature, to be able to pass it into the generated Parser)
Neither of the two main lexer generators commonly referenced, cl-lex and lispbuilder-lexer allow for state variables in the "action blocks", making it impossible to recognize a c-style multi-line comment, for example.
What is a lexer generator in Common Lisp that can recognize a c-style multi-line comment as a token?
Correction: This lexer actually needs to recognize nested, balanced multiline comments (not exactly C-style). So I can't do away with state-variables.
You can recognize a C-style multiline comment with the following regular expression:
[/][*][^*]*[*]+([^*/][^*]*[*]+)*[/]
It should work with any library which uses Posix-compatible extended regex syntax; although a bit hard to read because * is extensively used both as an operator and as a literal character, it uses no non-regular features. It does rely on inverted character classes ([^*], for example) matching the newline character, but afaik that is pretty well universal, even for regex engines in which a wildcard does not match newline.
Let's say I was lexing a ruby method definition:
def print_greeting(greeting = "hi")
end
Is it the lexer's job to maintain state and emit relevant tokens, or should it be relatively dumb? Notice in the above example the greeting param has a default value of "hi". In a different context, greeting = "hi" is variable assignment which sets greeting to "hi". Should the lexer emit generic tokens such as IDENTIFIER EQUALS STRING, or should it be context-aware and emit something like PARAM_NAME EQUALS STRING?
I tend to make the lexer as stupid as I possibly can and would thus have it emit the IDENTIFIER EQUALS STRING tokens. At lexical analysis time there is (most of the time..) no information available about what the tokens should represent. Having grammar rules like this in the lexer only polutes it with (very) complex syntax rules. And that's the part of the parser.
I think that lexer should be "dumb" and in your case should return something like this: DEF IDENTIFIER OPEN_PARENTHESIS IDENTIFIER EQUALS STRING CLOSE_PARENTHESIS END.
Parser should do validation - why split responsibilities.
Don't work with ruby, but do work with compiler & programming language design.
Both approches work, but in real life, using generic identifiers for variables, parameters and reserved words, is more easier ("dumb lexer" or "dumb scanner").
Later, you can "cast" those generic identifiers into other tokens. Sometimes in your parser.
Sometimes, lexer / scanners have a code section, not the parser , that allow to do several "semantic" operations, incduing casting a generic identifier into a keyword, variable, type identifier, whatever. Your lexer rules detects an generic identifier token, but, returns another token to the parser.
Another similar, common case, is when you have an expression or language that uses "+" and "-" for binary operator and for unary sign operator.
Distinction between lexical analysis and parsing is an arbitrary one. In many cases you wouldn't want a separate step at all. That said, since the performance is usually the most important issue (otherwise parsing would be mostly trivial task) then you need to decide, and probably measure, whether additional processing during lexical analysis is justified or not. There is no general answer.
I've been playing with this for an hour or tow and have found myself at a road block with the Lua pattern matching utilities. I am attempting to match all quoted text in a string and replace it if needed.
The pattern I have come up with so far is: (\?[\"\'])(.-)%1
This works in some cases but, not all cases:
Working: "This \"is a\" string of \"text to\" test with"
Not Working: "T\\\"his \"is\' a\" string\" of\' text\" to \"test\" wit\\\"h"
In the not working example I would like it to match to (I made a function that gets the matches I desire, I'm just looking for a pattern to use with gsub and curious if a lua pattern can do this):
string
a" string" of
is' a" string" of' text
test
his "is' a" string" of' text" to "test" wit
I'm going to continue to use my function instead for the time being, but am curious if there is a pattern I could/should be using and i'm just missing something with patterns.
(a few edits b/c I forgot about stackoverflows formating)
(another edit to make a non-html example since it was leading to assumptions that I was attempting to parse html)
Trying to match escaped, quoted text using regular expressions is like trying to remove the daisies (and only the daises) from a field using a lawnmower.
I made a function that gets the matches I desire
This is the correct move.
I'm curious if a lua pattern can do this
From a practical point of view, even if a pattern can do this, you don't want to. From a theoretical point of view, you are trying to find a double quote that is preceded by an even number of backslashes. This is definitely a regular language, and the regular expression you want would be something like the following (Lua quoting conventions)
[[[^\](\\)*"(.-[^\](\\)*)"]]
And the quoted string would be result #2. But Lua patterns are not full regular expressions; in particular, you cannot put a * after a parenthesized pattern.
So my guess is that this problem cannot be solved using Lua patterns, but since Lua patterns are not a standard thing in automata theory, I'm not aware of any body of proof technique that you could use to prove it.
The issue with escaped quotes is that, in general, if there's an odd number of backslashes before the quote, then it's escaped, and if there's an even number, it's not. I do not believe that Lua pattern-matching is powerful enough to represent this condition, so if you need to parse text like this, then you should seek another way. Perhaps you can iterate through the string and parse it, or you could find each quote in turn and read backwards, counting the backslashes until you find a non-backslash character (or the beginning of the string).
If you absolutely must use patterns for some reason, you could try doing this in a multi-step process. First, gsub for all occurrences of two backslashes in a row, and replace them with some sentinel value. This must be a value that does not already occur in the string. You could try something like "\001" if you know this string doesn't contain non-printable characters. Anyway, once you've replaced all sequences of two backslashes in a row, any backslashes left are escaping the following character. Now you can apply your original pattern, and then finally you can replace all instances of your sentinel value with two backslashes again.
Lua's pattern language is adequate for many simple cases. And it has at least one trick you don't find in a typical regular expression package: a way to match balanced parenthesis. But it has its limits as well.
When those limits are exceeded, then I reach for LPeg. LPeg is an implementation of a Parsing Expression Grammer for Lua, and was implemented by one of Lua's original authors so the adaptation to Lua is done quite well. A PEG allows specification of anything from simple patterns through complete language grammars to be written. LPeg compiles the grammar to a bytecode and executes it extremely efficiently.
you should NOT be trying to parse HTML with regular expressions, HTML and XML are NOT regular languages and can not be successfully manipulated with regular expressions. You should use a dedicated HTML parser. Here are lots of explanations why.
I'm trying to create a simple BaSH-like grammar on ANTLRv3 but haven't been able to parse (and check) input inside subshell commands.
Further explanation:
I want to parse the following input:
$(command parameters*)
`command parameters`
"some text $(command parameters*)"
And be able to check it's contents as I would with simple input such as: command parameters.
i.e.:
Parsing it would generate a tree like (SUBSHELL (CMD command (PARAM parameters*))) (tokens are in upper-case)
I'm able to ignore '$('s and '`'s, but that won't cover the cases where the subshells are used inside double-quoted strings, like:
$ echo "String test $(ls -l) end"
So... any tips on how do I achieve this?
I'm not very familiar with the details of Antlr v3, but I can tell you that you can't handle bash-style command substitution inside double-quoted strings in a traditional-style lexer, as the nesting cannot be expressed using a regular grammar. Most traditional compiler-compilers restrict lexers to use regular grammars so that efficient DFAs can be constructed for them. (Lexers, which irreducibly have to scan every single character of the source, have historically been one of the slowest parts of a compiler.)
You must either parse " as a token and (ideally) use a different lexer or lexer mode for the internals of strings, so that most shell metacharacters, e.g. '{', aren't parsed as tokens but as text; or alternatively, do away with the lexer-parser division and use a scannerless approach, so that the "lexer" rule for double-quoted strings can call into the "parser" rule for command substitutions.
I would favour the scannerless approach. I would investigate how well Antlr v3 supports writing grammars that work directly over a character stream, rather than using a token stream.