Let's say I allow strings of the following form:
"hello"
And, within a string, I also allow hex escape sequences of the form \xAB, and so an example string might be:
'hello \x74\x6f\x6d'
# hello tom
Of course the following is an invalid string:
"hello \x7z"
Is this something that should be handled at the Lexing stage, that is, it should raise a parsing error that the string is not in the correct format, or is this something that should not matter at the lexing phase and the only job is to "grab the string token", regardless of validity, and pass it onto the next phase, which can do some basic checks that the string contents are in fact valid (and some other checks -- maybe it doesn't contain 0x00 or it's under a certain length, etc.).
Since there’s no “valid” way to interpret that input (t other than as a string with invalid content. You’re better off accepting it as a string and doing the escape character validation in a listener or visitor.
Trying to do things like this in the parser adds no real value, substantially complicates the parser rules, and likely results in really obtuse error messages for the user of your language.
The purpose if your grammar is to recognize that the user was attempting to create a string and create the parse tree indicating such. It’s really not much different than a user trying to assign to an undefined variable in a language that requires definitions. Your parser identifies the only possible intent, and then you can validate it and, hopefully, write good error messages describing the problem.
Related
Let's say I have the following statement:
SELECT "hi\n
there";
Notice there is a literal newline in there, and the escape \n. The string that antlr4 picks up for me is:
String_Literal: "hi\n\nthere"
In other words, not differentiating between the literal newline and the \n one. Is there a way to differentiate the two, or what's the usual process to do that?
My guess is that the output you pasted into your question comes from a call to the Antlr4 runtime method tree.toStringTree(parser) (or equivalent in whatever target language you've chosen).
That function calls escapeWhitespace in the utilities class/module/file, and that function does what it's name suggests: it converts (some) whitespace characters to C-like backslash escape sequences. (Specifically, it handles newline, carriage return, and tab characters.) It does not escape backslash characters, which makes its output ambiguous; there's no way to distinguish between the two character escape sequence \n and the escaped conversion of a newline character in the message.
They are different in the actual character string, because the Antlr4 lexer does not transform the string value of the matched token in any way. That's your responsibility.
In computing, it is very often the case that what you see is not what you got. What you see is just what you see, and a lot of computational power has gone into creating that vision for you. By the same token, nothing guarantees that the vision is an unambiguous, or even useful, representation of the actual values. The best you can say for it is that it's probably more useful than trying to read the data as individual bits. (And, indeed, the individual bits are not physical objects either; despite the common refrain, you could completely disassemble a computer and examine it with an arbitrarily powerful microscope, and you will not see a single 1 or 0.)
That might seem like irrelevant philosophizing, but it has a real consequence: when you're debugging and you see something that makes you think, "that looks wrong", you need to consider two possibilities: maybe the underlying data is incorrect, but may it's the process which rendered the representation which is at fault. In this case, I'd say that the failure of escapeWhitespace to convert backslash characters into pairs of backslashes is a bug, but that's a value judgement on my part. Anyway, the function is not critical to the operation of Antlr4, and you could easily replace it.
Within the scripting language I am implementing, valid IDs can consist of a sequence of numbers, which means I have an ambiguous situation where "345" could be an integer, or could be an ID, and that's not known until runtime. Up until now, I've been handling every case as an ID and planning to handle the check for whether a variable has been declared under that name at runtime, but when I was improving my implementation of a particular bit of code, I found that there was a situation where an integer is valid, but any other sort of ID would not be. It seems like it would make sense to handle this particular case as a parsing error so that, e.g., the following bit of code that activates all picks with a spell level tag greater than 5 would be considered valid:
foreach pick in hero where spell.level? > 5
pick.activate[]
nexteach
but the following which instead compares against an ID that can't be mistaken for an integer constant would be flagged as an error during parsing:
foreach pick in hero where spell.level? > threshold
pick.activate[]
nexteach
I've considered separate tokens, ID and ID_OR_INTEGER, but that means having to handle that ambiguity everywhere I'm currently using an ID, which is a lot of places, including variable declarations, expressions, looping structures, and procedure calls.
Is there a better way to indicate a parsing error than to just print to the error log, and maybe set a flag?
I would think about it differently. If an ID is "just a number" and plain numbers are also needed, I would say any string of digits is a number, and a number might designate an ID in some circumstances.
For bare integer literals (like 345), I would have the tokenizer return maybe a NUMBER token, indicating it found an integer. In the parser, wherever you currently accept ID, change it to NUMBER, and call a lookup function to verify the "NUMBER" is a valid ID.
I might have misunderstood your question. You start by talking about "345", but your second example has no integer strings.
I'm trying to figure out how I can best parse just a subset of a given language with ANTLR. For example, say I'm looking to parse U-SQL. Really, I'm only interested in parsing certain parts of the language, such as query statements. I couldn't be bothered with parsing the many other features of the language. My current approach has been to design my lexer / parser grammar as follows:
// ...
statement
: queryStatement
| undefinedStatement
;
// ...
undefinedStatement
: (.)+?
;
// ...
UndefinedToken
: (.)+?
;
The gist is, I add a fall-back parser rule and lexer rule for undefined structures and tokens. I imagine later, when I go to walk the parse tree, I can simply ignore the undefined statements in the tree, and focus on the statements I'm interested in.
This seems like it would work, but is this an optimal strategy? Are there more elegant options available? Thanks in advance!
Parsing a subpart of a grammar is super easy. Usually you have a top level rule which you call to parse the full input with the entire grammar.
For the subpart use the function that parses only a subrule like:
const expression = parser.statement();
I use this approach frequently when I want to parse stored procedures or data types only.
Keep in mind however, that subrules usually are not termined with the EOF token (as the top level rule should be). This will cause no syntax error if more than the subelement is in the token stream (the parser just stops when the subrule has matched completely). If that's a problem for you then add a copy of the subrule you wanna parse, give it a dedicated name and end it with EOF, like this:
dataTypeDefinition: // For external use only. Don't reference this in the normal grammar.
dataType EOF
;
dataType: // type in sql_yacc.yy
type = (
...
Check the MySQL grammar for more details.
This general idea -- to parse the interesting bits of an input and ignore the sea of surrounding tokens -- is usually called "island parsing". There's an example of an island parser in the ANTLR reference book, although I don't know if it is directly applicable.
The tricky part of island parsing is getting the island boundaries right. If you miss a boundary, or recognise as a boundary something which isn't, then your parse will fail disastrously. So you need to understand the input at least well enough to be able to detect where the islands are. In your example, that might mean recognising a SELECT statement, for example. However, you cannot blindly recognise the string of letters SELECT because that string might appear inside a string constant or a comment or some other context in which it was never intended to be recognised as a token at all.
I suspect that if you are going to parse queries, you'll basically need to be able to recognise any token. So it's not going to be sea of uninspected input characters. You can view it as a sea of recognised but unparsed tokens. In that case, it should be reasonably safe to parse a non-query statement as a keyword followed by arbitrary tokens other than ; and ending with a ;. (But you might need to recognise nested blocks; I don't really know what the possibilities are.)
Assuming I define a variable like this in Lua
local input = "..."
Where the ... comes from a user-provided string. Would that user be able to perform code injection just from a variable definition? Do I need to sanitize the string?
As a general rule, if you ever need to ask yourself if you need to sanitize your inputs, the correct answer is "yes".
As to this particular case, if you just copy/paste the user's string directly into the Lua source file, even in quotes like that, they will be able to execute arbitrary code. It's not even particularly difficult; they can provide some text"; my_code = 20; last = "end of string.
The best way to sanitize this is by using a long-form literal string with [[...]] syntax. But even that can be broken out, so you need to search through the given string for repeated sequences of the = character. Each time you find a sequence, note how many = characters are in that sequence. After searching, insert a number of = characters into your literal string that isn't one of the lengths found in the user string.
Of course, the internal implementation of Lua may have some limits on the length of the = sequence in a long-form literal string. In such a case, an external user could break your code by forcing you to use a longer sequence than the implementation supports. But it won't be able to cause arbitrary code execution; you'll just get a compile error.
I'm new to lex (or flex) and I have a probably simple question. I want to recognize when a user types in "show " and retrieve the name and store it as a variable. Can I do this with some lex keywords or something? Or would just passing it to a method and parsing at the space be easiest?
side note: could include spaces in it
Flex is a tool that is used to create a lexical analyzer. The role of the lexical analyzer, be it generated by Flex or otherwise, is to split the input into tokens. That is, it takes the input stream of characters, s-h-o-w-space, and recognizes that it starts with the token show.
Doing other things, such as storing variable names and values, is better done elsewhere.