I'm trying to use ANTLR4 to build a sort of autocomplete system using the getExpectedTokens() function that can be called when the parser experiences an error. getExpectedTokens() returns an IntervalSet, containing all the token numbers of the acceptable tokens at that point in the parse. Is there some mapping from the token numbers back to the actual Tokens themselves? (So, for example, if one of the expected tokens is a keyword that keyword can be displayed to the user in some way).
These tokens names are accessible through the parser's vocabulary.
parser.getVocabulary().getLiteralName(token_num) will return the string of the literal tokens.
Using getSymbolicName() worked for me.
So you could do parser.getVocabulary().getSymbolicName(tokenType) where tokenType is an int.
Related
I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.
I've seen a few languages that will eat a token, then parse the token and then when they need to check the next token whilst parsing, they request it from the lexer.
So you have if (x == 3) you lex, check what it is in this case an if, lex again and make sure its a (, parse an expression which requests 3 in this case till it finishes parsing an expression, and then you lex and expect a closing parenthesis.
The other alternative is you lex this input stream as keyword, symbol, identifier, equality, number, symbol and then you give that token list to the parser which will parse it into an AST.
What are the pros/cons of these two techniques?
For most grammars, it doesn't really matter whether you lex the entire input into a token list as a first pass, then take tokens from the list during a parse pass, or lex on demand. The second method avoids the need for an in-memory token list, the first method means that you can parse several times a bit faster, which you might want to do in an interpreter.
However if the grammar require more than one token of lookahead or isn't left-right then you might need to lex more. Whilst natural languages have some odd parse rules ("time flies like an arrow, fruit flies like bananas"), computer languages are usually designed to be parseable with a simple recursive descent parser with one token of lookahead.
I'm new to lex (or flex) and I have a probably simple question. I want to recognize when a user types in "show " and retrieve the name and store it as a variable. Can I do this with some lex keywords or something? Or would just passing it to a method and parsing at the space be easiest?
side note: could include spaces in it
Flex is a tool that is used to create a lexical analyzer. The role of the lexical analyzer, be it generated by Flex or otherwise, is to split the input into tokens. That is, it takes the input stream of characters, s-h-o-w-space, and recognizes that it starts with the token show.
Doing other things, such as storing variable names and values, is better done elsewhere.
I am writing an interpreter for a mathematical language in Rust which is intended to be used to solve mathematical expressions.
When lexing, the program needs to know based on the characters used in a token, what type of token it is (for example is it a function or an operator).
Currently I use an enumeration to represent a type of token:
pub enum IdentifierType {
Function,
Variable,
Operator,
Integer,
}
To check the type of a token I use a function which takes an IdentifierType as input and matches based on input to return a bool. The data structures that could be used in this case are relatively simple as tokens only have a single property: allowed characters.
When parsing to an Abstract Syntax Tree (AST), I would like to know what specific operator or function is being used based on a token and to be able to add a reference to that operator and its associated functions to the AST.
When interpreting, I would like to be able to call execute on a node and have it know how to perform its own function.
I have tried to come up with a solution to store all of these related items, but none that I have encountered as felt satisfactory.
For example I stored all of the operators in a TOML file (a type of configuration file that maps to a hash table) but storing enumerations (values that are constrained) is difficult and there is no way to store an operators function. I also want to be able to search by multiple keys, such as operator associativity (e.g. get all operators that are right associative), which means storing within source code is not very satisfactory.
Other possible ideas I have had are using some kind of SQL hybrid system, however that seems tough to implement
Hi I am currently implementing a lexer that breaks XML files up into tokens, I'm considering ways of passing the tokens onto a parser to create a more useful data structure out of said tokens - my current plan is to store them in an arraylist and pass this to the parser , would a link list where each token points to the next be better suited? Or is being able to access tokens by index easier to make a parser for? Or is this all a terrible strategy?
Also if anyone has used antlr , I know it uses a token stream to pass tokenized input to the parser, how can the parser make decisions on if the input is valid / create a data structure if it does not have all the tokens from the input yet?
Any feedback / opinion welcome, thanks!
The most common architecture for this type of parser, to run the lexer inside your parser. Every time you need a token , make a call to a function (from lexer) that retrieves the next one.
I don't know Antlr, but I think they all uses the same. What I'm proposing is how the yacc and lex work.