I am writing an interpreter for a mathematical language in Rust which is intended to be used to solve mathematical expressions.
When lexing, the program needs to know based on the characters used in a token, what type of token it is (for example is it a function or an operator).
Currently I use an enumeration to represent a type of token:
pub enum IdentifierType {
Function,
Variable,
Operator,
Integer,
}
To check the type of a token I use a function which takes an IdentifierType as input and matches based on input to return a bool. The data structures that could be used in this case are relatively simple as tokens only have a single property: allowed characters.
When parsing to an Abstract Syntax Tree (AST), I would like to know what specific operator or function is being used based on a token and to be able to add a reference to that operator and its associated functions to the AST.
When interpreting, I would like to be able to call execute on a node and have it know how to perform its own function.
I have tried to come up with a solution to store all of these related items, but none that I have encountered as felt satisfactory.
For example I stored all of the operators in a TOML file (a type of configuration file that maps to a hash table) but storing enumerations (values that are constrained) is difficult and there is no way to store an operators function. I also want to be able to search by multiple keys, such as operator associativity (e.g. get all operators that are right associative), which means storing within source code is not very satisfactory.
Other possible ideas I have had are using some kind of SQL hybrid system, however that seems tough to implement
Related
I'm attempting to implement an existing scripting language using Ply. Everything has been alright until I hit a section with dot notation being used on objects. For most operations, whitespace doesn't matter, so I put it in the ignore list. "3+5" works the same as "3 + 5", etc. However, in the existing program that uses this scripting language (which I would like to keep this as accurate to as I can), there are situations where spaces cannot be inserted, for example "this.field.array[5]" can't have any spaces between the identifier and the dot or bracket. Is there a way to indicate this in the parser rule without having to handle whitespace not being important everywhere else? Or am I better off building these items in the lexer?
Unless you do something in the lexical scanner to pass whitespace through to the parser, there's not a lot the parser can do.
It would be useful to know why this.field.array[5] must be written without spaces. (Or, maybe, mostly without spaces: perhaps this.field.array[ 5 ] is acceptable.) Is there some other interpretation if there are spaces? Or is it just some misguided aesthetic judgement on the part of the scripting language's designer?
The second case is a lot simpler. If the only possibilities are a correct parse without space or a syntax error, it's only necessary to validate the expression after it's been recognised by the parser. A simple validation function would simply check that the starting position of each token (available as p.lexpos(i) where p is the action function's parameter and i is the index of the token the the production's RHS) is precisely the starting position of the previous token plus the length of the previous token.
One possible reason to require the name of the indexed field to immediately follow the . is to simplify the lexical scanner, in the event that it is desired that otherwise reserved words be usable as member names. In theory, there is no reason why any arbitrary identifier, including language keywords, cannot be used as a member selector in an expression like object.field. The . is an unambiguous signal that the following token is a member name, and not a different syntactic entity. JavaScript, for example, allows arbitrary identifiers as member names; although it might confuse readers, nothing stops you from writing obj.if = true.
That's a big of a challenge for the lexical scanner, though. In order to correctly analyse the input stream, it needs to be aware of the context of each identifier; if the identifier immediately follows a . used as a member selector, the keyword recognition rules must be suppressed. This can be done using lexical states, available in most lexer generators, but it's definitely a complication. Alternatively, one can adopt the rule that the member selector is a single token, including the .. In that case, obj.if consists of two tokens (obj, an IDENTIFIER, and .if, a SELECTOR). The easiest implementation is to recognise SELECTOR using a pattern like \.[a-zA-Z_][a-zA-Z0-9_]*. (That's not what JavaScript does. In JavaScript, it's not only possible to insert arbitrary whitespace between the . and the selector, but even comments.)
Based on a comment by the OP, it seems plausible that this is part of the reasoning for the design of the original scripting language, although it doesn't explain the prohibition of whitespace before the . or before a [ operator.
There are languages which resolve grammatical ambiguities based on the presence or absence of surrounding whitespace, for example in disambiguating operators which can be either unary or binary (Swift); or distinguishing between the use of | as a boolean operator from its use as an absolute value expression (uncommon but see https://cs.stackexchange.com/questions/28408/lexing-and-parsing-a-language-with-juxtaposition-as-an-operator); or even distinguishing the use of (...) in grouping expressions from their use in a function call. (Awk, for example). So it's certainly possible to imagine a language in which the . and/or [ tokens have different interpretations depending on the presence or absence of surrounding whitespace.
If you need to distinguish the cases of tokens with and without surrounding whitespace so that the grammar can recognise them in different ways, then you'll need to either pass whitespace through as a token, which contaminates the entire grammar, or provide two (or more) different versions of the tokens whose syntax varies depending on whitespace. You could do that with regular expressions, but it's probably easier to do it in the lexical action itself, again making use of the lexer state. Note that the lexer state includes lexdata, the input string itself, and lexpos, the index of the next input character; the index of the first character in the current token is in the token's lexpos attribute. So, for example, a token was preceded by whitespace if t.lexpos == 0 or t.lexer.lexdata[t.lexpos-1].isspace(), and it is followed by whitespace if t.lexer.lexpos == len(t.lexer.lexdata) or t.lexer.lexdata[t.lexer.lexpos].isspace().
Once you've divided tokens into two or more token types, you'll find that you really don't need the division in most productions. So you'll usually find it useful to define a new non-terminal for each token type representing all of the whitespace-context variants of that token; then, you only need to use the specific variants in productions where it matters.
I'm using the Z3_parse_smtlib2_string function from the Z3 C API (via Haskell's Z3 lib) to parse an SMTLIB file and apply some tactics to simplify its content, however I notice that any push, pop and check-sat commands appear to be swallowed by this function and do not appear in the resulting AST.
Is there anyway that I can parse this without losing these commands (and then apply the required tactics, once again without losing them)?
I don't think it's possible to do this with Z3_parse_smtlib2_string. As you can see in the documentation "It returns a formula comprising of the conjunction of assertions in the scope (up to push/pop) at the end of the string." See: https://z3prover.github.io/api/html/group__capi.html#ga7905ebec9289b9fe5debcad965f6267e
Note that the reason for this is not just mere "not-implemented" or "buggy." Look at the return type of the function you're using. It returns a Z3_ast_vector, and Z3_ast only captures "expressions" in the SMTLib language. But push/pop etc. are not considered expressions by Z3, but rather commands; i.e., they are internally represented differently. (Whether this was a conscious choice or historical is something I'm not sure about.)
I don't think there's a function to do what you're asking; i.e., can return both expressions and commands. You can ask at https://github.com/Z3Prover/z3/discussions to see if the developers can provide an alternative API, or if they already have something exposed to the users that achieves this.
I am making a compiler with Jflex and Bison. Jflex does the lexical analysis. Bison does the parsing.
The lexical analysis (in a .l file) is perfect. Tokenizes the input, and passes the input to the .y file for Bison to parse.
I need the parser to print an error for redeclared/undeclared variables. My thought are that it would need some sort of memory to remember all the variables initialized so far, so that it can produce an error for those tokens coming in and when it sees an undeclared variable being used. For example, ''bool", "test", "=", "true", ";", and on a new line, "test2", "=", "false", ";", the parser would need some sort of memory to remember ''test" and when it parses the second line it can access that memory again and say that "test2" is undeclared, hence it would print an error.
What I'm confused about is how we can make a memory like that with bison using Java in the .y file. With C, you would use the -d flag and it would make 2 files with enum types and a header file which would keep track of the declared variables but in Java I'm not too sure if I can do the same as I can't structure the grammar in any way so that it will remember variable names.
I could make a symbol table in Java code to check for redeclared variables, but in the main() in the .y file I have
public static void main(String args[]) throws IOException {
EXAMPLELexer lexer = new EXAMPLLexer(System.in);
EXAMPLE parser = new EXAMPLE(lexer);
if(parser.parse()){
System.out.println("VALID FROM PARSER");
}
else{
System.out.println("ERROR FROM PARSER");
}
return;
}
There is no way to get the tokens individually and pass them into another java instance or whatever.%union{} doesnt work with Java, so I dont know how this is even possible.
I can't find a single piece of documentation explaining this so I would love some answers!
It's actually a lot simpler to add your own data to a Bison-generated Java parser than it is to a C parser (or even a C++ parser).
Note that Bison's Java API does not have unions, mostly because Java doesn't have unions. All semantic values are non-primitive types, so they derive from Object. If you need to, you can cast them to a more precise type, or even a primitive type.
(There is an option to define a more precise base class for semantic value types, but Object is probably a good place to start.)
The %code { ... } blocks are just copied into the parser class. So you can add your own members, as well as methods to manipulate them. If you want a symbol table, just add it as a HashMap to the parser class, and then you can add whatever you like to it in your actions.
Since all the parser actions are within the parser class, they have direct access to whatever members and member functions you add to the parser. All of Bison's internal members and member functions have names starting with yy, except for the member functions documented in the manual, so you can use almost any names you want without fear of name collision.
You can also use %parse-param to add arguments to the constructor; each argument corresponds to a class member. But that's probably not necessary for this particular exercise.
Of course, you'll have to figure out what an appropriate value type for the symbol is; that depends completely on what you're trying to do with the symbols. If you only want to validate that the symbols are defined when they are used, I suppose you could get away with a HashSet, but I'm sure eventually you'll want to store some more useful information.
Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().
I want to use flex to handle patterns. In this case, both constant and function name are alphabetical strings that begin with an uppercase letter.
For example, in
Mother(Liz, Bob), how can I differentiate Mother and Liz?
I want ( to be a single token, so I can not regard Mother( as a pattern.
Normally, it would be unnecessary to generate different token types for different kinds of identifier. The parser shouldn't need that distinction if the different uses can be distinguished syntactically. (If you need semantic information to differentiate, and a sentence could be ambiguous without that information, then you might need semantic feedback but that does not appear to be the case here.)
If you don't have a parser, you would need to do some syntactic analysis. Say, for example, that function names are always followed by a ( -- which means that your language doesn't allow higher order functions. Then you could write a wrapper around yylex which reads one token in advance and emits a FUNCTION_NAME or CONSTANT_NAME, depending on the following token.