FParsec alternatives getting parser which parsed input - f#

When using alternative parsers, is there an option to get which parser matched input.
My input string can be like below format
{firstPart_number} {secondPart_operator_symbol} {thirdPart}
Here firstPart is always number, second part is alternative parser to parse operator and thirdPart is also alternative (of number, list of string).
sample input
1 + 2
5 * 3
1 in {2,45,6}
Since my discriminated unions are of different types, I want to know which parser matched input so that based on that parser I can create instance of my discriminate union type?
How to handle this situation in FParsec, where my first part is same across parsers but second and third parsers are different and based on that instantiate Type using |>>

my present problem was solved using attempt parser with alternatives. attempt will backtrack if it doesn't match and next alternative parser will parse input again and match

Related

Handling in-ambiguous yet breaking syntax in expression parsing

Context
I've recently come up with an issue that I couldn't solve by myself in a parser I'm writing.
This parser is a component in a compiler I'm building and the question is in regards to the expression parsing necessary in programming language parsing.
My parser uses recursive descent to parse expressions.
The problem
I parse expressions using normal regular language parsing rules, I've eliminated left recursion in all my rules but there is one syntactic "ambiguity" which my parser simply can't handle and it involves generics.
comparison → addition ( ( ">" | ">=" | "<" | "<=" ) addition )* ;
is the rule I use for parsing comparison nodes in the expression
On the other hand I decided to parse generic expressions this way:
generic → primary ( "<" arguments ">" ) ;
where
arguments → expression ( "," expression )* ;
Now because generic expressions have higher precedence as they are language constructs and not mathematical expressions, it causes a scenario where the generic parser will attempt to parse expressions when it shouldn't.
For example in a<2 it will parse "a" as a primary element of the identifier type, immediately afterwards find the syntax for a generic type, parse that and fail as it can't find the closing tag.
What is the solution to such a scenario? Especially in languages like C++ where generics can also have expressions in them if I'm not mistaken arr<1<2> might be legal syntax.
Is this a special edge case or does it require a modification to the syntax definition that im not aware of?
Thank you
for example in a<2 it will parse "a" as a primary element of the identifier type, immideatly afterwards find the syntax for a generic type, parse that and fail as it cant find the closing tag
This particular case could be solved with backtracking or unbounded lookahead. As you said, the parser will eventually fail when interpreting this as a generic, so when that happens, you can go back and parse it as a relational operator instead. The lookahead variant would be to look ahead when seeing a < to check whether the < is followed by comma-separated type names and a > and only go into the generic rule if that is the case.
However that approach no longer works if both interpretations are syntactically valid (meaning the syntax actually is ambiguous). One example of that would be x<y>z, which could either be a declaration of a variable z of type x<y> or two comparisons. This example is somewhat unproblematic since the latter meaning is almost never the intended one, so it's okay to always interpret it as the former (this happens in C# for example).
Now if we allow expressions, it becomes more complicated. For x<y>z it's easy enough to say that this should never be interpreted as two comparison as it makes no sense to compare the result of a comparison with something else (in many languages using relational operators on Booleans is a type error anyway). But for something like a<b<c>() there are two interpretations that might both be valid: Either a is a generic function called with the generic argument b<c or b is a generic function with the generic argument c (and a is compared to the result of calling that function). At this point it is no longer possible to resolve that ambiguity with syntactic rules alone:
In order to support this, you'll need to either check whether the given primary refers to a generic function and make different parsing decisions based on that or have your parser generate multiple trees in case of ambiguities and then select the correct one in a later phase. The former option means that your parser needs to keep track of which generic functions are currently defined (and in scope) and then only go into the generic rule if the given primary is the name of one of those functions. Note that this becomes a lot more complicated if you allow functions to be defined after they are used.
So in summary supporting expressions as generic arguments requires you to keep track of which functions are in scope while parsing and use that information to make your parsing decisions (meaning your parser is context sensitive) or generate multiple possible ASTs. Without expressions you can keep it context free and unambiguous, but will require backtracking or arbitrary lookahead (meaning it's LL(*)).
Since neither of those are ideal, some languages change the syntax for calling generic functions with explicit type parameters to make it LL(1). For example:
Java puts the generic argument list of a method before the method name, i.e. obj.<T>foo() instead of obj.foo<T>().
Rust requires :: before the generic argument list: foo::<T>() instead of foo<T>().
Scala uses square brackets for generics and for nothing else (array subscripts use parentheses): foo[T]() instead of foo<T>().

Predictive editor for Rascal grammar

I'm trying to write a predictive editor for a grammar written in Rascal. The heart of this would be a function taking as input a list of symbols and returning as output a list of symbol types, such that an instance of any of those types would be a syntactically legal continuation of the input symbols under the grammar. So if the input list was [4,+] the output might be [integer]. Is there a clever way to do this in Rascal? I can think of imperative programming ways of doing it, but I suspect they don't take proper advantage of Rascal's power.
That's a pretty big question. Here's some lead to an answer but the full answer would be implementing it for you completely :-)
Reify an original grammar for the language you are interested in as a value using the # operator, so that you have a concise representation of the grammar which can be queried easily. The representation is defined over the modules Type, ParseTree which extends Type and Grammar.
Construct the same representation for the input query. This could be done in many ways. A kick-ass, language-parametric, way would be to extend Rascal's parser algorithm to return partial trees for partial input, but I believe this would be too much hassle now. An easier solution would entail writing a grammar for a set of partial inputs, i.e. the language grammar with at specific points shorter rules. The grammar will be ambiguous but that is not a problem in this case.
Use tags to tag the "short" rules so that you can find them easily later: syntax E = #short E "+";
Parse with the extended and now ambiguous grammar;
The resulting parse trees will contain the same representation as in ParseTree that you used to reify the original grammar, except in that one the rules are longer, as in prod(E, [E,+,E],...)
then select the trees which serve you best for the goal of completion (which use the #short tag), and extract their productions "prod", which look like this prod(E,[E,+],...). For example using the / operator: [candidate : /candidate:prod(_,_,/"short") := trees], and you could use a cursor position to find candidates which are close by instead of all short trees in there.
Use list matching to find prefixes in the original grammar, like if (/match:prod(_,[*prefix, predicted, *postfix],_) := grammar) ..., prefix is your query as extracted from the #short rules. predicted is your answer and postfix is whatever would come after.
yield the predicted symbol back as a type for the user to read: "<type(predicted, ())>" (will pretty print it nicely even if it's some complex regexp type and does the quoting right etc.)

Recursive Descent vs Lex/Parse?

I think I understand (roughly) how recursive descent parsers (e.g. Scala's Parser Combinators) work: You parse the input string with one parser, and that parser calls other, smaller parsers for each "part" of the whole input, and so on, until you reach the low level parsers which directly generate the AST from fragments of the input string
I also think I understand how Lexing/Parsing works: you first run a lexer to break the whole input into a flat list of tokens, and you then run a parser to take the token list and generate an AST.
However, I do not understand is how the Lex/Parse strategy deals with cases where exactly how you tokenize something depends on the tokens that were tokenized earlier. For example, if I take a chunk of XML:
"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"
A recursive descent parser may take this and break it down (each subsequent indent represents the decomposition of the parent string)
"<tag attr='moo' omg='wtf'>attr='moo' omg='wtf'</tag>"
-> "<tag attr='moo' omg='wtf'>"
-> "<tag"
-> "attr='moo'"
-> "attr"
-> "="
-> "moo"
-> "omg='wtf'"
-> "omg"
-> "="
-> "wtf"
-> ">"
-> "attr='moo' omg='wtf'"
-> "</tag>"
And the small parsers which individually parse <tag, attr="moo", etc. would then construct a representation of an XML tag and add attributes to it.
However, how does a single-step Lex/Parse work? How does the Lexer know that the string after <tag and before > must be tokenized into separate attributes, while the string between > and </tag> does not need to be? Wouldn't it need the Parser to tell it that the first string is within a tag body, and the second case is outside a tag body?
EDIT: Changed the example to make it clearer
Typically the lexer will have a "mode" or "state" setting, which changes according to the input. For example, on seeing a < character, the mode would change to "tag" mode, and the lexer would tokenize appropriately until it sees a >. Then it would enter "contents" mode, and the lexer would return all of attr='moo' omg='wtf' as a single string. Programming language lexers, for example, handle string literals this way:
string s1 = "y = x+5";
The y = x+5 would never be handled as a mathematical expression and then turned back into a string. It's recognized as a string literal, because the " changes the lexer mode.
For languages like XML and HTML, it's probably easier to build a custom parser than to use one of the parser generators like yacc, bison, or ANTLR. They have a different structure than programming languages, which are a better fit for the automatic tools.
If your parser needs to turn a list of tokens back into the string it came from, that's a sign that something is wrong in the design. You need to parse it a different way.
How does the Lexer know that the string after must
be tokenized into separate attributes, while the string between > and
does not need to be?
It doesn't.
Wouldn't it need the Parser to tell it that the first string is within
a tag body, and the second case is outside a tag body?
Yes.
Generally, the lexer turns the input stream into a sequence of tokens. A token has no context - that is, a token has the same meaning no matter where it occurs in the input stream. Once the lexing process has completed, each token is treated as a single unit.
For XML, a generated lexer would typically identify integers, identifiers, string literal and so on as well as the control characters, like '<' and '>' but not a whole tag. The work of understanding what is an open tag, close tag, attribute, element, etc., is left to the parser proper.

What type of parser is needed for this grammar?

I have a grammar that I do not know what type of parser I need in order to parse it other than I do not believe the grammar is LL(1). I am thinking I need a parser with backtracking or LL(*) of some sort. The grammar I have came up with (which may need some rewriting) is:
S: Rules
Rules: Rule | Rule Rules
Rule: id '=' Ids
Ids: id | Ids id
The language I am trying to generate looks something like this:
abc = def g hi jk lm
xy = aaa bbb ccc ddd eee fff jjj kkk
foo = bar ha ha
Zero or more Rule that contain a left identifier followed by an equal sign followed by one or more identifers. The part that I think I will have a problem writing a parser for is that the grammar allows any amount of id in a Rule and that the only way to tell when a new Rule starts is when it finds id =, which would require backtracking.
Does anyone know the classification of this grammar and the best method of parsing, for a hand written parser?
The grammar that generates an identifier followed by an equals sign followed by a finite sequence of identifiers is regular. This means that strings in the language can be parsed using a DFA or regular expression. No need for fancy nondeterministic or LL(*) parsers.
To see that the language is regular, let Id = U {a : a ∈ Γ}, where Γ ⊂ Σ is the set of symbols that can occur in identifiers. The language you are trying to generate is denoted by the regular expression
Id+ =( Id+)* Id+
Setting Γ = {a, b, ..., z}, examples of strings in the language of the regular expression are:
look = i am in a regular language
hey = that means i can be recognized by a dfa
cool = or even a regular expression
There is no need to parse your language using powerful parsing techniques. This is one case where parsing using regular expressions or DFA is both appropriate and optimal.
edit:
Call the above regular expression R. To parse R*, generate a DFA recognizing the language of R*. To do this, generate an NFA recognizing the language of R* using the algorithm obtainable from Kleene's theorem. Then convert the NFA into a DFA using the subset construction. The resultant DFA will recognize all strings in R*. Given a representation of the constructed DFA in your implementation language, the required actions - for instance,
Add the last identifier parsed to the right-hand side of the current declaration statement being parsed
Add the last declaration statement parsed to a list of parsed declarations, and use the last identifier parsed to begin parsing a new declaration statement
can be encoded into the states of the DFA. In reality, using Kleene's theorem and the subset construction is probably unnecessary for such a simple language. That is, you can probably just write a parser with the above two actions without implementing an automaton. Given a more complicated regular langauge (for instance, the lexical structure of a programming langauge), the conversion would be the best option.

Is the word "lexer" a synonym for the word "parser"?

The title is the question: Are the words "lexer" and "parser" synonyms, or are they different? It seems that Wikipedia uses the words interchangeably, but English is not my native language so I can't be sure.
A lexer is used to split the input up into tokens, whereas a parser is used to construct an abstract syntax tree from that sequence of tokens.
Now, you could just say that the tokens are simply characters and use a parser directly, but it is often convenient to have a parser which only needs to look ahead one token to determine what it's going to do next. Therefore, a lexer is usually used to divide up the input into tokens before the parser sees it.
A lexer is usually described using simple regular expression rules which are tested in order. There exist tools such as lex which can generate lexers automatically from such a description.
[0-9]+ Number
[A-Z]+ Identifier
+ Plus
A parser, on the other hand, is typically described by specifying a grammar. Again, there exist tools such as yacc which can generate parsers from such a description.
expr ::= expr Plus expr
| Number
| Identifier
No. Lexer breaks up input stream into "words"; parser discovers syntactic structure between such "words". For instance, given input:
velocity = path / time;
lexer output is:
velocity (identifier)
= (assignment operator)
path (identifier)
/ (binary operator)
time (identifier)
; (statement separator)
and then the parser can establish the following structure:
= (assign)
lvalue: velocity
rvalue: result of
/ (division)
dividend: contents of variable "path"
divisor: contents of variable "time"
No. A lexer breaks down the source text into tokens, whereas a parser interprets the sequence of tokens appropriately.
They're different.
A lexer takes a stream of input characters as input, and produces tokens (aka "lexemes") as output.
A parser takes tokens (lexemes) as input, and produces (for example) an abstract syntax tree representing statements.
The two are enough alike, however, that quite a few people (especially those who've never written anything like a compiler or interpreter) treat them as the same, or (more often) use "parser" when what they really mean is "lexer".
As far as I know, lexer and parser are allied in meaning but are not exact synonyms. Though many sources do use them as similar a lexer (abbreviation of lexical analyser) identifies tokens relevant to the language from the input; while parsers determine whether a stream of tokens meets the grammar of the language under consideration.

Resources