A context dependent Fortran scanner - flex-lexer

I am trying to generate a fortran lexer and parser automatically with flex & bison, however, I came across an error when scanning the following fortran language:
"if(i.le.20.and.j.le.10)"
The reason I've found is, ".and." is a logical operator in fortran grammar, while a float-point number can be written as "20.". So my lexer would recognize "20." as a float-point number, basing on the rule "matching the longest possible string", while the left string "and.j.le.10" has no rules to match.
So how can I figure out this issue?

You can't solve that problem with lex and yacc. While there have been occasional backtracking yacc implementations, the problem here is at the level of lex. The lexer would have to test successively longer strings until it is as long as possible before succeeding tokens would fail the rule. lex doesn't do that: it advances through the input stream, backing up only to address ambiguities in the current token. Whether lex or flex, the same answer.
Others solve the problem with a specially written lexical analyzer. For instance, you could read a whole line and recursively split it into tokens. After each proposed token, the lexer would recur from that point, looking for the next token. If none is suitable (such as and.), the lexer would return an error. The recursion would only complete if it reached the end of the line.
This is fairly straightforward with Fortran 90's free form. With the earlier versions, whitespace was largely irrelevant (except in an I/O format).
Further reading:
Parser generators with backtrack or extended lookahead capability
Need FORTRAN grammar for lex & yacc (1994 usenet posting)
bfor (description of a Fortran preprocessor)

You may have to preprocess the file to modify such lines to
if((i.le.20) .and. (j .le. 10)) then

Related

Is it possible to remove the internal control of lexer by the parser for parsing heredoc in shell?

To deal with heredoc in shell (e.g., bash), the grammar rule will change the variable need_here_doc via push_heredoc().
| LESS_LESS WORD
{
source.dest = 0;
redir.filename = $2;
$$ = make_redirection (source, r_reading_until, redir, 0);
push_heredoc ($$);
}
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n539
static void
push_heredoc (r)
REDIRECT *r;
{
if (need_here_doc >= HEREDOC_MAX)
{
last_command_exit_value = EX_BADUSAGE;
need_here_doc = 0;
report_syntax_error (_("maximum here-document count exceeded"));
reset_parser ();
exit_shell (last_command_exit_value);
}
redir_stack[need_here_doc++] = r;
}
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n2794
need_here_doc is used in read_token(), which is called by yylex(). This makes the behavior of yylex() non-automomous.
Is it normal to design a parser that can change how yylex() behaves?
Is it because the shell language is not LALR(1), so there is no way to avoid changing the behavior of yylex() by the grammar actions?
if (need_here_doc)
gather_here_documents ();
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n3285
current_token = read_token (READ);
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n2761
Is it normal to design a parser that can change how yylex() behaves?
Sure. It might not be ideal, but it's extremely common.
The Posix shell syntax is far from the ideal candidate for a flex/bison parser, and about the only thing you can say for the bash implementation using flex and bison is that it demonstrates how flexible those tools can be if pushed to their respective limits. Here-docs are not the only place where "lexical feedback" is necessary.
But even in more disciplined languages, lexical feedback can be useful. Or its alternative: writing partial parsing logic into the lexical scanner in order for it to know when the parse would require a different set of lexical rules.
Possibly the most well-known (or most frequently-commented) lexical feedback is the parsing of C-style cast expressions, which require the lexer to know whether the foo in (foo) is a typename or not. (This is usually implemented by way of a symbol table shared between the parser and the lexer but the precise implementation details are tricky.)
Here are a few other examples, which might be considered relatively benign uses of lexical feedback, although they certainly increase the coupling between lexer and parser.
Python (and Haskell) require the lexical scanner to reformulate leading whitespace into INDENT or DEDENT tokens. But if the line break occurs within parentheses, the whitespace handling is suppressed (including the NEWLINE token itself).
Ecmascript (Javascript) and other languages allow regular expression literals to be written surrounded by /s. But the / could also be a division operator or the first character in a /= mutation operator. The lexical decision depends on the parse context. (This could be guessed by the lexical scanner from the recent token history, which would count as reproducing part of the parsing logic in the lexical scanner.)
Similar to the above, many languages overload < in ways which complicate the logic in the lexical scanner. The use as a template bracket rather than a comparison operator might be dealt with in the scanner -- in C++, for example, it will depend on features like whether the preceding identifier was a template or not -- but that doesn't actually change lexical context. However, the use of an angle bracket to indicate the start of an X/HTML literal (or template) definitely changes lexical context. As with the regex example above, it will be necessary to know whether or not a comparison operator would be syntactically valid or not.
Is it because the shell language is not LALR(1), so there is no way to avoid changing the behavior of yylex() by the grammar actions?
The Posix shell syntax is most certainly not LALR(1), or even context-free. But most languages could not be parsed scannerlessly with an LALR(1) parser, and many languages turn out not to have context-free grammars if you take all syntactic considerations into account. (Cf. C-style cast expressions, above.) Perhaps shell is further from the platonic ideal than most. But then, it grew over the years from a kernel intended to be simple to type, rather than formally analysable. (No comment from me about whether this excuse can be extended to Perl, which I don't plan to discuss here.)
What I'd say in general is that languages which embed other languages (regular expressions, HTML fragments, Flex/Bison semantic actions, shell arithmetic expansions, etc., etc.) present challenges for a simplistic parser/scanner model. Despite lots of interesting work and solid experimentation, my sense is that language embedding still lacks a good implementable formal structure. And since most languages do have embedded sublanguages, there is and will continue to be a certain adhockery in their parser implementations. In part, that's what makes this field of study so much fun.

How can a lexer extract a token in ambiguous languages?

I wish to understand how does a parser work. I learnt about the LL, LR(0), LR(1) parts, how to build, NFA, DFA, parse tables, etc.
Now the problem is, i know that a lexer should extract tokens only on the parser demand in some situation, when it's not possible to extract all the tokens in one separated pass. I don't exactly understand this kind of situation, so i'm open to any explanation about this.
The question now is, how should a lexer does its job ? should it base its recognition on the current "contexts", the current non-terminals supposed to be parsed ? is it something totally different ?
What about the GLR parsing : is it another case where a lexer could try different terminals, or is it only a syntactic business ?
I would also want to understand what it's related to, for example is it related to the kind of parsing technique (LL, LR, etc) or only the grammar ?
Thanks a lot
The simple answer is that lexeme extraction has to be done in context. What one might consider be lexemes in the language may vary considerably in different parts of the language. For example, in COBOL, the data declaration section has 'PIC' strings and location-sensitive level numbers 01-99 that do not appear in the procedure section.
The lexer thus to somehow know what part of the language is being processed, to know what lexemes to collect. This is often handled by having lexing states which each process some subset of the entire language set of lexemes (often with considerable overlap in the subset; e.g., identifiers tend to be pretty similar in my experience). These states form a high level finite state machine, with transitions between them when phase changing lexemes are encountered, e.g., the keywords that indicate entry into the data declaration or procedure section of the COBOL program. Modern languages like Java and C# minimize the need for this but most other languages I've encountered really need this kind of help in the lexer.
So-called "scannerless" parsers (you are thinking "GLR") work by getting rid of the lexer entirely; now there's no need for the lexer to produce lexemes, and no need to track lexical states :-} Such parsers work by simply writing the grammar down the level of individual characters; typically you find grammar rules that are the exact equivalent of what you'd write for a lexeme description. The question is then, why doesn't such a parser get confused as to which "lexeme" to produce? This is where the GLR part is useful. GLR parsers are happy to process many possible interpretations of the input ("locally ambiguous parses") as long as the choice gets eventually resolved. So what really happens in the case of "ambiguous tokens" is the the grammar rules for both "tokens" produce nonterminals for their respectives "lexemes", and the GLR parser continues to parse until one of the parsing paths dies out or the parser terminates with an ambiguous parse.
My company builds lots of parsers for languages. We use GLR parsers because they are very nice for handling complex languages; write the context-free grammar and you have a parser. We use lexical-state based lexeme extractors with the usual regular-expression specification of lexemes and lexical-state-transitions triggered by certain lexemes. We could arguably build scannerless GLR parsers (by making our lexers produce single characters as tokens :) but we find the efficiency of the state-based lexers to be worth the extra trouble.
As practical extensions, our lexers actually use push-down-stack automata for the high level state machine rather than mere finite state machines. This helps when one has high level FSA whose substates are identical, and where it is helpful for the lexer to manage nested structures (e.g, match parentheses) to manage a mode switch (e.g., when the parentheses all been matched).
A unique feature of our lexers: we also do a little tiny bit of what scannerless parsers do: sometimes when a keyword is recognized, our lexers will inject both a keyword and an identifier into the parser (simulates a scannerless parser with a grammar rule for each). The parser will of course only accept what it wants "in context" and simply throw away the wrong alternative. This gives us an easy to handle "keywords in context otherwise interpreted as identifiers", which occurs in many, many languages.
Ideally, the tokens themselves should be unambiguous; you should always be able to tokenise an input stream without the parser doing any additional work.
This isn't always so simple, so you have some tools to help you out:
Start conditions
A lexer action can change the scanner's start condition, meaning it can activate different sets of rules.
A typical example of this is string literal lexing; when you parse a string literal, the rules for tokenising usually become completely different to the language containing them. This is an example of an exclusive start condition.
You can separate ambiguous lexings if you can identify two separate start conditions for them and ensure the lexer enters them appropriately, given some preceding context.
Lexical tie-ins
This is a fancy name for carrying state in the lexer, and modifying it in the parser. If a certain action in your parser gets executed, it modifies some state in the lexer, which results in lexer actions returning different tokens. This should be avoided when necessary, because it makes your lexer and parser both more difficult to reason about, and makes some things (like GLR parsers) impossible.
The upside is that you can do things that would require significant grammar changes with relatively minor impact on the code; you can use information from the parse to influence the behaviour of the lexer, which in turn can come some way to solving your problem of what you see as an "ambiguous" grammar.
Logic, reasoning
It's probable that it is possible to lex it in one parse, and the above tools should come second to thinking about how you should be tokenising the input and trying to convert that into the language of lexical analysis. :)
The fact is, your input is comprised of tokens—whether you like it or not!—and all you need to do is find a way to make a program understand the rules you already know.

Does the recognition of numbers belong in the scanner or in the parser?

When you look at the EBNF description of a language, you often see a definition for integers and real numbers:
integer ::= digit digit* // Accepts numbers with a 0 prefix
real ::= integer "." integer (('e'|'E') integer)?
(Definitions were made on the fly, I have probably made a mistake in them).
Although they appear in the context-free grammar, numbers are often recognized in the lexical analysis phase. Are they included in the language definition to make it more complete and it is up to the implementer to realize that they should actually be in the scanner?
Many common parser generator tools -- such as ANTLR, Lex/YACC -- separate parsing into two phases: first, the input string is tokenized. Second, the tokens are combined into productions to create a concrete syntax tree.
However, there are alternative techniques that do not require tokenization: check out backtracking recursive-descent parsers. For such a parser, tokens are defined in a similar way to non-tokens. pyparsing is a parser generator for such parsers.
The advantage of the two-step technique is that it usually produces more efficient parsers -- with tokens, there's a lot less string manipulation, string searching, and backtracking.
According to "The Definitive ANTLR Reference" (Terence Parr),
The only difference between [lexers and parsers] is that the parser recognizes grammatical structure in a stream of tokens while the lexer recognizes structure in a stream of characters.
The grammar syntax needs to be complete to be precise, so of course it includes details as to the precise format of identifiers and the spelling of operators.
Yes, the compiler engineer decides but generally it is pretty obvious. You want the lexer to handle all the character-level detail efficiently.
There's a longer answer at Is it a Lexer's Job to Parse Numbers and Strings?

Two level grammar

I am trying to determine whether suggested changes to the EcmaScript grammar introduce ambiguities.
The grammar is odd in a few ways
There is no regular or context free lexical grammar meaning there is no way to break the input into a series of tokens which can be fed to a tree builder, though at a given parser state there is a context free grammar which can be used to fetch the next token.
Some tokens are implicit. Specifically semicolons are inserted in some places when not present in the source text. This only requires one non-ignorable token of lookahead but since ignorable tokens can be of arbitrary length prevents non-finite lookahead.
There is no translation simpler than a full parse that allows removal or collapsing of ignorable tokens.
Line terminators tokens (and multiline comments that are equivalent to line terminators) are ignorable in most contexts but are significant in some.
I know that proving no ambiguity is not doable in general, but I'd like to be able to achieve a simpler goal:
A test that is true if and only if there is no string such that two different paths through the candidate grammar might produce two different trees where each path involves breaking the string into less than k tokens.
I would be very happy if I could prove such a thing for a candidate grammar to k of 50.
Is there any literature on detecting ambiguity within such limits?

Parsing Context Sensitive Language

i am reading the Definitive ANTLR reference by Terence Parr, where he says:
Semantic predicates are a powerful
means of recognizing context-sensitive
language structures by allowing
runtime information to drive
recognition
But the examples in the book are very simple. What i need to know is: can ANTLR parse context-sensitive rules like:
xAy --> xBy
If ANTLR can't parse these rules, is there is another tool that deals with context-sensitive grammars?
ANTLR parses only grammars which are LL(*). It can't parse using grammars for full context-sensitive languages such as the example you provided. I think what Parr meant was that ANTLR can parse some languages that require some (left) context constraints.
In particular, one can use semantic predicates on "reduction actions" (we do this for GLR parsers
used by our DMS Software Reengineering Toolkit but the idea is similar for ANTLR, I think) to inspect any data collected by the parser so far, either as ad hoc side effects of other semantic actions, or in a partially-built parse tree.
For our DMS-based DMS-based Fortran front end, there's a context-sensitive check to ensure that DO-loops are properly lined up. Consider:
DO 20, I= ...
DO 10, J = ...
...
20 CONTINUE
10 CONTINUE
From the point of view of the parser, the lexical stream looks like this:
DO <number> , <variable> = ...
DO <number> , <variable> = ...
...
<number> CONTINUE
<number> CONTINUE
How can the parser then know which DO statement goes with which CONTINUE statement?
(saying that each DO matches its closest CONTINUE won't work, because FORTRAN can
share a CONTINUE statement with multiple DO-heads).
We use a semantic predicate "CheckMatchingNumbers" on the reduction for the following rule:
block = 'DO' <number> rest_of_do_head newline
block_of_statements
<number> 'CONTINUE' newline ; CheckMatchingNumbers
to check that the number following the DO keyword, and the number following the CONTINUE keyword match. If the semantic predicate says they match, then a reduction for this rule succeeds and we've aligned the DO head with correct CONTINUE. If the predicate fails, then no reduction is proposed (and this rule is removed from candidates for parsing the local context); some other set of rules has to parse the text.
The actual rules and semantic predicates to handle FORTRAN nesting with shared continues is more complex than this but I think this makes the point.
What you want is full context-sensitive parsing engine. I know people have built them, but I don't know of any full implementations, and don't expect them to be fast.
I did follow Quinn Taylor Jackson's MetaS grammar system for awhile; it sounded like a practical attempt to come close.
It is comparatively easy to write a context-sensitive parser in Prolog. This program parses the string [a,is,less,than,b,and,b,is,less,than,c], converting it into [a,<,b,<,c]:
:- initialization(main).
:- set_prolog_flag('double_quotes','chars').
main :-
rewrite_system([a,is,less,than,b,and,b,is,less,than,c],X),writeln('\nFinal output:'),writeln(X).
rewrite_rule([[A,<,B],and,[B,<,C]],[A,<,B,<,C]).
rewrite_rule([A,is,less,than,B],[A,<,B]).
rewrite_rule([[A,<,B],and,C,than,D],[[A,<,B],and,A,is,C,than,D]).
rewrite_rule([A,<,B],[[A,<,B]]).
rewritten(A) :- atom(A);bool(A).
bool(A) :- atom(A).
bool([A,<,B,<,C]) :- atom(A),atom(B),atom(C).
bool([A,and,B]) :- bool(A),bool(B).
% this predicate is from https://stackoverflow.com/a/8312742/975097
replace(ToReplace, ToInsert, List, Result) :-
once(append([Left, ToReplace, Right], List)),
append([Left, ToInsert, Right], Result).
rewrite_system(Input,Output) :-
rewritten(Input),Input=Output;
rewrite_rule(A,B),
replace(A,B,Input,Input1),
writeln(Input1),
rewrite_system(Input1,Output).
Using the same algorithm, I also wrote an adaptive parser that "learns" new rewrite rules from its input.

Resources