Wrong keyword is detected in during lexical analysis or parsing? - parsing

I'm writing a small compiler for a interest and I need to know what stage in the wrong keyword is detected (a keyword that is not in the language) during lexical analysis or parsing ?

You'll detect this during parsing. It's not that you have a "wrong keyword", it's that you have an identifier (i.e. a variable name) appearing in a place where you don't expect. So, if your source code looks like:
reeeturn 3;
From the compiler's perspective, you're just using some variable named reeeturn. That could be an error because a variable with that name isn't defined. Or, in this case, it's probably a syntax error to have a number follow an identifier.
But there's no lexical error here. It's a totally valid sequence of tokens: identifier, number, semicolon.

This is probably during lexical analysis. Lexical analysis is the compiler phase where the input file is cut apart into pieces and tagged with what those pieces mean, whereas parsing takes those existing pieces and uses them to assemble an AST. Without seeing the code I can't be certain about this, but based on this reasoning I'd suspect the error is in the scanner and not the parser.
Hope this helps!

It depends on the language.
The lexing phase is responsible for creating a token stream from the source file. If the "wrong keyword" is still a valid token in the language, it will be tokenized properly - for example, in C, the "wrong keyword" will be tokenized to an identifier. It is only later during parsing that the mistake will be revealed.
On the other hand, in a language in which the "wrong keyword" cannot be any other valid token (e.g. a language using sigils for variables), the lexer itself will complain.

Related

Context dependent lexer

I see the following in bash's parse.y. This means that the lexical analysis will be context dependent. How to use flex to do such kind of context depdendent analysis? Will this kind of context depdedent requirement make the flex code too messy? Thanks.
http://git.savannah.gnu.org/cgit/bash.git/tree/parse.y#n3006
/* Handle special cases of token recognition:
IN is recognized if the last token was WORD and the token
before that was FOR or CASE or SELECT.
DO is recognized if the last token was WORD and the token
before that was FOR or SELECT.
ESAC is recognized if the last token caused `esacs_needed_count'
to be set
`{' is recognized if the last token as WORD and the token
before that was FUNCTION, or if we just parsed an arithmetic
`for' command.
`}' is recognized if there is an unclosed `{' present.
`-p' is returned as TIMEOPT if the last read token was TIME.
`--' is returned as TIMEIGN if the last read token was TIMEOPT.
']]' is returned as COND_END if the parser is currently parsing
a conditional expression ((parser_state & PST_CONDEXPR) != 0)
`time' is returned as TIME if and only if it is immediately
preceded by one of `;', `\n', `||', `&&', or `&'.
*/
(F)lex provides start conditions to allow for context-dependent lexical analysis.
If you avoid the temptation to reproduce the parsing logic as a hand-written state machine in the lexical scanner, then start conditions can certainly simplify the implementation of context-dependent scanners.
For the particular application of conditionally-recognised keywords -- often called "semi-reserved words" -- context-dependent lexical analysis is often not the best solution. Instead, consider writing the scanner to always recognise the keywords and then add rules in the grammar to treat the words as identifiers in contexts in which the keyword is not possible. See this answer for an example.

Designing a Language Lexer

I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.
If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.

How to write parser which handles import statements?

I'm using lex & yacc to write a VHDL parser. VHDL has some languages features which make it context sensitive in a manner similar to C. For example, typedef-like constructs which impact whether the parser should tokenize something as an IDENTIFIER vs. TYPEDEF_NAME.
The difficulty comes in when you need to build a symbol table based on another file which is referenced by "use" statements (similar to "import" in Java or Python).
library ieee;
use ieee.std_logic_1164.all;
-- code which uses something defined in ieee.std_logic_1164 package
In C, this is fairly straight-forward because the preprocessor has already combined all of the header files into a single translation unit which can be scanned from top to bottom. But 'use' statements in VHDL are not preprocessor commands.
So, somehow, as I'm parsing the file, I have to recognize when I see a use statement and then go off and parse the relevant file, and then continue parsing the original file with that symbol table.
Is there an elegant way to do this with lex/yacc? I know there is yyrestart but I'm not sure if that's going down the right track.
If you are using flex, then it is pretty easy.
The basic mechanism (including two functioning code samples) is described in the "Multiple Input Buffers" chapter of the flex manual. You can also take a glance at this question on SO.
The parser (yacc/bison) reduction which recognizes the use construction can include the code which calls yy_push_buffer. In the example code, the end of the included file is recognized by the scanner (lex/flex), which simply pops the buffer stack.
Depending on the formal rules of file inclusion, you might want the parser to know that the included file has finished, in order to avoid having syntactic constructs which start in the included file and continue in the includer. (C allows this, even though it is almost always an error; I don't know anything about VHDL, but there are definitely languages which do not allow it.) One possibility is to recursively call the parser in order to parse the included file, which will require a re-entrant ("pure") parser. In that case, the scanner should return an end-of-included-file token when it hits the end of the included file, because your included file grammar production will need to be terminated with such a token.
You may need to worry about the possibility that the parser has already requested the next input token. Most LALR(1) grammars do not depend on the lookahead token for semi-colon terminated statements, and bison usually doesn't request a lookahead token in a context in which it doesn't need it. But that behaviour is not guaranteed by all Posix-compatible yacc implementations and you might be using one which doesn't.
In that case, you would have to preserve the lookahead token so that you can reread it after the included file has been parsed. That would most conveniently be done by stashing the lookahead token somewhere the scanner can see it, and having the scanner return that token (if set) when it sees the end of the included file. In a bison action, you can find the lookahead token in yychar and its semantic value and location (if locations are enabled) are in yylval and yylloc. If bison has not read the lookahead token, the value of yychar will be YYEMPTY, and the simplest possible bison implementation would assert(yychar == YYEMPTY) when it is about to push the input buffer. If the assert fails, you'll need to implement a more sophisticated strategy.

Compiler Design : Is "variable not declared" a syntactic error or semantic error?

Is such type of an error produced during type checking or when input is being parsed?
Under what type should the error be addressed?
The way I see it it is a semantic error, because your language parses just fine even though your are using an identifier which you haven't previously bound--i.e. syntactic analysis only checks the program for well-formed-ness. Semantic analysis actually checks that your program has a valid meaning--e.g. bindings, scoping or typing. As #pst said you can do scope checking during parsing, but this is an implementation detail. AFAIK old compilers used to do this to save some time and space, but I think today such an approach is questionable if you don't have some hard performance/memory constraints.
The program conforms to the language grammar, so it is syntactically correct. A language grammar doesn't contain any statements like 'the identifier must be declared', and indeed doesn't have any way of doing so. An attempt to build a two-level grammar along these lines failed spectacularly in the Algol-68 project, and it has not been attempted since to my knowledge.
The meaning, if any, of each is a semantic issue. Frank deRemer called issues like this 'static semantics'.
In my opinion, this is not strictly a syntax error - nor a semantic one. If I were to implement this for a statically typed, compiled language (like C or C++), then I would not put the check into the parser (because the parser is practically incapable of checking for this mistake), rather into the code generator (the part of the compiler that walks the abstract syntax tree and turns it into assembly code). So in my opinion, it lies between syntax and semantic errors: it's a syntax-related error that can only be checked by performing semantic analysis on the code.
If we consider a primitive scripting language however, where the AST is directly executed (without compilation to bytecode and without JIT), then it's the evaluator/executor function itself that walks the AST and finds the undeclared variable - in this case, it will be a runtime error. The difference lies between the "AST_walk()" routine being in different parts of the program lifecycle (compilation time and runtime), should the language be a scripting or a compiled one.
In the case of languages -- and there are many -- which require identifiers to be declared, a program with undeclared identifiers is ill-formed and thus a missing declaration is clearly a syntax error.
The usual way to deal with this is to incorporate information about symbols in a symbol table, so that the parse can use this information.
Here are a few examples of how identifier type affects parsing:
C / C++
A classic case:
(a)-b;
Depending on a, that's either a cast or a subtraction:
#include <stdio.h>
#if TYPEDEF
typedef double a;
#else
double a = 3.0;
#endif
int main() {
int b = 3;
printf("%g\n", (a)-b);
return 0;
}
Consequently, if a hadn't been declared at all, the compiler must reject the program as syntactically ill-formed (and that is precisely the word the standard uses.)
XML
This one is simple:
<block>Hello, world</blob>
That's ill-formed XML, but it cannot be detected with a CFG. (Nonetheless, all XML parsers will correctly reject it as ill-formed.) In the case of HTML/SGML, where end-tags may be omitted under some well-defined circumstances, parsing is trickier but nonetheless deterministic; again, the precise declaration of a tag will determine the parse of a valid input, and it's easy to come up with inputs which parse differently depending on declaration.
English
OK, not a programming language. I have lots of other programming language examples, but I thought this one might trigger some other intuitions.
Consider the two grammatically correct sentences:
The sheep is in the meadow.
The sheep are in the meadow.
Now, what about:
The cow is in the meadow.
(*) The cow are in the meadow.
The second sentence is intelligible, albeit ambiguous (is the noun or the verb wrong?) but it is certainly not grammatically correct. But in order to know that (and other similar examples), we have to know that sheep has an unmarked plural. Indeed, many animals have unmarked plurals, so I recognize all the following as grammatical:
The caribou are in the meadow.
The antelope are in the meadow.
The buffalo are in the meadow.
But definitely not:
(*) The mouse are in the meadow.
(*) The bird are in the meadow.
etc.
It seems that there is a common misconception that because the syntactic analyzer uses a context free grammar parser, that syntactic analysis is restricted to parsing a context free grammar. This is simply not true.
In the case of C (and family), the syntax analyzer uses a symbol table to help it parse. In the case of XML, it uses the tag stack, and in the case of generalize SGML (including HTML) it also uses tag declarations. Consequently, the syntax analyzer considered as a whole is more powerful than the CFG, which is just a part of the analysis.
The fact that a given program passes the syntax analysis does not mean that it is semantically correct. For example, the syntax analyser needs to know whether a is a type or not in order to correctly parse (a)-b, but it does not need to know whether the cast is in fact possible, in the case that it a is a type, or that a and b can meaningfully be subtracted, in the case that a is a variable. These verifications can happen during type analysis after the parse tree is built, but they are still compile-time errors.

Is this the job of the lexer?

Let's say I was lexing a ruby method definition:
def print_greeting(greeting = "hi")
end
Is it the lexer's job to maintain state and emit relevant tokens, or should it be relatively dumb? Notice in the above example the greeting param has a default value of "hi". In a different context, greeting = "hi" is variable assignment which sets greeting to "hi". Should the lexer emit generic tokens such as IDENTIFIER EQUALS STRING, or should it be context-aware and emit something like PARAM_NAME EQUALS STRING?
I tend to make the lexer as stupid as I possibly can and would thus have it emit the IDENTIFIER EQUALS STRING tokens. At lexical analysis time there is (most of the time..) no information available about what the tokens should represent. Having grammar rules like this in the lexer only polutes it with (very) complex syntax rules. And that's the part of the parser.
I think that lexer should be "dumb" and in your case should return something like this: DEF IDENTIFIER OPEN_PARENTHESIS IDENTIFIER EQUALS STRING CLOSE_PARENTHESIS END.
Parser should do validation - why split responsibilities.
Don't work with ruby, but do work with compiler & programming language design.
Both approches work, but in real life, using generic identifiers for variables, parameters and reserved words, is more easier ("dumb lexer" or "dumb scanner").
Later, you can "cast" those generic identifiers into other tokens. Sometimes in your parser.
Sometimes, lexer / scanners have a code section, not the parser , that allow to do several "semantic" operations, incduing casting a generic identifier into a keyword, variable, type identifier, whatever. Your lexer rules detects an generic identifier token, but, returns another token to the parser.
Another similar, common case, is when you have an expression or language that uses "+" and "-" for binary operator and for unary sign operator.
Distinction between lexical analysis and parsing is an arbitrary one. In many cases you wouldn't want a separate step at all. That said, since the performance is usually the most important issue (otherwise parsing would be mostly trivial task) then you need to decide, and probably measure, whether additional processing during lexical analysis is justified or not. There is no general answer.

Resources