Why would I use a lexer and not directly parse code?

Why would I use a lexer and not directly parse code? - parsing

I am trying to create a simple programming language from scratch (interpreter) but I wonder why I should use a lexer.
For me, it looks like it would be easier to create a parser that directly parses the code. what am I overlooking?

I think you'll agree that most languages (likely including the one you are implementing) have conceptual tokens:
operators, e.g * (usually multiply), '(', ')', ;
keywords, e.g., "IF", "GOTO"
identifiers, e.g. FOO, count, ...
numbers, e.g. 0, -527.23E-41
comments, e.g., /* this text is ignored in your file */
whitespace, e.g., sequences of blanks, tabs and newlines, that are ignored
As a practical matter, it takes a specific chunk of code to scan for/collect the characters that make each individual token. You'll need such a code chunk for each type of token your language has.
If you write a parser without a lexer, at each point where your parser is trying to decide what comes next, you'll have to have ALL the code that recognize the tokens that might occur at that point in the parse. At the next parser point, you'll need all the code to recognize the tokens that are possible there. This gives you an immense amount of code duplication; how many times do you want the code for blanks to occur in your parser?
If you think that's not a good way, the obvious cure to is remove all the duplication: place the code for each token in a subroutine for that token, and at each parser place, call the subroutines for the tokens. At this point, in some sense, you already have a lexer: an isolated collection of code to recognize tokens. You can code perfectly fine recursive descent parsers this way.
The next thing you'll discover is that you call the token subroutines for many of the tokens at each parser point. Even that seems like a lot of work and duplication. So, replace all the calls with a single "GetNextToken" call, that itself invokes the token recognizing code for all tokens, and returns a enum that identifies the specific token encountered. Now your parser starts to look reasonable: at each parser point, it makes one call on GetNextToken, and then branches on enum returned. This is basically the interface that people have standardized on as a "lexer".
One thing you will discover is the token-lexers sometimes have trouble with overlaps; keywords and identifiers usually have this trouble. It is actually easier to merge all the token recognizers into a single finite state machine, which can then distinguish the tokens more easily. This also turns out to be spectacularly fast when processing the programming language source text. Your toy language may never parse more than 100 lines, but real compilers process millions of lines of code a day, and most of that time is spent doing token recognition ("lexing") esp. white space suppression.
You can code this state machine by hand. This isn't hard, but it is rather tedious. Or, you can use a tool like FLEX to do it for you, that's just a matter of convenience. As the number of different kinds of tokens in your language grows, the FLEX solution gets more and more attractive.
TLDR: Your parser is easier to write, and less bulky, if you use a lexer. In addition, if you compile the individual lexemes into a state machine (by hand or using a "lexer generator"), it will run faster and that's important.

Well, for intelligently simplified programing language you can get away without either lexer or parser :-) Not kidding. Look up Forth. You can start with tags here on SO (gforth is GNU's) and then go to the Standard's site which has pointers to a few interpreters, sites and its Glossary.
Then you can check out Win32Forth and that should keep you busy for quite a while :-)
Interpreter also compiles (when you invoke words that switch system to compilation context). All without a distinct parser. Lookahead is actually lookbehind :-) - not kidding. It rarely absorbs one following word (== lookahead is max 1). The "words" (aka tokens) are at the same time keywords and variable names and they all live in a Dictionary. There's a whole online book at that site (plus pdf).
Control structures are also just words (they compile a few addresses and jumps on the fly).
You can find old Journals there as well, covering a wide spectrum from machine code generation to object oriented extensions. Yes still without parser - believe it or not.
There used to be more sophisticated (commercial) Forth systems which were reducing words to machine call instructions with immediate addressing (makes the engine run 2-4 times faster) but even plain interpreters were always considered to be fast. One is apparently still active - SwiftForth, but don't expect any freebies there.
There's one Forth on GitHub CiForth which is quite spartanic but has builds and releases for Win, Linux and Mac, 32 and 64 so you can just download and run. Claims to have a 16-bit build as well :-) For embedded systems I suppose.

Related

Writing a lexer for a context sensitive markup language, that has recursive structures such as nested lists

I'm working on a reStructuredText transpiler in Rust, and am in need of some advice concerning how lexing should be structured in languages that have recursive structures. For example lists within lists are possible in rST:
* This is a list item
* This is a sub list item
* And here we are at the preceding indentation level again.
The default docutils.parsers.rst took the approach of scanning the input one line at a time:
The reStructuredText parser is implemented as a state machine, examining its
input one line at a time.
The state machine mentioned basically operates on a set of states of the form (regex, match_method, next_state). It tries to match the current line to the regex based on the current state and runs match_method while transitioning to the next_state if a match succeeds, doing this until it runs out of lines to scan.
My question then is, is this the best approach to scanning a language such as rST? My approach thus far has been to create a Chars iterator of the source and eat away at the source while trying to match against structures at the current Unicode scalar. This works to some extent when all I'm doing is scanning inline content, but I've now run into the realization that handling recursive body level structures like nested lists is going to be a pain in the butt. It feels like I'm going to need a whole bunch of states with duplicate regexes and related methods in many states for matching against indentations before new lines and such.
Would it be better to simply have and iterator of the lines of the source and match on a per-line basis, and if a line such as
* this is an indented list item
is encountered in State::Body, simply transition to a state such as State::BulletList and start lexing lines based on the rules specified there? The above line could be lexed for example as a sequence
TokenType::Indent, TokenType::Bullet, TokenType::BodyText
Any thoughts on this?

I don't know much about rST. But you say it has "recursive" structures. If that's that case, you can't fully lex it as a recursive structure using just state machines or regexes or even lexer generators.
But this the wrong way to think about it. The lexer's job is to identify the atoms of the language. A parser's job is to recognize structure, especially if it is recursive (yes, parsers often build trees recording the recursive structures they found).
So build the lexer ignoring context if you can, and use a parser to pick up the recursive structures if you need them. You can read more about the distinction in my SO answer about Parsers Vs. Lexers https://stackoverflow.com/a/2852716/120163
If you insist on doing all of this in the lexer, you'll need to augment it with a pushdown stack to track the recursive structures. Then what are you building is a sloppy parser disguised as lexer. (You will probably still want a real parser to process the output of this "lexer").
Having a pushdown stack actually useful if the language has different atoms in different contexts especially if the contexts nest; in this case what you want is mode stack that you change as the lexer encounters tokens that indicate a switch from one mode to another. A really useful extension of this idea is to have mode changes select what amounts to different lexers, each of which produces lexemes unique to that mode.
As an example you might do this to lex a language that contains embedded SQL. We build parsers for JavaScript; our lexer uses a pushdown stack to process the content of regexp literals and track nesting of { ... } [...] and (... ). (This has arguably a downside: it rejects versions of JQuery.js that contain malformed regexes [yes, they exist]. Javascript doesn't care if you define a bad regex literal and never use it, but that seems pretty pointless.)
A special case of the stack occurs if you only have track single "(" ... ")" pairs or the equivalent. In this case you can use a counter to record how many "pushes" or "pop" you might have done on a real stack. If you have two or more pairs of tokens like this, counters don't work.

Designing a Language Lexer

I'm currently in the process of creating a programming language. I've laid out my entire design and am in progress of creating the Lexer for it. I have created numerous lexers and lexer generators in the past, but have never come to adopt the "standard", if one exists.
Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
Because the way I design mine, they look like the following:
Code:
int main() {
printf("Hello, World!");
}
Lexer:
[
KEYWORD:INT, IDENTIFIER:"main", LEFT_ROUND_BRACKET, RIGHT_ROUNDBRACKET, LEFT_CURLY_BRACKET,
IDENTIFIER:"printf", LEFT_ROUND_BRACKET, STRING:"Hello, World!", RIGHT_ROUND_BRACKET, COLON,
RIGHT_CURLY_BRACKET
]
Is this the way Lexer's should be made? Also as a side-note, what should my next step be after creating a Lexer? I don't really want to use something such as ANTLR or Lex+Yacc or Flex+Bison, etc. I'm doing it from scratch.

If you don't want to use a parser generator [Note 1], then it is absolutely up to you how your lexer provides information to your parser.
Even if you do use a parser generator, there are many details which are going to be project-dependent. Sometimes it is convenient for the lexer to call the parser with each token; other times is is easier if the parser calls the lexer; in some cases, you'll want to have a driver which interacts separately with each component. And clearly, the precise datatype(s) of your tokens will vary from project to project, which can have an impact on how you communicate as well.
Personally, I would avoid use of global variables (as in the original yacc/lex protocol), but that's a general style issue.
Most lexers work in streaming mode, rather than tokenizing the entire input and then handing the vector of tokens to some higher power. Tokenizing one token at a time has a number of advantages, particularly if the tokenization is context-dependent, and, let's face it, almost all languages have some impurity somewhere in their syntax. But, again, that's entirely up to you.
Good luck with your project.
Notes:
Do you also forgo the use of compilers and write all your code from scratch in assembler or even binary?

Is there a specific way a lexer should be created to maximise capability to use it with as many parsers as possible?
In the lexers I've looked at, the canonical API is pretty minimal. It's basically:
Token readNextToken();
The lexer maintains a reference to the source text and its internal pointers into where it is currently looking. Then, every time you call that, it scans and returns the next token.
The Token type usually has:
A "type" enum for which kind of token it is: string, operator, identifier, etc. There are usually special kinds for "EOF", meaning a special terminator token that is produced after the end of the input, and "ERROR" for the rare cases where a syntax error comes from the lexical grammar. This is mainly just unterminated string literals or totally unknown characters in the source.
The source text of the token.
Sometimes literals are converted to their proper value representation during lexing in which case you'll have that value too. So a number token would have "123" as text but also have the numeric value 123. Or you can do that during parsing/compilation.
Location within the source file of the token. This is for error reporting. Usually 1-based line and column, but can also just be start and end byte offsets. The latter is a little faster to produce and can be converted to line and column lazily if needed.
Depending on your grammar, you may need to be able to rewind the lexer too.

Lexical Analysis of a Scripting Language

I am trying to create a simple script for a resource API. I have a resource API mainly creates game resources in a structured manner. What I want is dealing with this API without creating c++ programs each time I want a resource. So we (me and my instructor from uni) decided to create a simple script to create/edit resource files without compiling every time. There are also some other irrelevant factors that I need a command line interface rather than a GUI program.
Anyway, here is script sample:
<path>.<command> -<options>
/Graphics[3].add "blabla.png"
I didn't design this script language, the owner of API did. The part before '.' as you can guess is the path and part after '.' is actual command and some options, flags etc. As a first step, I tried to create grammar of left part because I thought I could use it while searching info about lexical analyzers and parser. The problem is I am inexperienced when it comes to parsing and programming languages and I am not sure if it's correct or not. Here is some more examples and grammar of left side.
dir -> '/' | '/' path
path -> object '/' path | object
object -> number | string '[' number ']'
Notation if this grammar can be a mess, I don't know. There is 5 different possibilities, they are:
String
"String"
Number
String[Number]
"String"[Number]
It has to start with '/' symbol and if it's the only symbol, I will accept it as Root.
Now my problem is how can I lexically analyze this script? Is there a special method? What should my lexical analyzer do and do not(I read some lexical analysers also do syntactic analysis up to a point). Do you think grammar, etc. is technically appropriate? What kind of parsing method I should use(Recursive Descent, LL etc.)? I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.

What should my lexical analyzer do and not do?
It should:
recognize tokens
ignore ignorable whitespace and comments (if there are such things)
optionally, keep track of source location in order to produce meaningful error messages.
It should not attempt to parse the input, although that will be very tempting with such a simple language.
From what I can see, you have the following tokens:
punctuation: /, ., linear-white-space, new-line
numbers
unquoted strings (often called "atoms" or "ids")
quoted strings (possibly the same token type as unquoted strings)
I'm not sure what the syntax for -options is, but that might include more possibilities.
Choosing to return linear-white-space (that is, a sequence consisting only of tabs and spaces) as a token is somewhat questionable; it complicates the grammar considerably, particularly since there are probably places where white-space is ignorable, such as the beginning and end of a line. But I have the intuition that you do not want to allow whitespace inside of a path and that you plan to require it between the command name and its arguments. That is, you want to prohibit:
/left /right[3] .whimper "hello, world"
/left/right[3].whimper"hello, world"
But maybe I'm wrong. Maybe you're happy to accept both. That would be simpler, because if you accept both, then you can just ignore linear-whitespace altogether.
By the way, experience has shown that using new-line to separate commands can be awkward; sooner or later you will need to break a command into two lines in order to avoid having to buy an extra monitor to see the entire line. The convention (used by bash and the C preprocessor, amongst others) of putting a \ as the last character on a line to be continued is possible, but can lead to annoying bugs (like having an invisible space following the \ and thus preventing it from really continuing the line).
From here down is 100% personal opinion, offered for free. So take it for what its worth.
I am trying to make it technically appropriate piece of work. It's not commercial so I have time thus I can learn lexical analysis and parsing better. I don't want to use a parser library.
There is a contradiction here, in my opinion. Or perhaps two contradictions.
A technically appropriate piece of work would use standard tools; at least a lexical generator and probably a parser generator. It would do that because, properly used, the lexical and grammatical descriptions provided to the tools document precisely the actual language, and the tools guarantee that the desired language is what is actually recognized. Writing ad hoc code, even simple lexical recognizers and recursive descent parsers, for all that it can be elegant, is less self-documenting, less maintainable, and provides fewer guarantees of correctness. Consequently, best practice is "use standard tools".
Secondly, I disagree with your instructor (if I understand their proposal correctly, based on your comments) that writing ad hoc lexers and parsers aids in understanding lexical and parsing theory. In fact, it may be counterproductive. Bottom-up parsing, which is incredibly elegant both theoretically and practically, is almost impossible to write by hand and totally impossible to read. Consequently, many programmers prefer to use recursive-descent or Pratt parsers, because they understand the code. However, such parsers are not as powerful as a bottom-up parser (particularly GLR or Earley parsers, which are fully general), and their use leads to unnecessary grammatical compromises.
You don't need to write a regular expression library to understand regular expressions. The libraries abstract away the awkward implementation details (and there are lots of them, and they really are awkward) and let you concentrate on the essence of creating and using regular expressions.
In the same way, you do not need to write a compiler in order to understand how to program in C. After you have a good basis in C, you can improve your understanding (maybe) by understanding how it translates into machine code, but unless you plan a career in compiler writing, knowing the details of obscure optimization algorithms are not going to make you a better programmer. Or, at least, they're not first on your agenda.
Similarly, once you really understand regular expressions, you might find writing a library interesting. Or not -- you might find it incredibly frustrating and give up after a couple of months of hard work. Either way, you will appreciate existing libraries more. But learn to use the existing libraries first.
And the same with parser generators. If you want to learn how to translate an idea for a programming language into something precise and implementable, learn how to use a parser generator. Only after you have mastered the theory of parsing should you even think of focusing on low-level implementations.

Is it possible to call one yacc parser from another to parse specific token substream?

Suppose I already have a complete YACC grammar. Let that be C grammar for example. Now I want to create a separate parser for domain-specific language, with simple grammar, except that it still needs to parse complete C type declarations. I wouldn't like to duplicate long rules from the original grammar with associated handling code, but instead would like to call out to the original parser to handle exactly one rule (let's call it "declarator").
If it was a recursive descent parser, there would be a function for each rule, easy to call in. But what about YACC with its implicit stack automaton?

Basically, no. Composing LR grammars is not easy, and bison doesn't offer much help.
But all is not lost. Nothing stops you from including the entire grammar (except the %start declaration), and just using part of it, except for one little detail: bison will complain about useless productions.
If that's a show-stopper for you, then you can use a trick to make it possible to create a grammar with multiple start rules. In fact, you can create a grammar which lets you specify the start symbol every time you call the parser; it doesn't even have to be baked in. Then you can tuck that into a library and use whichever parser you want.
Of course, this also comes at a cost: the cost is that the parser is bigger than it would otherwise need to be. However, it shouldn't be any slower, or at least not much -- there might be some cache effects -- and the extra size is probably insignificant compared to the rest of your compiler.
The hack is described in the bison FAQ in quite a lot of detail, so I'll just do an outline here: for each start production you want to support, you create one extra production which starts with a pseudo-token (that is, a lexical code which will never be generated by the lexer). For example, you might do the following:
%start meta_start
%token START_C START_DSL
meta_start: START_C c_start | START_DSL dsl_start;
Now you just have to arrange for the lexer to produce the appropriate START token when it first starts up. There are various ways to do that; the FAQ suggests using a global variable, but if you use a re-entrant flex scanner, you can just put the desired start token in the scanner state (along with a flag which is set when the start token has been sent).

Partial parsing with flex/antlr

I encountered a problem while doing my student research project. I'm an electrical engineering student, but my project has somewhat to do with theoretical computer science: I need to parse a lot of pascal sourcecode-files for typedefinitions and constants and visualize all occurrences. The typedefinitions are spread recursively over various files, i.e. there is type a = byte in file x, in file y, there is a record (struct) b, that contains type a and then there is even a type c in file z that is an array of type b.
My idea so far was to learn about compiler construction, since the compiler has to resolve all typedefinitions and break them down to the elemental types.
So, I've read about compiler construction in two books (one of which is even written by the pascal inventor), but I'm lacking so many basics of theoretical computer science that it took me one week alone to work my way halfway through. What I've learned so far is that for achieving my goal, lexer and parser should be sufficient. Since this software is only a really smart part of the whole project, I can't spend so much time with it, so I started experimenting with flex and later with antlr.
My hope was, that parsing for typedefinitions only was such an easy task, that I could manage to do it with only using a scanner and let it do some parser's work: The pascal-files consist of 5 main-parts, each one being optional: A header with comments, a const-section, a type-section, a var-section and (in least cases) a code-section. Each section has a start-identifier but no clear end-identifier. So I started searching for the start of the type- and const-section (TYPE, CONST), discarding everything else. In flex, this is fairly easy, because it allows "start conditions". They can be used as various states like "INITIAL", "TYPE-SECTION", "CONST-SECTION" and "COMMENT" with different rules for each state. I wanted to get back a string from the scanner with following syntax " = ". There was one thing that made this task difficult: Some type contain comments like in this example: AuEingangsBool_t {PCMON} = MAX_AuEingangsFeld;. The scanner can not extract such type-definition with a regular expression.
My next step was to do it properly with scanner AND parser, so I searched for a parsergenerator and found antlr. Since I write the tool in C# anyway, I decided to use its scannergenerator, too, so that I do not have to communicate between different programs. Now I encountered following Problem: AFAIK, antlr does not support "start conditions" as flex do. That means, I have to scan the whole file (okay, comments still get discarded) and get a lot of unneccessary (and wrong) tokens. Because I don't use rules for the whole pascal grammar, the scanner would identify most keywords of the pascal syntax as user-identifiers and the parser would nag about all those series of tokens, that do not fit to type- and constant-defintions
Now, finally my question(s): Can anyone of you tell me, which approach leads anywhere for my project? Is there a possibility to scan only parts of the source-files with antlr? Or do I have to connect flex with antlr for that purpose? Can I tell antlr's parser to ignore every token that is not in the const- or type-section? Are those tools too powerful for my task and should I write own routines instead?

You'd be better off to find a compiler for Pascal, and simply modify to report the information you want. Presumably there is such a compiler for your Pascal, and often the source code for such compilers is available.
Otherwise you essentially need to build a parser. Building lexer, and then hacking around with the resulting lexemes, is essentially building a bad parser by ad hoc methods. ANTLR is a good way to go; you can define the lexemes (including means to pick up and ignore comments) pretty easily, especially for older dialects of Pascal. You'll need good BNF rules for the type information that you want, and translate those rules to the parser generator. What you can do to minimize work, is to cheat on rules for the parts of the language you don't care about. For instance, you could write an accurate subgrammar for assignment statements. Since you don't care about them, you can write a sloppy subgrammar that treats assignment statements as anything that begins with an identifier, is followed by arbitrary other tokens, and ends with semicolon. This kind of a grammar is called an "island grammar"; it is only accurate where it needs to be accurate.
I don't know about the recursive bit. Is there a reason you can't just process each file separately? The answer may depend on what information you want to know about each type declaration, and if you go deep enough, you may need a symbol table as well as an island parser. Parser generators offer you no help for this.

First, there can be type and const blocks within other blocks (procedures, in later Delphi versions also classes).
Moreover, I'm not entirely sure that you can actually simply scan for a const token, and then start parsing. Const is also used for other purposes in most common (Borland) Pascal dialects. Some keywords can be reused in a different context, and if you don't parse the global blockstructure, and only look for const and type in specific places you will erroneously start parsing there.
A base problem of course is the comments. Scanners cut out comments as early as possible, and don't regard them further. You probably have to setup the scanner so that comments are attached to the adjacent tokens as field (associate with token before or save them up till a certain token follows).
As far antlr vs flex, no clue. The only parsergenerator I have some minor experience in parsing Pascal with is Coco/R (a parsergenerator popular by Wirthians), but in general I (and many pascalians) prefer handcoded.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart