Defining lex include files for (f)lexers - flex-lexer

When having multiple lexers one sees that, especially, in the pattern part some definitions are repeated in each lexer (e.g. whiteSpace [ \t]+ ), this is not nice that one has to do define it each time and especially with more complex patterns a bit error prone.
So far I have not been able to fine anything, but is there a way to have a file with (e.g.) patterns included in a lexer?

You are free to write your own preprocessor, and I suspect many people have done so. But as far as I know, no popular lex derivative includes such a feature. Certainly, neither flex nor the original AT&T lex have one.

Related

Writing a Parser for mixed Languages

I am trying to write a Parser that can analyze mixed Languages and generate an AST of it. I first tried to build it from scratch on my own in Java and failed, because this is quite a hard topic for a Parser-beginner. Then i googled and found http://www2.cs.tum.edu/projects/cup/examples.php and JFlex.
The question now is: What is the best way to do it?
For example i have a Codefile, that contains several Tags, JS Code, and some $CMS_SET(x,y)$ Code. Is the best way to solve this to define a grammar for all those things in CUP and let CUP generate a Parser based on my grammer that can analyze those mixed Language files and generate and AST Tree of it?
Thanks for all helpful answers. :)
EDIT: I need to do it in Java...
This topic is quite hard even for an expert in this area, which I consider myself to be; check my bio.
The first issue is to build individual parsers for each sublanguage. The first thing you will discover is that defining parsers for specific languages is actually hard; you can read the endless list of SO requests for "can I get a parser for X" or "how do I fix my parser for X". Mostly I think these requests end up not going anywhere; the parsing engines aren't very good, you have to twist the grammars and the parsers to make them work on real languages, there isn't any such thing as "pure HTML", the standards documents disagree and your customer always has some twist in his code that you're not prepared for. Finally there's the glitches related to character set encodings, variations in newline endings, and preprocessors to complicate the parsing parsing problem. The C++ preprocessor is a lot more complex than you might think, and you have to have this right. The easiest way to defeat this problem is to find some parser generator with the languages already predefined. ANTLR has a bunch for the now deprecated ANTLR3; there's no assurance these parsers are robust let alone compatible for your purposes.
CUP isn't a particularly helpful parser generator; none of the LL(x) or LALR(x) parser generators are really helpful, because no real langauge matches the categories of things they can parse. The consequence: an endless stream of requests (on SO!) for help "resolving my shift-reduce conflict", or "eliminating right recursion". The only parser generator IMHO that has stood the test of time is a GLR parser generator (I hear good things about GLL but that's pretty recent). We've done 40+ languages with one GLR parser generator, including production IBM COBOL, full C++14 and Java8.
You second problem will be building ASTs. You can hand code the AST building process, but that gets old fast when you have to change the grammars often and/or you have many grammars as you are effectively contemplating. This you can beat your way throught with sweat. (We chose to push the problem of building ASTs into the parser so we didn't have to put any energy this in building a grammar; to do this, your parser engine has to offer you this help and none of the mainstream ones do.).
Now you need to compose parsers. You need to have one invoke the other as the need comes up; of course your chosen parser isn't designed to do this so you'll have to twist it. The first hard part is provide a parser with clues that a sublanguage is coming up in the input stream, and for it to hand off that parsing to the sublangauge parser, and get it to pass a tree back to be incorporated in the parent parser's tree, presumably with some kind of marker so you can tell where the transitions between different sublangauges are in the tree. You can often do this by hacking one language's lexer, when it sees the clue, to invoke the other; but then what you do with the tree it returns? There's no way to give that tree to the current parser and say "integrate this". You get around this by modifying the parsing machinery in arcane ways.
But all of the above isn't where the problem is.
Parsing is inconvenient, but only a small part of what you need to analyze your programs in any interesting way; you need symbol tables, control and data flow analysis, maybe points-to analyses, and the engineering on these will swamp the work listed above. See my essay on "Life After Parsing" (google or via my bio) for a long discussion of what else you need.
In short, I think you are biting off an enormous task in just "parsing", and you haven't even told us what you intend to do with the result. You are welcome to start down this path, but very few people have succeeded; my team spent over a 50 man years of PhD level engineering to get where we are, and we are hardly done.
Java won't make the solution any easier or harder; the langauge in which you solve all of the above is irrelevant.

Are there any known parser combinator library's in F# that can parse binary (not text) files?

I am familiar with some of the basics of fparsec but it seems to be geared towards text files or streams.
Are there any other F# library's that can efficiently parse binary files? Or can fparsec be easily modified to work efficiently with binary streams?
You may be interested in pickler combinators. These are a bit like parser combinators, but are more focused at simpler binary formats (picklers allow you to produce binary data and unpicklers parse them). There is a quite readable article about the idea (PDF) by Andrew Kennedy (the author of units of measure).
I don't have much experience with these myself, but I just realized it may be relevant for you. The idea is used in the F# compiler for generating some binary resources (like quotations stored in resources). Although, I'm not sure if the F# compiler implementation is any good (it is one of those things from early days of the F# compiler).
The problem with working with binary streams is not a parser problem per se, it's a lexing problem. The lexer is what turns the raw data in to elements that the parse can handle.
Most any parsing system has few problems letting you supply your own lexer, and if that's the case you could, ideally, readily write a compliant lexer that works on your binary stream.
The problem, however, is that most parsing and lexing systems today are themselves created from a higher level tool. And THAT tool most likely is not designed to work with binary streams. That is, it's not practical for you specify the tokens and grammar of the binary stream that can be used to create the subsequent parsers and lexer. Also, there is likely no support whatsoever for the higher level concepts of multi byte binary numbers (shorts, longs, floats, etc.) that you are likely to encounter in a binary stream, nor for the generated parser to possibly work well upon them if you actually need to work on their actual value, again because the systems are mostly designed for text based tokens, and the underlying runtime handles the details of converting that text it something the machine can use (such as sequences of ascii numerals in to actual binary integers).
All that said, you can probably actually use the parsing section of the tool, since parsers work more on abstract tokens that are fed them by the lexer. Once you create your grammar, at a symbolic level, you would need to redo the lexer to create the problem tokens from the binary stream to feed in to the parser.
This is actually good, because the parser tends to be far more complicated than the basic lexer, so the toolkit would handle much of the "hard part" for you. But you would still need to deal with creating your own lexer and interfacing it properly to the generated parser. Not an insurmountable task, and if the grammar is of any real complexity, likely worth your effort in the long run.
If it's all mostly simple, then you're likely just better off doing it your self by hand. Of the top of my head, it's hard to imagine a difficult binary grammar, since the major selling point of a binary format is that it's much closer to the machine, which is in contradiction to the text that most parsers are designed to work with. But I don't know your use case.
But consider the case of a disassembler. That's a simple lexer that may be able to under stand at a high level the different instruction types (such as those operands that have no arguments, those that take a single byte as an argument, or a word), and feed that to a parser can then be used to convert the instructions in to their mnemonics and operands in the normal assembler syntax, as well as handle the label references and such.
It's a contrived case, as a disassembler typically doesn't separate the lexing and parsing phases, it's usually not complicated enough to bother, but it's one way to look at the problem.
Addenda:
If you have enough information to convert the binary stream in to text to feed to the engine, then you you have enough information to instead of creating text, you could create the actual tokens that the parser would want to see from the lexer.
That said, what you could do is take your text format, use that as the basis for your parsing tool and grammar, and have it create the lexer and parser machines for you, and then, by hand, you can test your parser and its processing using "text tests".
But when you get around to reading the binary, rather than creating text to then be lexed and parsed, simply create the tokens that the lexer would create (these should be simple objects), and pump the parser directly. This will save you the lex step and save you some processing time.

Will rewriting a multipurpose log file parser to use formal grammars improve maintainability?

TLDR: if I built a multipurpose parser by hand with different code for each format, will it work better in the long run using one chunk of parser code and an ANTLR, PyParsing or similar grammar to specify each format?
Context:
My job involves lots of benchmark log files from ~50 different benchmarks. There are a few in XML, a few HTML, a few CSV and lots of proprietary stuff with no documented spec. To save me and my coworkers the time of entering this data by hand, I wrote a parsing tool that handles all of the formats we deal with regularly with a uniform interface. The design, though, is not so clean.
I wrote this thing in Python and created a Parser class. Each file format is handled as an implementation that provides its own code for the Parser's read() method. I like the idea of having only one definition of Parser that uses grammars to understand each format, but I've never done it before.
Is it worth my time, and will it be easier for other newbies to work with in the future once I finish refactoring?
I can't answer your question with 100% certainty, but I can give you an opinion.
I find the choice to use a proper grammar vs hand rolled regex "parsers" often comes down to how uniform the input is.
If the input is very uniform and you already know a language that deals with strings well, like Python or Perl, then I'd keep your existing code.
On the other hand I find parser generators, like Antlr, really shine when the input can have errors and inconsistencies in it. The reason is that the formal grammar allows you to focus on what should be matched in a certain context without having to worry about walking the input stream manually.
Furthermore if the input stream has an error then I find it's often easier to deal with them using Antlr vs regexs. The reason being is that if a couple of options are available Antlr has built in functionality for hosing the correct path, including rollback via predicates.
Having said all that, there is alot to be said for working code. I find if I want to rewrite something then I try to make a good use case for how the rewrite will benefit the user of the product.

In the programming languages specifications, why is it that lexical analysis not translatable?

In all of the standard specifications for programming languages, why is it that you cannot directly translate the lexical analysis/layout to a grammar that is ready to be plugged into and working?
I can understand that it would be impossible to adapt it for the likes of Flex/Bison, Lex/Yacc, Antlr and so on, and furthermore to make it readable for humans to understand.
But surely, if it is a standard specification, it should be a simple copy/paste the grammar layout and instead end up with loads of shift/reduce errors as a result which can back fire and hence, produce an inaccurate grammar.
In other words, why did they not make it readable for use by a grammar/parser tool straight-away?
Maybe it is a debatable thing I don't know...
Thanks,
Best regards,
Tom.
In other words, why did they not make
it readable for use by a
grammar/parser tool straight-away?
Standards documents are intended to be readable by humans, not parser generators.
It is easy for humans to look at a grammar and know what the author intended, however, a computer needs to have a lot more hand holding along the way.
Specifically, these specifications are generally not LL(1) or LR(1). As such, lookaheads are needed, conflicts need to be resolved. True, this could be done in the language specification, but then it is source code for a lexical analyzer, not a language specification.
I agree with your sentiment, but the guys writing standards can't win on this.
To make the lexer/grammar work for a parser generator directly-out-of-standard, the standard writers would have to choose a specific one. (What choice would the COBOL standard folks have made in 1958?)
The popular ones (LEX, YACC, etc.)
are often not capable of handling reference grammars, written for succinctness and clarity, and so would be a poor (e.g. non-)choice.
More exotic ones (Earley, GLR) might be more effective because they allow infinite lookahead and ambiguity, but are harder to find. So if a specific tool like this
was chosen you would not get what you wanted, which is a grammar that works with the parser generator you have.
Having said that, the DMS Software Reengineering Toolkit uses a GLR parser generator. We don't have to massage reference grammars a lot to get them to work, and DMS now handles a lot of languages, including ones that are famously hard such as C++. IMHO, this is as close to your ideal as you are likely to get.

Most effective way to parse C-like definition strings?

I've got a set of function definitions written in a C-like language with some additional keywords that can be put before some arguments(the same way as "unsigned" or "register", for example) and I need to analyze these lines as well as some function stubs and generate actual C code from them.
Is that correct that Flex/Yacc are the most proper way to do it?
Will it be slower than writing a Shell or Python script using regexps(which may become big pain, as I suppose, if the number of additional keywords becomes bigger and their effects would be rather different) provided that I have zero experience with analysers/parsers(though I know how LALR does its job)?
Are there any good materials on Lex/Yacc that cover similar problems? All papers I could find use the same primitive example of a "toy" calculator.
Any help will be appreciated.
ANTLR is commonly used (as are Lex\Yacc).
ANTLR, ANother Tool for Language
Recognition, is a language tool that
provides a framework for constructing
recognizers, interpreters, compilers,
and translators from grammatical
descriptions containing actions in a
variety of target languages.
There is also the Lemon Parser, which features a less restrictive grammar. The down side is you're married to lemon, re-writing a parser's grammar to something else when you discover some limitation sucks. The up side is its really easy to use .. and self contained. You can drop it in tree and not worry about checking for the presence of others.
SQLite3 uses it, as do several other popular projects. I'm not saying use it because SQLite does, but perhaps give it a try if time permits.
That entirely depends on your definition of "effective". If you have all the time of the world, the fastest parser would be a hand-written pull parser. They take a long time to debug and develop but today, no parser generator beats hand-written code in terms of runtime performance.
If you want something that can parse valid C within a week or so, use a parser generator. The code will be fast enough and most parser generators come with a grammar for C already which you can use as a starting point (avoiding 90% of the common mistakes).
Note that regexps are not suitable for parsing recursive structures. This approach would both be slower than using a generator and more error prone than a hand-written pull parser.
actually, it depends how complex is your language and whether it's really close to C or not...
Still, you could use lex as a first step even for regular expression ....
I would go for lex + menhir and o'caml....
but any flex/yacc combination would be fine..
The main problem with regular bison (the gnu implementation of yacc) stems from the C typing.. you have to describe your whole tree (and all the manipulation functions)... Using o'caml would be really easier ...
For what you want to do, our DMS Software Reengineering Toolkit is likely a very effective solution.
DMS is designed specifically to support customer analyzers/code generators of the type you are discussing. It provides very strong facilities for defining arbitrary language parsers/analyzers (tested on 30+ real languages including several complete dialects of C, C++, Java, C#, and COBOL).
DMS automates the construction of ASTs (so you don't have to do anything but get the grammar right to have a usable AST), enables the construction of custom analyses of exactly the pattern-directed inspection you indicated, can construct new C-specific ASTs representing the code you want to generate, and spit them out as compilable C source text. The pre-existing definitions of C for DMS can likely be bent to cover your C-like language.

Resources