I am writing an analyzer which needs an abstract syntax tree (AST) or control flow graph (CFG) of Rust code. It seems impossible for me to do this without implementing a parser by myself.
I've noticed some crates such as syn and quote, but they don't work without using procedural macros and creating a totally unnecessary project structure. I've found that there's a crate called syntex_syntax which meets my requirements, but it is no longer maintained and panics when some code with newer syntax is given.
Is there any way of parsing Rust code directly: read from an arbitrary external *.rs file and parse it using syn or quote just like syntex_syntax did?
syn is a Rust parser and is not only for procedural macros. Take a look at the "functions" section of the documentation. There you will find these functions (as of syn 0.15):
fn parse<T: Parse>(tokens: proc_macro::TokenStream) -> Result<T>: this is what you would use in a procedural macro.
fn parse2<T: Parse>(tokens: proc_macro2::TokenStream) -> Result<T>: the same, but with the TokenStream from the proc_macro2 crate.
fn parse_str<T: Parse>(s: &str) -> Result<T>: parsing from a simple string. No TokenStreams required.
fn parse_file(content: &str) -> Result<File>: Very similar to parse_str, but some convenient differences. See docs for more information.
You can use parse_str or parse_file to parse Rust code from outside of procedural macros.
A few additional notes:
quote is not needed in your case. This crate is just used to easily create TokenStreams; it's not required for parsing.
In case you are just interested in parsing the tokens, you can use proc_macro2 outside of a procedural macro, too!
syntex_syntax is indeed deprecated and shouldn't be used anymore. Just thinking about how it was used makes me shudder :P
Related
I am new to Haskell, and I have been trying to write a JSON parser using Parsec as an exercise. This has mostly been going well, I am able to parse lists and objects with relatively little code which is also readable (great!). However, for JSON I also need to parse primitives like
Integers (possibly signed)
Floats (possibly using scientific notation such as "3.4e-8")
Strings with e.g. escaped quotes
I was hoping to find ready to use parsers for things like these as part of Parsec. The closest I get is the Parsec.Tokens module (defines integer and friends), but those parsers require a "language definition" that seems way beyond what I should have to make to parse something as simple as JSON -- it appears to be designed for programming languages.
So my questions are:
Are the functions in Parsec.Token the right way to go here? If so, how to make a suitable language definition?
Are "primitive" parsers for integers etc defined somewhere else? Maybe in another package?
Am I supposed to write these kinds of low-level parsers myself? I can see myself reusing them frequently... (obscure scientific data formats etc.)
I have noticed that a question on this site says Megaparsec has these primitives included [1], but I suppose these cannot be used with parsec.
Related questions:
How do I get Parsec to let me call `read` :: Int?
How to parse an Integer with parsec
Are the functions in Parsec.Token the right way to go here?
Yes, they are. If you don't care about the minutiae specified by a language definition (i.e. you don't plan to use the parsers which depend on them, such as identifier or reserved), just use emptyDef as a default:
import Text.Parsec
import qualified Text.Parsec.Token as P
import Text.Parsec.Language (emptyDef)
lexer = P.makeTokenParser emptyDef
integer = P.integer lexer
As you noted, this feels unnecesarily clunky for your use case. It is worth mentioning that megaparsec (cf. Alec's suggestion) provides a corresponding integer parser without the ceremony. (The flip side is that megaparsec doesn't try to bake in support for e.g. reserved words, but that isn't difficult to implement in the cases you actually need it.)
(Background: Inspired by Is C++ context-free or context-sensitive?, while I am writing a simple compiler using jflex/cup myself. )
If they are written using a lexer/parser generator, how do we specify the grammar?
Since code like
a b(c);
could be interpreted as either a function declaration or a local variable definition, how could we handle it in the grammar definition file?
Another example could be the token ">>" in the following code:
std::vector<std::vector<int>> foo;
int a = 1000 >> 4;
Thanks
Are the compilers of C++ written using a lexer/parser generator?
It depends. Some are, some aren't.
GCC originally did use GNU bison, but was re-written a couple of years ago with a hand-written parser. If I have understood that correctly, the main reason was that writing the parser by hand gives you more control over the parser state, and specifically, how much "extraneous" data to keep in there, so that you can generate better error messages.
If they are written using a lexer/parser generator, how do we specify the grammar?
This depends on which parser generator you are using.
Since code like
a b(c);
could be interpreted as either a function declaration or a local variable definition, how could we handle it in the grammar definition file?
Some parser generators may be powerful enough to handle this directly.
Some aren't. Some parser generators which aren't powerful enough have a concept of semantic action that allow you to attach code written in an arbitrarily powerful language to parser rules. E.g. yacc allows you to attach C code to rules.
Otherwise, you will have to handle it during semantic analysis.
I want to analysis OCaml files (.ml) using OCaml. I want to break the files into Abstract Syntax Trees for analysis. I have attempted to use camlp4 but have had no luck. Has anyone else successfully done this before? Is this the best way to parse an OCaml file?
(I assume you know basic parts of OCaml already: how to write OCaml code, how to link modules and libraries, how to write build scripts and so on. If you do not, learn them first.)
The best way is to use the genuine OCaml code parser used in OCaml compiler itself, since it is 100% compatible by definition.
CamlP4 also implements OCaml parser but it is slightly incompatible with the genuine parser and the parse tree is somewhat specialized for writing syntax extensions: not very good for any other kind of analysis.
You may want to parse .ml files with syntax extensions using P4. Even in this case, you should stick to the genuine parser: you can desugar the source code by P4 then send the result to your analyzer with the genuine parser.
To use OCaml compiler's parser, the easiest approach is to use compiler-libs.common OCamlFind package. It contains the parser and type checker of OCaml compiler.
Start from modifying driver/compile.ml of OCaml compiler source, it implements the major compilation phases: calling preprocessor, parse, typing then code generation. To parse .ml files you should modify (or simplify) Compile.implementation. For .mli files Compile.interface.
Good luck.
Couldn't you use the -dparsetree option to the ocaml compiler?
hello.ml:
let _ = print_endline "Hello AST"
Now compile it:
$ ocamlc -dparsetree hello.ml
Which results in:
[
structure_item (hello.ml[1,0+0]..[1,0+33])
Pstr_eval
expression (hello.ml[1,0+8]..[1,0+33])
Pexp_apply
expression (hello.ml[1,0+8]..[1,0+21])
Pexp_ident "print_endline" (hello.ml[1,0+8]..[1,0+21])
[
<label> ""
expression (hello.ml[1,0+22]..[1,0+33])
Pexp_constant Const_string("Hello AST",None)
]
]
See also this blog post on -ppx extensions which has some info on extension point syntax extensions (the new way of writing syntax extensions in OCaml 4.02). There is info there on various AST manipulation modules.
To have a general-purpose documentation system that can extract inline documentation of multiple languages, a parser for each language is needed. A parser generator (which actually doesn't have to be that complete or efficient) is thus needed.
http://antlr.org/ is a nice parser generator that already has a number of grammars for popular languages. Are there better alternatives i.e. simpler ones that support generating parsers for even more languages out-of-the-box?
If you're only looking for "partial parsing", then you could use ANTLR's option to partially "lex" a token stream and ignore the rest of the tokens. You can do that by enabling the filter=true in a lexer-grammar. The lexer then tries to match any token you defined in your grammar, and when it can't match one of the tokens, it advances one single character (and ignores it) and then again tries to match one of your token at the next character:
lexer grammar Foo;
options {filter=true;}
StringLiteral
: ...
;
CharLiteral
: ...
;
SingleLineComment
: ...
;
MultiLineComment
: ...
;
When implemented properly, you can get the MultiLineComments (/* ... */) from a Java file quite easily without being afraid of single line comments and String- or char literals messing things up.
Obviously, your source files need to be valid to be able to properly tokenize a file, otherwise you get strange results!
My compiler uses Dypgen. This is a user extenisble GLR parser with lots of enrichments so it can parse many languages. The bootstrap grammar is EBNF like (it supports * + and ? directly in your productions). It is powerful enough to dynamically load extensions, a fact my compiler leverages: the bulk of my programming language has its syntax dynamically loaded at compiler startup.
Dypgen is written in Ocaml and generates Ocaml code.
There is a C++ GLR parser called Elkhound which is powerful enough to parse most of C++.
However, for your actual requirements, you do not really need to do any serious parsing: a regular expression matching engine is probably good enough. Googles re2 may be suitable (provides most PCRE functionality, a lot faster and with C++ interface).
Although this is less accurate, it is good enough because you can demand that inline documentation adhere to some simple formats. Most existing inline docs already do so for just this reason.
Where I work we used to use GOLD Parser. This is a lot simpler that Antlr and supports multiple languages. We have since moved to Antlr however as we needed to do more complex parsing, which we found Antlr was better for than GOLD.
By concept/function/implementation, what are the differences between compilers and parsers?
A compiler is often made up of several components, one of which is a parser.
A common set of components in a compiler is:
Lexer - break the program up into words.
Parser - check that the syntax of the sentences are correct.
Semantic Analysis - check that the sentences make sense.
Optimizer - edit the sentences for brevity.
Code generator - output something with equivalent semantic meaning using another vocabulary.
To add a little bit:
As mentioned elsewhere, small C is a recursive decent compiler that generated code as it parsed. Basically syntactical analysis, semantic analysis, and code generation in one pass. As I recall, it also lexed in the parser.
A long time ago, I wrote a C compiler (actually several: the Introl-C family for microcontrollers) that used recursive descent and did syntax and semantic checking during the parse and produced a tree representation of the program from which code was generated.
Today, I'm working on a compiler that does source -> tokens -> AST -> IR -> code, pretty much as I described above.
A parser just reads a text into an internal, more abstract representation, often a tree or graph of some sort.
A compiler translates such an internal representation into another format. Most often this means converting source code into executable programs. But the target doesn't have to be machine code. It can be another programming language as well; the compiler would still be a compiler. Obviously a compiler needs a parser to actually read its input.
Compiler always have a parser inside. Parser just process the language and return the tree representation of it, compiler generate something from that tree, actual machine codes or another language.
A parser is one element of a compiler.
Are you looking for the differences between an interpreter and a compiler?
A parser takes in raw-data and parses it into a tree structure. This syntax-tree is then passed on to generator, which will turn it into whatever it is supposed to generate.
So, a parser is a part of a compiler.
In general, parser is a part of the compiler, but compiler is designed to convert the received script generally into machine-readable code or sometimes into another language.
A compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand. At its most basic level, a computer can only understand two things, a 1 and a 0. At this level, a human will operate very slowly and find the information contained in the long string of 1s and 0s incomprehensible. A compiler is a computer program that bridges this gap.
A parser is a piece of software that evaluates the syntax of a script when it is executed on a web server. For scripting languages used on the web, the parser works like a compiler might work in other types of application development environments.Parsers are commonly used in script development because they can evaluate code when the script is executed and do not require that the code be compiled first.