I am starting a class project that regards adding some functionality to Go.
However, I am thoroughly confused on the structure of Go. I was under the impression that Go used flex and bison but I can't find anything familiar in the Go source code.
On the other hand, the directory go/src/pkg/go has folders with familiar names (ast, token, parser, etc.) but all they contain are .go files. I'm confused!
My request is, of anyone familiar with Go, can you give me an overview of how Go is lexed, parsed, etc. and where to find the files to edit the grammar and whatnot?
The directory structure:
src/cmd/5* ARM
src/cmd/6* amd64 (x86-64)
src/cmd/8* i386 (x86-32)
src/cmd/cc C compiler (common part)
src/cmd/gc Go compiler (common part)
src/cmd/ld Linker (common part)
src/cmd/6c C compiler (amd64-specific part)
src/cmd/6g Go compiler (amd64-specific part)
src/cmd/6l Linker (amd64-specific part)
Lexer is written in pure C (no flex). Grammar is written in Bison:
src/cmd/gc/lex.c
src/cmd/gc/go.y
Many directories under src/cmd contain a doc.go file with short description of the directory's contents.
If you are planning to modify the grammar, it should be noted that the Bison grammar sometimes does not distinguish between expressions and types.
lex.c
go.y
The Go compilers are written in c, which is why you need flex and bison. The Go package for parsing is not used. If you wanted to write a self hosting compiler in Go, you could use the Go parsing package.
Related
Does anyone know of any tool(s) that can convert ANTLR v4 grammar files (.g4 extension) to tree-sitter grammar files (.js extension)? It would also be fine if I had to chain a couple conversion tools together. For example, going from foo.g4 (antlr4) to foo.ebnf (intermediary format) to foo.js (tree-sitter). Thank you!
I tried using this tool to go from g4 to ebnf, and then this tool to go from ebnf to tree-sitter js, but to no avail. The first tool seemed to create some junk at the bottom of the file which gave the second tool trouble. Additionally, the second tool seems to expect each definition to be completely on one line (and the first tool breaks each definition up into multiple lines for readability).
I want to analysis OCaml files (.ml) using OCaml. I want to break the files into Abstract Syntax Trees for analysis. I have attempted to use camlp4 but have had no luck. Has anyone else successfully done this before? Is this the best way to parse an OCaml file?
(I assume you know basic parts of OCaml already: how to write OCaml code, how to link modules and libraries, how to write build scripts and so on. If you do not, learn them first.)
The best way is to use the genuine OCaml code parser used in OCaml compiler itself, since it is 100% compatible by definition.
CamlP4 also implements OCaml parser but it is slightly incompatible with the genuine parser and the parse tree is somewhat specialized for writing syntax extensions: not very good for any other kind of analysis.
You may want to parse .ml files with syntax extensions using P4. Even in this case, you should stick to the genuine parser: you can desugar the source code by P4 then send the result to your analyzer with the genuine parser.
To use OCaml compiler's parser, the easiest approach is to use compiler-libs.common OCamlFind package. It contains the parser and type checker of OCaml compiler.
Start from modifying driver/compile.ml of OCaml compiler source, it implements the major compilation phases: calling preprocessor, parse, typing then code generation. To parse .ml files you should modify (or simplify) Compile.implementation. For .mli files Compile.interface.
Good luck.
Couldn't you use the -dparsetree option to the ocaml compiler?
hello.ml:
let _ = print_endline "Hello AST"
Now compile it:
$ ocamlc -dparsetree hello.ml
Which results in:
[
structure_item (hello.ml[1,0+0]..[1,0+33])
Pstr_eval
expression (hello.ml[1,0+8]..[1,0+33])
Pexp_apply
expression (hello.ml[1,0+8]..[1,0+21])
Pexp_ident "print_endline" (hello.ml[1,0+8]..[1,0+21])
[
<label> ""
expression (hello.ml[1,0+22]..[1,0+33])
Pexp_constant Const_string("Hello AST",None)
]
]
See also this blog post on -ppx extensions which has some info on extension point syntax extensions (the new way of writing syntax extensions in OCaml 4.02). There is info there on various AST manipulation modules.
Is there an existing POSIX sh grammar available or do I have to figure it out from the specification directly?
Note I'm not so much interested in a pure sh; an extended but conformant sh is also more than fine for my purposes.
The POSIX standard defines the grammar for the POSIX shell. The definition includes an annotated Yacc grammar. As such, it can be converted to EBNF more or less mechanically.
If you want a 'real' grammar, then you have to look harder. Choose your 'real shell' and find the source and work out what the grammar is from that.
Note that EBNF is not used widely. It is of limited practical value, not least because there are essentially no tools that support it. Therefore, you are unlikely to find an EBNF grammar (of almost anything) off-the-shelf.
I have done some more digging and found these resources:
An sh tutorial located here
A Bash book containing Bash 2.0's BNF grammar (gone from here) with the relevant appendix still here
I have looked through the sources of bash, pdksh, and posh but haven't found anything remotely at the level of abstraction I need.
I've had multiple attempts at writing my own full blown Bash interpreters over the past year, and I've also reached at some point the same book appendix reference stated in the marked answer (#2), but it's not completely correct/updated (for example it doesn't define production rules using the 'coproc' reserved keyword and has a duplicate production rule definition for a redirection using '<&', might be more problems but those are the ones I've noticed).
The best way i've found was to go to http://ftp.gnu.org/gnu/bash/
Download the current bash version's sources
Open the parse.y file (which in this case is the YACC file that basically contains all the parsing logic that bash uses) and just copy paste the lines between '%%' in your favorite text editor, those define the grammar's production rules
Then, using a little bit of regex (which I'm terrible at btw) we can delete the extra code logic that are in between '{...}' to make the grammar look more BNF-like.
The regex i used was :
(\{(\s+.*?)+\})\s+([;|])
It matches any line non greedily .*? including spaces and new lines \s+ that are between curly braces, and specifically the last closing brace before a ; or | character. Then i just replaced the matched strings to \3 (e.g. the result of the third capturing group, being either ; or |).
Here's the grammar definition that I managed to extract at the time of posting https://pastebin.com/qpsK4TF6
I'd expect that sh, csh, ash, bash, would contain parsers. GNU versions of these are open source; you might just go check there.
By concept/function/implementation, what are the differences between compilers and parsers?
A compiler is often made up of several components, one of which is a parser.
A common set of components in a compiler is:
Lexer - break the program up into words.
Parser - check that the syntax of the sentences are correct.
Semantic Analysis - check that the sentences make sense.
Optimizer - edit the sentences for brevity.
Code generator - output something with equivalent semantic meaning using another vocabulary.
To add a little bit:
As mentioned elsewhere, small C is a recursive decent compiler that generated code as it parsed. Basically syntactical analysis, semantic analysis, and code generation in one pass. As I recall, it also lexed in the parser.
A long time ago, I wrote a C compiler (actually several: the Introl-C family for microcontrollers) that used recursive descent and did syntax and semantic checking during the parse and produced a tree representation of the program from which code was generated.
Today, I'm working on a compiler that does source -> tokens -> AST -> IR -> code, pretty much as I described above.
A parser just reads a text into an internal, more abstract representation, often a tree or graph of some sort.
A compiler translates such an internal representation into another format. Most often this means converting source code into executable programs. But the target doesn't have to be machine code. It can be another programming language as well; the compiler would still be a compiler. Obviously a compiler needs a parser to actually read its input.
Compiler always have a parser inside. Parser just process the language and return the tree representation of it, compiler generate something from that tree, actual machine codes or another language.
A parser is one element of a compiler.
Are you looking for the differences between an interpreter and a compiler?
A parser takes in raw-data and parses it into a tree structure. This syntax-tree is then passed on to generator, which will turn it into whatever it is supposed to generate.
So, a parser is a part of a compiler.
In general, parser is a part of the compiler, but compiler is designed to convert the received script generally into machine-readable code or sometimes into another language.
A compiler is a special type of computer program that translates a human readable text file into a form that the computer can more easily understand. At its most basic level, a computer can only understand two things, a 1 and a 0. At this level, a human will operate very slowly and find the information contained in the long string of 1s and 0s incomprehensible. A compiler is a computer program that bridges this gap.
A parser is a piece of software that evaluates the syntax of a script when it is executed on a web server. For scripting languages used on the web, the parser works like a compiler might work in other types of application development environments.Parsers are commonly used in script development because they can evaluate code when the script is executed and do not require that the code be compiled first.
I have stumbled upon the following F77 yacc grammar: http://yaxx.cvs.sourceforge.net/viewvc/yaxx/yaxx/fortran/fortran.y?revision=1.3&view=markup.
How can I make a Fortran 77 parser out of this file using Happy?
Why is there some C?/C++? code in that .y file?
UPDATE: Thank you for your replies!
I've been playing with two fresh approaches for a while now:
extracting and modifiying the parser from the source code package bundled with a paper titled Parametric Fortran,
writing a grammar from scratch with the help of BNFC.
I've got both to parse simple code excerpts already. I'll keep people in the know should something usable come into existence within this century ^__^" hehe.
P/S: Want to see whether I could gather enough momentum on my own to initiate a project for an automatic differentiation engine to replace a binary-only one we depend on for the time being. For entertainment at the initial stages: I'm watching Love Shuffle! It's a very enjoyable J-Drama! Highly recommendable ...
The C is the semantic action for reducing the stack when the syntax is read in. These actions are in C because the definition is intended for Bison/Yacc which produces a C source file.
If you want to use Happy, port the BNF to the Happy definition syntax and write your semantics in Haskell.
Just the tip of the iceberg for getting anything useful however.
If you don't have a copy already, invest in the Dragon Book (Compilers: Principles, Techniques & tools by Aho, Lam, Sethi, Ullman - Pearson)
Why the other answers are true in the general sense, in that you'll need to write your own actions to do anything meaningful the Yacc definition that you linked to actually doesn't have any actions associated with the grammar rules. What it does is that it defines the yyerror function and some code for extracting values from yylval based on the token type.
If you have no clue what yyerror/yylval are about you should read a bison/flex tutorial. The Dragon book is also a good resource if you're more serious about this. There are also some excellent handouts from a Stanford course on compilers floating around the Net, which are based on the book.
You'll need an AST to build that can be constructed in an equivalent way to the C fragments in the Yacc file.
Use BNFC and write your own grammar from scratch! BNFC works wonders and you could do your parsing exactly as you desire.