Given an LLVM IR, can we generate Clang AST? - clang

This question is purely from research point of view and right now I am not looking at any practical aspect of it.
Just like we have decompilers which can take in a binary code and generate LLVM IR, like
https://github.com/repzret/dagger or https://github.com/avast/retdec
and many other.
Do we have some code generator which can convert an LLVM IR to Clang AST?
Thank You in advance.
Found one dropped project -
https://www.phoronix.com/scan.php?page=news_item&px=MTE2OTg
Looking for more.

Going from the AST to the LLVM IR is a one way street.
Take a look at this picture.
A source code file of the high level programming language (which maybe C, C++, or Rust), is converted into the Clang AST. This is a data structure which has a knowledge about the source code constructs of the programming language itself. An AST is specific to a programming language. It is a description of the parsed source code file of the programming language, in the same way as the Javascript DOM tree is a description of the HTML document. This means that the AST contains information specific to that programming language. If the programming language is Rust, the Rust AST might for example contain functional coding constructs.
The LLVM IR however is sometimes described as a portable, high-level assembly language, because it has constructions that can map closely to system hardware.
A frontend module converts a high level programming language into LLVM IR. It does this by generating a language specific AST and then recursively traversing that AST and generating LLVM code constructs representing each node in the AST. Then we have LLVM IR code. Then the backend module converts the LLVM IR into an architecture specific assembly code.
There are multiple frontend modules, one for each high level language that you want to convert into LLVM IR. Once this conversion is complete, the generated LLVM IR has no way of knowing what programming language it came from. You could take C++ code and the same code written in Rust, and after generating the LLVM IR you won't be able to tell them apart.
Once the LLVM IR has been generated any high level language specific information is gone. This is including information about how to generate an AST, because an AST needs knowledge about coding constructs specific to that programming language.
Going from a high level (more abstract) source code representation into a medium level, such as LLVM IR, and even into a lower level, such as assembly code is relatively easy.
Going the other way, from a very low level machine specific code, to a more abstract source code of a high level programming language is much harder. This is because in high level programming languages you can solve the same problem many different ways, while the representation of code in assembly language is more limited, so you have no way of knowing which specific high level coding construct the low level code originally came from.
This is why in principle you cannot go from the LLVM IR into an AST. If someone would indeed try to do such a thing, it won't be an exact same representation of the original high level language source code, and it won't be very readable.

Related

What makes libadalang special?

I have been reading about libadalang 1 2 and I am very impressed by it. However, I was wondering if this technique has already been used and another language supports a library for syntactically and semantically analyzing its code. Is this a unique approach?
C and C++: libclang "The C Interface to Clang provides a relatively small API that exposes facilities for parsing source code into an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical source locations with elements within the AST, and other facilities that support Clang-based development tools." (See libtooling for a C++ API)
Python: See the ast module in the Python Language Services section of the Python Library manual. (The other modules can be useful, as well.)
Javascript: The ongoing ESTree effort is attempting to standardize parsing services over different Javascript engines.
C# and Visual Basic: See the .NET Compiler Platform ("Roslyn").
I'm sure there are lots more; those ones just came off the top of my head.
For a practical and theoretical grounding, you should definitely (re)visit the classical textbook Structure and Interpretation of Computer Programs by Abelson & Sussman (1st edition 1985, 2nd edition 1996), which helped popularise the idea of Metacircular Interpretation -- that is, interpreting a computer program as a formal datastructure which can be interpreted (or otherwise analysed) programmatically.
You can see "libadalang" as ASIS Mark II. AdaCore seems to be attempting to rethink ASIS in a way that will support both what ASIS already can do, and more lightweight operations, where you don't require the source to compile, to provide an analysis of it.
Hopefully the final API will be nicer than that of ASIS.
So no, it is not a unique approach. It has already been done for Ada. (But I'm not aware of similar libraries for other languages.)

What are common properties in an Abstract Syntax Tree (AST)?

I'm new to compiler design and have been watching a series of youtube videos by Ravindrababu Ravula.
I am creating my own language for fun and I'm parsing it to an Abstract Syntax Tree (AST). My understanding is that these trees can be portable given they follow the same structure as other languages.
How can I create an AST that will be portable?
Side notes:
My parser is currently written in javascript but I might move it to C#.
I've been looking at SpiderMonkey's specs for guidance. Is that a good approach?
Portability (however defined) is not likely to be your primary goal in building an AST. Few (if any) compiler frameworks provide a clear interface which allows the use of an external AST, and particular AST structures tend to be badly-documented and subject to change without notice. (Even if they are well-documented, the complexity of a typical AST implementation is challenging.)
An AST is very tied to the syntactic details of a language, as well as to the particular parsing strategy being used. While it is useful to be able to repurpose ASTs for multiple tasks -- compiling, linting, pretty-printing, interactive editing, static analysis, etc. -- the conflicting demands of these different use cases tends to increase complexity. Particularly at the beginning stages of language development, you'll want to give yourself a lot of scope for rapid prototyping.
The most tempting reason for portable ASTs would be to use some other language as a target, thereby saving the cost of writing code-generation, etc. However, in practice it is usually easier to generate the textual representation of the other language from your own AST than to force your parser to use a foreign AST. Even better is to target a well-documented virtual machine (LLVM, .Net IL, JVM, etc.), which is often not much more work than generating, say, C code.
You might want to take a look at the LLVM Kaleidoscope tutorial (the second section covers ASTs, although implemented in C++). Also, you might find this question on a sister site interesting reading. And finally, if you are going to do your implementation in Javascript, you should at least take a look at the jison parser generator, which takes a lot of the grunt-work out of maintaining a parser and scanner (and thus allows for easier experimentation.)

abstract syntax tree for imperative languages

I am looking for an abstract syntax tree representation that can be used for common imperative languages (Java, C, python, ruby, etc). I would like this to be as close to source as possible (as opposed to something like LLVM). I found Rose online but it is only able to handle C and Fortran. Does this exist?
You won't find "one" universal AST that can represent many languages. People have been searching for 50 years.
The essential reason is that an AST node implicitly represents the precise language semantics of the operator it encodes, and different languages have different semantics for what are apparently the same operators.
For example, the "+" operator in modern Fortran will add integers, reals, complex values, and slices of arrays of such things. Java "+" will add integers, reals, and glue strings together. If I wrote "a+b" in "universal AST", how would you know which semantic effect the corresponding AST encoded?
What you can do is build a system in which the ASTs for different languages are represented uniformly, so that you can share tool infrastructure across many languages. This is done by many Program Transformation Systems (PTS), where you provide the grammar (or pick one from an available library), and the PTS parses and builds an AST using its uniform representation. Most PTS provide additional support to analyze and transform the code.
So, all you need is a PTS and some sweat to define a grammar. That's really not true; getting a grammar right for a real language is actually pretty hard. Worse, there's a lot to Life After Parsing because you need the meaning of symbols and additional inferences such as control and data flow analysis. So you need full front ends (e.g., parsing, name/type resolution, flow analysis, ...), or as much as you can get, if you don't want to be distracted for months before beginning your real work.
What this means in practice is you want to find a tool that handles the languages of interest to you, with mature front ends already available:
Rose (you already found this) handle C, C++ and Fortran. It has no built-in parsing capability of its own; its front ends are custom built. So it is apparantly hard to extend to other languages. But it has good flow analysis capabilities and provides means to transform the code via hand-write AST walks/smashes.
Clang handles C and C++. Clang also uses hand-built front ends. It can also transform code, again by hand-written AST walks/smashes, with a small amount of pattern matching support. As I understand it, you have to use the LLVM part of Clang to do flow analysis.
Our DMS Software Reengineering Toolkit has full front ends for C, C++, Java and COBOL, and full parsers for many more languages such as Python. DMS provides pattern-based analysis and source-to-source transformation. It operates directly from a grammar (see one for Oberon, Nicklaus Wirth's latest language). (I don't know of any tool that handles Ruby, which is famously hard to parse; I understand its grammar is ambiguous, and DMS is good at handling ambiguous grammars).

How to use Clang to obtain ASTs on the current versions?

I have been trying to obtain ASTs from Clang but I have not been successfully so far. I found a one year old question here at stack that mentions about two other ways to obtain the ast using Clang which are:
./llvmc -cc1 -ast-dump file.c
./llvmc -cc1 -ast-print file.c
On this question doxygen is mentioned and a representation where an ast is given but I am mostly looking for one on some textual form such as XML so that further analysis can be performed.
lastly there was another question here on stack about exactly XML import but it was discontinued for several reasons also mentioned.
My question thus is, which version and how can I use it from the console to obtain ast related information for a given code in C? I believe this to be a very painless one line command code like those above but the documentation index did not refer anything about ast from as much as I have read and the only one at llvmc I found was about writing an AST by hand which is not really what I am looking for.
I tried all of the commands above but they all already fail on version 2.9 and I already found out llvm changes a whole lot between each version.
Thank you.
OP says "open to other suggestions as well".
He might consider our DMS Software Reengineering Toolkit with its C Front End.
While it would be pretty easy to exhibit a C AST produced by this, it is easier to show an ObjectiveC AST [already at SO] produced by DMS using the same C front end (ObjectiveC is a dialect of C).
See https://stackoverflow.com/a/10749970/120163 DMS can produce an XML equivalent of this, too.
We don't recommend exporting trees as text, because real trees for real code are simply enormous and are poorly manipulated as text or XML objects in our experience. One usually needs machinery beyond parsing. See my discussion about Life After Parsing

ANTLR vs. Happy vs. other parser generators

I want to write a translator between two languages, and after some reading on the Internet I've decided to go with ANTLR. I had to learn it from scratch, but besides some trouble with eliminating left recursion everything went fine until now.
However, today some guy told me to check out Happy, a Haskell based parser generator. I have no Haskell knowledge, so I could use some advice, if Happy is indeed better than ANTLR and if it's worth learning it.
Specifically what concerns me is that my translator needs to support macro substitution, which I have no idea yet how to do in ANTLR. Maybe in Happy this is easier to do?
Or if think other parser generators are even better, I'd be glad to hear about them.
People keep believing that if they just get a parser, they've got it made
when building language tools. Thats just wrong. Parsers get you to the foothills
of the Himalayas then you need start climbing seriously.
If you want industrial-strength support for building language translators, see our
DMS Software Reengineering Toolkit. DMS provides
Unicode-based lexers
full context-free parsers (left recursion? No problem! Arbitrary lookahead? No problem. Ambiguous grammars? No problem)
full front ends for C, C#, COBOL, Java, C++, JavaScript, ...
(including full preprocessors for C and C++)
automatic construction of ASTs
support for building symbol tables with arbitrary scoping rules
attribute grammar evaluation, to build analyzers that leverage the tree structure
support for control and data flow analysis (as well realization of this for full C, Java and COBOL),
source-to-source transformations using the syntax of the source AND the target language
AST to source code prettyprinting, to reproduce target language text
Regarding the OP's request to handle macros: our C, COBOL and C++ front ends handle their respective language preprocessing by a) the traditional method of full expansion or b) non-expansion (where practical) to enable post-parsing transformation of the macros themselves. While DMS as a foundation doesn't specifically implement macro processing, it can support the construction and transformation of same.
As an example of a translator built with DMS, see the discussion of
converting
JOVIAL to C for the B-2 bomber. This is 100% translation for > 1 MSLOC of hard
real time code. [It may amuse you to know that we were never allowed to see the actual program being translated (top secret).]. And yes, JOVIAL has a preprocessor, and yes we translated most JOVIAL macros into equivalent C versions.
[Haskell is a cool programming language but it doesn't do anything like this by itself.
This isn't about what's expressible in the language. Its about figuring out what machinery is required to support the task of manipulating programs, and
spending 100 man-years building it.]

Resources