How to use Clang to obtain ASTs on the current versions? - clang

I have been trying to obtain ASTs from Clang but I have not been successfully so far. I found a one year old question here at stack that mentions about two other ways to obtain the ast using Clang which are:
./llvmc -cc1 -ast-dump file.c
./llvmc -cc1 -ast-print file.c
On this question doxygen is mentioned and a representation where an ast is given but I am mostly looking for one on some textual form such as XML so that further analysis can be performed.
lastly there was another question here on stack about exactly XML import but it was discontinued for several reasons also mentioned.
My question thus is, which version and how can I use it from the console to obtain ast related information for a given code in C? I believe this to be a very painless one line command code like those above but the documentation index did not refer anything about ast from as much as I have read and the only one at llvmc I found was about writing an AST by hand which is not really what I am looking for.
I tried all of the commands above but they all already fail on version 2.9 and I already found out llvm changes a whole lot between each version.
Thank you.

OP says "open to other suggestions as well".
He might consider our DMS Software Reengineering Toolkit with its C Front End.
While it would be pretty easy to exhibit a C AST produced by this, it is easier to show an ObjectiveC AST [already at SO] produced by DMS using the same C front end (ObjectiveC is a dialect of C).
See https://stackoverflow.com/a/10749970/120163 DMS can produce an XML equivalent of this, too.
We don't recommend exporting trees as text, because real trees for real code are simply enormous and are poorly manipulated as text or XML objects in our experience. One usually needs machinery beyond parsing. See my discussion about Life After Parsing

Related

Given an LLVM IR, can we generate Clang AST?

This question is purely from research point of view and right now I am not looking at any practical aspect of it.
Just like we have decompilers which can take in a binary code and generate LLVM IR, like
https://github.com/repzret/dagger or https://github.com/avast/retdec
and many other.
Do we have some code generator which can convert an LLVM IR to Clang AST?
Thank You in advance.
Found one dropped project -
https://www.phoronix.com/scan.php?page=news_item&px=MTE2OTg
Looking for more.
Going from the AST to the LLVM IR is a one way street.
Take a look at this picture.
A source code file of the high level programming language (which maybe C, C++, or Rust), is converted into the Clang AST. This is a data structure which has a knowledge about the source code constructs of the programming language itself. An AST is specific to a programming language. It is a description of the parsed source code file of the programming language, in the same way as the Javascript DOM tree is a description of the HTML document. This means that the AST contains information specific to that programming language. If the programming language is Rust, the Rust AST might for example contain functional coding constructs.
The LLVM IR however is sometimes described as a portable, high-level assembly language, because it has constructions that can map closely to system hardware.
A frontend module converts a high level programming language into LLVM IR. It does this by generating a language specific AST and then recursively traversing that AST and generating LLVM code constructs representing each node in the AST. Then we have LLVM IR code. Then the backend module converts the LLVM IR into an architecture specific assembly code.
There are multiple frontend modules, one for each high level language that you want to convert into LLVM IR. Once this conversion is complete, the generated LLVM IR has no way of knowing what programming language it came from. You could take C++ code and the same code written in Rust, and after generating the LLVM IR you won't be able to tell them apart.
Once the LLVM IR has been generated any high level language specific information is gone. This is including information about how to generate an AST, because an AST needs knowledge about coding constructs specific to that programming language.
Going from a high level (more abstract) source code representation into a medium level, such as LLVM IR, and even into a lower level, such as assembly code is relatively easy.
Going the other way, from a very low level machine specific code, to a more abstract source code of a high level programming language is much harder. This is because in high level programming languages you can solve the same problem many different ways, while the representation of code in assembly language is more limited, so you have no way of knowing which specific high level coding construct the low level code originally came from.
This is why in principle you cannot go from the LLVM IR into an AST. If someone would indeed try to do such a thing, it won't be an exact same representation of the original high level language source code, and it won't be very readable.

Finding Java Grammars

I'm making my way into Rascal - nice work!
I'm halfway through the Tutor and might be getting ahead of myself, but one of my interests is refactoring Java. I've watched Tijs van der Storm's Curry On 2016 talk where he presents an example of this (the "TrafoFields" module around minute 16 in the video).
I was eager to get into this kind of work so I searched the documentation for the Java grammar and a similar example using it, to no avail. What's more, the library documentation has under lang::java only m3 and jdt. I reloaded Tijs' video to find he uses lang::java::\syntax::Java15. I blindly tried importing that in the Repl and it worked (albeit with lots of warnings)! I opened the Rascal .jar file to find there is even more in this package.
So my questions in this context are:
Why is this not in the documentation? I would have expected that the library documentation is exhaustive. Couldn't you at least add "TrafoFields" to the recipes?!
Is there an alternative way of finding out about such modules besides the online documentation (and apart from searching the .jar file)?
What is this weird backslash in the module name before "syntax", ::\syntax?
All good questions; and also implied suggestions for improvement. In the meantime if you get stuck please ask more questions.
Short answer: we've prioritized documenting what is used most and what is used in courses at universities and then what we've received questions about. So thanks for asking.
In reverse order:
The backslash is a generic escape to use identifiers in Rascal which are also keywords in the language. So any variable or package name or module name could be named "if" but you'd have to escape it. This helps a lot when defining abstract syntax trees for programming languages, as you can imagine where "if" would be a great name for an ast node type. "Syntax" happens to be a keyword in Rascal too.
The Rascal explorer view in eclipse can be used to browse through the library interactively. Also you can crawl the library using util::FileSystem and get a first class representation of everything that is in a library like |std:///|. Agreed, these are poorman's solutions and a good feature would a fully indexed searchable library. We're halfway there (see the analysis::text:: Lucene pre-work we've done towards supporting this.
That is a rhetorical question which I accept as a suggestion for improvement 😉😊 the answer is yes.

What makes libadalang special?

I have been reading about libadalang 1 2 and I am very impressed by it. However, I was wondering if this technique has already been used and another language supports a library for syntactically and semantically analyzing its code. Is this a unique approach?
C and C++: libclang "The C Interface to Clang provides a relatively small API that exposes facilities for parsing source code into an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical source locations with elements within the AST, and other facilities that support Clang-based development tools." (See libtooling for a C++ API)
Python: See the ast module in the Python Language Services section of the Python Library manual. (The other modules can be useful, as well.)
Javascript: The ongoing ESTree effort is attempting to standardize parsing services over different Javascript engines.
C# and Visual Basic: See the .NET Compiler Platform ("Roslyn").
I'm sure there are lots more; those ones just came off the top of my head.
For a practical and theoretical grounding, you should definitely (re)visit the classical textbook Structure and Interpretation of Computer Programs by Abelson & Sussman (1st edition 1985, 2nd edition 1996), which helped popularise the idea of Metacircular Interpretation -- that is, interpreting a computer program as a formal datastructure which can be interpreted (or otherwise analysed) programmatically.
You can see "libadalang" as ASIS Mark II. AdaCore seems to be attempting to rethink ASIS in a way that will support both what ASIS already can do, and more lightweight operations, where you don't require the source to compile, to provide an analysis of it.
Hopefully the final API will be nicer than that of ASIS.
So no, it is not a unique approach. It has already been done for Ada. (But I'm not aware of similar libraries for other languages.)

Verilog gate level parser

I want to parse Verilog gate level code and store the data in a data structure (ex. graph).
Then I want to do something on the gates in C/C++ and output a corresponding Verilog file.
(I would like to build one program which input and output are Verilog gate level code)
(input.v => myProgram => output.v)
If there is any library or open source code to do so?
I found that it can be done by Flex and Bison but I have no idea how to use Flex and Bison.
There was a similar question a few days ago about doing this in ruby, in which I pointed to my Verilog parser gem. Not sure if it is robust enough for you though, would love feedback, bug reports, feature requests.
There are perl verilog parsers out there but I have not used any of them directly and avoid perl, hopefully others can add info about other parsers.
I have used Verilog-Perl successfully to parse Verilog code. It is well-maintained: it even supports the recent SystemVerilog extensions.
Yosys (https://github.com/cliffordwolf/yosys) is a framework for Verilog Synthesis written in C++. Yosys is still under construction but if you only want to read and write gate-level netlists it can do what you need..
PS: A reference manual (that also covers the C++ APIs) is on the way. I've written ~100 pages already, but can't publish it before I've finished my BSc. thesis (another month or so).

Why are programmers "strongly advised not to engage in parse transformations"?

According to the erl_id_trans documentation:
Programmers are strongly advised not to engage in parse transformations and no support is offered for problems encountered.
Why are programmers strongly advised not to use parse_transform/2? Will this not be supported in the future? Other than parse_transform/2, is there a mechanism to inject code (runtime bytecode modification) or modify the source code before it gets compiled?
One reason I can imagine is that they don't want to fix the syntax tree format.
So if you use Parse teansforms and they break because of a new version of Erlang you can't complain.
Addendum: in the comments arose the question about other ways to manipulate Erlang source- or byte-code
For semiautomatic code refactoring there is Wrangler
You have acces to the Erlang preprocessor, tokenizer and parser, giving e.g syntax trees of your program
For easy and portable manipulation of abstract forms (what you get out of the parser or even beam-files) there are syntax_tools
For manipulating beam-files ther is beam_lib

Resources