AST of whole program - clang

I would like to do transformations on AST of a c program but I need to have access to all ASTs created for the program to do right changes. LLVM processes one translation unit at a time and because of it, I do not have access to AST of all the translation units at the same time. Do you have any suggestion how I can access all the ASTs created for a program, do analysis on the ASTs and do modifications on the ASTs?
As a summary:
I need to have access to ASTs of the program at the same time.
Do analysis on ASTs.
Modify ASTs based on my analysis and create llvm IR from modified ASTs.

You can try using llvm-link on all of your generated .ll files (from clang with -S -emit-llvm) to create one large llvm source.
You have access to everything at that point.

Related

Get clang/llvm parser from yacc parser

I'm trying to build a parser for Promela in llvm. I have the parser SPIN uses, which is built using yacc, including the input that goes to yacc. Is there a way to use the yacc parser to quickly and painlessly generate a clang/llvm parser? I will be using it to generate call graphs and perform static analysis.
What I need to know now is whether I can use the existing Promela compiler, which was built with yacc, to quickly build a parser (and later, IR generator) using the llvm framework.
Yes, you can re-use the existing YACC-grammar (and if you want even the existing AST) for your project. "Building a parser using the llvm framework" is a bit misleading though because LLVM won't have anything to do with parsing and the AST. LLVM won't enter into it until you generate the LLVM IR and then work with it.
So you either take the existing YACC grammar and the existing AST or you only take the grammar and replace the actions with ones that create your own AST that you've defined yourself. Either way that part won't involve LLVM.
Then you'd write a separate phase that walks the AST and generates LLVM IR using the LLVM API, on which you can then run all the transformations and analyses supported by LLVM.

Generate Bytecode from AST

This link describes how bytecode can be generated from an AST tree. Basically, it shows how the parsing phase of compilation can be bypassed and the AST be picked up by the java compiler to produce bytecode.
This works well but I would like to be able to generate the AST using javac the way it is without changing its source code and without any framework. Is this possible and has there been anything done like this before?
Thanks in advance for your reply.
So it turns out you cannot compile a tree created by the user using the arbitrary implementations of com.sun.source.tree.*. What can be done though is to print the AST to a string and compile the string in memory using the Java 6 Compiler API.

Is there any efficiency difference between using Dialyzer on Erlang beam and source code?

I collect all beam files of a project under a path like ~/erl_beam
dialyzer ~/erl_beam/*.beam --get_warnings -o static_analysis.log
It works well.
If I do it on Erlang source code:
dialyzer --get_warnings -I <Path1> --src <Path2> -o static_analysis.log
It works, too.
So why we have two ways to take static analysis on Erlang code?
Is there any strength or weakness for each other?
Very small.
Dialyzer analysis is performed on Core Erlang. This representation can be extracted either directly from a +debug_info compiled .beam file, or by compiling a .erl file. Compilation takes time, but it is of course not the most time-consuming part of the analysis.
If you have already compiled your .erl with +debug_info it is also more convenient to analyze the resulting .beam file, as you won't have to pass any compilation-related command-line options to Dialyzer.
Dialyzer starts its analysis from either debug-compiled BEAM bytecode or from Erlang source code. However, several options work only for BEAM files (e.g., --build_plt).
Using BEAM files may be necessary if, for example, you don't have access to source files. If you have access to both BEAM and source files, you'll probably want to use the BEAM files as this will speed up the analysis slightly: Dialyzer will take much less time to parse its input. On the other hand, parsing takes significantly less time than the rest of the analysis, so don't expect to see much of a difference (I'd be surprised if it was more than 10%).
Apart from that, AFAIK, there's no difference in the type of analysis that Dialyzer performs, between these two cases.

What is the difference between Dart's snapshots and Java bytecode?

I've been reading up on Dart snapshots, and they're frequently compared to Smalltalk images. But to me, they sound alot like Java bytecode.
For example:
"A Dart snapshot is just a binary serialization of the token stream, generated from parsing the code. A snapshot is not a "snapshot of a running program", it's generated before the tokens are turned into machine code. So, no program state is captured in a snapshot."
Plus they're cross-platform:
"The snapshot format itself is cross-platform meaning that it works between 32-bit, 64-bit machines and so forth. The format has been made so that it's quick to read into memory with a emphasis on minimizing extra work like pointer fixups."
Am I getting it wrong somewhere?
Sources:
What is the snapshot concept in dart?
http://www.infoq.com/articles/google-dart
Snapshots contain the VM data structures representing the loaded script in a serialized form similar to Smalltalk images. To get a better understanding of what is contained in the snapshot, we should take a look at what the Dart VM creates as it reads the script:
Library objects, referring to all top-level structures such as classes or top-level methods and variables.
Class objects, containing all objects describing all methods and fields.
Script and Tokenstream objects representing all loaded source code.
String objects for all used identifiers and string constants in the source code.
This object graph is serialized into a file when generating a snapshot using a format that is architecture agnostic. This allows the Dart VM to deserialize this snapshot file on 32-bit or 64-bit machines and recreate all of the necessary internal VM data structures much quicker than reading the original scripts from a set of files (see John's answer).
To clarify John's answer a bit. The Dart VM does not parse ALL of the source code when generating the snapshot. It only needs to parse the top level of the sources to be able to extract class, method and field definitions as these are represented in the serialized graph. In particular method bodies are not parsed and as it is customary for a scripting language errors will be only reported once control reaches the particular method.
The purpose of Java bytecodes is entirely different as Ladicek points out. You could create a snapshot of the VM data structures in a JVM once the bytecodes are loaded to get a similar effect.
In short: The snapshot contains an efficient representation of all the data structures allocated on the Dart VM heap which are needed to start executing the script.
A Dart snapshot is just a roll up of all source files that has been parsed ahead of time. A Dart snapshot is not similar to a Java bytecode file. A Java bytecode file consists of JVM machine code and is the product of a compile, link, and assembly (into JVM machine code) phase.
A Dart snapshot is a binary file of a Dart program and it's import/part source file dependencies that has been parsed into an abstract syntax tree and rolled into a single file. Executing a Dart snapshot allows for faster startup times because:
Only 1 file must be loaded from disk or off network. In contrast, a non-snapshot Dart program must be fetched, then any imported files must be fetched, and so on. Before each subsequent source file request can be made the previously fetched source file must be parsed to find out if it's referencing more source files. Imagine if your Dart program imported 10 libraries which consisted of 10 source files each. That means 110 I/O requests and parses that are done one after another.
The parsing has been done ahead of time. It's already known to be syntactically correct and ready to be compiled by the Dart VM.
I will just point out that as of Dart 2+, there are several distinctive concepts when it comes to Snapshots:
Kernel Snapshot
JIT Snapshot
AOT Snapshot
You can read more here.

Does gcov give code coverage analysis for assembly language code

I have an application which i build using gcc on linux host for ARM target processor. This generated arm executable i execute on a ARM development board i have.
I want to do some code coverage analysis:
Will gcov show code coverage if i have ARM assembly source files in my build environment?
If my build environment has some X86 assembly source files, then will gcov show code coverage data?
Thank you.
-AD.
AFAIK, gcov works by preprocessing your C or C++ source code.
If you have pure assembly language files, I don't think gcov ever
sees them.
If it does, I'd be suprised if it understand how
to safely insert code in arbitrary-target assembly code,
with ARM being common enough so there's a faint chance.
The problem with instrumentating assembly code is the
test coverage probe code itself may require registers,
and there isn't a safe way to know, for an arbitrary piece
of assemblers, a) what registers are available, and b)
if there's an inserted instruction, will some other instruction
break because of the extra space (e.g., a hardwired jump relative
across the inserted instruction).

Resources