Understanding how do the coverage modules work - code-coverage

I came across with a new dynamic language. I would like to create a coverage tool for that language. I started reading the source code of Perl 5 and Python coverage modules but it got complicated. It's a dynamic scripting language so I guess that source code of static languages (like Java & C++) won't help me here. Also, as I understand, each language was built in a different way and the same ideas won't work. But, the big concepts could be similar.
My question is as follows: how do I "attack" this task? What is the proper workflow I need to follow? What I need to investigate? Are there any books or blogs I can read about those kind of stuff?

There are two kinds of coverage collection mechanisms:
1) Real-time sampling of the program counter, typically by a clock running at 1-10ms. Difficulties: a) mapping an actual PC value back to a source line, b) sampling means you might not see execution of a rarely used bit of code, so your coverage reporting is inaccurate. Because of these issues, this approach isn't used very often.
2) Instrumenting the program so that it collects coverage as it runs. This is hard to do with object code... a) you have to decode the instructions to see where to put probes, and this can be very hard to do right, b) you have patch the source code to include the probes (this can be really awkward; a "probe" might consist of a 5 byte subroutine call but the probe has replace a single-byte instruction). c) you still have to figure out how to map a probe location back to a source code line. A more effective way is to instrument the source code, which requires pretty sophisticated machinery to read source, make probe patches, and regenerate the instrumented code for execution/compilation.
My technical paper Branch Coverage for Arbitrary Languages Made Easy provides explicit detail for how to do this in a general way. My company has built commercial test coverage tools for a wide variety of languages (C, Python, PHP, COBOL, Java, C++, C#, ProC,....) using this approach. This covers most static and dynamic languages. Some dynamic mechanisms are extremely difficult to instrument, e.g., eval() but that is true of every approach.

In addition to Ira's answer, there is a third coverage collection mechanism: the language implementation provides a callback that can inform you about program events. For example, Python has sys.settrace: you provide it a function, and Python calls your function for every function called or returned, and every line executed.

Related

What makes libadalang special?

I have been reading about libadalang 1 2 and I am very impressed by it. However, I was wondering if this technique has already been used and another language supports a library for syntactically and semantically analyzing its code. Is this a unique approach?
C and C++: libclang "The C Interface to Clang provides a relatively small API that exposes facilities for parsing source code into an abstract syntax tree (AST), loading already-parsed ASTs, traversing the AST, associating physical source locations with elements within the AST, and other facilities that support Clang-based development tools." (See libtooling for a C++ API)
Python: See the ast module in the Python Language Services section of the Python Library manual. (The other modules can be useful, as well.)
Javascript: The ongoing ESTree effort is attempting to standardize parsing services over different Javascript engines.
C# and Visual Basic: See the .NET Compiler Platform ("Roslyn").
I'm sure there are lots more; those ones just came off the top of my head.
For a practical and theoretical grounding, you should definitely (re)visit the classical textbook Structure and Interpretation of Computer Programs by Abelson & Sussman (1st edition 1985, 2nd edition 1996), which helped popularise the idea of Metacircular Interpretation -- that is, interpreting a computer program as a formal datastructure which can be interpreted (or otherwise analysed) programmatically.
You can see "libadalang" as ASIS Mark II. AdaCore seems to be attempting to rethink ASIS in a way that will support both what ASIS already can do, and more lightweight operations, where you don't require the source to compile, to provide an analysis of it.
Hopefully the final API will be nicer than that of ASIS.
So no, it is not a unique approach. It has already been done for Ada. (But I'm not aware of similar libraries for other languages.)

Programmatic access to fslex and fsyacc

The fslex and fsyacc tools currently require 2-stage compilation, generating files that are then compiled by fsc. It seems to me that these tools would be much easier to use if the source files were embedded resources, fed to fslex and fsyacc programmatically and the generated code compiled on-the-fly using the CodeDom.
Is this feasible and, if so, what would be required to implement this?
Jon, this is a great question; in fact, one of the design goals I have for fsharp-tools (new lexer- and parser-generator implementations for F#) is for them to be embeddable, specifically to enable scenarios like this.
As of now, I haven't implemented (yet) the functionality which would let you do this easily in fsharplex, but don't let that deter you; I've written fsharplex (and the other tools in fsharp-tools) in a more-or-less purely-functional style, so there shouldn't be any issues with global state or anything like that. It should be relatively straightforward to hack up the compiler code so you can build a regex AST using some combinators, run the compiler to get a compiled DFA, then emit IL for your state machine into a dynamic assembly (which you could then "bake" and execute).
fsharpyacc currently uses an approach where I've put the bulk of the compilation logic into a purely-functional library, Graham; the idea there is that the grammar analysis/manipulation and parser DFA compilation algorithms should be generic, reusable, and easy to test, so anyone else wanting to build language tools with F# will have a common framework on which to build them. Likewise, contributions/improvements to Graham can easily flow back to fsharpyacc. Eventually, I will modify fsharplex to use this same approach, which will allow you to embed the regex compiler in your own code simply by referencing the NuGet package (you'd just need to write the code to generate IL from the DFA).
fsharplex and fsharpyacc use MEF to allow various backends to be plugged in; for now, they're only targetting fslex and fsyacc for compatibility reasons, but I'd like to implement code-based backends (as opposed to the current table-based backends) to get better performance in the future.
Update -- I just re-read your question and noticed you want to embed the *.fsl and *.fsy files themselves and invoke the respective compilers at run-time. You could accomplish this by compiling the tools and referencing the assemblies from your own projects. IIRC, I exposed an entry point in both compilers so they could be called from outside code; the main entry points (e.g., what gets executed when you invoke the tools from a console) simply parse the command-line arguments then pass them into this "external" entry point.
There is one problem with directly embedding the *.fsl and *.fsy files though; if you embed them, then run them through fsharplex and fsharpyacc at run-time, your user-defined actions (e.g., the code executed when a lexer or parser rule is matched) will still be specified as F# source code -- you'd need to decide how you want to compile them into executable code.
It should be feasible to provide a parser combinator-like interface with a backend that uses expression trees (the LISP "eval" of F#) or something similar, for full integration with the language. Or else a TypeProvider. There are many options. If table generation is an expensive computation, it could be cached by providing a Cache, for example a disk cache.
I think nothing except lack of time, dedication and expertise, prevents us from having tools with (non-monadic) parser combinator-like interface, yet efficient compiled implementation.
Sometimes I get back to this pet project of mine, playing with an algebraic approach to optimizing regular expressions (and lexers) specified in source using combinators and then compiled to a state machine. It still lacks a few key pieces for efficiency, but there it is:
https://github.com/toyvo/ocaml-regex-algebraic

Learning incremental compilation design

There are a lot of books and articles about creating compilers which do all the compilation job at a time. And what about design of incremental compilers/parsers, which are used by IDEs? I'm familiar with first class of compilers, but I have never work with the second one.
I tried to read some articles about Eclipse Java Development Tools, but they describe how to use complete infrastructure(i.e. APIs) instead of describing internal design(i.e. how it works internally).
My goal is to implement incremental compiler for my own programming language. Which books or articles would you recommend me?
This book is worth a look: Builing a Flexible Incremental Compiler Back-End.
Quote from Ch. 10 "Conclusions":
This paper has explored the design of
the back-end of an incremental
compilation system. Rather than
building a single fixed incremental
compiler, this paper has presented a
flexible framework for constructing such
systems in accordance with user needs.
I think this is what you are looking for...
Edit:
So you plan to create something that is known as a "cross compiler"?!
I started a new attempt. Until now, I can't provide the ultimate reference. If you plan such a big project, I'm sure you are an experienced programmer. Therefore it is possible, that you already know these link(s).
Compilers.net
List of certain compilers, even cross compilers (Translators). Unfortunately with some broken links, but 'Toba' is still working and has a link to its source code. May be that this can inspire you.
clang: a C language family frontend for LLVM
Ok, it's for LVVM but source is available in a SVN repository and it seems to be a front end for a compiler (translator). May be that this can inspire you as well.
I'm going to disagree with conventional wisdom on this one because most conventional wisdom makes unwritten assumptions about your goals, such as complete language designs and the need for extreme efficiency. From your question, I am assuming these goals:
learn about writing your own language
play around with your language until it looks elegant
try to emit code into another language or byte code for actual execution.
You want to build a hacking harness and a recursive descent parser.
Here is what you might want to build for a harness, using just a text based processor.
Change the code fragment (now "AT 0700 SET HALLWAY LIGHTS ON FULL")
Compile the fragment
Change the code file (now "tests.l")
Compile from file
Toggle Lexer output (now ON)
Toggle Emitter output (now ON)
Toggle Run on home hardware (now OFF)
Your command, sire?
You will probably want to write your code in Python or some other scripting language. You are optimizing your speed of play, not execution. A recursive descent parser might look like:
def cmd_at():
if next_token.type == cTIME:
num = next_num()
emit("events.setAlarm(events.DAILY, converttime(" + time[0:1] + ", "
+ time[2:] + ", func_" + num + ");")
match_token(cTIME)
match_token(LOCATION)
...
So you need to write:
A little menu for hacking.
Some lexing routines, to return different tokens for numbers, reserved words, and the like.
A bunch of logic for what your language
This approach is aimed at speeding up the cycle for hacking together the language. When you have finished this approach, then you reach for BISON, test harnesses, etc.
Making your own language can be a wonderful journey! Expect to learn. Do not expect to get rich.
I see that there is an accepted answer, but I think that some additional material could be usefully included on this page.
I read the Wikipedia article on this topic and it linked to a DDJ article from 1997:
http://www.drdobbs.com/cpp/codestore-and-incremental-c/184410345?pgno=1
The meat of the article is the first page. It explains that the code in the editor is divided into pieces that are "incorporated" into a "CodeStore" (database). The pieces are incorporated via a work queue which contains unincorporated pieces. A piece of code may be parsed and returned to the work queue multiple times, with some failure on each attempt, until it goes through successfully. The database includes dependencies between the pieces so that when the source code is edited the effects on the edited piece and other pieces can be seen and these pieces can be reprocessed.
I believe other systems approach the problem differently. Java presents different problems than C/C++ but has advantages as well, so Eclipse perhaps has a different design.

Is it possible that F# will be optimized more than other .Net languages in the future?

Is it possible that Microsoft will be able to make F# programs, either at VM execution time, or more likely at compile time, detect that a program was built with a functional language and automatically parallelize it better?
Right now I believe there is no such effort to try and execute a program that was built as single threaded program as a multi threaded program automatically.
That is to say, the developer would code a single threaded program. And the compiler would spit out a compiled program that is multi-threaded complete with mutexes and synchronization where needed.
Would these optimizations be visible in task manager in the process thread count, or would it be lower level than that?
I think this is unlikely in the near future. And if it does happen, I think it would be more likely at the IL level (assembly rewriting) rather than language level (e.g. something specific to F#/compiler). It's an interesting question, and I expect that some fine minds have been looking at this and will continue to look at this for a while, but in the near-term, I think the focus will be on making it easier for humans to direct the threading/parallelization of programs, rather than just having it all happen as if by magic.
(Language features like F# async workflows, and libraries like the task-parallel library and others, are good examples of near-term progress here; they can do most of the heavy lifting for you, especially when your program is more declarative than imperative, but they still require the programmer to opt-in, do analysis for correctness/meaningfulness, and probably make slight alterations to the structure of the code to make it all work.)
Anyway, that's all speculation; who can say what the future will bring? I look forward to finding out (and hopefully making some of it happen). :)
Being that F# is derived from Ocaml and Ocaml compilers can optimize your programs far better than other compilers, it probably could be done.
I don't believe it is possible to autovectorize code in a generally-useful way and the functional programming facet of F# is essentially irrelevant in this context.
The hardest problem is not detecting when you can perform subcomputations in parallel, it is determining when that will not degrade performance, i.e. when the subtasks will take sufficiently long to compute that it is worth taking the performance hit of a parallel spawn.
We have researched this in detail in the context of scientific computing and we have adopted a hybrid approach in our F# for Numerics library. Our parallel algorithms, built upon Microsoft's Task Parallel Library, require an additional parameter that is a function giving the estimated computational complexity of a subtask. This allows our implementation to avoid excessive subdivision and ensure optimal performance. Moreover, this solution is ideal for the F# programming language because the function parameter describing the complexity is typically an anonymous first-class function.
Cheers,
Jon Harrop.
I think the question misses the point of the .NET architecture-- F#, C# and VB (etc.) all get compiled to IL, which then gets compiled to machine code via the JIT compiler. The fact that a program was written in a functional language isn't relevant-- if there are optimizations (like tail recursion, etc.) available to the JIT compiler from the IL, the compiler should take advantage of it.
Naturally, this doesn't mean that writing functional code is irrelevant-- obviously, there are ways to write IL which will parallelize better-- but many of these techniques could be used in any .NET language.
So, there's no need to flag the IL as coming from F# in order to examine it for potential parallelism, nor would such a thing be desirable.
There's active research for autoparallelization and auto vectorization for a variety of languages. And one could hope (since I really like F#) that they would concive a way to determine if a "pure" side-effect free subset was used and then parallelize that.
Also since Simon Peyton-Jones the father of Haskell is working at Microsoft I have a hard time not beliving there's some fantastic stuff comming.
It's possible but unlikely. Microsoft spends most of it's time supporting and implementing features requested by their biggest clients. That usually means C#, VB.Net, and C++ (not necessarily in that order). F# doesn't seem like it's high on the list of priorities.
Microsoft is currently developing 2 avenues for parallelisation of code: PLINQ (Pararllel Linq, which owes much to functional languages) and the Task Parallel Library (TPL) which was originally part of Robotics Studio. A beta of PLINQ is available here.
I would put my money on PLINQ becoming the norm for auto-parallelisation of .NET code.

Game Engine Scripting Languages

I am trying to build out a useful 3d game engine out of the Ogre3d rendering engine for mocking up some of the ideas i have come up with and have come to a bit of a crossroads. There are a number of scripting languages that are available and i was wondering if there were one or two that were vetted and had a proper following.
LUA and Squirrel seem to be the more vetted, but im open to any and all.
Optimally it would be best if there were a compiled form for the language for distribution and ease of loading.
One interesting option is stackless-python. This was used in the Eve-Online game.
The syntax is a matter of taste, Lua is like Javascript but with curly braces replaced with Pascal-like keywords. It has the nice syntactic feature that semicolons are never required but whitespace is still not significant, so you can even remove all line breaks and have it still work. As someone who started with C I'd say Python is the one with esoteric syntax compared to all the other languages.
LuaJIT is also around 10 times as fast as Python and the Lua interpreter is much much smaller (150kb or around 15k lines of C which you can actually read through and understand). You can let the user script your game without having to embed a massive language. On the other hand if you rip the parser part out of Lua it becomes even smaller.
The Python/C API manual is longer than the whole Lua manual (including the Lua/C API).
Another reason for Lua is the built-in support for coroutines (co-operative multitasking within the one OS thread). It allows one to have like 1000's of seemingly individual scripts running very fast alongside each other. Like one script per monster/weapon or so.
( Why do people write Lua in upper case so much on SO? It's "Lua" (see here). )
One more vote for Lua. Small, fast, easy to integrate, what's important for modern consoles - you can easily control its memory operations.
I'd go with Lua since writing bindings is extremely easy, the license is very friendly (MIT) and existing libraries also tend to be under said license. Scheme is also nice and easy to bind which is why it was chosen for the Gimp image editor for example. But Lua is simply great. World of Warcraft uses it, as a very high profile example. LuaJIT gives you native-compiled performance. It's less than an order of magnitude from pure C.
I wouldn't recommend LUA, it has a peculiar syntax so takes some time to get used to. Depending on who will be doing the scripting, this may not be a problem, but I would try to use something fairly accessible.
I would probably choose python. It normally compiles to bytecode, so you would need to embed the interpreter. However, if you must you can use PyPy to for example translate the code to C, and then compile it.
Embedding the interpreter is no issue. I am more interested in features and performance at this point in time. LUA and Squirrel are both interpreted, which is nice because one of the games i am building out is to include modifiable code, which has an editor in game.
I would love to hear about python, as i have seen its use within the battlefield series i believe.
python is also nice because it has actual OGRE bindings, just in case you need to modify something lower-level on the fly. I don't know of any equivalent bindings for Lua.
Since it's a C++ library, I would suggest either JavaScript or Squirrel, the latter being my personal favorite of the two for being even closer to C++, in particular to how it handles tables/structs and classes. It would be the easiest to get used to for a C++ coder because of all the similarities.
However, if you go with JavaScript and find an HTML5 version of Ogre3D, you should be able to port your game code directly into the web version with minimal (if any) changes necessary.
Both of these are a good choice, and both have their pros and cons, but both would definitely be the easiest to learn since you're likely already working in C++. If you're working with Java, the same may hold true, and if it's Game Maker, you wouldn't need either one unless you're trying to make an executable that people wouldn't need Game Maker itself to run, in which case, good luck finding an extension to run either of these.

Resources