llvm based code mutation for genetic programming?

llvm based code mutation for genetic programming? - clang

for a study on genetic programming, I would like to implement an evolutionary system on basis of llvm and apply code-mutations (possibly on IR level).
I found llvm-mutate which is quite useful executing point mutations.
As far as I have understood, the instructions get count/numbered, one can then e.g. delete a numbered instruction.
However, introduction of new instructions seems to be possible as one of the availeable statements in the code.
Real mutation however would allow to insert any of the allowed IR instructions, irrespective of it beeing used in the code to be mutated.
In addition, it should be possible to insert library function calls of linked libraries (not used in the current code, but possibly available, because the lib has been linked in clang).
Did I overlook this in the llvm-mutate or is it really not possible so far?
Are there any projects trying to /already have implement(ed) such mutations for llvm?
llvm has lots of code analysis tools which should allow the implementation of the afore mentioned approach. llvm is huge, so I'm a bit disoriented. Any hints which tools could be helpful (e.g. getting a list of available library functions etc.)?
Thanks
Alex

Very interesting question. I have been intrigued by the possibility of doing binary-level genetic programming for a while. With respect to what you ask:
It is apparent from their documentation that LLVM-mutate can't do what you are asking. However, I think it is wise for it not to. My reasoning is that any machine-language genetic program would inevitably face the "Halting Problem", e.g. it would be impossible to know if a randomly generated instruction would completely crash the whole computer (for example, by assigning a value to a OS-reserved pointer), or it might run forever and take all of your CPU cycles. Turing's theorem tells us that it is impossible to know in advance if a given program would do that. Mind you, LLVM-mutate can cause for a perfectly harmless program to still crash or run forever, but I think their approach makes it less likely by only taking existing instructions.
However, such a thing as "impossibility" only deters scientists, not engineers :-)...
What I have been thinking is this: In nature, real mutations work a lot more like LLVM-mutate that like what we do in normal Genetic Programming. In other words, they simply swap letters out of a very limited set (A,T,C,G) and every possible variation comes out of this. We could have a program or set of programs with an initial set of instructions, plus a set of "possible functions" either linked or defined in the program. Most of these functions would not be actually used, but they will be there to provide "raw DNA" for mutations, just like in our DNA. This set of functions would have the complete (or semi-complete) set of possible functions for a problem space. Then, we simply use basic operations like the ones in LLVM-mutate.
Some possible problems though:
Given the amount of possible variability, the only way to have
acceptable execution times would be to have massive amounts of
computing power. Possibly achievable in the Cloud or with GPUs.
You would still have to contend with Mr. Turing's Halting Problem.
However I think this could be resolved by running the solutions in a
"Sandbox" that doesn't take you down if the solution blows up:
Something like a single-use virtual machine or a Docker-like
container, with a time limitation (to get out of infinite loops). A
solution that crashes or times out would get the worst possible
fitness, so that the programs would tend to diverge away from those
paths.
As to why do this at all, I can see a number of interesting applications: Self-healing programs, programs that self-optimize for an specific environment, program "vaccination" against vulnerabilities, mutating viruses, quality assurance, etc.
I think there's a potential open source project here. It would be insane, dangerous and a time-sucking vortex: Just my kind of project. Count me in if someone doing it.

Related

Erlang Hot Code Loading not widely used?

I just saw this 2012 video from LinuxConf.au about Erlang in production.
There's a part on the video where Bernard says no big Erlang projects use Hot Code Loading apart from Ericsson, because it's really hard to guarantee things will work. It's around minute 29.
Is that still true? Are there tools to help test a hot code load or or make it easier nowadays?

This is not true. Every Erlang user uses hot code loading to his advantage in one way or another -- whether it is for development, testing, troubleshooting, one-off fixes, or full scale deployments. This is one of the major Erlang advantages. Rather unique too.
For example, WhatsApp, one of the biggest Erlang users, relies on hot code loading for almost all code pushes.
I have personally worked with hot code loading in scenarios where each change was well understood and often performed by the same person who made the change. It works extremely well and good engineers don't have any problems doing this. Speaking of tools, loading modules one by one from Erlang shell using l(...). or all at once using l(). (see here) works just fine. Some prefer release-based tools like relx.
Others, like Ericsson, use enterprise-style deployments with hot code loading after rigorous testing of clear-cut releases and patches. The goal here is to upgrade without using spare capacity and special procedures for draining and shifting load. Operationally this may be simpler and more efficient than restarts, but testing can be more expensive.

It is difficult to know whether it is a feature widely or scarcely used. Nowadays there are plenty of Erlang systems out there. I can however think of reasons why and why not to use it, since I have been working with bot options for quite some time.
In favour of using it:
It is obviously quite useful during development to ensure a fast feedback cycle. I always develop with an open shell, and with functions to load code automatically as son as compile.
In the rare case you need to implement a monolithic application with high availability requirements, it is basically the only option
The main reason not to use it, as the presentation states: it is hard. Even if you manage to understand exactly how it works (which is not the hardest part).
It is not, in my opinion just a problem of tooling, but rather that you are getting a lot of intrinsic complications just by the fact that now your code is part of the mutable running state of the system. You basically end up having a long running system that changes behaviour, so you add those to the problems you already had:
You are no longer sure that restarting the system will not change behaviour on any fundamental way. So you will probably need to put extra care on making sure that whatever code you load, it is also written to disk.
Changing the way your modules work (i.e. loading new code) is very tricky unless a) you never break compatibility, b) you somehow figure out the order in which the modules should change or c) you assume the worst that can happen is a few crashes due to undefined functions, function or case clauses, etc, and hope for the best (the actual worst is when the new and old modules interact in unexpected ways while you haven't finished loading all of the new ones and the actually run some impossible logic).
You will almost certainly will end up killing some process running old code when loading new code at some point. Maybe your supervisors will help you, maybe not. In any case that could be very confusing and difficult to debug.
As the presentation also states, is very hard to test (if not impossible).
Etc.
Adding to all those, you are running a long living server with long living state, which is far from ideal.
So my advise is always that, if you could get away with a distributed application and rolling upgrades, you should do it. That option is much easier to handle, and in my experience, performs better overall.

For distributed applications, which to use, ASIO vs. MPI?

I am a bit confused about this. If you're building a distributed application, which in some cases may perform parallel operations (although not necessarily mathematical), should you use ASIO or something like MPI? I take it MPI is a higher level than ASIO, but it's not clear where in the stack one would begin.

I know nothing about ASIO but from a quick Google it looks to me to be a lot lower level than MPI. For me the whole point of MPI is so that I can program against a higher level of abstraction from the messaging than, it seems, ASIO provides. Where you begin depends on your needs. For mine, parallelising scientific codes for high-performance, the obvious answer is MPI. I'm not sure I'd use it, or at least not sure it would be my default choice, if I were writing more general-purpose distributed, as opposed to parallel, applications. Well, actually, it probably would be my default choice to avoid learning another approach (most of which are less portable and less long-lived than MPI) but I'll admit it might not be the best choice if starting from an equal footing.

As far as I know MPI is currently incapable of handling the situation, when the new distributed nodes want to join the already started group. The problems also may occur if one of the nodes goes offline.
MPI does not reveal any network related machinery that is underneath. Thus if you would ever need something on the lower level -- you're in trouble. If you on the other hand do not aticipate such a need, then you'll save yourself a lot of time using MPI.

What's the quickest way to parallelize code?

I have an image processing routine that I believe could be made very parallel very quickly. Each pixel needs to have roughly 2k operations done on it in a way that doesn't depend on the operations done on neighbors, so splitting the work up into different units is fairly straightforward.
My question is, what's the best way to approach this change such that I get the quickest speedup bang-for-the-buck?
Ideally, the library/approach I'm looking for should meet these criteria:
Still be around in 5 years. Something like CUDA or ATI's variant may get replaced with a less hardware-specific solution in the not-too-distant future, so I'd like something a bit more robust to time. If my impression of CUDA is wrong, I welcome the correction.
Be fast to implement. I've already written this code and it works in a serial mode, albeit very slowly. Ideally, I'd just take my code and recompile it to be parallel, but I think that that might be a fantasy. If I just rewrite it using a different paradigm (ie, as shaders or something), then that would be fine too.
Not require too much knowledge of the hardware. I'd like to be able to not have to specify the number of threads or operational units, but rather to have something automatically figure all of that out for me based on the machine being used.
Be runnable on cheap hardware. That may mean a $150 graphics card, or whatever.
Be runnable on Windows. Something like GCD might be the right call, but the customer base I'm targeting won't switch to Mac or Linux any time soon. Note that this does make the response to the question a bit different than to this other question.
What libraries/approaches/languages should I be looking at? I've looked at things like OpenMP, CUDA, GCD, and so forth, but I'm wondering if there are other things I'm missing.
I'm leaning right now to something like shaders and opengl 2.0, but that may not be the right call, since I'm not sure how many memory accesses I can get that way-- those 2k operations require accessing all the neighboring pixels in a lot of ways.

Easiest way is probably to divide your picture into the number of parts that you can process in parallel (4, 8, 16, depending on cores). Then just run a different process for each part.
In terms of doing this specifically, take a look at OpenCL. It will hopefully be around for longer since it's not vendor specific and both NVidia and ATI want to support it.
In general, since you don't need to share too much data, the process if really pretty straightforward.

I would also recommend Threading Building Blocks. We use this with the Intel® Integrated Performance Primitives for the image analysis at the company I work for.
Threading Building Blocks(TBB) is similar to both OpenMP and Cilk. And it uses OpenMP to do the multithreading, it is just wrapped in a simpler interface. With it you don't have to worry about how many threads to make, you just define tasks. It will split the tasks, if it can, to keep everything busy and it does the load balancing for you.
Intel Integrated Performance Primitives(Ipp) has optimized libraries for vision. Most of which are multithreaded. For the functions we need that aren't in the IPP we thread them using TBB.
Using these, we obtain the best result when we use the IPP method for creating the images. What it does is it pads each row so that any given cache line is entirely contained in one row. Then we don't divvy up a row in the image across threads. That way we don't have false sharing from two threads trying to write to the same cache line.

Have you seen Intel's (Open Source) Threading Building Blocks?

I haven't used it, but take a look at Cilk. One of the big wigs on their team is Charles E. Leiserson; he is the "L" in CLRS, the most widely/respected used Algorithms book on the planet.
I think it caters well to your requirements.
From my brief readings, all you have to do is "tag" your existing code and then run it thru their compiler which will automatically/seamlessly parallelize the code. This is their big selling point, so you dont need to start from scratch with parallelism in mind, unlike other options (like OpenMP).

If you already have a working serial code in one of C, C++ or Fortran, you should give serious consideration to OpenMP. One of its big advantages over a lot of other parallelisation libraries / languages / systems / whatever, is that you can parallelise a loop at a time which means that you can get useful speed-up without having to re-write or, worse, re-design, your program.
In terms of your requirements:
OpenMP is much used in high-performance computing, there's a lot of 'weight' behind it and an active development community -- www.openmp.org.
Fast enough to implement if you're lucky enough to have chosen C, C++ or Fortran.
OpenMP implements a shared-memory approach to parallel computing, so a big plus in the 'don't need to understand hardware' argument. You can leave the program to figure out how many processors it has at run time, then distribute the computation across whatever is available, another plus.
Runs on the hardware you already have, no need for expensive, or cheap, additional graphics cards.
Yep, there are implementations for Windows systems.
Of course, if you were unwise enough to have not chosen C, C++ or Fortran in the beginning a lot of this advice will only apply after you have re-written it into one of those languages !
Regards
Mark

Is automated source translation seen as beneficial and/or necessary?

I have recently spent several years translating legacy FORTRAN into Java. Prior to that, I found myself translating FORTRAN into C (for which I wrote a simple translation tool). After all this work, I find myself wondering how many others are doing similar language-to-language translations and whether an automated way of doing so would be beneficial.
I know about F2C, For_C, F2J and others, as well as some of the translation sites, but none seem to be all that successful. Having seen output from For_C, I can see why it just hasn't taken off. While it is technically correct, it is very difficult to maintain.
So, I guess what I am wondering is if there were are tool that produced more maintainable, more grok-able code than the code I have seen, would developers use it? Or are developers as jaded as many posts seem to indicate and unwilling to use generated code as it could never be as good as their manually translated code?

In short, no. Obviously time restraints necessitate it sometimes, but...
Rarely is code written in one language going to translate well to another - every language has certain ways of doing things that are more suited to the constructs available / common libraries / etc.
Consider for example a program written in C as compared to something written in Python - certainly you can write for loops and iterate through things in Python just as easily as you can in C, but it is much simpler to use list comprehensions and take advantage of the features the language provides.
I'd be surprised to see an example of a reasonably sized program written in any language that could be translated into 'correct', well-maintainable code in any other.

This was already covered to some extent in Conversion of Fortran 77 code to C++, but I'll take a stab at it here.
I think there's a lot of time wasted translating legacy code to new languages. It takes a phenomenal amount of time and energy to do, and you introduce new bugs when you do it.
Joel mentioned why rewriting from scratch is a horrible idea in Things you Should Never do Part I, and though I realize that translating something to a new language isn't quite the same as rewriting from scratch, I claim it's close enough:
Automated translation tools aren't wonderful because you don't get anything maintainable out of them. You pretty much have to know the old code to understand the new code, and then what have you gained?
To port something manually, you have to know how the code works to do it well. Rewriting code is seldom done by the original developers, so you seldom get people who understand everything that's going on to do the rewrite. I worked at a company where an outsource team was hired to translate an entire website backend from ColdFusion to JSP. That project kept getting delayed and delayed because the port team didn't know the code at all. Our guys never quite liked their design, and they never quite got it right, so there was constant iteration as everyone worked out all the issues that were solved in the original code. Then, the porting itself took forever.
You also need to be familiar with really technical inconsistencies between languages. People who are very familiar with two languages are rare.
For Fortran specifically, I now work at a place where there are millions of lines of legacy Fortran code, and no one here is about to rewrite it. There's just too much risk. Old bugs would have to be re-fixed, and there are hundreds of man-years that went into working out the math. Nobody wants to introduce those kinds of bugs, and it's probably downright unsafe to do it.
Instead of porting, we have hybrid codes. After all, you can link Fortran and C/C++, and if you make a C interface around your Fortran code, you can call it from Java. Modern codes here have C/C++ components that make calls into old Fortran routines, and if you do it this way you get the added benefit that Fortran compilers are screaming fast, so the old code continues to run as fast as it ever did.
I think the best way to handle this is to do any porting you need to do incrementally. Make a lightweight interface around your old fortran code and call the pieces you need, but only port things as you need them in the new part. There are also component frameworks for integrating multi-language applications that can make this easier, but you can check out Conversion of Fortran 77 code to C++ for more on that.

Since programming is hard, no such tool can really exist.
If it was trivial to change one language into another, the idea of "compiler" would be moot. You'd just map the language you liked into the language of the hardware, press the button and be done.
However, it's never that simple. Each VM, each language, each API library adds nuances that are just impossible to automate.
" I can see why it just hasn't taken off. While it is technically correct, it is very difficult to maintain."
Correct for F2C as well as Fortran to machine language. The object code generated from most compilers can't easily be read by people. Either it's cruddy or it's highly optimized. Either way, it doesn't look a thing like an expert human would write in the assembler language for that hardware.
If only compiling could be reduced to some XSLT-like transformations that preserved the clarity of the old language in the new language. If there was only some universal Lingua Franca of computing that would be the Rosetta Stone of programming.
Until someone invents that Lingua Franca of computing, every language translation job will be hard and will lead to code that's "difficult to maintain" in the new language.

I've used f2c, and I agree with whoever wanted to name it cc2fc instead. It isn't a way of transforming Fortran into anything vaguely usable as C. It's a way of taking a C compiler and making a Fortran compiler out of it.
It did work just fine at taking that Fortran code and turning it (through C) to a Macintosh library I could call from Macintosh Common Lisp. Those were the days.

How to approach learning a new SDK/API/library?

Let's say that you have to implement some functionality that is not trivial (it will take at least 1 work week). You have a SDK/API/library that contains (numerous) code samples demonstrating the usage of the part of the SDK for implementing that functionality.
How do you approach learning all the samples, extract the necessary information, techniques, etc. in order to use them to implement the 'real thing'. The key questions are:
Do you use some tool for diagramming of the control flow, the interactions between the functions from the SDK, and the sample itself? Which kind of diagrams do you find useful? (I was thinking that the UML sequence diagram can be quite useful together with the debugger in this case).
How do you keep the relevant and often interrelated information about SDK/API function calls, the general structure and calls order in the sample programs that have to be used as a reference - mind maps, some plain text notes, added comments in the samples code, some refactoring of the sample code to suit your personal coding style in order to make the learning easier?

Personally I use the prototyping approach. Keep development to manageable iterations. In the beginning, those iterations are really small. As part of this, don't be afraid to throw code away and start again (everytime I say that somewhere a project manager has a heart attack).
If your particular task can't easily or reasonably be divided into really small starting tasks then start with some substitute until you get going.
You want to keep it as simple as you can (the proverbial "Hello world") just to familiarize yourself with building, deploying, debugging, what error messages look like, the simple things that can and do go wrong in the beginning, etc.
I don't go as far as using a diagramming tool sorry (I barely see the point in that for my job).
As soon as you start trying things you'll get the hang of it, even if in the beginning you have no idea of what's going on and why what you're doing works (or doesn't).

I usually compile and modify the examples, making them fit something that I need to do myself. I tend to do this while using and annotating the corresponding documents. Being a bit old school, the tool I usually use for diagramming is a pencil, or for the really complex stuff two or more colored pens.

I am by no means a seasoned programmer. In fact, I am learning C++ and I've been studying the language primarily from books. When I try to stray from the books (which happens a lot because I want to start contributing to programs like LibreOffice), for example, I find myself being lost. Furthermore, when I'm using functionality of the library, my implementations are wrong because I don't really understand how the library was created and/or why things need to be done that way. When I look at sample source code, I see how something is done, but I don't understand why it's done that way which leads to poor design of my programs. And as a result, I'm constantly guessing at how to do something and dealing with errors as I encounter them. Very unproductive and frustrating.
Going back to my book comment, two books which I have ready from cover to cover more than once are Ivor Horton's Beginning Visual C++ 2010 and Starting Out with C++: Early Objects (7th Edition). What I really loved about Ivor Horton's book is that it contained thorough explanation of why something needs to be done a certain way. For example, before any Windows programming began, lots of explanation about how Windows works was given first. Understanding how and why things work a certain way really helps in how I develop software.
So to contribute my two pennies towards answering your question. I think the best approach is to pick up well written books and sit down and begin learning about that library, API, SDK, whatever in a structured approach that offers real-world examples along with explanations as to how and why things are implemented as they are.
I don't know if I totally missed your question, but I don't think I did.
Cheers!
This was my first post on this site. Don't rip me too hard. (:

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart