Compiling C-Type Grammar to Custom Assembly - parsing

The title pretty much sums it up. I have a homemade CPU with my assembly language called scratchy that I'd like to write code for more effectively, but I imagine there MUST be a smart place to start.

As well as LLVM, as suggested in a comment by #SK-logic, you might want to look at the portable C compiler (pcc), which is possibly simpler to write a backend for.
Good luck!

Related

Would it be safe to rely on DeHL for new projects?

I've been browsing the DeHL repository on GoogleCode, and it looks really good to me.
Many interesting features that make basic programming tasks easier; Some neat things that are in the DotNet FCL, but are missing from the Delphi RTL can be found in this library;
Coded in a modern way, making good use of new language features;
Each class, record type, member function and parameter is documented in such a way that it'll show in the code completion of the Delphi IDE;
Well-organized and clean code;
Plenty of unit tests;
Open source and Free;
Basically, it looks like this library should've been included with Delphi, as part of the RTL.
One major drawback: The project has been discontinued. :-(
Now my question is:
Would it be safe to rely on this library for future projects, and use it as a base framework to build upon?
Basically I'd like to hear from somebody who's actually used this library whether or not it's worth it to invest time in getting to know this library, and why.
IIRC the project was discontinued because it was an over-engineered first attempt and a lot of its features turned out really messy and bloated. You should look at Alex Ciobanu's second attempt, which is simply called Collections. It contains most of the interesting features from DeHL, but leaner.
Be careful, though. It still makes heavy use of generics, which will make your binary size really big if you use it a lot, because the compiler team hasn't implemented a way to collapse duplicate code yet.

Custom programming language: how?

Hopefully this question won't be too convoluted or vague. I know what I want in my head, so fingers crossed I can get this across in text.
I'm looking for a language with a syntax of my own specification, so I assume I will need to create one myself. I've spent the last few days reading about compilers, lexers, parsers, assembly language, virtual machines, etc, and I'm struggling to sort everything out in terms of what I need to accomplish my goals (file attached at the bottom with some specifications). Essentially, I'm deathly confused as to what tools specifically I will need to use to go forward.
A little background: the language made would hopefully be used to implement a multiplayer, text-based MUD server. Therefore, it needs easy inbuilt functionality for creating/maintaining client TCP/IP connections, non-blocking IO, database access via SQL or similar. I'm also interested in security insofar as I don't want code that is written for this language to be able to be stolen and used by the general public without specialist software. This probably means that it should compile to object code
So, what are my best options to create a language that fits these specifications
My conclusions are below. This is just my best educated guess, so please contest me if you think I'm heading in the wrong direction. I'm mostly only including this to see how very confused I am when the experts come to make comments.
For code security, I should want a language that compiles and is run in a virtual machine. If I do this, I'll have a hell of a lot of work to do, won't I? Write a virtual machine, assembler language on the lower-level, and then on the higher-level, code libraries to deal with IO, sockets, etc myself, rather than using existing modules?
I'm just plain confused.
I'm not sure if I'm making sense.
If anyone could settle my brain even a little bit, I'd sincerely appreciate it! Alternatively, if I'm way off course and there's a much easier way to do this, please let me know!
Designing a custom domain-specific programming language is the right approach to a problem. Actually, almost all the problems are better approached with DSLs. Terms you'd probably like to google are: domain specific languages and language-oriented programming.
Some would say that designing and implementing a compiler is a complicated task. It is not true at all. Implementing compilers is a trivial thing. There are hordes of high-quality compilers available, and all you need to do is to define a simple transform from your very own language into another, or into a combination of the other languages. You'd need a parser - it is not a big deal nowdays, with Antlr and tons of homebrew PEG-based parser generators around. You'd need something to define semantics of your language - modern functional programming langauges shines in this area, all you need is something with a support for ADTs and pattern matching. You'd need a target platform. There is a lot of possibilities: JVM and .NET, C, C++, LLVM, Common Lisp, Scheme, Python, and whatever else is made of text strings.
There are ready to use frameworks for building your own languages. Literally, any Common Lisp or Scheme implementation can be used as such a framework. LLVM has all the stuff you'd need too. .NET toolbox is ok - there is a lot of code generation options available. There are specialised frameworks like this one for building languages with complex semantics.
Choose any way you like. It is easy. Much easier than you can imagine.
Writing your own language and tool chain to solve what seems to be a standard problem sounds like the wrong way to go. You'll end up developing yet another language, not writing your MUD.
Many game developers take an approach of using scripting languages to describe their own game world, for example see: http://www.gamasutra.com/view/feature/1570/reflections_on_building_three_.php
Also see: https://stackoverflow.com/questions/356160/which-game-scripting-language-is-better-to-use-lua-or-python for using existing languages (Pythong and LUA) in this case for in-game scripting.
Since you don't know a lot about compilers and creating computer languages: Don't. There are about five people in the world who are good at it.
If you still want to try: Creating a good general purpose language takes at least 3 years. Full time. It's a huge undertaking.
So instead, you should try one of the existing languages which solves almost all of your problems already except maybe the "custom" part. But maybe the language does things better than you ever imagined and you don't need the "custom" part at all.
Here are two options:
Python, a beautiful scripting language. The VM will compile the language into byte code for you, no need to waste time with a compiler. The syntax is very flexible but since there is a good reason for everything in Python, it's not too flexible.
Java. With the new Xtext framework, you can create your own languages in a couple of minutes. That doesn't mean you can create a good language in a few minutes. Just a language.
Python comes with a lot of libraries but if you need anything else, the air gets thin, quickly. On a positive side, you can write a lot of good and solid code in a short time. One line of python is usually equal to 10 lines of Java.
Java doesn't come with a lot of frills but there a literally millions of frameworks out there which do everything you can image ... and a lot of things you can't.
That said: Why limit yourself to one language? With Jython, you can run Python source in the Java VM. So you can write the core (web server, SQL, etc) in Java and the flexible UI parts, the adventures and stuff, in Python.
If you really want to create your own little language, a simpler and often quicker solution is to look at tools like lex and yacc and similar systems (ANTLR is a popular alternative), and then you can generate code either to an existing virtual machine or make a simple one yourself.
Making it all yourself is a great learning-experience, and will help you understand what goes on behind the scenes in other virtual machines.
An excellent source for understanding programming language design and implementation concepts is Structure and Interpretation of Computer Programs from MIT Press. It's a great read for anyone wanting to design and implement a language, or anyone looking to generally become a better programmer.
From what I can understand from this, you want to know how to develop your own programming language.
If so, you can accomplish this by different methods. I just finished up my own a few minutes ago and I used HTML and Javascript (And DOM) to develop my very own. I used a lot of x.split and x.indexOf("code here")!=-1 to do so... I don't have much time to give an example, but if you use W3schools and search "indexOf" and "split" I am sure that you will find what you might need.
I would really like to show you what I did and past the code below, but I can't due to possible theft and claim of my work.
I am pretty much just here to say that you can make your own programming language using HTML and Javascript, so that you and other might not get their hopes too low.
I hope this helps with most things....

How to make VS2008 auto insert indent for F#?

Just as my title . I want my VS to auto indent for me like in VBNET . Please help.
As far as I know, the F# language integration doesn't support this feature.
Also, automatic formatting is not as useful in F# as it is in Visual Basic. In VB, the formatting is not really important (so you can write code with crazy indentation and the formatter can fix it). In F#, the indenation (partly) determines what code means, so you need to write correctly indented code (although I agree that the automatic formatting could make the code more consistent).
In principle it should be possible to implement this feature as a Visual Studio plugin using the open source release of F#. There is a similar plugin that adds colors for nested expressions by Brian, so that could be used as an inspiration, but it's definitely not something I could write in the answer box :-).
Sadly, the indentation-sensitive syntax that F# inherited from languages like Haskell makes it impossible to auto-indent. This is actually my only major gripe with the F# language because, in addition to making it impossible to implement professional tools like auto-indenters, it renders programs fragile in the absence of correct indentation which means an accidental change in whitespace can silently corrupt a program and cut-and-paste (e.g. from blogs) is prone to breaking or corrupting programs. F# almost always screws up if you feed it OCaml code, partly because it cannot handle tabs.
The damn crying shame is that OCaml already got this right by providing a concise unambiguous syntax and powerful tools. For example, you can autoindent any definition by pressing ALT+Q in Emacs. This makes it much easier to manipulate OCaml code and can be an enormous time saver. I often find myself trawling through F# code trying to reindent it by hand, having to read the code in detail and understand the algorithm just to indent it is seriously frustrating. Having done this many times, I can also state quite confidently that the verbosity savings of the #light syntax are insignificant. In fact, F# is almost always more verbose than OCaml in practice.
I prefer to pour cold water all over this question. In principle, it's impossible to provide an auto-formatter for a whitespace-significant language.
(Pragmatically, you could add a few small niceties to the editor, e.g. if you type a line of code that starts with if and ends with the matching then and press enter, the editor could get smart and also insert the next indent so you don't also have to press tab. But this is a far cry from auto-formatting, which I think would be wrong-headed to even attempt.)

Learning More About Parsing

I have been programming since 1999 for work and fun. I want to learn new things, and lately I've been focused on parsing, as a large part of my job is reading, integrating and analyzing data. I also have a large number of repetitive tasks that I think I could express in very simple domain-specific languages if the overhead was low enough. I have a few questions about the subject.
Most of my current parsing code don't define a formal grammar. I usually hack something together in my language of choice because that's easy, I know how to do it and I can write that code very fast. It's also easy for other people I work with to maintain. What are the advantages and disadvantages of defining a grammar and generating a real parser (as one would do with ANTLR or YACC) to parse things compared with the hacks that most programmers used to write parsers?
What are the best parser generation tools for writing grammar-based parsers in C++, Perl and Ruby? I've looked at ANTLR and haven't found much about using ANTLRv3 with a C++ target, but otherwise that looks interesting. What are the other tools that are similar to ANTLR that I should be reading about?
What are the canonical books and articles that someone interested in learning more about parsing? A course in compilers unfortunately wasn't part of my education, so basic material is very welcome. I've heard great things about the Dragon Book, but what else is out there?
On 1., I would say the main advantage is maintainability -- making a little change to the language just means making a correspondingly-small change to the grammar, rather than minutely hacking through the various spots in the code that may have something to do with what you want changed... orders of magnitude better productivity and smaller risk of bugs.
On 2. and 3., I can't suggest much beyond what you already found (I mostly use Python and pyparsing, and could comment from experience on many Python-centered parse frameworks, but for C++ I mostly use good old yacc or bison anyway, and my old gnarled copy of the Dragon Book -- not the latest edition, actually -- is all I keep at my side for the purpose...).
Here's my take on your (very good) questions:
I think a parser benefits most from non-trivial situations where a grammar actually exists. You have to know about how parsers and grammars work to think of that technique, and not every developer does.
lex/yacc are older Unix tools that might be usable for you as a C++ developer. Maybe Bison as well.
ANTRL and its attendant book are very good. "Writing Compilers and Interpreters" has C++ examples which you might like.
The GoF Interpreter pattern is another technique for writing "little languages". Take a look at that.
Let's Build A Compiler is a step-by-step tutorial on how to write a simple compiler. The code is written in Delphi (Pascal), but it's basic enough to easily translate into most other languages.
I would have a serious look at monadic combinator-based parsing (which often also deals with lexical analysis) in Haskell. I found it quite an eye opener; it's amazing how easily you can build a parser from scratch using this method. It's so easy, in fact, that it's often faster to write your own parser than it is to try to use existing libraries.
The most famous example is probably Parsec which has a good user guide that explains how to use it. There is a list of ports of this library to other languages (including C++ and Ruby) listed on the Parsec page of the Haskell wiki, though I'm not familiar with them and so I can't say how close they are to using Parsec in Haskell.
If you want to learn how these work internally and how to write your own, I recommend starting with Chapter 8 ("Functional Parsers") from Graham Hutton's Programming in Haskell. Once you understand that chapter well (which will probably take several readings), you'll be set.
In perl, the Parse::RecDescent modules is the first place to start. Add tutorial to the module name and Google should be able to find plenty of tutorials to get you started.
Defining a grammar using BNF, EBNF or something similar, is easier and later on you will have a better time maintaining it. Also, you can find a lot of examples of grammar definitions. Last but not least, if you are going to talk about your grammar to someone else on the field, it is better if you are both speaking the same language (BNF, EBNF etc.).
Writing your own parsing code is like reinventing the wheel and is prone to errors. It is also less maintainable. Of course, it can be more flexible, and for small projects it might also be a good choice, but using an existing parser generator that takes a grammar and spits out the code should cover most of our needs.
For C++ I would also suggest lex/yacc. For Ruby this looks like a decent choice: Coco/R(uby)
Funny timing: I spent lots of this morning wondering about state machines and parsers, and trying to figure out how I could learn more about them.
For 2, you might take a look at Ragel (it's good for C++ and Ruby).
Here's a tutorial on a self-contained (10 pages!), completely portable compiler-compiler
which can be used to design and implement "low overhead" DSLs very quickly:
http://www.bayfronttechnologies.com/mc_tutorial.html
This site walks you through Val Schorre's 1964 paper on MetaII.
Yes, 1964. And it is amazing. This is how I learned about compilers
back in 1970.

Is automated source translation seen as beneficial and/or necessary?

I have recently spent several years translating legacy FORTRAN into Java. Prior to that, I found myself translating FORTRAN into C (for which I wrote a simple translation tool). After all this work, I find myself wondering how many others are doing similar language-to-language translations and whether an automated way of doing so would be beneficial.
I know about F2C, For_C, F2J and others, as well as some of the translation sites, but none seem to be all that successful. Having seen output from For_C, I can see why it just hasn't taken off. While it is technically correct, it is very difficult to maintain.
So, I guess what I am wondering is if there were are tool that produced more maintainable, more grok-able code than the code I have seen, would developers use it? Or are developers as jaded as many posts seem to indicate and unwilling to use generated code as it could never be as good as their manually translated code?
In short, no. Obviously time restraints necessitate it sometimes, but...
Rarely is code written in one language going to translate well to another - every language has certain ways of doing things that are more suited to the constructs available / common libraries / etc.
Consider for example a program written in C as compared to something written in Python - certainly you can write for loops and iterate through things in Python just as easily as you can in C, but it is much simpler to use list comprehensions and take advantage of the features the language provides.
I'd be surprised to see an example of a reasonably sized program written in any language that could be translated into 'correct', well-maintainable code in any other.
This was already covered to some extent in Conversion of Fortran 77 code to C++, but I'll take a stab at it here.
I think there's a lot of time wasted translating legacy code to new languages. It takes a phenomenal amount of time and energy to do, and you introduce new bugs when you do it.
Joel mentioned why rewriting from scratch is a horrible idea in Things you Should Never do Part I, and though I realize that translating something to a new language isn't quite the same as rewriting from scratch, I claim it's close enough:
Automated translation tools aren't wonderful because you don't get anything maintainable out of them. You pretty much have to know the old code to understand the new code, and then what have you gained?
To port something manually, you have to know how the code works to do it well. Rewriting code is seldom done by the original developers, so you seldom get people who understand everything that's going on to do the rewrite. I worked at a company where an outsource team was hired to translate an entire website backend from ColdFusion to JSP. That project kept getting delayed and delayed because the port team didn't know the code at all. Our guys never quite liked their design, and they never quite got it right, so there was constant iteration as everyone worked out all the issues that were solved in the original code. Then, the porting itself took forever.
You also need to be familiar with really technical inconsistencies between languages. People who are very familiar with two languages are rare.
For Fortran specifically, I now work at a place where there are millions of lines of legacy Fortran code, and no one here is about to rewrite it. There's just too much risk. Old bugs would have to be re-fixed, and there are hundreds of man-years that went into working out the math. Nobody wants to introduce those kinds of bugs, and it's probably downright unsafe to do it.
Instead of porting, we have hybrid codes. After all, you can link Fortran and C/C++, and if you make a C interface around your Fortran code, you can call it from Java. Modern codes here have C/C++ components that make calls into old Fortran routines, and if you do it this way you get the added benefit that Fortran compilers are screaming fast, so the old code continues to run as fast as it ever did.
I think the best way to handle this is to do any porting you need to do incrementally. Make a lightweight interface around your old fortran code and call the pieces you need, but only port things as you need them in the new part. There are also component frameworks for integrating multi-language applications that can make this easier, but you can check out Conversion of Fortran 77 code to C++ for more on that.
Since programming is hard, no such tool can really exist.
If it was trivial to change one language into another, the idea of "compiler" would be moot. You'd just map the language you liked into the language of the hardware, press the button and be done.
However, it's never that simple. Each VM, each language, each API library adds nuances that are just impossible to automate.
" I can see why it just hasn't taken off. While it is technically correct, it is very difficult to maintain."
Correct for F2C as well as Fortran to machine language. The object code generated from most compilers can't easily be read by people. Either it's cruddy or it's highly optimized. Either way, it doesn't look a thing like an expert human would write in the assembler language for that hardware.
If only compiling could be reduced to some XSLT-like transformations that preserved the clarity of the old language in the new language. If there was only some universal Lingua Franca of computing that would be the Rosetta Stone of programming.
Until someone invents that Lingua Franca of computing, every language translation job will be hard and will lead to code that's "difficult to maintain" in the new language.
I've used f2c, and I agree with whoever wanted to name it cc2fc instead. It isn't a way of transforming Fortran into anything vaguely usable as C. It's a way of taking a C compiler and making a Fortran compiler out of it.
It did work just fine at taking that Fortran code and turning it (through C) to a Macintosh library I could call from Macintosh Common Lisp. Those were the days.

Resources