For some time now I've been thinking about designing a small toy language from scratch, nothing that will "Rule The World", but mostly as an exercise. I realize there is a lot to learn in order to accomplish this.
This question is about three different concepts (parsing, code highlighting and completion) that strike me as extremely similar. Of course, parsing and ASTgen is part of the compilation, while code highlighting and completion is more of a feature of the IDE, yet I wonder what are the similarities and differences.
I need some hints from someone more experienced in this topic. What code can be shared between these concepts and what are the architecture considerations that could help in this sense?
What you want is a syntax-directed structure editor. This is one that combines parsing with AST building and uses the parser to predict what you can type next (either syntax completion), or has a tie to the compiler's last run, so that it can interpret the edit point to see what valid identifiers might come next by inspecting the compiler's symbol table that was last relevant at that point in the code.
The most difficult part is offering the user a seamless experience; she pretty much has to believe she is editing text or (experience with structure editors shows) she will reject it as awkward.
This is a lot of machinery to coordinate and quite a big effort. The good news is that you need a parser anyway for the compiler; if editing also parses, the AST needed by the compiler is essentially available. (Of course you have to worry about batch compiling, too). The compiler has to build a symbol table; so you can use that in the editing completion process. The more difficult news is that the parsers are a lot harder to build; they can't just declare a user-visible syntax error and quit; rather they have to be tolerant of a number of errors extant at the same moment, hold partial ASTs for the pieces, and stitch them together as the errors are removed by the user.
The Berkeley Harmonia people are doing good work in this area. It is well worth your trouble to read some of their papers to get a detailed sense of the problems and one approach to handling them.
THe other major approach people (notably Intentional Programming and XText) seem to be trying are object-oriented editors, where you attach editing actions to each AST node, and associate every point on the screen with an AST node. Then editing actions invoke AST-node specific actions (insert-character, go right, go up, ...) and it can decide how to act and how to modify the screen. Arguably you can make these editors do anything; its a little harder in practice. I've used these editors; they don't feel like text editors. There are some enthusiastic users, but YMMV.
I think you probably ought to choose between trying to build such an editor, vs. trying to define a new langauge. Doing both at once is likely to overwhelm you with troubles.
Related
I'm doing a project which involves parsing the histories of common lisp repos. I need to parse them into list-of-lists or something like that. Ideally, I'd like to preserve as much of the original source file syntax as possible, in some way. For example, in the case of the text #+sbcl <something>, which I think means "If our current lisp is sbcl, read <something>, otherwise skip it", I'd like to get something like (#+ 'sbcl <something>).
I originally wrote a LALR parser in Python, which sort of worked, but it's not ideal for many reasons. I'm having a lot of difficulty getting correct output, and I have tons of special cases to add.
I figured that what I should really do is is use lisp itself, since it already has a lisp parser built in. If I could just read a file into sexps, I could dump it into something (cl-json would do) for further processing down the line.
Unfortunately, when I attempt to read https://github.com/fukamachi/woo/blob/master/src/woo.lisp, I get the error
There is no package with the name WOO.EV.TCP
which is of course coming from line 80 of that file, since that package is defined in src/ev/tcp.lisp, and we haven't read it.
Basically, is it possible to just read the file into sexps without caring whether the packages are defined or if they contain the relevant symbols? If so, how? I've tried looking at the hyperspec reader documentation, but I don't see anything that sounds relevant.
I'm out of practice with actually writing common lisp, but it seems potentially possible to hack around this by handling the undefined package condition by creating a blank package with that name, and handling the no-symbol-of-that-name-in-package condition by just interning a given symbol. I think. I don't know how to actually do this, I don't know if it would work, I don't know how many special cases would be involved. Offhand, the first condition is called no-such-package, but the second one (at least in sbcl) is called simple-error, so I don't even know how to determine whether this particular simple-error is the no-such-symbol-in-that-package error, let alone how to extract the relevant names from the condition, fix it, and restart. I'd really like to hear from a common lisp expert that this is the right thing to do here before I go down the road of trying to do it this way, because it will involve a lot of learning.
It also occurs to me that I could fix this by just sed-ing the file before reading it. E.g. turning woo.ev.tcp:start-listening-socket into, say, woo.ev.tcp===start-listening-socket. I don't particularly like this solution, and it's not clear that I wouldn't run into tons more ugly special cases, but it might work if there's no better answer.
I am almost sure there is no easy portable way to do this for a number of reasons.
(Just limiting things to the non-existent-package problem for now.)
First of all there is no portable access into the bit of the reader which decides that tokens are going to be symbols and then looks for package markers &c: that just happens according to the rules in 2.3. So you can't easily intervene in this.
Secondly there's not portably enough information in any kind of condition the reader might signal to be able to handle them.
There are several possible ways out of this bit of the problem.
If you felt sufficiently heroic you might be able to teach the reader that all of the token-starting characters are in fact things you control and then write a token-reader that somehow deals with the whole package thing by returning some object which isn't a symbol. But to do that you need to deal with numbers, and if you think that's simple, well, it's not.
If you felt less heroic you could write a more primitive token-reader which just doesn't even try to deal with anything except grabbing all the characters needed and returns some kind of object which wraps a string. This would avoid the whole number problem at the cost of losing a lot of intofmration.
If you don't care about portability, find an implementation, understand how its reader does it, and muck around with it. There are more open source or source-available implementations than I can easily count (perhaps I am not very good at counting) so this is a pretty good approach. It's certainly what I'd do.
But this is only the start of the problems. The CL reader is hairy and, in its standard configuration (the configuration which is used for things like compile-file unless people have arranged otherwise) can run completely arbitrary code at read time, including code which modifies the reader itself, some of which may do so in an implementation-dependent way. And people use this: there's a reason Lisp is called the 'programmable programming language' and it's that people program it.
I've decided to solve this using sed (actually Python's re.sub, but who's counting?) because it'll work for my actual use case, and was easy.
For future readers: The various people saying this is impossible in general are probably right. The other questions posted by #Svante look like good easy ways to solve part of the problem. Other parts of the problem might be solved more elegantly by replacing the reader macros for #., #+, #-, etc with ones which just make a list, which sounds less heroic than the suggestions from #tfb, but I don't have time for that shit.
I'm looking for steps/libraries/approaches to solve this Problem statement.
Given a source file of a Programming language, I need to parse it and Subdivide it into components.
Example:
Given a Java File, I need to find the following in it.
list of Imports
Classes present in it
Attributes in the Class
Methods in it - along the Parameters if any.
etc.
I need to extract these and store it separately.
Reason Why I want to do it?
I want to build an Inverted Index on the top of these Components.
Example queries to Inverted index
1. Find the list of files with Class name: Sample
2. Find the positions where variable XXX is used within the class AAA.
I need to support queries likes the above
So, my plan is given a file, if I build these components from it, It would be easy to build an Inverted index on the top of it.
Example: Sample -- Class - Sample.java(Keyword - Component - FileName )
I want to build an Inverted index like above.
I see it is being implemented in many IDEs like IntelliJ.What I'm interested it how much effort it would take to build something like this. And I want to try implementing the same for at least one language.
Thanks in advance.
You can try to do this "just" a parser; for your specific example, that might be enough.
But you'll need a parser for each language. If you stick to just Java, you can find Java parsers pretty easily; just reuse one, there is little point in you reinventing one more set of grammar rules to describe Java.
For more than one language, this starts to get tricky. You can:
try to find a separate parser for each language. This may be sort of successful for mainstream languages. As you get to less well known languages, these get a lot harder to find. If you succeed, you'll have the problem that the parsers are likely incompatible technology; now gluing them together to collectively collect your index information is going to be a mess.
pick one parsing technology and get grammars for all the languages you care about. You have only two realistic choices: YACC/Bison, and ANTLR.
As a practical matter the YACC and Bison have been used to implement LOTS of languages... but the grammar files are not collected in one place, so they are hard to find. ANTLR at least has a single repository you can find at their web site. So that might kind of work.
Its going to be quite the effort to assemble all these into an integrated whole.
A complication is that you may want more than just raw syntax; you might want to know the meaning of the symbols, and for each symbol, precisely where it is defined in which file. After all, you want your index to be accurate at scale, and this will require differentiating foo the variable name from foo the function name. Arguably you need symbol tables.
As a general rule, this is where pure-parsing of languages breaks down;
there is serious Life After Parsing.
In that case, you want an integrated set of tools for extracting information from the different languages.
Our DMS Software Reengineering Toolkit is such a framework, and has some 40 languages predefined for it. We use something like OP's suggested process to build indexes of a code base for search tools based on DMS. Building something like DMS is an enormous effort.
(I'm using the word "workflow" - not in the sense of async workflows - but rather in the "git workflow" sense, that is, how you use it as part of your development)
Having played around with F# for a while, I've started developing my first F# app. I'm from c#/vb. Having watched various demos/talks - rightly or wrongly- I've started off using fsi as the main development "engine" and work up stuff within that area. If I hit a problem which I need to debug, I tend to break out the problematic function into smaller bits and check those work to try and debug the problem.
However, In order to keep the amount of code manageable in fsi, once I am happy with what I have done, I the move it into a .fs and #load the .fs back into fsi. As the app gets bigger, this can begin to feel a bit clunky since when I need to refactor, I end up having to bring back in content from the fs file change it run stuff to get something working again, before pushing the code back out into the .fs file. Further this style isn't really a test first approach and so I am not getting the benefit of building a set of tests. (I can also miss the ability to set breakpoints/step the code which, istm in certain situations e.g. recursion, can be quicker for diagnosing errors than breaking out parts of a function - though maybe this is available in VS11 and I'm not setup right) .. so I think I'm perhaps not doing things optimally or else not thinking about things in the right way.
I was wondering if others could offer how they develop apps. Do you primarily use fsi or do you start with tdd. Should the tdd approach be the primary dev vehicle and FSI used more selectively to aid in the, say, implementation of more complex algorithms, data exploration etc etc
I have looked at this question and obviously it helpfully points to various tdd frameworks for F#, but I'd still be interested to find out the workflow of seasoned F# developers.
Many thx
S
I think you're on the right track.
Development process is a matter of taste. I'll share my approach anyway.
Start by a few fs files. Each file represents a module, which consists of a group of functions closely related to each other. It doesn't have to be precise from beginning; you often move stuffs between modules.
Create a few fsx files for quick testing once skeleton of the modules is ready.
Create a test project and set up NuGet packages. I often use NUnit and FsUnit together.
Whenever fsx scripting gives correct results, move them to test cases. Do this repeatedly.
Include a Program.fs into the main project and compile to executable in order to debug if needed.
In general, F# REPL is the main development engine. It gives me instant feedbacks and allows incremental changes, which are very helpful in prototyping. In F#, TDD is less critical since bug rate is much lower than in other languages. And I don't test everything, just focus on main functionalities and ensure a high test coverage. Using testdriven.net add-in or Visual Studio 2012 Premium and Ultimate can give you useful statistics on test coverage.
Using F# REPL and TDD, I almost never have to use debugging. Whenever there is a wrong behaviour, I stop and think. Since your codes don't have side effects, you can reason on them easily. In many times reasoning and a few printing commands can give me the right answer.
You can use TDD in F# REPL with Unquote and FsCheck. The former offers testing via quotations, which is quite impressive. The latter uses random testing approach which is attractive in handling corner cases of your codes. I find it really useful when your programs have to satisfy certain properties. However, it may take time to learn to use these frameworks properly.
pad gave a great answer that is very practical and useful for a person new to F#. I will give a different means so that others don't think there is only one way F#'ers do it.
Note: If you are very new to programming, then stick with pad's answer, it is much better for a new programmer.
In the Object Oriented world one thinks with objects and in such languages I would start with writing objects down on paper and working with various diagrams such as use-case, state transition, sequence diagram, etc., until I felt I had enough to start creating objects in C# cs files, fleshing out the objects with methods, properties, events, etc.
In the functional world I typically start with abstract concepts and convert them into discriminated unions (DU) in an F# fs file, skipping the use of a REPL, i.e. F# Interactive, and then start adding a few functions. After I have a few functions I set up a test project using NUnit and FsUnit via NuGet. Since the DU are abstract, the test cases are typically harder to write, so I create printers for the DU and then insert them into the test case where I capture result output from the printer in the NUnit tool, for cut and paste back into the test case making changes as necessary. See these for examples of why I don't write them from scratch by hand.
Once I have the abstract DU done, I then can move onto the code to convert the human/concrete form into the abstract DU and convert the abstract DU into human/concrete form. In some cases these would be parsers and pretty printers.
The main point I am trying to make is that I don't focus on the tools I use but on the abstract concept of the problem and bring the tools in when needed.
I will note that I also program in PROLOG and there I do start with the REPL and move the code to a store once the logic works. So I am not opposed to using a REPL, it's just a different way of approaching the problem.
EDIT
Per a request by Ken for an example.
See: Discriminated Unions (F#) and look for the section
Using Discriminated Unions Instead of Object Hierarchies
So instead of a base type of shape with inherited types of Circle, EquilateralTriangle, Square and Rectangle one would create a discriminated Union as noted:
type Shape =
| Circle of float
| EquilateralTriangle of double
| Square of double
| Rectangle of double * double
As your question would make for a much better independent question and get answers with much better detail than I can give, I would suggest you ask it.
Also if you search for info on this also search with the following substitutions for discriminated union (DU):
Algebraic data type
Generalized algebraic data type (GADT)
Tagged union
Variant
variant record
disjoint union
sum type
I'm still new to OOP, and the way I initially perceived it was to throw alot of procedural looking code inside of objects, and think I'd done my job. But as I've spent the last few weeks doing alot of thinking, reading, and coding (and looking at good code, which is a hugely under-rated resource), I believe I'm starting to grasp the different outlook. It's really just a matter of clarity, simplicity, and organization once you get down to it.
But now I'm starting to look at things as objects that are not as black and white a slamdunk case for being an object. For example, I have a parser, and usually the parser returns some strings that I have to deal with. But it has one specialized case where it has to return an array, and what goes in that array and how it's formatted has specialized rules. This only amounts to two lines plus one method of code, but this code sticks out to me as not being cleanly fitting in the Parser class, and I want to turn it into its own "ActionArray" object.
But is it going to far? Has OOP become a hammer that is making me look at everything like a nail? Is it possible to go too far with turning things into objects?
It's your call, but you should think of objects as real life objects.
Take for example a car. You could describe a car with different objects:
Engine
Wheels
Chassis
Or you could describe a car with just one object:
Engine
You can keep it simple and stupid or you can spread the dependency to different objects.
As a general guideline, I think Sesame Street says it best: you need an new object when "one of these things is not like the others".
Listen to your code. If it is telling you that your objects are becoming polluted with non-essential state and behavior (and thus violating the "Single Responsibility Principle"), or that one part of your object has a rate of change that is different from the rest, and so on, it is telling you that you are missing an object.
Do the simplest thing that could possibly work. When that no longer works, do the next simplest thing. And so on. In general, this means that a system tends to move from fewer, larger objects to more, smaller objects; but not always.
There are a number of great resources for OO design. In addition to the ones already mentioned, I highly recommend Smalltalk Best Practice Patterns and Implementation Patterns by Kent Beck. They use Smalltalk and Java examples, respectively, but I find the principles translate quite well to other OO languages.
Design patterns are your friend. A class rarely exists in a vacuum. It interacts with other classes, and the mechanisms by which your classes are coupled together is going to directly affect your ability to modify your code in the future. With poor class design, a change that you make in one class may ripple down and force changes in other classes, which cause you to have to change other classes, etc.
Design patterns force you to think about how classes relate to each other. For example, your Parser class might choose to implement the Strategy design pattern to abstract out the mechanism for parsing. You might decide to create your Parser as a Template design pattern, and then have each actual instance of the Parser complete the template.
The original book on Design Patters (Design Patterns: Elements of Reusable Object-Oriented Software is excellent, but can be dense and intimidating reading if you are new to OOP. A more accessible book (and specific to Ruby) might be Design Patterns in Ruby, which has a nice introduction to design patterns, and talks about the Ruby way of implementing those patterns.
Object oriented programming is a pretty tricky tool. Many people today are getting into the same conflict, by forgetting the fundamental OOP purpose, which is improving code maintainability.
You can always brainstorm about your future OO code reusability and maintainability, and decide yourself if it's the best way to go. Take look at this interesting study:
Potok, Thomas; Mladen Vouk, Andy Rindos (1999). "Productivity Analysis of Object-Oriented Software Developed in a Commercial Environment"
I wonder what sort of things you look for when you start working on an existing, but new to you, system? Let's say that the system is quite big (whatever it means to you).
Some of the things that were identified are:
Where is a particular subroutine or procedure invoked?
What are the arguments, results and predicates of a particular function?
How does the flow of control reach a particular location?
Where is a particular variable set, used or queried?
Where is a particular variable declared?
Where is a particular data object accessed, i.e. created, read, updated or deleted?
What are the inputs and outputs of a particular module?
But if you look for something more specific or any of the above questions is particularly important to you please share it with us :)
I'm particularly interested in something that could be extracted in dynamic analysis/execution.
I like to use a "use case" approach:
First, I ask myself "what's this software's purpose?": I try to identify how users are going to interact with the application;
Once I have some "use case", I try to understand what are the objects that are more involved and how they interact with other objects.
Once I did this, I draw a UML-type diagram that describe what I've just learned for further reference. What happens after depends on the task I've been assigned, i.e. modify the code, document the code etc.
There is the question of what motivation do I have for learning the new system:
Bug fix/minor enhancement - In this case, I may focus solely on that portion of the system that performs a specific function that needs to be altered. This is a way to break down a huge system but also is a way to identify if the issue is something I can fix or if it is something that I have to hand to the off-the-shelf company whose software we are using,e.g. a CRM, CMS, or ERP system can be a customized off-the-shelf system so there are many pieces to it.
Project work - This would be the other case and is where I'd probably try to build myself a view from 30,000 feet or so to know what are the high-level components and which areas of the system does the project impact. An example of this is where I'd join a company and work off of an existing code base but I don't have the luxury of having the small focus like in the previous case. Part of that view is to look for any patterns in the code in terms of naming conventions, project structure, etc. as this may be useful once I start changing some code in the system. I'd probably do some tracing through the system and try to see where are the uglier parts of the code. By uglier I mean those parts that are kludge-like and may have some spaghetti code as this was rushed when first written and is now being reworked heavily.
To my mind another way to view this is the question of whether I'm going to be spending days or weeks wrapping my head around a system like in the second case or should this be a case where it hopefully takes only a few hours, optimistically that is, to get my footing to make the necessary changes.