I know that syntax analysis is needed to determine if the series of tokens given are appropriate in a language (by parsing these tokens to produce a syntax tree), and detect errors occurred during the parsing of input code, which caused by grammatically incorrect statements.
I also know that semantic analysis is then performed on the syntax tree to produce an annotated tree ,checking aspects that are not related to the syntactic form( like type correctness of expressions and declaration prior to use) , detecting errors occurred during the execution of the code, after it has been parsed as grammatically correct.
However , the following issue is not clear to me :
In case of syntax error detected by the syntax analyzer - does it mean that should be no semantic analysis ? Or perhaps the recovery from errors (in syntax analysis) should make the semantic analysis possible to be carried out?
When you compile an incorrect program, you generally want the compiler to inform you about as many problems as possible, so that you can fix them all before attempting to compile the program again. However, you don't want the compiler to report the same error many times, or to report things which are not really errors but rather the result of the compiler getting confused by previous errors.
Or am I projecting my expectations on you? Perhaps I should have written that whole paragraph in first person, since it is really about what I expect from a compiler. Perhaps you have different expectations. Or perhaps your expectations are similar to mine. Whatever they are, you should probably write your compiler to satisfy them. That's basically the point of writing your own compiler.
So, if you share my expectations, you probably want to do as much semantic analysis as you can feel reasonably confident about. You might, for example, be able to do type checking inside some functions, because there are no syntax errors within those functions. On the other hand, that's a lot of work and there's always the chance that the resulting error messages will not be helpful.
That's not very precise, because there really is no definitive answer. But you should at least be able to answer your own question on your own behalf. If your compiler does a lousy job of error reporting and you find that frustrating when you try to use your compiler, then you should work on making the reports better. (But, of course, time is limited and you might well feel that your compiler will be better if it optimises better, even though the error reports are not so great.)
Related
I just have to build myself some parsers for different computer languages. I thought about using ANTLR but what I really want is to explore this myself because I dislike the idea of generated code (yeah silly I know).
The question is how are compile errors (missing identifiers, wrong token for a certain rule etc.) are handled and represented within ASTs etc.
What I know from compiler lectures is that a parser usually tries to throw away token or try to find the next matching code element (like missing a ';' and so taking the ';' of the next expression).
But then how is this expressed within the AST. Is there some malformed expression object/type? I am a bit puzzled.
I do not want to just reject a certain input but handle it.
I have recently started using the built-in helper classes for basic data types, makes them look like C#. The IDE has a really unusual behavior associated only with NativeInt and NativeUInt helpers, as such it interprets the Size property to be undefined.
Its a nuisance to see a line of errors which are actually not there, and then sniffing through them for the real errors. Other mistakes made by the IDE error parsers can almost always be mitigated with a successful compile but this one never goes away.
Does somebody know a solution to this aside from not using the property and switching back to SizeOf ()? A hack solution is also welcome.
Disable "Error Insight" in the IDE settings. Seriously. It never works right, reports false errors that are not real errors, etc. It gets its info from a separate source then the actual code, and easily gets out of sync. Best to just not use it at all.
I'm writing an LALR parser generator as a pet project.
I'm using the purple dragon book to help me with the design, and what I gather from it is that there are four methods of error recovery in a parser:
Panic mode: Start dumping input symbols until a symbol pre-selected by the compiler designer is found
Phrase-level recovery: Modify the input string into something that allows the current production to reduce
Error productions: Anticipate errors by incorporating them into the grammar
Global correction: Way more complicated version of phrase-level recovery (as I understand it)
Two of these require modifying the input string (which I'd like to avoid), and the other two require the compiler designer to anticipate errors and design the error recovery based on their knowledge of the language. But the parser generator also has knowledge about the language, so I'm curious if there's a better way to recover from parsing errors without pre-selecting synchronizing tokens or filling up the grammar with error productions.
Instead of picking synchronizing tokens, can't the parser just treat symbols in the follow of all the nonterminals the current production can reduce to as synchronizing tokens? I haven't really worked out how well that would work - I visualize the parser being down a chain of in-progress productions but of course that's not how bottom-up parsers work. Would it produce too many irrelevant errors trying to find a workable state? Would it attempt to resume the parser in an invalid state? Is there a good way to pre-fill the parser table with valid error actions so the actual parsing program doesn't have to reason about where to go next when an error is encountered?
It's way too easy to get lost in a dead-end when you try to blindly follow all available productions. There are things that you know about your language which it would be very difficult for the parser generator to figure out. (Like, for example, that skipping to the next statement delimiter is very likely to allow the parse to recover.)
That's not to say that automated procedures haven't been tried. There is a long section about it in Parsing Theory (Sippu & Soisalon-Soininen). (Unfortunately, this article is paywalled, but if you have an ACM membership or access to a good library, you can probably find it.)
On the whole, the yacc strategy has proven to be "not awful", and even "good enough". There is one well-known way of making it better, which is to collect really bad syntax error messages (or failed error recovery), trace them to the state which is active when they occur (which is easy to do), and attach an error recovery procedure to that precise state and lookahead token. See, for example, Russ Cox's approach.
So, I've been working on a new project at work, and today had a coworker bring up the idea to me that my exceptions and even returned error messages should be completely localized. I thought maybe that was a good idea, but he said that I should only error return error codes. I personally don't like the error code idea a lot as it tends to make other programmers either
To reuse error codes where they don't fit because they don't want to add another one
They tend to use the wrong error codes as there can get to be so many defined.
So my question is what doe everyone else do to handle this situation? I'm open for all sorts of suggestions including those that think error codes are the way to go.
There may be cultural differences, according to your coding language ?
In Java for example, numerical errors codes are not used much ...
Concerning exceptions, I believe it is just a technical tool.
What is important is wether your message is targeted at a user, or a developper.
For a user, localizing messages is important, if several languages appears, or to be able to change the messages without recompiling (to customize between clients, to adapt to changing user needs ..).
In my projects, our culture is to use (java) enums to handle all collections of fixed values.
Errors are no different.
Enums for errors could provide :
strong typing (you can't pass something else to a method that expect an error code)
simple localisation (a simple utility method can find automatically the message corresponding to each one, using for example "SimpleClassName"."INSTANCE_NAME" pattern ; you could also expose the getMessage() method on each enum, that delegates the implementation to your utility method)
verification of your localized files (your unit tests could loop for each language on the code and the files, and find all unmatched values)
error level functionnality (we use the same levels as for logging : fatal, error, warn ; the logging decisions are very easily implemented then !).
to allow for easy finding of the appropriate error by other developpers, we use several enums (possibly in the same package), classifying the errors according to their technical or functionnal domain.
To adress your two concerns :
Adding one only requires adding an instance to an enum, and a message in the localisation file (but the tests can catch the later if forgotten).
With the classification in several enums, and possibly the javadoc, they are guided to use the correct error.
I wouldn't be using error codes for localization. There may be good reasons to use error codes (e.g. to be able to test which specific kind of error occurred), but localization is not one of those reasons. Instead, use the same framework that you use for the rest of the message localization also for exceptions. E.g. if you use gettext everywhere else, also use it in exceptions. That will make life easier for the translator.
You can include an error code in an exception, thereby getting the best of both.
One frequent cause of error with old-style function-return error codes was failure to check the error code before continuing with subsequent code. An exception cannot be implicitly ignored. Eliminating a source of error is a good thing.
An error code allows:
Calling code to distinguish between different kinds of errors.
Messages to be constructed by UI components when errors occur in non-UI, non-localized code.
A list of errors in the code that may be useful when writing user documentation or troubleshooting guides.
A few guidelines I have found useful:
Only UI architectural layers construct and localize messages to the user.
When non-UI layers communicate errors via exceptions, those may optionally carry error codes and additional fields useful to the UI when constructing a message.
In Java, when error codes are defined in a UI layer Enum, the error message format strings can be accessed through the enumerators themselves. They may contain symbols for additional data carried in the error.
In Java, include argument index in format specifiers in the original language's format strings, so that translators can move dynamic data around in localized messages.
When is it advisable to let the compiler do its thing and when should I be explicit when declaring variable types?
Easy, in F#, always prefer to let the compiler "do its thing". The folks who wrote its powerful type inference system would be saddened otherwise.
In all seriousness, in C# I know there is (or was?) a debate about when or when not to use var for infered variable types. But I believe the concerns there about the lack of clarity stemmed from a community which was unfamiliar with terse, strongly typed languages and feared var was some kind of dynamic voodoo not to be trusted. But what we have now in C#, and many times over in F#, is the best of all worlds. Strong, automatic typing. And "variables" are only the tip of the ice-burg. The real amazement comes with F#'s inference of function type signatures. There was a while there where I believed this was over-the-top, and that writing out the full signature would be clearer. Man, you get over that fast.
I agree with #Stephen, let the compiler "do its thing".
When you are first starting with a type-inferred language, this will feel unnatural and you'll write extra type annotations, perhaps thinking you need them for readability. You'll get over that soon enough; code with decent variable names has little need for spelling out types everywhere, type annotations are often just excess cruft cluttering your algorithms.
I can only think of a couple general detriments to not spelling out the types. First, if you are viewing the source code as a mere text file, it may not be obvious to the reader what the types are. However this is mitigated very largely by the fact that tools like Visual Studio provide hover-tooltips that show the types (e.g. hover your mouse over foo and a tooltip pops up showing the type of foo). The logic for doing this inference is exposed by the compiler source code, and has been easily integrated into other tools, like F# web snippets, MonoDevelop, etc. (Someone, please leverage the new Apache license and write the plugins for github, emacs, and gvim :), thanks!) As a result, much of time you're looking at the code, you'll be doing it in an environment/tool where the types are available to be shown on-demand anyway.
Second, occasionally it can be hard to debug type errors when there is a lack of type annotations. When you get a weird type inference error you can't figure out, it can be useful to add some explicit annotations to localize the type error. Oftentimes you'll then see some dumb bug in your code, fix it, and then you can remove the needless annotations.
Of course, there are a number of places where type annotations are required because type inference can't solve everything. See the links below for some details there. You'll get accustomed to these after a while, and get adept at predicting when you will or won't need an annotation.
http://lorgonblog.wordpress.com/2009/10/25/overview-of-type-inference-in-f/
Why can't F#'s type inference handle this?