ECMA CLI spec: stack transition diagram notation for CIL instructions - stack

I have been reading the ECMA CLI spec:
http://www.ecma-international.org/publications/standards/Ecma-335.htm
and I'm puzzled by the use of commas within the stack transition diagrams for some of the instructions. For example, here is the documented stack transition for ldloc (load argument onto the stack):
… => …, value
And here is the stack transition for ldsfld (load static field of a class):
…, => …, value
My question has to do with the extra comma before the instruction: Does it have any significance? Another example is jmp (jump to method):
… => …
and br.<length> (unconditional branch):
…, => …
There are also examples of trailing commas such as for nop and starg.<length>. Is this just an inconsistency or is there a nuance to this notation that I don't understand?

The commas are simply there to show you that the rest of the evaluation stack doesn't change. The stack could have values before (it's very common to put stuff on the stack and do other operations while they're there).
If you see … => … - this instruction doesn't change the stack at all.
If you see … => …, value - this instruction adds one value to the stack
If you see …, value => … - this instruction removes one value from the stack

Related

Behavior of STRING verb

I am reading a COBOL program file and I am struggling to understand the way the STRING command works in the following example
STRING WK-NO-EMP-SGE
','
WK-DT-DEB-PER-FEU-TEM
','
WK-DT-FIN-PER-FEU-TEM
DELIMITED BY SIZE
INTO UUUUUU-CO-CLE-ERR-DB2
I have three possible understandings of what it does:
Either the code concatenate each variables into UUUUUU-CO-CLE-ERR-DB2 and separate each values with ',', and the last variable is delimited by size;
Either the code concatenate each variables into UUUUUU-CO-CLE-ERR-DB2 and separate each values with ',', but all the values are delimited by size (meaning that the DELIMITED BY SIZE in this case applies to all the values passed in the string command;
Or each variable is delimited by a specific character, for example WK-NO-EMP-SGE would be delimited by ',', WK-DT-DEB-PER-FEU-TEM by ',' and WK-DT-FIN-PER-FEU-TEM would then be DELIMITED BY SIZE.
Which of my reading is actually the good one?
Here's the syntax-diagram for STRING (from the Enterprise COBOL Language Reference):
Now you need to know how to read it.
Fortunately, the same document tells you how:
How to read the syntax diagrams
Use the following description to read the syntax diagrams in this
document:
. Read the syntax diagrams from left to right, from top to bottom,
following the path of the line.
The >>--- symbol indicates the beginning of a syntax diagram.
The ---> symbol indicates that the syntax diagram is continued on the
next line.
The >--- symbol indicates that the syntax diagram is continued from
the previous line.
The --->< symbol indicates the end of a syntax diagram. Diagrams of
syntactical units other than complete statements start with the >---
symbol and end with the ---> symbol.
. Required items appear on the horizontal line (the main path).
. Optional items appear below the main path.
. When you can choose from two or more items, they appear vertically,
in a stack.
If you must choose one of the items, one item of the stack appears on
the main path.
If choosing one of the items is optional, the entire stack appears
below the main path.
. An arrow returning to the left above the main line indicates an item
that can be repeated.
A repeat arrow above a stack indicates that you can make more than one
choice from the stacked items, or repeat a single choice.
. Variables appear in italic lowercase letters (for example, parmx).
They represent user-supplied names or values.
. If punctuation marks, parentheses, arithmetic operators, or other
such symbols are shown, they must be entered as part of the syntax.
All that means, if you follow it through, that your number 2 is correct.
You can use a delimiter (when you don't have fixed-length data) or just use the size. Any item which is not explicit in how it is delimited, is delimited by the next DELIMITED BY statement.
One thing to watch for with STRING, which doesn't matter in your case, is that the target field does not get space-padded if the data is shorter than the target. With variable-length data, you need to clear the field to space before the STRING executes.
There is a nuance one must grasp in order to understand the results. DELIMITED BY SIZE can be misleading if one has experience in other programming languages.
Each of the three variables has a size that is defined in WORKING-STORAGE. Let's presume it looks something like this.
05 WK-NO-EMP-SGE PIC X(04).
05 WK-DT-DEB-PER-FEU-TEM PIC X(10).
05 WK-DT-FIN-PER-FEU-TEM PIC X(10).
If the value of the variables were set like this:
MOVE 'BOB' TO WK-NO-EMP-SGE.
MOVE 'Q' TO WK-DT-DEB-PER-FEU-TEM.
MOVE 'D19EIEIO2B' TO WK-DT-FIN-PER-FEU-TEM.
Then one might expect the value of UUUUUU-CO-CLE-ERR-DB2 to be:
BOB,Q,D19EIEIO2B
But it would actually be:
BOB ,Q ,D19EIEIO2B

Haskell/Parsec: How do you use the functions in Text.Parsec.Indent?

I'm having trouble working out how to use any of the functions in the Text.Parsec.Indent module provided by the indents package for Haskell, which is a sort of add-on for Parsec.
What do all these functions do? How are they to be used?
I can understand the brief Haddock description of withBlock, and I've found examples of how to use withBlock, runIndent and the IndentParser type here, here and here. I can also understand the documentation for the four parsers indentBrackets and friends. But many things are still confusing me.
In particular:
What is the difference between withBlock f a p and
do aa <- a
pp <- block p
return f aa pp
Likewise, what's the difference between withBlock' a p and do {a; block p}
In the family of functions indented and friends, what is ‘the level of the reference’? That is, what is ‘the reference’?
Again, with the functions indented and friends, how are they to be used? With the exception of withPos, it looks like they take no arguments and are all of type IParser () (IParser defined like this or this) so I'm guessing that all they can do is to produce an error or not and that they should appear in a do block, but I can't figure out the details.
I did at least find some examples on the usage of withPos in the source code, so I can probably figure that out if I stare at it for long enough.
<+/> comes with the helpful description “<+/> is to indentation sensitive parsers what ap is to monads” which is great if you want to spend several sessions trying to wrap your head around ap and then work out how that's analogous to a parser. The other three combinators are then defined with reference to <+/>, making the whole group unapproachable to a newcomer.
Do I need to use these? Can I just ignore them and use do instead?
The ordinary lexeme combinator and whiteSpace parser from Parsec will happily consume newlines in the middle of a multi-token construct without complaining. But in an indentation-style language, sometimes you want to stop parsing a lexical construct or throw an error if a line is broken and the next line is indented less than it should be. How do I go about doing this in Parsec?
In the language I am trying to parse, ideally the rules for when a lexical structure is allowed to continue on to the next line should depend on what tokens appear at the end of the first line or the beginning of the subsequent line. Is there an easy way to achieve this in Parsec? (If it is difficult then it is not something which I need to concern myself with at this time.)
So, the first hint is to take a look at IndentParser
type IndentParser s u a = ParsecT s u (State SourcePos) a
I.e. it's a ParsecT keeping an extra close watch on SourcePos, an abstract container which can be used to access, among other things, the current column number. So, it's probably storing the current "level of indentation" in SourcePos. That'd be my initial guess as to what "level of reference" means.
In short, indents gives you a new kind of Parsec which is context sensitive—in particular, sensitive to the current indentation. I'll answer your questions out of order.
(2) The "level of reference" is the "belief" referred in the current parser context state of where this indentation level starts. To be more clear, let me give some test cases on (3).
(3) In order to start experimenting with these functions, we'll build a little test runner. It'll run the parser with a string that we give it and then unwrap the inner State part using an initialPos which we get to modify. In code
import Text.Parsec
import Text.Parsec.Pos
import Text.Parsec.Indent
import Control.Monad.State
testParse :: (SourcePos -> SourcePos)
-> IndentParser String () a
-> String -> Either ParseError a
testParse f p src = fst $ flip runState (f $ initialPos "") $ runParserT p () "" src
(Note that this is almost runIndent, except I gave a backdoor to modify the initialPos.)
Now we can take a look at indented. By examining the source, I can tell it does two things. First, it'll fail if the current SourcePos column number is less-than-or-equal-to the "level of reference" stored in the SourcePos stored in the State. Second, it somewhat mysteriously updates the State SourcePos's line counter (not column counter) to be current.
Only the first behavior is important, to my understanding. We can see the difference here.
>>> testParse id indented ""
Left (line 1, column 1): not indented
>>> testParse id (spaces >> indented) " "
Right ()
>>> testParse id (many (char 'x') >> indented) "xxxx"
Right ()
So, in order to have indented succeed, we need to have consumed enough whitespace (or anything else!) to push our column position out past the "reference" column position. Otherwise, it'll fail saying "not indented". Similar behavior exists for the next three functions: same fails unless the current position and reference position are on the same line, sameOrIndented fails if the current column is strictly less than the reference column, unless they are on the same line, and checkIndent fails unless the current and reference columns match.
withPos is slightly different. It's not just a IndentParser, it's an IndentParser-combinator—it transforms the input IndentParser into one that thinks the "reference column" (the SourcePos in the State) is exactly where it was when we called withPos.
This gives us another hint, btw. It lets us know we have the power to change the reference column.
(1) So now let's take a look at how block and withBlock work using our new, lower level reference column operators. withBlock is implemented in terms of block, so we'll start with block.
-- simplified from the actual source
block p = withPos $ many1 (checkIndent >> p)
So, block resets the "reference column" to be whatever the current column is and then consumes at least 1 parses from p so long as each one is indented identically as this newly set "reference column". Now we can take a look at withBlock
withBlock f a p = withPos $ do
r1 <- a
r2 <- option [] (indented >> block p)
return (f r1 r2)
So, it resets the "reference column" to the current column, parses a single a parse, tries to parse an indented block of ps, then combines the results using f. Your implementation is almost correct, except that you need to use withPos to choose the correct "reference column".
Then, once you have withBlock, withBlock' = withBlock (\_ bs -> bs).
(5) So, indented and friends are exactly the tools to doing this: they'll cause a parse to immediately fail if it's indented incorrectly with respect to the "reference position" chosen by withPos.
(4) Yes, don't worry about these guys until you learn how to use Applicative style parsing in base Parsec. It's often a much cleaner, faster, simpler way of specifying parses. Sometimes they're even more powerful, but if you understand Monads then they're almost always completely equivalent.
(6) And this is the crux. The tools mentioned so far can only do indentation failure if you can describe your intended indentation using withPos. Quickly, I don't think it's possible to specify withPos based on the success or failure of other parses... so you'll have to go another level deeper. Fortunately, the mechanism that makes IndentParsers work is obvious—it's just an inner State monad containing SourcePos. You can use lift :: MonadTrans t => m a -> t m a to manipulate this inner state and set the "reference column" however you like.
Cheers!

Parsec debugging

I've been working with parsec and I have trouble debugging my code. For example, I can set a breakpoint in ghci, but I'm not sure how to see how much of the input has been consumed, or things like that.
Are there tools / guidelines to help with debugging parsec code?
This page might help.
Debug.trace is your friend, it allows you to essentially do some printf debugging. It evaluates and prints its first argument and then returns its second. So if you have something like
foo :: Show a => a -> a
foo = bar . quux
You can debug the 'value' of foo's parameter by changing foo to the following:
import Debug.Trace(trace)
foo :: Show a => a -> a
foo x = bar $ quux $ trace ("x is: " ++ show x) x
foo will now work the same way as it did before, but when you call foo 1 it will now print x is: 1 to stderr when evaluated.
For more in-depth debugging, you'll want to use GHCI's debugging commands. Specifically, it sounds like you're looking for the :force command, which forces the evaluation of a variable and prints it out. (The alternative is the :print command, which prints as much of the variable as has been evaluated, without evaluating any more.)
Note that :force is more helpful in figuring out the contents of a variable, but may also change the semantics of your program (if your program depends upon laziness).
A general GHCI debugging workflow looks something like this:
Use :break to set breakpoints
Use :list and :show context to check where you are in the code
Use :show bindings to check the variable bindings
Try using :print to see what's currently bound
Use :force if necessary to check your bindings
If you're trying to debug an infinite loop, it also helps to use
:set -fbreak-on-error
:trace myLoopingFunc x y
Then you can hit Ctrl-C during the loop and use :history to see what's looping.
You might be able to use the <?> operator in Text.Parsec.Prim to make better error messages for you and your users. There are some examples in Real World Haskell. If your parser has good sub-parts then you could setup a few simple tests (or use HUnit) to ensure they work separately as expected.
Another useful trick:
_ <- many anyChar >>= fail this will generate an error (Left) of:
unexpected end of input
the remaining 'string'
I think the parserTrace and parserTraced functions mentioned here http://hackage.haskell.org/package/parsec-3.1.13.0/docs/Text-Parsec-Combinator.html#g:1 do something similar to the above.

High-precedence application expressions as arguments

A high precedence application expression is one in which an identifier is immediately following by a left paren without intervening whitespace, e.g., f(g). Parentheses are required when passing these as function arguments: func (f(g)).
Section 15.2 of the spec states the grammar and precedence rules allow the unparenthesized form -- func f(g) -- but an additional check prevents this.
Why is this intentionally prohibited? It would obviate the need for excessive parentheses and piping, and generally make the code much cleaner.
A common example is
raise <| IndexOutOfRangeException()
or
raise (IndexOutOfRangeException())
could become simply
raise IndexOutOfRangeException()
I agree that the need for writing the additional parentheses is a bit annoying. I think that the main reason why it is not allowed to omit them is that adding a whitespace would then change the meaning of your code in quite a significant way:
// Call 'foo' with the result of 'bar()' as an argument
foo bar()
// Call 'foo' with 'bar' as the first argument and '()' as the second
foo bar ()
There are still some rough edges where adding parens changes the evaluation (see this form post), but that "just" changes the evaluation order. This would change the meaning of your code!

How to convert method calls to postfix notation?

I'm writing a compiler for a javascript like language for fun. aka I'm learning about the wheel so I make one for myself and trying to find out everything but now I got stuck.
I know that shunting yard algorithm is a nice one when parsing simple infix expressions. I was able to figure out how to extend this algorithm for prefix and postfix operators too and also able to parse simple functions.
For example: 2+3*a(3,5)+b(3,5) turns into 2 3 <G> 3 5 a () * + <G> 3 5 b () +
(<G> is a guard token that is pushed on the stack it will store the return address etc. () is the call command that calls the function on the top of the stack that pops out the necessary amount of arguments and pushes back the result on return.)
If the function name is just one token I can simply mark it as function symbol if directly followed by a parenthesis. During the process if I encounter a function symbol I push it on the operator stack and pop it out when I finished converting the parameters.
This is working so far.
But if I add the option to have member functions, the . operator. The things get more tricky. For example I want to convert the a.b.c(12)+d.e.f(34) I can't mark c and f to be functions because a.b.c and d.e.f are functions. If I start my parser on an expression like this the result will be a b . <G> 12 c () . d e . <G> 34 f () . Which is obviously wrong. I want it to be <G> 12 a b . c . () <G> 34 d e . f. () Which appears correct.
But of curse I can make the things more complicated if I add some parentheses: (a.b.c)(). Or I make a function that returns a function which I call again: f(a,b)(c,d).
Is there an easy way handle these tricky situations?
A problem of your approach is that you treat object and its member as two separate tokens separated by .. Classical Shunting yard algorithm knows nothing about OOP and relies on single token for function call. So the first way to resolve you problem is to use one token for a call of an object member -- i.e. entire a.b.c must be a single token.
You may also refer to automatic parser generators for another solution of your problem. They allow to define complete grammar of your target language (JavaScript) as a set of formal rules and generate parser automatically. List of popular tools includes tools that generates parser on different programming languages: ANTLR, Bison + Lex, Lemon + Ragel.
--artem
(I saw this question is still alive. I found the solution for it myself.)
First I threat the (...) and [...] expressions as one token and expand them (recursively) when needed. Then I detect the function calls and array subscripts. If there isn't an infix operator before a parenthesized token, then that's a function call or an array subscript, so I insert a special call-function or access operator there. With this modification it works like charm.

Resources