Implement SWAP in Forth - forth

I saw that in an interview with Chuck Moore, he says:
The words that manipulate that stack are DUP, DROP and OVER period.
There's no, well SWAP is very convenient and you want it, but it isn't
a machine instruction.
So I tried to implement SWAP in terms of only DUP, DROP and OVER, but couldn't figure out how to do it, without increasing the stack at least.
How is that done, really?

You are right, it seems hard or impossible with just dup, drop, and over.
I would guess the i21 probably also has some kind return stack manipulation, so this would work:
: swap over 2>r drop 2r> ;
Edit: On the GA144, which also doesn't have a native swap, it's implemented as:
over push over or or pop
Push and pop refer to the return stack, or is actually xor. See http://www.colorforth.com/inst.htm

In Standard Forth it is
: swap ( a b -- b a ) >r >r 2r> ;
or
: swap ( a b -- b a ) 0 rot nip ;
or
: swap ( a b -- b a ) 0 rot + ;
or
: swap ( a b -- b a ) 0 rot or ;

This remark of Charles Moore can easily be misunderstood, because it is in the context of his Forth processors. A SWAP is not a machine instruction for a hardware Forth processor. In general in Forth some definitions are in terms of other definitions, but this ends with certain so called primitives. In a Forth processor those are implemented in hardware, but in all Forth implementations on e.g. host systems or single board computers they are implemented by a sequence of machine instruction, e.g. for Intel :
CODE SWAP pop, ax pop, bx push, ax push, bx END-CODE
He also uses the term "convenient", because SWAP is often avoidable. It is a situation where you need to handle two data items, but they are not in the order you want them. SWAP means a mental burden because you must imagine the stack content changed. One can often keep the stack straight by using an auxiliary stack to temporarily hold an item you don't need right now. Or if you need an item twice OVER is preferable. Or a word can be defined differently, with its parameters in a different order.
Going out of your way to implement SWAP in terms of 4 FORTH words instead of 4 machine instructions is clearly counterproductive, because those FORTH words each must have been implemented by a couple of machine instructions themselves.

Related

x87 FPU and integer arithmetic?

I'm trying to understand using the FPU for 64-bit integer arithmetic. I write this (ATT syntax):
fildq A
fildq B
faddp
fistpq C
The result in C is A + B + 1. If I start with an "finit" instruction, it gives me the correct value A + B. I thought that the unwanted +1 was maybe because it was adding in a carry bit, but using gdb I see no difference at all in the FPU control registers when I use finit from when I don't -- in both cases the control register starts off as 0x27F, the tag register is 0xFFFF (= stack empty), and all the others (including the status register, where all the condition bits are located) are zero.
Using finit seems a bit of a blunt instrument here, and I'm also wondering where the extra +1 is coming from if I don't use it, given that all the FPU registers seem to have the same values in both cases. Can anyone shed any light on this for me?
[…] I see no difference at all in the FPU control registers when I use finit from when I don't -- in both cases the control register starts off as 0x27F […]
Are you sure?
finit is supposed to load 0x37F, one additional bit set in comparison to 0x27F.
The difference is in the precision control field.
The default value uses 80‑bits whilst your observed value is using 64‑bits.
The result in C is A + B + 1. […]
Using finit seems a bit of a blunt instrument here, and I'm also wondering where the extra +1 is coming from if I don't use it, […]
With sufficiently large A and B you’re likely seeing a loss in precision from fadd.
Unmasking the precision exception will confirm this.
I think you were using the inline assembly capabilities of your favorite compiler.
This is certainly convenient if you don’t wanna bother about menial tasks, yet apparently your compiler’s run-time system loads 0x27F at startup for compatibility considerations.
Study its manual (and possibly source code) for details.

What instruction set would be easiest to implement on a homemade ALU?

I'm designing a basic 8 or 16 bit computer (haven't really decided yet) using eeprom chips, sram, and an ALU made (mostly) out of individual transistors on a PCB using cmos logic that I already have partially designed and tested. And I thought it would be cool to use an already existing instruction set so I can compile C++ code for it instead of writing everything in machine code.
I looked at the AVR gcc compiler on Compiler Explorer and the machine code it produces, it looks very simple and I think it is only 8-bits. Or should I go for 32-bits and try to use x86? That would make the ALU a lot bigger. Are there compilers that let you use limited instructions so I don't have to make every single one? Or would it even be easier to just write an interpreter for a custom instruction set? Any advice is welcome, thank you.
After a bit of research it has become apparent that trying to recreate modern ALUs and instructions would be very complicated and time consuming, and I should definitely make my own simplistic architecture and if I really want to compile C code for it I could probably just interpret x86 or AVR assembly from gcc.
I would also love some feedback on my design, I came up with a really weird ISA last night that is focused mainly on being easy to engineer the hardware.
There are two registers in the ALU, all other registers perform functions based off those two numbers all at the same time. For instance, there is a register that holds the added result of A and B, one that holds the result of A shifted right B times, a "jump if A > B" branch, and so on.
And so to add a number, it would take 3 clock cycles, you would move two values from ram into A and B, then copy the data back to ram afterwards. It would look like this:
setA addressInRam1 (6-bit opcode, 18-bit address/value)
setB addressInRam2
copyAddedResult addressInRam1
And program code is executed directly from EEPROM memory. I don't know if I should think of it as having two general purpose registers or it having 2^18 registers. Either way, it makes it much easier and simpler to build when you're executing instructions one at a time like that. Again any advice is welcome, I am somewhat of a noob in this field, thank you!
Oh and then an additional C register to hold a value to be stored in RAM the next clock cycle specified in the set register. This is what the Fibonacci sequence would look like:
1: setC 1; // setting C reg to 1
2: set 0; // setting address 0 in ram to the C register
3: setA 0; // copying value in address 0 of ram into A reg
// repeat for B reg
4: set 1; // setting this to the same as the other
5: setB 1;
6: jumpIf> 9; // jump to line 9 if A > B
7: getSum 0; // put sum of A and B into address 0 of ram
8: setA 0; // set the A register to address 0 of ram
9: getSum 1; // "else" put the sum into the second variable
10: setB 1;
11: jump 6; // loop back to line 6 forever
I made a C++ equivalent and put it through compiler explorer and despite the many drawbacks of this architecture it uses the same amount of clock cycles as x64 in the loop and two more in total. But I think this function in particular works pretty well with it as I don't have to reassign A and B often.

Memory usage in assembler 8086

I made a program in assembler 8086 for my class and everything is working just fine.
But beside making working program we have to make it use as low memory as possible. Could you give me some tips in that aspect? What should I write and what should I avoid?
The program is supposed to first print letter A on the screen and then in avery new line two more of letters of next letter in the alphabet, stop at Z and after pressing any key end program. For stopping until key is pressed i'm using:
mov ah,00h
int 16h
Is it good way to do it?
Most of what you want can be done in zero memory (counting only data, not the code itself). In general:
use registers rather than variables in memory
do not use push/pop
do not use subroutines
But to interact with the OS, you need to make BIOS calls and/or OS system calls; these require some memory (typically a small amount of stack space). In your case, you have to:
output characters to screen
wait for keypress
exit back to the OS
However, if you are serious about doing this in minimal memory, then there are a few hacks you can use.
Output characters to screen
On a PC, in traditional text mode, you can write characters straight to video RAM (address B800:0000 and further). This requires zero memory.
Wait for keypress
The cheapest way is to wait for a change of the BIOS keyboard buffer head (a change of the 16-bit content at address 041A hex). This requires zero memory.
See also: http://support.microsoft.com/kb/60140
Exit back to the OS
Try a simple ret; it is not recommended but it might just work in some versions of MS-DOS. An even uglier escape is to jump to F000:FFF0, which will reboot the machine. That's guaranteed to work in zero memory.
Use these instructions:
INC (Register*) instead of ADD (Register*), 1
DEC (Register*) instead of SUB (Register*), 1
XOR (Register)(same register) instead of MOV (Register), 0 (Doesn't work with variables)
SHR (Register*), 1 instead of DIV (Register*), 2
SHR (Register*), 2 instead of DIV (Register*), 4
..
SHL (Register*), 1 instead of MUL (Register*), 2
..
*Register or variable
These optimizations makes the program faster AND the size larger

x86 memory ordering: Loads Reordered with Earlier Stores vs. Intra-Processor Forwarding

I am trying to understand section 8.2 of Intel's System Programming Guide (that's Vol 3 in the PDF).
In particular, I see two different reordering scenarios:
8.2.3.4 Loads May Be Reordered with Earlier Stores to Different Locations
and
8.2.3.5 Intra-Processor Forwarding Is Allowed
However, I do not understand the difference between these scenarios from the observable effects POW. The examples provided in those sections seem interchangeable to me. 8.2.3.4 example can be explained by 8.2.3.5 rule just as well as by its own rule. And the converse seems true to me as well, although I am not that sure in that case.
So here is my question: are there better examples or explanations how the observable effects of 8.2.3.4 are different from observable effects of 8.2.3.5?
The example at 8.2.3.5 should be "surprising" if you expect memory ordering to be all strict an clean, and even if you acknowledge that 8.2.3.4 allows loads to reorder with stores of different addresses.
Processor 0 | Processor 1
--------------------------------------
mov [x],1 | mov [y],1
mov R1, [x] | mov R3,[y]
mov R2, [y] | mov R4,[x]
Note that the key part is that the newly added loads in the middle both return 1 (store-to-load forwarding makes that possible in the uarch without stalling). So in theory, you would expect that both stores have been "observed" globally by the time both these loads completed (that would have been the case with sequential consistency, where there is a unique ordering between stores and all cores see it).
However, having later R2 = R4 = 0 as a valid outcome proves this is not the case - the stores are in fact observed locally first. In other words, allowing this outcome means that processor 0 sees the stores as time(x) < time(y), while processor 1 sees the opposite.
This is a very important observation about the consistency of this memory model, which the previous example doesn't prove. This nuance is the biggest difference between Sequential Consistency and Total Store Ordering - the second example breaks SC, the first one doesn't.

How does a stackless language work?

I've heard of stackless languages. However I don't have any idea how such a language would be implemented. Can someone explain?
The modern operating systems we have (Windows, Linux) operate with what I call the "big stack model". And that model is wrong, sometimes, and motivates the need for "stackless" languages.
The "big stack model" assumes that a compiled program will allocate "stack frames" for function calls in a contiguous region of memory, using machine instructions to adjust registers containing the stack pointer (and optional stack frame pointer) very rapidly. This leads to fast function call/return, at the price of having a large, contiguous region for the stack. Because 99.99% of all programs run under these modern OSes work well with the big stack model, the compilers, loaders, and even the OS "know" about this stack area.
One common problem all such applications have is, "how big should my stack be?". With memory being dirt cheap, mostly what happens is that a large chunk is set aside for the stack (MS defaults to 1Mb), and typical application call structure never gets anywhere near to using it up. But if an application does use it all up, it dies with an illegal memory reference ("I'm sorry Dave, I can't do that"), by virtue of reaching off the end of its stack.
Most so-called called "stackless" languages aren't really stackless. They just don't use the contiguous stack provided by these systems. What they do instead is allocate a stack frame from the heap on each function call. The cost per function call goes up somewhat; if functions are typically complex, or the language is interpretive, this additional cost is insignificant. (One can also determine call DAGs in the program call graph and allocate a heap segment to cover the entire DAG; this way you get both heap allocation and the speed of classic big-stack function calls for all calls inside the call DAG).
There are several reasons for using heap allocation for stack frames:
If the program does deep recursion dependent on the specific problem it is solving,
it is very hard to preallocate a "big stack" area in advance because the needed size isn't known. One can awkwardly arrange function calls to check to see if there's enough stack left, and if not, reallocate a bigger chunk, copy the old stack and readjust all the pointers into the stack; that's so awkward that I don't know of any implementations.
Allocating stack frames means the application never has to say its sorry until there's
literally no allocatable memory left.
The program forks subtasks. Each subtask requires its own stack, and therefore can't use the one "big stack" provided. So, one needs to allocate stacks for each subtask. If you have thousands of possible subtasks, you might now need thousands of "big stacks", and the memory demand suddenly gets ridiculous. Allocating stack frames solves this problem. Often the subtask "stacks" refer back to the parent tasks to implement lexical scoping; as subtasks fork, a tree of "substacks" is created called a "cactus stack".
Your language has continuations. These require that the data in lexical scope visible to the current function somehow be preserved for later reuse. This can be implemented by copying parent stack frames, climbing up the cactus stack, and proceeding.
The PARLANSE programming language I implemented does 1) and 2). I'm working on 3). It is amusing to note that PARLANSE allocates stack frames from a very fast-access heap-per-thread; it costs typically 4 machine instructions. The current implementation is x86 based, and the allocated frame is placed in the x86 EBP/ESP register much like other conventional x86 based language implementations. So it does use the hardware "contiguous stack" (including pushing and poppping) just in chunks. It also generates "frame local" subroutine calls the don't switch stacks for lots of generated utility code where the stack demand is known in advance.
Stackless Python still has a Python stack (though it may have tail call optimization and other call frame merging tricks), but it is completely divorced from the C stack of the interpreter.
Haskell (as commonly implemented) does not have a call stack; evaluation is based on graph reduction.
There is a nice article about the language framework Parrot. Parrot does not use the stack for calling and this article explains the technique a bit.
In the stackless environments I'm more or less familiar with (Turing machine, assembly, and Brainfuck), it's common to implement your own stack. There is nothing fundamental about having a stack built into the language.
In the most practical of these, assembly, you just choose a region of memory available to you, set the stack register to point to the bottom, then increment or decrement to implement your pushes and pops.
EDIT: I know some architectures have dedicated stacks, but they aren't necessary.
Call me ancient, but I can remember when the FORTRAN standards and COBOL did not support recursive calls, and therefore didn't require a stack. Indeed, I recall the implementations for CDC 6000 series machines where there wasn't a stack, and FORTRAN would do strange things if you tried to call a subroutine recursively.
For the record, instead of a call-stack, the CDC 6000 series instruction set used the RJ instruction to call a subroutine. This saved the current PC value at the call target location and then branches to the location following it. At the end, a subroutine would perform an indirect jump to the call target location. That reloaded saved PC, effectively returning to the caller.
Obviously, that does not work with recursive calls. (And my recollection is that the CDC FORTRAN IV compiler would generate broken code if you did attempt recursion ...)
There is an easy to understand description of continuations on this article: http://www.defmacro.org/ramblings/fp.html
Continuations are something you can pass into a function in a stack-based language, but which can also be used by a language's own semantics to make it "stackless". Of course the stack is still there, but as Ira Baxter described, it's not one big contiguous segment.
Say you wanted to implement stackless C. The first thing to realize is that this doesn't need a stack:
a == b
But, does this?
isequal(a, b) { return a == b; }
No. Because a smart compiler will inline calls to isequal, turning them into a == b. So, why not just inline everything? Sure, you will generate more code but if getting rid of the stack is worth it to you then this is easy with a small tradeoff.
What about recursion? No problem. A tail-recursive function like:
bang(x) { return x == 1 ? 1 : x * bang(x-1); }
Can still be inlined, because really it's just a for loop in disguise:
bang(x) {
for(int i = x; i >=1; i--) x *= x-1;
return x;
}
In theory a really smart compiler could figure that out for you. But a less-smart one could still flatten it as a goto:
ax = x;
NOTDONE:
if(ax > 1) {
x = x*(--ax);
goto NOTDONE;
}
There is one case where you have to make a small trade off. This can't be inlined:
fib(n) { return n <= 2 ? n : fib(n-1) + fib(n-2); }
Stackless C simply cannot do this. Are you giving up a lot? Not really. This is something normal C can't do well very either. If you don't believe me just call fib(1000) and see what happens to your precious computer.
Please feel free to correct me if I'm wrong, but I would think that allocating memory on the heap for each function call frame would cause extreme memory thrashing. The operating system does after all have to manage this memory. I would think that the way to avoid this memory thrashing would be a cache for call frames. So if you need a cache anyway, we might as well make it contigous in memory and call it a stack.

Resources