Examples of very concise Forth applications? [closed] - forth

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
In this talk, Chuck Moore (the creator of Forth) makes some very bold, sweeping claims, such as:
"Every application that I have seen that I didn't code has ten times as much code in it as it needs"
"About a thousand instructions seems about right to me to do about anything"
"If you are writing code that needs [local variables] you are writing non-optimal code. Don't use local variables."
I'm trying to figure out whether Mr. Moore is a) an absolutely brilliant genius or b) a crackpot. But that's a subjective question, and I'm not looking for the answer to that question here. What I am looking for are examples of complex, real-world problems which can be solved in "1000 instructions or less" using Forth, and source code demonstrating how to do so. An example showing just one non-trivial piece of a real-world system would be fine, but no "toy" code samples which could be duplicated in 5 or 10 lines of another high-level language, please.
If you have written real-world systems in Forth, using just a small amount of source code, but aren't at liberty to show the source (because it is proprietary), I'd still like to hear about it.

You need to understand that Chuck Moore is a little different than you and me. He was trained in an era when mainframe computers consisted of 16 KB or its equivalent in core memory, and he was able to do quite a lot of things with the computers of the time. Perhaps the biggest success for Forth, outside of his OKAD-II chip design package (that's not a typo), was a multi-user multi-tasking Forth system responsible for concurrently controlling data acquisition instruments and data analysis/visualization software at NRAO on a fairly modestly sized computer barely able to compile Fortran source code on its own.
What he calls an "application", we might consider to be a "component" of a larger, more nebulous thing called an application. More generally, it's good to keep in mind that one Moore "application" is more or less equivalent to one "view" in an MVC triad today. To keep memory consumption small, he relies heavily on overlays and just-in-time compilation techniques. When switching from one program interface to another, it typically involves recompiling the entire application/view from source. This happens so fast you don't know it's happening. Kind of like how Android recompiles Dalvik code to native ARM code when you activate an application every time today.
At any given time, OKAD-II has no more than about 2.5 KB of its code loaded into memory and running. However, the on-disk source for OKAD-II is considerably larger than 2.5 KB. Though, it is still significantly more compact than its nearest competitor, SPICE.
I'm often curious about Chuck Moore's views and find his never-ending strive for simplicity fascinating. So, in MythBusters fashion, I put his claims to the test by trying to design my own system as minimally as I could make it. I'm happy to report he's very nearly spot-on in his claims, on both hardware and software issues. Case in point, during last September's Silicon Valley Forth Interest Group (SVFIG) meeting, I used my Kestrel-2 itself to generate video for the slide deck. This required I wrote a slide presentation program for it, which took up 4 KB of memory for the code, and 4 KB for the slide deck data structures. With an average space of six bytes per Forth word (for reasons I won't go into here), the estimate of "about 1000 (Forth) instructions" for the application is just about spot on to what Chuck Moore estimates his own "applications" to be.
If you're interested in speaking to real-world Forth coders (or who have done so in the past, as it increasingly seems to be), and you happen to be in the Bay Area, the Silicon Valley Forth Interest Group still meets every fourth Saturday of the month, except for November and December, which is the third Saturday. If you're interested in attending a meeting, even if only to interview Forth coders and get a taste of what "real-world" Forth is like, check us out on meetup.com and tag along. We also new stream our meetings on YouTube, but we're not very good at it. We're abusing inappropriate hardware and software to do our bidding, since we have a budget of zero for this sort of thing. :)

Forth is indeed amazingly compact! Words without formal parameters (and zero-operand instructions at the hardware - e.g. the GA144) saves a lot. The other main contributor to its compactness is the absolutely relentless factoring of redundant code that the calling convention and concatenative nature affords.
I don't know if it qualifies as a non-toy example, but the Turtle Graphics implementation for the Fignition (in FigForth) is just 307 bytes compiled and fits in a single source block! This includes the fixed point trig and all the normal turtle commands. This isn't the best example of readable Forth because of trying to squeeze it into a single source block with single-character names and such:
\ 8.8 fixed point sine table lookup
-2 var n F9F2 , E9DD , CEBD , AA95 , 7F67 , 4E34 , 1A c,
: s abs 3C mod dup 1D > if 3C swap - then dup E > if
-1 1E rot - else 1 swap then n + c# 1+ * ;
0 var x 0 var y 0 var a
0 var q 0 var w
: c 9380 C80 0 fill ; \ clear screen
: k >r 50 + 8 << r> ! ;
: m dup q # * x +! w # * y +! ; \ move n-pixels (without drawing)
: g y k x k ; \ go to x,y coord
: h dup a ! dup s w ! 2D + s q ! ; \ heading
: f >r q # x # y # w # r 0 do >r r + >r over + \ forward n-pixels
dup 8 >> r 8 >> plot r> r> loop o y ! x ! o r> o ;
: e key 0 vmode cls ; \ end
: b 1 vmode 1 pen c 0 0 g 0 h ; \ begin
: t a # + h ; \ turn n-degrees
Using it is extremely concise as well.
: sin 160 0 do i i s 4 / 80 + plot loop ;
: burst 60 0 do 0 0 g i h 110 f loop ;
: squiral -50 50 g 20 0 do 100 f 21 t loop ;
: circle 60 0 do 4 f 1 t loop ;
: spiral 15 0 do circle 4 t loop ;
: star 5 0 do 80 f 24 t loop ;
: stars 3 0 do star 20 t loop ;
: rose 0 50 0 do 2 + dup f 14 t loop ;
: hp 15 0 do 5 f 1 t loop 15 0 do 2 f -1 t loop ;
: petal hp 30 t hp 30 t ;
: flower 15 0 do petal 4 t loop ;
(shameless blog plug: http://blogs.msdn.com/b/ashleyf/archive/2012/02/18/turtle-graphics-on-the-fignition.aspx)

What is not well understood today is the way Forth anticipated an approach to coding that became popular early in the 21st century in association with agile methods. Specifically:
Forth introduced the notion of tiny method coding -- the use of small objects with small methods. You could make a case for Smalltalk and Lisp here too, but in the late 1980s both Smalltalk and Lisp practice tended toward larger and more complex methods. Forth always embraced very small methods, if only because it encouraged doing so much on the stack.
Forth, even more than Lisp, popularized the notion that the interpreter was just a little software pattern, not a dissertation-sized brick. Got a problem that's hard to code? The Forth solution had to be, "write a little language", because that's what Forth programming was.
Forth was very much a product of memory and time constraints, of an era where computers were incredibly tiny and terribly slow. It was a beautiful design that lets you build an operating system and a compiler in a matchbox.

An example of just how compact Forth can be, is Samuel Falvo's screencast Over the Shoulder 1 - Text Preprocessing in Forth (1h 06 min 25 secs, 101 MB, MPEG-1 format - at least VLC can play it). Alternative source ("Links and Resources" -> "Videos").

Forth Inc's polyFORTH/32 VAX/VMS assembler definitions took some 8 blocks of source. A VAX assembler, in 8K of source. Commented. I'm still amazed, 30 years later.
I can't verify at the moment, but I'm guessing the instruction count to parse those CODE definitions would be in the low hundreds. And when I said 'took some 8 blocks', it still takes, the application using that nucleus is live and in production, 30 years later.

Related

Floyd's Algorithm to detect cycle in linked list proof

I've seen several proofs for Floyd's algorithm in several posts inside and outside stack overflow. All of them proves the second part of the algorithm that is, why the method to find the start of the cycle works. But none of the proofs I've seen addresses the first part that is, why the slow pointer and the fast pointer will meet inside the loop. Why wouldn't the slow and the fast pointer go on infinity and never meet at a particular node? All the proofs I've seen so far either do not address this or tells "it's obvious" that the pointers will meet. I'm sorry but I don't get why the points will never go on an infinite loop, to me this feels like the proof of Fermat's last theorem. Can someone prove why it'll always meet for a loop of any length?
. Let's take an example of odd length linked list.
If you take a linked list loop as 1->2->3->4->5 and 5 is connected to 1(so that it forms a loop). Now, let's have a pointer from node 1 and iterating alternatively through list. Now visiting is done in the following order: 1st iteration: 1->3->5 Next starts from 2. 2nd iteration: 2->4 Next starts from 1... so on...
So, we can observe that every node is visited. This is true even if you start iteration from 2nd node.
. 2nd example of even length, 1->2->3->4->5->6 & 6->1 (cycle). Starting node as 1; Slow => 1, Fast => 2
Slow : 1 ; Fast : 2
Slow : 2 ; Fast : 4
Slow : 3 ; Fast : 6
Slow : 4 ; Fast : 2
Slow : 5 ; Fast : 4
Slow : 6 ; Fast : 6 => Met!
So, in case of odd length, fast meets slow as fast touches every point and in case of even, slow and fast meet at a point after some iterations.
I didn't find proof like thing, but just posted my observations.
Any suggestions/corrections are appreciated. :)

What instruction set would be easiest to implement on a homemade ALU?

I'm designing a basic 8 or 16 bit computer (haven't really decided yet) using eeprom chips, sram, and an ALU made (mostly) out of individual transistors on a PCB using cmos logic that I already have partially designed and tested. And I thought it would be cool to use an already existing instruction set so I can compile C++ code for it instead of writing everything in machine code.
I looked at the AVR gcc compiler on Compiler Explorer and the machine code it produces, it looks very simple and I think it is only 8-bits. Or should I go for 32-bits and try to use x86? That would make the ALU a lot bigger. Are there compilers that let you use limited instructions so I don't have to make every single one? Or would it even be easier to just write an interpreter for a custom instruction set? Any advice is welcome, thank you.
After a bit of research it has become apparent that trying to recreate modern ALUs and instructions would be very complicated and time consuming, and I should definitely make my own simplistic architecture and if I really want to compile C code for it I could probably just interpret x86 or AVR assembly from gcc.
I would also love some feedback on my design, I came up with a really weird ISA last night that is focused mainly on being easy to engineer the hardware.
There are two registers in the ALU, all other registers perform functions based off those two numbers all at the same time. For instance, there is a register that holds the added result of A and B, one that holds the result of A shifted right B times, a "jump if A > B" branch, and so on.
And so to add a number, it would take 3 clock cycles, you would move two values from ram into A and B, then copy the data back to ram afterwards. It would look like this:
setA addressInRam1 (6-bit opcode, 18-bit address/value)
setB addressInRam2
copyAddedResult addressInRam1
And program code is executed directly from EEPROM memory. I don't know if I should think of it as having two general purpose registers or it having 2^18 registers. Either way, it makes it much easier and simpler to build when you're executing instructions one at a time like that. Again any advice is welcome, I am somewhat of a noob in this field, thank you!
Oh and then an additional C register to hold a value to be stored in RAM the next clock cycle specified in the set register. This is what the Fibonacci sequence would look like:
1: setC 1; // setting C reg to 1
2: set 0; // setting address 0 in ram to the C register
3: setA 0; // copying value in address 0 of ram into A reg
// repeat for B reg
4: set 1; // setting this to the same as the other
5: setB 1;
6: jumpIf> 9; // jump to line 9 if A > B
7: getSum 0; // put sum of A and B into address 0 of ram
8: setA 0; // set the A register to address 0 of ram
9: getSum 1; // "else" put the sum into the second variable
10: setB 1;
11: jump 6; // loop back to line 6 forever
I made a C++ equivalent and put it through compiler explorer and despite the many drawbacks of this architecture it uses the same amount of clock cycles as x64 in the loop and two more in total. But I think this function in particular works pretty well with it as I don't have to reassign A and B often.

How to analyse z3 performance issues?

I have 37 similar SMT2 problems, each in two equisatisfiable versions that I call compact and unrolled. The problems are using incremental SMT solving and every (check-sat) returns unsat. The compact versions are using the QF_AUFBV logic, the unrolled versions use QF_ABV. I did run them in z3, yices, and boolector (but boolector only supports the unrolled version). The results of this performance evaluation can be found here:
http://scratch.clifford.at/compact_smt2_enc_r1102.html
The SMT2 files for this examples can be downloaded from here (~10 MB):
http://scratch.clifford.at/compact_smt2_enc_r1102.zip
I run each solver 5 times with different values for the :random-seed option. (Except boolector which does not support :random-seed. So I simply run boolector 5 times on the same input.) The variation I get when running the solvers with different :random-seed is relatively small (see +/- values in the table for the max outlier).
There is a wide spread between solvers. Boolector and Yices are consistently faster than z3, in some cases up to two orders of magnitude.
However, my question is about "z3 vs z3" performance. Consider for example the following data points:
| Test Case | Z3 Median Runtime | Max Outlier |
|-------------------|-------------------|-------------|
| insn_add unrolled | 873.35 seconds | +/-  0% |
| insn_add compact | 1837.59 seconds | +/- 1% |
| insn_sub unrolled | 4395.67 seconds | +/- 16% |
| insn_sub compact | 2199.21 seconds | +/- 5% |
The problems insn_add and insn_sub are almost identical. Both are generated from Verilog using Yosys and the only difference is that insn_add is using this Verilog module and insn_sub is using this one in its place. Here is the diff between those two source files:
--- insn_add.v 2017-01-31 15:20:47.395354732 +0100
+++ insn_sub.v 2017-01-31 15:20:47.395354732 +0100
## -1,6 +1,6 ##
// DO NOT EDIT -- auto-generated from generate.py
-module rvfi_insn_add (
+module rvfi_insn_sub (
input rvfi_valid,
input [ 32 - 1 : 0] rvfi_insn,
input [`RISCV_FORMAL_XLEN - 1 : 0] rvfi_pc_rdata,
## -29,9 +29,9 ##
wire [4:0] insn_rd = rvfi_insn[11: 7];
wire [6:0] insn_opcode = rvfi_insn[ 6: 0];
- // ADD instruction
- wire [`RISCV_FORMAL_XLEN-1:0] result = rvfi_rs1_rdata + rvfi_rs2_rdata;
- assign spec_valid = rvfi_valid && insn_funct7 == 7'b 0000000 && insn_funct3 == 3'b 000 && insn_opcode == 7'b 0110011;
+ // SUB instruction
+ wire [`RISCV_FORMAL_XLEN-1:0] result = rvfi_rs1_rdata - rvfi_rs2_rdata;
+ assign spec_valid = rvfi_valid && insn_funct7 == 7'b 0100000 && insn_funct3 == 3'b 000 && insn_opcode == 7'b 0110011;
assign spec_rs1_addr = insn_rs1;
assign spec_rs2_addr = insn_rs2;
assign spec_rd_addr = insn_rd;
But their behavior in this benchmark is very different: Overall the performance for insn_sub is much worse than the performance for insn_add. Furthermore, in the case of insn_add the unrolled version runs about twice as fast as the compact version, but in the case of insn_sub the compact version runs about twice as fast as the unrolled version.
Here are the times before creating the median. The :random-seed setting obviously does not seem to make much of a difference:
insn_add unrolled: 868.15 873.34 873.35 873.36 874.88
insn_add compact: 1828.70 1829.32 1837.59 1843.74 1867.13
insn_sub unrolled: 3204.06 4195.10 4395.67 4539.30 4596.05
insn_sub compact: 2003.26 2187.52 2199.21 2206.04 2209.87
Since the value of :random-seed does not seem to have much of an effect, I would assume there is something intrinsic to those .smt2 files that makes them fast or slow on z3. How would I investigate this? How would I find out what makes the fast cases fast and the slow cases slow, so I can avoid whatever makes the slow cases slow? (Yes, I know that this is a very broad question. Sorry. :)
<edit>
Here are some more concrete questions along the lines of my primary question. This questions are directly inspired by the obvious differences I can see between the (compact) insn_add and insn_sub benchmarks.
Can the order of (declare-..) and (define-..) statements in my SMT input influence performance?
Can changing the names of declared or defined function influence performance?
If I split a BV into smaller BVs, and then concatenate them back again, can this influence performance?
If I either compare two BVs for equality, or split the BV into single bit variables and compare each of the bits individually, can this influence performance?
Also: What operations in z3 do actually change when I chose a different value for :random-seed?
</edit>
Making small changes to the .smt2 files without changing the semantics can be very difficult for large test cases generated by complex tools. I'm hoping there are other things I can try first, or maybe there is some existing expert knowledge about the kind of changes that might be worth investigating. Or alternatively: What kind of changes would effectively be equivalent to changing :random-seed and thus are not worth investigating.
(Tests performed with git rev c67cf16, i.e. current git head of z3 on an AWS EC2 c4.8xlarge instance with make -j40. The runtimes are CPU seconds, not wall-clock seconds.)
Edit 2:
I have now three test cases (test1.smt2, test2.smt2, and test3.smt2) that are identical except that I've renamed some of the functions I declare/define. The test cases can be found at http://svn.clifford.at/handicraft/2017/z3perf/.
This is a variation of the original problem that takes ~2 minutes to solve instead of ~1 hour. As before, changing the value of :random-seed only has a marginal effect. But renaming some of the functions without changing anything else changes the runtime by more than 2x:
I've now opened an issue on github, arguing that :random-seed should by tied into whatever thing I change randomly inside z3 when I rename the functions in my SMT2 code.
As you say, there can be many things that may be creating that perf difference in add vs sub.
A good start is to check if the formulas after preprocessing are equal modulo add/sub (btw, Z3 converts 'a - b' into 'a + (-1) * b'). If not, then trace down which preprocessing step is at fault. Then trace down the problem and send us a patch :)
Alternatively, the problem could be down the line, e.g., in the bitblaster. You can also dump the bit-blasted formulas of both of your files and check if there is a significant difference in terms of number of variables and/or clauses.
Anyway, you'll need to be prepared to invest a day or two (maybe more) to track down these issues. If you find something, let us know and/or send us a patch! :)

32 bit multiplication on 24 bit ALU

I want to port a 32 by 32 bit unsigned multiplication on a 24-bit dsp (it's a Linear Congruential Generator, so I'm not allowed to truncate, also I don't want to replace yet the current LCG with a 24 bit one). The available data types are 24 and 48 bit ints.
Only the last 32 LSB are needed. Do you know any hacks to implement this in fewer multiplies, masks and shifts than the usual way?
The line looks like this:
//val is an int(32 bit)
val = (1664525 * val) + 1013904223;
An outline would be (in my current compiler style):
static uint48_t val = SEED;
...
val = 0xFFFFFFFFUL & ((1664525UL * val) + 1013904223UL);
and hopefully the compiler will recognise:
it can use a multiply and accumulate command
it only needs a reduced multiply algorithim due to the "high word" of the constant being zero
the AND could be effected by resetting the upper bits or multiplying a constant and restoring
...other stuff depends on your {mystery dsp} target
Note
if you scale up the coefficients by 2^16, you can get truncation for free, but due to lack of info
you will have to explore/decide if it is better overall.
(This is more an elaboration why two multiplications 24×24→n, 31<n are enough for 32×32→min(n, 40).)
The question discloses amazingly little about the capabilities to build a method
32×21→32 in fewer [24×24] multiplies, masks and shifts than the usual way on:
24 and 48 bit ints & DSP (I read high throughput, non-high latency 24×24→48).
As far as there indeed is a 24×24→48 multiply (or even 24×24+56→56 MAC) and one factor is less than 24 bits, the question is pointless, a second multiply being the compelling solution.
The usual composition of a 24<n<48×24<m<48→24<p multiply from 24×24→48 uses three of the latter; a compiler should know as well as a coder that "the fourth multiply" would yield bits with a significance/position exceeding the combined lengths of the lower parts of the factors.
So, is it possible to generate "the long product" using just a second 24×24→48?
Let the (bytes of the) factors be w_xyz and W_XYZ, respectively; the underscores suggesting "the Ws" being the lower significance bits in the higher significance words/ints if interpreted as 24bit ints. The first 24×24→48 gives the sum of
  zX
 yXzY
xXyYzZ
 xYyZ
  xZ, what is needed (fat) is
 wZ +
 zW.
This can be computed using one combined multiplication of
((w<<16)|(z & 0xff)) × ((W<<16)|(Z & 0xff)). (Never mind the 17th bit of wZ+zW "running" into wW.)
(In the first revision of this answer, I foolishly produced wZ and zW separately - their sum is wanted in the end, anyway.)
(Annoyingly, this is about all you can do for 24×24→24 as a base operation too - beyond this "combining multiplication", you need four instead of one.)
Another angle to explore is choosing a different PRNG.
It may have to be >24 bits (tell!).
On a 24 bit machine, XorShift* (or even XorShift+) 48/32 seems worth a look.

Finding the largest prime factor of 600851475143 [duplicate]

This question already has answers here:
Project Euler #3 in Ruby solution times out
(2 answers)
Closed 9 years ago.
I'm trying to use a program to find the largest prime factor of 600851475143. This is for Project Euler here: http://projecteuler.net/problem=3
I first attempted this with this code:
#Ruby solution for http://projecteuler.net/problem=2
#Prepared by Richard Wilson (Senjai)
#We'll keep to our functional style of approaching these problems.
def gen_prime_factors(num) # generate the prime factors of num and return them in an array
result = []
2.upto(num-1) do |i| #ASSUMPTION: num > 3
#test if num is evenly divisable by i, if so add it to the result.
result.push i if num % i == 0
puts "Prime factor found: #{i}" # get some status updates so we know there wasn't a crash
end
result #Implicit return
end
#Print the largest prime factor of 600851475143. This will always be the last value in the array so:
puts gen_prime_factors(600851475143).last #this might take a while
This is great for small numbers, but for large numbers it would take a VERY long time (and a lot of memory).
Now I took university calculus a while ago, but I'm pretty rusty and haven't kept up on my math since.
I don't want a straight up answer, but I'd like to be pointed toward resources or told what I need to learn to implement some of the algorithms I've seen around in my program.
There's a couple problems with your solution. First of all, you never test that i is prime, so you're only finding the largest factor of the big number, not the largest prime factor. There's a Ruby library you can use, just require 'prime', and you can add an && i.prime? to your condition.
That'll fix inaccuracy in your program, but it'll still be slow and expensive (in fact, it'll now be even more expensive). One obvious thing you can do is just set result = i rather than doing result.push i since you ultimately only care about the last viable i you find, there's no reason to maintain a list of all the prime factors.
Even then, however, it's still very slow. The correct program should complete almost instantly. The key is to shrink the number you're testing up to, each time you find a prime factor. If you've found a prime factor p of your big number, then you don't need to test all the way up to the big number anymore. Your "new" big number that you want to test up to is what's left after dividing p out from the big number as many times as possible:
big_number = big_number/p**n
where n is the largest integer such that the right hand side is still a whole number. In practice, you don't need to explicitly find this n, just keep dividing by p until you stop getting a whole number.
Finally, as a SPOILER I'm including a solution below, but you can choose to ignore it if you still want to figure it out yourself.
require 'prime'
max = 600851475143; test = 3
while (max >= test) do
if (test.prime? && (max % test == 0))
best = test
max = max / test
else
test = test + 2
end
end
puts "Here's your number: #{best}"
Exercise: Prove that test.prime? can be eliminated from the if condition. [Hint: what can you say about the smallest (non-1) divisor of any number?]
Exercise: This algorithm is slow if we instead use max = 600851475145. How can it be improved to be fast for either value of max? [Hint: Find the prime factorization of 600851475145 by hand; it's easy to do and it'll make it clear why the current algorithm is slow for this number]

Resources