What is forth.
This is not a forth tutorial but for those
who haven't heard of it before I'll try to give you some idea. If
you're interested, there are web sites devoted to forth where you can
learn the history of it and why it has no “U”. There is
probably enough material online to learn it. I think forth currently
rates about 37'th place in language popularity while Java holds first
place.
Forth is more than a language, it is (or can be) an IDE,
OS, complier, interpreter, assembler, runtime kernel and
monitor/debugger.
On the right processor this can be quite small.
The 6511 kernel was 4KB, my 32 bit ARM kernel with compiler is
currently 11KB – note this is under 3K, 32 bit words.
Stacks
of trouble.
Forth is stack based and uses RPN (reverse
polish notation) like the HP scientific calculators did. Every forth
has at least two stacks – a data stack and a return stack –
some forths (like mine) have a lot more than two.
Stacks are often
visualized as building up like a stacks of plates. The last item
placed there is on top.
So to add 1 and 2 you don't type “1+2=”
you type “1 2 + .” . Every word or number must be
separated with a space or tab. The “1 2” pushes the
numbers onto the data stack so the 2 will be on top and the 1 will be
below it. The “+” will add the top two stack items
together leaving the result on the stack and the “.” is
the forth word for printing the top of the stack as a number (and
removing it from the stack).
The numbers on the stack are
often memory addresses. For example if DEMOV is a forth variable “1
DEMOV !” would store a 1 into the variable.
The “1”
pushes 1 onto the stack. “ DEMOV” pushes the address of
the storage location onto the stack and “!” is the word
which store the second stack item to the address on top (and removes
the two items from the stack). Forth has a rich set of words for
manipulating the stack in various ways. Common things you do include
duplicating the stack items (DUP), swapping top and second (SWAP),
dropping unwanted items (DROP) etc etc.
I would not say writing
forth code in forth is easy – but I think forth has enough
redeeming features to make up for it.
Words
are enough.
Every string entered into forth is either
a word in the forth dictionary, a number or a string forth is
expecting as input for some reason.
The standard forth does not
split words into parts unless it is trying to interpret it as a
number.
In the first example “1 2 + .”
- 1 and 2 are numbers, “+” and “.” are words
which exist in the forth dictionary.
If we enter a word forth
doesn't know for instance “hello” - we will get an error
message.
But we could say ' .” hello” ' - where ' .”
' is a predefined forth word which expects a string (which it later
prints).
Or “VARIABLE hello” where “VARIABLE”
is the forth word for defining new variables.
The defined word exist in one or more
dictionaries AKA vocabularies. Usually they are simple linked
lists.
There are a number of standard forth words which create new
dictionary entries. You can also define your own word creators AKA
defining words.
Examples of defining words are CREATE CODE CONSTANT VARIABLE and : .
One of the simplest words in the dictionary is “@” - this loads a value from the address on top of data stack and replaces the top with this value. It is defined in assembler and is only two instructions long.
CODE @
RTOP RTOP LDR,
LR PC
MOV,
END-CODE
The word “CODE” creates a new
dictionary entry for the word “@”, the two lines of
assembler generate the machine code that the processor will execute
when the word “@” is called. “END-CODE”
finishes the definition.
Let's see what the word looks like in
memory (as a hex dump). You don't have to understand this low level
stuff to write forth but it helps.
200A3C
28 A 20 0 0
1 40 58 0
60 16 E4 E F0 A0 E1
200A4C 3C
A 20 0 0 2
43 40 0 60
56 E4 E F0 A0 E1
I've
coloured it to show the different parts.
The word uses 16 bytes of
memory, the word header is at address 200A3C.
The 28 A 20 0 is a
32 link to the previous (older) word header. This is lowest
byte first so it points to 200A28. The next zero is a flag
field. The 1 is the number of characters in the word and 40 is
the ascii code for “@” in other words
a counted string.
The actual machine code is 0 60 16 E4 and E F0 A0 E1, ARM instructions must be on word (as in 32 bits) boundaries so the 58 is padding and does nothing.
On the
next line of the hex dump is the header for the word “C@”
starting with a link to 200A3C (the header for “@”).
That
is the stuff forth is made of – one or more linked lists of
word headers and code.
Compiler
and assembler words are are much the same as regular words –
new ones can be defined or existing ones redefined to change the way
the compiler or assembler works. Forth can be written in forth (mine
is) so forth can build a new version of itself.
A current emforth
for arm vocabulary is here this may be
incomplete because I haven't finished it yet.
The
interpretation.
Forth can be an interpreter. The
interpreter is very simple. All it has to do is input a string from
the input stream and scan the dictionaries to see if it exists.
If
the word exists the code part of the word is executed by the CPU. If
it doesn't exist it is assumed to be a number. If the string can't be
converted into a number we have an error condition. The source for
the interpret loop is here.
Compiling.
Not
all forth compilers actually compile into machine code. Many compile
pointers to the code the be executed and this set of pointers is
processed by a sort of interpreter. This can have advantages at the
expense of speed. My arm forth compiles to machine code.
The
compiler loop is not much more complex than the interpreter. One
difference is that it checks we are in compiler mode and exits if
not. The other is that when it finds a word in the dictionary it
check to see if it is a compiler word or not. Compiler (immediate)
words are executed other words are compiled. To compile a word it
simply inserts a sub-routine call to the target code into the word
being compiled. The source for the compiler loop is
here.
A trivial example. If we wanted to square the number 12 we could type “ 12 DUP * .” into the interpreter and it would print the answer for us.
Or we
could define a new word which does squares. To do this we could type
“: SQR DUP * ;”
What happens in “:” is a
word which creates a new header for us, changes the mode to compile,
adds code the push the link-register (the return address) and runs
the compiler loop. The compiler compiles calls (branch and link
instructions) to DUP and * and the “;” inserts the return
instruction ( a stack pop into the program counter) and changes the
mode back to interpret.
The hex dump looks like this.
202D60
30 2D 20 0 0
3 53 51 52 B2 CD 43 0
40 2D E9
202D70 8F F6 FF EB 5 F7 FF
EB 0 80 BD E8
The
header is exactly the same format as the example for “@”
with the machine code starting at address 202D6C
This is four 32
bit ARM instructions, a stack push, the two calls and the stack pop.
We can
now type “12 SQR .” and get the same answer as before. We
can also use SQR in other definitions.
For example we could define
a word which outputs every square for a range of numbers.
: DEMO
DO I SQR . LOOP ;
Typing
“10 0 DEMO”
results in the output “0 1 4 9 16 25
36 49 64 81”
“DO-LOOP” is one of the standard
forth control structures and “I” copies the loop counter
onto the data stack.
Forth uses a single pass compiler (and
assembler) you generally only call older words – forward
references are difficult.
Because emforth compiles to machine code it is very easy to mix high level forth with assembler. It is also easy to add some code optimization.
Optimization.
A
simple example of an optimization possibility is using the “@”
code described above. If we add “@” to one of our
definitions our simple compiler simply compiles a call to “@”.
All the work “@” does is performed by one instruction –
so instead of compiling a 32 instruction to call “@” we
could compile “RTOP RTOP LDR,” instead and remove the
call overhead. There are lots of tweaks like that which improve
execution speed or code size at the expense of have a larger, slower,
more complex compiler.
Assembly.
Forth
usually include an assembler but this could be omitted in some cases.
This assembler is quite different from the traditional two pass ones.
The syntax looks reversed with the registers and options coming
before the instruction.
For example where a line of normal
assembler might look like this
LDR R1, =PMC_MOR_Val
Forth might look like this.
PMC_MOR_Val
# R1 MOV,
The other big difference is forth doesn't use labels
to define control structures.
For a comparison of the two styles
a fragment of the assembler code used it initialize the SAM7
oscillator is here with the regular syntax in
the comment field.
Extensions.
Forth
can be extended to include new data types, syntax, control structures
etc. My main extension is the addition of objects.
Emforth (when
completed) can define objects with data encapsulation, method
inheritance, late or early binding and polymorphism.