\documentclass[a4paper,12pt]{report}
\usepackage{color}
\usepackage[cm]{fullpage}
\usepackage{graphicx}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{pdfpages}

\title{YULE: Yet Unnamed Lisp Evaluator}
\author{Marius Gavrilescu}
\date{May 2018}

\begin{document}
% title page does not need anchors
\hypersetup{pageanchor=false}
\maketitle
\hypersetup{pageanchor=true}
\lstset{language=Perl, frame=leftline, framexleftmargin=5pt,
  showstringspaces=false, showspaces=false, breaklines=true,
  postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space}}

\chapter*{Abstract}
Lisp machines are computers designed to run Lisp programs efficiently.
They were first developed in the 1970s, and they have special hardware
support for Lisp. This usually comprises a tagged architecture (each
pointer has a ``type'' field), garbage collection implemented in
hardware, and an instruction set that lends itself well to evaluating
Lisp expressions.

Between 1975 and 1980, Guy L. Steele and Gerald Jay Sussman wrote a
series of memos on lambda calculus, the Lisp language and its
implementation known collectively as the ``Lambda Papers''. One of
these papers, ``Design of LISP-Based Processors (1979)'' describes how
to implement a Lisp machine processor. If this processor is connected
to a RAM that contains a program, it executes that program and places
the result back into the RAM.

The aim of YULE is to take the design from the 1979, implement it
today using modern technology and techniques, and bundle it up in a
form that is both easy to use and extend. By going through this
modernisation process, future developers can easily integrate the Lisp
processor into their own design, modifying it as they please.

The end result of YULE is a small device that can be plugged into a
USB port, and an associated computer program. Running the program
presents a simplified Lisp listener. The user can type a Lisp
expression into the listener, and the listener will respond with the
value of the expression. Under the hood, the listener evaluates Lisp
by sending the expression over USB to the small device and getting
back the result.

YULE comes with the design files for the hardware, and source code for
the software. All of these are released under a free software license
and published online, which makes it very easy (from both a practical
and a legal standpoint) for future developers to use in their own
projects and modify to fit their purposes better.
\tableofcontents

\chapter{Introduction}
\section{Motivation}
Lisp machine processors are quite unlike modern processors. Modern
instruction sets are designed for imperative programming. Each
instruction makes the processor execute some operation, such as
loading a register or storing a value in memory. In contrast, a Lisp
machine processor reads an expression from memory, determines its
type, and recursively evaluates it. This operation lends itself to
declarative programming, where the programmer declares an arbitrarily
complicated expression and asks the computer to evaluate it.

Another common feature of Lisp machine processors is the use of tagged
pointers. Each memory location is a tagged union of two elements: the
value itself, and a tag representing the type of that data. Tagged
architectures are not common today, although some programs still
use tagged pointers by exploiting pointer alignment (if every pointer
is a multiple of 8, then the bottom 3 bits can be used to store a
tag). These tags can be used to store an object's reference count (as
in Objective-C on certain versions of iOS), or information about a
data structure (as in certain implementations of kd-trees).

The ``Design of LISP-Based Processors (1979)'' paper by Guy L. Steele
and Gerald Jay Sussman \cite{lambda} presents the design of a
relatively simple Lisp machine processor, which is made from two
read-only memories, two buses, and several registers. If this
processor is connected to a RAM containing a Lisp expression in a
certain binary format, the processor will evaluate the expression and
place the result back in memory.

This processor features both a tagged architecture (the top 3 bits of
each word in memory represent the value's type) and the declarative
nature mentioned above. It is therefore interesting to study and to
bring this processor into today's world by using technology such as
hardware description languages (that allow us to describe a circuit
formally) and FPGAs (that allow us to synthesise any circuit quickly
and at low cost). By building a HDL implementation of the processor
not only can it be used easily on its own (by synthesising it to a
FPGA and connecting it to RAM), but a future circuit designer can take
this processor and include it in their more complicated design.

The processor presented in the original paper is a somewhat barebones
implementation of Lisp. Arithmetic primitives are not available, nor
are there any input or output routines. Due to time constraints the
paper's writers have not included a equality primitive. Additionally
the ATOM primitive has a bug that makes it consider NIL not atomic.
All these deficiencies can be rectified given time and effort, but it
is difficult for one to fix them given only a paper describing the
processor. Having a HDL implementation of the processor with a proper
test suite allows a future circuit designer to more easily make
changes to the processor while using the test suite to prevent
regressions and using a FPGA to quickly obtain a hardware version of
the changed processor that can be manually tested and used in any
desired application.

\section{Challenges}
While \cite{lambda} is part of the well-known Lambda Papers series and
has been read by many over the years, there are no published
implementations of the processor described in the paper. Therefore
\cite{lambda} was the only source used in the building of this
project, and several of the limitations of the processor were not easy
to see before the implementation phase was well underway. Furthermore,
the original paper has several shortcomings.

For one, while the microcode (the values of the two read-only
memories) is provided in both source and compiled form, the microcode
compiler is not provided. This makes it difficult to alter the
microcode to fix bugs or introduce new instructions. One needs to
either work on the already-compiled code (hard and error-prone) or
reimplement the microcode compiler, which takes a reasonable amount of
effort.

Moreover, the paper does not concern itself at all with how the user
gets a program into memory or how the user gets the result back. This
makes the processor as-described very difficult to use.

To implement the processor in Verilog, no source other than the paper
was used. There were no previous implementations of the processor that
could be consulted to resolve unclarities in the original paper.

In addition, there was no software available for interacting with the
processor. A compiler and assembler had to be written from scratch,
and so the processor could not be easily debugged until after the
software was functional.

\section{Contribution}
YULE is a modern-day implementation of a Lisp machine processor using
the ideas and techniques presented in \cite{lambda}.

The main contribution of YULE is a Verilog implementation of the
processor which allows future circuit designers to use it in their own
design and modify it as they please.

As the processor as described has no way to input a program or extract
a result from memory, YULE includes more circuitry around the
processor that lets it communicate with the external word. This
consists of a serial UART and two modules, a reader and a writer. When
the processor is idle, the reader waits for a program to be sent over
the UART. Once a program has been received, it is placed into memory
and the processor starts to evaluate it. Once evaluation is done, the
writer transmits the new contents of the memory over the UART.

Besides the hardware chip, YULE also includes three pieces of
software:

\begin{itemize}
\item a compiler, which takes Lisp source code and produces an
  intermediate representation of the program.
\item an assembler, which takes a program in the intermediate
  representation and produces the binary representation of the program
  that the processor can understand.
\item a REPL, in which users can type expressions. When an expression
  is submitted, the REPL compiles it, assembles it, and sends it to
  the processor. The processor evaluates the expression and sends back
  a memory dump. The REPL then reads the memory dump, interprets it,
  and prints the value of the expression in a human-readable format.
\end{itemize}

The REPL is therefore the entry point of YULE as a whole. An end-user
who does not intend to extend YULE can plug in a USB device containing
the YULE processor, and run the REPL. Then the user can write
expressions in the REPL and get their values back. Besides being the
entry point, the REPL serves as a convenient way to test that the
entire system is working. An expression typed in the REPL goes through
the compiler and assembler, and then is sent to the processor. It goes
through the three stages of the processor (reading the expression into
memory, evaluating the contents of the memory, writing a memory dump).

The hardware and software components of YULE have been tested to
ensure they function correctly. There is an automated test suite
containing 48 unit tests for the compiler and assembler, and 21
functional tests that engage all parts of the system (both the
software and the hardware). The unit tests ensure that the various
small parts of the compiler and assembler work properly, while the
functional tests consist of pairs of short Lisp expressions and their
correct values. The expressions are written to the REPL, and the
REPL's output is compared to the correct values.

To make it easier for future circuit designers and programmers to use
this project, YULE has been released under an open source license.
Prospective users can find all the source files (for both hardware and
software) online, and are free to use, modify, and redistribute them
under the same terms as Perl itself. The files can be accessed at this
URL: \url{https://git.ieval.ro/?p=yule.git}.

\section{Layout of this report}
This report begins by describing the background of YULE, the
technologies that made this project possible and a more in-depth
description of the 1979 paper. Then the components that YULE comprises
are examined in turn, followed by a chapter on how everything was
tested, which includes a worked example of the system in action. The
report ends with a list of ideas for future work based on YULE, and a
section on reflections.

\chapter{Background}

\section{Field-programmable gate array}
An FPGA is a configurable integrated circuit. A designer can define a
circuit using a hardware description language (explained in the next
section) and compile it to some binary format called a configuration.
This configuration describes how the different components of the FPGA
should be connected to each other. The configuration can then be
uploaded to the FPGA, and the FPGA will connect its internal
components as requested.

An FPGA can therefore replace any digital custom circuit (ASIC). While
an FPGA consumes more energy, uses more space, and runs at slower
clock rates than an ASIC, the FPGA permits rapid prototyping: the time
between finishing to write a design and having a working part is mere
seconds (compilation time + uploading the configuration to the FPGA),
and if a bug is discovered the buggy part need not be thrown away. The
design can be modified and the fixed configuration can be uploaded to
the FPGA.

The hardware aspect of YULE is designed to run on the Lattice
iCE40HX-1k FPGA, but with small modifications it can run on other
FPGAs as well.

\subsection*{iCEstick}
\begin{figure}
  \centering
  \includegraphics[height=4cm]{icestick.png}
  \caption{Lattice iCEstick Evaluation Kit}
\end{figure}

The Lattice iCEstick Evaluation Kit is a development board for the
iCE40HX-1k FPGA. It contains the FPGA, a USB connector, a FTDI chip
and 5 LEDs. The form factor is very convenient. It is similar to a USB
memory stick (albeit bigger) that can be plugged into a computer,
avoiding the need for jumper wires and GPIO that traditional
development boards require.

The FTDI chip implements the USB protocol, allowing the FPGA to be
programmed by the computer and letting the FPGA talk to the computer
using a very simple serial UART interface on the FPGA side, and a
standard USB serial port on the computer side. This makes
communication very easy. The FPGA can use any off-the-shelf UART
implementation (this project uses \texttt{osdvu} from
Opencores \cite{osdvu}), and the computer can use standard serial
console tools such as \texttt{minicom}.

\section{Verilog}
Verilog is a hardware description language. It is used to model
digital circuits at register-transfer level of abstraction. In Verilog
one defines several modules, with one of them (the ``toplevel''
module) representing the entire design. Each module has several inputs
and outputs, can contain internal wires, can contain instances of
other modules, and can contain combinational and sequential logic
blocks that drive internal wires and outputs.

Verilog has many more complicated features that are not used in YULE.
These include tristate buffers, delays, functions, tasks, loops, and
strength declarations.

A subset of Verilog can be \textit{synthesised}; that is converted to
a netlist (a list of transistors or FPGA components used and
connections between them) that can then be used to build an ASIC
(custom integrated circuit), or can be further converted to the
internal representation of the FPGA and uploaded to the FPGA. The FPGA
workflow and the programs that are part of it are described in the
next section.

A verilog project can also be run through a simulator, which simulates
the design clock edge by clock edge and records the values on each
wire in the design. This record can later be read in and visualised
with a wave viewer, which allows a circuit designer to better
understand why the circuit is not working correctly. One such program,
GTKWave \cite{gtkwave}, was used to debug YULE.

\section{icestorm}
Project IceStorm \cite{icestorm} is a set of open source reverse
engineered tools for programming Lattice iCE40 FPGAs. There exists a
fully open source Verilog-to-Bitstream flow which comprises:

\begin{enumerate}
\item Yosys, a Verilog synthetizer.
\item Arachne-pnr, a tool that implements placing and routing for
  iCE40 FPGAs.
\item icepack, which converts Arachne-pnr's output to iCE40 bitstream.
\item iceprog, which programs the iCEstick with the output of icepack.
\end{enumerate}

\section{Design of LISP-Based Processors (1979)}
Design of LISP-Based Processors (1979) \cite{lambda} is one of the
``Lambda papers''; a series of memos by Sussman and Steele about the
Lisp language and its implementation. The paper describes SIMPLE, a
microprocessor designed to run Lisp.

This processor is made of two read-only memories called EVAL and GC.
Each read-only memory is paired with state registers to form a state
machine. The contents of the state registers are fed into the
read-only memory, and the output of the read-only memory determines
the next state and other signals.

Each state machine also has an internal bus (called the E and G buses
respectively) and several registers connected to that bus. Each of
these registers stores a tagged pointer, which is an 11-bit value (the
top 3 bits represent the type, and the bottom 8 bytes represent the
value).

The signals coming out of each state machine can cause a register to
be read onto the machine's internal bus, a register to be loaded from
the internal bus, or write a constant value onto the internal bus.

The GC state machine represents the processor's memory system. It is
connected to an external RAM, which it controls through several
signals. It knows how to execute certain ``GC operations'', which
represent common Lisp memory management actions. Examples of these are
\texttt{CAR} and \texttt{CDR} (``given a pointer to a cons cell,
return either the CAR or the CDR of that cons cell'') or \texttt{CONS}
(``given two pointers, allocate a new cons cell, copy them into it,
and return a pointer to the new cons cell''). The existence of the GC
isolates the rest of the processor from the particularities of the
memory used.

The EVAL state machine evaluates Lisp expressions. Signals coming out
of it control the GC by selecting what operation the GC should perform
and connecting the E and G buses to supply arguments to the GC and
read results back. The EVAL reads expressions from memory via the GC
and evaluates them, putting results back into memory.

The overall operation of the processor is to read an expression from
the magic location ``5'' in memory, evaluate it, and place the result
into the magic location ``6'' in memory. Other memory locations are
used to store subexpressions of the input expression, and to store any
intermediate results. Despite its name, GC does not perform any
garbage collection, so at the end of the processor's operation the
memory will still contain all intermediate results produced during
evaluation.

The processor is intended as a proof of concept and not a
production-ready hardware implementation of Lisp. Therefore it has
several important limitations, for example there is no primitive
operation that implements equality of objects (making it impossible to
distinguish between symbols), there are no primitive arithmetic
operations (so one needs to use Church numerals if numbers are
needed), and the processor includes a known bug (``the ATOM bug'').
The bug is that the ATOM primitive considers NIL to be non-atomic,
which is incorrect.

The Lisp expressions mentioned here are in textual form (Lisp source
code), but instead in a special tagged pointer format. Each memory
value is a union of two values: a 8-bit pointer and a 3-bit type. We
can use pairs of consecutive memory locations to represent cons cells
and build linked lists out of them. If a list is at location X, that
means that (X, X+1) is the first cons cell of the list, with X+1 being
the CAR (the value of the first element) and X being the CDR (a
pointer to the next cons cell). A discussion of the different ways to
represent Lisp expressions (source code, this tagged pointer format,
and an intermediate representation) can be found in \ref{sec:formats}.

The next page shows a diagram of how the processor in the paper is
structured. The two read-only memories can be seen, with their
registers, state registers, and the interconnection between the E and
G buses.

\begin{figure}
\centering\includegraphics[width=18.5cm]{AIM-514-design.pdf}
\caption{Design diagram of processor from \cite{lambda}}
\end{figure}

\chapter{A Lisp Machine processor on a USB drive}
The project comprises a hardware part written in Verilog and a
software part written in Perl.

The hardware part is a Lisp machine processor with associated
circuitry that can read a program from the iCEstick UART and that can
write the result of running a program to the UART. The normal
operation of this part is to:

\begin{enumerate}
\item Wait for a program to be supplied via UART
\item Run the program on the processor
\item Write a memory dump to the UART
\item Go back to step 1
\end{enumerate}

The software part includes:

\begin{itemize}
\item A compiler, which takes Lisp source code and puts it in the
  linked list format understood by the SIMPLE processor
\item An assembler, which takes the result of the compiler and
  converts it to the SIMPLE in-memory binary format.
\item A REPL, which repeatedly reads a program from the user, compiles
  it, assembles it, sends it to YULE, reads the memory dump from YULE,
  and prints the result for the user to see.
\end{itemize}

\section{Hardware}
The hardware part is built from 8 modules:

\begin{enumerate}
\item The \textbf{READER}, which implements step 1 of the normal operation described above.
\item The \textbf{WRITER}, which implements step 3 of the normal operation described above.
\item The \textbf{GCRAM}, a single-port RAM.
\item The \textbf{GC}, corresponding to the GC PLA in the original implementation.
\item The \textbf{EVAL}, corresponding to the EVAL PLA in the original implementation.
\item The \textbf{CTRL}, which advances the operation from one step to the next one.
\item The \textbf{UART}, which implements transmission and reception of bytes via UART.
\item The \textbf{CPU}, which instantiates every module above and wires them together.
\end{enumerate}

\subsection{Clocks}
To avoid timing problems, a single 48 MHz clock is used. The iCEstick
has a 12 MHz oscillator, and the phase-locked loop circuitry on the
FPGA multiplies this signal by 4.

Every flip-flop updates on the positive edge of the clock, except for
the RAM which updates on the negative edge. This way the rest of the
circuit perceives the RAM to be combinational (it ``updates
immediately''): if on a positive edge the RAM's address input changes,
the RAM will update its output on the following negative edge and the
changed output will be visible on the next positive edge. If the RAM
updated on a positive edge, then if the input changed on positive edge
$i$ the changed output will only be visible on positive edge $i+2$.

The design is intended to be run on a FPGA, which has a fixed clock
tree. This typically consists of a small number of special nets called
global buffers that can distribute signals with minimal skew to every
logic cell. The FPGA on the iCEstick has 8 such global buffers.


There are 4 components in the design (GC, EVAL, READER, WRITER) that
need to be enabled/disabled at different times. One way to achieve
this is to AND the master 48 MHz clock with several signals, thus
creating a separate clock for each component. This is known as
\textit{clock gating}. However, on a FPGA clock gating involves
routing the clock from a global buffer to a logic cell and then back
to another global buffer. This can introduce glitches in the clock and
would delay each of the clocks by some amount (dependent on the
signals it is ANDed with). It is therefore possible that the 4 output
clocks would not be aligned, causing problems when data is
communicated by one component to another. As such, clock gating is not
recommended on FPGAs.

\begin{figure}
  \centering\includegraphics[height=3cm]{gating.eps}
  \caption{Gated clock on the left, clock enable on the right}
\end{figure}

To avoid this problem, the design used here does not use clock gating
and instead uses 4 clock enable signals supplied by the controller to
each of the GC, EVAL, READER and WRITER. At any point the clock is
enabled for exactly one of the READER, GC, and WRITER. The EVAL clock
is enabled if and only if the GC clock is enabled and the GC's
\texttt{step\_eval} output is high.

\subsection{GC}
\begin{figure}
  \centering\includegraphics[height=4.88cm]{gc.eps}
  \caption{GC module}
\end{figure}

The GC module is connected to the GCRAM (via the \texttt{ram\_do,
  ram\_we, ram\_addr, ram\_di} signals), to the EVAL (via the
\texttt{Ein, Eout, gcop, step\_eval, conn\_et, conn\_ea} signals) and
to the controller. It also provides the value of its P register (the
free storage pointer) to the WRITER, so the writer can know when to
stop dumping the memory.

It is implemented as a combinational ROM, some combinational logic to
drive the internal G bus, sequential logic to advance to the next
state respecting dispatch rules, and sequential logic to update
registers. The clock enable signal acts as an active low synchronous
reset: when the clock is disabled, the state register is set to the
initial state.

This module represents one half of the original design presented in
\cite{lambda}, and the ROM's contents are those from the paper with a
bug (the original design sets the ADR line low in two states when it
should be high).

\subsection{EVAL}
\begin{figure}
  \centering\includegraphics[height=3.29cm]{eval.eps}
  \caption{EVAL module}
\end{figure}

The EVAL module is connected to the GC (via the \texttt{conn\_et,
  conn\_ea, Ein, Eout, gcop} signals) and to the controller.

The implementation of the EVAL is very similar to that of the GC.
There is a combinational ROM, combinational logic to drive the E bus,
sequential logic to advance to the next state respecting dispatch
rules, and sequential logic to update registers. A synchronous reset
is available. When the reset line is held high for one clock cycle,
the state register is set to the initial state. It is necessary to
separate the reset from the clock enable because the GC often pauses
the EVAL to do work and the EVAL should not be reset every time this
happens.

This module represents the other half of the original design presented
in \cite{lambda}, and the ROM's contents are the same as those in the
paper except that the ROM is widened by one bit to indicate whether
the EVAL is finished evaluating the expression.

\subsection{Writer}
\begin{figure}
\centering\includegraphics[height=2.26cm]{writer.eps}
\caption{WRITER module}
\end{figure}

The WRITER module is connected to the UART (via the \texttt{tx\_busy,
  tx\_signal, tx\_byte} signals), the GC (via the \texttt{P} signal),
the RAM (via the \texttt{ram\_addr, ram\_do} signals), and to the
controller.

Its purpose is to dump the memory to the UART starting with position 4
(since positions 0, 1, 2, 3 are constant) and finishing with position
$P - 1$, the last used cell in memory. It is implemented as a state
machine with 7 states and an extra register (\texttt{current\_index})
that stores the current index in the memory.

The memory is 16-bit wide, but the UART sends and receives bytes. The
writer dumps memory word in network byte order (big-endian), meaning
the significant byte is sent before the least significant byte.

The states are, in order:
\begin{description}
\item[START] which initializes \texttt{current\_index} to 4
\item[WRITE1\_WAIT] which waits for \texttt{tx\_busy} to become low
\item[WRITE1] in which transmission of the upper byte starts
\item[WRITE2\_WAIT] which waits for \texttt{tx\_busy} to become low
\item[WRITE2] in which transmission of the lower byte starts
\item[INCREMENT] in which \texttt{current\_index} is incremented,
  jumping back to WRITE1\_WAIT if it is still lower than
  \texttt{freeptr}.
\item[FINISHED] in which \texttt{finished} is driven high
\end{description}

This module is not part of the original design in \cite{lambda}, as
that paper did not concern itself with input and output.

\subsection{Reader}
\begin{figure}
\centering\includegraphics[height=2.22cm]{reader.eps}
\caption{READER module}
\end{figure}

The READER module is connected to the UART (via the
\texttt{rx\_signal, rx\_byte} signals), the RAM (via the
\texttt{ram\_we, ram\_addr, ram\_di} signals), and to the controller.

Its purpose is to read a program from the UART and overwrite the GCRAM
with this program. It is implemented as a state machine with two extra
registers (\texttt{total\_words} and \texttt{current\_index}) to store
the size of the incoming program and the current position within it.

The reader expects to receive a 13-bit \texttt{length} value, followed
by \texttt{length} 16-bit values that represent the desired memory
contents. Each of these values is sent in network byte order
(big-endian), meaning the most significant byte is sent before the
least significant byte.

The states are, in order:
\begin{description}
\item[IDLE] which waits for \texttt{rx\_signal}, puts the read byte at the top of \texttt{total\_words}.
\item[LENGTH] which waits for \texttt{rx\_signal}, puts the read byte at the bottom of \texttt{total\_words}
\item[READ1] which waits for the \texttt{rx\_signal}, puts the read byte at the top of \texttt{ram\_di} and increments \texttt{current\_index}
\item[READ2] which waits for the \texttt{rx\_signal}, puts the read byte at the bottom of \texttt{ram\_di}
\item[WRITE] in which \texttt{ram\_we} is high, jumping back to READ1 if \texttt{current\_index + 1 < total\_words}
\item[FINISHED] in which \texttt{finished} is high
\end{description}

This module is not part of the original design in \cite{lambda}, as
that paper did not concern itself with input and output.

\subsection{Putting the hardware together}
The components described above are managed by the controller, which
provides them with clock enable signals. The controller is a state
machine with three states:
\begin{description}
\item[STATE\_READ] in which only the READER clock is enabled
\item[STATE\_RUN] in which the GC and EVAL clocks are enabled
\item[STATE\_WRITE] in which only the WRITER clock is enabled
\end{description}

In STATE\_READ and STATE\_WRITE, the EVAL's reset line is high.

Whenever the component corresponding to the current state indicates it
is finished, the controller advances to the next state.

The controller also includes three multiplexers that control which
lines are connected to the GCRAM. If the READER is active, all three
GCRAM lines are connected to the READER. If the WRITER is active, the
address line is connected to the writer (and the other lines are
connected to the GC). If the GC is active, all lines are connected to
the GC.

The controller also includes logic to drive the 5 LEDs on the
iCEstick. Three LEDs indicate the current state, while the other two
indicate if the UART is currently transmitting or receiving.

\section{Representing a Lisp program}
\label{sec:formats}
The processor described above evaluates a Lisp program received via
UART. This section talks about three representations of such a
program: the Lisp source code (a S-expression), the intermediate form
(also a S-expression), and the in-memory form (a binary string).

The first representation is obtained by parsing the user input, the
compiler converts this into the second, and the assembler converts the
second into the third.

\subsection{Lisp source code}
Recall that a S-expression is either an atom or a cons cell.

An atom is a sequence of letters, digits, and certain symbols. There
are two special atoms named \texttt{T} and \texttt{NIL} representing
the boolean true and false respectively.

A cons cell is a pair of two S-expressions called the ``car'' and the
``cdr''. We write them as \texttt{(car . cdr)}, for example \texttt{(5
  . 7)} is a cons cell with car \texttt{5} and cdr \texttt{7}.

We can use cons cells to represent linked lists. A linked list is a
sequence of cons cells where the cdr of one cell is the next cell, and
the cdr of the last cell is \texttt{NIL}. For example the list
\texttt{1,2,3} can be written as \texttt{(1 . (2 . (3 . NIL)))}.
We can also write this list more concisely as \texttt{(1 2 3)}, and we
write \texttt{()} to mean the atom \texttt{NIL}.

Using these syntax elements we can encode several types of logical
expressions:

\begin{itemize}
\item Numbers are encoded as numeric atoms, e.g. \texttt{5}
\item Variables are encoded as non-numeric atoms, e.g. \texttt{my-var}
\item Conditional statements are encoded as 4-element lists, e.g.
  \texttt{(if is-odd 5 6)} which has value 5 if the variable
  \texttt{is-odd} is not \texttt{NIL} and value 6 otherwise
\item Symbols are encoded as non-numeric atoms preceded by a single
  quote, e.g. \texttt{'my-symbol}. This can also be written
  \texttt{(quote my-symbol)}
\item Constant lists are encoded as lists preceded by a single quote,
  e.g. \texttt{'(1 2 (3 4 5))}. This can also be written
  \texttt{(quote (1 2 (3 4 5)))}
\item Functions (lambda abstractions) are encoded as 4-element lists,
  e.g. \texttt{(lambda first (a b) a)} in which the second element is
  the function name, the third element is a list of function
  arguments, and the fourth element is the function body.
\item Function calls are encoded as lists, with the function as the
  first element and the arguments as the other elements, e.g.
  \texttt{(atom 5)} calls the function \texttt{atom} with argument
  \texttt{5}.
\end{itemize}

\subsection{In-memory representation}
The memory is an array of 16-bit integer values. We interpret these as
tagged pointers: the top 3 bits indicate the ``type'' of this pointer,
while the other 13 bits represent another memory location. A pointer
therefore has a type and a value.

We represent each Lisp expression as a tagged pointer. To evaluate it,
the processor looks at the type of the pointer. Here is what a pointer
of value $v$ means based on its type. We use the syntax $mem[a]$ to
mean ``the contents of the memory location $a$'' (that is, a tagged
pointer) and $memval[a]$ to mean ``the value at the memory location
$a$'' (that is, the value of the tagged pointer $mem[a]$).

\begin{enumerate}
\setcounter{enumi}{-1}
\item is a constant list, whose cdr is $mem[v]$ and car is $mem[v+1]$.
\item is a constant symbol.
\item is a variable reference, where $v$ indicates the position in the
  environment of the referenced variable.
\item is a constant closure, that is the result of evaluating a lambda
  abstraction. Values of this type do not normally appear in the input
  code, but instead are created during evaluation of other forms.
\item is a procedure (lambda abstraction), whose body is $mem[v]$.
\item is a conditional, where $mem[v+1]$ is the condition,
  $mem[memval[v]+1]$ is the ``then'' branch, and $mem[memval[v]]$ is the
  ``else'' branch.
\item is a procedure call, which points to a special kind of linked
  list. The CDR of each cell (other than the last one) points to the
  next cell and has type 0, while the CDR of the last cell has value 0
  and its type indicates the primitive to be called. The CARs of these
  cells indicate the arguments for the procedure, in reverse order.

  To call a user-defined procedure, we use the primitive
  \texttt{funcall} and put the user-defined procedure as the CAR of
  the last cell (in other words, the user-defined procedure is the
  first argument to \texttt{funcall}).
\item is a quoted constant, representing the object at $mem[v]$.
\end{enumerate}

The booleans \texttt{NIL} and \texttt{T} are compiled to a list
pointing to 0, and a symbol with value 2 respectively.

The memory has a peculiar fixed layout. Locations 0 and 1 are the CDR
and CAR of NIL, locations 2 and 3 are the CDR and CAR of T, location 4
holds the first unused location in memory (this tells the processor
where it can start to place intermediate expressions), location 5
holds the expression to be evaluated, and location 6 is where the
processor will place the value of the expression.

\subsection{The intermediate form}
To facilitate conversion from the former format to the latter, an
intermediate form is used. Here we represent tagged pointers as lists
of one of three types:

\begin{itemize}
\item \texttt{(type value)}, where ``type'' is the pointer's type and
  ``value'' is the pointer's (numeric) value.
\item \texttt{(type ptr)}, where the pointer's type is ``type'' and
  the pointer's value is $x$, where $x$ is the memory location where
  ``ptr'' (another tagged pointer) lies.
\item \texttt{(type ptr1 ptr2)}, where the pointer's type is ``type''
  and the pointer's value is $x$, where $x$ is the memory location
  where ``ptr1'' lies and $x+1$ is the memory location where ``ptr2''
  lies.
\end{itemize}

These three representations map nicely to the tagged pointers
described in the previous section. The first representation is used
for constant symbols and variable references, the second for
procedures, and the third for constant lists, conditionals, and
procedure calls.

A visual representation of the intermediate representation (from the
AIM-514 paper) can be seen on the next page. The expression in
question is

\begin{lstlisting}
((lambda append (x y) (if (atom x) y (cons (car x) (append (cdr x) y)))) '(a b c) '(d e f))
\end{lstlisting}

which we would represent as this sexp:

\begin{lstlisting}
(CALL (MORE (MORE (FUNCALL 0) (PROC (IF (LIST (CALL (MORE (CONS 0) (CALL (MORE (MORE (FUNCALL 0) (VAR -1)) (VAR -2)) (CALL (CDR 0) (VAR -3)))) (CALL (CAR 0) (VAR -3))) (VAR -2)) (CALL (ATOM 0) (VAR -3))))) (LIST (LIST (LIST (LIST 0) (SYMBOL 6)) (SYMBOL 7)) (SYMBOL 8))) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5)))
\end{lstlisting}

As mentioned in the Background chapter, the processor has a bug (the
ATOM bug) that causes \texttt{(ATOM NIL)} to be true. As a result,
\texttt{(atom x)} in this expression is always true, even when
\texttt{x} is the empty list, and the expression loops forever (or at
least until memory is exhausted). We can fix this by changing the
condition from \texttt{(atom x)} to \texttt{x}, which is false when
\texttt{x} is NIL and true otherwise, and swapping the two branches of
the conditional. The fixed program is:

\begin{lstlisting}
((lambda append (x y) (if x (cons (car x) (append (cdr x) y)) y)) '(a b c) '(d e f))
\end{lstlisting}

which is compiled to the slightly shorter sexp:

\begin{lstlisting}
(CALL (MORE (MORE (FUNCALL 0) (PROC (IF (LIST (VAR -2) (CALL (MORE (CONS 0) (CALL (MORE (MORE (FUNCALL 0) (VAR -1)) (VAR -2)) (CALL (CDR 0) (VAR -3)))) (CALL (CAR 0) (VAR -3)))) (VAR -3)))) (LIST (LIST (LIST (LIST 0) (SYMBOL 6)) (SYMBOL 7)) (SYMBOL 8))) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5)))
\end{lstlisting}


\begin{figure}
\centering\includegraphics[width=18.5cm]{AIM-514-ir.pdf}
\caption{Example of intermediate representation}
\end{figure}

\section{Software}
\subsection{Compiler}
The compiler takes in Lisp code and translates it to a sexp-based
format understood by the assembler.

The compiler has several functions for compiling certain kinds of
expressions, and a function (\texttt{process\_toplevel}) that uses the
former to compile arbitrary expressions.

The output from any of this functions is a sexp in the intermediate
form described in the previous section.

Symbols are identified by sequences of letters on the source code side
and by numbers on the processor side. The compiler therefore keeps a
mapping between symbol names and their numbers. Whenever a new symbol
is encountered, a new entry is added to this map.

The \texttt{process\_quoted} function handles constants. For example
if the source code contains \texttt{'(1 2 3)} then
\texttt{process\_quoted} is called with \texttt{(1 2 3)}. This
function handles the symbol interning described in the previous
paragraph, and converts its input to intermediate form.

The \texttt{process\_proc} function handles lambda abstractions. For
example if the source code contains \texttt{(lambda first (a b) a)}
then \texttt{process\_proc} is called with \texttt{first},
\texttt{(a b)}, and \texttt{a}. This function appends the function
name and arguments to the environment, then processes the body (using
\texttt{process\_toplevel}) with the new environment.

The \texttt{process\_funcall} function handles function calls. For
example if the source code contains \texttt{(reverse-list 2 1)} then
\texttt{process\_funcall} is called with \texttt{reverse-list}, and
\texttt{(2 1)}. The arguments are first processed (using
\texttt{process\_toplevel}), then (if the function is not a primitive)
the function is processed as well.

Finally the \texttt{process\_toplevel} handles any expression by
figuring out what sort of expression it is and then processing it
appropriately. Empty lists and \texttt{NIL} become \texttt{(list 0)},
numbers and \texttt{T} are processed with \texttt{process\_quoted}

Examples of compiler operation can be found in the compiler tests.
Here are a few of them:

\begin{itemize}
\item The Lisp expression \texttt{'()} is compiled to \texttt{(LIST 0)}
\item The Lisp expression \texttt{'(5 foo)} is compiled to
  \texttt{(LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4))}. Observe that
  the CDR comes before the CAR, and symbols are interned:
  \texttt{'foo} is compiled to \texttt{(SYMBOL 3)} and \texttt{'5} is
  compiled to \texttt{(SYMBOL 4)}
\item The Lisp expression \texttt{(if t '(2 3) 'x)} is compiled to
  \texttt{(IF (LIST (SYMBOL 5) (LIST (LIST (LIST 0) (SYMBOL 3))
    (SYMBOL 4))) (SYMBOL 2))}. Observe that \texttt{t} is always
  compiled to \texttt{(SYMBOL 2)}, as mentioned in the section about
  the memory format.
\end{itemize}

\subsection{Assembler}
The assembler takes in an S-expression in the intermediate form, lays
out the words in memory, and produces a memory dump that can be
supplied to the processor.

The core of the assembler is a function called \texttt{process} that
takes an expression in intermediate form and an optional location in
memory. This function lays out the expression's subexpressions in
memory and then lays out the expression itself at the given location
(or at the end of the memory if there is no location given).

The assembler begins by initializing the first 7 elements of the
memory with the values needed by the processor. After initialization,
the \texttt{process} function is called on the input expression with
no location.

Here is how the \texttt{process} function lays out the expression
$expr$ at location $l$ (where $l$ can be a given memory location or
simply ``the first free location in memory''):

\begin{itemize}
\item If $expr$ is \texttt{(type value)}, then it is directly laid out
  at $l$
\item If $expr$ is \texttt{(type ptr)}, then \texttt{process} is
  called recursively to lay \texttt{ptr} out at the first free
  location in memory (say this is location $loc$). Then \texttt{(type
    $loc$)} is laid out at $l$.
\item If $expr$ is \texttt{(type ptr1 ptr2)}, then two consecutive
  free locations in memory $loc$ and $loc+1$ are reserved,
  \texttt{process} is recursively called twice to lay \texttt{ptr1} at
  $loc$ and \texttt{ptr2} at $loc+1$, and finally
  \texttt{(type $loc$)} is laid out at $l$.
\end{itemize}

Once again, examples of assembler operations can be found in the
assembler tests. Two of them:

\begin{itemize}
\item \texttt{(quoted (symbol 5))} is assembled to \texttt{00 08 00 00
    00 00 20 00 20 00 00 08 E0 07 00 00 20 05}. Observe that each word
  is made of two bytes, the first word (\texttt{00 08}) indicating the
  length (in words) of the program. The first 4 words are always
  fixed: they represent in order the CDR of NIL, the CAR of NIL, the
  CDR of T, and the CAR of T. The 7th word is also a constant, 0,
  because that is where the processor puts the result. The 5th word is
  the first free word in memory (which is equal to the length in words
  of the program), and the 6th word (\texttt{E0 07}) is the expression
  to be evaluated. In this case, it has type quote and points to
  memory location 7 (the 8th word, because pointers are 0-indexed). At
  memory location 7 we have the value \texttt{20 05} which has type
  symbol and value 5.
\item \texttt{(call (more (funcall 0) (proc (var -2))) (number 5))} is
  assembled to \texttt{00 0C 00 00 00 00 20 00 20 00 00 0C C0 07 00 00
    00 09 20 05 E0 00 80 0B 5F FE}. Here we want to evaulate the
  expression \texttt{C0 07} (type ``function call'', address 7). This
  points to a cons-cell, whose car (at position 8) is \texttt{20 05}
  (that is, \texttt{(number 5)}) and whose cdr (at position 7) is
  \texttt{00 09} (type ``more'' and address 9). The ``more'' points to
  another cons-cell, whose cdr (at position 9) is \texttt{E0 00} (that
  is, \texttt{(funcall 0)}), and whose car is \texttt{80 0B}: a
  procedure whose body is at position 11, namely \texttt{5F FE} (that
  is, \texttt{(var -2)}).

  So the expression is a call to the primitive \texttt{funcall} with
  arguments \texttt{(lambda l (x) x)} and \texttt{5}. The original
  program was probably \texttt{((lambda l (x) x) 5)}. Note that this
  program was compiled with an old version of the compiler that did
  not intern symbols, which explains why \texttt{5} was compiled to
  \texttt{(number 5)} and not \texttt{(symbol 3)}.
\end{itemize}

\subsection{REPL}
The REPL ties everything together: it reads a program from the user,
compiles it, assembles it, sends the assembled program to the
processor, reads the memory dump from the processor, and finally
prints the result to the screen.

The REPL includes an S-expression pretty-printer that operates on
memory dumps. As symbols are interned by the compiler, when the
pretty-printer hits a symbol it only knows its index, and not the
original name. To print the original name the pretty-printer searches
the symbol map built by the compiler and pulls the name from there.

The pretty-printer takes a chunk of memory and index, and prints the
expression at that index. For example, if the memory chunk is
\texttt{00 00 00 00 20 00 20 00 00 09 00 07 00 07 20 02 00 00 60 00}
and the index is 6, then the pretty printer will try to print
\texttt{00 07}, which is a list with car \texttt{00 00} and cdr
\texttt{20 02}, and the pretty printer prints \texttt{(nil . t)}.

An example REPL session is:

\begin{lstlisting}
YULE REPL

* (cons 1 2)
(1 . 2)
* (car '(4 5 6))
4
* (cdr '(t t t))
(T T)
* ((lambda id (x) x) 5)
5
* (lambda first (a b) a)
<CLOSURE>
* ((lambda first (a b) a) 7 8)
7
* (atom 6)
T
\end{lstlisting}

\chapter{Evaluation and Testing}
To make development easier, ensure everything works correctly, and
prevent regressions there are unit tests for the compiler and
assembler, and functional tests for the entire project.

Besides these automated tests, manual evaluation (in the beginning,
precompiling simple expressions and supplying them to the processor as
the initial value of the memory; later on, manually typing expressions
in the REPL and looking at the output). When the processor was not
operating as expected, the GTKWave wave simulator was used to assist
with debugging. This piece of software takes in a recorded simulation
of the circuits and displays the values of all signals inside the
circuit against time. This allows the designer to look at every clock
cycle in turn and discover where the error occurs.

The process of debugging the processor had the side effect of finding
inefficiencies in the original design. While \cite{lambda} mentions
some limitations of the processors, one important limitation that was
only discovered during debugging is that for every operation that the
EVAL performs, the result of that operation will be written to new
locations in memory. This includes ``read-only'' operations such as
CAR and CDR. If the programmer requests the CAR of a list, a copy of
the first element will be taken and a pointer to it will be returned.
Since YULE only has 4096 words of memory, this severely restricts the
complexity of programs that can be evaluated on it.

Another avenue to ensure the processor works as it should is formal
verification using model checking tools. Formal verification is
commonly employed as part of circuit design evaluation because
mistakes are more difficult to identify, and if found after the chip
has been built they are very costly to fix (often the chips that were
built with the bug need to be thrown away). However, building models
is a very involved process, and devoting effort to formal verification
of this project would not have allowed the implementation of several
features. Due to time constraints and knowing that no physical YULE
chips are built (the project is so far only synthesised to a FPGA),
testing and manual evaluation was considered sufficient for YULE.

\section{Unit testing}
The software was tested with 48 unit tests written in Perl using the
\texttt{Test::More} framework.

The assembler takes in compiled Lisp code and outputs a memory dump in
one of two formats (binary or Verilog code). Its tests take a
configuration for the assembler, an input, two expected outputs, and
(for documentation purposes) the source Lisp code. An example test:

\begin{lstlisting}
$expbin =
    '00 0C 00 00 00 00 20 00 20 00 00 0C C0 07'
  . ' 00 00 00 09 20 05 E0 00 80 0B 5F FE';
run_test {addr_bits => 13},
    '(call (more (funcall 0) (proc (var -2))) (number 5))',
    <<'EOF', $expbin, '((LAMBDA ID (X) X) 5)';
mem[ 0] <= 0;                     // (cdr part of NIL)
mem[ 1] <= 0;                     // (car part of NIL)
mem[ 2] <= 16'b0010000000000000;  // (cdr part of T)
mem[ 3] <= 16'b0010000000000000;  // (car part of T)
mem[ 4] <= 16'd12;                // (free storage pointer)
mem[ 5] <= 16'b1100000000000111;  // CALL 7
mem[ 6] <= 0;                     // (result of computation)
mem[ 7] <= 16'b0000000000001001;  // MORE 9
mem[ 8] <= 16'b0010000000000101;  // NUMBER 5
mem[ 9] <= 16'b1110000000000000;  // FUNCALL 0
mem[10] <= 16'b1000000000001011;  // PROC 11
mem[11] <= 16'b0101111111111110;  // VAR -2
EOF
\end{lstlisting}

Here the compiled code \texttt{(call (more (funcall 0) (proc (var -2))) (number 5))}
is given to the assembler, and its outputs are compared to the binary
string on the first line and the block of Verilog code.

The tests do not all contain valid assembler inputs. There are a
series of tests that give invalid inputs to the assembler and expect
the assembler to raise an exception matching a pattern. Example tests:

\begin{lstlisting}
expect_error_like { run_test {}, '((type is a list) 5)'}
    'Type of toplevel is not atom';
expect_error_like { run_test {}, '(badtype 5)'}
    'No such type';
expect_error_like { run_test {}, '(5 700000)'}
    'Addr too large';
\end{lstlisting}

On the compiler side, there are 4 tests for the quoting function,
but the brunt of the tests are pairs of a compiler input (i.e. Lisp
code) and a compiler output (i.e. a sexp that the assembler
understands). For each pair the compiler is run on the first element
and its output is compared with the second. Example such tests:

\begin{lstlisting}
is_toplevel '(quote 5)', '(SYMBOL 3)';
is_toplevel '(reverse-list \'a \'a \'b)',
    '(CALL (MORE (MORE (REVERSE-LIST 0) (SYMBOL 4)) (SYMBOL 3)) (SYMBOL 3))';
is_toplevel '(if t \'(2 3) \'x)',
    '(IF (LIST (SYMBOL 5) (LIST (LIST (LIST 0)'
  . ' (SYMBOL 3)) (SYMBOL 4))) (SYMBOL 2))';
\end{lstlisting}

As the compiler code includes a function to pretty-print
S-expressions, there are a few round-trip tests on this function.
Given a string, it is parsed using the \texttt{Data::SExpression}
parsing library then it is pretty-printed. The result should match the
original string.

There are also tests for handling of invalid input, like in the assembler. These include:

\begin{lstlisting}
expect_error_like { is_toplevel 'x' } 'Variable x not in environment';
expect_error_like { is_toplevel '(car)' } 'Cannot call primitive car with no arguments';
\end{lstlisting}

A test coverage tool, \texttt{Devel::Cover}, was ran on the compiler
and assembler to find statements and branches that were not covered by
existing unit tests, and new tests were written to cover them. Right
now \texttt{Devel::Cover} reports 100\% test coverage of the compiler
and assembler.

\begin{figure}
  \centering\includegraphics[height=8cm]{devel-cover.png}
  \caption{Screenshot of \texttt{Devel::Cover} results, showing 100\%
    test coverage}
\end{figure}

\section{Functional testing}
Besides unit tests, the entire system (hardware and software) has
undergone functional testing. This means that a set of 21 Lisp
expressions has been prepared, alongside their correct values computed
using a software Lisp implementation. These values are given to the
YULE REPL and the test succeeds if the output equals the known correct
value of the expression.

Here are some example tests:
\begin{lstlisting}
'(1 2 3)
5
((lambda id (x) x) 5)
((lambda id (x) x) '(1 2 3))
(car '(2 3))
(car (cdr '(4 5 6)))
((lambda snd (x y) y) 'first 'second)

(cons '(1 2) '(7))
(car (cons '(1 2) '(7)))
(cdr (cons '(1 2) '(7)))

((lambda rev (xs) ((lambda reverse-acc (xs acc) (if xs (reverse-acc (cdr xs) (cons (car xs) acc)) acc)) xs '())) '(4 5 6 7))
\end{lstlisting}

This test suite tries to cover most kinds of valid expressions. It
includes constants of all kinds, conditionals, lambda abstractions,
calls to all primitives, calls to user-defined functions, recursive
calls.

\section{Timing and performance}
There are three parts to the performance of YULE:

\begin{enumerate}
\item The efficiency of the software. This refers to the time taken to
  compile and assemble input programs, which is negligible.

\item The I/O performance, which refers to the time needed to transfer
  the program from the computer to the processor's memory, plus the
  time needed to transfer the memory dump from the memory back to the
  computer.

\item The speed of the processor itself, that is the time/number of
  cycles that the processor needs to evaluate a given expression.
\end{enumerate}

The GC runs at 48MHz, while the EVAL only runs when the GC allows it
to which is at most half of the time.

The UART runs at 4000000 baud, so a bit is sent every 12 cycles of the
48MHz clock. A byte is made of 11 bits: a start bit, 8 data bits, and
one stop bit. A word is two bytes, so 22 bits. Therefore 264 cycles
are needed to transmit one word.

Since the WRITER just sends a full memory dump, the I/O performance is
proportional not to \texttt{program\_len + result\_len} but to
\texttt{program\_len + result\_len + intermediate\_results\_len}. As
evaluation of any expression involves the creation of at least one
value (the result), for every expression or subexpression evaluated at
least 264 cycles will be needed to output that result. The EVAL takes
significantly less than 264 cycles to evaluate a expression (not
counting its subexpressions), in one test each expression took between
30 and 90 cycles to evaluate. The processor performance is therefore
dominated by I/O.

As an example, the program \texttt{((lambda id (x) x) 5)} is 12 bytes
long, and takes 258 cycles to evaluate. The memory dump is 54 bytes
long. Therefore of the roughly 17700 cycles needed from beginning to
end, about 98.5\% are used to do I/O.

\section{A worked example}
\label{sec:example}
Here is an example of what happens when a expression is typed into the
REPL:

\begin{enumerate}
\item A user types into the REPL the string

\begin{lstlisting}
(car '(2 3 4))
\end{lstlisting}

  This is a call to the CAR primitive, with the list \texttt{(2 3 4)}
  as an argument. We expect the result to be \texttt{2}, the first
  element of the list.

\item This string is given to the compiler, producing the sexp

\begin{lstlisting}
(CALL (CAR 0) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5)))
\end{lstlisting}

  Observe that in this representation the CDR comes before the CAR. So
  for example ``(LIST (LIST 0) (SYMBOL 3))'' represents a list whose
  CAR is ``(SYMBOL 3)'' and CDR is ``(LIST 0)'' (that is, nil). This
  is a list with one element, ``(SYMBOL 3)''. The sexp shown here
  represents a call to the CAR primitive with one argument. The list
  has three elements, which are in order \texttt{(SYMBOL 5), (SYMBOL
    4), (SYMBOL 3)}.

  In this stage the compiler maps each number to a symbol, in this
  case \texttt{(SYMBOL 5)} is 2, \texttt{(SYMBOL 4)} is 3, and
  \texttt{(SYMBOL 3)} is 2. This mapping is remembered so it can be
  used later by the pretty printer.

\item This sexp is given to the assembler, producing the 16 words long
  binary string (here displayed in hex, with a space after each 16-bit
  word)

\begin{lstlisting}
000F 0000 0000 2000 2000 000F C007 0000 2000 0009 000B 2005 000D 2004 0000 2003
\end{lstlisting}

  The first word, \texttt{000F}, indicates the length of the program
  in
  words (15 words, this does not include this first word).\\
  The next four words, \texttt{0000 0000 2000 2000}, represent the CDR
  and CAR of NIL and T respectively. These words are always
  constant.\\
  The next word, \texttt{000F}, represents the first unused position
  in memory. This is always equal to the first word transmitted.\\
  The next word, \texttt{C007}, represents the expression to be
  evaluated.\\
  The next word, \texttt{0000}, is ignored as this is where the
  processor will place the result of the evaluation.\\
  The rest of the words are subexpressions or parts of subexpressions
  of the expression being evaluated.\\

  The expression that is evaluated is \texttt{C007}, which is a CALL
  whose CDR is at position 7 (\texttt{2000}, meaning \texttt{(CAR 0)})
  and whose CAR is at
  position 8 (\texttt{0009}).\\
  \texttt{0009} is a list with CAR at 10 (\texttt{2005}, meaning
  \texttt{(SYMBOL 5)}) and CDR at 9 (\texttt{000B}).\\
  \texttt{000B} is a list with CAR at 12 (\texttt{2004}, meaning
  \texttt{(SYMBOL 4)}) and CDR at 11 (\texttt{000D}).\\
  \texttt{000D} is a list with CAR at 14 (\texttt{2003}, meaning
  \texttt{(SYMBOL 3)}) and CDR at 13 (\texttt{0000}, meaning NIL).

\item This binary string is sent via UART to the FPGA's READER. The
  READER writes the string (except the first word) into memory,
  overwriting the previous contents of the memory. It then reports to
  the controller that it is done, and the controller advances to
  \texttt{STATE\_RUN}.

\item The EVAL and GC evaluate the program in memory, and then place
  the result back into memory. The EVAL reports to the controller that
  it is done, and the controller advances to \texttt{STATE\_WRITE}.

\item The WRITER sends this memory dump via UART back to the REPL

\begin{lstlisting}
000F C007 2005 2000 0009 000B 2005 000D 2004 0000 2003 0000 4004 2000 0010 0000 0012 0000 0000 0009
\end{lstlisting}

  Then the WRITER reports to the controller that it is done, and the
  controller advances to \texttt{STATE\_READ}.

  Observe that this memory dump is similar to the assembler output.
  The first two words are the 6th and 7th words in the assembler
  output, the next word is the result of the evaluation, then the next
  8 words are words 9th through 16th in the assembler output. The
  following words are intermediate results computed by the processor.

\item The REPL's pretty printer interprets this string starting with
  \texttt{2005}, which means \texttt{(SYMBOL 5)}. The pretty printer
  looks in the compiler's mapping and finds that the 5th symbol was
  2, and so the REPL prints

\begin{lstlisting}
2
*
\end{lstlisting}

  The evaluation is complete, and a new prompt is printed so that the
  user can input another expression.
\end{enumerate}

\begin{figure}
  \centering\includegraphics[height=7cm]{repl.png}
  \caption{What the user types and sees in this example}
\end{figure}

\chapter{Conclusions}
A Lisp machine processor has been built using the original design and
techniques presented in \cite{lambda}. It has been extended with input
and output capabilities, and using a modern FPGA it has been
synthesised on a physical device slightly larger than a USB flash
drive. This device can be used as a standalone Lisp evaluator, or the
processor inside it can be included in a more complicated circuit.

Software has been written to help interact with the processor. This
includes a compiler, an assembler, and a REPL that together make it
possible for an end-user unfamiliar with hardware design or the
processor's internals to submit Lisp expressions to the processor and
get back their values. The software also allows programmers to convert
Lisp source code to the binary format understood by the processor, so
that they may later send the binaries to the processor for evaluation.

The processor and software have been tested to ensure they work
correctly. This has uncovered bugs and inefficiencies in the original
paper, some of which could not be gleaned from simply reading the
paper.

The source files for the hardware and software, and also the automated
tests have been released under an open-source license to make it easy
for future circuit designers and programmers to use and extend this
project. They can be found at \url{https://git.ieval.ro/?p=yule.git}.

The original paper \cite{lambda} is one of the influential Lambda
Papers that defined the Scheme programming language and is an
important historical artifact for understanding how Lisp machine
processors can be implemented and more generally how tagged
architectures work. But all details of the processor cannot be
understood from simply reading the paper, and most readers would not
follow by attempting to implement what is described there. This
projects brings a tangible implementation into the open, which shows
the limitations of the design to people interested in this design and
makes it possible to solve this limitations and extend the processor
beyond its original scope.

\section{Reflections}
Now that the project is finished, it is worth revisiting ideas I had
in the beginning that were not ultimately useful.

The original concept had the processor run on a typical development
board, with the software running on a Raspberry PI. The board and the
Raspberry PI would be connected through jumper wires, and the UART
circuitry on the Raspberry PI would be used by the software to talk to
the processor.

This idea would have produced a much harder to use YULE. Not only
would the user have to ssh into a Raspberry PI to do anything useful,
but also the equipment would take more space (both boards are larger
than the iCEstick). Once I tried the iCEstick and found that
everything can be done just as well with it, it replaced the two
boards and made the project easier to use (albeit slightly harder to
debug, as the iCEstick only has 5 LEDs).

Some code had to be rewritten, sometimes multiple times. The compiler
was originally written in Common Lisp, as it is the best suited
language for manipulation of S-expressions, especially Lisp code. This
meant a working project was obtained more quickly, but then the
compiler had to be rewritten in another language; if the user has a
software Lisp implementation already, why would they need YULE? I
chose Perl as the reimplementation language, as the assembler was
already written in Perl and it is widely-available, most systems
coming with a Perl interpreter out of the box.

Throughout development of the hardware part, timing problems were a
mainstay. The first ones related to the interactions between the
processor and the memory (the memory was clocked, but the processor
expected a combinational RAM). What fixed this problem was to run the
memory at twice the speed of the rest of the processor. At some point,
a decision was taken to run the GC and EVAL on different clocks (both
ran at one quarter the speed of the master clock, but one of the
clocks was delayed by half a cycle. The positive edge on one clock
would happen midway between a positive edge and a negative edge on the
other one). This proved to be unwise, as sometimes values sent by one
module to the other would be cleared before the destination module
managed to read them.

GTKWave proved to be very useful in debugging these timing issues, as
the exact point where the issue began was easy to identify and I could
see why the issue was happening. A workaround for this was to add a
latch on each side of the common E bus that remembered the value sent
long enough for the receiving module to see it.

All timing problems were fixed more cleanly in the end when an attempt
to simplify the clocks was made. First GC and EVAL were changed to use
the same clock, which removed the need for the latches, and then the
memory was changed to use the negative edge of the clock instead of
the positive edge, which meant no more clock division code had to be
used.

\section{Future work}
There are several ways to extend this project.

One idea is to fix the noted limitations: the ATOM primitive considers
NIL to not be an atom, and there is no EQ primitive that compares
symbols/other objects. The latter problem means that all symbols are
indistinguishable (note: NIL can still be distinguished from T and
other symbols, but NIL is not a symbol). To introduce equality a new
signal needs to be added to the EVAL that compares (for example) the
value on the E bus with the value of a fixed register, possibly a
newly added one. Then (EQ A B) would store A in the register, load B
on the E bus and compare them.

Once equality is implemented, it becomes possible to implement the
EVAL function that evaluates an arbitrary Lisp expression.

Two more involved instruction set improvements are adding arithmetic
operations, and input/output. Without arithmetic primitives one has to
rely on Church numerals, which encodes numbers as lambda abstractions
and is very inefficient. Input/output primitives would allow programs
to receive data at runtime over the UART and send data back before
evaluation is finished. This enables long-running useful programs, for
example with input/output and equality a REPL can be written (which
relies on the EVAl function mentioned above).

On the software side, the assembler can support optimizations and then
the REPL can be extended with a core image. The main optimization
needed in the assembler is common subexpression elimination. The
processor never overwrites a value in memory with another value (other
than overwriting \texttt{mem[6]} with the result of the evaluation),
so it is safe to share any identical subexpressions. For example, if
the user inputs the program \texttt{(cons (car '(1 2 3 4)) (cdr '(0 2
  3 4)))}, then when the assembler should share the tails of the two
lists, meaning the second list should compile to \texttt{(LIST (LIST
  X) (SYMBOL Y))} where X is a pointer to where the second element of
the first list is stored. This makes long programs significantly
shorter. Note that no compiler changes are required to implement this,
as the assembler can just remember all expressions and subexpressions
it has assembler so far including their locations, and whenever a new
expression needs to be assembler it can be replaced with a pointer to
where the same expression has been placed before.

Once this optimization is implemented, the REPL can gain core image
support. Suppose the user enters the expression \texttt{expr}. Instead
of sending \texttt{assemble(compile(expr))} to the processor, the REPL
can remember the last expression sent (or the last memory dump
received), then append \texttt{assemble(compile(expr))} to that
previous expression. This way \texttt{expr} can refer to previously
defined expressions, as long as the REPL or compiler has a way to name
previously supplied instructions, for example via a \texttt{define}
special form, via the standard REPL variables (\texttt{+, ++, +++}),
or by sequentially numbering all expressions submitted by the user.

Returning to hardware, a potential future project is to add actual
hardware garbage collection to the GC component. At the moment the
design does no garbage collection, the processor only has 4096 words
of RAM available, and evaluation is quite memory-inefficient. Adding
garbage collection would make more complicated programs possible
despite the limited RAM. One way to implement this is to add a small
microprocessor to the design which is preprogrammed with a garbage
collection algorithm (simple choices are a stop-and-copy collector or
Cheney's algorithm). When the free space pointer gets too high, the
EVAL and GC are stopped, and the new microprocessor is started up so
it can perform garbage collection, compact the memory, and update the
free space pointer. The EVAL/GC registers would have to be updated as
well. Then the microprocessor can be stopped, EVAL and GC be started
up again and evaluation can continue. Note that a non-moving collector
is not sufficient: the GC allocates new words only at the very end of
the memory, so it cannot benefit from free ``gaps'' in the middle of
the memory.

\begin{thebibliography}{99}
\bibitem{lambda} Design of LISP-Based Processors (1979), available at
  \url{http://repository.readscheme.org/ftp/papers/ai-lab-pubs/AIM-514.pdf}
\bibitem{osdvu} \url{https://opencores.org/project/osdvu}
\bibitem{gtkwave} \url{http://gtkwave.sourceforge.net/}
\bibitem{icestorm} \url{http://www.clifford.at/icestorm/}
\end{thebibliography}
\end{document}