From: Marius Gavrilescu Date: Thu, 17 May 2018 11:09:26 +0000 (+0300) Subject: Add report and artifacts used in it X-Git-Url: http://git.ieval.ro/?a=commitdiff_plain;h=e7b86bf0143c9bef1c26248fcda4fb2f2558b519;p=yule.git Add report and artifacts used in it --- diff --git a/AIM-514-design.pdf b/AIM-514-design.pdf new file mode 100644 index 0000000..c348c01 Binary files /dev/null and b/AIM-514-design.pdf differ diff --git a/AIM-514-ir.pdf b/AIM-514-ir.pdf new file mode 100644 index 0000000..30ce9fa Binary files /dev/null and b/AIM-514-ir.pdf differ diff --git a/devel-cover.png b/devel-cover.png new file mode 100644 index 0000000..09da4d2 Binary files /dev/null and b/devel-cover.png differ diff --git a/eval.svg b/eval.svg new file mode 100644 index 0000000..d0b2d90 --- /dev/null +++ b/eval.svg @@ -0,0 +1,286 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + clk + cen + rst + Ein + conn_ea + conn_et + EVAL + ostate + gcop + Eout + + + + + + + + + + + + + + diff --git a/gating.svg b/gating.svg new file mode 100644 index 0000000..ed373ad --- /dev/null +++ b/gating.svg @@ -0,0 +1,288 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + clk + data + + + + + + + + + + clk + en + data + out + out + + + diff --git a/gc.svg b/gc.svg new file mode 100644 index 0000000..d76af77 --- /dev/null +++ b/gc.svg @@ -0,0 +1,404 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + clk + cen + Ein + gcop + ram_do + ostate + conn_ea + conn_et + Eout + ram_di + ram_addr + ram_we + step_eval + GC + + Pout + + + + + + + + + + + + + + + + + + diff --git a/gcram.svg b/gcram.svg new file mode 100644 index 0000000..51b7c1c --- /dev/null +++ b/gcram.svg @@ -0,0 +1,225 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + clk + addr + din + we + GCRAM + do + + + + + + + + + + + + + diff --git a/icestick.png b/icestick.png new file mode 100644 index 0000000..bc9f1c1 Binary files /dev/null and b/icestick.png differ diff --git a/module.svg b/module.svg new file mode 100644 index 0000000..aa321f9 --- /dev/null +++ b/module.svg @@ -0,0 +1,1007 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + clk + addr + din + we + GCRAM + do + + + + + + + + + + + + + clk + cen + Ein + gcop + ram_do + ostate + conn_ea + conn_et + Eout + ram_di + ram_addr + ram_we + step_eval + GC + + Pout + + + + + + + + + + + + + + + + + + clk + cen + rx_byte + rx_signal + READER + active + ram_di + ram_addr + ram_we + + + + + + + + + + + + + clk + cen + tx_busy + P + ram_do + WRITER + tx_byte + tx_signal + finished + ram_addr + + + + + + + + + + + + + + clk + cen + rst + Ein + conn_ea + conn_et + EVAL + ostate + gcop + Eout + + + + + + + + + + + + + + diff --git a/reader.svg b/reader.svg new file mode 100644 index 0000000..8af3983 --- /dev/null +++ b/reader.svg @@ -0,0 +1,270 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + clk + cen + rx_byte + rx_signal + READER + active + ram_di + ram_addr + ram_we + + + + + + + + + + + + + diff --git a/report.tex b/report.tex new file mode 100644 index 0000000..03a43ea --- /dev/null +++ b/report.tex @@ -0,0 +1,1427 @@ +\documentclass[a4paper,12pt]{report} +\usepackage{color} +\usepackage[cm]{fullpage} +\usepackage{graphicx} +\usepackage{hyperref} +\usepackage{listings} +\usepackage{pdfpages} + +\title{YULE: Yet Unnamed Lisp Evaluator} +\author{Marius Gavrilescu} +\date{May 2018} + +\begin{document} +% title page does not need anchors +\hypersetup{pageanchor=false} +\maketitle +\hypersetup{pageanchor=true} +\lstset{language=Perl, frame=leftline, framexleftmargin=5pt, + showstringspaces=false, showspaces=false, breaklines=true, + postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space}} + +\chapter*{Abstract} +Lisp machines are computers designed to run Lisp programs efficiently. +They were first developed in the 1970s, and they have special hardware +support for Lisp. This usually comprises a tagged architecture (each +pointer has a ``type'' field), garbage collection implemented in +hardware, and an instruction set that lends itself well to evaluating +Lisp expressions. + +Between 1975 and 1980, Guy L. Steele and Gerald Jay Sussman wrote a +series of memos on lambda calculus, the Lisp language and its +implementation known collectively as the ``Lambda Papers''. One of +these papers, ``Design of LISP-Based Processors (1979)'' describes how +to implement a Lisp machine processor. If this processor is connected +to a RAM that contains a program, it executes that program and places +the result back into the RAM. + +The aim of YULE is to take the design from the 1979, implement it +today using modern technology and techniques, and bundle it up in a +form that is both easy to use and extend. By going through this +modernisation process, future developers can easily integrate the Lisp +processor into their own design, modifying it as they please. + +The end result of YULE is a small device that can be plugged into a +USB port, and an associated computer program. Running the program +presents a simplified Lisp listener. The user can type a Lisp +expression into the listener, and the listener will respond with the +value of the expression. Under the hood, the listener evaluates Lisp +by sending the expression over USB to the small device and getting +back the result. + +YULE comes with the design files for the hardware, and source code for +the software. All of these are released under a free software license +and published online, which makes it very easy (from both a practical +and a legal standpoint) for future developers to use in their own +projects and modify to fit their purposes better. +\tableofcontents + +\chapter{Introduction} +\section{Motivation} +Lisp machine processors are quite unlike modern processors. Modern +instruction sets are designed for imperative programming. Each +instruction makes the processor execute some operation, such as +loading a register or storing a value in memory. In contrast, a Lisp +machine processor reads an expression from memory, determines its +type, and recursively evaluates it. This operation lends itself to +declarative programming, where the programmer declares an arbitrarly +complicated expression and asks the computer to evaluate it. + +Another common feature of Lisp machine processors is the use of tagged +pointers. Each memory location is a tagged union of two elements: the +value itself, and a tag representing the type of that data. Tagged +architectures are not common today, although some programs still +use tagged pointers by exploiting pointer alignment (if every pointer +is a multiple of 8, then the bottom 3 bits can be used to store a +tag). These tags can be used to store an object's reference count (as +in Objective-C on certain versions of iOS), or information about a +data structure (as in certain implementations of kd-trees). + +The ``Design of LISP-Based Processors (1979)'' paper by Guy L. Steele +and Gerald Jay Sussman \cite{lambda} presents the design of a +relatively simple Lisp machine processor, which is made from two +read-only memories, two buses, and several registers. If this +processor is connected to a RAM containing a Lisp expression in a +certain binary format, the processor will evaluate the expression and +place the result back in memory. + +This processor features both a tagged architecture (the top 3 bits of +each word in memory represent the value's type) and the declarative +nature mentioned above. It is therefore interesting to study and to +bring this processor into today's world by using technology such as +hardware description languages (that allow us to describe a circuit +formally) and FPGAs (that allow us to synthesise any circuit quickly +and at low cost). By building a HDL implementation of the processor +not only can it be used easily on its own (by synthesising it to a +FPGA and connecting it to RAM), but a future circuit designer can take +this processor and include it in their more complicated design. + +The processor presented in the original paper is a somewhat barebones +implementation of Lisp. Arithmetic primitives are not available, nor +are there any input or output routines. Due to time constraints the +paper's writers have not included a equality primitive. Additionally +the ATOM primitive has a bug that makes it consider NIL not atomic. +All these deficiencies can be rectified given time and effort, but it +is difficult for one to fix them given only a paper describing the +processor. Having a HDL implementation of the processor with a proper +test suite allows a future circuit designer to more easily make +changes to the processor while using the test suite to prevent +regressions and using a FPGA to quickly obtain a hardware version of +the changed processor that can be manually tested and used in any +desired application. + +\section{Challenges} +While \cite{lambda} is part of the well-known Lambda Papers series and +has been read by many over the years, there are no published +implementations of the processor described in the paper. Therefore +\cite{lambda} was the only source used in the building of this +project, and several of the limitations of the processor were not easy +to see before the implementation phase was well underway. Furthermore, +the original paper has several shortcomings. + +For one, while the microcode (the values of the two read-only +memories) is provided in both source and compiled form, the microcode +compiler is not provided. This makes it difficult to alter the +microcode to fix bugs or introduce new instructions. One needs to +either work on the already-compiled code (hard and error-prone) or +reimplement the microcode compiler, which takes a reasonable amount of +effort. + +Moreover, the paper does not concern itself at all with how the user +gets a program into memory or how the user gets the result back. This +makes the processor as-described very difficult to use. + +To implement the processor in Verilog, no source other than the paper +was used. There were no previous implementations of the processor that +could be consulted to resolve unclarities in the original paper. + +In addition, there was no software available for interacting with the +processor. A compiler and assembler had to be written from scratch, +and so the processor could not be easily debugged until after the +software was functional. + +\section{Contribution} +YULE is a modern-day implementation of a Lisp machine processor using +the ideas and techniques presented in \cite{lambda}. + +The main contribution of YULE is a Verilog implementation of the +processor which allows future circuit designers to use it in their own +design and modify it as they please. + +As the processor as described has no way to input a program or extract +a result from memory, YULE includes more circuitry around the +processor that lets it communicate with the external word. This +consists of a serial UART and two modules, a reader and a writer. When +the processor is idle, the reader waits for a program to be sent over +the UART. Once a program has been received, it is placed into memory +and the processor starts to evaluate it. Once evaluation is done, the +writer transmits the new contents of the memory over the UART. + +Besides the hardware chip, YULE also includes three pieces of +software: + +\begin{itemize} +\item a compiler, which takes Lisp source code and produces an + intermediate representation of the program. +\item an assembler, which takes a program in the intermediate + representation and produces the binary representation of the program + that the processor can understand. +\item a REPL, in which users can type expressions. When an expression + is submitted, the REPL compiles it, assembles it, and sends it to + the processor. The processor evaluates the expression and sends back + a memory dump. The REPL then reads the memory dump, interprets it, + and prints the value of the expression in a human-readable format. +\end{itemize} + +The REPL is therefore the entry point of YULE as a whole. An end-user +who does not intend to extend YULE can plug in an iCEstick with the +processor, and run the REPL. Then the user can write expressions in +the REPL and get their values back. Besides being the entry point, the +REPL serves as a convenient way to test that the entire system is +working. An expression typed in the REPL goes through the compiler and +assembler, and then is sent to the processor. It goes through the +three stages of the processor (reading the expression into memory, +evaluating the contents of the memory, writing a memory dump). + +The hardware and software components of YULE have been tested to +ensure they function correctly. There is an automated test suite +containing 48 unit tests for the compiler and assembler, and 21 +functional tests that engage all parts of the system (both the +software and the hardware). The unit tests ensure that the various +small parts of the compiler and assembler work properly, while the +functional tests consist of pairs of short Lisp expressions and their +correct values. The expressions are written to the REPL, and the +REPL's output is compared to the correct values. + +To make it easier for future circuit designers and programmers to use +this project, YULE has been released under an open source license. +Prospective users can find all the source files (for both hardware and +software) online, and are free to use, modify, and redistribute them +under the same terms as Perl itself. The files can be accessed at this +URL: \url{https://git.ieval.ro/?p=yule.git}. + +\section{Layout of this report} +This report begins by describing the background of YULE, the +technologies that made this project possible and a more in-depth +description of the 1979 paper. Then the components that YULE comprises +are examined in turn, followed by a chapter on how everything was +tested, which includes a worked example of the system in action. The +report ends with a list of ideas for future work based on YULE, and a +section on reflections. + +\chapter{Background} + +\section{Field-programmable gate array} +An FPGA is a configurable integrated circuit. A designer can define a +circuit using a hardware description language (explained in the next +section) and compile it to some binary format called a configuration. +This configuration describes how the different components of the FPGA +should be connected to each other. The configuration can then be +uploaded to the FPGA, and the FPGA will connect its internal +components as requested. + +An FPGA can therefore replace any digital custom circuit (ASIC). While +an FPGA consumes more energy, uses more space, and runs at slower +clock rates than an ASIC, the FPGA permits rapid prototyping: the time +between finishing to write a design and having a working part is mere +seconds (compilation time + uploading the configuration to the FPGA), +and if a bug is discovered the buggy part need not be thrown away. The +design can be modified and the fixed configuration can be uploaded to +the FPGA. + +The hardware aspect of YULE is designed to run on the Lattice +iCE40HX-1k FPGA, but with small modifications it can run on other +FPGAs as well. + +\subsection*{iCEstick} +\begin{figure} + \centering + \includegraphics[height=4cm]{icestick.png} + \caption{Lattice iCEstick Evaluation Kit} +\end{figure} + +The Lattice iCEstick Evaluation Kit is a development board for the +iCE40HX-1k FPGA. It contains the FPGA, a USB connector, a FTDI chip +and 5 LEDs. The form factor is very convenient. It is similar to a USB +memory stick (albeit bigger) that can be plugged into a computer, +avoiding the need for jumper wires and GPIO that traditional +development boards require. + +The FTDI chip implements the USB protocol, allowing the FPGA to be +programmed by the computer and letting the FPGA talk to the computer +using a very simple serial UART interface on the FPGA side, and a +standard USB serial port on the computer side. This makes +communication very easy. The FPGA can use any off-the-shelf UART +implementation (this project uses \texttt{osdvu} from +Opencores \cite{osdvu}), and the computer can use standard serial +console tools such as \texttt{minicom}. + +\section{Verilog} +Verilog is a hardware description language. It is used to model +digital circuits at register-transfer level of abstraction. In Verilog +one defines several modules, with one of them (the ``toplevel'' +module) representing the entire design. Each module has several inputs +and outputs, can contain internal wires, can contain instances of +other modules, and can contain combinational and sequential logic +blocks that drive internal wires and outputs. + +Verilog has many more complicated features that are not used in YULE. +These include tristate buffers, delays, functions, tasks, loops, and +strength declarations. + +A subset of Verilog can be \textit{synthesised}; that is converted to +a netlist (a list of transistors or FPGA components used and +connections between them) that can then be used to build an ASIC +(custom integrated circuit), or can be further converted to the +internal representation of the FPGA and uploaded to the FPGA. The FPGA +workflow and the programs that are part of it are described in the +next section. + +A verilog project can also be run through a simulator, which simulates +the design clock edge by clock edge and records the values on each +wire in the design. This record can later be read in and visualised +with a wave viewer, which allows a circuit designer to better +understand why the circuit is not working correctly. One such program, +GTKWave \cite{gtkwave}, was used to debug YULE. + +\section{icestorm} +Project IceStorm \cite{icestorm} is a set of open source reverse +engineered tools for programming Lattice iCE40 FPGAs. There exists a +fully open source Verilog-to-Bitstream flow which comprises: + +\begin{enumerate} +\item Yosys, a Verilog synthetizer. +\item Arachne-pnr, a tool that implements placing and routing for + iCE40 FPGAs. +\item icepack, which converts Arachne-pnr's output to iCE40 bitstream. +\item iceprog, which programs the iCEstick with the output of icepack. +\end{enumerate} + +\section{Design of LISP-Based Processors (1979)} +Design of LISP-Based Processors (1979) \cite{lambda} is one of the +``Lambda papers''; a series of memos by Sussman and Steele about the +Lisp language and its implementation. The paper describes SIMPLE, a +microprocessor designed to run Lisp. + +This processor is made of two read-only memories called EVAL and GC. +Each read-only memory is paired with state registers to form a state +machine. The contents of the state registers are fed into the +read-only memory, and the output of the read-only memory determines +the next state and other signals. + +Each state machine also has an internal bus (called the E and G buses +respectively) and several registers connected to that bus. Each of +these registers stores a tagged pointer, which is an 11-bit value (the +top 3 bits represent the type, and the bottom 8 bytes represent the +value). + +The signals coming out of each state machine can cause a register to +be read onto the machine's internal bus, a register to be loaded from +the internal bus, or write a constant value onto the internal bus. + +The GC state machine represents the processor's memory system. It is +connected to an external RAM, which it controls through several +signals. It knows how to execute certain ``GC operations'', which +represent common Lisp memory management actions. Examples of these are +\texttt{CAR} and \texttt{CDR} (``given a pointer to a cons cell, +return either the CAR or the CDR of that cons cell'') or \texttt{CONS} +(``given two pointers, allocate a new cons cell, copy them into it, +and return a pointer to the new cons cell''). The existence of the GC +isolates the rest of the processor from the particularities of the +memory used. + +The EVAL state machine evaluates Lisp expressions. Signals coming out +of it control the GC by selecting what operation the GC should perform +and connecting the E and G buses to supply arguments to the GC and +read results back. The EVAL reads expressions from memory via the GC +and evaluates them, putting results back into memory. + +The overall operation of the processor is to read an expression from +the magic location ``5'' in memory, evaluate it, and place the result +into the magic location ``6'' in memory. Other memory locations are +used to store subexpressions of the input expression, and to store any +intermediate results. Despite its name, GC does not perform any +garbage collection, so at the end of the processor's operation the +memory will still contain all intermediate results produced during +evaluation. + +The processor is intended as a proof of concept and not a +production-ready hardware implementation of Lisp. Therefore it has +several important limitations, for example there is no primitive +operation that implements equality of objects (making it impossible to +distinguish between symbols), there are no primitive arithmetic +operations (so one needs to use Church numerals if numbers are +needed), and the processor includes a known bug (``the ATOM bug''). +The bug is that the ATOM primitive considers NIL to be non-atomic, +which is incorrect. + +The Lisp expressions mentioned here are in textual form (Lisp source +code), but instead in a special tagged pointer format. Each memory +value is a union of two values: a 8-bit pointer and a 3-bit type. We +can use pairs of consecutive memory locations to represent cons cells +and build linked lists out of them. If a list is at location X, that +means that (X, X+1) is the first cons cell of the list, with X+1 being +the CAR (the value of the first element) and X being the CDR (a +pointer to the next cons cell). A discussion of the different ways to +represent Lisp expressions (source code, this tagged pointer format, +and an intermediate representation) can be found in \ref{sec:formats}. + +The next page shows a diagram of how the processor in the paper is +structured. The two read-only memories can be seen, with their +registers, state registers, and the interconnection between the E and +G buses. + +\begin{figure} +\centering\includegraphics[width=18.5cm]{AIM-514-design.pdf} +\caption{Design diagram of processor from \cite{lambda}} +\end{figure} + +\chapter{A Lisp Machine processor on a USB drive} +The project comprises a hardware part written in Verilog and a +software part written in Perl. + +The hardware part is a Lisp machine processor with associated +circuitry that can read a program from the iCEstick UART and that can +write the result of running a program to the UART. The normal +operation of this part is to: + +\begin{enumerate} +\item Wait for a program to be supplied via UART +\item Run the program on the processor +\item Write a memory dump to the UART +\item Go back to step 1 +\end{enumerate} + +The software part includes: + +\begin{itemize} +\item A compiler, which takes Lisp source code and puts it in the + linked list format understood by the SIMPLE processor +\item An assembler, which takes the result of the compiler and + converts it to the SIMPLE in-memory binary format. +\item A REPL, which repeatedly reads a program from the user, compiles + it, assembles it, sends it to YULE, reads the memory dump from YULE, + and prints the result for the user to see. +\end{itemize} + +\clearpage +\section{Hardware} +The hardware part is built from 8 modules: + +\begin{enumerate} +\item The \textbf{READER}, which implements step 1 of the normal operation described above. +\item The \textbf{WRITER}, which implements step 3 of the normal operation described above. +\item The \textbf{GCRAM}, a single-port RAM. +\item The \textbf{GC}, corresponding to the GC PLA in the original implementation. +\item The \textbf{EVAL}, corresponding to the EVAL PLA in the original implementation. +\item The \textbf{CTRL}, which advances the operation from one step to the next one. +\item The \textbf{UART}, which implements transmission and reception of bytes via UART. +\item The \textbf{CPU}, which instantiates every module above and wires them together. +\end{enumerate} + +\subsection{Clocks} +To avoid timing problems, a single 48 MHz clock is used. The iCEstick +has a 12 MHz oscillator, and the phase-locked loop circuitry on the +FPGA multiplies this signal by 4. + +Every flip-flop updates on the positive edge of the clock, except for +the RAM which updates on the negative edge. This way the rest of the +circuit perceives the RAM to be combinational (it ``updates +immediately''): if on a positive edge the RAM's address input changes, +the RAM will update its output on the following negative edge and the +changed output will be visible on the next positive edge. If the RAM +updated on a positive edge, then if the input changed on positive edge +$i$ the changed output will only be visible on positive edge $i+2$. + +The design is intended to be run on a FPGA, which has a fixed clock +tree. This typically consists of a small number of special nets called +global buffers that can distribute signals with minimal skew to every +logic cell. The FPGA on the iCEstick has 8 such global buffers. + + +There are 4 components in the design (GC, EVAL, READER, WRITER) that +need to be enabled/disabled at different times. One way to achieve +this is to AND the master 48 MHz clock with several signals, thus +creating a separate clock for each component. This is known as +\textit{clock gating}. However, on a FPGA clock gating involves +routing the clock from a global buffer to a logic cell and then back +to another global buffer. This can introduce glitches in the clock and +would delay each of the clocks by some amount (dependent on the +signals it is ANDed with). It is therefore possible that the 4 output +clocks would not be aligned, causing problems when data is +communicated by one component to another. As such, clock gating is not +recommended on FPGAs. + +\begin{figure} + \centering\includegraphics[height=3cm]{gating.eps} + \caption{Gated clock on the left, clock enable on the right} +\end{figure} + +To avoid this problem, the design used here does not use clock gating +and instead uses 4 clock enable signals supplied by the controller to +each of the GC, EVAL, READER and WRITER. At any point the clock is +enabled for exactly one of the READER, GC, and WRITER. The EVAL clock +is enabled if and only if the GC clock is enabled and the GC's +\texttt{step\_eval} output is high. + +\clearpage +\subsection{GC} +\begin{figure} + \centering\includegraphics[height=4.88cm]{gc.eps} + \caption{GC module} +\end{figure} + +The GC module is connected to the GCRAM (via the \texttt{ram\_do, + ram\_we, ram\_addr, ram\_di} signals), to the EVAL (via the +\texttt{Ein, Eout, gcop, step\_eval, conn\_et, conn\_ea} signals) and +to the controller. It also provides the value of its P register (the +free storage pointer) to the WRITER, so the writer can know when to +stop dumping the memory. + +It is implemented as a combinational ROM, some combinational logic to +drive the internal G bus, sequential logic to advance to the next +state respecting dispatch rules, and sequential logic to update +registers. The clock enable signal acts as an active low synchronous +reset: when the clock is disabled, the state register is set to the +initial state. + +This module represents one half of the original design presented in +\cite{lambda}, and the ROM's contents are those from the paper with a +bugfi (the original design sets the ADR line low in two states when it +should be high). + +\subsection{EVAL} +\begin{figure} + \centering\includegraphics[height=3.29cm]{eval.eps} + \caption{EVAL module} +\end{figure} + +The EVAL module is connected to the GC (via the \texttt{conn\_et, + conn\_ea, Ein, Eout, gcop} signals) and to the controller. + +The implementation of the EVAL is very similar to that of the GC. +There is a combinational ROM, combinational logic to drive the E bus, +sequential logic to advance to the next state respecting dispatch +rules, and sequential logic to update registers. A synchronous reset +is available. When the reset line is held high for one clock cycle, +the state register is set to the initial state. It is necessary to +separate the reset from the clock enable because the GC often pauses +the EVAL to do work and the EVAL should not be reset every time this +happens. + +This module represents the other half of the original design presented +in \cite{lambda}, and the ROM's contents are the same as those in the +paper except that the ROM is widened by one bit to indicate whether +the EVAL is finished evaluating the expression. + +\clearpage +\subsection{Writer} +\begin{figure} +\centering\includegraphics[height=2.26cm]{writer.eps} +\caption{WRITER module} +\end{figure} + +The WRITER module is connected to the UART (via the \texttt{tx\_busy, + tx\_signal, tx\_byte} signals), the GC (via the \texttt{P} signal), +the RAM (via the \texttt{ram\_addr, ram\_do} signals), and to the +controller. + +Its purpose is to dump the memory to the UART starting with position 4 +(since positions 0, 1, 2, 3 are constant) and finishing with position +$P - 1$, the last used cell in memory. It is implemented as a state +machine with 7 states and an extra register (\texttt{current\_index}) +that stores the current index in the memory. + +The memory is 16-bit wide, but the UART sends and receives bytes. The +writer dumps memory word in network byte order (big-endian), meaning +the significant byte is sent before the least significant byte. + +The states are, in order: +\begin{description} +\item[START] which initializes \texttt{current\_index} to 4 +\item[WRITE1\_WAIT] which waits for \texttt{tx\_busy} to become low +\item[WRITE1] in which transmission of the upper byte starts +\item[WRITE2\_WAIT] which waits for \texttt{tx\_busy} to become low +\item[WRITE2] in which transmission of the lower byte starts +\item[INCREMENT] in which \texttt{current\_index} is incremented, + jumping back to WRITE1\_WAIT if it is still lower than + \texttt{freeptr}. +\item[FINISHED] in which \texttt{finished} is driven high +\end{description} + +This module is not part of the original design in \cite{lambda}, as +that paper did not concern itself with input and output. + +\clearpage +\subsection{Reader} +\begin{figure} +\centering\includegraphics[height=2.22cm]{reader.eps} +\caption{READER module} +\end{figure} + +The READER module is connected to the UART (via the +\texttt{rx\_signal, rx\_byte} signals), the RAM (via the +\texttt{ram\_we, ram\_addr, ram\_di} signals), and to the controller. + +Its purpose is to read a program from the UART and overwrite the GCRAM +with this program. It is implemented as a state machine with two extra +registers (\texttt{total\_words} and \texttt{current\_index}) to store +the size of the incoming program and the current position within it. + +The reader expects to receive a 13-bit \texttt{length} value, followed +by \texttt{length} 16-bit values that represent the desired memory +contents. Each of these values is sent in network byte order +(big-endian), meaning the most significant byte is sent before the +least significant byte. + +The states are, in order: +\begin{description} +\item[IDLE] which waits for \texttt{rx\_signal}, puts the read byte at the top of \texttt{total\_words}. +\item[LENGTH] which waits for \texttt{rx\_signal}, puts the read byte at the bottom of \texttt{total\_words} +\item[READ1] which waits for the \texttt{rx\_signal}, puts the read byte at the top of \texttt{ram\_di} and increments \texttt{current\_index} +\item[READ2] which waits for the \texttt{rx\_signal}, puts the read byte at the bottom of \texttt{ram\_di} +\item[WRITE] in which \texttt{ram\_we} is high, jumping back to READ1 if \texttt{current\_index + 1 < total\_words} +\item[FINISHED] in which \texttt{finished} is high +\end{description} + +This module is not part of the original design in \cite{lambda}, as +that paper did not concern itself with input and output. + +\clearpage +\subsection{Putting the hardware together} +The components described above are managed by the controller, which +provides them with clock enable signals. The controller is a state +machine with three states: +\begin{description} +\item[STATE\_READ] in which only the READER clock is enabled +\item[STATE\_RUN] in which the GC and EVAL clocks are enabled +\item[STATE\_WRITE] in which only the WRITER clock is enabled +\end{description} + +In STATE\_READ and STATE\_WRITE, the EVAL's reset line is high. + +Whenver the component corresponding to the current state indicates it +is finished, the controller advances to the next state. + +The controller also includes three multiplexers that control which +lines are connected to the GCRAM. If the READER is active, all three +GCRAM lines are connected to the READER. If the WRITER is active, the +address line is connected to the writer (and the other lines are +connected to the GC). If the GC is active, all lines are connected to +the GC. + +The controller also includes logic to drive the 5 LEDs on the +iCEstick. Three LEDs indicate the current state, while the other two +indicate if the UART is currently transmitting or receiving. + +\section{Representing a Lisp program} +\label{sec:formats} +The processor described above evaluates a Lisp program received via +UART. This section talks about three representations of such a +program: the Lisp source code (a S-expression), the intermediate form +(also a S-expression), and the in-memory form (a binary string). + +The first representation is obtained by parsing the user input, the +compiler converts this into the second, and the assembler converts the +second into the third. + +\subsection{Lisp source code} +Recall that a S-expression is either an atom or a cons cell. + +An atom is a sequence of letters, digits, and certain symbols. There +are two special atoms named \texttt{T} and \texttt{NIL} representing +the boolean true and false respectively. + +A cons cell is a pair of two S-expressions called the ``car'' and the +``cdr''. We write them as \texttt{(car . cdr)}, for example \texttt{(5 + . 7)} is a cons cell with car \texttt{5} and cdr \texttt{7}. + +We can use cons cells to represent linked lists. A linked list is a +sequence of cons cells where the cdr of one cell is the next cell, and +the cdr of the last cell is \texttt{NIL}. For example the list +\texttt{1,2,3} can be written as \texttt{(1 . (2 . (3 . NIL)))}. +We can also write this list more concisely as \texttt{(1 2 3)}, and we +write \texttt{()} to mean the atom \texttt{NIL}. + +Using these syntax elements we can encode several types of logical +expressions: + +\begin{itemize} +\item Numbers are encoded as numeric atoms, e.g. \texttt{5} +\item Variables are encoded as non-numeric atoms, e.g. \texttt{my-var} +\item Conditional statements are encoded as 4-element lists, e.g. + \texttt{(if is-odd 5 6)} which has value 5 if the variable + \texttt{is-odd} is not \texttt{NIL} and value 6 otherwise +\item Symbols are encoded as non-numeric atoms preceded by a single + quote, e.g. \texttt{'my-symbol}. This can also be written + \texttt{(quote my-symbol)} +\item Constant lists are encoded as lists preceded by a single quote, + e.g. \texttt{'(1 2 (3 4 5))}. This can also be written + \texttt{(quote (1 2 (3 4 5)))} +\item Functions (lambda abstractions) are encoded as 4-element lists, + e.g. \texttt{(lambda first (a b) a)} in which the second element is + the function name, the third element is a list of function + arguments, and the fourth element is the function body. +\item Function calls are encoded as lists, with the function as the + first element and the arguments as the other elements, e.g. + \texttt{(atom 5)} calls the function \texttt{atom} with argument + \texttt{5}. +\end{itemize} + +\subsection{In-memory representation} +The memory is an array of 16-bit integer values. We interpret these as +tagged pointers: the top 3 bits indicate the ``type'' of this pointer, +while the other 13 bits represent another memory location. A pointer +therefore has a type and a value. + +We represent each Lisp expression as a tagged pointer. To evaluate it, +the processor looks at the type of the pointer. Here is what a pointer +of value $v$ means based on its type. We use the syntax $mem[a]$ to +mean ``the contents of the memory location $a$'' (that is, a tagged +pointer) and $memval[a]$ to mean ``the value at the memory location +$a$'' (that is, the value of the tagged pointer $mem[a]$). + +\begin{enumerate} +\setcounter{enumi}{-1} +\item is a constant list, whose cdr is $mem[v]$ and car is $mem[v+1]$. +\item is a constant symbol. +\item is a variable reference, where $v$ indicates the position in the + environment of the referenced variable. +\item is a constant closure, that is the result of evaluating a lambda + abstraction. Values of this type do not normally appear in the input + code, but instead are created during evaluation of other forms. +\item is a procedure (lambda abstraction), whose body is $mem[v]$. +\item is a conditional, where $mem[v+1]$ is the condition, + $mem[memval[v]+1]$ is the ``then'' branch, and $mem[memval[v]]$ is the + ``else'' branch. +\item is a procedure call, which points to a special kind of linked + list. The CDR of each cell (other than the last one) points to the + next cell and has type 0, while the CDR of the last cell has value 0 + and its type indicates the primitive to be called. The CARs of these + cells indicate the arguments for the procedure, in reverse order. + + To call a user-defined procedure, we use the primitive + \texttt{funcall} and put the user-defined procedure as the CAR of + the last cell (in other words, the user-defined procedure is the + first argument to \texttt{funcall}). +\item is a quoted constant, representing the object at $mem[v]$. +\end{enumerate} + +The booleans \texttt{NIL} and \texttt{T} are compiled to a list +pointing to 0, and a symbol with value 2 respectively. + +The memory has a peculiar fixed layout. Locations 0 and 1 are the CDR +and CAR of NIL, locations 2 and 3 are the CDR and CAR of T, location 4 +holds the first unused location in memory (this tells the processor +where it can start to place intermediate expressions), location 5 +holds the expression to be evaluated, and location 6 is where the +processor will place the value of the expression. + +\subsection{The intermediate form} +To facilitate conversion from the former format to the latter, an +intermediate form is used. Here we represent tagged pointers as lists +of one of three types: + +\begin{itemize} +\item \texttt{(type value)}, where ``type'' is the pointer's type and + ``value'' is the pointer's (numeric) value. +\item \texttt{(type ptr)}, where the pointer's type is ``type'' and + the pointer's value is $x$, where $x$ is the memory location where + ``ptr'' (another tagged pointer) lies. +\item \texttt{(type ptr1 ptr2)}, where the pointer's type is ``type'' + and the pointer's value is $x$, where $x$ is the memory location + where ``ptr1'' lies and $x+1$ is the memory location where ``ptr2'' + lies. +\end{itemize} + +These three representations map nicely to the tagged pointers +described in the previous section. The first representation is used +for constant symbols and variable references, the second for +procedures, and the third for constant lists, conditionals, and +procedure calls. + +A visual representation of the intermediate representation (from the +AIM-514 paper) can be seen on the next page. The expression in +question is + +\begin{lstlisting} +((lambda append (x y) (if (atom x) y (cons (car x) (append (cdr x) y)))) '(a b c) '(d e f)) +\end{lstlisting} + +which we would represent as this sexp: + +\begin{lstlisting} +(CALL (MORE (MORE (FUNCALL 0) (PROC (IF (LIST (CALL (MORE (CONS 0) (CALL (MORE (MORE (FUNCALL 0) (VAR -1)) (VAR -2)) (CALL (CDR 0) (VAR -3)))) (CALL (CAR 0) (VAR -3))) (VAR -2)) (CALL (ATOM 0) (VAR -3))))) (LIST (LIST (LIST (LIST 0) (SYMBOL 6)) (SYMBOL 7)) (SYMBOL 8))) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5))) +\end{lstlisting} + +As mentioned in the Background chapter, the processor has a bug (the +ATOM bug) that causes \texttt{(ATOM NIL)} to be true. As a result, +\texttt{(atom x)} in this expression is always true, even when +\texttt{x} is the empty list, and the expression loops forever (or at +least until memory is exhausted). We can fix this by changing the +condition from \texttt{(atom x)} to \texttt{x}, which is false when +\texttt{x} is NIL and true otherwise, and swapping the two branches of +the conditional. The fixed program is: + +\begin{lstlisting} +((lambda append (x y) (if x (cons (car x) (append (cdr x) y)) y)) '(a b c) '(d e f)) +\end{lstlisting} + +which is compiled to the sligthly shorter sexp: + +\begin{lstlisting} +(CALL (MORE (MORE (FUNCALL 0) (PROC (IF (LIST (VAR -2) (CALL (MORE (CONS 0) (CALL (MORE (MORE (FUNCALL 0) (VAR -1)) (VAR -2)) (CALL (CDR 0) (VAR -3)))) (CALL (CAR 0) (VAR -3)))) (VAR -3)))) (LIST (LIST (LIST (LIST 0) (SYMBOL 6)) (SYMBOL 7)) (SYMBOL 8))) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5))) +\end{lstlisting} + + +\begin{figure} +\centering\includegraphics[width=18.5cm]{AIM-514-ir.pdf} +\caption{Example of intermediate representation} +\end{figure} + +\section{Software} +\subsection{Compiler} +The compiler takes in Lisp code and translates it to a sexp-based +format understood by the assembler. + +The compiler has several functions for compiling certain kinds of +expressions, and a function (\texttt{process\_toplevel}) that uses the +former to compile arbitrary expressions. + +The output from any of this functions is a sexp in the intermediate +form described in the previous section. + +Symbols are identified by sequences of letters on the source code side +and by numbers on the processor side. The compiler therefore keeps a +mapping between symbol names and their numbers. Whenever a new symbol +is encountered, a new entry is added to this map. + +The \texttt{process\_quoted} function handles constants. For example +if the source code contains \texttt{'(1 2 3)} then +\texttt{process\_quoted} is called with \texttt{(1 2 3)}. This +function handles the symbol interning described in the previous +paragraph, and converts its input to intermediate form. + +The \texttt{process\_proc} function handles lambda abstractions. For +example if the source code contains \texttt{(lambda first (a b) a)} +then \texttt{process\_proc} is called with \texttt{first}, +\texttt{(a b)}, and \texttt{a}. This function appends the function +name and arguments to the environment, then processes the body (using +\texttt{process\_toplevel}) with the new environment. + +The \texttt{process\_funcall} function handles function calls. For +example if the source code contains \texttt{(reverse-list 2 1)} then +\texttt{process\_funcall} is called with \texttt{reverse-list}, and +\texttt{(2 1)}. The arguments are first processed (using +\texttt{process\_toplevel}), then (if the function is not a primitive) +the function is processed as well. + +Finally the \texttt{process\_toplevel} handles any expression by +figuring out what sort of expression it is and then processing it +appropriately. Empty lists and \texttt{NIL} become \texttt{(list 0)}, +numbers and \texttt{T} are processed with \texttt{process\_quoted} + +Examples of compiler operation can be found in the compiler tests. +Here are a few of them: + +\begin{itemize} +\item The Lisp expression \texttt{'()} is compiled to \texttt{(LIST 0)} +\item The Lisp expression \texttt{'(5 foo)} is compiled to + \texttt{(LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4))}. Observe that + the CDR comes before the CAR, and symbols are interned: + \texttt{'foo} is compiled to \texttt{(SYMBOL 3)} and \texttt{'5} is + compiled to \texttt{(SYMBOL 4)} +\item The Lisp expression \texttt{(if t '(2 3) 'x)} is compiled to + \texttt{(IF (LIST (SYMBOL 5) (LIST (LIST (LIST 0) (SYMBOL 3)) + (SYMBOL 4))) (SYMBOL 2))}. Observe that \texttt{t} is always + compiled to \texttt{(SYMBOL 2)}, as mentioned in the section about + the memory format. +\end{itemize} + +\subsection{Assembler} +The assembler takes in an S-expression in the intermediate form, lays +out the words in memory, and produces a memory dump that can be +supplied to the processor. + +The core of the assembler is a function called \texttt{process} that +takes an expression in intermediate form and an optional location in +memory. This function lays out the expression's subexpressions in +memory and then lays out the expression itself at the given location +(or at the end of the memory if there is no location given). + +The assembler begins by initializing the first 7 elements of the +memory with the values needed by the processor. After initialization, +the \texttt{process} function is called on the input expression with +no location. + +Here is how the \texttt{process} function lays out the expression +$expr$ at location $l$ (where $l$ can be a given memory location or +simply ``the first free location in memory''): + +\begin{itemize} +\item If $expr$ is \texttt{(type value)}, then it is directly laid out + at $l$ +\item If $expr$ is \texttt{(type ptr)}, then \texttt{process} is + called recursively to lay \texttt{ptr} out at the first free + location in memory (say this is location $loc$). Then \texttt{(type + $loc$)} is laid out at $l$. +\item If $expr$ is \texttt{(type ptr1 ptr2)}, then two consecutive + free locations in memory $loc$ and $loc+1$ are reserved, + \texttt{process} is recursively called twice to lay \texttt{ptr1} at + $loc$ and \texttt{ptr2} at $loc+1$, and finally + \texttt{(type $loc$)} is laid out at $l$. +\end{itemize} + +Once again, examples of assembler operations can be found in the +assembler tests. Two of them: + +\begin{itemize} +\item \texttt{(quoted (symbol 5))} is assembled to \texttt{00 08 00 00 + 00 00 20 00 20 00 00 08 E0 07 00 00 20 05}. Observe that each word + is made of two bytes, the first word (\texttt{00 08}) indicating the + length (in words) of the program. The first 4 words are always + fixed: they represent in order the CDR of NIL, the CAR of NIL, the + CDR of T, and the CAR of T. The 7th word is also a constant, 0, + because that is where the processor puts the result. The 5th word is + the first free word in memory (which is equal to the length in words + of the program), and the 6th word (\texttt{E0 07}) is the expression + to be evaluated. In this case, it has type quote and points to + memory location 7 (the 8th word, because pointers are 0-indexed). At + memory location 7 we have the value \texttt{20 05} which has type + symbol and value 5. +\item \texttt{(call (more (funcall 0) (proc (var -2))) (number 5))} is + assembled to \texttt{00 0C 00 00 00 00 20 00 20 00 00 0C C0 07 00 00 + 00 09 20 05 E0 00 80 0B 5F FE}. Here we want to evaulate the + expression \texttt{C0 07} (type ``function call'', address 7). This + points to a cons-cell, whose car (at position 8) is \texttt{20 05} + (that is, \texttt{(number 5)}) and whose cdr (at position 7) is + \texttt{00 09} (type ``more'' and address 9). The ``more'' points to + another cons-cell, whose cdr (at position 9) is \texttt{E0 00} (that + is, \texttt{(funcall 0)}), and whose car is \texttt{80 0B}: a + procedure whose body is at position 11, namely \texttt{5F FE} (that + is, \texttt{(var -2)}). + + So the expression is a call to the primitive \texttt{funcall} with + arguments \texttt{(lambda l (x) x)} and \texttt{5}. The original + program was probably \texttt{((lambda l (x) x) 5)}. Note that this + program was compiled with an old version of the compiler that did + not intern symbols, which explains why \texttt{5} was compiled to + \texttt{(number 5)} and not \texttt{(symbol 3)}. +\end{itemize} + +\subsection{REPL} +The REPL ties everything together: it reads a program from the user, +compiles it, assembles it, sends the assembled program to the +processor, reads the memory dump from the processor, and finally +prints the result to the screen. + +The REPL includes an S-expression pretty-printer that operates on +memory dumps. As symbols are interned by the compiler, when the +pretty-printer hits a symbol it only knows its index, and not the +original name. To print the original name the pretty-printer searches +the symbol map built by the compiler and pulls the name from there. + +The pretty-printer takes a chunk of memory and index, and prints the +expression at that index. For example, if the memory chunk is +\texttt{00 00 00 00 20 00 20 00 00 09 00 07 00 07 20 02 00 00 60 00} +and the index is 6, then the pretty printer will try to print +\texttt{00 07}, which is a list with car \texttt{00 00} and cdr +\texttt{20 02}, and the pretty printer prints \texttt{(nil . t)}. + +\clearpage +An example REPL session is: + +\begin{lstlisting} +YULE REPL + +* (cons 1 2) +(1 . 2) +* (car '(4 5 6)) +4 +* (cdr '(t t t)) +(T T) +* ((lambda id (x) x) 5) +5 +* (lambda first (a b) a) + +* ((lambda first (a b) a) 7 8) +7 +* (atom 6) +T +\end{lstlisting} + +\chapter{Evaluation and Testing} +To make development easier, ensure everything works correctly, and +prevent regressions there are unit tests for the compiler and +assembler, and functional tests for the entire project. + +Besides these automated tests, manual evaluation (in the beginning, +precompiling simple expressions and supplying them to the processor as +the initial value of the memory; later on, manually typing expressions +in the REPL and looking at the output). When the processor was not +operating as expected, the GTKWave wave simulator was used to assist +with debugging. This piece of software takes in a recorded simulation +of the circuits and displays the values of all signals inside the +circuit against time. This allows the designer to look at every clock +cycle in turn and discover where the error occurs. + +The process of debugging the processor had the side effect of finding +inefficiencies in the original design. While \cite{lambda} mentions +some limitations of the processors, one important limitation that was +only discovered during debugging is that for every operation that the +EVAL performs, the result of that operation will be written to new +locations in memory. This includes ``read-only'' operations such as +CAR and CDR. If the programmer requests the CAR of a list, a copy of +the first element will be taken and a pointer to it will be returned. +Since YULE only has 4096 words of memory, this severely restricts the +complexity of programs that can be evaluated on it. + +Another avenue to ensure the processor works as it should is formal +verification using model checking tools. Formal verification is +commonly employed as part of circuit design evaluation because +mistakes are more difficult to identify, and if found after the chip +has been built they are very costly to fix (often the chips that were +built with the bug need to be thrown away). However, building models +is a very involved process, and devoting effort to formal verification +of this project would not have allowed the implementation of several +features. Due to time constraints and knowing that no physical YULE +chips are built (the project is so far only synthesised to a FPGA), +testing and manual evaluation was considered sufficient for YULE. + +\section{Unit testing} +The software was tested with 48 unit tests written in Perl using the +\texttt{Test::More} framework. + +The assembler takes in compiled Lisp code and outputs a memory dump in +one of two formats (binary or Verilog code). Its tests take a +configuration for the assembler, an input, two expected outputs, and +(for documentation purposes) the source Lisp code. An example test: + +\begin{lstlisting} +$expbin = + '00 0C 00 00 00 00 20 00 20 00 00 0C C0 07' + . ' 00 00 00 09 20 05 E0 00 80 0B 5F FE'; +run_test {addr_bits => 13}, + '(call (more (funcall 0) (proc (var -2))) (number 5))', + <<'EOF', $expbin, '((LAMBDA ID (X) X) 5)'; +mem[ 0] <= 0; // (cdr part of NIL) +mem[ 1] <= 0; // (car part of NIL) +mem[ 2] <= 16'b0010000000000000; // (cdr part of T) +mem[ 3] <= 16'b0010000000000000; // (car part of T) +mem[ 4] <= 16'd12; // (free storage pointer) +mem[ 5] <= 16'b1100000000000111; // CALL 7 +mem[ 6] <= 0; // (result of computation) +mem[ 7] <= 16'b0000000000001001; // MORE 9 +mem[ 8] <= 16'b0010000000000101; // NUMBER 5 +mem[ 9] <= 16'b1110000000000000; // FUNCALL 0 +mem[10] <= 16'b1000000000001011; // PROC 11 +mem[11] <= 16'b0101111111111110; // VAR -2 +EOF +\end{lstlisting} + +Here the compiled code \texttt{(call (more (funcall 0) (proc (var -2))) (number 5))} +is given to the assembler, and its outputs are compared to the binary +string on the first line and the block of Verilog code. + +The tests do not all contain valid assembler inputs. There are a +series of tests that give invalid inputs to the assembler and expect +the assembler to raise an exception matching a pattern. Example tests: + +\begin{lstlisting} +expect_error_like { run_test {}, '((type is a list) 5)'} + 'Type of toplevel is not atom'; +expect_error_like { run_test {}, '(badtype 5)'} + 'No such type'; +expect_error_like { run_test {}, '(5 700000)'} + 'Addr too large'; +\end{lstlisting} + +On the compiler side, there are 4 tests for the quoting function, +but the brunt of the tests are pairs of a compiler input (i.e. Lisp +code) and a compiler output (i.e. a sexp that the assembler +understands). For each pair the compiler is run on the first element +and its output is compared with the second. Example such tests: + +\begin{lstlisting} +is_toplevel '(quote 5)', '(SYMBOL 3)'; +is_toplevel '(reverse-list \'a \'a \'b)', + '(CALL (MORE (MORE (REVERSE-LIST 0) (SYMBOL 4)) (SYMBOL 3)) (SYMBOL 3))'; +is_toplevel '(if t \'(2 3) \'x)', + '(IF (LIST (SYMBOL 5) (LIST (LIST (LIST 0)' + . ' (SYMBOL 3)) (SYMBOL 4))) (SYMBOL 2))'; +\end{lstlisting} + +As the compiler code includes a function to pretty-print +S-expressions, there are a few round-trip tests on this function. +Given a string, it is parsed using the \texttt{Data::SExpression} +parsing library then it is pretty-printed. The result should match the +original string. + +There are also tests for handling of invalid input, like in the assembler. These include: + +\begin{lstlisting} +expect_error_like { is_toplevel 'x' } 'Variable x not in environment'; +expect_error_like { is_toplevel '(car)' } 'Cannot call primitive car with no arguments'; +\end{lstlisting} + +A test coverage tool, \texttt{Devel::Cover}, was ran on the compiler +and assembler to find statements and branches that were not covered by +existing unit tests, and new tests were written to cover them. Right +now \texttt{Devel::Cover} reports 100\% test coverage of the compiler +and assembler. + +\begin{figure} + \centering\includegraphics[height=8cm]{devel-cover.png} + \caption{Screenshot of \texttt{Devel::Cover} results, showing 100\% + test coverage} +\end{figure} + +\section{Functional testing} +Besides unit tests, the entire system (hardware and software) has +undergone functional testing. This means that a set of 21 Lisp +expressions has been prepared, alongside their correct values computed +using a software Lisp implementation. These values are given to the +YULE REPL and the test succeeds if the output equals the known correct +value of the expression. + +Here are some example tests: +\begin{lstlisting} +'(1 2 3) +5 +((lambda id (x) x) 5) +((lambda id (x) x) '(1 2 3)) +(car '(2 3)) +(car (cdr '(4 5 6))) +((lambda snd (x y) y) 'first 'second) + +(cons '(1 2) '(7)) +(car (cons '(1 2) '(7))) +(cdr (cons '(1 2) '(7))) + +((lambda rev (xs) ((lambda reverse-acc (xs acc) (if xs (reverse-acc (cdr xs) (cons (car xs) acc)) acc)) xs '())) '(4 5 6 7)) +\end{lstlisting} + +This test suite tries to cover most kinds of valid expressions. It +includes constants of all kinds, conditionals, lambda abstractions, +calls to all primitives, calls to user-defined functions, recursive +calls. + +\section{Timing and performance} +There are three parts to the performance of YULE: + +\begin{enumerate} +\item The efficiency of the software. This refers to the time taken to + compile and assemble input programs, which is negligible. + +\item The I/O performance, which refers to the time needed to transfer + the program from the computer to the processor's memory, plus the + time needed to transfer the memory dump from the memory back to the + computer. + +\item The speed of the processor itself, that is the time/number of + cycles that the processor needs to evaluate a given expression. +\end{enumerate} + +The GC runs at 48MHz, while the EVAL only runs when the GC allows it +to which is at most half of the time. + +The UART runs at 4000000 baud, so a bit is sent every 12 cycles of the +48MHz clock. A byte is made of 11 bits: a start bit, 8 data bits, and +one stop bit. A word is two bytes, so 22 bits. Therefore 264 cycles +are needed to transmit one word. + +Since the WRITER just sends a full memory dump, the I/O performance is +proportional not to \texttt{program\_len + result\_len} but to +\texttt{program\_len + result\_len + intermediate\_results\_len}. As +evaluation of any expression involves the creation of at least one +value (the result), for every expression or subexpression evaluated at +least 264 cycles will be needed to output that result. The EVAL takes +significantly less than 264 cycles to evaluate a expression (not +counting its subexpressions), in one test each expression took between +30 and 90 cycles to evaluate. The processor performance is therefore +dominated by I/O. + +As an example, the program \texttt{((lambda id (x) x) 5)} is 12 bytes +long, and takes 258 cycles to evaluate. The memory dump is 54 bytes +long. Therefore of the roughly 17700 cycles needed from beginning to +end, about 98.5\% are used to do I/O. + +\section{A worked example} +\label{sec:example} +Here is an example of what happens when a expression is typed into the +REPL: + +\begin{enumerate} +\item A user types into the REPL the string + +\begin{lstlisting} +(car '(2 3 4)) +\end{lstlisting} + + This is a call to the CAR primitive, with the list \texttt{(2 3 4)} + as an argument. We expect the result to be \texttt{2}, the first + element of the list. + +\item This string is given to the compiler, producing the sexp + +\begin{lstlisting} +(CALL (CAR 0) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5))) +\end{lstlisting} + + Observe that in this representation the CDR comes before the CAR. So + for example ``(LIST (LIST 0) (SYMBOL 3))'' represents a list whose + CAR is ``(SYMBOL 3)'' and CDR is ``(LIST 0)'' (that is, nil). This + is a list with one element, ``(SYMBOL 3)''. The sexp shown here + represents a call to the CAR primitive with one argument. The list + has three elements, which are in order \texttt{(SYMBOL 5), (SYMBOL + 4), (SYMBOL 3)}. + + In this stage the compiler maps each number to a symbol, in this + case \texttt{(SYMBOL 5)} is 2, \texttt{(SYMBOL 4)} is 3, and + \texttt{(SYMBOL 3)} is 2. This mapping is remembered so it can be + used later by the pretty printer. + +\item This sexp is given to the assembler, producing the 16 words long + binary string (here displayed in hex, with a space after each 16-bit + word) + +\begin{lstlisting} +000F 0000 0000 2000 2000 000F C007 0000 2000 0009 000B 2005 000D 2004 0000 2003 +\end{lstlisting} + + The first word, \texttt{000F}, indicates the length of the program + in + words (15 words, this does not include this first word).\\ + The next four words, \texttt{0000 0000 2000 2000}, represent the CDR + and CAR of NIL and T respectively. These words are always + constant.\\ + The next word, \texttt{000F}, represents the first unused position + in memory. This is always equal to the first word transmitted.\\ + The next word, \texttt{C007}, represents the expression to be + evaluated.\\ + The next word, \texttt{0000}, is ignored as this is where the + processor will place the result of the evaluation.\\ + The rest of the words are subexpressions or parts of subexpressions + of the expression being evaluated.\\ + + The expression that is evaluated is \texttt{C007}, which is a CALL + whose CDR is at position 7 (\texttt{2000}, meaning \texttt{(CAR 0)}) + and whose CAR is at + position 8 (\texttt{0009}).\\ + \texttt{0009} is a list with CAR at 10 (\texttt{2005}, meaning + \texttt{(SYMBOL 5)}) and CDR at 9 (\texttt{000B}).\\ + \texttt{000B} is a list with CAR at 12 (\texttt{2004}, meaning + \texttt{(SYMBOL 4)}) and CDR at 11 (\texttt{000D}).\\ + \texttt{000D} is a list with CAR at 14 (\texttt{2003}, meaning + \texttt{(SYMBOL 3)}) and CDR at 13 (\texttt{0000}, meaning NIL). + +\item This binary string is sent via UART to the FPGA's READER. The + READER writes the string (except the first word) into memory, + overwriting the previous contents of the memory. It then reports to + the controller that it is done, and the controller advances to + \texttt{STATE\_RUN}. + +\item The EVAL and GC evaluate the program in memory, and then place + the result back into memory. The EVAL reports to the controller that + it is done, and the controller advances to \texttt{STATE\_WRITE}. + +\item The WRITER sends this memory dump via UART back to the REPL + +\begin{lstlisting} +000F C007 2005 2000 0009 000B 2005 000D 2004 0000 2003 0000 4004 2000 0010 0000 0012 0000 0000 0009 +\end{lstlisting} + + Then the WRITER reports to the controller that it is done, and the + controller advances to \texttt{STATE\_READ}. + + Observe that this memory dump is similar to the assembler output. + The first two words are the 6th and 7th words in the assembler + output, the next word is the result of the evaluation, then the next + 8 words are words 9th through 16th in the assembler output. The + following words are intermediate results computed by the processor. + +\item The REPL's pretty printer interpets this string starting with + \texttt{2005}, which means \texttt{(SYMBOL 5)}. The pretty printer + looks in the compiler's mapping and finds that the 5th symbol was + 3, and so the REPL prints + +\begin{lstlisting} +3 +* +\end{lstlisting} + + The evaluation is complete, and a new prompt is printed so that the + user can input another expression. +\end{enumerate} + +\chapter{Conclusions} +A Lisp machine processor has been built using the original design and +techniques presented in \cite{lambda}. It has been extended with input +and output capabilities, and using a modern FPGA it has been +synthesised on a physical device slightly larger than a USB flash +drive. This device can be used as a standalone Lisp evaluator, or the +processor inside it can be included in a more complicated circuit. + +Software has been written to help interact with the processor. This +includes a compiler, an assembler, and a REPL that together make it +possible for an end-user unfamiliar with hardware design or the +processor's internals to submit Lisp expressions to the processor and +get back their values. The software also allows programmers to convert +Lisp source code to the binary format understood by the processor, so +that they may later send the binaries to the processor for evaluation. + +The processor and software have been tested to ensure they work +correctly. This has uncovered bugs and inefficiencies in the original +paper, some of which could not be gleaned from simply reading the +paper. + +The source files for the hardware and software, and also the automated +tests have been released under an open-source license to make it easy +for future circuit designers and programmers to use and extend this +project. They can be found at \url{https://git.ieval.ro/?p=yule.git}. + +The original paper \cite{lambda} is one of the influential Lambda +Papers that defined the Scheme programming language and is an +important historical artifact for understading how Lisp machine +processors can be implemented and more generally how tagged +architectures work. But all details of the processor cannot be +understood from simply reading the paper, and most readers would not +follow by attempting to implement what is described there. This +projects brings a tangible implementation into the open, which shows +the limitations of the design to people interested in this design and +makes it possible to solve this limitations and extend the processor +beyond its original scope. + +\section{Reflections} +Now that the project is finished, it is worth revisiting ideas I had +in the beginning that were not ultimately useful. + +The original concept had the processor run on a typical development +board, with the software running on a Raspberry PI. The board and the +Raspberry PI would be connected through jumper wires, and the UART +circuitry on the Raspberry PI would be used by the software to talk to +the processor. + +This idea would have produced a much harder to use YULE. Not only +would the user have to ssh into a Raspberry PI to do anything useful, +but also the equipment would take more space (both boards are larger +than the iCEstick). Once I tried the iCEstick and found that +everything can be done just as well with it, it replaced the two +boards and made the project easier to use (albeit slightly harder to +debug, as the iCEstick only has 5 LEDs). + +Some code had to be rewritten, sometimes multiple times. The compiler +was originally written in Common Lisp, as it is the best suited +language for manipulation of S-expressions, especially Lisp code. This +meant a working project was obtained more quickly, but then the +compiler had to be rewritten in another language; if the user has a +software Lisp implementation already, why would they need YULE? I +chose Perl as the reimplementation language, as the assembler was +already written in Perl and it is widely-available, most systems +coming with a Perl interpreter out of the box. + +Throughout development of the hardware part, timing problems were a +mainstay. The first ones related to the interactions between the +processor and the memory (the memory was clocked, but the processor +expected a combinational RAM). What fixed this problem was to run the +memory at twice the speed of the rest of the processor. At some point, +a decision was taken to run the GC and EVAL on different clocks (both +ran at one quarter the speed of the master clock, but one of the +clocks was delayed by half a cycle. The positive edge on one clock +would happen midway between a positive edge and a negative edge on the +other one). This proved to be unwise, as sometimes values sent by one +module to the other would be cleared before the destination module +managed to read them. + +GTKWave proved to be very useful in debugging these timing issues, as +the exact point where the issue began was easy to identify and I could +see why the issue was happening. A workaround for this was to add a +latch on each side of the common E bus that remembered the value sent +long enough for the receiving module to see it. + +All timing problems were fixed more cleanly in the end when an attempt +to simplify the clocks was made. First GC and EVAL were changed to use +the same clock, which removed the need for the latches, and then the +memory was changed to use the negative edge of the clock instead of +the positive edge, which meant no more clock division code had to be +used. + +\section{Future work} +There are several ways to extend this project. + +One idea is to fix the noted limitations: the ATOM primitive considers +NIL to not be an atom, and there is no EQ primitive that compares +symbols/other objects. The latter problem means that all symbols are +indistinguishable (note: NIL can still be distinguished from T and +other symbols, but NIL is not a symbol). To introduce equality a new +signal needs to be added to the EVAL that compares (for example) the +value on the E bus with the value of a fixed register, possibly a +newly added one. Then (EQ A B) would store A in the register, load B +on the E bus and compare them. + +Once equality is implemented, it becomes possible to implement the +EVAL function that evaluates an arbitrary Lisp expression. + +Two more involved instruction set improvements are adding arithmetic +operations, and input/output. Without arithmetic primitives one has to +rely on Church numerals, which encodes numbers as lambda abstractions +and is very inefficient. Input/output primitives would allow programs +to receive data at runtime over the UART and send data back before +evaluation is finished. This enables long-running useful programs, for +example with input/output and equality a REPL can be written (which +relies on the EVAl function mentioned above). + +On the software side, the assembler can support optimizations and then +the REPL can be extended with a core image. The main optimization +needed in the assembler is common subexpression elimination. The +processor never overwrites a value in memory with another value (other +than overwriting \texttt{mem[6]} with the result of the evaluation), +so it is safe to share any identical subexpressions. For example, if +the user inputs the program \texttt{(cons (car '(1 2 3 4)) (cdr '(0 2 + 3 4)))}, then when the assembler should share the tails of the two +lists, meaning the second list should compile to \texttt{(LIST (LIST + X) (SYMBOL Y))} where X is a pointer to where the second element of +the first list is stored. This makes long programs significantly +shorter. Note that no compiler changes are required to implement this, +as the assembler can just remember all expressions and subexpressions +it has assembler so far including their locations, and whenever a new +expression needs to be assembler it can be replaced with a pointer to +where the same expression has been placed before. + +Once this optimization is implemented, the REPL can gain core image +support. Suppose the user enters the expression \texttt{expr}. Instead +of sending \texttt{assemble(compile(expr))} to the processor, the REPL +can remember the last expression sent (or the last memory dump +received), then append \texttt{assemble(compile(expr))} to that +previous expression. This way \texttt{expr} can refer to previously +defined expressions, as long as the REPL or compiler has a way to name +previously supplied instructions, for example via a \texttt{define} +special form, via the standard REPL variables (\texttt{+, ++, +++}), +or by sequentially numbering all expressions submitted by the user. + +Returning to hardware, a potential future project is to add actual +hardware garbage collection to the GC component. At the moment the +design does no garbage collection, the processor only has 4096 words +of RAM available, and evaluation is quite memory-inefficient. Adding +garbage collection would make more complicated programs possible +despite the limited RAM. One way to implement this is to add a small +microprocessor to the design which is preprogrammed with a garbage +collection algorithm (simple choices are a stop-and-copy collector or +Cheney's algorithm). When the free space pointer gets too high, the +EVAL and GC are stopped, and the new microprocessor is started up so +it can perform garbage collection, compact the memory, and update the +free space pointer. The EVAL/GC registers would have to be updated as +well. Then the microprocessor can be stopped, EVAL and GC be started +up again and evaluation can continue. Note that a non-moving collector +is not sufficient: the GC allocates new words only at the very end of +the memory, so it cannot benefit from free ``gaps'' in the middle of +the memory. + +\begin{thebibliography}{99} +\bibitem{lambda} Design of LISP-Based Processors (1979), available at + \url{http://repository.readscheme.org/ftp/papers/ai-lab-pubs/AIM-514.pdf} +\bibitem{osdvu} \url{https://opencores.org/project/osdvu} +\bibitem{gtkwave} \url{http://gtkwave.sourceforge.net/} +\bibitem{icestorm} \url{http://www.clifford.at/icestorm/} +\end{thebibliography} +\end{document} diff --git a/writer.svg b/writer.svg new file mode 100644 index 0000000..5862d94 --- /dev/null +++ b/writer.svg @@ -0,0 +1,286 @@ + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + clk + cen + tx_busy + P + ram_do + WRITER + tx_byte + tx_signal + finished + ram_addr + + + + + + + + + + + + + +