report.tex

   1 \documentclass[a4paper,12pt]{report}
   2 \usepackage{color}
   3 \usepackage[cm]{fullpage}
   4 \usepackage{graphicx}
   5 \usepackage{hyperref}
   6 \usepackage{listings}
   7 \usepackage{pdfpages}
   8
   9 \title{YULE: Yet Unnamed Lisp Evaluator}
  10 \author{Marius Gavrilescu}
  11 \date{May 2018}
  12
  13 \begin{document}
  14 % title page does not need anchors
  15 \hypersetup{pageanchor=false}
  16 \maketitle
  17 \hypersetup{pageanchor=true}
  18 \lstset{language=Perl, frame=leftline, framexleftmargin=5pt,
  19   showstringspaces=false, showspaces=false, breaklines=true,
  20   postbreak=\mbox{\textcolor{red}{$\hookrightarrow$}\space}}
  21
  22 \chapter*{Abstract}
  23 Lisp machines are computers designed to run Lisp programs efficiently.
  24 They were first developed in the 1970s, and they have special hardware
  25 support for Lisp. This usually comprises a tagged architecture (each
  26 pointer has a ``type'' field), garbage collection implemented in
  27 hardware, and an instruction set that lends itself well to evaluating
  28 Lisp expressions.
  29
  30 Between 1975 and 1980, Guy L. Steele and Gerald Jay Sussman wrote a
  31 series of memos on lambda calculus, the Lisp language and its
  32 implementation known collectively as the ``Lambda Papers''. One of
  33 these papers, ``Design of LISP-Based Processors (1979)'' describes how
  34 to implement a Lisp machine processor. If this processor is connected
  35 to a RAM that contains a program, it executes that program and places
  36 the result back into the RAM.
  37
  38 The aim of YULE is to take the design from the 1979, implement it
  39 today using modern technology and techniques, and bundle it up in a
  40 form that is both easy to use and extend. By going through this
  41 modernisation process, future developers can easily integrate the Lisp
  42 processor into their own design, modifying it as they please.
  43
  44 The end result of YULE is a small device that can be plugged into a
  45 USB port, and an associated computer program. Running the program
  46 presents a simplified Lisp listener. The user can type a Lisp
  47 expression into the listener, and the listener will respond with the
  48 value of the expression. Under the hood, the listener evaluates Lisp
  49 by sending the expression over USB to the small device and getting
  50 back the result.
  51
  52 YULE comes with the design files for the hardware, and source code for
  53 the software. All of these are released under a free software license
  54 and published online, which makes it very easy (from both a practical
  55 and a legal standpoint) for future developers to use in their own
  56 projects and modify to fit their purposes better.
  57 \tableofcontents
  58
  59 \chapter{Introduction}
  60 \section{Motivation}
  61 Lisp machine processors are quite unlike modern processors. Modern
  62 instruction sets are designed for imperative programming. Each
  63 instruction makes the processor execute some operation, such as
  64 loading a register or storing a value in memory. In contrast, a Lisp
  65 machine processor reads an expression from memory, determines its
  66 type, and recursively evaluates it. This operation lends itself to
  67 declarative programming, where the programmer declares an arbitrarily
  68 complicated expression and asks the computer to evaluate it.
  69
  70 Another common feature of Lisp machine processors is the use of tagged
  71 pointers. Each memory location is a tagged union of two elements: the
  72 value itself, and a tag representing the type of that data. Tagged
  73 architectures are not common today, although some programs still
  74 use tagged pointers by exploiting pointer alignment (if every pointer
  75 is a multiple of 8, then the bottom 3 bits can be used to store a
  76 tag). These tags can be used to store an object's reference count (as
  77 in Objective-C on certain versions of iOS), or information about a
  78 data structure (as in certain implementations of kd-trees).
  79
  80 The ``Design of LISP-Based Processors (1979)'' paper by Guy L. Steele
  81 and Gerald Jay Sussman \cite{lambda} presents the design of a
  82 relatively simple Lisp machine processor, which is made from two
  83 read-only memories, two buses, and several registers. If this
  84 processor is connected to a RAM containing a Lisp expression in a
  85 certain binary format, the processor will evaluate the expression and
  86 place the result back in memory.
  87
  88 This processor features both a tagged architecture (the top 3 bits of
  89 each word in memory represent the value's type) and the declarative
  90 nature mentioned above. It is therefore interesting to study and to
  91 bring this processor into today's world by using technology such as
  92 hardware description languages (that allow us to describe a circuit
  93 formally) and FPGAs (that allow us to synthesise any circuit quickly
  94 and at low cost). By building a HDL implementation of the processor
  95 not only can it be used easily on its own (by synthesising it to a
  96 FPGA and connecting it to RAM), but a future circuit designer can take
  97 this processor and include it in their more complicated design.
  98
  99 The processor presented in the original paper is a somewhat barebones
 100 implementation of Lisp. Arithmetic primitives are not available, nor
 101 are there any input or output routines. Due to time constraints the
 102 paper's writers have not included a equality primitive. Additionally
 103 the ATOM primitive has a bug that makes it consider NIL not atomic.
 104 All these deficiencies can be rectified given time and effort, but it
 105 is difficult for one to fix them given only a paper describing the
 106 processor. Having a HDL implementation of the processor with a proper
 107 test suite allows a future circuit designer to more easily make
 108 changes to the processor while using the test suite to prevent
 109 regressions and using a FPGA to quickly obtain a hardware version of
 110 the changed processor that can be manually tested and used in any
 111 desired application.
 112
 113 \section{Challenges}
 114 While \cite{lambda} is part of the well-known Lambda Papers series and
 115 has been read by many over the years, there are no published
 116 implementations of the processor described in the paper. Therefore
 117 \cite{lambda} was the only source used in the building of this
 118 project, and several of the limitations of the processor were not easy
 119 to see before the implementation phase was well underway. Furthermore,
 120 the original paper has several shortcomings.
 121
 122 For one, while the microcode (the values of the two read-only
 123 memories) is provided in both source and compiled form, the microcode
 124 compiler is not provided. This makes it difficult to alter the
 125 microcode to fix bugs or introduce new instructions. One needs to
 126 either work on the already-compiled code (hard and error-prone) or
 127 reimplement the microcode compiler, which takes a reasonable amount of
 128 effort.
 129
 130 Moreover, the paper does not concern itself at all with how the user
 131 gets a program into memory or how the user gets the result back. This
 132 makes the processor as-described very difficult to use.
 133
 134 To implement the processor in Verilog, no source other than the paper
 135 was used. There were no previous implementations of the processor that
 136 could be consulted to resolve unclarities in the original paper.
 137
 138 In addition, there was no software available for interacting with the
 139 processor. A compiler and assembler had to be written from scratch,
 140 and so the processor could not be easily debugged until after the
 141 software was functional.
 142
 143 \section{Contribution}
 144 YULE is a modern-day implementation of a Lisp machine processor using
 145 the ideas and techniques presented in \cite{lambda}.
 146
 147 The main contribution of YULE is a Verilog implementation of the
 148 processor which allows future circuit designers to use it in their own
 149 design and modify it as they please.
 150
 151 As the processor as described has no way to input a program or extract
 152 a result from memory, YULE includes more circuitry around the
 153 processor that lets it communicate with the external word. This
 154 consists of a serial UART and two modules, a reader and a writer. When
 155 the processor is idle, the reader waits for a program to be sent over
 156 the UART. Once a program has been received, it is placed into memory
 157 and the processor starts to evaluate it. Once evaluation is done, the
 158 writer transmits the new contents of the memory over the UART.
 159
 160 Besides the hardware chip, YULE also includes three pieces of
 161 software:
 162
 163 \begin{itemize}
 164 \item a compiler, which takes Lisp source code and produces an
 165   intermediate representation of the program.
 166 \item an assembler, which takes a program in the intermediate
 167   representation and produces the binary representation of the program
 168   that the processor can understand.
 169 \item a REPL, in which users can type expressions. When an expression
 170   is submitted, the REPL compiles it, assembles it, and sends it to
 171   the processor. The processor evaluates the expression and sends back
 172   a memory dump. The REPL then reads the memory dump, interprets it,
 173   and prints the value of the expression in a human-readable format.
 174 \end{itemize}
 175
 176 The REPL is therefore the entry point of YULE as a whole. An end-user
 177 who does not intend to extend YULE can plug in a USB device containing
 178 the YULE processor, and run the REPL. Then the user can write
 179 expressions in the REPL and get their values back. Besides being the
 180 entry point, the REPL serves as a convenient way to test that the
 181 entire system is working. An expression typed in the REPL goes through
 182 the compiler and assembler, and then is sent to the processor. It goes
 183 through the three stages of the processor (reading the expression into
 184 memory, evaluating the contents of the memory, writing a memory dump).
 185
 186 The hardware and software components of YULE have been tested to
 187 ensure they function correctly. There is an automated test suite
 188 containing 48 unit tests for the compiler and assembler, and 21
 189 functional tests that engage all parts of the system (both the
 190 software and the hardware). The unit tests ensure that the various
 191 small parts of the compiler and assembler work properly, while the
 192 functional tests consist of pairs of short Lisp expressions and their
 193 correct values. The expressions are written to the REPL, and the
 194 REPL's output is compared to the correct values.
 195
 196 To make it easier for future circuit designers and programmers to use
 197 this project, YULE has been released under an open source license.
 198 Prospective users can find all the source files (for both hardware and
 199 software) online, and are free to use, modify, and redistribute them
 200 under the same terms as Perl itself. The files can be accessed at this
 201 URL: \url{https://git.ieval.ro/?p=yule.git}.
 202
 203 \section{Layout of this report}
 204 This report begins by describing the background of YULE, the
 205 technologies that made this project possible and a more in-depth
 206 description of the 1979 paper. Then the components that YULE comprises
 207 are examined in turn, followed by a chapter on how everything was
 208 tested, which includes a worked example of the system in action. The
 209 report ends with a list of ideas for future work based on YULE, and a
 210 section on reflections.
 211
 212 \chapter{Background}
 213
 214 \section{Field-programmable gate array}
 215 An FPGA is a configurable integrated circuit. A designer can define a
 216 circuit using a hardware description language (explained in the next
 217 section) and compile it to some binary format called a configuration.
 218 This configuration describes how the different components of the FPGA
 219 should be connected to each other. The configuration can then be
 220 uploaded to the FPGA, and the FPGA will connect its internal
 221 components as requested.
 222
 223 An FPGA can therefore replace any digital custom circuit (ASIC). While
 224 an FPGA consumes more energy, uses more space, and runs at slower
 225 clock rates than an ASIC, the FPGA permits rapid prototyping: the time
 226 between finishing to write a design and having a working part is mere
 227 seconds (compilation time + uploading the configuration to the FPGA),
 228 and if a bug is discovered the buggy part need not be thrown away. The
 229 design can be modified and the fixed configuration can be uploaded to
 230 the FPGA.
 231
 232 The hardware aspect of YULE is designed to run on the Lattice
 233 iCE40HX-1k FPGA, but with small modifications it can run on other
 234 FPGAs as well.
 235
 236 \subsection*{iCEstick}
 237 \begin{figure}
 238   \centering
 239   \includegraphics[height=4cm]{icestick.png}
 240   \caption{Lattice iCEstick Evaluation Kit}
 241 \end{figure}
 242
 243 The Lattice iCEstick Evaluation Kit is a development board for the
 244 iCE40HX-1k FPGA. It contains the FPGA, a USB connector, a FTDI chip
 245 and 5 LEDs. The form factor is very convenient. It is similar to a USB
 246 memory stick (albeit bigger) that can be plugged into a computer,
 247 avoiding the need for jumper wires and GPIO that traditional
 248 development boards require.
 249
 250 The FTDI chip implements the USB protocol, allowing the FPGA to be
 251 programmed by the computer and letting the FPGA talk to the computer
 252 using a very simple serial UART interface on the FPGA side, and a
 253 standard USB serial port on the computer side. This makes
 254 communication very easy. The FPGA can use any off-the-shelf UART
 255 implementation (this project uses \texttt{osdvu} from
 256 Opencores \cite{osdvu}), and the computer can use standard serial
 257 console tools such as \texttt{minicom}.
 258
 259 \section{Verilog}
 260 Verilog is a hardware description language. It is used to model
 261 digital circuits at register-transfer level of abstraction. In Verilog
 262 one defines several modules, with one of them (the ``toplevel''
 263 module) representing the entire design. Each module has several inputs
 264 and outputs, can contain internal wires, can contain instances of
 265 other modules, and can contain combinational and sequential logic
 266 blocks that drive internal wires and outputs.
 267
 268 Verilog has many more complicated features that are not used in YULE.
 269 These include tristate buffers, delays, functions, tasks, loops, and
 270 strength declarations.
 271
 272 A subset of Verilog can be \textit{synthesised}; that is converted to
 273 a netlist (a list of transistors or FPGA components used and
 274 connections between them) that can then be used to build an ASIC
 275 (custom integrated circuit), or can be further converted to the
 276 internal representation of the FPGA and uploaded to the FPGA. The FPGA
 277 workflow and the programs that are part of it are described in the
 278 next section.
 279
 280 A verilog project can also be run through a simulator, which simulates
 281 the design clock edge by clock edge and records the values on each
 282 wire in the design. This record can later be read in and visualised
 283 with a wave viewer, which allows a circuit designer to better
 284 understand why the circuit is not working correctly. One such program,
 285 GTKWave \cite{gtkwave}, was used to debug YULE.
 286
 287 \section{icestorm}
 288 Project IceStorm \cite{icestorm} is a set of open source reverse
 289 engineered tools for programming Lattice iCE40 FPGAs. There exists a
 290 fully open source Verilog-to-Bitstream flow which comprises:
 291
 292 \begin{enumerate}
 293 \item Yosys, a Verilog synthetizer.
 294 \item Arachne-pnr, a tool that implements placing and routing for
 295   iCE40 FPGAs.
 296 \item icepack, which converts Arachne-pnr's output to iCE40 bitstream.
 297 \item iceprog, which programs the iCEstick with the output of icepack.
 298 \end{enumerate}
 299
 300 \section{Design of LISP-Based Processors (1979)}
 301 Design of LISP-Based Processors (1979) \cite{lambda} is one of the
 302 ``Lambda papers''; a series of memos by Sussman and Steele about the
 303 Lisp language and its implementation. The paper describes SIMPLE, a
 304 microprocessor designed to run Lisp.
 305
 306 This processor is made of two read-only memories called EVAL and GC.
 307 Each read-only memory is paired with state registers to form a state
 308 machine. The contents of the state registers are fed into the
 309 read-only memory, and the output of the read-only memory determines
 310 the next state and other signals.
 311
 312 Each state machine also has an internal bus (called the E and G buses
 313 respectively) and several registers connected to that bus. Each of
 314 these registers stores a tagged pointer, which is an 11-bit value (the
 315 top 3 bits represent the type, and the bottom 8 bytes represent the
 316 value).
 317
 318 The signals coming out of each state machine can cause a register to
 319 be read onto the machine's internal bus, a register to be loaded from
 320 the internal bus, or write a constant value onto the internal bus.
 321
 322 The GC state machine represents the processor's memory system. It is
 323 connected to an external RAM, which it controls through several
 324 signals. It knows how to execute certain ``GC operations'', which
 325 represent common Lisp memory management actions. Examples of these are
 326 \texttt{CAR} and \texttt{CDR} (``given a pointer to a cons cell,
 327 return either the CAR or the CDR of that cons cell'') or \texttt{CONS}
 328 (``given two pointers, allocate a new cons cell, copy them into it,
 329 and return a pointer to the new cons cell''). The existence of the GC
 330 isolates the rest of the processor from the particularities of the
 331 memory used.
 332
 333 The EVAL state machine evaluates Lisp expressions. Signals coming out
 334 of it control the GC by selecting what operation the GC should perform
 335 and connecting the E and G buses to supply arguments to the GC and
 336 read results back. The EVAL reads expressions from memory via the GC
 337 and evaluates them, putting results back into memory.
 338
 339 The overall operation of the processor is to read an expression from
 340 the magic location ``5'' in memory, evaluate it, and place the result
 341 into the magic location ``6'' in memory. Other memory locations are
 342 used to store subexpressions of the input expression, and to store any
 343 intermediate results. Despite its name, GC does not perform any
 344 garbage collection, so at the end of the processor's operation the
 345 memory will still contain all intermediate results produced during
 346 evaluation.
 347
 348 The processor is intended as a proof of concept and not a
 349 production-ready hardware implementation of Lisp. Therefore it has
 350 several important limitations, for example there is no primitive
 351 operation that implements equality of objects (making it impossible to
 352 distinguish between symbols), there are no primitive arithmetic
 353 operations (so one needs to use Church numerals if numbers are
 354 needed), and the processor includes a known bug (``the ATOM bug'').
 355 The bug is that the ATOM primitive considers NIL to be non-atomic,
 356 which is incorrect.
 357
 358 The Lisp expressions mentioned here are in textual form (Lisp source
 359 code), but instead in a special tagged pointer format. Each memory
 360 value is a union of two values: a 8-bit pointer and a 3-bit type. We
 361 can use pairs of consecutive memory locations to represent cons cells
 362 and build linked lists out of them. If a list is at location X, that
 363 means that (X, X+1) is the first cons cell of the list, with X+1 being
 364 the CAR (the value of the first element) and X being the CDR (a
 365 pointer to the next cons cell). A discussion of the different ways to
 366 represent Lisp expressions (source code, this tagged pointer format,
 367 and an intermediate representation) can be found in \ref{sec:formats}.
 368
 369 The next page shows a diagram of how the processor in the paper is
 370 structured. The two read-only memories can be seen, with their
 371 registers, state registers, and the interconnection between the E and
 372 G buses.
 373
 374 \begin{figure}
 375 \centering\includegraphics[width=18.5cm]{AIM-514-design.pdf}
 376 \caption{Design diagram of processor from \cite{lambda}}
 377 \end{figure}
 378
 379 \chapter{A Lisp Machine processor on a USB drive}
 380 The project comprises a hardware part written in Verilog and a
 381 software part written in Perl.
 382
 383 The hardware part is a Lisp machine processor with associated
 384 circuitry that can read a program from the iCEstick UART and that can
 385 write the result of running a program to the UART. The normal
 386 operation of this part is to:
 387
 388 \begin{enumerate}
 389 \item Wait for a program to be supplied via UART
 390 \item Run the program on the processor
 391 \item Write a memory dump to the UART
 392 \item Go back to step 1
 393 \end{enumerate}
 394
 395 The software part includes:
 396
 397 \begin{itemize}
 398 \item A compiler, which takes Lisp source code and puts it in the
 399   linked list format understood by the SIMPLE processor
 400 \item An assembler, which takes the result of the compiler and
 401   converts it to the SIMPLE in-memory binary format.
 402 \item A REPL, which repeatedly reads a program from the user, compiles
 403   it, assembles it, sends it to YULE, reads the memory dump from YULE,
 404   and prints the result for the user to see.
 405 \end{itemize}
 406
 407 \section{Hardware}
 408 The hardware part is built from 8 modules:
 409
 410 \begin{enumerate}
 411 \item The \textbf{READER}, which implements step 1 of the normal operation described above.
 412 \item The \textbf{WRITER}, which implements step 3 of the normal operation described above.
 413 \item The \textbf{GCRAM}, a single-port RAM.
 414 \item The \textbf{GC}, corresponding to the GC PLA in the original implementation.
 415 \item The \textbf{EVAL}, corresponding to the EVAL PLA in the original implementation.
 416 \item The \textbf{CTRL}, which advances the operation from one step to the next one.
 417 \item The \textbf{UART}, which implements transmission and reception of bytes via UART.
 418 \item The \textbf{CPU}, which instantiates every module above and wires them together.
 419 \end{enumerate}
 420
 421 \subsection{Clocks}
 422 To avoid timing problems, a single 48 MHz clock is used. The iCEstick
 423 has a 12 MHz oscillator, and the phase-locked loop circuitry on the
 424 FPGA multiplies this signal by 4.
 425
 426 Every flip-flop updates on the positive edge of the clock, except for
 427 the RAM which updates on the negative edge. This way the rest of the
 428 circuit perceives the RAM to be combinational (it ``updates
 429 immediately''): if on a positive edge the RAM's address input changes,
 430 the RAM will update its output on the following negative edge and the
 431 changed output will be visible on the next positive edge. If the RAM
 432 updated on a positive edge, then if the input changed on positive edge
 433 $i$ the changed output will only be visible on positive edge $i+2$.
 434
 435 The design is intended to be run on a FPGA, which has a fixed clock
 436 tree. This typically consists of a small number of special nets called
 437 global buffers that can distribute signals with minimal skew to every
 438 logic cell. The FPGA on the iCEstick has 8 such global buffers.
 439
 440
 441 There are 4 components in the design (GC, EVAL, READER, WRITER) that
 442 need to be enabled/disabled at different times. One way to achieve
 443 this is to AND the master 48 MHz clock with several signals, thus
 444 creating a separate clock for each component. This is known as
 445 \textit{clock gating}. However, on a FPGA clock gating involves
 446 routing the clock from a global buffer to a logic cell and then back
 447 to another global buffer. This can introduce glitches in the clock and
 448 would delay each of the clocks by some amount (dependent on the
 449 signals it is ANDed with). It is therefore possible that the 4 output
 450 clocks would not be aligned, causing problems when data is
 451 communicated by one component to another. As such, clock gating is not
 452 recommended on FPGAs.
 453
 454 \begin{figure}
 455   \centering\includegraphics[height=3cm]{gating.eps}
 456   \caption{Gated clock on the left, clock enable on the right}
 457 \end{figure}
 458
 459 To avoid this problem, the design used here does not use clock gating
 460 and instead uses 4 clock enable signals supplied by the controller to
 461 each of the GC, EVAL, READER and WRITER. At any point the clock is
 462 enabled for exactly one of the READER, GC, and WRITER. The EVAL clock
 463 is enabled if and only if the GC clock is enabled and the GC's
 464 \texttt{step\_eval} output is high.
 465
 466 \subsection{GC}
 467 \begin{figure}
 468   \centering\includegraphics[height=4.88cm]{gc.eps}
 469   \caption{GC module}
 470 \end{figure}
 471
 472 The GC module is connected to the GCRAM (via the \texttt{ram\_do,
 473   ram\_we, ram\_addr, ram\_di} signals), to the EVAL (via the
 474 \texttt{Ein, Eout, gcop, step\_eval, conn\_et, conn\_ea} signals) and
 475 to the controller. It also provides the value of its P register (the
 476 free storage pointer) to the WRITER, so the writer can know when to
 477 stop dumping the memory.
 478
 479 It is implemented as a combinational ROM, some combinational logic to
 480 drive the internal G bus, sequential logic to advance to the next
 481 state respecting dispatch rules, and sequential logic to update
 482 registers. The clock enable signal acts as an active low synchronous
 483 reset: when the clock is disabled, the state register is set to the
 484 initial state.
 485
 486 This module represents one half of the original design presented in
 487 \cite{lambda}, and the ROM's contents are those from the paper with a
 488 bug (the original design sets the ADR line low in two states when it
 489 should be high).
 490
 491 \subsection{EVAL}
 492 \begin{figure}
 493   \centering\includegraphics[height=3.29cm]{eval.eps}
 494   \caption{EVAL module}
 495 \end{figure}
 496
 497 The EVAL module is connected to the GC (via the \texttt{conn\_et,
 498   conn\_ea, Ein, Eout, gcop} signals) and to the controller.
 499
 500 The implementation of the EVAL is very similar to that of the GC.
 501 There is a combinational ROM, combinational logic to drive the E bus,
 502 sequential logic to advance to the next state respecting dispatch
 503 rules, and sequential logic to update registers. A synchronous reset
 504 is available. When the reset line is held high for one clock cycle,
 505 the state register is set to the initial state. It is necessary to
 506 separate the reset from the clock enable because the GC often pauses
 507 the EVAL to do work and the EVAL should not be reset every time this
 508 happens.
 509
 510 This module represents the other half of the original design presented
 511 in \cite{lambda}, and the ROM's contents are the same as those in the
 512 paper except that the ROM is widened by one bit to indicate whether
 513 the EVAL is finished evaluating the expression.
 514
 515 \subsection{Writer}
 516 \begin{figure}
 517 \centering\includegraphics[height=2.26cm]{writer.eps}
 518 \caption{WRITER module}
 519 \end{figure}
 520
 521 The WRITER module is connected to the UART (via the \texttt{tx\_busy,
 522   tx\_signal, tx\_byte} signals), the GC (via the \texttt{P} signal),
 523 the RAM (via the \texttt{ram\_addr, ram\_do} signals), and to the
 524 controller.
 525
 526 Its purpose is to dump the memory to the UART starting with position 4
 527 (since positions 0, 1, 2, 3 are constant) and finishing with position
 528 $P - 1$, the last used cell in memory. It is implemented as a state
 529 machine with 7 states and an extra register (\texttt{current\_index})
 530 that stores the current index in the memory.
 531
 532 The memory is 16-bit wide, but the UART sends and receives bytes. The
 533 writer dumps memory word in network byte order (big-endian), meaning
 534 the significant byte is sent before the least significant byte.
 535
 536 The states are, in order:
 537 \begin{description}
 538 \item[START] which initializes \texttt{current\_index} to 4
 539 \item[WRITE1\_WAIT] which waits for \texttt{tx\_busy} to become low
 540 \item[WRITE1] in which transmission of the upper byte starts
 541 \item[WRITE2\_WAIT] which waits for \texttt{tx\_busy} to become low
 542 \item[WRITE2] in which transmission of the lower byte starts
 543 \item[INCREMENT] in which \texttt{current\_index} is incremented,
 544   jumping back to WRITE1\_WAIT if it is still lower than
 545   \texttt{freeptr}.
 546 \item[FINISHED] in which \texttt{finished} is driven high
 547 \end{description}
 548
 549 This module is not part of the original design in \cite{lambda}, as
 550 that paper did not concern itself with input and output.
 551
 552 \subsection{Reader}
 553 \begin{figure}
 554 \centering\includegraphics[height=2.22cm]{reader.eps}
 555 \caption{READER module}
 556 \end{figure}
 557
 558 The READER module is connected to the UART (via the
 559 \texttt{rx\_signal, rx\_byte} signals), the RAM (via the
 560 \texttt{ram\_we, ram\_addr, ram\_di} signals), and to the controller.
 561
 562 Its purpose is to read a program from the UART and overwrite the GCRAM
 563 with this program. It is implemented as a state machine with two extra
 564 registers (\texttt{total\_words} and \texttt{current\_index}) to store
 565 the size of the incoming program and the current position within it.
 566
 567 The reader expects to receive a 13-bit \texttt{length} value, followed
 568 by \texttt{length} 16-bit values that represent the desired memory
 569 contents. Each of these values is sent in network byte order
 570 (big-endian), meaning the most significant byte is sent before the
 571 least significant byte.
 572
 573 The states are, in order:
 574 \begin{description}
 575 \item[IDLE] which waits for \texttt{rx\_signal}, puts the read byte at the top of \texttt{total\_words}.
 576 \item[LENGTH] which waits for \texttt{rx\_signal}, puts the read byte at the bottom of \texttt{total\_words}
 577 \item[READ1] which waits for the \texttt{rx\_signal}, puts the read byte at the top of \texttt{ram\_di} and increments \texttt{current\_index}
 578 \item[READ2] which waits for the \texttt{rx\_signal}, puts the read byte at the bottom of \texttt{ram\_di}
 579 \item[WRITE] in which \texttt{ram\_we} is high, jumping back to READ1 if \texttt{current\_index + 1 < total\_words}
 580 \item[FINISHED] in which \texttt{finished} is high
 581 \end{description}
 582
 583 This module is not part of the original design in \cite{lambda}, as
 584 that paper did not concern itself with input and output.
 585
 586 \subsection{Putting the hardware together}
 587 The components described above are managed by the controller, which
 588 provides them with clock enable signals. The controller is a state
 589 machine with three states:
 590 \begin{description}
 591 \item[STATE\_READ] in which only the READER clock is enabled
 592 \item[STATE\_RUN] in which the GC and EVAL clocks are enabled
 593 \item[STATE\_WRITE] in which only the WRITER clock is enabled
 594 \end{description}
 595
 596 In STATE\_READ and STATE\_WRITE, the EVAL's reset line is high.
 597
 598 Whenever the component corresponding to the current state indicates it
 599 is finished, the controller advances to the next state.
 600
 601 The controller also includes three multiplexers that control which
 602 lines are connected to the GCRAM. If the READER is active, all three
 603 GCRAM lines are connected to the READER. If the WRITER is active, the
 604 address line is connected to the writer (and the other lines are
 605 connected to the GC). If the GC is active, all lines are connected to
 606 the GC.
 607
 608 The controller also includes logic to drive the 5 LEDs on the
 609 iCEstick. Three LEDs indicate the current state, while the other two
 610 indicate if the UART is currently transmitting or receiving.
 611
 612 \section{Representing a Lisp program}
 613 \label{sec:formats}
 614 The processor described above evaluates a Lisp program received via
 615 UART. This section talks about three representations of such a
 616 program: the Lisp source code (a S-expression), the intermediate form
 617 (also a S-expression), and the in-memory form (a binary string).
 618
 619 The first representation is obtained by parsing the user input, the
 620 compiler converts this into the second, and the assembler converts the
 621 second into the third.
 622
 623 \subsection{Lisp source code}
 624 Recall that a S-expression is either an atom or a cons cell.
 625
 626 An atom is a sequence of letters, digits, and certain symbols. There
 627 are two special atoms named \texttt{T} and \texttt{NIL} representing
 628 the boolean true and false respectively.
 629
 630 A cons cell is a pair of two S-expressions called the ``car'' and the
 631 ``cdr''. We write them as \texttt{(car . cdr)}, for example \texttt{(5
 632   . 7)} is a cons cell with car \texttt{5} and cdr \texttt{7}.
 633
 634 We can use cons cells to represent linked lists. A linked list is a
 635 sequence of cons cells where the cdr of one cell is the next cell, and
 636 the cdr of the last cell is \texttt{NIL}. For example the list
 637 \texttt{1,2,3} can be written as \texttt{(1 . (2 . (3 . NIL)))}.
 638 We can also write this list more concisely as \texttt{(1 2 3)}, and we
 639 write \texttt{()} to mean the atom \texttt{NIL}.
 640
 641 Using these syntax elements we can encode several types of logical
 642 expressions:
 643
 644 \begin{itemize}
 645 \item Numbers are encoded as numeric atoms, e.g. \texttt{5}
 646 \item Variables are encoded as non-numeric atoms, e.g. \texttt{my-var}
 647 \item Conditional statements are encoded as 4-element lists, e.g.
 648   \texttt{(if is-odd 5 6)} which has value 5 if the variable
 649   \texttt{is-odd} is not \texttt{NIL} and value 6 otherwise
 650 \item Symbols are encoded as non-numeric atoms preceded by a single
 651   quote, e.g. \texttt{'my-symbol}. This can also be written
 652   \texttt{(quote my-symbol)}
 653 \item Constant lists are encoded as lists preceded by a single quote,
 654   e.g. \texttt{'(1 2 (3 4 5))}. This can also be written
 655   \texttt{(quote (1 2 (3 4 5)))}
 656 \item Functions (lambda abstractions) are encoded as 4-element lists,
 657   e.g. \texttt{(lambda first (a b) a)} in which the second element is
 658   the function name, the third element is a list of function
 659   arguments, and the fourth element is the function body.
 660 \item Function calls are encoded as lists, with the function as the
 661   first element and the arguments as the other elements, e.g.
 662   \texttt{(atom 5)} calls the function \texttt{atom} with argument
 663   \texttt{5}.
 664 \end{itemize}
 665
 666 \subsection{In-memory representation}
 667 The memory is an array of 16-bit integer values. We interpret these as
 668 tagged pointers: the top 3 bits indicate the ``type'' of this pointer,
 669 while the other 13 bits represent another memory location. A pointer
 670 therefore has a type and a value.
 671
 672 We represent each Lisp expression as a tagged pointer. To evaluate it,
 673 the processor looks at the type of the pointer. Here is what a pointer
 674 of value $v$ means based on its type. We use the syntax $mem[a]$ to
 675 mean ``the contents of the memory location $a$'' (that is, a tagged
 676 pointer) and $memval[a]$ to mean ``the value at the memory location
 677 $a$'' (that is, the value of the tagged pointer $mem[a]$).
 678
 679 \begin{enumerate}
 680 \setcounter{enumi}{-1}
 681 \item is a constant list, whose cdr is $mem[v]$ and car is $mem[v+1]$.
 682 \item is a constant symbol.
 683 \item is a variable reference, where $v$ indicates the position in the
 684   environment of the referenced variable.
 685 \item is a constant closure, that is the result of evaluating a lambda
 686   abstraction. Values of this type do not normally appear in the input
 687   code, but instead are created during evaluation of other forms.
 688 \item is a procedure (lambda abstraction), whose body is $mem[v]$.
 689 \item is a conditional, where $mem[v+1]$ is the condition,
 690   $mem[memval[v]+1]$ is the ``then'' branch, and $mem[memval[v]]$ is the
 691   ``else'' branch.
 692 \item is a procedure call, which points to a special kind of linked
 693   list. The CDR of each cell (other than the last one) points to the
 694   next cell and has type 0, while the CDR of the last cell has value 0
 695   and its type indicates the primitive to be called. The CARs of these
 696   cells indicate the arguments for the procedure, in reverse order.
 697
 698   To call a user-defined procedure, we use the primitive
 699   \texttt{funcall} and put the user-defined procedure as the CAR of
 700   the last cell (in other words, the user-defined procedure is the
 701   first argument to \texttt{funcall}).
 702 \item is a quoted constant, representing the object at $mem[v]$.
 703 \end{enumerate}
 704
 705 The booleans \texttt{NIL} and \texttt{T} are compiled to a list
 706 pointing to 0, and a symbol with value 2 respectively.
 707
 708 The memory has a peculiar fixed layout. Locations 0 and 1 are the CDR
 709 and CAR of NIL, locations 2 and 3 are the CDR and CAR of T, location 4
 710 holds the first unused location in memory (this tells the processor
 711 where it can start to place intermediate expressions), location 5
 712 holds the expression to be evaluated, and location 6 is where the
 713 processor will place the value of the expression.
 714
 715 \subsection{The intermediate form}
 716 To facilitate conversion from the former format to the latter, an
 717 intermediate form is used. Here we represent tagged pointers as lists
 718 of one of three types:
 719
 720 \begin{itemize}
 721 \item \texttt{(type value)}, where ``type'' is the pointer's type and
 722   ``value'' is the pointer's (numeric) value.
 723 \item \texttt{(type ptr)}, where the pointer's type is ``type'' and
 724   the pointer's value is $x$, where $x$ is the memory location where
 725   ``ptr'' (another tagged pointer) lies.
 726 \item \texttt{(type ptr1 ptr2)}, where the pointer's type is ``type''
 727   and the pointer's value is $x$, where $x$ is the memory location
 728   where ``ptr1'' lies and $x+1$ is the memory location where ``ptr2''
 729   lies.
 730 \end{itemize}
 731
 732 These three representations map nicely to the tagged pointers
 733 described in the previous section. The first representation is used
 734 for constant symbols and variable references, the second for
 735 procedures, and the third for constant lists, conditionals, and
 736 procedure calls.
 737
 738 A visual representation of the intermediate representation (from the
 739 AIM-514 paper) can be seen on the next page. The expression in
 740 question is
 741
 742 \begin{lstlisting}
 743 ((lambda append (x y) (if (atom x) y (cons (car x) (append (cdr x) y)))) '(a b c) '(d e f))
 744 \end{lstlisting}
 745
 746 which we would represent as this sexp:
 747
 748 \begin{lstlisting}
 749 (CALL (MORE (MORE (FUNCALL 0) (PROC (IF (LIST (CALL (MORE (CONS 0) (CALL (MORE (MORE (FUNCALL 0) (VAR -1)) (VAR -2)) (CALL (CDR 0) (VAR -3)))) (CALL (CAR 0) (VAR -3))) (VAR -2)) (CALL (ATOM 0) (VAR -3))))) (LIST (LIST (LIST (LIST 0) (SYMBOL 6)) (SYMBOL 7)) (SYMBOL 8))) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5)))
 750 \end{lstlisting}
 751
 752 As mentioned in the Background chapter, the processor has a bug (the
 753 ATOM bug) that causes \texttt{(ATOM NIL)} to be true. As a result,
 754 \texttt{(atom x)} in this expression is always true, even when
 755 \texttt{x} is the empty list, and the expression loops forever (or at
 756 least until memory is exhausted). We can fix this by changing the
 757 condition from \texttt{(atom x)} to \texttt{x}, which is false when
 758 \texttt{x} is NIL and true otherwise, and swapping the two branches of
 759 the conditional. The fixed program is:
 760
 761 \begin{lstlisting}
 762 ((lambda append (x y) (if x (cons (car x) (append (cdr x) y)) y)) '(a b c) '(d e f))
 763 \end{lstlisting}
 764
 765 which is compiled to the slightly shorter sexp:
 766
 767 \begin{lstlisting}
 768 (CALL (MORE (MORE (FUNCALL 0) (PROC (IF (LIST (VAR -2) (CALL (MORE (CONS 0) (CALL (MORE (MORE (FUNCALL 0) (VAR -1)) (VAR -2)) (CALL (CDR 0) (VAR -3)))) (CALL (CAR 0) (VAR -3)))) (VAR -3)))) (LIST (LIST (LIST (LIST 0) (SYMBOL 6)) (SYMBOL 7)) (SYMBOL 8))) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5)))
 769 \end{lstlisting}
 770
 771
 772 \begin{figure}
 773 \centering\includegraphics[width=18.5cm]{AIM-514-ir.pdf}
 774 \caption{Example of intermediate representation}
 775 \end{figure}
 776
 777 \section{Software}
 778 \subsection{Compiler}
 779 The compiler takes in Lisp code and translates it to a sexp-based
 780 format understood by the assembler.
 781
 782 The compiler has several functions for compiling certain kinds of
 783 expressions, and a function (\texttt{process\_toplevel}) that uses the
 784 former to compile arbitrary expressions.
 785
 786 The output from any of this functions is a sexp in the intermediate
 787 form described in the previous section.
 788
 789 Symbols are identified by sequences of letters on the source code side
 790 and by numbers on the processor side. The compiler therefore keeps a
 791 mapping between symbol names and their numbers. Whenever a new symbol
 792 is encountered, a new entry is added to this map.
 793
 794 The \texttt{process\_quoted} function handles constants. For example
 795 if the source code contains \texttt{'(1 2 3)} then
 796 \texttt{process\_quoted} is called with \texttt{(1 2 3)}. This
 797 function handles the symbol interning described in the previous
 798 paragraph, and converts its input to intermediate form.
 799
 800 The \texttt{process\_proc} function handles lambda abstractions. For
 801 example if the source code contains \texttt{(lambda first (a b) a)}
 802 then \texttt{process\_proc} is called with \texttt{first},
 803 \texttt{(a b)}, and \texttt{a}. This function appends the function
 804 name and arguments to the environment, then processes the body (using
 805 \texttt{process\_toplevel}) with the new environment.
 806
 807 The \texttt{process\_funcall} function handles function calls. For
 808 example if the source code contains \texttt{(reverse-list 2 1)} then
 809 \texttt{process\_funcall} is called with \texttt{reverse-list}, and
 810 \texttt{(2 1)}. The arguments are first processed (using
 811 \texttt{process\_toplevel}), then (if the function is not a primitive)
 812 the function is processed as well.
 813
 814 Finally the \texttt{process\_toplevel} handles any expression by
 815 figuring out what sort of expression it is and then processing it
 816 appropriately. Empty lists and \texttt{NIL} become \texttt{(list 0)},
 817 numbers and \texttt{T} are processed with \texttt{process\_quoted}
 818
 819 Examples of compiler operation can be found in the compiler tests.
 820 Here are a few of them:
 821
 822 \begin{itemize}
 823 \item The Lisp expression \texttt{'()} is compiled to \texttt{(LIST 0)}
 824 \item The Lisp expression \texttt{'(5 foo)} is compiled to
 825   \texttt{(LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4))}. Observe that
 826   the CDR comes before the CAR, and symbols are interned:
 827   \texttt{'foo} is compiled to \texttt{(SYMBOL 3)} and \texttt{'5} is
 828   compiled to \texttt{(SYMBOL 4)}
 829 \item The Lisp expression \texttt{(if t '(2 3) 'x)} is compiled to
 830   \texttt{(IF (LIST (SYMBOL 5) (LIST (LIST (LIST 0) (SYMBOL 3))
 831     (SYMBOL 4))) (SYMBOL 2))}. Observe that \texttt{t} is always
 832   compiled to \texttt{(SYMBOL 2)}, as mentioned in the section about
 833   the memory format.
 834 \end{itemize}
 835
 836 \subsection{Assembler}
 837 The assembler takes in an S-expression in the intermediate form, lays
 838 out the words in memory, and produces a memory dump that can be
 839 supplied to the processor.
 840
 841 The core of the assembler is a function called \texttt{process} that
 842 takes an expression in intermediate form and an optional location in
 843 memory. This function lays out the expression's subexpressions in
 844 memory and then lays out the expression itself at the given location
 845 (or at the end of the memory if there is no location given).
 846
 847 The assembler begins by initializing the first 7 elements of the
 848 memory with the values needed by the processor. After initialization,
 849 the \texttt{process} function is called on the input expression with
 850 no location.
 851
 852 Here is how the \texttt{process} function lays out the expression
 853 $expr$ at location $l$ (where $l$ can be a given memory location or
 854 simply ``the first free location in memory''):
 855
 856 \begin{itemize}
 857 \item If $expr$ is \texttt{(type value)}, then it is directly laid out
 858   at $l$
 859 \item If $expr$ is \texttt{(type ptr)}, then \texttt{process} is
 860   called recursively to lay \texttt{ptr} out at the first free
 861   location in memory (say this is location $loc$). Then \texttt{(type
 862     $loc$)} is laid out at $l$.
 863 \item If $expr$ is \texttt{(type ptr1 ptr2)}, then two consecutive
 864   free locations in memory $loc$ and $loc+1$ are reserved,
 865   \texttt{process} is recursively called twice to lay \texttt{ptr1} at
 866   $loc$ and \texttt{ptr2} at $loc+1$, and finally
 867   \texttt{(type $loc$)} is laid out at $l$.
 868 \end{itemize}
 869
 870 Once again, examples of assembler operations can be found in the
 871 assembler tests. Two of them:
 872
 873 \begin{itemize}
 874 \item \texttt{(quoted (symbol 5))} is assembled to \texttt{00 08 00 00
 875     00 00 20 00 20 00 00 08 E0 07 00 00 20 05}. Observe that each word
 876   is made of two bytes, the first word (\texttt{00 08}) indicating the
 877   length (in words) of the program. The first 4 words are always
 878   fixed: they represent in order the CDR of NIL, the CAR of NIL, the
 879   CDR of T, and the CAR of T. The 7th word is also a constant, 0,
 880   because that is where the processor puts the result. The 5th word is
 881   the first free word in memory (which is equal to the length in words
 882   of the program), and the 6th word (\texttt{E0 07}) is the expression
 883   to be evaluated. In this case, it has type quote and points to
 884   memory location 7 (the 8th word, because pointers are 0-indexed). At
 885   memory location 7 we have the value \texttt{20 05} which has type
 886   symbol and value 5.
 887 \item \texttt{(call (more (funcall 0) (proc (var -2))) (number 5))} is
 888   assembled to \texttt{00 0C 00 00 00 00 20 00 20 00 00 0C C0 07 00 00
 889     00 09 20 05 E0 00 80 0B 5F FE}. Here we want to evaulate the
 890   expression \texttt{C0 07} (type ``function call'', address 7). This
 891   points to a cons-cell, whose car (at position 8) is \texttt{20 05}
 892   (that is, \texttt{(number 5)}) and whose cdr (at position 7) is
 893   \texttt{00 09} (type ``more'' and address 9). The ``more'' points to
 894   another cons-cell, whose cdr (at position 9) is \texttt{E0 00} (that
 895   is, \texttt{(funcall 0)}), and whose car is \texttt{80 0B}: a
 896   procedure whose body is at position 11, namely \texttt{5F FE} (that
 897   is, \texttt{(var -2)}).
 898
 899   So the expression is a call to the primitive \texttt{funcall} with
 900   arguments \texttt{(lambda l (x) x)} and \texttt{5}. The original
 901   program was probably \texttt{((lambda l (x) x) 5)}. Note that this
 902   program was compiled with an old version of the compiler that did
 903   not intern symbols, which explains why \texttt{5} was compiled to
 904   \texttt{(number 5)} and not \texttt{(symbol 3)}.
 905 \end{itemize}
 906
 907 \subsection{REPL}
 908 The REPL ties everything together: it reads a program from the user,
 909 compiles it, assembles it, sends the assembled program to the
 910 processor, reads the memory dump from the processor, and finally
 911 prints the result to the screen.
 912
 913 The REPL includes an S-expression pretty-printer that operates on
 914 memory dumps. As symbols are interned by the compiler, when the
 915 pretty-printer hits a symbol it only knows its index, and not the
 916 original name. To print the original name the pretty-printer searches
 917 the symbol map built by the compiler and pulls the name from there.
 918
 919 The pretty-printer takes a chunk of memory and index, and prints the
 920 expression at that index. For example, if the memory chunk is
 921 \texttt{00 00 00 00 20 00 20 00 00 09 00 07 00 07 20 02 00 00 60 00}
 922 and the index is 6, then the pretty printer will try to print
 923 \texttt{00 07}, which is a list with car \texttt{00 00} and cdr
 924 \texttt{20 02}, and the pretty printer prints \texttt{(nil . t)}.
 925
 926 An example REPL session is:
 927
 928 \begin{lstlisting}
 929 YULE REPL
 930
 931 * (cons 1 2)
 932 (1 . 2)
 933 * (car '(4 5 6))
 934 4
 935 * (cdr '(t t t))
 936 (T T)
 937 * ((lambda id (x) x) 5)
 938 5
 939 * (lambda first (a b) a)
 940 <CLOSURE>
 941 * ((lambda first (a b) a) 7 8)
 942 7
 943 * (atom 6)
 944 T
 945 \end{lstlisting}
 946
 947 \chapter{Evaluation and Testing}
 948 To make development easier, ensure everything works correctly, and
 949 prevent regressions there are unit tests for the compiler and
 950 assembler, and functional tests for the entire project.
 951
 952 Besides these automated tests, manual evaluation (in the beginning,
 953 precompiling simple expressions and supplying them to the processor as
 954 the initial value of the memory; later on, manually typing expressions
 955 in the REPL and looking at the output). When the processor was not
 956 operating as expected, the GTKWave wave simulator was used to assist
 957 with debugging. This piece of software takes in a recorded simulation
 958 of the circuits and displays the values of all signals inside the
 959 circuit against time. This allows the designer to look at every clock
 960 cycle in turn and discover where the error occurs.
 961
 962 The process of debugging the processor had the side effect of finding
 963 inefficiencies in the original design. While \cite{lambda} mentions
 964 some limitations of the processors, one important limitation that was
 965 only discovered during debugging is that for every operation that the
 966 EVAL performs, the result of that operation will be written to new
 967 locations in memory. This includes ``read-only'' operations such as
 968 CAR and CDR. If the programmer requests the CAR of a list, a copy of
 969 the first element will be taken and a pointer to it will be returned.
 970 Since YULE only has 4096 words of memory, this severely restricts the
 971 complexity of programs that can be evaluated on it.
 972
 973 Another avenue to ensure the processor works as it should is formal
 974 verification using model checking tools. Formal verification is
 975 commonly employed as part of circuit design evaluation because
 976 mistakes are more difficult to identify, and if found after the chip
 977 has been built they are very costly to fix (often the chips that were
 978 built with the bug need to be thrown away). However, building models
 979 is a very involved process, and devoting effort to formal verification
 980 of this project would not have allowed the implementation of several
 981 features. Due to time constraints and knowing that no physical YULE
 982 chips are built (the project is so far only synthesised to a FPGA),
 983 testing and manual evaluation was considered sufficient for YULE.
 984
 985 \section{Unit testing}
 986 The software was tested with 48 unit tests written in Perl using the
 987 \texttt{Test::More} framework.
 988
 989 The assembler takes in compiled Lisp code and outputs a memory dump in
 990 one of two formats (binary or Verilog code). Its tests take a
 991 configuration for the assembler, an input, two expected outputs, and
 992 (for documentation purposes) the source Lisp code. An example test:
 993
 994 \begin{lstlisting}
 995 $expbin =
 996     '00 0C 00 00 00 00 20 00 20 00 00 0C C0 07'
 997   . ' 00 00 00 09 20 05 E0 00 80 0B 5F FE';
 998 run_test {addr_bits => 13},
 999     '(call (more (funcall 0) (proc (var -2))) (number 5))',
1000     <<'EOF', $expbin, '((LAMBDA ID (X) X) 5)';
1001 mem[ 0] <= 0;                     // (cdr part of NIL)
1002 mem[ 1] <= 0;                     // (car part of NIL)
1003 mem[ 2] <= 16'b0010000000000000;  // (cdr part of T)
1004 mem[ 3] <= 16'b0010000000000000;  // (car part of T)
1005 mem[ 4] <= 16'd12;                // (free storage pointer)
1006 mem[ 5] <= 16'b1100000000000111;  // CALL 7
1007 mem[ 6] <= 0;                     // (result of computation)
1008 mem[ 7] <= 16'b0000000000001001;  // MORE 9
1009 mem[ 8] <= 16'b0010000000000101;  // NUMBER 5
1010 mem[ 9] <= 16'b1110000000000000;  // FUNCALL 0
1011 mem[10] <= 16'b1000000000001011;  // PROC 11
1012 mem[11] <= 16'b0101111111111110;  // VAR -2
1013 EOF
1014 \end{lstlisting}
1015
1016 Here the compiled code \texttt{(call (more (funcall 0) (proc (var -2))) (number 5))}
1017 is given to the assembler, and its outputs are compared to the binary
1018 string on the first line and the block of Verilog code.
1019
1020 The tests do not all contain valid assembler inputs. There are a
1021 series of tests that give invalid inputs to the assembler and expect
1022 the assembler to raise an exception matching a pattern. Example tests:
1023
1024 \begin{lstlisting}
1025 expect_error_like { run_test {}, '((type is a list) 5)'}
1026     'Type of toplevel is not atom';
1027 expect_error_like { run_test {}, '(badtype 5)'}
1028     'No such type';
1029 expect_error_like { run_test {}, '(5 700000)'}
1030     'Addr too large';
1031 \end{lstlisting}
1032
1033 On the compiler side, there are 4 tests for the quoting function,
1034 but the brunt of the tests are pairs of a compiler input (i.e. Lisp
1035 code) and a compiler output (i.e. a sexp that the assembler
1036 understands). For each pair the compiler is run on the first element
1037 and its output is compared with the second. Example such tests:
1038
1039 \begin{lstlisting}
1040 is_toplevel '(quote 5)', '(SYMBOL 3)';
1041 is_toplevel '(reverse-list \'a \'a \'b)',
1042     '(CALL (MORE (MORE (REVERSE-LIST 0) (SYMBOL 4)) (SYMBOL 3)) (SYMBOL 3))';
1043 is_toplevel '(if t \'(2 3) \'x)',
1044     '(IF (LIST (SYMBOL 5) (LIST (LIST (LIST 0)'
1045   . ' (SYMBOL 3)) (SYMBOL 4))) (SYMBOL 2))';
1046 \end{lstlisting}
1047
1048 As the compiler code includes a function to pretty-print
1049 S-expressions, there are a few round-trip tests on this function.
1050 Given a string, it is parsed using the \texttt{Data::SExpression}
1051 parsing library then it is pretty-printed. The result should match the
1052 original string.
1053
1054 There are also tests for handling of invalid input, like in the assembler. These include:
1055
1056 \begin{lstlisting}
1057 expect_error_like { is_toplevel 'x' } 'Variable x not in environment';
1058 expect_error_like { is_toplevel '(car)' } 'Cannot call primitive car with no arguments';
1059 \end{lstlisting}
1060
1061 A test coverage tool, \texttt{Devel::Cover}, was ran on the compiler
1062 and assembler to find statements and branches that were not covered by
1063 existing unit tests, and new tests were written to cover them. Right
1064 now \texttt{Devel::Cover} reports 100\% test coverage of the compiler
1065 and assembler.
1066
1067 \begin{figure}
1068   \centering\includegraphics[height=8cm]{devel-cover.png}
1069   \caption{Screenshot of \texttt{Devel::Cover} results, showing 100\%
1070     test coverage}
1071 \end{figure}
1072
1073 \section{Functional testing}
1074 Besides unit tests, the entire system (hardware and software) has
1075 undergone functional testing. This means that a set of 21 Lisp
1076 expressions has been prepared, alongside their correct values computed
1077 using a software Lisp implementation. These values are given to the
1078 YULE REPL and the test succeeds if the output equals the known correct
1079 value of the expression.
1080
1081 Here are some example tests:
1082 \begin{lstlisting}
1083 '(1 2 3)
1084 5
1085 ((lambda id (x) x) 5)
1086 ((lambda id (x) x) '(1 2 3))
1087 (car '(2 3))
1088 (car (cdr '(4 5 6)))
1089 ((lambda snd (x y) y) 'first 'second)
1090
1091 (cons '(1 2) '(7))
1092 (car (cons '(1 2) '(7)))
1093 (cdr (cons '(1 2) '(7)))
1094
1095 ((lambda rev (xs) ((lambda reverse-acc (xs acc) (if xs (reverse-acc (cdr xs) (cons (car xs) acc)) acc)) xs '())) '(4 5 6 7))
1096 \end{lstlisting}
1097
1098 This test suite tries to cover most kinds of valid expressions. It
1099 includes constants of all kinds, conditionals, lambda abstractions,
1100 calls to all primitives, calls to user-defined functions, recursive
1101 calls.
1102
1103 \section{Timing and performance}
1104 There are three parts to the performance of YULE:
1105
1106 \begin{enumerate}
1107 \item The efficiency of the software. This refers to the time taken to
1108   compile and assemble input programs, which is negligible.
1109
1110 \item The I/O performance, which refers to the time needed to transfer
1111   the program from the computer to the processor's memory, plus the
1112   time needed to transfer the memory dump from the memory back to the
1113   computer.
1114
1115 \item The speed of the processor itself, that is the time/number of
1116   cycles that the processor needs to evaluate a given expression.
1117 \end{enumerate}
1118
1119 The GC runs at 48MHz, while the EVAL only runs when the GC allows it
1120 to which is at most half of the time.
1121
1122 The UART runs at 4000000 baud, so a bit is sent every 12 cycles of the
1123 48MHz clock. A byte is made of 11 bits: a start bit, 8 data bits, and
1124 one stop bit. A word is two bytes, so 22 bits. Therefore 264 cycles
1125 are needed to transmit one word.
1126
1127 Since the WRITER just sends a full memory dump, the I/O performance is
1128 proportional not to \texttt{program\_len + result\_len} but to
1129 \texttt{program\_len + result\_len + intermediate\_results\_len}. As
1130 evaluation of any expression involves the creation of at least one
1131 value (the result), for every expression or subexpression evaluated at
1132 least 264 cycles will be needed to output that result. The EVAL takes
1133 significantly less than 264 cycles to evaluate a expression (not
1134 counting its subexpressions), in one test each expression took between
1135 30 and 90 cycles to evaluate. The processor performance is therefore
1136 dominated by I/O.
1137
1138 As an example, the program \texttt{((lambda id (x) x) 5)} is 12 bytes
1139 long, and takes 258 cycles to evaluate. The memory dump is 54 bytes
1140 long. Therefore of the roughly 17700 cycles needed from beginning to
1141 end, about 98.5\% are used to do I/O.
1142
1143 \section{A worked example}
1144 \label{sec:example}
1145 Here is an example of what happens when a expression is typed into the
1146 REPL:
1147
1148 \begin{enumerate}
1149 \item A user types into the REPL the string
1150
1151 \begin{lstlisting}
1152 (car '(2 3 4))
1153 \end{lstlisting}
1154
1155   This is a call to the CAR primitive, with the list \texttt{(2 3 4)}
1156   as an argument. We expect the result to be \texttt{2}, the first
1157   element of the list.
1158
1159 \item This string is given to the compiler, producing the sexp
1160
1161 \begin{lstlisting}
1162 (CALL (CAR 0) (LIST (LIST (LIST (LIST 0) (SYMBOL 3)) (SYMBOL 4)) (SYMBOL 5)))
1163 \end{lstlisting}
1164
1165   Observe that in this representation the CDR comes before the CAR. So
1166   for example ``(LIST (LIST 0) (SYMBOL 3))'' represents a list whose
1167   CAR is ``(SYMBOL 3)'' and CDR is ``(LIST 0)'' (that is, nil). This
1168   is a list with one element, ``(SYMBOL 3)''. The sexp shown here
1169   represents a call to the CAR primitive with one argument. The list
1170   has three elements, which are in order \texttt{(SYMBOL 5), (SYMBOL
1171     4), (SYMBOL 3)}.
1172
1173   In this stage the compiler maps each number to a symbol, in this
1174   case \texttt{(SYMBOL 5)} is 2, \texttt{(SYMBOL 4)} is 3, and
1175   \texttt{(SYMBOL 3)} is 2. This mapping is remembered so it can be
1176   used later by the pretty printer.
1177
1178 \item This sexp is given to the assembler, producing the 16 words long
1179   binary string (here displayed in hex, with a space after each 16-bit
1180   word)
1181
1182 \begin{lstlisting}
1183 000F 0000 0000 2000 2000 000F C007 0000 2000 0009 000B 2005 000D 2004 0000 2003
1184 \end{lstlisting}
1185
1186   The first word, \texttt{000F}, indicates the length of the program
1187   in
1188   words (15 words, this does not include this first word).\\
1189   The next four words, \texttt{0000 0000 2000 2000}, represent the CDR
1190   and CAR of NIL and T respectively. These words are always
1191   constant.\\
1192   The next word, \texttt{000F}, represents the first unused position
1193   in memory. This is always equal to the first word transmitted.\\
1194   The next word, \texttt{C007}, represents the expression to be
1195   evaluated.\\
1196   The next word, \texttt{0000}, is ignored as this is where the
1197   processor will place the result of the evaluation.\\
1198   The rest of the words are subexpressions or parts of subexpressions
1199   of the expression being evaluated.\\
1200
1201   The expression that is evaluated is \texttt{C007}, which is a CALL
1202   whose CDR is at position 7 (\texttt{2000}, meaning \texttt{(CAR 0)})
1203   and whose CAR is at
1204   position 8 (\texttt{0009}).\\
1205   \texttt{0009} is a list with CAR at 10 (\texttt{2005}, meaning
1206   \texttt{(SYMBOL 5)}) and CDR at 9 (\texttt{000B}).\\
1207   \texttt{000B} is a list with CAR at 12 (\texttt{2004}, meaning
1208   \texttt{(SYMBOL 4)}) and CDR at 11 (\texttt{000D}).\\
1209   \texttt{000D} is a list with CAR at 14 (\texttt{2003}, meaning
1210   \texttt{(SYMBOL 3)}) and CDR at 13 (\texttt{0000}, meaning NIL).
1211
1212 \item This binary string is sent via UART to the FPGA's READER. The
1213   READER writes the string (except the first word) into memory,
1214   overwriting the previous contents of the memory. It then reports to
1215   the controller that it is done, and the controller advances to
1216   \texttt{STATE\_RUN}.
1217
1218 \item The EVAL and GC evaluate the program in memory, and then place
1219   the result back into memory. The EVAL reports to the controller that
1220   it is done, and the controller advances to \texttt{STATE\_WRITE}.
1221
1222 \item The WRITER sends this memory dump via UART back to the REPL
1223
1224 \begin{lstlisting}
1225 000F C007 2005 2000 0009 000B 2005 000D 2004 0000 2003 0000 4004 2000 0010 0000 0012 0000 0000 0009
1226 \end{lstlisting}
1227
1228   Then the WRITER reports to the controller that it is done, and the
1229   controller advances to \texttt{STATE\_READ}.
1230
1231   Observe that this memory dump is similar to the assembler output.
1232   The first two words are the 6th and 7th words in the assembler
1233   output, the next word is the result of the evaluation, then the next
1234   8 words are words 9th through 16th in the assembler output. The
1235   following words are intermediate results computed by the processor.
1236
1237 \item The REPL's pretty printer interprets this string starting with
1238   \texttt{2005}, which means \texttt{(SYMBOL 5)}. The pretty printer
1239   looks in the compiler's mapping and finds that the 5th symbol was
1240   2, and so the REPL prints
1241
1242 \begin{lstlisting}
1243 2
1244 *
1245 \end{lstlisting}
1246
1247   The evaluation is complete, and a new prompt is printed so that the
1248   user can input another expression.
1249 \end{enumerate}
1250
1251 \begin{figure}
1252   \centering\includegraphics[height=7cm]{repl.png}
1253   \caption{What the user types and sees in this example}
1254 \end{figure}
1255
1256 \chapter{Conclusions}
1257 A Lisp machine processor has been built using the original design and
1258 techniques presented in \cite{lambda}. It has been extended with input
1259 and output capabilities, and using a modern FPGA it has been
1260 synthesised on a physical device slightly larger than a USB flash
1261 drive. This device can be used as a standalone Lisp evaluator, or the
1262 processor inside it can be included in a more complicated circuit.
1263
1264 Software has been written to help interact with the processor. This
1265 includes a compiler, an assembler, and a REPL that together make it
1266 possible for an end-user unfamiliar with hardware design or the
1267 processor's internals to submit Lisp expressions to the processor and
1268 get back their values. The software also allows programmers to convert
1269 Lisp source code to the binary format understood by the processor, so
1270 that they may later send the binaries to the processor for evaluation.
1271
1272 The processor and software have been tested to ensure they work
1273 correctly. This has uncovered bugs and inefficiencies in the original
1274 paper, some of which could not be gleaned from simply reading the
1275 paper.
1276
1277 The source files for the hardware and software, and also the automated
1278 tests have been released under an open-source license to make it easy
1279 for future circuit designers and programmers to use and extend this
1280 project. They can be found at \url{https://git.ieval.ro/?p=yule.git}.
1281
1282 The original paper \cite{lambda} is one of the influential Lambda
1283 Papers that defined the Scheme programming language and is an
1284 important historical artifact for understanding how Lisp machine
1285 processors can be implemented and more generally how tagged
1286 architectures work. But all details of the processor cannot be
1287 understood from simply reading the paper, and most readers would not
1288 follow by attempting to implement what is described there. This
1289 projects brings a tangible implementation into the open, which shows
1290 the limitations of the design to people interested in this design and
1291 makes it possible to solve this limitations and extend the processor
1292 beyond its original scope.
1293
1294 \section{Reflections}
1295 Now that the project is finished, it is worth revisiting ideas I had
1296 in the beginning that were not ultimately useful.
1297
1298 The original concept had the processor run on a typical development
1299 board, with the software running on a Raspberry PI. The board and the
1300 Raspberry PI would be connected through jumper wires, and the UART
1301 circuitry on the Raspberry PI would be used by the software to talk to
1302 the processor.
1303
1304 This idea would have produced a much harder to use YULE. Not only
1305 would the user have to ssh into a Raspberry PI to do anything useful,
1306 but also the equipment would take more space (both boards are larger
1307 than the iCEstick). Once I tried the iCEstick and found that
1308 everything can be done just as well with it, it replaced the two
1309 boards and made the project easier to use (albeit slightly harder to
1310 debug, as the iCEstick only has 5 LEDs).
1311
1312 Some code had to be rewritten, sometimes multiple times. The compiler
1313 was originally written in Common Lisp, as it is the best suited
1314 language for manipulation of S-expressions, especially Lisp code. This
1315 meant a working project was obtained more quickly, but then the
1316 compiler had to be rewritten in another language; if the user has a
1317 software Lisp implementation already, why would they need YULE? I
1318 chose Perl as the reimplementation language, as the assembler was
1319 already written in Perl and it is widely-available, most systems
1320 coming with a Perl interpreter out of the box.
1321
1322 Throughout development of the hardware part, timing problems were a
1323 mainstay. The first ones related to the interactions between the
1324 processor and the memory (the memory was clocked, but the processor
1325 expected a combinational RAM). What fixed this problem was to run the
1326 memory at twice the speed of the rest of the processor. At some point,
1327 a decision was taken to run the GC and EVAL on different clocks (both
1328 ran at one quarter the speed of the master clock, but one of the
1329 clocks was delayed by half a cycle. The positive edge on one clock
1330 would happen midway between a positive edge and a negative edge on the
1331 other one). This proved to be unwise, as sometimes values sent by one
1332 module to the other would be cleared before the destination module
1333 managed to read them.
1334
1335 GTKWave proved to be very useful in debugging these timing issues, as
1336 the exact point where the issue began was easy to identify and I could
1337 see why the issue was happening. A workaround for this was to add a
1338 latch on each side of the common E bus that remembered the value sent
1339 long enough for the receiving module to see it.
1340
1341 All timing problems were fixed more cleanly in the end when an attempt
1342 to simplify the clocks was made. First GC and EVAL were changed to use
1343 the same clock, which removed the need for the latches, and then the
1344 memory was changed to use the negative edge of the clock instead of
1345 the positive edge, which meant no more clock division code had to be
1346 used.
1347
1348 \section{Future work}
1349 There are several ways to extend this project.
1350
1351 One idea is to fix the noted limitations: the ATOM primitive considers
1352 NIL to not be an atom, and there is no EQ primitive that compares
1353 symbols/other objects. The latter problem means that all symbols are
1354 indistinguishable (note: NIL can still be distinguished from T and
1355 other symbols, but NIL is not a symbol). To introduce equality a new
1356 signal needs to be added to the EVAL that compares (for example) the
1357 value on the E bus with the value of a fixed register, possibly a
1358 newly added one. Then (EQ A B) would store A in the register, load B
1359 on the E bus and compare them.
1360
1361 Once equality is implemented, it becomes possible to implement the
1362 EVAL function that evaluates an arbitrary Lisp expression.
1363
1364 Two more involved instruction set improvements are adding arithmetic
1365 operations, and input/output. Without arithmetic primitives one has to
1366 rely on Church numerals, which encodes numbers as lambda abstractions
1367 and is very inefficient. Input/output primitives would allow programs
1368 to receive data at runtime over the UART and send data back before
1369 evaluation is finished. This enables long-running useful programs, for
1370 example with input/output and equality a REPL can be written (which
1371 relies on the EVAl function mentioned above).
1372
1373 On the software side, the assembler can support optimizations and then
1374 the REPL can be extended with a core image. The main optimization
1375 needed in the assembler is common subexpression elimination. The
1376 processor never overwrites a value in memory with another value (other
1377 than overwriting \texttt{mem[6]} with the result of the evaluation),
1378 so it is safe to share any identical subexpressions. For example, if
1379 the user inputs the program \texttt{(cons (car '(1 2 3 4)) (cdr '(0 2
1380   3 4)))}, then when the assembler should share the tails of the two
1381 lists, meaning the second list should compile to \texttt{(LIST (LIST
1382   X) (SYMBOL Y))} where X is a pointer to where the second element of
1383 the first list is stored. This makes long programs significantly
1384 shorter. Note that no compiler changes are required to implement this,
1385 as the assembler can just remember all expressions and subexpressions
1386 it has assembler so far including their locations, and whenever a new
1387 expression needs to be assembler it can be replaced with a pointer to
1388 where the same expression has been placed before.
1389
1390 Once this optimization is implemented, the REPL can gain core image
1391 support. Suppose the user enters the expression \texttt{expr}. Instead
1392 of sending \texttt{assemble(compile(expr))} to the processor, the REPL
1393 can remember the last expression sent (or the last memory dump
1394 received), then append \texttt{assemble(compile(expr))} to that
1395 previous expression. This way \texttt{expr} can refer to previously
1396 defined expressions, as long as the REPL or compiler has a way to name
1397 previously supplied instructions, for example via a \texttt{define}
1398 special form, via the standard REPL variables (\texttt{+, ++, +++}),
1399 or by sequentially numbering all expressions submitted by the user.
1400
1401 Returning to hardware, a potential future project is to add actual
1402 hardware garbage collection to the GC component. At the moment the
1403 design does no garbage collection, the processor only has 4096 words
1404 of RAM available, and evaluation is quite memory-inefficient. Adding
1405 garbage collection would make more complicated programs possible
1406 despite the limited RAM. One way to implement this is to add a small
1407 microprocessor to the design which is preprogrammed with a garbage
1408 collection algorithm (simple choices are a stop-and-copy collector or
1409 Cheney's algorithm). When the free space pointer gets too high, the
1410 EVAL and GC are stopped, and the new microprocessor is started up so
1411 it can perform garbage collection, compact the memory, and update the
1412 free space pointer. The EVAL/GC registers would have to be updated as
1413 well. Then the microprocessor can be stopped, EVAL and GC be started
1414 up again and evaluation can continue. Note that a non-moving collector
1415 is not sufficient: the GC allocates new words only at the very end of
1416 the memory, so it cannot benefit from free ``gaps'' in the middle of
1417 the memory.
1418
1419 \begin{thebibliography}{99}
1420 \bibitem{lambda} Design of LISP-Based Processors (1979), available at
1421   \url{http://repository.readscheme.org/ftp/papers/ai-lab-pubs/AIM-514.pdf}
1422 \bibitem{osdvu} \url{https://opencores.org/project/osdvu}
1423 \bibitem{gtkwave} \url{http://gtkwave.sourceforge.net/}
1424 \bibitem{icestorm} \url{http://www.clifford.at/icestorm/}
1425 \end{thebibliography}
1426 \end{document}