Let me introduce an experimental SoC and a compiler toolchain built for exploration of the design space in compute acceleration. Of course, the words "compute acceleration" do not go well together with such a small and simple FPGA as ICE40, but it still provides an opportunity to explore some simple techniques before graduating to more complex FPGAs, while enjoying the most pleasant to use, fully open source FPGA flow - yosys/arachne-pnr/icestorm.
BlackIce board is very suitable for such experiments, with its fast SRAM and convenient PMODs.
The demo I'm presenting here consists of the following:
- A simple, reasonably fast 6-stage RISC CPU (around 2500LCs in total), it's retiring 1 instruction per clock cycle unless stalled by an extended instruction (no memory stalls, no interrupts, etc.). This CPU core is designed to be a minion CPU in an SoC controlled by another, more general purpose CPU, but obviously on ICE40 8k we only have space for one CPU anyway.
- A monochrome 640x480 VGA
- An infrastructure for adding extended instructions to the RISC CPU
- An optional UART (not used in the demo)
- A small 2-port RAM implemented on ICE40 block RAMs, used for both code and data
- A very simple extensible C-like language compiler, with an SSA-based optimisation middle-layer and multiple CPU backends. This language allows to inline Verilog into C the same way as one would inline an assembly.
There is a demo program displaying a monochrome Mandelbrot set (computed in fixed point).
One version of this program runs entirely on a CPU, including software implemented multiplication and division. The other version is nearly the same, but it adds an "__hls" attribute to some functions (multiplication and division), immediately turning them into hardware instructions. And, the third version implements the entire Mandelbrot kernel in hardware, using 3 32-bit multipliers in parallel.
This toolchain allows to exploit HLS compute acceleration even further, by utilising pipeline level parallelism - the Mandelbrot kernel inner loop is a 9-stage pipeline, meaning one core can compute 9 threads in parallel instead of one, but, unfortunately, this one is already a bit too big for the ICE40 8k (just a couple of hundreds of LCs over the top, so there is a hope that I'll probably cram it in later, with a smaller host CPU). If you want to see this aspect of the toolchain in action, there is a 4-core demo running on the Digilent NEXYS4DDR board.
How to build
Install dependencies: get swi-prolog and mono (and, of course, yosys, arachne and icestorm). E.g., on Ubuntu 16.04:
$ apt-get install swi-prolog mono-complete yosys arachne-pnr fpga-icestorm fpga-icestorm-chipdb arachne-pnr-chipdb iverilog verilator
$ git clone https://github.com/combinatorylogic/mbase
$ cd mbase/bin; make; sudo ./install.sh
Then, clone the git repository:
$ git clone https://github.com/combinatorylogic/soc
$ cd soc && git submodule update --init --recursive
$ cd soc && make clikecc.exe
Now, build a BlackIce .bin file for the demo:
$ cd soc/backends/c2
$ make blackicemandhls
The resulting bitfile is in
The VGA output is on PMOD 7/8/9/10 (I'm using the Digilent PmodVGA module)
EDIT: added a screenshot of the HLS demo