The aim of this project is to create a machine learning co-processor with a similar architecture as Google's Tensor Processing Unit. The implementation is resource-friendly and can be used in different sizes to fit every type of FPGA. This allows the deployment of this co-processor in embedded systems and IoT devices, but it can also be scaled up to be used in data centers and high-perfomance machines. The AXI interface allows usage in a variety of combinations. Evaluations were made on the Xilinx Zynq 7020 SoC.
Unlike the original TPU, this version can only do fixed-point arithmetic. Weights and inputs have to be in the range of -1 to 127/128 or 0 to 255/256.
There are 6 main components, which allow the arithmetic:
The sizes of the components (e.g. size of MXU, buffers, etc.) can be configured seperately.
The control units allow the system to execute 10 Byte wide instructions (more info at doc/TPU_ISA.md). Instructions can be transmitted over AXI and are stored in a small fifo-buffer.
A sample model, trained with the MNIST dataset, was evaluated on different sized MXUs at 177.77 MHz with a theorethical perfomance of up to 72.18 GOPS. Real timing measurements were then compared with traditional processors:
|Matrix Width N||6||8||10||12||14|
|Duration in us (N input vectors)||383||289||234||194||165|
|Duration per input vector in us||63||36||23||16||11|
|Processor||Intel Core i5-5287U at 2.9 GHz||BCM2837 4x ARM Cortex-A53 at 1.2 GHz|
|Duration per input vector in us||62||763|
To get started with tinyTPU, please have a look at getting_started.pdf, where detailed instructions for Xilinx Zynq SoCs and Vivado can be found.
This project was developed during a bachelor thesis in technical computer science at the HAW Hamburg. If you want to know more about the co-processor, you can have a look at the thesis here (german).