

# **eFPGAs – Revolutionizing the High Performance Computing Landscape**

Eric Law Director, Asia-Pacific Sales September 14<sup>th</sup>, 2017

# Agenda

- The Compute Challenge and the Accelerator Spectrum
- eFPGA vs. FPGA for Hardware Acceleration
- Compute Acceleration with Speedcore eFPGA
- Standard Atomic Building Blocks
- ACE Design Tool
- Achronix Hardware Accelerator FPGA-Based Product Lines



### The Compute Challenge







# **Accelerator Spectrum**

CPU GPU

Flexible (Purpose, Ease-of-use)

Efficient (Latency, Power)

FPGA

#### CPU

- Lots of flexibility, easy to program
- Slow compared to the other options

#### GPU

- Less flexible can't run general purpose programs
- Lots and lots of processors.
- Pretty flexible, very power hungry.

#### ASIC

- Burn algorithms into silicon.
- No flexibility, but no wasted time or energy on anything superfluous.

#### FPGA

- Flexibility of an CPU with ASIC-like efficiency and performance.
- Very low power per operation vs. CPU or GPU
- Reprogrammed to implement different algorithms in "hardware".
- Made of Look-Up-Tables (LUTs) & Registers,
   SRAMS, DSPs, and other special purpose blocks.



### eFPGA vs. FPGA for Hardware Acceleration

### **FPGA Based Accelerator**



### **eFPGA Based Accelerator**







### **Achronix Speedcore eFPGA IP**

- Speedcore is an eFPGA IP for integration in ASIC / SoCs
- In production on TSMC 16nm
  - In development on TSMC 7nm
- Supported by Achronix ACE design tools
- Resource sizing defined by IP customer
  - Up to 1M LUTs
  - Speeds up to 500 MHz





### **Compute Acceleration with Speedcore**

#### High throughput Programmable Acceleration

- 2x 128b interfaces @ 600 MHz: <u>153</u>
   <u>Gb/s</u>
- Any combination of masters and slaves.
- As many interfaces as you need.

#### Low latency Programmable Acceleration

Latency (round-trip) @ 600 MHz: ~10 ns.

### <u>Coherent</u> Programmable Acceleration

- Simplified and safe software programming
- Cache Stashing allows the FPGA to work with any layer of cache, from any CPU.

#### AXI/ACE-Lite Interfaces

Integrates like any other SOC IP core.





### **Compute Acceleration with Speedcore**

#### Accelerator Coherency Port (ACP)

- The Speedcore Accelerator can access the system memory via L2 & L3 Cache.
- Extremely low-latency:
  - Get inputs directly from CPU cache.
  - Returns results directly to CPU cache.

#### Configuration

- Speedcore internal half-DMA: pulls configuration from memory with minimal CPU involvement.
- Extremely fast: ~2ms per 100k LUTs.
- On-die configuration is inherently more secure than interchip configuration.

#### Trustzone

- FPGA accesses memory (or other peripherals) via the Trustzone controller.
  - An untrusted configuration cannot see trusted data.



# 100 Gbps Programmable Packet Processing

- Speedcore allows you to segment the design into hard vs. programmable blocks at a fine-grain level.
  - Increase performance while reducing area.

#### Example: 100 Gbps programmable packet processing

- FPGA Stage 1 tells extraction controller what to be extracted
  - · E.g. L2 Header
- FPGA Stage 1receives requested data via 128b interface
- FPGA Stage 1 Processes the packet headerE.g. MAC Address TCAM lookup
- FPGA Stage 1 Inserts sideband or inline data via 128b interface
- FPGA Stage 1 provides "insertion instruction".Where to insert, drop, etc.
- FPGA Stage1 can forward control information directly to FPGA Stage 2
  - Within FPGA





## **Standard Atomic Building Blocks**

- Standard FPGA blocks
  - Logic
  - DSP
  - Memory
- Structured in column based format
- Complete permutability for column locations



- 4-input LUT
- Dedicated wide muxes
- 4-bit ALU (per 4 LUTs)



- 27x18
- 64-bit ACC



- 20 Kbit
- True dual-port



- 4 Kbit
- Simple dual-port



### **Achronix Speedcore eFPGA Compiler:** Fast Development of Customer Specific Speedcore IP

**Inputs** 

Node

Process Variants

Metal stack

achronix **IP Deliverables:** TSMC 16FF+ (<HK, SD) - 13 layer metal **GDSII Draw Speedcore** Simulation files SI and PI models Test models Timing characterization 0.92 watts 0.06 watts Documentation **Full Support in ACE Design Tools** Foundry Notice Ensure U Clock Denoise II "

all N # ED 0, # 0

Clock Denoise Name M. Flags LUTs ALDs

V V V V V

**Deliverables** 



### **ACE Design Tools and Supported Features**

### ACE Design Tools loaded with features:

- Full RTL Verilog, SystemVerilog and RTL support
- Full-fledged STA supporting CRPR, OCV derating, multiple corners
- TCL command language
- GUI and command-line control
- Highly efficient integrated data model
- Supports all current simulators: Synopsys VCS,
   Mentor QuestaSim, Aldec Riviera, Cadence Incisive.
- Linux and Windows support
- Node locked and network licensing models
- Protected/encrypted IP support

Production Version 6.0.4

Available Now





## **ACE Design Suite**



FPGA Program & Snapshot Debug

1136M of 1716M

Waveform File Preview

☐ Tcl Console 13 +C [TAG Diagram

□ DO\_VERIFY
□ DO\_BRAM\_READ

Run Selected Action

Run 'PROGRAM' on the Connected Device



### **Achronix Hardware Accelerator FPGA-Based Product Lines**

### **Speedster**



- Standalone FPGAs
- High-performance
- High-density
- For targeted applications

### **Speedcore**



- Embedded FPGA Cores
- High-performance
- Customized per application

### **Speedchip**



- FPGA Chiplets
- 2.5D or MCM integration
- Customized per application

### **ACE Design Tools**

Synthesis – Verification – Timing – Programming – Debug



## **Summary**

- Smaller, faster and programmable
  - FPGAs are critical to meet tomorrow's compute requirements,
- Embedding FPGA functionality no more a distant dream!
  - Start exploiting the full functionality and benefits of an FPGA inside an SoC as a hard IP.
- eFPGA IP has to be silicon-proven!
  - Achronix Speedcore eFPGA based on the same high-performance architecture that is in Achronix's Speedster 22i FPGAs which have been shipping in production since 2013.
  - Speedcore eFPGA products are fully supported by Achronix's robust and proven ACE design tools.

