

Next-generation endpoint AI for IoT with the new Arm Cortex-M55 and Ethos-U55 processors

Mark Quartermain, Cortex-M55 Product Manager

### Arm Enables AI Everywhere, On Any Device

Arm's AI platform delivers comprehensive hardware IP, software frameworks, and ecosystem



Al-enabled IoT device shipments forecast to increase by almost 20% per year through 2024\*



# Best-in-class Solution Optimized for Endpoint Al

Cortex-M55

Most Al-capable Cortex-M processor Ethos-U55

First microNPU for Cortex-M

Performance

Versatile ML performance: Up to 15x ML uplift\*



Dedicated ML performance: Additional 32x ML uplift\*\*

Up to
480x
ML
performance
uplift\*

Optimization

Arm Custom Instructions\*\*\*
and configuration options

Configurable 32-256 MACs

Accelerated
Design and
Development

Corstone-300 reference design for faster and more secure system-on-chip development



<sup>\*</sup>Compared to previous Cortex-M generations

<sup>\*\*</sup>Compared to the Cortex-M55

<sup>\*\*\*</sup>available in 2021

# Unified Software Development: Fastest Path to Endpoint Al



- Multiple software development flows
- Harder to program and debug
- (b) More complex, longer time to market



- Unified software development flow
- Works with common ML frameworks and existing tools
- More productivity, faster time to market





# Arm Cortex-M55 Processor

Arm's most Al-capable Cortex-M processor and the first to feature Arm Helium vector processing technology

### Cortex-M55: The Most Al-capable Cortex-M Processor

Cortex-M processor with enhanced DSP/ML compute capabilities

### Helium vector processing

- > 150 new scalar and vector instructions
- Support for complex maths
- Low overhead loops

#### Vector processing support

- 2 x 32-bit MAC/cycle
- 4 x 16-bit MAC/cycle
- 8 x 8-bit MAC/cycle

### Extended datatype support

- Half-precision float
- 8-bit integer
- Floating-point (half, full and doubleprecision)



#### Cortex-M ease of use

- Unified instruction set
- Single toolchain, simplified debug
- No need for separate DSP engine
- Cortex-M ecosystem

### High performance system

- Memory system designed for DSP and ML applications
- Optional I/D caches
- Optional Tightly Coupled Memory

### Security

- TrustZone for system wide security
- Protect software investments



### Cortex-M55: Accelerating Embedded DSP/ML Performance



Highest performance and efficiency for ML and signal processing across Cortex-M portfolio



# Cortex-M55 Energy Efficiency by Datatype

Compared to the Cortex-M4 processor



- Measured average energy consumption based on selected DSP kernels from CMSIS-DSP
- Energy efficiency measured as:  $\left(\frac{Cortex-M4\ cycles\ to\ complete\ kernel}{Cortex-M55\ cycles\ to\ complete\ kernel}\right)\ {\rm X}\left(\frac{Cortex-M4\ Power}{Cortex-M55\ Power}\right)$
- > 1 indicates Cortex-M55 energy efficiency is greater than Cortex-M4





# Arm Ethos-U55 microNPU

The first Arm microNPU for Cortex-M based systems

### Ethos-U55: The First microNPU for Cortex-M

- Configurations 32/64/128/256 MACs
- High compute operators accelerated in hardware.
   Other operators run on the microcontroller
- Works alongside Cortex-M55, Cortex-M7, Cortex-M33 and Cortex-M4 processors
- Connected to the memory bus
- Uses existing system SRAM and flash storage
- Comes with a high-performance, high-efficiency
   Mac engine
- Weight decoder and DMA for on-the-fly weight decompression
- 8-bit input x 8-bit weights
- 16-bit input x 8-bit weights



### **Ethos-U55 Performance Results**

Using 128 MACs/Cycle configuration of Ethos-U55



### MobileNet V2





### CMSIS-NN and TensorFlow Lite Micro for Cortex-M

Optimized low-level kernels for the embedded market



- The optimized kernel library for Cortex-M
  - Called from TF Lite Micro or bare metal implementations
  - Offline flow creates a binary for Cortex-M based platforms
- Targets all Cortex-M architectures
  - Armv6-M/Armv7-M/Armv8-M/Armv8.1-M with Helium support
  - Runs on earlier versions of the architecture
- Key operators accelerated by CMSIS-NN
  - Fallback to TF Lite Micro reference kernels
- Open-source, via Apache 2.0 license
   https://github.com/ARM-software/CMSIS 5



# Accelerating Embedded Machine Learning

Add Ethos-U55 under the same stack



- Boosts ML performance beyond Cortex-M alone
  - Small memory footprint
  - Low-power for 'always-on' applications
- Operators are accelerated by the microNPU
  - Fallback to CMSIS-NN, then reference kernels
- Tooling for offline optimization
  - Quantize and tune data structures for the microNPU



# arm

# Summary

### Summary: Bringing the Benefits of AI to Billions More - Devices



- ✓ Significant uplift in DSP/ML performance
- ✓ Meets efficiency needs of IoT endpoint
- ✓ Built-in system-wide security
- ✓ Unified toolchain enabling ease-of-use
- ✓ Simplified SoC and software development
- ✓ Industry-leading ecosystem



# Industry-wide Effort: The Most Extensive Al Ecosystem

Significant silicon partner collaboration

Algorithm, software, tools and RTOS partners





# arm

Thank You Danke

Merci

谢谢

ありがとう

Gracias

ຸ Kiitos 각사한니다

धन्यवाद

شکرًا

נודה

Find Out More:

Cortex-M55: developer.arm.com/cortex-m55

Ethos-U55: developer.arm.com/ethos-U55

Corstone-300: developer.arm.com/corstone-300

Alternatively, get in touch with an expert to learn more: pages.arm.com/cortex-M55-consultation



<sup>+</sup>The Arm trādemarks feātured in this presentation are registēred trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks