# tinyML. Talks

Enabling Ultra-low Power Machine Learning at the Edge

"Demoing the world's fastest inference engine for Arm Cortex-M"

Cedric Nugteren - Plumerai

January 4, 2022







# tinyML Talks Strategic Partners















































Additional Sponsorships available – contact Olga@tinyML.org for info

# Arm: The Software and Hardware Foundation for tinyML



Resources: developer.arm.com/solutions/machine-learning-on-arm





# WE USE AI TO MAKE OTHER AI FASTER, SMALLER AND MORE POWER EFFICIENT



**Automatically compress** SOTA models like MobileNet to <200KB with **little to no drop in accuracy** for inference on resource-limited MCUs



**Reduce** model optimization trial & error from weeks to days using Deeplite's **design space exploration** 



**Deploy more** models to your device without sacrificing performance or battery life with our **easy-to-use software** 

BECOME BETA USER bit.ly/testdeeplite



# **EDGE IMPULSE The leading edge ML platform**



www.edgeimpulse.com





# The Eye in IoT

**Edge AI Visual Sensors** 

info@emza-vs.com





- Machine Learning algorithm
- <1MB memory footprint</li>
- · Microcontrollers computing power
- · Trained algorithm
- · Processing of low-res images
- · Human detection and other classifiers

- · Machine Learning edge computing silicon
- <1mW always-on power consumption</li>
- Computer Vision hardware accelerators

# **Enabling the next generation of Sensor and Hearable products**

### to process rich data with energy efficiency

Visible Image



Sound



IR Image



Radar



Bio-sensor



Gyro/Accel











Battery-powered consumer electronics







IoT Sensors







# **∠**Grovety Inc.

#### SOFTWARE DEVELOPMENT SERVICES FOR TINYML SOLUTIONS

Development tools

SDK, IDE, compilers, leveraging on TVM, uTVM & LLVM

2 Firmware
Drivers, BSP, protocols, etc.



## Distributed infrastructure for TinyML apps









**Develop at warp speed** 

**Automate deployments** 

**Device orchestration** 

HOTG is building the distributed infrastructure to pave the way for AI enabled edge applications

# LatentAl

Adaptive AI for the Intelligent Edge



## **Maxim Integrated: Enabling Edge Intelligence**

#### **Advanced AI Acceleration IC**







The new MAX78000 implements AI inferences at low energy levels, enabling complex audio and video inferencing to run on small batteries. Now the edge can see and hear like never before.

**Low Power Cortex M4 Micros** 



Large (3MB flash + 1MB SRAM) and small (256KB flash + 96KB SRAM, 1.6mm x 1.6mm) Cortex M4 microcontrollers enable algorithms and neural networks to run at wearable power levels.

www.maximintegrated.com/microcontrollers

**Sensors and Signal Conditioning** 



Health sensors measure PPG and ECG signals critical to understanding vital signs. Signal chain products enable measuring even the most sensitive signals.

www.maximintegrated.com/sensors



## **Qeexo AutoML**





Automated Machine Learning Platform that builds tinyML solutions for the Edge using sensor data

#### **Key Features**

- Supports 17 ML methods:
  - Multi-class algorithms: GBM, XGBoost, Random Forest, Logistic Regression, Gaussian Naive Bayes, Decision Tree, Polynomial SVM, RBF SVM, SVM, CNN, RNN, CRNN, ANN
  - Single-class algorithms: Local Outlier Factor, One
     Class SVM, One Class Random Forest, Isolation Forest
- Labels, records, validates, and visualizes time-series sensor data
- On-device inference optimized for low latency, low power consumption, and small memory footprint applications
- Supports Arm<sup>®</sup> Cortex<sup>™</sup>- M0 to M4 class MCUs

#### **End-to-End Machine Learning Platform**



For more information, visit: www.qeexo.com

#### **Target Markets/Applications**

- Industrial Predictive Maintenance
  - Smart Home
- Wearables

- Automotive
- Mobile
- loT

#### Qualcomm Al research

# Advancing Al research to make efficient Al ubiquitous

#### Power efficiency

Model design, compression, quantization, algorithms, efficient hardware, software tool

#### Personalization

Continuous learning, contextual, always-on, privacy-preserved, distributed learning

#### Efficient learning

Robust learning through minimal data, unsupervised learning, on-device learning

A platform to scale Al across the industry



#### Perception

Object detection, speech recognition, contextual fusion



Edge cloud



#### Reasoning

Scene understanding, language understanding, behavior prediction



Action

Reinforcement learning for decision making



Cloud







Mobile



# **Add Advanced Sensing** to your Product with Edge AI / TinyML

https://reality.ai



info@reality.ai





## Pre-built Edge Al sensing modules, plus tools to build your own

#### Reality AI solutions

Prebuilt sound recognition models for indoor and outdoor use cases

Solution for industrial anomaly detection

Pre-built automotive solution that lets cars "see with sound"

#### Reality AI Tools® software

Build prototypes, then turn them into real products

Explain ML models and relate the function to the physics

> Optimize the hardware, including sensor selection and placement

### BROAD AND SCALABLE EDGE COMPUTING PORTFOLIO

#### Microcontrollers & Microprocessors

#### Arm® Core



Arm® Cortex®-M 32-bit MCUs Arm ecosystem, Advanced security, Intelligent IoT



Arm®-based High-end 32 & 64-bit MPUs High-resolution HMI, Industrial network & real-time control



Arm® Cortex®-M0+ Ultra-low Power 32-bit MCUs Innovative process tech (SOTB), Energy harvesting

Renesas Synergy™ Arm®-based 32-bit MCUs for Qualified Platform Qualified software and tools

#### Renesas Core



Ultra-low Energy 8 & 16-bit MCUs Bluetooth® Low Energy, SubGHz, LoRa®-based Solutions



High Power Efficiently 32-bit MCUs Motor control, Capacitive touch, Functional safety, GUI



40nm/28nm process Automotive 32-bit MCUs Rich functional safety and embedded security features

#### Core technologies

#### AI

A broad set of high-power and energy-efficient embedded processors

#### Security & Safety

Comprehensive technology and support that meet the industry's stringent standards



#### Digital & Analog & Power Solution

Winning Combinations that combine our complementary product portfolios

#### Cloud Native

Cross-platforms working with partners in different verticals and organizations





# **Build Smart IoT Sensor Devices From Data**

SensiML pioneered TinyML software tools that auto generate AI code for the intelligent edge.

- End-to-end Al workflow
- Multi-user auto-labeling of time-series data
- Code transparency and customization at each step in the pipeline

We enable the creation of productiongrade smart sensor devices.



sensiml.com



**SynSense** builds **sensing and inference** hardware for **ultra-low-power** (sub-mW) **embedded, mobile and edge** devices. We design systems for **real-time always-on smart sensing**, for audio, vision, IMUs, bio-signals and more.

https://SynSense.ai



# SYNTIANT

Silicon

Neural Decision Processors

- At-Memory Compute
- Sustained High MAC Utilization
- Native Neural Network Processing

ML Training Pipeline

Enables Production Quality
 Deep Learning Deployments



End-to-End Deep Learning Solutions

for

TinyML & Edge Al



#### **Data Platform**

- Reduces Data Collection
   Time and Cost
- Increases Model Performance

SYNTIANT

 $\boxtimes$ 

partners@syntiant.com



www.syntiant.com





# Next tinyML Talks

| Date F                                  | Presenter                                       | Topic / Title                                                              |
|-----------------------------------------|-------------------------------------------------|----------------------------------------------------------------------------|
| • • • • • • • • • • • • • • • • • • • • | Tim Callahan<br>Staff Software Engineer, Google | CFU Playground: Customize Your ML Processor for Your Specific TinyML Model |

Webcast start time is 8:00 am Pacific time

Please contact talks@tinyml.org if you are interested in presenting



# tinyML Summit 2022

Miniature dreams can come true...

March 28-30, 2022

Hyatt Regency San Francisco Airport <a href="https://www.tinyml.org/event/summit-2022/">https://www.tinyml.org/event/summit-2022/</a>

Registration will be open on **December 15**, 2021.

Deadline for poster submission is **December 17**.

The Best Product of the Year and the Best Innovation of the Year awards are open for nominations between **November 15** and **February 28**.

# tinyML Research Symposium 2022

March 28, 2022

https://www.tinyml.org/event/research-symposium-2022

Call for papers – Submission deadline is **December 17**, 2021.

More sponsorships are available: <a href="mailto:sponsorships@tinyML.org">sponsorships@tinyML.org</a>





# Reminders

Slides & Videos will be posted tomorrow





tinyml.org/forums

youtube.com/tinyml



Please use the Q&A window for your questions







# **Cedric Nugteren**



Cedric Nugteren is a software engineer focussed on writing efficient code for deep learning applications. After he received his MSc and PhD from Eindhoven University of Technology he optimized GPU and CPU code for various companies using C++, OpenCL and CUDA. Then, he worked for 4 years on deep learning for autonomous driving at TomTom, after which he joined Plumerai where he is now writing fast code for the smallest microcontrollers.



# Demoing the world's fastest inference engine for Arm Cortex-M

# You might know us from: BNNs?





- On the forward pass, weights are binarized
- On the backward pass, a Straight-Through Estimator approximates the gradient

Helwegen et al., 2019, NeurIPS - Latent Weights Do Not Exist; Rethinking Binarized Neural Network Optimization







# You might know us from: Person detection?



Person Presence Detection

Example: Smart doorbell





# You might know us from: Our own IP core?







# Or from: the world's fastest Cortex-M inference?



1. What is an inference engine?

Monday, October 4, 2021

### The world's fastest deep learning inference software for Arm Cortex-M

New: Try out our inference engine with your own model!

At Plumerai we enable our customers to We're proud to announce that our inferen memory-efficient in the world, for both B

3. Live demo of public benchmarking service

on tiny embedded hardware. ontrollers is the fastest and most deep learning models. Our

inference software is an essential component of our solution, since it directs resource management akin to an operating system. It has 40% lower latency and requires 49% less RAM than TensorFlow Lite for Microcontrollers with Arm's CMSIS-NN kernels while retaining the same accuracy. It also outperforms any other deep learning inference software for Arm Cortex-M:

|                                                           |                      | Inference time | RAM usage |
|-----------------------------------------------------------|----------------------|----------------|-----------|
| TensorFlow Lite for Microcontrollers 2.5 (with CMSIS NIN) |                      | 129 ms         | 155 KiB   |
| Edge Impulse's EON                                        | 4. What did we do to | 120 ms         | 153 KiB   |
| MIT's TinyEngine 1                                        |                      | 124 ms         | 98 KiB    |
| STMicroelectronics' X-CUBE-#                              | become so efficient? | 103 ms         | 109 KiB   |
| Plumerai's inference software                             | 77 ms                | 80 KiB         |           |

2. Are we really that efficient?



0. How did we get here?

# How did we get here?







1. What is an inference engine?

# The machine learning flow





#### Pick a model

Pick a new model or retrain an existing one.



#### Convert

Convert a TensorFlow model into a compressed flat buffer with the TensorFlow Lite Converter.



#### Deploy

Take the compressed .tflite file and load it into a mobile or embedded device.

Deploy INT8 quantized model on device



Run optimized code



# The tasks of an inference engine



1. Execute the layers of the model in the correct order



# An inference engine example: TFLM









2. Are we really that efficient?

# A closer look at the results





Model: MobileNetV2 2 3 (alpha=0.30, resolution=80x80, classes=1000)

Board: STM32F746G-Discovery at 216 MHz with 320 KiB RAM and 1 MiB flash

Also tested: microTVM, but ran out of memory

**No tricks:** no binarization or pruning, accuracy remains the same in this table

### Just good on MobileNetV2?





### More off-the-shelf models





### More off-the-shelf models





### A closer look at the MLPerf Tiny models









# 3. Live demo of public benchmarking service

# Public benchmarking service: try it yourself!





Visit <a href="https://plumerai.com/benchmark">https://plumerai.com/benchmark</a> to try it with your own model



4. What did we do to become so efficient?

### How to beat the competition?





### 2. Optimized and model-specific INT8 code for Cortex-M

# Memory planning: a (rotated) game of Tetris 💾





Time (layer execution)



# Memory planning for an example model







### A much better memory plan







### Even better: lower granularity planning







### Memory planning at Plumerai: summary



words

Visual wake

#### Lower RAM usage:

- 1. Smarter tensor placement
- 2. Lower granularity planning

Anomaly detection Keyword Spotting Image Classification

17.8

RAM savings highly model dependent!

100

80

40 20

RAM [KB]

Plumerai TFLM

### Optimized INT8 code for speed





Example code optimizations:

- Hand-written assembly (if needed)
- Specialization for Cortex-M4 or M7 capabilities
- Register-count aware optimisations
- Template-based loop unrolling
- Weight memory layout pre-processing

Optimized code for special cases, e.g. 1x1 Conv2D

### Model-agnostic - vs - model-specific













```
some_loop(signed char*, signed char*, int):
// Some function with unknown num-channels
                                                                                   r2, #0
void some loop(int8 t* src, int8 t* dst,
                                                                           ble
                                                                                   .L1
              int num_channels) {
                                                                                   ro, ro, #1
                                                                           subs
   for (int i = 0; i < num_channels; ++i) {</pre>
                                                                           subs
                                                                                   r1, r1, #1
       dst[i] = src[i] * 16;
                                                                                   r2, r2, r0
                                                                                                                        Generic code with
                                                                    .L3:
                                                                                                                      compare and branch
                                                                                   r3, [r0, #1]!
                                                                                                   @ zero_extendq
                                                               8
                                                                           ldrb
                                                                           lsls
                                                                                   r3, r3, #4
                                                                                                                           instructions
                                                               9
                                                                                   r0, r2
                                                              10
                                                                           cmp
                                                                                   r3, [r1, #1]!
                                                              11
                                                                           strb
                                                              12
                                                                                   .L3
                                                                    .L1:
                                                              13
                                                              14
                                                                           bx
                                                                                   1r
                                                                   some loop(signed char*, signed char*):
                                                              15
// Same function but with 3 channels hard-coded
                                                                           ldrb
                                                                                   r3, [r0]
                                                                                                   @ zero extendaisi2
                                                              16
void some_loop(int8_t* src, int8_t* dst) {
                                                                           lsls
                                                                                   r3, r3, #4
                                                              17
   for (int i = 0; i < 3; ++i) {
                                                              18
                                                                           strb
                                                                                   r3, [r1]
       dst[i] = src[i] * 16;
                                                                           ldrb
                                                                                   r3, [r0, #1]
                                                                                                   @ zero_extendqisi2
                                                              19
                                                                           lsls
                                                                                   r3, r3, #4
                                                               20
                                                                                   r3, [r1, #1]
                                                                           strb
                                                                                                                    Unrolled code with only
                                                                           1drb
                                                                                   r3, [r0, #2]
                                                                                                   @ zero_extenda
                                                               22
                                                                                                                      add, load and stores
                                                                           lsls
                                                                                   r3, r3, #4
                                                              23
                                                                                   r3, [r1, #2]
                                                               24
                                                                           strb
```

25

bx

1r

### Better speed at Plumerai: summary



#### Lower latency, better speed:

- Optimized code for Cortex-M
- 2. Model-specific code generation

Latency savings dependent on the layers and layer configurations



You've got mail!



# 5. Conclusion

### The world's fastest Cortex-M inference





### What can Plumerai mean for you?



Monday, October 4, 2021

The world's fastest deep learning inference software for Arm Cortex-M

New: Try out our inference engine with your own model!

#### **Binarized Convolution**

Plumerai Data Pipeline

Plumerai BNN Models

Plumerai Inference Stack

Plumerai Hardware



Person detection

### Public benchmarking service: try it yourself!







Visit <u>plumerai.com/benchmark</u> to try it with your own model

Contact <a href="mailto:hello@plumerai.com">hello@plumerai.com</a> for help or other questions





# Copyright Notice

This presentation in this publication was presented as a tinyML® Talks webcast. The content reflects the opinion of the author(s) and their respective companies. The inclusion of presentations in this publication does not constitute an endorsement by tinyML Foundation or the sponsors.

There is no copyright protection claimed by this publication. However, each presentation is the work of the authors and their respective companies and may contain copyrighted material. As such, it is strongly encouraged that any use reflect proper acknowledgement to the appropriate source. Any questions regarding the use of any materials presented should be directed to the author(s) or their companies.

tinyML is a registered trademark of the tinyML Foundation.

www.tinyML.org