How to implement convolutional neural network on STM32 and Arduino

en español

In most cases, useful machine learning algorithms require huge amounts of computational resources (CPU and RAM cycles). However, TensorFlow Lite recently released an experimental version that works on multiple microcontrollers. Assuming we can build a model suitable for a device with limited resources, we can start transforming embedded systems into small machine learning (TinyML) devices.

TensorFlow Lite Micro (TFLM) is an open-source ML inference framework for running deep learning models on embedded systems. TFLM meets efficiency requirements due to embedded system resource constraints and fragmentation issues that make cross-platform interoperability nearly impossible. The framework uses an interpreter-based approach that overcomes these unique challenges while providing flexibility.

The first step in developing a TFLM application is to create a live neural network model object in memory. The application developer creates an “operator resolver " object through the client API. The "OpResolver" API controls the operators connected to the final binary and minimises the file size.

Figure 1: A simple deep learning network with two layers

The second step is to provide an "arena" of continuous memory that contains the intermediate results and other variables that the interpreter needs. This is necessary because it assumes that dynamic memory allocation is not available.

The third step is to create a compiled example and provide the model, operator parser, and arena as arguments. The compiler allocates all the memory needed in the initialization phase to the field. We avoid any mail allocation to ensure stack fragmentation prevents errors from occurring for long-running applications. Trigger applications can allocate memory for use during evaluation, so the operator setup functions are called at this point and allow their memory to be transferred to the interpreter. The application-supplied OpResolver maps the types of operators listed in the serial model to the execution functions. C API calls control all communications between the interpreter and the operator, ensuring that the implementation of the operator is modular and independent of the details of the interpreter. This approach allows you to easily replace the launcher implementation with an improved version and makes it easier to reuse other system launcher libraries (for example, as part of a code generation project).

The fourth step is execution. The application retrieves pointers to memory areas that represent model inputs and fills them with values (often derived from sensors or other user-supplied data). Once the inputs are available, the application calls the interpreter to perform the model calculations. This process involves iterating over topologically ordered operations, using the offsets computed during memory planning to locate inputs and outputs, and calling the evaluator function for each operation.

Finally, after all operations have been evaluated, the interpreter returns control to the application. Most MCUs are single-threaded and use interrupts for urgent tasks, which is acceptable. However, applications can still run on a single thread, and platform-specific operators can still distribute work across processors. When the call is complete, the application can query the interpreter to determine the location of the array that contains the output of the model calculation, and then use that output.

Figure 2: Implementation-module overview

The model that is created with Keras or TensorFlow and needs to be converted to TensorFlow Lite and exported to deploy on a microcontroller. We use the TensorFlow Lite Converter’s Python API to do this. It takes our Keras model and writes it to disk in the form of a FlatBuffer, which is a special file format designed to be space-efficient. Because we’re deploying to devices with limited memory, this will come in handy.

To deploy model to STM32 microcontroller and Arduino we will use the EloquentTinyML library to do this without pain. This is a library to run TinyML models on your microcontroller without messing around with complex compilation procedures and esoteric errors.

You must first install the library at its latest version (0.0.5 or 0.0.4 if not available), either via the Library Manager or directly from Github.

Below is the code to run and deploy digit recognition TinyML model to STM32 and Arduino Microcontrollers.

#include <EloquentTinyML.h>

// copy the printed code from tinymlgen into this file
#include "digits_model.h"

#define NUMBER_OF_INPUTS 64
#define NUMBER_OF_OUTPUTS 10
#define TENSOR_ARENA_SIZE 8*1024

Eloquent::TinyML::TfLite<NUMBER_OF_INPUTS, NUMBER_OF_OUTPUTS, TENSOR_ARENA_SIZE> ml;

void setup() {
Serial.begin(115200);
ml.begin(digits_model);
}

void loop() {
// a random sample from the MNIST dataset (precisely the last one)
float x_test[64] = { 0., 0. , 0.625 , 0.875 , 0.5   , 0.0625, 0. , 0. ,
0. , 0.125 , 1. , 0.875 , 0.375 , 0.0625, 0. , 0. ,
0. , 0. , 0.9375, 0.9375, 0.5   , 0.9375, 0. , 0. ,
0. , 0. , 0.3125, 1. , 1. , 0.625 , 0. , 0. ,
0. , 0. , 0.75  , 0.9375, 0.9375, 0.75  , 0. , 0. ,
0. , 0.25  , 1. , 0.375 , 0.25  , 1. , 0.375 , 0. ,
0. , 0.5   , 1. , 0.625 , 0.5   , 1. , 0.5   , 0. ,
0. , 0.0625, 0.5   , 0.75  , 0.875 , 0.75  , 0.0625, 0. };
// the output vector for the model predictions
float y_pred[10] = {0};
// the actual class of the sample
int y_test = 8;

// let's see how long it takes to classify the sample
uint32_t start = micros();

ml.predict(x_test, y_pred);

uint32_t timeit = micros() - start;

Serial.print("It took ");
Serial.print(timeit);
Serial.println(" micros to run inference");

// let's print the raw predictions for all the classes
// these values are not directly interpretable as probabilities!
Serial.print("Test output is: ");
Serial.println(y_test);
Serial.print("Predicted proba are: ");

for (int i = 0; i < 10; i++) {
Serial.print(y_pred[i]);
Serial.print(i == 9 ? '\n' : ',');
}
// let's print the "most probable" class
// you can either use probaToClass() if you also want to use all the probabilities
Serial.print("Predicted class is: ");
Serial.println(ml.probaToClass(y_pred));
// or you can skip the predict() method and call directly predictClass()
Serial.print("Sanity check: ");
Serial.println(ml.predictClass(x_test));

delay(1000);
}