The modern challenges of image and facial recognition

en español

The modern challenges of image and facial recognition – and how AI is facilitating tiny, competitive edge-based solutions

While image recognition can be a powerful tool for improving security and productivity, system designers are constantly challenged to deliver faster, more nuanced classification from smaller, lower-power devices. The trend is to respond with AI-based recognition algorithms running on tiny microcontrollers located at the IoT edge.

This article looks at how image processing is becoming more sophisticated, the enabling technologies available and some practical implementation possibilities based on various semiconductor manufacturers’ hardware and ecosystems.

Image recognition, and its major subset – facial recognition – have been widely used in industrial and security applications for many years. However, although users started to use cameras for image recognition because the technology allowed them to do so, the results were often inadequate. Attempts to classify images can be beset by problems such as variation in scale or perspective, clutter in the background, or illumination.

Therefore, there is always pressure to improve the performance of these systems, so that they can provide more nuanced recognition and classification capabilities, while delivering more robust and accurate results. And, as better technology becomes available, it creates further opportunities for improving productivity or security.

One excellent example is 3D facial recognition. 2D systems were once sufficient in applications like access control, until people learned to fool them using spoofing techniques like pictures of faces, so 3D recognition became necessary to overcome this. It also solves problems like recognizing people after they have grown a beard or if they are wearing glasses or a COVID mask.

Sophisticated image recognition technology is making a difference in areas apart from security. In industry, it can be used to improve product quality in terms of shape, size and coloring, while in automotive applications, it is applied to roadside detection, lane detection, the detection of animals, humans or objects in live lanes. It can also map human presence, for example on public transport.

More powerful hardware plus increasingly sophisticated AI software is also enabling image recognition systems with mood detection capabilities. For example, automotive vendors can use facial emotion detection technology in smart cars to alert the driver when they are feeling drowsy.

However, systems builders seeking to deliver more powerful, low-latency solutions must do so while consuming less energy, space, and cost. They must remain technically competitive while going green.

Increasingly, the response is to move systems that once ran on big servers in the Cloud out to the edge. This means that AI algorithms are now running on tiny microcontrollers, which must map incoming images very quickly and with great accuracy. Although it’s not so important in industry, where robots have more space and power available, in other applications this technology can put powerful image recognition solutions onto users’ phones and wristwatches.

Running facial recognition systems locally at the edge, without sending data to the cloud, also addresses concerns about privacy.

Figure 1: Screenshots from a facial recognition application using the analog devices MAX78000 microcontroller

Technology concepts and practical approaches to building edge image recognition systems

From a systems developer’s perspective, an AI image recognition system, like any other electronic product, comprises a number of hardware and software building blocks that must be integrated into a basic platform which can be further developed into an application – specific solution. These include:

Camera or other input device: Cameras come in different technologies; the choice of camera technology will fundamentally affect the entire system design.

Output devices: These could include a security gate, which allows a facial recognition system to control access to a secure area; there could also be a display providing the results of AI analysis. Additionally, there will be a network connection if the image recognition system is part of a larger infrastructure.

Microcomputing hardware: This may comprise just a core processor, but it will more likely also have an AI engine accelerator to improve performance.

AI algorithm: many image recognition applications could use the same hardware, but different AI algorithms can be run to fulfil different applications.

To integrate these components into an application-specific image recognition system, we need to

Choose a technology such as 3D facial recognition or 3D Time of Flight for collecting high quality image data.
Choose an AI algorithm such as Convolutional Neural Networks (ConvNet/CNN) to extract meaningful and actionable information from the raw image data.
Find a semiconductor manufacturer that offers the hardware and development environment best suited the image collection and processing approach you are seeking to adopt.

Collecting high quality image data

3D facial recognition and 3D Time of Flight are popular approaches:

3D facial recognition

The 3D facial recognition method involves using sensors to capture the shape of the face with more precision. Unlike traditional facial recognition methods, the accuracy of 3D facial recognition is not affected by lighting, and scans can even be done in the dark. Another advantage of 3D facial recognition is that it can recognizea target from multiple angles, rather than just a straight-on profile. Unlike 2D facial recognition, it cannot be fooled by photographs used by people seeking unauthorised entry into a secure area.

The iPhone X (and later versions) come with Face ID technology, which relies on 3D facial recognition to identify its owner.

The 3D facial recognition process has six main steps: Detection, Alignment, Measurement, Representation, Matching, and Verification or Identification

3D time of flight

3D time of flight (ToF) is a type of scanner-less LIDAR (light detection and ranging) that uses high power optical pulses in durations of nanoseconds to capture depth information (typically over short distances) from a scene of interest.

A ToF camera measures distance by actively illuminating an object with a modulated light source such as a laser and a sensor that is sensitive to the laser’s wavelength for capturing reflected light. The sensor measures the time delay ∆ between when the light is emitted and when the reflected light is received by the camera. The time delay is proportional to twice the distance between the camera and the object (round-trip), therefore the distance can be estimated as depth = cΔ/2 where c is the speed of light.

There are different methods for measuring ∆T, of which two have become the most prevalent: the continuous-wave (CW) method and the pulse-based method. It should be noted that the vast majority of CW ToF systems that have been implemented and are currently on the market use CMOS sensors, while pulsed ToF systems use non-CMOS sensors (notably CCDs).

Figure 2: Simple diagram of time of flight measurement

Extract meaningful and actionable information from the raw image data

After using either of the above technologies to capture image data, we need an AI algorithm to run on the chosen hardware to analyse the data and provide meaningful and actionable results.

One approach is to use Convolutional Neural Networks (ConvNet/CNN): Deep Learning algorithms that can take in an input image, assign importance (learnable weights and biases) to various aspects/objects in the image, and then differentiate one from the other.

The pre-processing required in a CNN is much lower than by other classification algorithms. While in primitive methods filters are hand-engineered, with enough training, CNNs have the ability to learn these filters/characteristics.

The architecture of a CNN is analogous to that of the connectivity pattern of neurons in the human brain and was inspired by the organization of the visual cortex.

A CNN is able to successfully capture the spatial and temporal dependencies in an image through the application of relevant filters. The architecture performs a better fitting to the image dataset due to the reduction in the number of parameters involved and the reusability of weights. In other words, the network can be trained to understand the sophistication of the image better.

However other deep-learning algorithmsare also evolving rapidly, with lower precision data types such as INT8, binary, ternary, and custom data being used.

Semiconductor manufacturers’ hardware and ecosystems

Whichever AI algorithm is chosen, to be effective it must run on suitable hardware, capable of providing the necessary processing power without making excessive demands on electrical power, space, weight, or cost.

When it comes to practical hardware implementations, each semiconductor manufacturer tends to offer their own ecosystems, based on the underlying hardware that they have developed, together with suitable software and development tools. In deciding which semiconductor manufacturer to work with, developers must be aware that they are committing to the manufacturer’s development ecosystem as well as their image processing hardware.

Below, we look at image recognition solutions from three leaders in the field of AI hardware – Analog Devices, Xilinx, and NXP Semiconductors.

Analog Devices’ solution is based on their MAX78000 family, including the MAX78002, an ultra-low power Convolutional Neural Network (CNN) inference engine. The MAX78002’s advanced system-on-chip architecture features an Arm® Cortex®-M4 with FPU CPU and an ultra-low-power deep neural-network accelerator. (See textbox: ‘The role of neural network accelerators’.)

The integrated RISC-V core can execute application and control codes as well as drive the CNN accelerator.

The role of neural network accelerators

Deep learning is currently one of the most prominent machine learning approaches for solving complex tasks that could previously only be solved by humans. In applications such as computer vision or speech recognition, deep neural networks (DNNs) achieve high accuracy compared to non-learning algorithms and in some cases even higher than human experts. The greater accuracy of DNNs compared to non-learning algorithms comes from the ability to extract high-level features from the input data after using statistical learning over a large number of training data.

Statistical learning leads to an efficient representation of the input space and a good generalization. However, this capability requires high computational effort - yet, by increasing the number of parameters, the accuracy of a network can be increased. Consequently, the trend in DNNs is clearly that network size is growing exponentially. This leads to an exponentially increasing computational effort and required memory size.

Therefore, central processing units (CPUs) alone are inadequate to handle the computational load. Accordingly, structurally optimized hardware accelerators are used to increase the inference performance of neuronal networks. For inference of a neural network running on edge devices, energy efficiency is an important factor that has to be considered, in addition to throughput.

As a follow-on product to the MAX78000, the MAX78002 has additional computing power and memory, and is part of the new generation of artificial intelligence (AI) microcontrollers built to enable the execution of neural networks at ultra-low power and live at the edge of the internet-of-things (IoT).

This product combines the most energy-efficient AI processing with Analog Device's proven ultra-low power microcontrollers. The hardware-based convolutional neural networks (CNN) accelerator enables battery-powered applications to execute AI inferences while spending only microjoules of energy.

Figure 3: Architecture of the analog devices MAX78002 microcontroller

You can engage with the microcontroller by using the MAX78002 evaluation kit (EV kit); this provides a platform for leveraging device capabilities to build new generations of AI products. The kit features onboard hardware like a digital microphone, serial ports, digital video port (DVP) and camera serial interface (CSI) camera module support, and a 3.5-inch touch-enabled colour thin-film transistor (TFT) display.

The kit also includes the circuitry to monitor and display the power level on the secondary TFT display. The MAX34417 monitors the voltage and current of the MAX78002 and reports the accumulated power to the MAX32625, which is used as the power data processor that also controls the power display.

Developing a face identification model: Designers can build Face Identification models using Analog Devices development flow on PyTorch, trained with different open datasets and deployed on the MAX78000 evaluation board. Fig. 4 shows the development flow.

Figure 4: Development flow on the MAX78000

The development process solves the face identification problem in three main steps:

Face extraction: Detection of the faces in the image to extract a rectangular subimage that contains only one face.
Face alignment: Determination of the rotation angles (in 3D) of the face in the subimage to compensate its effect by affine transformation.
Face identification: Identification of the person using the extracted and aligned subimage.

Xilinx uses a different hardware approach, based on their Kria K26 SOM (System on Module). The SOM is built to enable developers in their preferred design environment to deploy their smart vision applications faster with an out-of-the-box ready, low cost development kit to get started.

The K26 SOM is well suited to Edge applications, as its underlying Zynq MPSoC architecture provides high performance/watt and low cost of ownership. Kria SOMs are hardware configurable, making them scalable and future proof.

The device’s design offers further performance advantages:

Raw computing power: The K26 can be configured with various deep learning processing unit (DPU) configurations, and based on performance requirements, the most applicable configuration can be integrated into the design. As an example, DPU B3136 at 300MHz has peak performance of 0.94TOPS.

Lower precision data type support: As deep-learning algorithms are evolving rapidly, lower precision data types such as INT8, binary, ternary, and custom data are being used. It is difficult for GPU vendors to meet the current market needs because they must modify/tweak their architecture to accommodate custom or lower precision data type support. The Kria K26 SOM supports a full range of data type precisions such as FP32, INT8, binary, and other custom data types – and operations on lower precision data types have been shown to consume much less power.

Low latency and power: The Zynq MPSoC architecture’s reconfigurability allows developers to design their application with reduced or no external memory accesses, which not only helps to reduce the overall power consumption of the application, but also increases responsiveness with lower end-to-end latencies.

Flexibility: Unlike GPUs, where the data flow is fixed, Xilinx hardware offers flexibility to uniquely reconfigure the datapath to achieve maximum throughput and lower latencies. Also, the programmable datapath reduces the need for batching, which is a major drawback in GPUs and becomes a tradeoff between lower latencies or higher throughput.

For evaluation and development, Xilinx offers their KV260 starter kit that includes a Kria K26 SOM mated to a vision-centric carrier card. The combination of this pre-defined vision hardware platform, and a robust and comprehensive software stack built on Yocto or Ubuntu, together with pre-built vision-enabled accelerated applications, provides an unprecedented path for developers to leverage Xilinx technologies to build systems.

After development is completed, customization for production deployments is simple. The Kria SOM is mated with a simple enduser-designed carrier card that incorporates the connectivity and additional components specific to their own target system.

Figure 5:Xilinx KV260 Vision AI starter kit

Application example: Xilinx has partnered with Uncanny Vision, an industry leader in video analytics solutions for smart cities, with the goal to provide a world-class auto number plate (license plate) recognition (ANPR) solution to the market. The application is being adopted widely across many cities in the world as part of the smart city buildout.

The ANPR application is an AI-based pipeline that includes video decode, image preprocessing, machine learning (detection), and OCR character recognition. Fig. 6 shows the application’s building blocks.

Figure 6: ANPR application building blocks

NXP Semiconductors has expanded its NXP EdgeReady portfolio, adding a solution for secure face recognition that leverages a high-performance 3D structured light module (SLM) camera combined with the i.MX RT117F crossover MCU. This is the first solution to combine a 3D SLM camera with an MCU to deliver the performance and security of 3D face recognition at the edge, thereby removing the need to use an expensive and power-hungry Linux implementation on an MPU, as is traditionally required with high-performance 3D cameras.

The newest EdgeReady solution enables developers of smart locks and other access control systems to add machine learning-based secure face recognition quickly and easily to smart home and smart building products. The solution delivers reliable 3D face recognition in indoor and outdoor applications, across varied lighting conditions, including bright sunlight, dim night light, or other difficult lighting conditions that are challenging for traditional face recognition systems.

The use of a 3D SLM camera enables advanced liveness detection, helping distinguish a real person from spoofing techniques, such as a photograph, imitator mask or a 3D model, to prevent unauthorized access.

The i.MX RT117F utilizes an advanced machine learning model as part of NXP’s eIQ machine learning software running on its high-performance CPU core that enables faster and more accurate face recognition to improve both the user experience and power efficiency.

Similar to the i.MX RT106F MCU-based NXP EdgeReady solution for secure face recognition, advanced liveness detection and face recognition are all done locally at the edge, making it possible for personal biometric data to remain on the device. This helps address consumer privacy concerns, while also eliminating the latency associated with cloud-based solutions.

Conclusion

The above article has discussed the technologies available for developing improved image recognition systems, and presented examples of different semiconductor manufacturers’ hardware platforms and development ecosystems available for implementing the technologies.

From this, it becomes apparent that each manufacturer’s approach is very different, in terms of hardware implementations and components already available. Other manufacturers, beyond the scope of this article, are also offering their own solutions.

It therefore makes sense to consult with a supplier like Newark, who has access to a wide range of manufacturers and solutions. We have experts who can discuss the factors to consider when choosing the right hardware architecture and development environment, and then when moving into production.