Advanced AI Vision Processing Using AM68A for Industrial Smart Camera Applications

Abstract

The advances of deep-learning based artificial intelligence (AI) and embedded processors makes camera-based analytics a crucial technology in many industrial applications where the data from multiple cameras must be processed with high performance at low power and low latency. The AM68A processor provides various ways to optimize the performance of AI applications at the network edge with heterogeneous processing cores and remarkable integrated hardware accelerators. This processor is designed for edge AI with as many as eight cameras. The Edge AI SDK and tools provided make the development of the AI applications on AM68A simpler and faster while fully taking advantage of hardware accelerators for vision and AI processing.

Trademarks

Arm® and Cortex® are registered trademarks of Arm Limited.

All trademarks are the property of their respective owners.

1 Introduction

As vision is a primary sensor for human beings, machines also use vision to perceive and comprehend environments around them. Camera sensors provide rich information on surroundings and the advance of deep-learning based AI makes it possible to analyze enormous and complex visual data with higher accuracy. Therefore, in the applications like machine vision, robotics, surveillance, home and factory automation, camera-based analytics has become a more powerful and important tool.

Embedded processors (EP) with AI capability, that is, edge AI processors, are accelerating this trend. EP can process visual data from multiple cameras into actionable insight by mimicking the eyes and brain of a human. In contrast to cloud-based AI, where deep neural network (DNN) inference is running on the central computing devices, edge AI processes and analyzes the visual data on the systems, for example, edge AI processors, directly connected to the sensors. Edge AI technology not only makes existing applications smarter but also opens up new applications that require intelligent processing of large amounts of visual data for 2D and 3D perception.

Edge AI is specifically designed for time-sensitive applications. However, edge AI requires a low-power processor to process multiple vision sensors and execute multiple DNN inferences simultaneously at the edge, which presents challenges in size, power consumption, and heat dissipation. These sensors and processor must fit in a small form factor and operate efficiently under the harsh environments of factories, farm and construction sites, as well as inside vehicles or cameras installed on the road. Moreover, certain equipment such as mobile machines and robots necessitate functionally safe 3D perception. The global market for such edge AI processors was valued at $2.1 billion in 2021 and is expected to reach $5.5 billion by 2028⁽¹⁾.

This paper focuses on the highly-integrated AM68A processor and several edge AI use cases including AI Box, machine vision, and multi-camera AI. Optimizing the edge AI systems using the heterogeneous architecture of the AM68A with the optimized AI models and the easy-to-use software architecture is also discussed.

2 AM68A Processor

The AM68A is a dual-core Arm® Cortex® A72 microprocessor. The processor is designed as a high-performance and highly-integrated device providing significant levels of processing power, image and video processing, and graphics capability. Compared with the AM62A⁽²⁾, which is designed for the applications with one or two cameras, the AM68A enables real-time processing of four to eight 2MP cameras with improved AI performance. Figure 2-1 shows the following multiple sub-systems based on the heterogeneous architecture of the AM68A:

A dual-core Arm Cortex A72 microprocessor at 2 GHz provides up to 25K Dhrystone Million Instructions Per Second (DMIPS).
Vision Processing Accelerator V3 (VPAC3) performs image processing in Vision Image Sub-System (VISS) to support raw image sensors through de-mosaic, defective pixel correction, auto exposure, auto white balance, chromatic aberration correction (CAC), and so forth. In addition, VPAC3 includes Lens Distortion Correction (LDC), Multi-Scalar (MSC), and Bilateral Noise Filter (BNF) hardware accelerators (HWAs) to accelerate correction of distorted images, down scaling of images into multiple resolutions and noise filtering, respectively. VPAC3 in the AM68A can process 600 MP per second (MP/s) when assuming 20% system overhead.
Digital Signal Processing (DSP) and Matrix Multiplication Accelerator (MMA) are integrated together for DL acceleration as well as traditional computer vision tasks. The AM68A processor has two 512-bit C7x DSP running at 1 GHz, one of which is tightly coupled with an MMA capable of 4K (64 × 64) 8-bit fixed-point multiply accumulates per cycle. When run at 1 GHz, the AM68A provides 8 dense Trillion Operations per Second (TOPS).
H.264, H.265 encoder and decoder can encode and decode multiple channels simultaneously. This encoder and decoder supports H.264 Baseline, Main, High Profile at L5.2 and H.265 Main Profile at L5.1. The H.264, H.265 encoder and decoder can process 480 MP/s, for example, 8 channels of 2MP at 30 fps.
2x 4-lane MIPI CIS-2 RX are included in the AM68A. Two high-resolution (for example, 12MP) cameras can be directly connected to CSI-2 RX ports, and captured and preprocessed by VPAC. Capturing eight 2MP cameras is possible via MIPI CSI-2 4-to-1 aggregators.
BXS-4-64 GPU offers up to 50 Giga Floating-point Operations per Second (GFLOPS) to enable dynamic 2D and 3D rendering for enhanced viewing applications.
Display Sub-System (DSS) supports multiple displays with the flexibility to interface with different panel types such as eDP, DSP, and DPI.
Improved memory architecture and high-speed interfaces improve the system throughput by enabling high utilization of cores and HWAs. The AM68A supports up to 34 Giga Bytes Per Second (GBps) DDR memory bandwidth.

Figure 2-1 AM68A Block Diagram With Subsystems

Deep learning inference efficiency is crucial for the performance of an edge AI system. As the Performance and efficiency benchmarking with TDA4 Edge AI processors application note shows, MMA-based deep learning inference is 60% more efficient than a GPU-based one in terms of FPS or TOPS. The optimized network models for C7xMMA are also provided by the TI Model Zoo⁽³⁾, which is a large collection of DNN models optimized for C7xMMA for various computer vision tasks. The models include popular image classification, 2D and 3D object detection, semantic segmentation, and 6D pose estimation models. Table 2-1 shows the 8-bit fixed-point inference performances on AM68A for several models in the TI Model Zoo.

Table 2-1 Inference Performances of Classification, Object Detection, and Semantic Segmentation Models on AM68A

Task	Model	Image Resolution	Frame Rate (fps)	Accuracy (%)
Classification	mobileNetV2-tv	224 × 224	500	70.27⁽¹⁾
Object detection	ssdLite-mobDet-DSP-coco	320 × 320	218	34.64⁽²⁾
Object detection	yolox-nano-lite-mmdet-coco	416 × 416	268	18.96⁽²⁾
Semantic segmentation	deeplabv3lite-mobv2-cocoseq21	512 × 512	120	55.47⁽³⁾
Semantic segmentation	deeplabv3lite-regnetx800mf-cocoseq21	512 × 512	58	60.62⁽³⁾

(1) TOP-1 accuracy

(2) mAP 50-95

(3) mIoU

The multicore heterogeneous architecture of the AM68A provides flexibility to optimize the performance of edge AI system for various applications by utilizing suitable programmable cores or HWAs for particular tasks. For example, computationally intense deep learning (DL) inference can run on MMA with enhanced DL models, and vision processing, video encoding and decoding can be offloaded to VPAC3 and hardware-accelerated video codec for the best performance. Other functional blocks can be programmed in A72 or C7x. Section 3 describes in detail how edge AI systems can be built on the AM68A for various industrial (non-automotive) use cases.

3 Edge AI Use Cases on AM68A

The popularity of edge AI technology is increasing in many existing and new use cases. The AM6xA scalable processor family is well designed for edge AI owing to a multicore heterogeneous architecture. This section introduces popular edge AI use cases which require varying input requirements, for example, resolution, frame rate, and task and computation requirements. The distribution of each task among multiple cores and HWAs in the AM68A is described to maximize the performance.

3.1 AI Box

AI Box is a cost-effective way of adding intelligence to existing non-analytics-based cameras present in retail stores, traffic roads, factories, and buildings. AI Box is preferred over AI camera because AI Box is more cost effective than replacing legacy cameras with smart cameras that have AI capabilities. Such a system receives live video streams from multiple cameras, decodes them, and does intelligent video analytics at the edge relieving the burden of transferring large video streams back to the cloud for analysis. The applications of AI Box include security surveillance system with anomaly or event detection, workplace safety systems that verifies workers wear personal protective equipment (PPE) such as goggles, safety vests, and hard hats before entering a hazardous zone. In traffic management, AI Box is used for vehicle counting, vehicle type classification, and moving direction predictions for traffic flow measurement and vehicle tracking.

Figure 3-1 AI Box Block Diagram With Data Flow on AM68A

Figure 3-1 shows the data flow for AI Box on AM68A, where six channels of 2MP bitstreams are coming through Ethernet at 30 fps. The HW accelerated H.264 or H.265 decoder decodes the bitstreams and the decoded frames are scaled to smaller resolution by MSC. DL networks are applied to these smaller-resolution frames at a lower frame rate, for example, 12 fps. DL networks are accelerated by MMA. In DL preprocessing, the smaller resolution frames in YUV are converted to RGB, which is the input format to the DL network. In DL post-processing, the outputs (for example, detections) are overlaid on the input frame. Next, the output frames from six channels are stitched together into a single 2MP frame and seven channels, that is, six channels plus one composite channel are encoded by hardware accelerated H.264 or H.265 encoder at lower frame rates and streamed out or saved in storage. Table 3-1 summarizes the resource utilization and estimated power consumption with six and four channels of 2MP bitstreams. An assumption made here is that each channel needs 1 TOPS for inference. The second C7x core is still available for additional vision processing and JPEG image encoding to create snapshots. While both DL pre- and post-processing run on A72 cores in this example, both processes can run on the second C7x. In such cases, power estimates can be a little higher. The AM68A can enable the AI Box with eight channels of 2 MP bitstreams. However, due to the maximum throughput of video codec, the input frame rate and output frame rate need to be reduced to 24 fps and 4 fps, respectively.

Table 3-1 AM68A Resource Utilization and Power Consumption for the AI Box Use Case

Main IP	Utilization (6 × 2MP at 30 fps)	Utilization (4 × 2MP at 30 fps)
Decoder	6 × 2MP at 30 fps = 360 MP/s (75%)	4 × 2MP at 30fps = 240 MP/s (50%)
Encoder	6 × 2MP at 6 fps + 1 composite × 2MP at 6 fps = 84 MP/s (18%)	4 × 2MP at 6fps + 1 composite × 2MP at 6fps = 60 MP/s (18%)
Decoder + Encoder	360 MP/s + 84 MP/s = 444 MP/s (93%)	240 MP/s + 50 MP/s = 300 MP/s (62.5%)
GPU	20%	20%
VPAC (MSC)	6 × 2MP at 30 fps = 360 MP/s (60%)	4 × 2MP at 30 fps = 240 MP/s (40%)
MMA	6 × 1 TOPS per ch = 6 TOPS (75%)	4 × 1 TOPS per ch = 4 TOPS (50 %)
2 × A72	DL pre- and post-processing, depacketization, and so forth (50%)	DL pre- and post-processing, depacketization, and so forth (40%)
DDR Bandwidth	5.19 GBps (15%)	3.54 GBps (10%)
Power Consumption (85°C)	6.9 W	6.3 W

3.2 Machine Vision

Industrial 4.0 targets the increased automation for production processes within the manufacturing industry, including smart factories, smart manufacture, and so on. Industrial 5.0 emphasizes the human-centric collaboration between human and robots with artificial intelligence, that is, collaborative robot (cobot), to optimize the manufacturing process with improved automation. Machine vision is one of key technologies in Industrial 4.0 and 5.0 and the real-time processing of visual data at the edge is crucial for machine vision. The main use case of machine vision is visual quality inspection, where 2D or 3D vision-based DL is used for various purposes, for example, verifying the presence or absence of parts or ingredients in packaging systems, detecting defects, or identifying the characters on printed circuit board (PCB), gauging the dimension of parts, verifying proper assembly of parts, and the wrapping of labels around containers, detecting tool wear defects as preventive maintenance, and UAV- or drone-based fault detection systems of solar panels, turbines and pipeline, and so forth. The robot arm for pick and place of parts and assembly is another use case of machine vision for the improved collaboration between human and cobots.

Figure 3-2 Machine Vision Block Diagram With Data Flow on AM68A

Figure 3-2 illustrates the data flow for a machine vision use case example on the AM68A, which involves capturing an image sequence at 30 fps using an 8MP camera through a MIPI CSI-2 RX port. The captured raw Bayer image is processed and demosaiced to YUV by VPAC3 VISS, and VPAC3 LDC corrects any lens distortion that can be present. In this machine vision use case, the DL networks are applied to regions of interest (ROI), which are extracted on A72 cores. The number of ROIs and their sizes vary based on the specific use case. The frame rate at which DL networks are applied is also dependent on the use case. The output obtained through DL preprocessing, DL network on MMA, and DL post-processing is displayed via DSS. In the event of any unexpected detection, an alarm can be activated for human attention. The resource utilization and estimated power consumption of AM68A are shown in Table 3-2 for this machine vision use case with a single 8MP input. MMA is assumed to be fully utilized even though the actual MMA utilization can depend on the application. There is still enough room for CSI-2, VPAC, A72, and DDR bandwidth to process higher resolutions of input, for example, 1 × 16MP at 30 fps or another input of 8MP, for example, 2 × 8MP at 30 fps. Therefore, the AM68A can enable the machine vision use case for these camera configurations as long as the MMA can handle the necessary DL inferrencing, but at the cost of increased power.

Table 3-2 AM68A Resource Utilization and Power Consumption for the Machine Vision Use Case

Main IP	Utilization (1 × 8MP at 30 fps)
1 × CSI-2 RX	1 × 8MP at 30 fps = 3.84 Gbps (38%)
VPAC (VISS, LDC)	1 × 8MP at 30 fps = 240 MP/s (40%)
MMA	8 TOPS (100%)
2 × A72	ROI extraction, DL pre- and post-processing, and so forth (50%)
DDR Bandwidth	5.13 GBps (15%)
Power Consumption (85°C)	6.6 W