SPRAD45 Application note | 德州仪器 TI.com.cn

SPRAD45 July 2022 AM623 , AM625

3.1.1 LMBench

LMBench is a suite of microbenchmarks for processor cores and operating system primitives. The memory bandwidth and latency related tests are most relevant for modern embedded processors. The results vary a little (< 10%) run to run.

LMBench benchmark bw_mem measures achieved memory copy performance. With parameter cp it does an array copy and bcopy parameter uses the runtime glibc version of memcpy() standard function. The glibc uses a highly optimized implementation that utilizes, for example, SIMD resulting in higher performance. The size parameter equal to or smaller than the cache size at a given level measures the achievable memory bandwidth from software doing a typical for loop or memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1, which is roughly half of STREAM copy result. Table 3-1 shows the measured bandwidth and the efficiency compared to theoretical wire rate. The wire rate used is the DDR MT/s rate times the width divided by two (read and write making up a copy both consume the bus). The benchmark further allows creating parallel threads with -P parameter. To get the maximum multicore memory bandwidth, create the same amount of threads as there are cores available for the operating system, which is 4 for AM62x Linux (-P 4).

Table 3-1 LMBench Results

	Description	Arm Cortex-A53, DDR4-1600MT/s-16 Bit	DDR4 Efficiency
bw_mem -P 2 8M bcopy	quad core, glibc memcpy	1222MB/s	76%
bw_mem 8M bcopy	single core, glibc memcpy	887MB/s	55%
bw_mem -P 4 8M cp	quad core, inline copy loop	731MB/s	46%
bw_mem 8M cp	single core, inline copy loop	590MB/s	37%

LMBench benchmark lat_mem_rd is used to measure the observed memory access latency for external memory (DDR4 on AM62x) and cache hits. The two arguments are the size of the transaction (64 in the screenshot below) and the stride of the read (512). These two values are selected to measure the latency to caches and external memory, not the processor data prefetchers or other speculative execution. For access patterns, the prefetching will work, but this benchmark is most useful to measure the case when it does not. The left column is the size of the data access pattern in megabytes, right column is the round trip read latency in nanoseconds. As a summary for Arm Cortex-A53 read latency to:

L1D is 3.5 ns
L2 latency is 12 ns
For access to DDR4-1600 latency is 209 ns

The below is a run with DDR4:

root@am62xx-evm:~# lat_mem_rd 64 512
"stride=512
0.00049 2.503
0.00098 2.504
0.00195 2.503
0.00293 2.503
0.00391 2.503
0.00586 2.503
0.00781 2.504
0.01172 2.503
0.01562 2.503
0.02344 2.520
0.03125 2.562
0.04688 7.673
0.06250 8.980
0.09375 10.190
0.12500 10.772
0.18750 11.374
0.25000 11.675
0.37500 11.969
0.50000 12.784
0.75000 140.541
1.00000 179.407
1.50000 192.142
2.00000 197.091
3.00000 202.542
4.00000 205.342
6.00000 207.528
8.00000 208.155
12.00000 209.024
16.00000 209.193
24.00000 209.510
32.00000 209.754
48.00000 209.919
64.00000 209.947