SPRAC21A June   2016  – June 2019 OMAP-L132 , OMAP-L138 , TDA2E , TDA2EG-17 , TDA2HF , TDA2HG , TDA2HV , TDA2LF , TDA2P-ABZ , TDA2P-ACD , TDA2SA , TDA2SG , TDA2SX , TDA3LA , TDA3LX , TDA3MA , TDA3MD , TDA3MV

 

  1.   TDA2xx and TDA2ex Performance
    1.     Trademarks
    2. SoC Overview
      1. 1.1 Introduction
      2. 1.2 Acronyms and Definitions
      3. 1.3 TDA2xx and TDA2ex System Interconnect
      4. 1.4 Traffic Regulation Within the Interconnect
        1. 1.4.1 Bandwidth Regulators
        2. 1.4.2 Bandwidth Limiters
        3. 1.4.3 Initiator Priority
      5. 1.5 TDA2xx and TDA2ex Memory Subsystem
        1. 1.5.1 Controller/PHY Timing Parameters
        2. 1.5.2 Class of Service
        3. 1.5.3 Prioritization Between DMM/SYS PORT or MPU Port to EMIF
      6. 1.6 TDA2xx and TDA2ex Measurement Operating Frequencies
      7. 1.7 System Instrumentation and Measurement Methodology
        1. 1.7.1 GP Timers
        2. 1.7.2 L3 Statistic Collectors
    3. Cortex-A15
      1. 2.1 Level1 and Level2 Cache
      2. 2.2 MMU
      3. 2.3 Performance Control Mechanisms
        1. 2.3.1 Cortex-A15 Knobs
        2. 2.3.2 MMU Page Table Knobs
      4. 2.4 Cortex-A15 CPU Read and Write Performance
        1. 2.4.1 Cortex-A15 Functions
        2. 2.4.2 Setup Limitations
        3. 2.4.3 System Performance
          1. 2.4.3.1 Cortex-A15 Stand-Alone Memory Read, Write, Copy
          2. 2.4.3.2 Results
    4. System Enhanced Direct Memory Access (System EDMA)
      1. 3.1 System EDMA Performance
        1. 3.1.1 System EDMA Read and Write
        2. 3.1.2 System EDMA Results
      2. 3.2 System EDMA Observations
    5. DSP Subsystem EDMA
      1. 4.1 DSP Subsystem EDMA Performance
        1. 4.1.1 DSP Subsystem EDMA Read and Write
        2. 4.1.2 DSP Subsystem EDMA Results
      2. 4.2 DSP Subsystem EDMA Observations
    6. Embedded Vision Engine (EVE) Subsystem EDMA
      1. 5.1 EVE EDMA Performance
        1. 5.1.1 EVE EDMA Read and Write
        2. 5.1.2 EVE EDMA Results
      2. 5.2 EVE EDMA Observations
    7. DSP CPU
      1. 6.1 DSP CPU Performance
        1. 6.1.1 DSP CPU Read and Write
        2. 6.1.2 Code Setup
          1. 6.1.2.1 Pipeline Copy
          2. 6.1.2.2 Pipeline Read
          3. 6.1.2.3 Pipeline Write
          4. 6.1.2.4 L2 Stride-Jmp Copy
          5. 6.1.2.5 L2 Stride-Jmp Read
          6. 6.1.2.6 L2 Stride-Jmp Write
      2. 6.2 DSP CPU Observations
      3. 6.3 Summary
    8. Cortex-M4 (IPU)
      1. 7.1 Cortex-M4 CPU Performance
        1. 7.1.1 Cortex-M4 CPU Read and Write
        2. 7.1.2 Code Setup
        3. 7.1.3 Cortex-M4 Functions
        4. 7.1.4 Setup Limitations
      2. 7.2 Cortex-M4 CPU Observations
        1. 7.2.1 Cache Disable
        2. 7.2.2 Cache Enable
      3. 7.3 Summary
    9. USB IP
      1. 8.1 Overview
      2. 8.2 USB IP Performance
        1. 8.2.1 Test Setup
        2. 8.2.2 Results and Observations
        3. 8.2.3 Summary
    10. PCIe IP
      1. 9.1 Overview
      2. 9.2 PCIe IP Performance
        1. 9.2.1 Test Setup
        2. 9.2.2 Results and Observations
    11. 10 IVA-HD IP
      1. 10.1 Overview
      2. 10.2 H.264 Decoder
        1. 10.2.1 Description
        2. 10.2.2 Test Setup
        3. 10.2.3 Test Results
      3. 10.3 MJPEG Decoder
        1. 10.3.1 Description
        2. 10.3.2 Test Setup
        3. 10.3.3 Test Results
    12. 11 MMC IP
      1. 11.1 MMC Read and Write Performance
        1. 11.1.1 Test Description
        2. 11.1.2 Test Results
      2. 11.2 Summary
    13. 12 SATA IP
      1. 12.1 SATA Read and Write Performance
        1. 12.1.1 Test Setup
        2. 12.1.2 Observations
          1. 12.1.2.1 RAW Performance
          2. 12.1.2.2 SDK Performance
      2. 12.2 Summary
    14. 13 GMAC IP
      1. 13.1 GMAC Receive/Transmit Performance
        1. 13.1.1 Test Setup
        2. 13.1.2 Test Description
          1. 13.1.2.1 CPPI Buffer Descriptors
        3. 13.1.3 Test Results
          1. 13.1.3.1 Receive/Transmit Mode (see )
          2. 13.1.3.2 Receive Only Mode (see )
          3. 13.1.3.3 Transmit Only Mode (see )
      2. 13.2 Summary
    15. 14 GPMC IP
      1. 14.1 GPMC Read and Write Performance
        1. 14.1.1 Test Setup
          1. 14.1.1.1 NAND Flash
          2. 14.1.1.2 NOR Flash
        2. 14.1.2 Test Description
          1. 14.1.2.1 Asynchronous NAND Flash Read/Write Using CPU Prefetch Mode
          2. 14.1.2.2 Asynchronous NOR Flash Single Read
          3. 14.1.2.3 Asynchronous NOR Flash Page Read
          4. 14.1.2.4 Asynchronous NOR Flash Single Write
        3. 14.1.3 Test Results
      2. 14.2 Summary
    16. 15 QSPI IP
      1. 15.1 QSPI Read and Write Performance
        1. 15.1.1 Test Setup
        2. 15.1.2 Test Results
        3. 15.1.3 Analysis
          1. 15.1.3.1 Theoretical Calculations
          2. 15.1.3.2 % Efficiency
      2. 15.2 QSPI XIP Code Execution Performance
      3. 15.3 Summary
    17. 16 Standard Benchmarks
      1. 16.1 Dhrystone
        1. 16.1.1 Cortex-A15 Tests and Results
        2. 16.1.2 Cortex-M4 Tests and Results
      2. 16.2 LMbench
        1. 16.2.1 LMbench Bandwidth
          1. 16.2.1.1 TDA2xx and TDA2ex Cortex-A15 LMbench Bandwidth Results
          2. 16.2.1.2 TDA2xx and TDA2ex Cortex-M4 LMBench Bandwidth Results
          3. 16.2.1.3 Analysis
        2. 16.2.2 LMbench Latency
          1. 16.2.2.1 TDA2xx and TDA2ex Cortex-A15 LMbench Latency Results
          2. 16.2.2.2 TDA2xx and TDA2ex Cortex-M4 LMbench Latency Results
          3. 16.2.2.3 Analysis
      3. 16.3 STREAM
        1. 16.3.1 TDA2xx and TDA2ex Cortex-A15 STREAM Benchmark Results
        2. 16.3.2 TDA2xx and TDA2ex Cortex-M4 STREAM Benchmark Results
    18. 17 Error Checking and Correction (ECC)
      1. 17.1 OCMC ECC Programming
      2. 17.2 EMIF ECC Programming
      3. 17.3 EMIF ECC Programming to Starterware Code Mapping
      4. 17.4 Careabouts of Using EMIF ECC
        1. 17.4.1 Restrictions Due to Non-Availability of Read Modify Write ECC Support in EMIF
          1. 17.4.1.1 Un-Cached CPU Access of EMIF
          2. 17.4.1.2 Cached CPU Access of EMIF
          3. 17.4.1.3 Non CPU Access of EMIF Memory
          4. 17.4.1.4 Debugger Access of EMIF via the Memory Browser/Watch Window
          5. 17.4.1.5 Software Breakpoints While Debugging
        2. 17.4.2 Compiler Optimization
        3. 17.4.3 Restrictions Due to i882 Errata
        4. 17.4.4 How to Find Who Caused the Unaligned Quanta Writes After the Interrupt
      5. 17.5 Impact of ECC on Performance
    19. 18 DDR3 Interleaved vs Non-Interleaved
      1. 18.1 Interleaved versus Non-Interleaved Setup
      2. 18.2 Impact of Interleaved vs Non-Interleaved DDR3 for a Single Initiator
      3. 18.3 Impact of Interleaved vs Non-Interleaved DDR3 for Multiple Initiators
    20. 19 DDR3 vs DDR2 Performance
      1. 19.1 Impact of DDR2 vs DDR3 for a Single Initiator
      2. 19.2 Impact of DDR2 vs DDR3 for Multiple Initiators
    21. 20 Boot Time Profile
      1. 20.1 ROM Boot Time Profile
      2. 20.2 System Boot Time Profile
    22. 21 L3 Statistics Collector Programming Model
    23. 22 Reference
  2.   Revision History

QSPI XIP Code Execution Performance

In order to understand the impact of executing code from QSPI flash in XIP mode, the following test setup was used on the TDA2xx device:

  • QSPI configured to Mode 0 and 64 MHz operating frequency
  • Vision SDK (version 2.9) IPU application modified to run out of QSPI. The TI Vision SDK is a multi-processor software development platform for TI’s family of ADAS SoCs. For more information, see the TI Vision SDK, Optimized Vision Libraries for ADAS Systems (SPRY260). The software framework allows users to create different ADAS application data flows involving video capture, video pre-processing, video analytics algorithms and video display.
  • Cortex-M4 Unicache enabled
  • Cortex-M4 operating at 212 MHz

Table 63 provides a comparative analysis of the impact of QSPI XIP code execution versus DDR Cortex-M4-based Capture Display Vision SDK. Note Cortex-M4 frequency is 212 MHz. The FPS was found to match between DDR and QSPI XIP code execution. (30 FPS)

Table 63. M4_0 CPU Execution Time in QSPI XIP Mode

Scenario M4_0 CPU Task Load (%) (load can vary by 2 % in different runs)
M4_0 code execution from DDR3 532 MHz Approximately 6.2 %
QSPI XIP (64 MHz clock frequency, Mode 0) Approximately 10.91 %

In order to understand the impact of the QSPI Code execution for a fully loaded M4 CPU, the networking usecase was also run with the capture display usecase. In order to run the networking threads on the M4_0 core, the Network Development Kit (NDK) (http://www.ti.com/lit/ug/spru524j/spru524j.pdf) windows application was run as shown below:

ndk_2_24_02_31\packages\ti\ndk\winapps>send <IPAddress> 2000

This tool prints out the number of megabytes of data that were sent by the tool and serviced by the TDA2xx device (M4_0) running the network stack. The M4_0 is 100% loaded in the following experiments. Table 64 provides the comparison of the achieved network throughput at different device conditions.

Table 64. M4_0 CPU Networking Bandwidth Performance

Networking Bandwidth Achieved
(all numbers mega Bytes per second)
QSPI4 (64 MHz) DDR3 532 MHz
M4 (212.8 MHz) 3.05 5.26

When there is a concurrent EDMA transfer from QSPI to DDR (possible application image copy from QSPI) while the M4_0 is executing code out of QSPI, there is a significant impact on the M4_0 code execution time. The impact on M4 code execution for varying EDMA ACNT parameter for AB_SYNC and ASYNC transfers is shown in Table 65.

Table 65. M4_0 CPU Networking Bandwidth Performance for Different EDMA ACNT Values

Networking Bandwidth Achieved
With concurrent EDMA Traffic
M4 @ 212.8 MHz QSPI @ 64 MHz
A_SYNC
(Bandwidth in MBps)
AB_SYNC
With BCNT = 512
(Bandwidth in MBps)
ACNT = 65535 0.0118 0.0046
ACNT = 16384 0.0379 0.0046
ACNT = 4096 0.233 0.0050
ACNT = 512 1.638 0.0088
ACNT = 256 2.048 0.0121
ACNT = 128 2.559 0.0122
ACNT = 64 2.730 0.0186
Without EDMA 2.935 2.935

The impact on M4 traffic can be controlled by using bandwidth limiter on the EDMA Read from QSPI. Table 66 provides the impact on performance for M4_0 code running the (1) network usecase + Capture and Display and (2) Capture-display usecase only in two independent runs.

Table 66. M4_0 CPU Networking Bandwidth Performance for Concurrent EDMA Traffic at Different EDMA Throughputs

BW Limited EDMA TPUT
(ACNT = 65535, A SYNC)
M4 Networking Performance (MBps) QSPI4 (64 MHz) M4 Capture Display Total CPU Load (%) QSPI4 (64 MHz)
22.46 MBps 0.0118 MBps 99.9 %
17.86 MBps 0.621 MBps 40.1 %
8.98 MBps 1.772 MBps 19.2 %
Without EDMA 2.935 MBps 13.2 %

The impact of the performance of the IPU code can also be understood by looking at the traffic profile using L3 statistic collectors. Figure 41 through Figure 45 shows how the IPU traffic is impacted with a concurrent EDMA traffic and how the performance of IPU can be recovered using BW limiters on the EDMA Read from QSPI. QSPI is operating at 64 MHz in all the below BW plots.

ipu_vision_sdk_nbp.pngFigure 41. IPU (QSPI XIP) Vision SDK + Networking Bandwidth Profile
edma_async_transfers_qspi_ddr.pngFigure 42. EDMA ASYNC Transfer QSPI to DDR (ACNT = 65535)
ipu_vision_sdk_nbp_concurrent_edma_traffic.pngFigure 43. IPU (QSPI XIP) Vision SDK + Networking Bandwidth Profile With Concurrent EDMA Traffic
ipu_vision_sdk_nbp_edma_bw_limited_18.pngFigure 44. IPU (QSPI XIP) Vision SDK + Networking Bandwidth Profile EDMA BW Limited to Approximately 18 MBps
ipu_vision_sdk_nbp_edma_bw_limited_9.pngFigure 45. IPU (QSPI XIP) Vision SDK + Networking Bandwidth Profile EDMA BW Limtied to Approximately 9 MBps