SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

Streaming Engines

The SEs provide the most efficient addressing, but are restricted as follows:

  • There are only two of them.
  • They can only be used for loads (not stores).
  • The base address is pre-initialized by the SE_OPEN step; therefore a given SE cannot be shared between multiple loads with different base addresses.

The migration tool allocates the two SE resources to what it considers to be the two highest priority loads. Loads in the innermost loop are considered to have the highest priority. The heuristic simply picks the first two loads in the innermost loop; if there are fewer than two, it moves to the next outer loop.

For SE-based loads, the migration tool generates the following sequence of steps:

  1. In the init() function, the migration tool generates a call to the SE_init() template function in the virtual machine, which returns an SE setup vector. (The ISA spec refers to the setup vector as an SE template; here we use the term setup vector to avoid confusion with the virtual machine’s C++ templates). The setup vector is saved in the tvals structure for later access by the vloops() function. The setup vector consists of static (compile-time) and dynamic (run-time) values. The static values correspond to flags in the SE setup vector and are determined from the distribution mode and data type. These are passed as template parameters to SE_init(). The dynamic values correspond to stride and trip count values that are determined from the terms in the Agen expression and loop trip counts. These are passed as runtime arguments to SE_init().
  2. Also in the init() function, the migration tool generates the expression that represents the base address and saves that in another field of the tvals structure.
  3. In the vloops() function, outside the outermost loop, the migration tool generates a call to the SE_OPEN() intrinsic, passing it both the setup vector and the base address from the tvals structure.
  4. The load instruction in the loop simply uses an __se_ac_<type> intrinsic for the access, which turns into a quasi-register operand SEn++ containing the loaded value.
  5. As an optimization, the compiler may copy-propagate the SEn++ operand into the instruction where the value is used, thereby eliminating the load instruction altogether.
  6. The migration tool generates a call to the SE_CLOSE() intrinsic after the loop nest.