SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

SIMD Width

VCOP has 8 lanes of 40 bits each. When mapped to 32-bit lanes on C7x, there are 16 lanes available, potentially doubling the throughput of a kernel.

Many kernels are written to be independent of the SIMD width, using the macro VCOP_SIMD_WIDTH to abstract the number of lanes. Some of these kernels can be successfully built in host emulation mode for wider (or narrower) machines simply by changing the value of the macro. In host emulation mode, VCOP_SIMD_WIDTH must now be defined on the command line or before inclusion of vcop_host_emulation.h.

The SIMD width used by VCC is controlled by the --vcop_simd option. (Kernels that qualify for SIMD 16 are NOT automatically detected or transformed.) For a SIMD width of 16, --vcop_simd=16 should be used. This option controls the translation sequence calls to the VM. As an additional change to allow this option, the generated C source file will also define VCOP_SIMD_WIDTH.

Some kernels depend on a specific SIMD width and will not work correctly if extended to 16-way SIMD. Furthermore, increasing the SIMD factor may depend on certain properties of the data layout in memory. For example, image widths may be required to be multiples of 16 instead of 8. It is not possible for the migration tool to automatically detect these cases.

The following are examples of VCOP operations that cannot be trivially extended to 16-way SIMD.

  • VBITPK – if kernel assumes results are 8-bit values
  • VBITTR – assumes 8x8 transpose
  • VBITUNPK – if bit mask ( src1[0]) is assumed to be 8 bits
  • Interleave/De-interleave, including de-interleaving loads, interleaving stores, and vector operations – if kernel assumes interleaving on 8-lane boundaries. (If kernel avoids making assumptions about vector sizes or layouts, for example simply using de-interleave-on-read and interleave-on-write to improve throughput without regard to layout, then interleaving can be extended to wider SIMD widths.)
  • Load with custom distribution – kernel C source format is tied to 8 lanes
  • Load with expand – if kernel assumes 8-bit predicate
  • Load with nbits – if kernel assumes 8-bit type for packed bit vector in memory
  • Lookup table and histogram – the table layout in memory is tied to VCOP’s 8-bank memory architecture, but certain cases of 16-way lookup and histogram are supported. See Section 5.5.8.