SPRUIG3C January   2018  – August 2019 TDA4VM , TDA4VM-Q1

 

  1.   Read This First
    1.     About This Manual
    2.     Related Documentation
    3.     Trademarks
  2. 1Overview and Scope
    1. 1.1 Comparing VCOP and C7000
    2. 1.2 About this Document
      1. 1.2.1 Documentation Conventions
    3. 1.3 Output Format
    4. 1.4 Data Types
      1. 1.4.1 40-bit Incompatibilities
      2. 1.4.2 40-Bit Detection in Host Emulation Mode
    5. 1.5 SIMD Width
    6. 1.6 VCOP Virtual Machine
  3. 2Kernel API
    1. 2.1 Overview
    2. 2.2 Parameter Block
      1. 2.2.1 Tvals Structure
      2. 2.2.2 Pblock Manipulation
  4. 3Loop Control
    1. 3.1 Overview
    2. 3.2 Loop Control and Nested Loops
    3. 3.3 Repeat Loops
    4. 3.4 Compound Conditions
    5. 3.5 Early Exit
  5. 4Addressing
    1. 4.1 Overview
    2. 4.2 Streaming Engines
    3. 4.3 Streaming Address Generators
    4. 4.4 Indexed Addressing
    5. 4.5 Circular Addressing
  6. 5Operations
    1. 5.1 Load Operations
    2. 5.2 Store Operations
      1. 5.2.1 Predicated Stores
      2. 5.2.2 Scatter and Transposing Stores
      3. 5.2.3 Optimization of OFFSET_NP1-Based Transpose
      4. 5.2.4 Rounding Stores
      5. 5.2.5 Saturating Stores
    3. 5.3 Arithmetic Operations
      1. 5.3.1 Vector Compares
      2. 5.3.2 Multiplication with Rounding, Truncation, or Left Shift
    4. 5.4 Lookup and Histogram Table Operations
      1. 5.4.1 Determination of Table Size
      2. 5.4.2 Table Configuration
      3. 5.4.3 Copy-in Operation
      4. 5.4.4 Copy-out Operation
      5. 5.4.5 Index Adjustment from Non-zero Agen
      6. 5.4.6 Lookup Operation
      7. 5.4.7 Histogram Update Operation
      8. 5.4.8 16-Way Lookup and Histogram
  7. 6Performance
    1. 6.1 Overview
    2. 6.2 Compiler Requirements
    3. 6.3 Automatic Performance Profiling
    4. 6.4 Performance Options
  8.   A Warnings and Notes
    1.     A.1 Compatibility Warnings
    2.     A.2 Efficiency Warnings

40-bit Incompatibilities

In general, there are two kinds of errors that result from narrowing lanes.

The first is when the upper bits indicate signedness. For example on VCOP the value 0x00.FFFF.FFFF represents a large unsigned number (4294967295), whereas the value 0xFF.FFFF.FFFF represents a negative number (-1). When translated, the 32 bit result 0xFFFF.FFFF could be either value depending on whether it’s treated as signed or unsigned.

By default, the migration tool treats all 32-bit values as signed. This covers the majority of cases, since the 40-bit values in VCOP are always treated as signed. In some cases, however, this can lead to incorrect results. For example, when the value 0x00.FFFF.FFFF is right-shifted or compared, VCOP treats it as positive while the C7x translation treats it as negative.

The migration tool helps address this issue, in some cases, treats 32-bit values as unsigned. This can happen in two ways. First, if the vector is loaded from an unsigned base pointer (__vptr_uchar, __vptr_ushort, or __vptr_uint), its element type becomes unsigned. Second, you can force a vector to be unsigned by declaring it using the __vector_uint32 keyword (rather than the normal __vector).

The following operations are affected by the signedness of vector elements.

  • Right shifts (including shift-or) are unsigned if the left-hand-side operand is unsigned.
  • Compare operations (including sort2, min, and max) use unsigned compares if both operands are unsigned; if only one operand is unsigned, these operations use signed compare.
  • Saturation operations use unsigned compare for saturation bounds if the source register is unsigned.
Compatibility Warning: Unsigned vector elements
If a kernel relies on vector elements being treated as unsigned when bit 31 is set, the translated code may not work properly. Most such issues can be fixed by declaring the vector as __vector_uint32.

The second error that can result from the reduced lane width is when values have significance in the upper 8 bits. On VCOP these bits are typically used as overflow (guard) bits for accumulation loops, or to hold the upper bits of extended multiply operations. Here is a partial list of VCOP operations that use the guard bits:

  • Multiplies (VMPY, VMADD, and VMSUB)
    Inputs are 17x17 (bit 16 is a sign bit) producing 40 bits of sign-extended output.
  • VLMBD
    Searches bits 39-32 for leftmost bit.
  • VADDH (Kernel-C: (Vdst1,Vdst2) = Vsrc1 + hi(Vsrc2))
    Shifts guard bits right by 32 and adds them.
  • VSHF16 (Kernel-C: Vdst = jus16(src))
    Sign extends bit 32 into upper guard bits.

The migration tool does not attempt to account for or detect these incompatibilities. The resultant code will likely fail at run time.

Compatibility Warning: Reliance on 40-bit elements
If a kernel depends on more than 32-bits of precision in vector elements, the translated code may not work properly.