SPRUIE9D May 2017 – May 2024 DRA74P , DRA75P , DRA76P , DRA77P
Software cannot issue VLDH_DINTRLV on odd halfwords, so use two LDH_NPT.
| Cycle | LD stage | OP stage | ST stage |
|---|---|---|---|
| 0 | iter 0 LD0 need 0x10000..0x1000F, read IBUFLA LD0_buf = {0x10000..0x1001F} LD1 need 0x10080 .. 0x1008F, stalled LD2 need 0..1, read WBUF LD2_buf = {0x0..0x1F} | ||
| 1 | iter 0 LD1 read IBUFLA LD1_buf = {0x10080..0x1009F} | ||
| 2 | iter 1 LD0 need 0x10002..0x10011, from LD0_buf LD1 need 0x10082..0x10091, from LD1_buf LD2 need 2..3, from LD2_buf | iter 0 MADD || MADD | |
| 3 | iter 2 LD0 need 0x10004..0x10013, from LD0_buf LD1 need 0x10084..0x10093, from LD1_buf LD2 need 4..5, from LD2_buf | iter 1 MADD || MADD | iter 0 |
| 4 | iter 3 LD0 need 0x10010..0x1001F, from LD0_buf LD1 need 0x10090 .. 0x1009F, from LD1_buf LD2 need 0..1, from LD2_buf | iter 2 MADD || MADD | iter 1 |
| 5 | iter 4 LD0 need 0x10012..0x10021, read IBUFLA LD0_buf = {0x10010..0x1002F} LD1 need 0x10092 .. 0x100A1, stall LD2 need 2..3, from LD2_buf | iter 3 MADD || MADD | iter 2 ST0 store 0x10400..0x1040F ST0 queued ST1 store 0x10480..0x1048F ST1 queued |
| 6 | iter 4 LD1 read IBUFLA LD1_buf = {0x10090..0x100AF} | stalled | stalled |
| 7 | iter 5 LD0 need 0x10014..0x10023, from LD0_buf LD1 need 0x10094..0x100A3, from LD1_buf LD2 need 4..5, from LD2_buf | iter 4 MADD || MADD | iter 3 ST0 write IBUFLA |
| 8 | iter 6 LD0 need 0x10020..0x1002F, from LD0_buf LD1 need 0x100A0..0x100AF, from LD1_buf LD2 need 0..1, from LD2_buf | iter 5 MADD || MADD | iter 4 ST1 write IBUFLA |
| 9 | iter 7 LD0 need 0x10022..0x10031, read IBUFLA LD0_buf = {0x20..x3F} LD1 need 0x100A2..0x100B1, stall LD2 need 2..3, from LD2_buf | iter 6 | iter 5 ST0 store 0x10410..0x1041F ST0 queuedST1 store 0x10490..0x1049F ST1 queued |
| 10 | iter 7 LD1 read IBUFLA LD1_buf = {0x100A0..0x100BF} | stalled | stalled |
In this loop, with 2 VLDH_NPT to sustain 16 multiply-accumulates per iteration, the load stage stalled due to memory contention in IBUFL from the 2 loads. The read buffer supplies data for 2 subsequent iterations in steady state, leaving 2 memory-read free cycles every 3 iterations. With output writing back to IBUFL, the store buffer effectively delays the memory writes to use these free memory slots. Each i4 loop of 3 iterations thus takes 4 cycles to complete.