CPI (Cycles per Instruction)

CPI by very definition is nothing but the average time taken per instruction by the program. So of-course the lesser the CPI for same program means you have spend lesser time doing same/more instruction.

\[ CPI = \frac{Total\_CPU\_Clock\_Cycles} {Total\_Instuctions\_Executed} \]

The CPI Stack is understanding why a CPI is what it is, it is breaking it down/decomposing it into its constituent parts.

CPI as a overall metric is used to compare software performance and analyze the bottlenecks at the architectural level but the deep underlying cause of the latency and the CPI can only be done using the CPI stack which talks about the underlying stalls and latency.

Constructing the CPI Stack

From the above section we have understood that the average CPI of a computer program executing on a given microprocessor can be divided into a base CPI plus a number of CPI components that reflect ‘lost’ cycle opportunities due to miss events.

So what is a CPI stack again ? it is nothing but the breakdown of CPI into its respective components. It is referred to as a stack because the CPI data is typically displayed as a stacked histogram bars where the base CPI is shown at the bottom and the rest above. the below diagram shows a CPI stack.

cpi_stack_fig1.png

So from a quick first look we see that L2 D-cache miss is the most significant contributing factor and if in some way if L2 D-cache is tuned or if the application’s locality is optimized by the compiler then we can reduce the CPI and boost the performance of the application on the target hardware.