Defining Performance

反應時間(response time)和生產量(throughput)是常見 作為評估計算效能的標準。
反應時間是指工作從開始到完成所花的時間,又稱執行時間 (execution time)。
生產量是指在單位時間內所完成的工作量。
\[ \text{Performance} = \frac{1}{\text{Execution Time}} \]

\[ \text{Performance }\uparrow \text{, Execution Time}\downarrow \]

n times faster

Two computer x and y.
The compouter x is n times faster than computer y. (performance ratio)
\[ (\rightarrow ) \frac{\text{Performance}_{\color{blue}x}}{\text{Performance}_{\color{red}y}} = n \] \[ (\leftarrow ) \frac{\text{Execution Time}_{\color{red}y}}{\text{Execution Time}_{\color{blue}x}} = n \]

Execution Time

Focus user cpu execution time.
時間定義: wall-clock time + response time + elapsed time
也就是完成工作的總時間,包括磁碟存取時間、記憶體存取時間、 輸入/輸出動作時間、作業系統時間 (operating system overhead)等。
CPU execution time without waiting I/O and other processes.


\( \text{執行一個程式所需時間(elapsed time)} \begin{cases} \text{程式執行及等待I/O時間}\\ \text{程式使用CPU時間} \begin{cases} \text{CPU執行作業系統時間(System CPU time)} \rightarrow \text{For servering user process}\\ \text{CPU執行使用者程式時間(User CPU time)}\\ \end{cases} \end{cases} \)

CPU Performance and Its Factors

\( \text{CPU execution time for a program} = \text{CPU clock cycles for a program} \times \text{Clock cycle time} \)
Alternatively, because clock rate and clock cycle time are inverses,

\( \text{CPU execution time for a program} = \frac{ \text{CPU clock cycles for a program} } { \text{Clock rate} } \)

Clock Cycle = Instruction count(IC) \( \times \) Cycles Per Instruction(CPI)
\( \Rightarrow \) Execution Time = IC \( \times \) CPI \( \times \) CCT(clock cycle time)
= (IC \( \times \) CPI) \( / \) CR(clock rate)

不同指令所需的Cycles Per Instruction 也會不同。因此,
Clock Cycles 也可以經由加總各類指令CPI與其指令使用個數 的乘積而獲得。
\( \text{CPU clock cycles}= \sum_{i=1}^{n}(\text{CPI}_i\times C_i) \)
\( C_i \)為指令類別i所使用的指令個數
\( \text{CPI}_i \)為指令類別i的CPI
n: 指令類別數。

Notice

改變指令集\( \rightarrow \)降低指令數, 有可能造成clock cycle增加。
由於CPI與指令的種類相關,指令數少的程式碼 也未必就是最快的。

\( V: \text{電壓} \)
功率(\( W \))的消耗正比於(clock rate\( \times \)\( V^2 \))

Performance example

A certain machine with a 10ns \( (10\times 10^{-9}\text{s}) \) clock can perform
jumps (1 cycle),
branches (3 cycles),
arithmetic instructions (2 cycles),
multiply instructions (5 cycles),
and memory instructions (4 cycles).

A certain program has
10% jumps,
10 branches,
50% arithmetic,
10% multiply,
and 20% memory instructions.

Answer the following question.
Show your derivation in sufficient detail.
  1. What is the CPI of this program on this machine?
  2. If the program executes \( 10^9 \) instructions, what is its execution time?
  3. A 5-cycle multiply-add instruction is implemented that combines an arithmetic and a multiply instruction.
    50% of the multiples can be tuned into multiply-adds.
    What is the new CPI?
  4. Following (3) above, if the clock period remains the same,
    what is the program's new execution time.

Ans

  1. Type Cycle Instruction Count
    jump 1 10%
    branch 3 10%
    ari 2 50%
    mul 5 10%
    mem 4 20%

    \( \text{CPI}_\text{avg} = 1\times 0.1 + 3\times 0.1 + {\color{blue}2\times 0.5} + 5\times 0.1 + 4\times 0.2 = 2.7 \)
  2. Execution Time = \( \text{IC}\times \text{CPI}\times \text{Clock Cycle Time} \) \( = 10^9 \times 2.7 \times (10\times 10^{-9}) =27 \)
  3. Combines 50% of mul into ari.
    \( \Rightarrow \text{halve of }10\% \text{ is } 5\% \Rightarrow \text{all instructions reduce } 5\% \)
    The ari cycle per instruction is 2.
    The mul cycle per instruction is 5.
    \( \therefore \) mul dominate clock cycle. \( \Rightarrow \text{ari part} = 50\% - 5\% = 45\% \)
    \( \Rightarrow \text{CPI}_\text{new} =( 1\times 0.1 + 3\times 0.1 + {\color{blue} 2\times 0.45 + 5\times 0.05} + 5\times 0.05 + 4\times 0.2 ) {\color{red} / 0.95 } = 2.74 \)
  4. Execution Time =IC\( \times \) CPI\( \times \) Clock Cycle Time
    \( \Rightarrow (10^9\times 0.95)\times 2.74\times (10\times 10^{-9}) = 26.03 \)s

使用MIPS做為效能評估標準的謬誤

非時間的效能評估標準叫做MIPS (million instructions per second 又稱 native MIPS)

\[ \text{MIPS}= \frac{ \text{Instruction Count} } { \text{Execution Time} \times10^6 } = \frac{ \text{Instruction Count} } { \frac{\text{IC}\times\text{CPI}}{\text{Clock rate}} \times 10^6} = \frac{\color{red}{\text{Clock rate}}}{\text{CPI}\times 10^6} \]

因為MIPS是指令的執行率,所以MIPS與執行時間成反比。

所以MIPS較高,效能就較好??

*但較快的電腦代表他有較大的MIPS。

不同電腦或裝置不能用MIPS來比較效能。

MIPS用於鑑別程式

MIPS表示指令的執行率(instruction execution rate)
並沒有考慮每個指令的能力(每類指令的CPI不同)考慮進來。
所以同一台電腦不同程式,其MIPS可能不同。
MIPS可能會與效能相反。ex IC增加, CPI減少。

Example

某一電腦有三類指令A、B 和 C其CPI分別為1,2,3。
假設有兩個編譯器對同一程式進行編譯而產生程式碼,我們 得到以下的資料:
Code from A B C
Compiler 1 5 1 1
Compiler 2 10 1 1
假設clock cycle rate是4GHz。
哪一個編譯器產生的程式碼會執行的必較快?
若用執行時間當標準,哪個比較快?

Solution

Instruction count
compiler 1 : 5+1+1=7
compiler 2 : 10+1+1=12

Clock cycle
compiler 1 : \( 1\cdot 5 + 2\cdot 1 + 3\cdot 1= 10 \)
compiler 2 : \( 1\cdot 10 + 2\cdot 1 + 3\cdot 1= 15 \)

Execution Time
compiler 1 : \( \frac{10}{4\cdot 10^9}\)
compiler 2 : \( \frac{15}{4\cdot 10^9}\)

\(\Rightarrow \) 就時間來說compiler 2較長 \( \Rightarrow \)compiler 1執行速度較快。

MIPS
MIPS\( _1 = \frac{\text{IC}_1}{\text{ET}_1\cdot 10^6} = \frac{7}{\frac{10}{4\cdot 10^9}\cdot 10^6} = \frac{7\cdot 4\cdot 10^9}{10\cdot 10^6} \)
MIPS\( _2 = \frac{\text{IC}_2}{\text{ET}_2\cdot 10^6} = \frac{12}{\frac{15}{4\cdot 10^9}\cdot 10^6} = \frac{12\cdot 4\cdot 10^9}{15\cdot 10^6} \)

\( \Rightarrow \frac{\text{MIPS}_1}{\text{MIPS}_2} = \frac{7\cdot 15}{10\cdot 12}=\frac{105}{120} \)

\( \Rightarrow \) compiler 2 有較高的MIPS。

Amdahl's law


P = Performance
ET = Execution Time
Parallel portion is the part that can be improved.
\[ \text{Overall Speedup} = \frac{\text{After }P}{\text{Before }P} = \frac{\text{Before }ET}{\text{After }ET} = \frac{\text{Before } ET}{\frac{\text{ET of Parallel portion}}{\text{speedup of Parallel portion}}+\text{unimproved time}} \]
For percentage \[ \text{speedup}_\text{total}= \frac{\cancelto{1}{\frac{\text{all time}}{\text{all time}}}}{\frac{\text{improved time/all time}}{\text{speedup}_\text{improved part}}+\cancelto{\frac{\text{unimproved time}}{\text{all time}}}{(1-\frac{\text{improved time}}{\text{all time}}})} \]

example

目前一個正考慮中的效能評估程式,在舊的浮點數運算
硬體執行需100秒,為達到整體speedup為3,改善的部分為原來的 5倍快。
請問原花在執行浮點數運算指令的時間應為多少?

solution

\( 3 = \frac{100}{(100-x)+\frac{x}{5}} \)
\( \rightarrow x=\frac{250}{3} \)

效能總評

SPECratio的方法正規化
\( \text{SPECratio} = \frac{\text{Referenced(execution time)}}{\text{execution time}} \Rightarrow \text{Execution Time Ratio} \)
\( \Rightarrow \) 代表值越大,效率越好。

幾何平均(Geometric mean)

\( \sqrt[n]{\prod_{i=1}^{n}\text{Execution time ratio}_i} \)
\( \Rightarrow \) 不同的reference會有相同的值。

如果用算數平均來做正規化,不同的reference會有不同的值, 可能有誤差。

效能評估程式

常用 SPEC(System Performance Evaluation Coporation) 評估效能。
評估兩個計算機系統時,只要簡單比較兩者的工作量的 執行時間即可。