U
48
1
,r
l%
VAfe
Ifi±
:
DRAM0^#i5?B^Fai:^Mgi
m^ummm^mj-Mm^mikm
-
m-m
sfnw^mm
spjsii-iit^isa<]^!^)SiyB^FBi^f#^s$mzgt^
1x6^1?
mmm
MmB^rBisbbiiBtiiis^^isBtrBFJi^p^^fi
°
^0<]^i^jsrjBfFBi
=
mut
ft
Misi^i50^isst±
IS;±S«/J^w4^B^rB1?l5^^fiB^MMfii)•
lil
Piig#lSi5fSff7b7vSBfrBF6<]i7«'|
[lOl^^^Z]
Suppose
we
have
a
processor
with
a
base
CPI
of
1.0,
assuming
all
references
hit
in
the
^primary
cache,
and
a
clock
rate
of
4
GHz.
Assume
a
b,
OZ^
»^5
main
memory
access
time
of
100
ns,
including
all
the
miss
handling.
Suppose
the
miss
rate
ver
instruction
at
the
primary
cache
is
2%.
How
much
^fasteJjwill
the
processor
be
if
we
add
a
se_CQndary_cache
that
has
a
5
ns
access
time
for
either
a
hit
or
a
miss
and
is
large
enough
to
^reduce
the
miss
rate
to
main
memory
ttyO^%.
Answer
100/0.25
=
400
clock
cycles
cpi,
-
0,
"
Total
CPI
=
Base
CPI
+
Memory-stall
cycles
per
Instruction
=
1.0
+
Memory-stall
cycles
per
Instruction
=
1.0
+
2%
X
400
=
9.0
cpi^
|
^
^
5/0.25
=
20
clock
cycles
=3,^
-
@itb
total
CPI
um
base
CPI
mm
:
J-
3,f
^
Total
CPI
=
1
+
Primary
stalls
per
Instruction
+
Secondary
stalls
per
Instruction
=
1
+
2%
x
20
+
0.5%
x
400
=
1
+
0.4
+
2.0
=
3.4
Hlhb
9.0/3.4
=
2.6
w
^P9((2%
-
0.5%)
x
20
=
0.3)|S|fiJ^IB'llfi^
lS09W1¥Ma^i5(O.5%
x
(20
+
400)
=
2.1)jP^-|B5|5SSl¥Ma^S5
°
mu
1.0
+
0.3
+
2.1
-mmm
3.4
°
50
5/Ni
>
h-fi6<]ip,#i|gfIi5nfff^a<]tj^I^(combined
cache)
ai^^miss
rate
per
instruction?l^gtff^lBtjNl^&^S1yB^^
rate
per
instructionS'^l^S
°
mmmm^U2Q
docks
docks
#SSiB
1111.2=:^^
»
1.2
Mem.
accesses
per
instruction
20
clocks
50
clocks
LI
L2
CPU
—>
>
«—>
Cache Cache
Memory
100
accesses
30
miss
10
miss
Total
stall
cycles
=
LI
stall
cycles
+
L2
stall
cycles
=
LI
misses
x
LI
miss
penalty
+
L2
misses
x
L2
miss
penalty
=
30
x
20
+
10
x
50
Stall
cycle
per
access
=
Total
stall
cycles/number
of
CPU
access
30
^
x20
+
f
100
100
t
t
x50
LI
miss
rate
L2
miss
rote
Stall
cycle
per
instruction
=
Memory
access
per
instr.
x
Stall
cycle
per
access
30^
(
10^
1.2
X
;
L2x-
X
20
+
x50
100
100
J
t
t
LI
miss
rate
L2
miss
rate
per
instruction
per
instruction
Consider
a
processor
with
the
following
parameters:
o
s
B
o
c
^-T
Oh
o
cd
CQ
o
(D
Oh
GO
(U
u
o
B
<D
O
b
o
E
D
s
.s
cd
CIh
<u
<u
o
cd
o
>
<D
<U
<l>
o
<u
a,
cd
-O
(U
Oh
^
5
8
B
GO
"O
^
(U
cd
-o
'-I
U
(U
c/i
cd
Q.
O
o
^
>
s
OS
aj
^
B
g
(N
^
(U
C/5
OO
q;r
^
4=
^
O
OJ
13
.&
II
a
G
C/3
<N
cd
cd
42
Vh
q
C/2
Cd
c/i
U
S
"B
>
cs
o
o
-o
OG
IZ3
CM
Cd
a.
2.0
3GHz
125ns
5%
15
cycles
3.0%
25
cycles
1.8%
b.
2.0
IGHz
100ns
4%
10
cycles
4.0%
20
cycles
1.6%
(1)
Calculate
the
CPI
for
the
processor
in
the
table
using:
©
only
a
first-level
cache,
®
a
second-level
direct-mapped
cache,
and
@
a
second-level
eight-way
set-associative
cache.
(2)
It
is
possible
to
have
an
even
greater
cache
hierarchy
levels.
Given
the
processor
above
with
a
second-level,
dirbct-mapped
cache,
a
designer
wants
to
add
a
third-level
cache
that
takes^SO^ydes
to
access
and
will
reduce
the
global
miss
rate
to
1.3%.
Would
this
provide
better
^
htfiey
performance?
In
general,
what
are
the
advantages
andjiisadvantages
of
adding
a
third-level
cache?
(3)
In
older
processors
such
as
the
Intel
Pentium
or
Alpha
21264,
the
second
level
of
cache
was
external
(located
on
a
different
chip)
from
the
main
processor
and
the
first-level
cache.
While
this
allowed
for
large
second-level
caches,
the
latency
to
access
the
cache
was
much
52
I
^7^*
higher,
and
the
bandwidth
was
typically
lower
because
the
second-level
cache
ran
at
a
lower
frequency.
Assume
a
512
KB
off-chip
second-level
cache
has
a
global
miss
rate
of
4%.
If
each
additional
512
KB
of
cache
lowered
global
miss
rates
by
0.7%,
and
the
cache
had
a
total
access
time
of
50
cycles,
how
big
would
the
cache
have
to
be
to
match
the
performance
of
the
second-level
direct-mapped
cache
listed
in
the
table?
Of
the
eight-way
set-associative
cache?
Answer
(1)
a.
b.
(2)
a.
b.
Memory
nuss
cycles:
125
ns
x
3G
=
375
©
Total
CPI:
2.0
+
375
x
5%
=
20.75
\J
®
Total
CPI:
2.0
+
15
x
5%
+
375
x
3%
=
14
>/
@
Total
CPI:
2.0
+
25
x
5%
+
375
x
1.8%
=
10
i/
Memory
miss
cycles:
100
clock
cycles
©
Total
CPI:
2.0
+
100
x
0.04
=
6.0
©
Total
CPI:
2.0
+
100
x
0.04
+
10
x
0.04
=
6.4
©
Total
CPI:
2.0
+
100
x
0.016
+
20
x
0.04
=
4.4
Total
CPI:
2.0
+
15
x
5%
+
50
x
3%
+
375
x
1.3%
=
9.125
'U
This
would
provide
better
performanc^ybut
may
complicate
the
design
of
the
processor,
could
lead
to:
more
complex
cache
coherency,
increased
cycle
time,
larger
and
more
expensive
chips.
)
Total
CPI:
2.0
+
100
x
0.013
+
10
x
0.04
+
50
x
0.04
=
5.7
This
would
provide
better
performance,
but
may
complicate
the
design
of
the
processor.
This
could
lead
to:
more
complex
cache
coherency,
increased
cycle
time,
larger
and
more
expensive
chips.
(3)
kJ
b.
Total
CPI:
2.0
+
50
x
5%
+
375
x
(4%
-
0.7%
x
n)
n
=
2
^
1.5
MB
L2
cache
to
match
direct-map
n
=
4
->
2.5
MB
L2
cache
to
match
8-way
Total
CPI:
2.0
+
50
x
0.04
+
100
x
(0.04
-
0.007
x
n)
n
=
2
1.5
MB
L2
cache
o
match
direct-map
n
=
5
->
3
MB
L2
cache
to
match
8-way
li(3)a
:
Let
2.0
+
50
x
5%
+
375
x
(4%
-
0.7%
.x
n)
=
14
^
n
=
2.1
Let
2.0
+
50
X
5%
+
375
x
(4%
-
0.7%
x
n)
=
10
^
n
=
3.6
>
=1°
?
Global
miss
rate
(GMR):
The/ractlon
of
references
that
miss
In
all
levels
of
a
multilevel
cache.
^
Local
miss
rate
(LMR);
The
fraction
of
references
to
one
level
of
a
cache
that
miss;
used
in
multilevel
hierarchies.
L(T15[I^GIobalSLocal
miss
ratefi'5ltffSlll^S'5'2;iC
1000
accesses
50
miss
20
miss
5
miss
CPU
>
LI
^ ^
L2
<—>
L3
<—>
Cache
Cache
Cache
GMR
50/1000
20/1000
5/1000
LMR
50/1000
20/50
5/20
LI
GMR
=
LI
LMR
L2
GMR
=
LI
LMR
x
L2
LMR
L3
GMR
=
LI
LMR
x
L2
LMR
x
L3
LMR
Memory
fe
i/e
I
f'u
"XtX,
If
LMI?
-
II
level
im(«
acie(c