U

48

1

,r

l%

VAfe

■ Ifi±

:

DRAM0^#i5?B^Fai:^Mgi

■

m^ummm^mj-Mm^mikm

-

m-m

sfnw^mm

■

spjsii-iit^isa<]^!^)SiyB^FBi^f#^s$mzgt^

1x6^1?

mmm

■

MmB^rBisbbiiBtiiis^^isBtrBFJi^p^^fi

°

^0<]^i^jsrjBfFBi

=

mut

■

ft

Misi^i50^isst±

•

IS;±S«/J^w4^B^rB1?l5^^fiB^MMfii)•

•

lil

Piig#lSi5fSff7b7vSBfrBF6<]i7«'|

[lOl^^^Z]

Suppose

we

have

a

processor

with

a

base

CPI

of

1.0,

assuming

all

references

hit

in

the

^primary

cache,

and

a

clock

rate

of

4

GHz.

Assume

a

b,

OZ^

»^5

main

memory

access

time

of

100

ns,

including

all

the

miss

handling.

Suppose

the

miss

rate

ver

instruction

at

the

primary

cache

is

2%.

How

much

^fasteJjwill

the

processor

be

if

we

add

a

se_CQndary_cache

that

has

a

5

ns

access

time

for

either

a

hit

or

a

miss

and

is

large

enough

to

^reduce

the

miss

rate

to

main

memory

ttyO^%.

Answer

100/0.25

=

400

clock

cycles

cpi,

-

0,

"

Total

CPI

=

Base

CPI

+

Memory-stall

cycles

per

Instruction

=

1.0

+

Memory-stall

cycles

per

Instruction

=

1.0

+

2%

X

400

=

9.0

cpi^

|

^

■

—

5/0.25

=

20

clock

cycles

=3,^

-

@itb

•

■

total

CPI

um

base

CPI

mm

:

J-

3,f

^

Total

CPI

=

1

+

Primary

stalls

per

Instruction

+

Secondary

stalls

per

Instruction

=

1

+

2%

x

20

+

0.5%

x

400

=

1

+

0.4

+

2.0

=

3.4

Hlhb

■

9.0/3.4

=

2.6

w

^P9((2%

-

0.5%)

x

20

=

0.3)|S|fiJ^IB'llfi^

lS09W1¥Ma^i5(O.5%

x

(20

+

400)

=

2.1)jP^-|B5|5SSl¥Ma^S5

°

mu

1.0

+

0.3

+

2.1

-mmm

3.4

°

50

5/Ni

>

h-fi6<]ip,#i|gfIi5nfff^a<]tj^I^(combined

cache)

•

ai^^miss

rate

per

instruction?l^gtff^lBtjNl^&^S1yB^^

rate

per

instructionS'^l^S

°

mmmm^U2Q

docks

■

docks

•

#SSiB

■ 1111.2=:^^

»

1.2

Mem.

accesses

per

instruction

20

clocks

50

clocks

LI

L2

CPU

—>

>

«—>

Cache Cache

Memory

100

accesses

30

miss

10

miss

Total

stall

cycles

=

LI

stall

cycles

+

L2

stall

cycles

=

LI

misses

x

LI

miss

penalty

+

L2

misses

x

L2

miss

penalty

=

30

x

20

+

10

x

50

Stall

cycle

per

access

=

Total

stall

cycles/number

of

CPU

access

30

^

x20

+

f

100

t

x50

LI

miss

rate

L2

miss

rote

Stall

cycle

per

instruction

=

Memory

access

per

instr.

x

Stall

cycle

per

access

30^

(

10^

1.2

X

;

L2x-

X

20

+

x50

100

J

t

LI

miss

rate

L2

miss

rate

per

instruction

per

instruction

Consider

a

processor

with

the

following

parameters:

o

s

B

o

c

^-T

Oh

o

cd

CQ

o

(D

Oh

GO

(U

u

o

B

<D

O

b

o

E

D

s

.s

cd

CIh

<u

o

cd

o

>

<D

<U

<l>

o

<u

a,

cd

-O

(U

Oh

^

5

8

B

GO

"O

^

(U

cd

-o

'-I

U

(U

c/i

cd

Q.

O

o

^

>

s

OS

aj

-§

^

B

g

(N

^

(U

C/5

OO

q;r

^

4=

^

O

OJ

13

.&

II

a

C«

G

C/3

<N

cd

42

Vh

q

C/2

Cd

c/i

U

S

"B

>

cs

o

I±

o

-o

OG

IZ3

CM

Cd

a.

2.0

3GHz

125ns

5%

15

cycles

3.0%

25

cycles

1.8%

b.

2.0

IGHz

100ns

4%

10

cycles

4.0%

20

cycles

1.6%

(1)

Calculate

the

CPI

for

the

processor

in

the

table

using:

©

only

a

first-level

cache,

®

a

second-level

direct-mapped

cache,

and

@

a

second-level

eight-way

set-associative

cache.

(2)

It

is

possible

to

have

an

even

greater

cache

hierarchy

levels.

Given

the

processor

above

with

a

second-level,

dirbct-mapped

cache,

a

designer

wants

to

add

a

third-level

cache

that

takes^SO^ydes

to

access

and

will

reduce

the

global

miss

rate

to

1.3%.

Would

this

provide

better

^

htfiey

performance?

In

general,

what

are

the

advantages

andjiisadvantages

of

adding

a

third-level

cache?

(3)

In

older

processors

such

as

the

Intel

Pentium

or

Alpha

21264,

the

second

level

of

cache

was

external

(located

on

a

different

chip)

from

the

main

processor

and

the

first-level

cache.

While

this

allowed

for

large

second-level

caches,

the

latency

to

access

the

cache

was

much

52

I

^7^*

higher,

and

the

bandwidth

was

typically

lower

because

the

second-level

cache

ran

at

a

lower

frequency.

Assume

a

512

KB

off-chip

second-level

cache

has

a

global

miss

rate

of

4%.

If

each

additional

512

KB

of

cache

lowered

global

miss

rates

by

0.7%,

and

the

cache

had

a

total

access

time

of

50

cycles,

how

big

would

the

cache

have

to

be

to

match

the

performance

of

the

second-level

direct-mapped

cache

listed

in

the

table?

Of

the

eight-way

set-associative

cache?

Answer

(1)

a.

b.

(2)

a.

b.

Memory

nuss

cycles:

125

ns

x

3G

=

375

©

Total

CPI:

2.0

+

375

x

5%

=

20.75

\J

®

Total

CPI:

2.0

+

15

x

5%

+

375

x

3%

=

14

>/

@

Total

CPI:

2.0

+

25

x

5%

+

375

x

1.8%

=

10

i/

Memory

miss

cycles:

100

clock

cycles

©

Total

CPI:

2.0

+

100

x

0.04

=

6.0

©

Total

CPI:

2.0

+

100

x

0.04

+

10

x

0.04

=

6.4

©

Total

CPI:

2.0

+

100

x

0.016

+

20

x

0.04

=

4.4

Total

CPI:

2.0

+

15

x

5%

+

50

x

3%

+

375

x

1.3%

=

9.125

'U

This

would

provide

better

performanc^ybut

may

complicate

the

design

of

the

processor,

could

lead

to:

more

complex

cache

coherency,

increased

cycle

time,

larger

and

more

expensive

chips.

)

Total

CPI:

2.0

+

100

x

0.013

+

10

x

0.04

+

50

x

0.04

=

5.7

This

would

provide

better

performance,

but

may

complicate

the

design

of

the

processor.

This

could

lead

to:

more

complex

cache

coherency,

increased

cycle

time,

larger

and

more

expensive

chips.

(3)

kJ

b.

Total

CPI:

2.0

+

50

x

5%

+

375

x

(4%

-

0.7%

x

n)

n

=

2

^

1.5

MB

L2

cache

to

match

direct-map

n

=

4

->

2.5

MB

L2

cache

to

match

8-way

Total

CPI:

2.0

+

50

x

0.04

+

100

x

(0.04

-

0.007

x

n)

n

=

2

1.5

MB

L2

cache

o

match

direct-map

n

=

5

->

3

MB

L2

cache

to

match

8-way

li(3)a

:

Let

2.0

+

50

x

5%

+

375

x

(4%

-

0.7%

.x

n)

=

14

^

n

=

2.1

Let

2.0

+

50

X

5%

+

375

x

(4%

-

0.7%

x

n)

=

10

^

n

=

3.6

>

=1°

?

Global

miss

rate

(GMR):

The/ractlon

of

references

that

miss

In

all

levels

of

a

multilevel

cache.

^

Local

miss

rate

(LMR);

The

fraction

of

references

to

one

level

of

a

cache

that

miss;

used

in

multilevel

hierarchies.

L(T15[I^GIobalSLocal

miss

ratefi'5ltffSlll^S'5'2;iC

1000

accesses

50

miss

20

miss

5

miss

CPU

>

LI

^ ^

L2

<—>

L3

<—>

Cache

GMR

50/1000

20/1000

5/1000

LMR

50/1000

20/50

5/20

LI

GMR

=

LI

LMR

L2

GMR

=

LI

LMR

x

L2

LMR

L3

GMR

=

LI

LMR

x

L2

LMR

x

L3

LMR

Memory

fe

i/e

I

f'u

"XtX,

If

LMI?

-

II

level

im(«

acie(c