54

3/\^

(•

Co-<^^yh(n'^yj

IS

(Average

Memory

Access

Time,

AMAT)»

fxSxR%i®l|—H

t^is^stiPJJ¥±^ie'tSfi^isa^rB^^it»$PT:

AMAT

=

Time

for

a

hit

+

Miss

rate

x

Miss

penalty

•

spTwm^

•

MU¥±^E'isfi#isBf

rai^itmspT:

>

CPU

LI

L2

—»

Ln

Cache Cache

Cache

Hit

Time

Ti

T2

Tn

Miss

rate

Ml

M2

Mn

Miss

penalty

Pi

P2

P„

Memory

AMAT

=

Ti

+

Ml

X

Pi

+

M2

X

P2

+

...

+

Mn

X

Pn

=

Ti

+

^M,.P,.

i=\

[SAmmmT.]

Suppose

that

in

1000

memory

references

there

are

60

misses

in

the

first-level

cache,

30

misses

in

the

second-level

cache,

and

5

misses

in

the

third-level

cache.

Assume

the

miss

penalty

from

the

L3

cache

to

memory

is

100

clock

cycles,

the

hit

time

of

the

L3

cache

is

10

clocks,

the

hit

time

of

the

L2

cache

is

5

clocks,

the

hit

time

of

LI

is

1

clock

cycle,

and

there

are

1.5

memory

references

per

instruction.

(a)

What's

the

global

miss

rate

for

each

level

of

caches?

(b)

What's

the

local

miss

rate

for

each

level

of

caches?

(c)

What

is

the

average

memory

access

time?

(d)

What

is

the

average

stall

cycle

per

instruction?

|

55

Answer

(a)

LI

=

60/1000

=

L2

=

30/1000

=

0^

L3

=

5/1000

=^0)35

(b)

LI

=

60/1000

=

0.06,

L2

=

30/60

=

0.5,

L3

=

5/30

=

0.167

(c)

AMAT

=

1

+

0.0^

X

5

{o^x

10

+

O^bS-x

100

=

2.1

clock

cycles

(d)

(2.1

-1)

X

1.5

=

1.65

clock

cycles

cpi^

The

Average

Memory

Access

Time

equation

(AMAT)

has

three

components:

hit

time,

miss

rate,

and

miss

penalty.

For

each

of

the

following

cache

optimizations,

indicate

which

component

of

the

AMAT

equation

is

improved.

(1)

Using

a

second-level

cache

(2)

Using

a

direct-mapped

cache

,

.

^

L

h(-t

-(iMt

'T'

(3)

Using

a

4-way

set-associative

cache

(4)

Using

a

virtually-addressed

cache

p,

T

'

(5)

Performing

hardware

pre-fetching

using

stream

buffers

^

(6)

Using

a

non-blocking

cache

(7)

Using

larger

blocks

Answer

(1)

miss

penalty

(2)

hit

time

(3)

miss

rate

(4)

hit

time

(5)

miss

rate

(6)

miss

penalty

(7)

miss

rate

56

I

^«AS5lJS'KWl2'lifi

'ft

y

Assume

that

main

memory

accesses

take

70

ns

and

that

memory

accesses

are

36%

of

all

instructions.

The

following

table

shows

data

for

LI

caches

attached

to

each

of

two

processors,

PI

and

P2.

LI

size

LI

miss

rate

LI

hit

time

PI

1KB

11.4%

0.62

ns

P2

2KB

8.0%

0.66

ns

(1)

Assuming

that

the

LI

hit

time

determines

the

cycle

times

for

PI

and

c|»dt

-

P2,

what

are

their

respective

clock

rates?

(2)

What

is

the

AMAT

for

each

of

PI

and

P2?

(3)

Assuming

a

base

CPI

of

1.0,

what

is

the

total

CPI

for

each

of

PI

and

P2?

Which

processor

is

faster?

For

the

next

three

problems,

we

will

consider

the

addition

of

an

L2

cache

to

PI

to

presumably

make

up

for

its

limited

LI

cache

capacity.

Use

the

LI

cache

capacities

and

hit

times

from

the

previous

table

when

solving

the

following

problems.

The

L2

miss

rate

indicated

is

its

local

miss

rate.

L2

size

L2

miss

rate

L2

hit

time

512

KB

98%

3.22

ns

(4)

What

is

the

AMAT

for

PI

with

the

addition

of

an

L2

cache?

Is

the

AMAT

better

or

worse

with

the

L2

cache?

(5)

Assuming

a

base

CPI

of

1.0,

what

is

the

total

CPI

for

PI

with

the

addition

of

an

L2

cache?

(6)

Which

processor

is

faster,

now

that

PI

has

an

L2

cache?

If

PI

is

faster,

what

miss

rate

would

P2

need

in

its

LI

cache

to

match

Pi's

I

57

performance?

If

P2

is

faster,

what

miss

rate

would

PI

need

in

its

LI

cache

to

match

P2's

performance?

Answer

(1)

(2)

(3)

PI

1.61

GHz

8.60

ns

13.87

cycles

18.5

P2

P2

1.52

GHz

6.26

ns

9.48

cycles

12.54

(4)

(5)

8.81

ns

14.21

eyeles

Worse

18.96

(6)

PI

with

L2

cache:

CPI

=

18.96.

P2:

CPl

=

12.54.

P2

is

still

faster

than

PI

even

with

an

L2

caehe

The

miss

rate

for

PI

in

its

LI

eache

should

be

7.83%

to

match

P2's

performance

-4m

jwjvfyte

h

'

y

7'

t£(3)

:

CPlpi

=

1

+

(1.36

x

=

18.5

CPlp2

=

1

+

(1.36

X

0.08

X

70/0.66)

=

12.54

tt(4)

•

L2

global

miss

rate

=

0.114

x

0.98

=

0.11172

0.62

+

0.114

X

3.22

+

0.11172

x

70

=

8.81

ti:(5)

:

1

+

1.36

x

(0.114

x

3.22/0.62

+

0.11172

x

70/0.62)

=

18.96

1i.(6)

•

Suppose

the

LI

cache

miss

rate

is

M

CPl

for

PI

with

second

level

cache

=

1

+

1.36

x

(M

x

3.22

/

0.62

+

M

X

0.98

X

70

/

0.62)

=

1

+

157.54

M

PI

performance

match

P2

performance

implies

both

instruction

times

should

be

the

same

->

(1

+

157.54

M)

x

0.62

=

12.54

x

0.66

M

=

7.83%

58

I

To

capture

the

fact

that

the

time

to

access

data

for

both

hits

and

misses

affects

performance,

designers

often

use

average

memory

access

time

(AMAT)

as

a

way

to

examine

alternative

cache

designs.

Average

memory

access

time

is

the

average

time

to

access

memory

considering

both

hits

and

misses

and

the

frequency

of

different

accesses;

it

is

equal

to

the

following:

AMAT

=

Time

for

a

hit

+

Miss

rate

x

Miss

penalty

AMAT

is

useful

as

a

figure

of

merit

for

different

cache

systems.

(1)

Find

the

AMAT

for

a

processor

with

a

2

ns

clock,

a

miss

penalty

of

20

clock

cycles,

a

miss

rate

of

0.05

misses

per

reference,

and

a

cache

access

time

(including

hit

detection)

of

1

clock

cycle.

Assume

that

the

read

and

write

miss

penalties

are

the

same

and

ignore

other

write

stalls.

(2)

Suppose

we

can

improve

the

miss

rate

to

0.03

misses

per

reference

by

doubling

the

cache

size.

This

causes

the

cache

access

time

to

increase

to

1.2

clock

cycles.

Using

the

AMAT

as

a

metric,

determine

if

this

is

a

good

trade-off.

(3)

If

the

^che

access

time

i

determines

the

processor's

clock

cycle

timp,

which

is

often

the

case,

AMAT

may

not

correctly

indicate

whether

one

cache

organization

is

better

than

another.

If

the

processor's

clock

cycle

time

must

be

changed

to

match

that

of

a

cache,

is

this

a

good

trade-off?

Assume

the processors

are

identical

except

for

the

clock

rate

and

the

number

of

cache

miss

cycles;

assume

1.5

references

per

instruction

and

a

CPI

without

cache

misses

of

2.

The

miss

penalty

is

20

cycles

for

both

processors.

|

59

Answer

(1)

AMAT

=

2ns

+

0.05

x(20x2ns)

=

4ns

y.

(2)

AMAT

=

(1.2

X

2

ns)

+

(20

x

2

ns

x

0.03)

=

2.4

ns

+

1.2

ns

=

3.6

ns

y

Yes,

it's

a

good

choice.

(3)

Execution

timeoid

=

2

x

IC

x

(2

+

1.5

x

20

x

0.05)

=

7IC

Execution

timenew=

2.4

x

IC

x

(2

+

1.5

x

20

x

0.03)

=

6.96

IC

So,

it's

a

good

choice

For

a

data

cache

with

a

92%

hit

rate

and

a

2-cycle

hit

latency,

calculate

the

average

memory

access

latency.

Assume

that

latency

to

memory

and

the

cache

miss

penalty

together

is

124

cycles.

Note:

The

cache

must

be

accessed

after

memory

returns

the

data.

I

~

2.

~t

Answer:

AMAT

=

2

+

0.08

x

124

=

11.92

cycles

\/

Which

of

the

following

is

generally(^m^bout

a

design

with

multiple

levels

of

caches?

(First-level

caches

are

more

concerned

about

hit

time,

and

second-level

caches

are

more

concerned

about

miss

rat^

2.

First-level

caches

are

more

concerned

about

miss

rate,

and

second-level

caches

are

more

concerned

about

hit

time.

Answer:

1

-tlw-c