当前位置：文档之家› ALU

ALU

Recap
Three Hazards ?Structural
I n s t r. O r d e r ALU Load IF Reg Dm Reg add r1, r2, r3 ALU IF Reg Dm Reg Inst 1 ALU IF Reg Dm Reg sub r4, r1, r3 ALU IF Reg Dm Reg
CS425 – Computer System Design Lecture 11 – Control Hazards + ILP and Software
Shankar Balachandran Dept. of Computer Science and Engineering IIT-Madras shankar@cse.iitm.ernet.in
?Data ?Control
Inst 2
ALU
ALU
IF
Reg
Dm
Reg
and r6, r1, r7
IF
Reg
Dm
Reg
Inst 3 IF Reg
ALU
Dm
Reg
or r8, r1, r9
IF
ALU
Reg
Dm
Reg
ALU
xor r10, r1, r11 Structural Hazard
IF
Reg
Dm
Reg
I: add r1,r2,r3 J: sub r4,r1,r3
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7
9/3/2006
1
9/3/2006
2
Recap
add r1, r2, r3 ALU IF Reg Dm Reg ALU lw r1, 0(r2) ALU Dm Reg sub r4, r1, r6 ALU Dm Reg ALU and r6, r1, r7 ALU Dm Reg ALU or r8, r1, r9 ALU xor r10, r1, r11 IF Reg Dm Reg IF Reg Dm Reg IF Reg Dm Reg ALU IF Reg Dm Reg IF Reg Dm Reg sub r4, r1, r3 IF Reg
Reference
? Hennessey and Patterson
and r6, r1, r7
IF
Reg
or r8, r1, r9
IF
Reg
9/3/2006
3
9/3/2006
4
1

Software Scheduling
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory.
Slow code: LW LW ADD SW LW LW SUB SW Rb,b Rc,c Ra,Rb,Rc a,Ra Re,e Rf,f Rd,Re,Rf d,Rd
Instruction Set Connection
? What is exposed about this organizational hazard in the instruction set? ? k cycle delay?
– bad, CPI is not part of ISA
Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd
? k instruction slot delay
– load should not be followed by use of the value in the next k instructions
? Nothing, but code can reduce run-time delays ? MIPS did the transformation in the assembler
9/3/2006
5
9/3/2006
6
Historical Perspective: Microprogramming
User program plus Data this can change!
Control Hazard on Branches => Three Stage Stall
ALU
10: beq r1,r3,36
Main Memory
IF
Reg
Dm
ALU
Reg
ADD SUB AND . . . DATA
14: and r2,r3,r5
IF
Reg
Dm
Reg
execution unit CPU
control memory
one of these is mapped into one of these
ALU
IF
Reg
Dm
Reg
18: or
r6,r1,r7 IF
ALU
Reg
Dm
Reg
22: add r8,r1,r9
ALU
IF 36: xor r10,r1,r11
Reg
Dm
Reg
Supported complex instructions a sequence of simple micro-inst (RTs) Pipelined micro-instruction processing, but very limited view. Could not reorganize macroinstructions to enable pipelining
9/3/2006 7 9/3/2006 8
2

Example : Branch Stall Impact
? If 30% branch, Stall 3 cycles significant ? Two part solution:
– Determine branch taken or not sooner, AND – Compute taken branch address earlier
Compiler Techniques for ILP
? Goals
– Static issue
? Pipeline and Multiple processor
? MIPS branch tests if register = 0 or ≠ 0 ? MIPS Solution:
– Move Zero test to ID/RF stage – Adder to calculate new PC in ID/RF stage – 1 clock cycle penalty for branch versus 3
– Dynamic techniques
? Dynamic issue but static scheduling
– Branch prediction
9/3/2006
9
9/3/2006
10
Pipeline Scheduling
? Find a sequence of unrelated instructions to fill the pipeline – Dependent instruction must be separated by latency of the source instruction ? Some assumptions – Five stage pipeline, one cycle delay for branch – Functional units nicely pipelined. Matches Inst Line – Integer ALU – 0 cycle latency; Integer load – 1 cycle latency –
Example
for(i=1000; i>0; i--) x[i] = x[i] + s; ? Why is this loop style preferred? ? Straightforward MIPS code
9/3/2006
11
9/3/2006
12
3

Without Any Schedule
Schedule The Loop
? Swap DADDUI with S.D
– Change the address to 8(R1)
? Not trivial
10 – S.D depends on DADDUI – Compilers will not easily interchange – Smarter compilers can perform symbolic computation
?
9/3/2006 13
Uses delayed branches
9/3/2006 14
Analysis
? 6 cycles per iteration
– 3 on load, store and add – 3 on DADDUI, BNE and Stall
Unrolled Loop
? Need to get more operations done per loop
– Cut down on per iteration overhead
-8
? Loop unrolling
– Replicate loop body many times – Adjust loop termination code – Can also be used to improve scheduling
? One iteration independent of another
– Example next slide
? 4 times unrolled ? No registers reused ? R1 = 32q
9/3/2006 15 9/3/2006
# -32
16
4

Analysis
? Loop is unrolled
– Not scheduled yet
Unroll and Schedule
? Needs clever analysis
– Symbolic computation
? Many dependent instructions follow source ones
– Stalls – 28 cycles for 4 operations – Stalls
? #L.D (4) *1 + #ADD.D (4) * 2 + #DADDUI (1) * 1 + #BNE (1) * 1
14 cycles for 4 elements
– 14 Instructions – Slower than scheduled version of original loop
9/3/2006
17
9/3/2006
18
What Did We Do?
? Find out that S.D can be moved after DADDUI and BNE
– Also adjust the S.D offset
? Determine that loop can be unrolled ? Use different registers ? Eliminate extra test and branch
– Change termination code
? Reorder loads and stores
– Check for memory aliasing
? Schedule the code
– Preserve dependencies
9/3/2006 19
5