ENEE 350H - Fall 2004

Homework 2

(Due in Class on Thursday, September 23th)

Q. 1 Match the features in the left column with the abstraction level from the right column whose
language may contain that feature.

FOR loops Instruction set architecture
Machine instruction names written in text Digital logic level
SPARC machine code High­level language level
Boolean expressions Assembly language level
Direct execution Micro­architecture

Q. 2 Answer true or false:
(a) The hardware for a superscalar architecture is more complex than for a VLIW.
(b) Interpretation results in faster execution than translation.
(c) Concurrency between processes is a feature provided at the ISA level.
(d) Moore's law states that the speed of semiconductor circuits roughly doubles every eighteen
(e) Microprogramming is a more flexible and modular way of deriving control hardware than direct execution.
(f) Arrays processors are useful in the same application domains as vector processors.
(g) Message passing multiprocessors are almost always built on top of distributed memory hardware.
(h) Bypassing is a method to reduce the stalls due to control hazards in pipelined processors.


Q. 3 A pipelined superscalar computer that issues up to 2 instructions per cycle has a 800Mhz
clock speed. Dependences and resource conflicts, however, prevent maximum­rate issue: on average
only 1.2 instructions are issued per cycle. How long will it take a program with 3 billion instructions
to execute if the operating system allocates 30% of the CPU's time to the program?

Q. 4 A 5­stage single pipeline computer resolves the direction and target of a conditional branch
instruction at the end of its third pipeline stage. How many instruction issue opportunities are lost
on a branch mis-predict? 


Q. 5.  Suppose you have a non-pipelined DLX computer where every instruction takes five cycles to execute.  You run the following program on the machine where the initial values of R2 and R3 are 0 and 40 respectively.

_loop:  LW            R1, 0(R3)             
            SUBI        R3, R3, 4              
            ADD         R2, R2, R3           
            BNEZ       R3, _loop        

a.  What does this program fragment do? (1-2 sentences)

b.  How many iterations does this loop execute before it exits?  An iteration is defined as executing all the instructions in the loop once.

c.  How many cycles will it take to execute the code fragment on this non-pipelined machine?

For parts d. to f. suppose this machine is pipelined into five stages (Fetch, Decode, Execute, Memory, and Writeback).  Each stage takes one cycle to complete.  The machine has no bypass paths and hence all data hazards are resolved by stalling.  In addition, the machine has no branch prediction and hence all control hazards are also resolved by stalling.  Branch conditions are evaluated when the branch instruction reaches the execute stage.

d.  Show how many cycles each instruction must stall because it has a hazard with some previous instruction by completing the following table.  If an instruction Y is stalled because some earlier instruction X is also stalled then do not count that as a stall for Y.  In the last column write either "data" or "control" for each instruction that stalls, depending on the type of the hazard which caused it to stall.

Instruction Number of stall cycles (max across iterations) Type of hazard (if stall is non-zero)
LW           R1, 0(R3)     
SUBI        R3, R3, 4    
ADD         R2, R2, R3    
BNEZ       R3, _loop    

e. How many cycles does the program fragment take to execute on this pipelined processor?  Take care that not all iterations require an equal number cycles to execute.  The number of cycles for a code fragment on a pipelined processor is defined as the difference between the fetch cycle number of the first instruction after the fragment and the the fetch cycle number of the first instruction in the fragment.

f. What would have been the speedup on a hypothetical ideal pipeline that had no stalls vs. a non-pipelined machine?  What is the actual speedup for the real pipeline with stalls vs a non-pipelined machine?

For parts g. to h.  assume that the machine has resolved all data hazards using data forwarding along bypass paths.  Further suppose a branch predictor that predicts "take the branch" on every branch instruction is added. When the prediction is correct, no stall is needed; but when it is incorrect, instructions in the pipeline along the mis-predicted path are quashed and then the correct subsequent instructions are fetched into the pipeline.

g.  How many cycles does the program take to execute on this improved pipeline?  As before, take care that not all iterations may take an equal number cycles to execute.

h. What is the speedup for this improved real pipeline vs a non-pipelined machine?