ENEE 446 (sec 201) - Spring 2004

Project (Due on Monday, April 26, 2004)

You are to modify the DLXSIM source code to incorporate new functionality into the simulator and then test it by collecting some statistics. Currently DLXSIM models branches by assuming a single-cycle branch delay slot with branches resolved in the ID stage, which implies that no stalls are needed.  Your project is to modify the code to simulate a machine without a branch delay slot; and instead handle branches by each of the following techniques: flushing, predicted-not-taken and dynamic branch prediction using a branch target buffer, depending on the command-line option used.  Assume that branches continue to be resolved in the ID stage in your new branching models.

The details on the DLX architecture and simulator are available elsewhere on the class website. 

Command-line options:

You should add the following new command-line options to DLXSIM:

Command-line option Behavior
-flushing No delay slot; handle branches by flushing the pipeline.
-pred-not-taken No delay slot; handle branches by always predicting branches to be not taken.
-dyn-branch-pred No delay slot; handle branches by using dynamic branch prediction using a BTB with a 2-bit predictor.

You should check that no more than one of the above three flags is used by the user at one time.  If more than one is used, report an appropriate error to the user and terminate DLXSIM.  If none of the above three flags is used, then continue to model a 1-cycle branch delay slot as is the current behavior.

Step 1:
Modify the DLX simulator to simulate a machine without a branch delay slot, and instead using flushing to handle branches.

Step 2:
Modify the DLX simulator to handle branches as predicted-not-taken, instead of flushing..

Step 3:
Modify the DLX simulator to handle branches using dynamic branch prediction using a branch target buffer (BTB) having a 2-bit history counter, instead of a static prediction in step 2. The BTB is like the one described in the first part of section 3.5 in the textbook, and the 2-bit history counter is maintained using the 2-bit branch prediction scheme in section 3.4. 

Implementation method:
You should modify the dlxsim source to implement the new functionality.  You can implement the BTB in software as an array of structures, with one array entry (structure) per BTB entry.  Each BTB entry (structure) will have four fields: valid bit, tag, predicted target, and 2-bit counter.  At start time when you load the application file into memory, you have to initialize all the entries in the BTB as invalid. Also initialize any extra runtime counters (see hint below) to zero.

Hint: It is important to realize that the DLX simulator is a timing simulator for a DLX 5-stage pipeline, and not a stage-by-stage pipeline simulator.  Thus for most of your modifications you only need to change the timing computed by the simulator and not its cycle-by-cycle execution.  (The only exception is in removing the branch-delay slot.)  You need not modify the values of the timing counters already in DLX.  It is cleanest instead to maintain a separate counter which stores the difference in the number of cycles versus the existing timing.  (This could be positive or negative).  At the end of DLXSIM, modify the final printout of runtime to report the time as the original time + difference in cycles.

Flow-chart for 2-bit scheme:
Below is a pseudo-code for the workings of the BTB with a 2-bit predictor.  It is meant as a modification of figure 3.19 in the textbook for a 2-bit scheme, presented as pseudo-code instead of a flow-chart.  Recall that a 2-bit scheme stores both taken and not-taken branches in the BTB, unlike a 1-bit scheme which stores only taken branches.  The code below is not meant to be an exact description of the workings of the hardware.  It is only meant to aid in your intuitive understanding of a BTB with 2-bit counter.

Send PC to memory and BTB;
if
(entry found in BTB) {
    Send out predicted PC; /* start fetching at predicted PC */
    /* No need to check if instr is a branch. It must be. */

    if (predicted target != resolved target) { /* Incorrect prediction */
        kill subsequent fetched instructions;
        restart fetch at resolved target;
    }

    if (branch taken) {
        new-2-bit-state = saturating increment of 2-bit state in BTB entry;
    }
    else { /* Branch not taken */
        new-2-bit-state = saturating decrement of 2-bit state in BTB entry;
    }

    update BTB entry with new-2-bit-state and the resolved target;
}
else {/* Entry not in BTB */
    if (instr is branch) {
        if (branch is taken) {
            kill subsequent fetched instructions;
            restart fetch at resolved target;
            prediction bits = 10;
        }
        else {
            prediction bits = 01;
        }
        insert BTB entry with branch addr, resolved target
            and prediction bits;
    }
}
 

Configuration Header File:

Define in a new C header file the following using #define macros:

 

- NUM_BITS_IN_BTB_INDEX: 8
 

The above macro is a constant storing the number of low-order bits used to index into the branch-target buffer. Assume that the instruction addresses are 4-byte-aligned, and thus the lowest two bits of the instruction address are not used to index the BTB.  The next higher NUM_BITS_IN_BTB_INDEX bits are used as an index.  Thus the highest (32 - NUM_BITS_IN_BTB_INDEX - 2) bits are stored in the tag.  The predicted PC field in the BTB is a full byte address.

Example program:
You are provided with a benchmark program called laplace.s to run it on dlxsim and produce the statistics. You will find this file under the /dlx/text/ directory. Report all your runtime numbers on this program.

The example DLX assembly program given to you assumes a 1-cycle delay slot, which is usually filled by a NOP by the (unoptimized) DLX compiler I used.  To test the un-modified simulator, you should use the example program without modification.  However, to test your modified simulator, you should modify the example program as well to manually remove all delay-slot NOPs and move all non-NOP delay slot instructions (if any) to before the branch.

 

Modularity:
It is important to add in your changes in a modular manner. In other words, make minimal changes to existing files, and add in your new code in separate files to the extent possible.  (**CREDIT WILL BE DEDUCTED FOR NOT FOLLOWING THIS RULE**).  

 

Statistics:
1.) Report the total runtime for four cases:

2) Vary the size of the BTB table by varying NUM_BITS_IN_BTB_INDEX to 4,5,6,7,8,and 9.  Recompile and collect the runtime each time for the dynamic branch prediction case, and report it on a 2-D graph plotting NUM_BITS_IN_BTB_INDEX versus runtime.  Also report the cycle count numbers in a table.


Submissions:
Please submit the following:
- Statistics
- List of files you added; and those you modified. 
- Printout of the above code files.  In the modified files, highlight your modifications.

 

You will need to give a demonstration of the software to the grader (Steve Haga), who is a senior graduate student.  He will contact you via the mailing list with the scheduling information.