ENEE 446 (sec 201) - Spring 2004
Project (Due on Monday, April 26, 2004)
You are to modify the DLXSIM source code to incorporate new functionality into the simulator and then test it by collecting some statistics. Currently DLXSIM models branches by assuming a single-cycle branch delay slot with branches resolved in the ID stage, which implies that no stalls are needed. Your project is to modify the code to simulate a machine without a branch delay slot; and instead handle branches by each of the following techniques: flushing, predicted-not-taken and dynamic branch prediction using a branch target buffer, depending on the command-line option used. Assume that branches continue to be resolved in the ID stage in your new branching models.
The details on the DLX architecture and simulator are available elsewhere on the class website.
Command-line options:
You should add the following new command-line options to DLXSIM:
| Command-line option | Behavior |
| -flushing | No delay slot; handle branches by flushing the pipeline. |
| -pred-not-taken | No delay slot; handle branches by always predicting branches to be not taken. |
| -dyn-branch-pred | No delay slot; handle branches by using dynamic branch prediction using a BTB with a 2-bit predictor. |
You should check that no more than one of the above three flags is used by the user at one time. If more than one is used, report an appropriate error to the user and terminate DLXSIM. If none of the above three flags is used, then continue to model a 1-cycle branch delay slot as is the current behavior.
Step 1:
Modify the DLX simulator to simulate a machine without a branch delay
slot, and instead using flushing to handle branches.
Step 2:
Modify the DLX simulator to handle branches as predicted-not-taken,
instead of flushing..
Step 3:
Modify the DLX simulator to handle branches using dynamic branch
prediction using a branch target buffer (BTB) having a 2-bit history counter,
instead of a static prediction in step 2. The BTB is like the one described in
the first part of section 3.5 in the textbook, and the 2-bit history counter is
maintained using the 2-bit branch prediction scheme in section 3.4.
Implementation method:
You should modify the dlxsim
source to implement the new functionality. You can implement the BTB in
software as an array of structures, with one array entry (structure) per BTB
entry. Each BTB entry (structure) will have four fields: valid bit, tag,
predicted target, and 2-bit counter. At start time when you load the
application file into memory, you have to initialize all the entries in the BTB
as invalid. Also initialize any extra runtime counters (see hint below) to zero.
Hint: It is important to realize that the DLX simulator is a timing simulator for a DLX 5-stage pipeline, and not a stage-by-stage pipeline simulator. Thus for most of your modifications you only need to change the timing computed by the simulator and not its cycle-by-cycle execution. (The only exception is in removing the branch-delay slot.) You need not modify the values of the timing counters already in DLX. It is cleanest instead to maintain a separate counter which stores the difference in the number of cycles versus the existing timing. (This could be positive or negative). At the end of DLXSIM, modify the final printout of runtime to report the time as the original time + difference in cycles.
Flow-chart for 2-bit
scheme:
Below is a pseudo-code for the workings of the BTB with a 2-bit predictor.
It is meant as a modification of figure 3.19 in the textbook for a 2-bit scheme,
presented as pseudo-code instead of a flow-chart. Recall that a 2-bit
scheme stores both taken and not-taken branches in the BTB, unlike a 1-bit
scheme which stores only taken branches. The code below is not meant to be
an exact description of the workings of the hardware. It is only meant
to aid in your intuitive understanding of a BTB with 2-bit counter.
Send PC to memory and BTB;
if (entry found in BTB) {
Send out predicted PC; /* start fetching at predicted PC */
/* No need to check if instr is a branch. It must be. */
if (predicted target != resolved target) { /*
Incorrect prediction */
kill subsequent fetched instructions;
restart fetch at resolved target;
}
if (branch taken) {
new-2-bit-state = saturating
increment of 2-bit state in BTB entry;
}
else { /* Branch not taken */
new-2-bit-state = saturating
decrement of 2-bit state in BTB entry;
}
update BTB entry with new-2-bit-state and the resolved
target;
}
else {/* Entry not in BTB */
if (instr is branch) {
if (branch is taken) {
kill
subsequent fetched instructions;
restart fetch
at resolved target;
prediction
bits = 10;
}
else {
prediction
bits = 01;
}
insert BTB entry with branch addr,
resolved target
and
prediction bits;
}
}
Configuration Header File:
Define in a new C header file the following using #define macros:
- NUM_BITS_IN_BTB_INDEX: 8
The above macro is a constant storing the number of low-order bits used to index into the branch-target buffer. Assume that the instruction addresses are 4-byte-aligned, and thus the lowest two bits of the instruction address are not used to index the BTB. The next higher NUM_BITS_IN_BTB_INDEX bits are used as an index. Thus the highest (32 - NUM_BITS_IN_BTB_INDEX - 2) bits are stored in the tag. The predicted PC field in the BTB is a full byte address.
Example program:
You are provided with a benchmark program called laplace.s to run it on dlxsim
and produce the statistics. You will find this file under the /dlx/text/
directory. Report all your runtime numbers on this program.
The example DLX assembly program given to you assumes a 1-cycle delay slot, which is usually filled by a NOP by the (unoptimized) DLX compiler I used. To test the un-modified simulator, you should use the example program without modification. However, to test your modified simulator, you should modify the example program as well to manually remove all delay-slot NOPs and move all non-NOP delay slot instructions (if any) to before the branch.
Modularity:
It is important to add in your changes in a modular manner. In other words, make
minimal changes to existing files, and add in your new code in separate
files to the extent possible. (**CREDIT WILL BE DEDUCTED FOR NOT
FOLLOWING THIS RULE**).
Statistics:
1.) Report the total runtime for four cases:
Base case (delay slot)
With flushing
With predicted-not-taken
With dynamic branch prediction.
2) Vary the size of the BTB table by varying NUM_BITS_IN_BTB_INDEX to 4,5,6,7,8,and 9. Recompile and collect the runtime each time for the dynamic branch prediction case, and report it on a 2-D graph plotting NUM_BITS_IN_BTB_INDEX versus runtime. Also report the cycle count numbers in a table.
Submissions:
Please submit the following:
- Statistics
- List of files you added; and those you modified.
- Printout of the above code files. In the modified files, highlight your
modifications.
You will need to give a demonstration of the software to the grader (Steve Haga), who is a senior graduate student. He will contact you via the mailing list with the scheduling information.