Course Project Ideas

CHIP MULTIPROCESSORS (CMPs).

As silicon technology improves, our ability to put more transistors on a single chip increases. Process engineers forecast that by the middle of the next decade, we will have the ability to place a billion transistors on a single die. The question is, what will architects do with all these transistors? A current active area of research is to investigate building multiple processors on a single chip. These have been recently called "chip multiprocessors" or "CMPs." Here are several projects related to CMPs:

a). Cache Hierarchies for CMPs. What is the right cache-memory organization for a CMP? Should we build a single shared cache for the entire chip? Should we build a banked cache? Should we distribute the cache memory one to each processing node and rely on a cache coherence protocol to maintain coherence? Given that a CMP will offer much higher bandwidth and much lower latency communication between nodes, which cache organization is most desirable? Compare this choice to the right choice for a conventional MP.

b). Cache Coherence Protocols for CMPs. If in part a). we choose to build distributed caches, what is the right cache coherence protocol? Should we use directory-based schemes, or does the enormous bandwidth offered on a single-chip warrant broadcast protocols? Compare these schemes to limited and chained directory schemes.

c). Memory Bottleneck in CMPs. While today's uniprocessors are already limited by memory latency and bandwidth, CMPs will only aggravate the memory bottleneck further. Build a simulator to characterize the severity of this bottleneck, and examine how it grows as you scale the number of nodes on the CMP. To what extent can this bottleneck be relieved if we allocate a significant portion of the chip area budget to memory (perhaps even considering building DRAMs on the same die as the CMP). How should this memory be managed? Should it be managed as a cache? Should it be managed using virtual memory?

d). Computational Grain. The ability to place multiple processors on the same chip will significantly increase the communication bandwidth and decrease the communication latency seen by threads executing on different processing nodes. This will enable the exploitation of finer-grained parallelism in a CMP as compared to a conventional MP. Take an application (perhaps one studied in class) and parallelize it for each multiprocessor architecture. How does the decomposition change when communication becomes much cheaper? Compare the performance between conventional and chip MPs.

COGNITIVE PROCESSING REQUIREMENTS.

The Defense Advanced Research Projects Agency (DARPA) is leading a program called "ACIP" (Architectures for Cognitive Information Processing) that seeks to build a next-generation chip multiprocessor architecture for supporting cognitive processing (i.e., artificial intelligence) applications. Examples of algorithms in the cognitive processing domain include search, probabilistic inference, knowledge-based reasoning, planning, and machine learning. In addition to attracting the interest of DARPA, cognitive processing workloads are also expected to have a profound impact on commercial applications. Specifically, they have wide applicability in such areas as medical diagnosis, bioinformatics, multi-media, data mining, and stock market forecasting. Unfortunately, very little is known about the parallel behavior of these applications, particularly in the context of CMP architectures. Here are projects related to understanding the parallel characteristics of cognitive processing workloads:

a). Quantifying thread-level parallelism in cognitive processing. Pick an important cognitive processing application (the instructor has access to several). Parallelize the application for a shared memory multiprocessor (e.g., a chip multiprocessor). Measure the available performance gain due to thread-level parallelism on a CMP simulator.

b). Comparing different forms of parallelism in cognitive processing. Perform a study to quantify the degrees of data-level parallelism, instruction-level parallelism, and thread-level parallelism in cognitive processing applications. Compare these different sources of parallelism to determine where the greatest performance gains can be achieved.

General applications study. The above studies can also be performed on other applications as well. Pick an application you've worked on in the past, perhaps for sequential machines, or something that you're doing in your current research which you would like to speed up. Parallelize the application for different machine models, e.g. shared memory and message passing. Understand the communication and computation requirements in the application, and its synchronization requirements. Which communication model is best suited to the application?

MULTI-THREADED ARCHITECTURES.

Multithreading is an effective way to hide long-latency operations such as cache misses and failed synchronization operations. The basic idea behind multithreading is that each processor contains multiple resident threads, and context switches between them rapidly each time a thread blocks on a long-latency operation. Recently, Simultaneous Multithreading (SMT) has been proposed as an effective way to exploit multiple threads. Like traditional multi-threaded processors, SMT contains support for multiple hardware contexts (i.e. replicated register files, program counters, etc.). What is unique about SMT is that it executes the threads simultaneously on a single out-of-order processor core; hence, it does not require significant modifications to existing uniprocessor hardware. Here are some projects that explore the potential uses of SMT processors:

a). Cache Architecture for SMT. Computer architects currently studying SMT processors have proposed to using a shared cache organization for each SMT processor. One advantage of a shared cache is that threads with overlapping working sets running on the same SMT processor will tend to prefetch data for each other. One disadvantage of a shared cache is that increased conflict and/or capacity misses can occur between threads. Evaluate the advantages and disadvantages of a shared cache organization for SMT processors. Or, compare the performance of a shared cache against a distributed cache organization.

b). Subordinate Threads. While multithreading was originally meant to run multiple processes or parallel workloads, another use of threading is for "subordinate threads." A subordinate thread is one that does not directly participate in the computation of the program, but rather performs some functionality that assists the main threads in the program. Examples of tasks that subordinate threads can perform include profiling, data and/or instruction prefetching, and pre-execution of branches to assist branch predictors. Evaluate the potential of subordinate threads on an SMT processor, or a chip multiprocessor.

c). Using SMT to Run Explicitly Parallel Programs. SMT was originally proposed to exploit parallelism across multiple *processes*. In this case, there is no communication or interaction between the multiple threads. How would a single parallel program with multiple threads perform on an SMT processor? Build a simulator to model the behavior of an SMT, and measure the performance of explicitly parallel programs on your simulator. Compare the performance of the SMT architecture with a traditional multiprocessor architecture, or with chip multiprocessors.

SPECULATIVE CACHE COHERENCE PROTOCOLS.

The current trend in modern processor design is to leverage speculation as much as possible. Recently, there has been a trend to apply speculation techniques to shared memory machines. A promising idea is to allow the memory controllers in a DSM to speculate on when a message will arrive for a shared memory transaction in advance of the message actually arriving.

A prerequisite for the success of speculation is the predictability of events. In shared memory machines, the events which must be predictable for speculation to succeed are cache misses. A potential project would be to study the predictability of cache misses on shared memory machines for a wide class of applications. In this project, several applications would be instrumented, perhaps using NWO to perform the study. The instrumented applications would be executed and their cache miss behavior traced. Such cache miss traces can be post-processed to investigate the predictability of the cache miss events. This study can be performed for both uniprocessor and multiprocessor traces, and their respective predictabilities compared.

There is strong reason to believe that cache misses on different processors will be correlated due to invalidation traffic on write-shared data. How predictable are such correlated cache misses? Can an architecture make use of this predictability to speculate on which cache blocks will miss, and issue prefetches for these?

THREAD-LEVEL SPECULATION.

In the past, multiprocessors have required the programmer or compiler to partition loop iterations into multiple threads for execution on multiple processors. Traditionally, this meant that only fully parallel loops could be parallelized and run on a multiprocessor. There are many loops that are "pseudo-parallel": most of the iterations are independent, but occaisionally, a few iterations are data dependent (typically due to memory dependences), and hence must be run sequentially. For such pseudo-parallel loops, the programmer or compiler must be conservative and execute the entire loop sequentially because it cannot be determined before runtime which iterations are parallel and which iterations are sequential.

Recently, there has been a proposal to build multiprocessors that allow pseudo-parallel loops to execute speculatively. The idea is for the cache coherence protocol to track the order in which memory accesses perform and to detect data dependence violations. If a dependence violation is detected, the hardware automatically squashes the thread(s) involved in the violation and restarts execution of those threads. If most of the loop iterations are parallel, then squashing happens infrequently and parallel performance becomes high. Here are some projects related to thread-level speculation:

a). Quantifying the importance of pseudo-parallel loops. Perform an application study to determine to what degree pseudo-parallel loops occur in important applications. Based on this study, determine the amount of thread-level parallelism that exists in these applications, and the potential performance gains possible on a multiprocessor with support for thread-level speculation.

b). Thread-level speculation simulation study. Build a speculation protocol into an existing MP simulator that already models a cache coherence protocol. Then, pick an application with several pseudo-parallel loops, and study the performance gains achieved through thread-level speculation.