a). Cache Hierarchies for CMPs. What is the right cache-memory organization for a CMP? Should we build a single shared cache for the entire chip? Should we build a banked cache? Should we distribute the cache memory one to each processing node and rely on a cache coherence protocol to maintain coherence? Given that a CMP will offer much higher bandwidth and much lower latency communication between nodes, which cache organization is most desirable? Compare this choice to the right choice for a conventional MP.
b). Cache Coherence Protocols for CMPs. If in part a). we choose to build distributed caches, what is the right cache coherence protocol? Should we use directory-based schemes, or does the enormous bandwidth offered on a single-chip warrant broadcast protocols? Compare these schemes to limited and chained directory schemes.
c). Memory Bottleneck in CMPs. While today's uniprocessors are already limited by memory latency and bandwidth, CMPs will only aggravate the memory bottleneck further. Build a simulator to characterize the severity of this bottleneck, and examine how it grows as you scale the number of nodes on the CMP. To what extent can this bottleneck be relieved if we allocate a significant portion of the chip area budget to memory (perhaps even considering building DRAMs on the same die as the CMP). How should this memory be managed? Should it be managed as a cache? Should it be managed using virtual memory?
d). Computational Grain. The ability to place multiple processors on the same chip will significantly increase the communication bandwidth and decrease the communication latency seen by threads executing on different processing nodes. This will enable the exploitation of finer-grained parallelism in a CMP as compared to a conventional MP. Take an application (perhaps one studied in class) and parallelize it for each multiprocessor architecture. How does the decomposition change when communication becomes much cheaper? Compare the performance between conventional and chip MPs.
a). Quantifying thread-level parallelism in cognitive processing. Pick an important cognitive processing application (the instructor has access to several). Parallelize the application for a shared memory multiprocessor (e.g., a chip multiprocessor). Measure the available performance gain due to thread-level parallelism on a CMP simulator.
b). Comparing different forms of parallelism in cognitive processing. Perform a study to quantify the degrees of data-level parallelism, instruction-level parallelism, and thread-level parallelism in cognitive processing applications. Compare these different sources of parallelism to determine where the greatest performance gains can be achieved.
General applications study. The above studies can also be performed on other applications as well. Pick an application you've worked on in the past, perhaps for sequential machines, or something that you're doing in your current research which you would like to speed up. Parallelize the application for different machine models, e.g. shared memory and message passing. Understand the communication and computation requirements in the application, and its synchronization requirements. Which communication model is best suited to the application?
a). Cache Architecture for SMT. Computer architects currently studying SMT processors have proposed to using a shared cache organization for each SMT processor. One advantage of a shared cache is that threads with overlapping working sets running on the same SMT processor will tend to prefetch data for each other. One disadvantage of a shared cache is that increased conflict and/or capacity misses can occur between threads. Evaluate the advantages and disadvantages of a shared cache organization for SMT processors. Or, compare the performance of a shared cache against a distributed cache organization.
b). Subordinate Threads. While multithreading was originally meant to run multiple processes or parallel workloads, another use of threading is for "subordinate threads." A subordinate thread is one that does not directly participate in the computation of the program, but rather performs some functionality that assists the main threads in the program. Examples of tasks that subordinate threads can perform include profiling, data and/or instruction prefetching, and pre-execution of branches to assist branch predictors. Evaluate the potential of subordinate threads on an SMT processor, or a chip multiprocessor.
c). Using SMT to Run Explicitly Parallel Programs. SMT was originally proposed to exploit parallelism across multiple *processes*. In this case, there is no communication or interaction between the multiple threads. How would a single parallel program with multiple threads perform on an SMT processor? Build a simulator to model the behavior of an SMT, and measure the performance of explicitly parallel programs on your simulator. Compare the performance of the SMT architecture with a traditional multiprocessor architecture, or with chip multiprocessors.
A prerequisite for the success of speculation is the predictability of events. In shared memory machines, the events which must be predictable for speculation to succeed are cache misses. A potential project would be to study the predictability of cache misses on shared memory machines for a wide class of applications. In this project, several applications would be instrumented, perhaps using NWO to perform the study. The instrumented applications would be executed and their cache miss behavior traced. Such cache miss traces can be post-processed to investigate the predictability of the cache miss events. This study can be performed for both uniprocessor and multiprocessor traces, and their respective predictabilities compared.
There is strong reason to believe that cache misses on different processors will be correlated due to invalidation traffic on write-shared data. How predictable are such correlated cache misses? Can an architecture make use of this predictability to speculate on which cache blocks will miss, and issue prefetches for these?
Recently, there has been a proposal to build multiprocessors that allow pseudo-parallel loops to execute speculatively. The idea is for the cache coherence protocol to track the order in which memory accesses perform and to detect data dependence violations. If a dependence violation is detected, the hardware automatically squashes the thread(s) involved in the violation and restarts execution of those threads. If most of the loop iterations are parallel, then squashing happens infrequently and parallel performance becomes high. Here are some projects related to thread-level speculation:
a). Quantifying the importance of pseudo-parallel loops. Perform an application study to determine to what degree pseudo-parallel loops occur in important applications. Based on this study, determine the amount of thread-level parallelism that exists in these applications, and the potential performance gains possible on a multiprocessor with support for thread-level speculation.
b). Thread-level speculation simulation study. Build a speculation protocol into an existing MP simulator that already models a cache coherence protocol. Then, pick an application with several pseudo-parallel loops, and study the performance gains achieved through thread-level speculation.