Ph.D. Dissertation Defense: Abdel-Hameed A. Badawy

Tuesday, August 20, 2013
11:00 a.m.
Room 2460, AVW Building
Maria Hoo
301 405 3681
mch@umd.edu

ANNOUNCEMENT: Ph.D. Dissertation Defense

Name: Abdel-Hameed A. Badawy

Committee:

Professor Donald Yeung, Chair

Professor Manoj Franklin

Professor Gang Qu

Professor Eyad Abed

Professor Amr Baz, Dean's Representative

Date/Time: Tuesday, August 20, 2013 at 11AM

Location : AVW 2460

Title: Locality Transformations and Prediction Techniques for Optimizing Multicore Memory Performance

Abstract

Chip Multiprocessors (CMPs) are here to stay for the foreseeable future. In terms of programmability of these processors, what is different from legacy multiprocessors is that sharing among the different cores (processors) is not that expensive anymore. Previous research has suggested that sharing is not starting to be a good feature to have for new codes. For some programs, more cache leads to more beneficial sharing since the sharing starts to kick in at a large on chip cache. This work tries to answer the question whether or not, we can (should) write code differently when a Chip Multiprocessor powers the underlying chip microarchitecture.

We use a set three graph benchmarks. Each benchmark has three different input problems varying in size and connectivity. We characterize the importance of the parallelization strategy in combination with the proper locality optimization to achieve better performance in multicore memory systems. Better performance means good utilization of the caches at the lowest level and increased sharing of data among the cores in a constructive way which can effectively be seen a prefetching by cores for each other.

The thesis has two thrusts. The first is exploring the design space represented by different parallelization schemes (we devise some tweaks on top of existing techniques) and different graph partitioning sizes (a locality optimization techniques suited for graph problems). The combination of the parallelization strategy and graph partitioning provide a large and complex space that we characterize using detailed simulation results to see how much gain we can obtain on top of a baseline parallelization technique with a partition sized to fit in the L1 cache. We show that the legacy parallelization is not the way to go in most of the cases and other parallelization techniques perform better. In addition, we show that there is a search problem to determine the partitioning size and in most of the cases, the best partitioning size is smaller than the baseline partition size.

The second thrust of the thesis is exploring how we can predict the best combination of parallelization and partitioning that performs the best for any given benchmark under any given input data set. We use a PIN tool that computes the reuse distance profiles to build an execution time prediction model that can rank order the different combinations of parallelization strategies and partitioning sizes relative to each other. We report the amount of the gain that we can capture using the PIN prediction relative to what detailed simulation results deem the best under a given benchmark and input size. In some cases, the prediction is 100% accurate and in some other cases, the prediction projects worse performance than the baseline case. We report difference between the simulation best performing combination and the PIN predicted ones as well as other statistics to evaluate how good the predictions are. We show that the PIN prediction method performs very well in predicting the partition size compared to predicting the parallelization strategy. In this case, the accuracy of the overall scheme can be highly improved if we only use the partitioning size predicted by the PIN prediction scheme and then we use search strategy to find the best parallelization strategy on top of that partition size.

In this thesis, we use a detailed performance model to scan a large solution space for the best optimization parameters for locality optimization of a set of graph problems. Using the M5 performance simulation we have shown gains of up to 20% on top of a naively picked baseline case. Our prediction scheme can achieve up to 100% of the best performance gains obtained using a search method on real hardware or performance simulation without running at all on the target hardware and up to 48% on average across all of our benchmarks and input sizes.

There are several interesting aspects to this work. We are the first to devise and verify a performance model against detailed simulation results. We suggest and quantify that locality optimization and problem partitioning can increase sharing synergistically to achieve better performance overall. We have shown a new utilization for coherent reuse distance profiles as a helping tool for program developers and compilers to optimize program's performance.

Audience: Graduate  Faculty 

remind we with google calendar

 

April 2024

SU MO TU WE TH FR SA
31 1 2 3 4 5 6
7 8 9 10 11 12 13
14 15 16 17 18 19 20
21 22 23 24 25 26 27
28 29 30 1 2 3 4
Submit an Event