Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency

Front Cover
Morgan & Claypool Publishers, 2007 - Architecture - 145 pages
1 Review
Chip multiprocessors - also called multi-core microprocessors or CMPs for short - are now the only way to build high-performance microprocessors, for a variety of reasons. Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream using conventional superscalar instruction issue techniques. In addition, one cannot simply ratchet up the clock speed on today's processors, or the power dissipation will become prohibitive in all but water-cooled systems. Compounding these problems is the simple fact that with the immense numbers of transistors available on today's microprocessor chips, it is too costly to design and debug ever-larger processors every year or two.CMPs avoid these problems by filling up a processor die with multiple, relatively simpler processor cores instead of just one huge core. The exact size of a CMP‚€™s cores can vary from very simple pipelines to moderately complex superscalar processors, but once a core has been selected the CMP‚€™s performance can easily scale across silicon process generations simply by stamping down more copies of the hard-to-design, high-speed processor core in each successive chip generation. In addition, parallel code execution, obtained by spreading multiple threads of execution across the various cores, can achieve significantly higher performance than would be possible using only a single core. While parallel threads are already common in many useful workloads, there are still important workloads that are hard to divide into parallel threads. The low inter-processor communication latency between the cores in a CMP helps make a much wider range of applications viable candidates for parallel execution than was possible with conventional, multi-chip multiprocessors; nevertheless, limited parallelism in key applications is the main factor limiting acceptance of CMPs in some types of systems.After a discussion of the basic pros and cons of CMPs when they are compared with conventional uniprocessors, this book examines how CMPs can best be designed to handle two radically different kinds of workloads that are likely to be used with a CMP: highly parallel, throughput-sensitive applications at one end of the spectrum, and less parallel, latency-sensitive applications at the other. Throughput-sensitive applications, such as server workloads that handle many independent transactions at once, require careful balancing of all parts of a CMP that can limit throughput, such as the individual cores, on-chip cache memory, and off-chip memory interfaces. Several studies and example systems, such as the Sun Niagara, that examine the necessary tradeoffs are presented here. In contrast, latency-sensitive applications ‚€” many desktop applications fall into this category ‚€” require a focus on reducing inter-core communication latency and applying techniques to help programmers divide their programs into multiple threads as easily as possible. This book discusses many techniques that can be used in CMPs to simplify parallel programming, with an emphasis on research directions proposed at Stanford University. To illustrate the advantages possible with a CMP using a couple of solid examples, extra focus is given to thread-level speculation (TLS), a way to automatically break up nominally sequential applications into parallel threads on a CMP, and transactional memory. This model can greatly simplify manual parallel programming by using hardware ‚€” instead of conventional software locks ‚€” to enforce atomic code execution of blocks of instructions, a technique that makes parallel coding much less error-prone.Contents: The Case for CMPs / Improving Throughput / Improving Latency Automatically / Improving Latency using Manual Parallel Programming / A Multicore World: The Future of CMPs
 

What people are saying - Write a review

User Review - Flag as inappropriate

For non-practitioner(s) of computer system design or processor design, this concise book definitely helps. Anyone with enough EE background (curiosity can make up for this too!) wanting to get up to speed with recent trends in processor design, will find this useful. This is important because sooner or later, you will deal with computer systems using these innovations.
Workload charectarization has a huge influence in computer system design, as well as processor design.
This book focuses on the Workload charecterization issues (Throughput sensitive and Latency sensitive workloads), and how they affect processor design. You might have to look up the various benchmarks referred to in this book.
Chapter 1 makes the case of CMT (chip multi-threading architecture). Chapter 2 covers Application workloads may already come with a good degree of threading already exists, and how CMT architecture exploits this property. Chapter 3 covers other considerations where legacy code with little or no direct threading can still exploit the CMT benefits via automatic Thread Level Parallelism from sequential code. It also covers recent techniques where you can get completely automated parallelization of java code. Chapter 4 covers manual programming techniques for exploiting CMT.
 

Contents

The Case for CMPs
1
THE CHIP MULTIPROCESSOR CMP
5
12 THE APPLICATION PARALLELISM LANDSCAPE
6
SUPERSCALAR VS CMP
8
131 Simulation Results
12
BEYOND BASIC CMPS
17
Improving Throughput
21
21 SIMPLE CORES AND SERVER APPLICATIONS
24
32 AUTOMATED PARALLELIZATION USING THREADLEVEL SPECULATION TLS
63
HYDRA
70
332 Adding TLS to Hydra
71
333 Using Feedback from Violation Statistics
80
334 Performance Analysis
84
The JRPM System
88
34 CONCLUDING THOUGHTS ON AUTOMATED PARALLELIZATION
99
Improving Latency Using Manual Parallel Programming
103

212 Maximizing the Number of Cores on the Die
25
213 Providing Sufficient Cache and Memory Bandwidth
26
The Niagara Server CMP
34
The Niagara 2 Server CMP
44
224 Simple Core Limitations
47
23 GENERAL SERVER CMP ANALYSIS
48
232 Choosing Design Datapoints
51
233 Results
53
234 Discussion
54
Improving Latency Automatically
61
HELPER THREADS
62
41 USING TLS SUPPORT AS TRANSACTIONAL MEMORY
104
Parallelizing Heapsort Using TLS
105
412 Parallelizing SPEC2000 with TLS
114
MORE GENERALIZED TRANSACTIONAL MEMORY
116
421 TCC HARDWARE
118
422 TCC Software
121
423 TCC Performance
127
43 MIXING TRANSACTIONAL MEMORY AND CONVENTIONAL SHARED MEMORY
136
A Multicore World The Future of CMPs
141
Author Biography
145
Copyright

Other editions - View all

Common terms and phrases

Popular passages

Page 138 - SC Woo, M. Ohara, E. Torrie, JP Singh, and A. Gupta. The SPLASH-2 programs: Characterization and methodological considerations.
Page 10 - Alpha microprocessors [4], using a 0.25 )Jm process technology 4.2 4 x 2-way Superscalar Multiprocessor Architecture The MP architecture is made up of four 2-way superscalar processors interconnected by a crossbar that allows the processors to share the L2 cache. On the die, the four processors are arranged in a grid with the L2 cache at one end, as shown in Figure 3. Internally, each of the processors has a register renaming buffer that is much more limited than the one in the 6-way architecture,...

About the author (2007)

Kunle Olukotun is a Professor of Electrical Engineering and Computer Science at Stanford University.

Bibliographic information