Chip Multiprocessor Architecture: Techniques to Improve Throughput and LatencyChip multiprocessors - also called multi-core microprocessors or CMPs for short - are now the only way to build high-performance microprocessors, for a variety of reasons. Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream using conventional superscalar instruction issue techniques. In addition, one cannot simply ratchet up the clock speed on today's processors, or the power dissipation will become prohibitive in all but water-cooled systems. After a discussion of the basic pros and cons of CMPs when they are compared with conventional uniprocessors, this book examines how CMPs can best be designed to handle two radically different kinds of workloads that are likely to be used with a CMP: highly parallel, throughput-sensitive applications at one end of the spectrum, and less parallel, latency-sensitive applications at the other. Throughput-sensitive applications, such as server workloads that handle many independent transactions at once, require careful balancing of all parts of a CMP that can limit throughput, such as the individual cores, on-chip cache memory, and off-chip memory interfaces. Several studies and example systems, such as the Sun Niagara, that examine the necessary tradeoffs are presented here. In contrast, latency-sensitive applications - many desktop applications fall into this category - require a focus on reducing inter-core communication latency and applying techniques to help programmers divide their programs into multiple threads as easily as possible. This book discusses many techniques that can be used in CMPs to simplify parallel programming, with an emphasis on research directions proposed at Stanford University. To illustrate the advantages possible with a CMP using a couple of solid examples, extra focus is given to thread-level speculation (TLS), a way to automatically break up nominally sequential applications into parallel threads on a CMP, and transactional memory. This model can greatly simplify manual parallel programming by using hardware - instead of conventional software locks - to enforce atomic code execution of blocks of instructions, a technique that makes parallel coding much less error-prone. Book jacket. |
Contents
1 | |
Improving Throughput | 21 |
Improving Latency Automatically | 61 |
Improving Latency Using Manual Parallel Programming | 103 |
The Future of CMPs | 141 |
Other editions - View all
Common terms and phrases
additional allow applications automatically bandwidth benchmarks bits bytes cache line cache miss cache miss rates CHIP MULTIPROCESSOR ARCHITECTURE clock coherence commit compiler complete Computer Architecture configurations conventional parallel critical arcs critical region crossbar cycle data cache dependence violations dynamic execution FIGURE hardware Hot Chips Hydra IMPROVING LATENCY IMPROVING THROUGHPUT in-order increase instruction cache integer Jrpm L2 cache loads and stores locks loop body loop iterations MANUAL PARALLEL PROGRAMMING memory controller microarchitecture microprocessor multiple multithreading Niagara Node null null null number of cores OLTP Olukotun on-chip optimization out-of-order overheads performance performance/Watt pipeline Piranha prefetch primary cache Proc processor cores RDRAM running secondary cache sequence server shared simple core simulation simultaneous multithreading single SPEC SPECjbb speculative execution speculative thread speedup stall superscalar superscalar processor Symp synchronization techniques thread-level parallelism uniprocessor updates variables workloads write bus
Popular passages
Page 10 - Alpha microprocessors [4], using a 0.25 )Jm process technology 4.2 4 x 2-way Superscalar Multiprocessor Architecture The MP architecture is made up of four 2-way superscalar processors interconnected by a crossbar that allows the processors to share the L2 cache. On the die, the four processors are arranged in a grid with the L2 cache at one end, as shown in Figure 3. Internally, each of the processors has a register renaming buffer that is much more limited than the one in the 6-way architecture,...