Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency

Oyekunle Ayinde Olukotun, Lance Hammond, James P. Laudon

Morgan & Claypool Publishers, 2007 - Computers - 145 pages

Chip multiprocessors - also called multi-core microprocessors or CMPs for short - are now the only way to build high-performance microprocessors, for a variety of reasons. Large uniprocessors are no longer scaling in performance, because it is only possible to extract a limited amount of parallelism from a typical instruction stream using conventional superscalar instruction issue techniques. In addition, one cannot simply ratchet up the clock speed on today's processors, or the power dissipation will become prohibitive in all but water-cooled systems. After a discussion of the basic pros and cons of CMPs when they are compared with conventional uniprocessors, this book examines how CMPs can best be designed to handle two radically different kinds of workloads that are likely to be used with a CMP: highly parallel, throughput-sensitive applications at one end of the spectrum, and less parallel, latency-sensitive applications at the other. Throughput-sensitive applications, such as server workloads that handle many independent transactions at once, require careful balancing of all parts of a CMP that can limit throughput, such as the individual cores, on-chip cache memory, and off-chip memory interfaces. Several studies and example systems, such as the Sun Niagara, that examine the necessary tradeoffs are presented here. In contrast, latency-sensitive applications - many desktop applications fall into this category - require a focus on reducing inter-core communication latency and applying techniques to help programmers divide their programs into multiple threads as easily as possible. This book discusses many techniques that can be used in CMPs to simplify parallel programming, with an emphasis on research directions proposed at Stanford University. To illustrate the advantages possible with a CMP using a couple of solid examples, extra focus is given to thread-level speculation (TLS), a way to automatically break up nominally sequential applications into parallel threads on a CMP, and transactional memory. This model can greatly simplify manual parallel programming by using hardware - instead of conventional software locks - to enforce atomic code execution of blocks of instructions, a technique that makes parallel coding much less error-prone. Book jacket.

Preview this book »

Selected pages

The Case for CMPs	1

Improving Throughput	21

Improving Latency Automatically	61

Improving Latency Using Manual Parallel Programming	103

The Future of CMPs	141

Copyright

Other editions - View all

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency
Kunle Olukotun,Lance Hammond,James Laudon
Limited preview - 2022

Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency
Kunle Olukotun,Lance Hammond,James Laudon
No preview available - 2007

Common terms and phrases

additional allow applications automatically bandwidth benchmarks bits bytes cache line cache miss cache miss rates CHIP MULTIPROCESSOR ARCHITECTURE clock coherence commit compiler complete Computer Architecture configurations conventional parallel critical arcs critical region crossbar cycle data cache dependence violations dynamic execution FIGURE hardware Hot Chips Hydra IMPROVING LATENCY IMPROVING THROUGHPUT in-order increase instruction cache integer Jrpm L2 cache loads and stores locks loop body loop iterations MANUAL PARALLEL PROGRAMMING memory controller microarchitecture microprocessor multiple multithreading Niagara Node null null null number of cores OLTP Olukotun on-chip optimization out-of-order overheads performance performance/Watt pipeline Piranha prefetch primary cache Proc processor cores RDRAM running secondary cache sequence server shared simple core simulation simultaneous multithreading single SPEC SPECjbb speculative execution speculative thread speedup stall superscalar superscalar processor Symp synchronization techniques thread-level parallelism uniprocessor updates variables workloads write bus

Popular passages

Page 10 - Alpha microprocessors [4], using a 0.25 )Jm process technology 4.2 4 x 2-way Superscalar Multiprocessor Architecture The MP architecture is made up of four 2-way superscalar processors interconnected by a crossbar that allows the processors to share the L2 cache. On the die, the four processors are arranged in a grid with the L2 cache at one end, as shown in Figure 3. Internally, each of the processors has a register renaming buffer that is much more limited than the one in the 6-way architecture,...‎

Appears in 4 books from 1991-2007

Bibliographic information

Title	Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency Volume 3 of Synthesis lectures in computer architecture
Authors	Oyekunle Ayinde Olukotun, Lance Hammond, James P. Laudon
Edition	illustrated
Publisher	Morgan & Claypool Publishers, 2007
ISBN	159829122X, 9781598291223
Length	145 pages
Subjects	Computers › Data Science › General Computers / Computer Architecture Computers / Computer Science Computers / Data Science / General Computers / Distributed Systems / General Computers / Hardware / Chips & Processors Computers / Parallel Processing Technology & Engineering / General

Export Citation	BiBTeX EndNote RefMan

About Google Books - Privacy Policy - Terms of Service - Information for Publishers - Report an issue - Help - Google Home