Hardware and Software Mechanisms for Reducing Load Latency
Abstract: "As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with effective alternatives for keeping address translation off the critical path of data cache access. Cache-conscious Data Placement is a profile- guided data placement optimization for reducing the frequency of data cache misses. The approach employs heuristic algorithms to find variable placement solutions that decrease inter-variable conflict, and increase cache line utilization and block prefetch. Detailed design descriptions and experimental evaluations are provided for each approach, confirming the designs as cost-effective and practical solutions for reducting load latency."
What people are saying - Write a review
We haven't found any reviews in the usual places.
Fast Address Calculation
6 other sections not shown
alignment Annual International Symposium approach array Assembler Format bandwidth base register base TLB benchmarks block offset BRIC cache block cache geometries cache performance cache-conscious data placement compiler Computer Architecture conflict misses cycle data cache access data cache miss decode stage Direct-Mapped Cache displaced addressing Doduc fast address calculation Figure floating point codes FPALIGN(FS fully-associative Ghostscript global pointer global variables GPR(RS heap allocation heap variables implementation in-order issue processor inc.dec Semantics increased index register Instruction Format integer codes inter-variable Interface interleaved L2 TLB load instructions load latency loads and stores memory access miss rates multi-level TLB natural placement Opcode out-of-order issue processor Perl piggyback ports pipeline placement algorithm placement solution prediction program performance reducing register file register value result run-time Semantics software support speedups stack pointer superscalar support for zero-cycle Symposium on Computer Syscode target cache TLB designs TLB miss Tomcatv variable placement virtual memory virtual page Xlisp