Hardware and Software Mechanisms for Reducing Load LatencyAbstract: "As processor demands quickly outpace memory, the performance of load instructions becomes an increasingly critical component to good system performance. This thesis contributes four novel load latency reduction techniques, each targeting a different component of load latency: address calculation, data cache access, address translation, and data cache misses. The contributed techniques are as follows: Fast Address Calculation employs a stateless set index predictor to allow address calculation to overlap with data cache access. The design eliminates the latency of address calculation for many loads. Zero-Cycle Loads combine fast address calculation with an early-issue mechanism to produce pipeline designs capable of hiding the latency of many loads that hit in the data cache. High-Bandwidth Address Translation develops address translation mechanisms with better latency and area characteristics than a multi-ported TLB. The new designs provide multiple-issue processors with effective alternatives for keeping address translation off the critical path of data cache access. Cache-conscious Data Placement is a profile- guided data placement optimization for reducing the frequency of data cache misses. The approach employs heuristic algorithms to find variable placement solutions that decrease inter-variable conflict, and increase cache line utilization and block prefetch. Detailed design descriptions and experimental evaluations are provided for each approach, confirming the designs as cost-effective and practical solutions for reducting load latency." |
Contents
HighBandwidth Address Translation | 10 |
Experimental Framework | 13 |
Fast Address Calculation | 19 |
10 other sections not shown
Common terms and phrases
addition address translation alignment allocation approach architecture array Assembler Format bandwidth bank baseline better block branch BRIC byte cache misses cache performance cache-conscious data placement Chapter codes compiler complete Computer cycle data cache data cache access decode stage designs detailed effective address eliminating entry execution experiments fast address calculation fetch Figure FPALIGN function global heap impact implementation improve in-order issue increased Instruction Format integer International less limited load latency mechanism memory mode offset Opcode operations optimizations path performance piggybacked pipeline placed placement algorithm pointer portion ports possible prediction pretranslation Proceedings programs provides rates reducing reference register values replacement requests result running Semantics shown shows simulator single software support speculative speedups stack stack pointer Table technique tolerating variable placement virtual WRITE zero-cycle loads