Abstract
Conventional dynamically scheduled processors often use fully associative structures named load/store queue (LSQ) to implement the value communication between loads and the older in-flight stores and to detect the store-load order violation. But this in-flight forwarding only occupies about 15% of all store-load communications, which makes the CAM-based micro-architecture the major bottleneck to scale store-load communication further. This paper presents a new micro-architecture named ASW (short for active store window). It provides a new structure named speculative active store window to implement more aggressively speculative store-load forwarding than conventional LSQ. This structure could forward the data of committed stores to the executing loads without accessing to L1 data cache, which is referred to as far forwarding in this paper. At the back-end of the pipeline, it uses in-order load re-execution filtered by the tagged SSBF (short for store sequence bloom filter) to verify the correctness of the store-load forwarding. The speculative active store window and tagged store sequence bloom filter are all set-associate structures that are more efficient and scalable than fully associative structures. Experiments show that this simpler and faster design outperforms a conventional load/store queue based design and the NoSQ design on most benchmarks by 10.22% and 8.71% respectively.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Wulf W A, McKee S A. Hitting the memory wall: Implications of the obvious. Computer Architecture News, 1995, 23(1): 20–24.
Park I, Ooi C L, Vijaykumar T N. Reducing design complexity of the load/store queue. In Proc. the 36th MICRO, San Diego, USA, Dec. 3-5, 2003, pp.411–422.
Gandhi A, Akkary H, Rajwar R, Srinivasan S T, Lai K. Scalable load and store processing in latency tolerant processors. In Proc. the 32nd ISCA, Madison, USA, June 4-8, 2005, pp.446–457.
Pericàs M, Cristal A, Cazorla F J, Gonzàlez R, Veidenbaum A, Jimènez D A, ValeroM. A two-level load/store queue based on execution locality. In Proc. the 35th ISCA, Beijing, China, June 21-25, 2008, pp.25–36.
Sethumadhavan S, Desikan R, Burger D, Moore C R, Keckler S W. Scalable hardware memory disambiguation for high ILP processors. In Proc. the 36th MICRO, San Diego, USA, Dec. 3-5, 2003, pp.399–410.
Baugh L, Zilles C. Decomposing the load-store queue by function for power reduction and scalability. IBM Journal of Research and Development, 2006, 50(2/3): 287–297.
Sha T T, Martin M M K, Roth A. Scalable store-load forwarding via store queue index prediction. In Proc. the 38th MICRO, Barcelona, Spain, Nov. 12-16, 2005, pp.159–170.
Stone S S, Woley K M, Frank M I. Address-indexed memory disambiguation and store-to-load forwarding. In Proc. the 38th MICRO, Barcelona, Spain, Nov. 12-16, 2005, pp.171–182.
Roesner F, Burger D, Keckler S W. Counting dependence predictors. In Proc. the 35th ISCA, Beijing, China, June 21-25, 2008, pp.215–226.
Sha T T, Martin M M K, Roth A. NoSQ: Store-load communication without a store queue. In Proc. the 39th MICRO, Orlando, USA, Dec. 9-13, 2006, pp.285–296.
Subramaniam S, Loh G H. Fire-and-forget: Load/store scheduling with no store queue at all. In Proc. the 39th MICRO, Orlando, USA, Dec. 9-13, 2006, pp.273–284.
Garg A, Rashid M W, Huang M. Slackened memory dependence enforcement: Combining opportunistic forwarding with decoupled verification. In Proc. the 33rd ISCA, Boston, USA, June 17-21, 2006, pp.142–154.
Sethumadhavan S, Roesner F, Emer J S, Burger D, Keckler S W. Late-binding: Enabling unordered load-store queue. In Proc. the 34th ISCA, San Diego, USA, June 9-13, 2007, pp.347–357.
Huang R, Garg A, Huang M. Software hardware cooperative memory disambiguation. In Proc. the 12th HPCA, Austin, USA, Feb. 11-15, 2006, pp.244–253.
Cain H W, Lipasti M H. Memory ordering: A value-based approach. In Proc. the 31st ISCA, München, Germany, June 19-23, 2004, pp.90–101.
Roth A. Store vulnerability window: Re-execution filtering for enhanced load optimization. In Proc. the 32nd ISCA, Madison, USA, June 4-8, 2005, pp.458–468.
Chrysos G Z, Emer J S. Memory dependence prediction using store sets. In Proc. the 25th ISCA, Barcelona, Spain, June 27-July 1, 1998, pp.142–153.
Moshovos A, Breach S E, Vijaykumar T N, Sohi G S. Dynamic speculation and synchronization of data dependences. In Proc. the 24th ISCA, Denver, USA, June 2-4, 1997, pp.181–193.
Hilton A, Roth A. Decoupled store completion/silent deterministic replay: Enabling scalable data memory for CPR/CFP processors. In Proc. the 36th ISCA, Austin, USA, June 20-24, 2009, pp.245–254.
Hilton A, Roth A. BOLT: Energy-efficient out-of-order latency-tolerant execution. In Proc. the 16th HPCA, Bangalore, India, Jan. 9-14, 2010, pp.1–12.
Mutlu O, Stark J, Wilkerson C, Patt Y N. Runahead execution: An alternative to very large instruction windows for out-of-order processors. In Proc. the 9th HPCA, Anaheim, USA, Feb. 8-12, 2003, pp.129–140.
Akkary H, Rajwar R, Srinivasan S T. Checkpoint processing and recovery: Towards scalable large instruction window processors. In Proc. the 36th MICRO, San Diego, USA, Dec. 3-5, 2003, pp.423–434.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the National High Technology Research and Development 863 Program of China under Grant No. 2009ZX01029-001-002 and the Postdoctoral Science Foundation of China under Grant No. 20110490208.
Rights and permissions
About this article
Cite this article
Zhang, ZH., Wang, XY., Tong, D. et al. Active Store Window: Enabling Far Store-Load Forwarding with Scalability and Complexity-Efficiency. J. Comput. Sci. Technol. 27, 769–780 (2012). https://doi.org/10.1007/s11390-012-1263-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-012-1263-7