• Chen C, Yang W, Wang F, Zhao D, Liu Y, Deng L and Yang C. Reverse Offload Programming on Heterogeneous Systems. IEEE Access. 10.1109/ACCESS.2019.2891740. 7. (10787-10797).

    https://ieeexplore.ieee.org/document/8606083/

  • Chen C, Yang F, Wang F, Deng L and Zhao D. (2018). Review of Programming and Performance Optimization on CPU-MIC Heterogeneous System 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). 10.1109/ICIVC.2018.8492841. 978-1-5386-4991-6. (894-900).

    https://ieeexplore.ieee.org/document/8492841/

  • Tanasic I, Gelado I, Jorda M, Ayguade E and Navarro N. Efficient exception handling support for GPUs. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (109-122).

    https://doi.org/10.1145/3123939.3123950

  • Mittal S and Vetter J. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys. 47:4. (1-35). Online publication date: 21-Jul-2015.

    https://doi.org/10.1145/2788396

  • Li Z, Goswami N and Li T. (2015). iConn. ACM Journal on Emerging Technologies in Computing Systems. 11:4. (1-23). Online publication date: 27-Apr-2015.

    https://doi.org/10.1145/2700238

  • Zhu E, Ma R, Hou Y, Yang Y, Liu F and Guan H. (2014). Two-phase execution of binary applications on CPU/GPU machines. Computers and Electrical Engineering. 40:5. (1567-1579). Online publication date: 1-Jul-2014.

    https://doi.org/10.1016/j.compeleceng.2014.02.002

  • Newburn C, Deodhar R, Dmitriev S, Murty R, Narayanaswamy R, Wiegert J, Chinchilla F and McGuire R. (2013). Offload Compiler Runtime for the Intel® Xeon PhiTM Coprocessor. Supercomputing. 10.1007/978-3-642-38750-0_18. (239-254).

    http://link.springer.com/10.1007/978-3-642-38750-0_18

  • Kambadur M, Tang K and Kim M. (2012). Harmony. ACM SIGARCH Computer Architecture News. 40:3. (452-463). Online publication date: 5-Sep-2012.

    https://doi.org/10.1145/2366231.2337211

  • Kambadur M, Tang K and Kim M. Harmony. Proceedings of the 39th Annual International Symposium on Computer Architecture. (452-463).

    /doi/10.5555/2337159.2337211

  • Ravi V, Ma W, Chiu D and Agrawal G. (2012). Compiler and runtime support for enabling reduction computations on heterogeneous systems. Concurrency and Computation: Practice & Experience. 24:5. (463-480). Online publication date: 1-Apr-2012.

    https://doi.org/10.1002/cpe.1848

  • Ravi V, Ma W, Chiu D and Agrawal G. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. Proceedings of the 24th ACM International Conference on Supercomputing. (137-146).

    https://doi.org/10.1145/1810085.1810106

  • Tanasic I, Gelado I, Jorda M, Ayguade E and Navarro N. Efficient exception handling support for GPUs. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (109-122).

    https://doi.org/10.1145/3123939.3123950

  • Kim J, Lee Y, Park J and Lee J. Translating OpenMP device constructs to OpenCL using unnecessary data transfer elimination. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

    /doi/10.5555/3014904.3014973

  • Kim J, Lee Y, Park J and Lee J. (2016). Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. 10.1109/SC.2016.50. 978-1-4673-8815-3. (597-608).

    http://ieeexplore.ieee.org/document/7877129/

  • Lustig D, Trippel C, Pellauer M and Martonosi M. (2015). ArMOR. ACM SIGARCH Computer Architecture News. 43:3S. (388-400). Online publication date: 4-Jan-2016.

    https://doi.org/10.1145/2872887.2750378

  • Ren B, Ravi N, Yang Y, Feng M, Agrawal G and Chakradhar S. Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors. Revised Selected Papers of the 28th International Workshop on Languages and Compilers for Parallel Computing - Volume 9519. (173-190).

    https://doi.org/10.1007/978-3-319-29778-1_11

  • Lustig D, Trippel C, Pellauer M and Martonosi M. ArMOR. Proceedings of the 42nd Annual International Symposium on Computer Architecture. (388-400).

    https://doi.org/10.1145/2749469.2750378

  • Lee J, Nigania N, Kim H, Patel K and Kim H. (2016). OpenCL performance evaluation on modern multicore CPUs. Scientific Programming. 2015. (4-4). Online publication date: 1-Jan-2015.

    https://doi.org/10.1155/2015/859491

  • Song L, Feng M, Ravi N, Yang Y and Chakradhar S. COMP. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. (659-671).

    https://doi.org/10.1109/MICRO.2014.30

  • Lim J and Kim H. Design Space Exploration of Memory Model for Heterogeneous Computing. Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing. (160-167).

    https://doi.org/10.1109/SBAC-PAD.2014.9

  • Chandramohan K and O'Boyle M. A compiler framework for automatically mapping data parallel programs to heterogeneous MPSoCs. Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. (1-10).

    https://doi.org/10.1145/2656106.2656107

  • Sato M, Fukazawa G, Shimada A, Hori A, Ishikawa Y and Namiki M. Design of Multiple PVAS on InfiniBand Cluster System Consisting of Many-core and Multi-core. Proceedings of the 21st European MPI Users' Group Meeting. (133-138).

    https://doi.org/10.1145/2642769.2642795

  • Gerofi B, Shimada A, Hori A, Masamichi T and Ishikawa Y. CMCP. Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. (73-84).

    https://doi.org/10.1145/2600212.2600231

  • Chen L, Huo X and Agrawal G. Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores. Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. (48-57).

    https://doi.org/10.1109/IPDPSW.2014.11

  • Lee C, Ro W and Gaudiot J. (2014). Boosting CUDA Applications with CPU---GPU Hybrid Computing. International Journal of Parallel Programming. 42:2. (384-404). Online publication date: 1-Apr-2014.

    https://doi.org/10.1007/s10766-013-0252-y

  • Nürnberger S, Drescher G, Rotta R, Nolte J and Schröder-Preikschat W. (2014). Shared Memory in the Many-Core Age. Euro-Par 2014: Parallel Processing Workshops. 10.1007/978-3-319-14313-2_30. (351-362).

    http://link.springer.com/10.1007/978-3-319-14313-2_30

  • Gerofi B, Takagi M and Ishikawa Y. (2014). Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs. Euro-Par 2014: Parallel Processing Workshops. 10.1007/978-3-319-14313-2_21. (242-253).

    http://link.springer.com/10.1007/978-3-319-14313-2_21

  • Sampaio D, Souza R, Collange C and Pereira F. (2014). Divergence analysis. ACM Transactions on Programming Languages and Systems. 35:4. (1-36). Online publication date: 1-Dec-2013.

    https://doi.org/10.1145/2523815

  • Ji F, Lin H and Ma X. RSVM. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. (269-278).

    /doi/10.5555/2523721.2523758

  • Feng Ji , Heshan Lin and Xiaosong Ma . (2013). Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2013.6618823. 978-1-4799-1018-2. (341-352).

    http://ieeexplore.ieee.org/document/6618823/

  • Zhao H, Shriraman A, Kumar S and Dwarkadas S. (2013). Protozoa. ACM SIGARCH Computer Architecture News. 41:3. (547-558). Online publication date: 26-Jun-2013.

    https://doi.org/10.1145/2508148.2485969

  • Zhao H, Shriraman A, Kumar S and Dwarkadas S. Protozoa. Proceedings of the 40th Annual International Symposium on Computer Architecture. (547-558).

    https://doi.org/10.1145/2485922.2485969

  • Newburn C, Dmitriev S, Narayanaswamy R, Wiegert J, Murty R, Chinchilla F, Deodhar R and McGuire R. Offload Compiler Runtime for the Intel® Xeon Phi™ Coprocessor. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. (1213-1225).

    https://doi.org/10.1109/IPDPSW.2013.251

  • Karlin I, Bhatele A, Keasler J, Chamberlain B, Cohen J, Devito Z, Haque R, Laney D, Luke E, Wang F, Richards D, Schulz M and Still C. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. (919-932).

    https://doi.org/10.1109/IPDPS.2013.115

  • Gerofi B, Shimada A, Hori A and Ishikawa Y. Partially separated page tables for efficient operating system assisted hierarchical memory management on heterogeneous architectures. Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (360-368).

    https://doi.org/10.1109/CCGrid.2013.59

  • Coutinho B, Sampaio D, Pereira F and Meira W. (2012). Profiling divergences in GPU applications. Concurrency and Computation: Practice and Experience. 10.1002/cpe.2853. 25:6. (775-789). Online publication date: 25-Apr-2013.

    https://onlinelibrary.wiley.com/doi/10.1002/cpe.2853

  • Pienaar J, Chakradhar S and Raghunathan A. Automatic generation of software pipelines for heterogeneous parallel systems. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

    /doi/10.5555/2388996.2389029

  • Gerofi B, Shimada A, Hori A and Ishikawa Y. Abstract. Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. (1350-1351).

    https://doi.org/10.1109/SC.Companion.2012.181

  • Pienaar J, Chakradhar S and Raghunathan A. Automatic generation of software pipelines for heterogeneous parallel systems. Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

    https://doi.org/10.1109/SC.2012.22

  • Gerofi B, Hori A and Ishikawa Y. clone_n(). Proceedings of the 2012 IEEE International Conference on Cluster Computing. (592-596).

    https://doi.org/10.1109/CLUSTER.2012.85

  • Pai S, Govindarajan R and Thazhuthaveetil M. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. Proceedings of the 21st international conference on Parallel architectures and compilation techniques. (33-42).

    https://doi.org/10.1145/2370816.2370824

  • Matsuo Y, Shimosawa T and Ishikawa Y. A file I/O system for many-core based clusters. Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers. (1-8).

    https://doi.org/10.1145/2318916.2318920

  • Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V and Chakradhar S. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. (97-108).

    https://doi.org/10.1145/2287076.2287090

  • Lim J and Kim H. Design space exploration of memory model for heterogeneous computing. Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. (74-75).

    https://doi.org/10.1145/2247684.2247700

  • Lin F, Wang Z, LiKamWa R and Zhong L. (2012). Reflex. ACM SIGPLAN Notices. 47:4. (13-24). Online publication date: 1-Jun-2012.

    https://doi.org/10.1145/2248487.2150979

  • Kambadur M, Tang K and Kim M. (2012). Harmony: Collection and analysis of parallel block vectors 2012 ACM/IEEE 39th International Symposium on Computer Architecture (ISCA). 10.1109/ISCA.2012.6237039. 978-1-4673-0476-4. (452-463).

    http://ieeexplore.ieee.org/document/6237039/

  • Lin F, Wang Z, LiKamWa R and Zhong L. (2012). Reflex. ACM SIGARCH Computer Architecture News. 40:1. (13-24). Online publication date: 18-Apr-2012.

    https://doi.org/10.1145/2189750.2150979

  • Lin F, Wang Z, LiKamWa R and Zhong L. Reflex. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems. (13-24).

    https://doi.org/10.1145/2150976.2150979

  • Lee J, Kim J, Kim J, Seo S and Lee J. An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence. Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. (56-67).

    https://doi.org/10.1109/PACT.2011.12

  • Yan S, Zhou X, Gao Y, Chen H, Wu G, Luo S and Saha B. (2011). Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform. ACM SIGOPS Operating Systems Review. 45:1. (92-100). Online publication date: 18-Feb-2011.

    https://doi.org/10.1145/1945023.1945035

  • Chinya G, Collins J, Wang P, Jiang H, Lueh G, Piazza T and Wang H. (2011). Bothnia. ACM SIGOPS Operating Systems Review. 45:1. (11-20). Online publication date: 18-Feb-2011.

    https://doi.org/10.1145/1945023.1945027

  • Kelm J, Johnson D, Tuohy W, Lumetta S and Patel S. (2011). Cohesion. IEEE Micro. 31:1. (42-55). Online publication date: 1-Jan-2011.

    https://doi.org/10.1109/MM.2011.8

  • Coutinho B, Sampaio D, Pereira F and Meira Jr. W. Performance Debugging of GPGPU Applications with the Divergence Map. Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing. (33-40).

    https://doi.org/10.1109/SBAC-PAD.2010.38

  • Gummaraju J, Morichetti L, Houston M, Sander B, Gaster B and Zheng B. Twin peaks. Proceedings of the 19th international conference on Parallel architectures and compilation techniques. (205-216).

    https://doi.org/10.1145/1854273.1854302

  • Lee J, Kim J, Seo S, Kim S, Park J, Kim H, Dao T, Cho Y, Seo S, Lee S, Cho S, Song H, Suh S and Choi J. An OpenCL framework for heterogeneous multicores with local memory. Proceedings of the 19th international conference on Parallel architectures and compilation techniques. (193-204).

    https://doi.org/10.1145/1854273.1854301

  • Kelm J, Johnson D, Tuohy W, Lumetta S and Patel S. (2010). Cohesion. ACM SIGARCH Computer Architecture News. 38:3. (429-440). Online publication date: 19-Jun-2010.

    https://doi.org/10.1145/1816038.1816019

  • Kelm J, Johnson D, Tuohy W, Lumetta S and Patel S. Cohesion. Proceedings of the 37th annual international symposium on Computer architecture. (429-440).

    https://doi.org/10.1145/1815961.1816019

  • Becchi M, Byna S, Cadambi S and Chakradhar S. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. (82-91).

    https://doi.org/10.1145/1810479.1810498

  • Garg R and Amaral J. Compiling Python to a hybrid execution environment. Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. (19-30).

    https://doi.org/10.1145/1735688.1735695