Chen C, Yang W, Wang F, Zhao D, Liu Y, Deng L and Yang C. Reverse Offload Programming on Heterogeneous Systems. IEEE Access. 10.1109/ACCESS.2019.2891740. 7. (10787-10797).

https://ieeexplore.ieee.org/document/8606083/

Chen C, Yang F, Wang F, Deng L and Zhao D. (2018). Review of Programming and Performance Optimization on CPU-MIC Heterogeneous System 2018 IEEE 3rd International Conference on Image, Vision and Computing (ICIVC). 10.1109/ICIVC.2018.8492841. 978-1-5386-4991-6. (894-900).

https://ieeexplore.ieee.org/document/8492841/

Tanasic I, Gelado I, Jorda M, Ayguade E and Navarro N. Efficient exception handling support for GPUs. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (109-122).

https://doi.org/10.1145/3123939.3123950

Mittal S and Vetter J. (2015). A Survey of CPU-GPU Heterogeneous Computing Techniques. ACM Computing Surveys. 47:4. (1-35). Online publication date: 21-Jul-2015.

https://doi.org/10.1145/2788396

Li Z, Goswami N and Li T. (2015). iConn. ACM Journal on Emerging Technologies in Computing Systems. 11:4. (1-23). Online publication date: 27-Apr-2015.

https://doi.org/10.1145/2700238

Zhu E, Ma R, Hou Y, Yang Y, Liu F and Guan H. (2014). Two-phase execution of binary applications on CPU/GPU machines. Computers and Electrical Engineering. 40:5. (1567-1579). Online publication date: 1-Jul-2014.

https://doi.org/10.1016/j.compeleceng.2014.02.002

Newburn C, Deodhar R, Dmitriev S, Murty R, Narayanaswamy R, Wiegert J, Chinchilla F and McGuire R. (2013). Offload Compiler Runtime for the Intel® Xeon PhiTM Coprocessor. Supercomputing. 10.1007/978-3-642-38750-0_18. (239-254).

http://link.springer.com/10.1007/978-3-642-38750-0_18

Kambadur M, Tang K and Kim M. (2012). Harmony. ACM SIGARCH Computer Architecture News. 40:3. (452-463). Online publication date: 5-Sep-2012.

https://doi.org/10.1145/2366231.2337211

Kambadur M, Tang K and Kim M. Harmony. Proceedings of the 39th Annual International Symposium on Computer Architecture. (452-463).

/doi/10.5555/2337159.2337211

Ravi V, Ma W, Chiu D and Agrawal G. (2012). Compiler and runtime support for enabling reduction computations on heterogeneous systems. Concurrency and Computation: Practice & Experience. 24:5. (463-480). Online publication date: 1-Apr-2012.

https://doi.org/10.1002/cpe.1848

Ravi V, Ma W, Chiu D and Agrawal G. Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations. Proceedings of the 24th ACM International Conference on Supercomputing. (137-146).

https://doi.org/10.1145/1810085.1810106

Tanasic I, Gelado I, Jorda M, Ayguade E and Navarro N. Efficient exception handling support for GPUs. Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. (109-122).

https://doi.org/10.1145/3123939.3123950

Kim J, Lee Y, Park J and Lee J. Translating OpenMP device constructs to OpenCL using unnecessary data transfer elimination. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

/doi/10.5555/3014904.3014973

Kim J, Lee Y, Park J and Lee J. (2016). Translating OpenMP Device Constructs to OpenCL Using Unnecessary Data Transfer Elimination SC16: International Conference for High Performance Computing, Networking, Storage and Analysis. 10.1109/SC.2016.50. 978-1-4673-8815-3. (597-608).

http://ieeexplore.ieee.org/document/7877129/

Lustig D, Trippel C, Pellauer M and Martonosi M. (2015). ArMOR. ACM SIGARCH Computer Architecture News. 43:3S. (388-400). Online publication date: 4-Jan-2016.

https://doi.org/10.1145/2872887.2750378

Ren B, Ravi N, Yang Y, Feng M, Agrawal G and Chakradhar S. Automatic and Efficient Data Host-Device Communication for Many-Core Coprocessors. Revised Selected Papers of the 28th International Workshop on Languages and Compilers for Parallel Computing - Volume 9519. (173-190).

https://doi.org/10.1007/978-3-319-29778-1_11

Lustig D, Trippel C, Pellauer M and Martonosi M. ArMOR. Proceedings of the 42nd Annual International Symposium on Computer Architecture. (388-400).

https://doi.org/10.1145/2749469.2750378

Lee J, Nigania N, Kim H, Patel K and Kim H. (2016). OpenCL performance evaluation on modern multicore CPUs. Scientific Programming. 2015. (4-4). Online publication date: 1-Jan-2015.

https://doi.org/10.1155/2015/859491

Song L, Feng M, Ravi N, Yang Y and Chakradhar S. COMP. Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture. (659-671).

https://doi.org/10.1109/MICRO.2014.30

Lim J and Kim H. Design Space Exploration of Memory Model for Heterogeneous Computing. Proceedings of the 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing. (160-167).

https://doi.org/10.1109/SBAC-PAD.2014.9

Chandramohan K and O'Boyle M. A compiler framework for automatically mapping data parallel programs to heterogeneous MPSoCs. Proceedings of the 2014 International Conference on Compilers, Architecture and Synthesis for Embedded Systems. (1-10).

https://doi.org/10.1145/2656106.2656107

Sato M, Fukazawa G, Shimada A, Hori A, Ishikawa Y and Namiki M. Design of Multiple PVAS on InfiniBand Cluster System Consisting of Many-core and Multi-core. Proceedings of the 21st European MPI Users' Group Meeting. (133-138).

https://doi.org/10.1145/2642769.2642795

Gerofi B, Shimada A, Hori A, Masamichi T and Ishikawa Y. CMCP. Proceedings of the 23rd international symposium on High-performance parallel and distributed computing. (73-84).

https://doi.org/10.1145/2600212.2600231

Chen L, Huo X and Agrawal G. Scheduling Methods for Accelerating Applications on Architectures with Heterogeneous Cores. Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops. (48-57).

https://doi.org/10.1109/IPDPSW.2014.11

Lee C, Ro W and Gaudiot J. (2014). Boosting CUDA Applications with CPU---GPU Hybrid Computing. International Journal of Parallel Programming. 42:2. (384-404). Online publication date: 1-Apr-2014.

https://doi.org/10.1007/s10766-013-0252-y

Nürnberger S, Drescher G, Rotta R, Nolte J and Schröder-Preikschat W. (2014). Shared Memory in the Many-Core Age. Euro-Par 2014: Parallel Processing Workshops. 10.1007/978-3-319-14313-2_30. (351-362).

http://link.springer.com/10.1007/978-3-319-14313-2_30

Gerofi B, Takagi M and Ishikawa Y. (2014). Exploiting Hidden Non-uniformity of Uniform Memory Access on Manycore CPUs. Euro-Par 2014: Parallel Processing Workshops. 10.1007/978-3-319-14313-2_21. (242-253).

http://link.springer.com/10.1007/978-3-319-14313-2_21

Sampaio D, Souza R, Collange C and Pereira F. (2014). Divergence analysis. ACM Transactions on Programming Languages and Systems. 35:4. (1-36). Online publication date: 1-Dec-2013.

https://doi.org/10.1145/2523815

Ji F, Lin H and Ma X. RSVM. Proceedings of the 22nd international conference on Parallel architectures and compilation techniques. (269-278).

/doi/10.5555/2523721.2523758

Feng Ji , Heshan Lin and Xiaosong Ma . (2013). Breaking SIMD shackles with an exposed flexible microarchitecture and the access execute PDG 22nd International Conference on Parallel Architectures and Compilation Techniques (PACT). 10.1109/PACT.2013.6618823. 978-1-4799-1018-2. (341-352).

http://ieeexplore.ieee.org/document/6618823/

Zhao H, Shriraman A, Kumar S and Dwarkadas S. (2013). Protozoa. ACM SIGARCH Computer Architecture News. 41:3. (547-558). Online publication date: 26-Jun-2013.

https://doi.org/10.1145/2508148.2485969

Zhao H, Shriraman A, Kumar S and Dwarkadas S. Protozoa. Proceedings of the 40th Annual International Symposium on Computer Architecture. (547-558).

https://doi.org/10.1145/2485922.2485969

Newburn C, Dmitriev S, Narayanaswamy R, Wiegert J, Murty R, Chinchilla F, Deodhar R and McGuire R. Offload Compiler Runtime for the Intel® Xeon Phi™ Coprocessor. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing Workshops and PhD Forum. (1213-1225).

https://doi.org/10.1109/IPDPSW.2013.251

Karlin I, Bhatele A, Keasler J, Chamberlain B, Cohen J, Devito Z, Haque R, Laney D, Luke E, Wang F, Richards D, Schulz M and Still C. Exploring Traditional and Emerging Parallel Programming Models Using a Proxy Application. Proceedings of the 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. (919-932).

https://doi.org/10.1109/IPDPS.2013.115

Gerofi B, Shimada A, Hori A and Ishikawa Y. Partially separated page tables for efficient operating system assisted hierarchical memory management on heterogeneous architectures. Proceedings of the 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing. (360-368).

https://doi.org/10.1109/CCGrid.2013.59

Coutinho B, Sampaio D, Pereira F and Meira W. (2012). Profiling divergences in GPU applications. Concurrency and Computation: Practice and Experience. 10.1002/cpe.2853. 25:6. (775-789). Online publication date: 25-Apr-2013.

https://onlinelibrary.wiley.com/doi/10.1002/cpe.2853

Pienaar J, Chakradhar S and Raghunathan A. Automatic generation of software pipelines for heterogeneous parallel systems. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. (1-12).

/doi/10.5555/2388996.2389029

Gerofi B, Shimada A, Hori A and Ishikawa Y. Abstract. Proceedings of the 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. (1350-1351).

https://doi.org/10.1109/SC.Companion.2012.181

Pienaar J, Chakradhar S and Raghunathan A. Automatic generation of software pipelines for heterogeneous parallel systems. Proceedings of the 2012 International Conference for High Performance Computing, Networking, Storage and Analysis. (1-12).

https://doi.org/10.1109/SC.2012.22

Gerofi B, Hori A and Ishikawa Y. clone_n(). Proceedings of the 2012 IEEE International Conference on Cluster Computing. (592-596).

https://doi.org/10.1109/CLUSTER.2012.85

Pai S, Govindarajan R and Thazhuthaveetil M. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme. Proceedings of the 21st international conference on Parallel architectures and compilation techniques. (33-42).

https://doi.org/10.1145/2370816.2370824

Matsuo Y, Shimosawa T and Ishikawa Y. A file I/O system for many-core based clusters. Proceedings of the 2nd International Workshop on Runtime and Operating Systems for Supercomputers. (1-8).

https://doi.org/10.1145/2318916.2318920

Becchi M, Sajjapongse K, Graves I, Procter A, Ravi V and Chakradhar S. A virtual memory based runtime to support multi-tenancy in clusters with GPUs. Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing. (97-108).

https://doi.org/10.1145/2287076.2287090

Lim J and Kim H. Design space exploration of memory model for heterogeneous computing. Proceedings of the 2012 ACM SIGPLAN Workshop on Memory Systems Performance and Correctness. (74-75).

https://doi.org/10.1145/2247684.2247700

Lin F, Wang Z, LiKamWa R and Zhong L. (2012). Reflex. ACM SIGPLAN Notices. 47:4. (13-24). Online publication date: 1-Jun-2012.

https://doi.org/10.1145/2248487.2150979

Kambadur M, Tang K and Kim M. (2012). Harmony: Collection and analysis of parallel block vectors 2012 ACM/IEEE 39th International Symposium on Computer Architecture (ISCA). 10.1109/ISCA.2012.6237039. 978-1-4673-0476-4. (452-463).

http://ieeexplore.ieee.org/document/6237039/

Lin F, Wang Z, LiKamWa R and Zhong L. (2012). Reflex. ACM SIGARCH Computer Architecture News. 40:1. (13-24). Online publication date: 18-Apr-2012.

https://doi.org/10.1145/2189750.2150979

Lin F, Wang Z, LiKamWa R and Zhong L. Reflex. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems. (13-24).

https://doi.org/10.1145/2150976.2150979

Lee J, Kim J, Kim J, Seo S and Lee J. An OpenCL Framework for Homogeneous Manycores with No Hardware Cache Coherence. Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques. (56-67).

https://doi.org/10.1109/PACT.2011.12

Yan S, Zhou X, Gao Y, Chen H, Wu G, Luo S and Saha B. (2011). Optimizing a shared virtual memory system for a heterogeneous CPU-accelerator platform. ACM SIGOPS Operating Systems Review. 45:1. (92-100). Online publication date: 18-Feb-2011.

https://doi.org/10.1145/1945023.1945035

Chinya G, Collins J, Wang P, Jiang H, Lueh G, Piazza T and Wang H. (2011). Bothnia. ACM SIGOPS Operating Systems Review. 45:1. (11-20). Online publication date: 18-Feb-2011.

https://doi.org/10.1145/1945023.1945027

Kelm J, Johnson D, Tuohy W, Lumetta S and Patel S. (2011). Cohesion. IEEE Micro. 31:1. (42-55). Online publication date: 1-Jan-2011.

https://doi.org/10.1109/MM.2011.8

Coutinho B, Sampaio D, Pereira F and Meira Jr. W. Performance Debugging of GPGPU Applications with the Divergence Map. Proceedings of the 2010 22nd International Symposium on Computer Architecture and High Performance Computing. (33-40).

https://doi.org/10.1109/SBAC-PAD.2010.38

Gummaraju J, Morichetti L, Houston M, Sander B, Gaster B and Zheng B. Twin peaks. Proceedings of the 19th international conference on Parallel architectures and compilation techniques. (205-216).

https://doi.org/10.1145/1854273.1854302

Lee J, Kim J, Seo S, Kim S, Park J, Kim H, Dao T, Cho Y, Seo S, Lee S, Cho S, Song H, Suh S and Choi J. An OpenCL framework for heterogeneous multicores with local memory. Proceedings of the 19th international conference on Parallel architectures and compilation techniques. (193-204).

https://doi.org/10.1145/1854273.1854301

Kelm J, Johnson D, Tuohy W, Lumetta S and Patel S. (2010). Cohesion. ACM SIGARCH Computer Architecture News. 38:3. (429-440). Online publication date: 19-Jun-2010.

https://doi.org/10.1145/1816038.1816019

Kelm J, Johnson D, Tuohy W, Lumetta S and Patel S. Cohesion. Proceedings of the 37th annual international symposium on Computer architecture. (429-440).

https://doi.org/10.1145/1815961.1816019

Becchi M, Byna S, Cadambi S and Chakradhar S. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. Proceedings of the twenty-second annual ACM symposium on Parallelism in algorithms and architectures. (82-91).

https://doi.org/10.1145/1810479.1810498

Garg R and Amaral J. Compiling Python to a hybrid execution environment. Proceedings of the 3rd Workshop on General-Purpose Computation on Graphics Processing Units. (19-30).

https://doi.org/10.1145/1735688.1735695