Issue 5 - Volume 513 - Journal of Physics: Conference Series

052001

The following article is Open access

The Telescope Array Middle Drum fluorescence detector simulation on GPUs

Tareq Abu-Zayyad and (Telescope-Array Collaboration)

View article, The Telescope Array Middle Drum fluorescence detector simulation on GPUs PDF, The Telescope Array Middle Drum fluorescence detector simulation on GPUs

In recent years, the Graphics Processing Unit (GPU) has been recognized and widely used as an accelerator for many scientific calculations. In general, problems amenable to parallelization are ones that benefit most from the use of GPUs. The Monte Carlo simulation of fluorescence detector response to air showers presents many opportunities for parallelization. In this paper we report on a Monte Carlo program used for the simulation of the Telescope Array Fluorescence Detector located at the Middle Drum site which uses GPU acceleration. All of the physics simulation from shower development, light production and atmospheric attenuation, as well as, the realistic detector optics and electronics simulations are done on the GPU. A detailed description of the code implementation is given, and results on the accuracy and performance of the simulation are presented as well. Improvements in computational throughput in excess of 50× are reported and the accuracy of the results is on par with the CPU implementation of the simulation.

https://doi.org/10.1088/1742-6596/513/5/052001

052002

The following article is Open access

Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Roberto Ammendola, Andrea Biagioni, Ottorino Frezza, Francesca Lo Cicero, Pier Stanislao Paolucci, Alessandro Lonardo, Davide Rossetti, Francesco Simula, Laura Tosoratto and Piero Vicini

View article, Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems PDF, Architectural improvements and 28 nm FPGA implementation of the APEnet+ 3D Torus network for hybrid HPC systems

Modern Graphics Processing Units (GPUs) are now considered accelerators for general purpose computation. A tight interaction between the GPU and the interconnection network is the strategy to express the full potential on capability computing of a multi-GPU system on large HPC clusters; that is the reason why an efficient and scalable interconnect is a key technology to finally deliver GPUs for scientific HPC. In this paper we show the latest architectural and performance improvement of the APEnet+ network fabric, a FPGA-based PCIe board with 6 fully bidirectional off-board links with 34 Gbps of raw bandwidth per direction, and X8 Gen2 bandwidth towards the host PC. The board implements a Remote Direct Memory Access (RDMA) protocol that leverages upon peer-to-peer (P2P) capabilities of Fermi- and Kepler-class NVIDIA GPUs to obtain real zero-copy, low-latency GPU-to-GPU transfers. Finally, we report on the development activities for 2013 focusing on the adoption of the latest generation 28 nm FPGAs and the preliminary tests performed on this new platform.

https://doi.org/10.1088/1742-6596/513/5/052002

052003

The following article is Open access

GooFit: A library for massively parallelising maximum-likelihood fits

R Andreassen, B T Meadows, M de Silva, M D Sokoloff and K Tomko

View article, GooFit: A library for massively parallelising maximum-likelihood fits PDF, GooFit: A library for massively parallelising maximum-likelihood fits

Fitting complicated models to large datasets is a bottleneck of many analyses. We present GooFit, a library and tool for constructing arbitrarily-complex probability density functions (PDFs) to be evaluated on nVidia GPUs or on multicore CPUs using OpenMP. The massive parallelisation of dividing up event calculations between hundreds of processors can achieve speedups of factors 200-300 in real-world problems.

https://doi.org/10.1088/1742-6596/513/5/052003

052004

The following article is Open access

A GPU offloading mechanism for LHCb

Alexey Badalov, Daniel Hugo Campora Perez, Alexander Zvyagin, Niko Neufeld and Xavier Vilasis Cardona

View article, A GPU offloading mechanism for LHCb PDF, A GPU offloading mechanism for LHCb

The current computational infrastructure at LHCb is designed for sequential execution. It is possible to make use of modern multi-core machines by using multi-threaded algorithms and running multiple instances in parallel, but there is no way to make efficient use of specialized massively parallel hardware, such as graphical processing units and Intel Xeon/Phi. We extend the current infrastructure with an out-of-process computational server able to gather data from multiple instances and process them in large batches.

https://doi.org/10.1088/1742-6596/513/5/052004

052005

The following article is Open access

ROOT I/O in JavaScript

Bertrand Bellenot and Sergey Linev

View article, ROOT I/O in JavaScript PDF, ROOT I/O in JavaScript

In order to be able to browse (inspect) ROOT files in a platform independent way, a JavaScript version of the ROOT I/O subsystem has been developed. This allows the content of ROOT files to be displayed in most available web browsers, without having to install ROOT or any other software on the server or on the client. This gives a direct access to ROOT files from any new device in a lightweight way. It is possible to display simple graphical objects such as histograms and graphs (TH1, TH2, TH3, TProfile, and TGraph). The rendering of 1D/2D histograms and graphs is done with an external JavaScript library (D3.js), and another library (Three.js) is used for 2D and 3D histograms. We will describe the techniques used to display the content of a ROOT file, with a rendering being now very close to the one provided by ROOT.

https://doi.org/10.1088/1742-6596/513/5/052005

052006

The following article is Open access

The path toward HEP High Performance Computing

John Apostolakis, René Brun, Federico Carminati, Andrei Gheata and Sandro Wenzel

View article, The path toward HEP High Performance Computing PDF, The path toward HEP High Performance Computing

High Energy Physics code has been known for making poor use of high performance computing architectures. Efforts in optimising HEP code on vector and RISC architectures have yield limited results and recent studies have shown that, on modern architectures, it achieves a performance between 10% and 50% of the peak one. Although several successful attempts have been made to port selected codes on GPUs, no major HEP code suite has a "High Performance" implementation. With LHC undergoing a major upgrade and a number of challenging experiments on the drawing board, HEP cannot any longer neglect the less-than-optimal performance of its code and it has to try making the best usage of the hardware. This activity is one of the foci of the SFT group at CERN, which hosts, among others, the Root and Geant4 project. The activity of the experiments is shared and coordinated via a Concurrency Forum, where the experience in optimising HEP code is presented and discussed. Another activity is the Geant-V project, centred on the development of a highperformance prototype for particle transport. Achieving a good concurrency level on the emerging parallel architectures without a complete redesign of the framework can only be done by parallelizing at event level, or with a much larger effort at track level. Apart the shareable data structures, this typically implies a multiplication factor in terms of memory consumption compared to the single threaded version, together with sub-optimal handling of event processing tails. Besides this, the low level instruction pipelining of modern processors cannot be used efficiently to speedup the program. We have implemented a framework that allows scheduling vectors of particles to an arbitrary number of computing resources in a fine grain parallel approach. The talk will review the current optimisation activities within the SFT group with a particular emphasis on the development perspectives towards a simulation framework able to profit best from the recent technology evolution in computing.

https://doi.org/10.1088/1742-6596/513/5/052006

052007

The following article is Open access

A New Nightly Build System for LHCb

M Clemencic and B Couturier

View article, A New Nightly Build System for LHCb PDF, A New Nightly Build System for LHCb

The nightly build system used so far by LHCb has been implemented as an extension of the system developed by CERN PH/SFT group (as presented at CHEP2010). Although this version has been working for many years, it has several limitations in terms of extensibility, management and ease of use, so that it was decided to develop a new version based on a continuous integration system.

In this paper we describe a new implementation of the LHCb Nightly Build System based on the open source continuous integration system Jenkins and report on the experience of configuring a complex build workflow in Jenkins.

https://doi.org/10.1088/1742-6596/513/5/052007

052008

The following article is Open access

Explorations of the viability of ARM and Xeon Phi for physics processing

David Abdurachmanov, Kapil Arya, Josh Bendavid, Tommaso Boccali, Gene Cooperman, Andrea Dotti, Peter Elmer, Giulio Eulisse, Francesco Giacomini, Christopher D Jones et al

View article, Explorations of the viability of ARM and Xeon Phi for physics processing PDF, Explorations of the viability of ARM and Xeon Phi for physics processing

We report on our investigations into the viability of the ARM processor and the Intel Xeon Phi co-processor for scientific computing. We describe our experience porting software to these processors and running benchmarks using real physics applications to explore the potential of these processors for production physics processing.

https://doi.org/10.1088/1742-6596/513/5/052008

052009

The following article is Open access

Rise of the Build Infrastructure

Giulio Eulisse, Shahzad Muzaffar, David Abdurachmanov and David Mendez

View article, Rise of the Build Infrastructure PDF, Rise of the Build Infrastructure

CMS Offline Software, CMSSW, is an extremely large software project, with roughly 3 millions lines of code, two hundreds of active developers and two to three active development branches. Given the scale of the problem, both from a technical and a human point of view, being able to keep on track such a large project, bug free, and to deliver builds for different architectures is a challenge in itself. Moreover the challenges posed by the future migration of CMSSW to multithreading also require adapting and improving our QA tools. We present the work done in the last two years in our build and integration infrastructure, particularly in the form of improvements to our build tools, in the simplification and extensibility of our build infrastructure and the new features added to our QA and profiling tools. Finally we present our plans for the future directions for code management and how this reflects on our workflows and the underlying software infrastructure.

https://doi.org/10.1088/1742-6596/513/5/052009

052010

The following article is Open access

Parallel track reconstruction in CMS using the cellular automaton approach

D Funke, T Hauth, V Innocente, G Quast, P Sanders and D Schieferdecker

View article, Parallel track reconstruction in CMS using the cellular automaton approach PDF, Parallel track reconstruction in CMS using the cellular automaton approach

The Compact Muon Solenoid (CMS) experiment at the Large Hadron Collider (LHC) is a general-purpose particle detector and comprises the largest silicon-based tracking system built to date with 75 million individual readout channels. The precise reconstruction of particle tracks from this tremendous amount of input channels is a compute-intensive task. The foreseen LHC beam parameters for the next data taking period, starting in 2015, will result in an increase in the number of simultaneous proton-proton interactions and hence the number of particle tracks per event. Due to the stagnating clock frequencies of individual CPU cores, new approaches to particle track reconstruction need to be evaluated in order to cope with this computational challenge. Track finding methods that are based on cellular automata (CA) offer a fast and parallelizable alternative to the well-established Kalman filter-based algorithms. We present a new cellular automaton based track reconstruction, which copes with the complex detector geometry of CMS. We detail the specific design choices made to allow for a high-performance computation on GPU and CPU devices using the OpenCL framework. We conclude by evaluating the physics performance, as well as the computational properties of our implementation on various hardware platforms and show that a significant speedup can be attained by using GPU architectures while achieving a reasonable physics performance at the same time.

https://doi.org/10.1088/1742-6596/513/5/052010

052011

The following article is Open access

A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures

Raul H C Lopes, Ivan D Reid and Peter R Hobson

View article, A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures PDF, A well-separated pairs decomposition algorithm for k-d trees implemented on multi-core architectures

Variations of k-d trees represent a fundamental data structure used in Computational Geometry with numerous applications in science. For example particle track fitting in the software of the LHC experiments, and in simulations of N-body systems in the study of dynamics of interacting galaxies, particle beam physics, and molecular dynamics in biochemistry. The many-body tree methods devised by Barnes and Hutt in the 1980s and the Fast Multipole Method introduced in 1987 by Greengard and Rokhlin use variants of k-d trees to reduce the computation time upper bounds to O(n log n) and even O(n) from O(n²). We present an algorithm that uses the principle of well-separated pairs decomposition to always produce compressed trees in O(n log n) work. We present and evaluate parallel implementations for the algorithm that can take advantage of multi-core architectures.

https://doi.org/10.1088/1742-6596/513/5/052011

052012

The following article is Open access

Self-service for software development projects and HPC activities

M Husejko, N Høimyr, A Gonzalez, G Koloventzos, D Asbury, A Trzcinska, I Agtzidis, G Botrel and J Otto

View article, Self-service for software development projects and HPC activities PDF, Self-service for software development projects and HPC activities

This contribution describes how CERN has implemented several essential tools for agile software development processes, ranging from version control (Git) to issue tracking (Jira) and documentation (Wikis). Running such services in a large organisation like CERN requires many administrative actions both by users and service providers, such as creating software projects, managing access rights, users and groups, and performing tool-specific customisation. Dealing with these requests manually would be a time-consuming task. Another area of our CERN computing services that has required dedicated manual support has been clusters for specific user communities with special needs. Our aim is to move all our services to a layered approach, with server infrastructure running on the internal cloud computing infrastructure at CERN. This contribution illustrates how we plan to optimise the management of our of services by means of an end-user facing platform acting as a portal into all the related services for software projects, inspired by popular portals for open-source developments such as Sourceforge, GitHub and others. Furthermore, the contribution will discuss recent activities with tests and evaluations of High Performance Computing (HPC) applications on different hardware and software stacks, and plans to offer a dynamically scalable HPC service at CERN, based on affordable hardware.

https://doi.org/10.1088/1742-6596/513/5/052012

052013

The following article is Open access

High energy electromagnetic particle transportation on the GPU

P Canal, D Elvira, S Y Jun, J Kowalkowski, M Paterno and J Apostolakis

View article, High energy electromagnetic particle transportation on the GPU PDF, High energy electromagnetic particle transportation on the GPU

We present massively parallel high energy electromagnetic particle transportation through a finely segmented detector on a Graphics Processing Unit (GPU). Simulating events of energetic particle decay in a general-purpose high energy physics (HEP) detector requires intensive computing resources, due to the complexity of the geometry as well as physics processes applied to particles copiously produced by primary collisions and secondary interactions. The recent advent of hardware architectures of many-core or accelerated processors provides the variety of concurrent programming models applicable not only for the high performance parallel computing, but also for the conventional computing intensive application such as the HEP detector simulation. The components of our prototype are a transportation process under a non-uniform magnetic field, geometry navigation with a set of solid shapes and materials, electromagnetic physics processes for electrons and photons, and an interface to a framework that dispatches bundles of tracks in a highly vectorized manner optimizing for spatial locality and throughput. Core algorithms and methods are excerpted from the Geant4 toolkit, and are modified and optimized for the GPU application. Program kernels written in C/C++ are designed to be compatible with CUDA and OpenCL and with the aim to be generic enough for easy porting to future programming models and hardware architectures. To improve throughput by overlapping data transfers with kernel execution, multiple CUDA streams are used. Issues with floating point accuracy, random numbers generation, data structure, kernel divergences and register spills are also considered. Performance evaluation for the relative speedup compared to the corresponding sequential execution on CPU is presented as well.

https://doi.org/10.1088/1742-6596/513/5/052013

052014

The following article is Open access

Measurements of the LHCb software stack on the ARM architecture

S Vijay Kartik, Ben Couturier, Marco Clemencic and Niko Neufeld

View article, Measurements of the LHCb software stack on the ARM architecture PDF, Measurements of the LHCb software stack on the ARM architecture

The ARM architecture is a power-efficient design that is used in most processors in mobile devices all around the world today since they provide reasonable compute performance per watt. The current LHCb software stack is designed (and thus expected) to build and run on machines with the x86/x86_64 architecture. This paper outlines the process of measuring the performance of the LHCb software stack on the ARM architecture – specifically, the ARMv7 architecture on Cortex-A9 processors from NVIDIA and on full-fledged ARM servers with chipsets from Calxeda – and makes comparisons with the performance on x86_64 architectures on the Intel Xeon L5520/X5650 and AMD Opteron 6272. The paper emphasises the aspects of performance per core with respect to the power drawn by the compute nodes for the given performance – this ensures a fair real-world comparison with much more 'powerful' Intel/AMD processors. The comparisons of these real workloads in the context of LHCb are also complemented with the standard synthetic benchmarks HEPSPEC and Coremark.

The pitfalls and solutions for the non-trivial task of porting the source code to build for the ARMv7 instruction set are presented. The specific changes in the build process needed for ARM-specific portions of the software stack are described, to serve as pointers for further attempts taken up by other groups in this direction. Cases where architecture-specific tweaks at the assembler lever (both in ROOT and the LHCb software stack) were needed for a successful compile are detailed – these cases are good indicators of where/how the software stack as well as the build system can be made more portable and multi-arch friendly. The experience gained from the tasks described in this paper are intended to i) assist in making an informed choice about ARM-based server solutions as a feasible low-power alternative to the current compute nodes, and ii) revisit the software design and build system for portability and generic improvements.

https://doi.org/10.1088/1742-6596/513/5/052014

052015

The following article is Open access

Data processing in the wake of massive multi-core processors

J Kowalkowski

View article, Data processing in the wake of massive multi-core processors PDF, Data processing in the wake of massive multi-core processors

Developments in concurrency (massive multi-core, GPU, and architectures such as ARM) are changing the physics computing landscape. This paper will describe the use of GPU and massive multi-core, and the changes that result from massive parallelization and the impact on data processing.

Major HEP event-processing framework software runs within the changing computing environment. These frameworks have been evolving to accommodate the changes. The framework changes need to go quite a bit further, to better handle coprocessors with alternative architectures.

https://doi.org/10.1088/1742-6596/513/5/052015

052016

The following article is Open access

Compute farm software for ATLAS IBL calibration

M Bindi, T Flick, J Grosse-Knetter, T Heim, S-C Hsu, M Kretz, A Kugel, M Marx, P Morettini, K Potamianos et al

View article, Compute farm software for ATLAS IBL calibration PDF, Compute farm software for ATLAS IBL calibration

In 2014 the Insertable B-Layer (IBL) will extend the existing Pixel Detector of the ATLAS experiment at CERN by over 12 million additional pixels. For calibration and monitoring purposes, occupancy and time-over-threshold data are being histogrammed in the read-out hardware. Further processing of the histograms happens on commodity hardware, which not only requires the fast transfer of histogram data from the read-out hardware to the computing farm via Ethernet, but also the integration of the software and hardware into the already existing data-acquisition and calibration framework (TDAQ and PixelDAQ) of the ATLAS experiment and the current Pixel Detector.

We implement the software running on the compute cluster with an emphasis on modularity, allowing for flexible adjustment of the infrastructure and a good scalability with respect to the number of network interfaces, available CPU cores, and deployed machines. By using a modular design we are able to not only employ CPU-based fitting algorithms, but also have the possibility to take advantage of the performance offered by a GPU-based approach to fitting.

https://doi.org/10.1088/1742-6596/513/5/052016

052017

The following article is Open access

The ATLAS data management software engineering process

M Lassnig, V Garonne, G A Stewart, M Barisits, T Beermann, R Vigne, C Serfon, L Goossens, A Nairz, A Molfetas et al

View article, The ATLAS data management software engineering process PDF, The ATLAS data management software engineering process

Rucio is the next-generation data management system of the ATLAS experiment. The software engineering process to develop Rucio is fundamentally different to existing software development approaches in the ATLAS distributed computing community. Based on a conceptual design document, development takes place using peer-reviewed code in a test-driven environment. The main objectives are to ensure that every engineer understands the details of the full project, even components usually not touched by them, that the design and architecture are coherent, that temporary contributors can be productive without delay, that programming mistakes are prevented before being committed to the source code, and that the source is always in a fully functioning state. This contribution will illustrate the workflows and products used, and demonstrate the typical development cycle of a component from inception to deployment within this software engineering process. Next to the technological advantages, this contribution will also highlight the social aspects of an environment where every action is subject to detailed scrutiny.

https://doi.org/10.1088/1742-6596/513/5/052017

052018

The following article is Open access

Experience with Intel's Many Integrated Core architecture in ATLAS software

S Fleischmann, S Kama, W Lavrijsen, M Neumann, R Vitillo and the ATLAS Collaboration

View article, Experience with Intel's Many Integrated Core architecture in ATLAS software PDF, Experience with Intel's Many Integrated Core architecture in ATLAS software

Intel recently released the first commercial boards of its Many Integrated Core (MIC) Architecture. MIC is Intel's solution for the domain of throughput computing, currently dominated by general purpose programming on graphics processors (GPGPU). MIC allows the use of the more familiar x86 programming model and supports standard technologies such as OpenMP, MPI, and Intel's Threading Building Blocks (TBB). This should make it possible to develop for both throughput and latency devices using a single code base. In ATLAS Software, track reconstruction has been shown to be a good candidate for throughput computing on GPGPU devices. In addition, the newly proposed offline parallel event-processing framework, GaudiHive, uses TBB for task scheduling. The MIC is thus, in principle, a good fit for this domain. In this paper, we report our experiences of porting to and optimizing ATLAS tracking algorithms for the MIC, comparing the programmability and relative cost/performance of the MIC against those of current GPGPUs and latency-optimized CPUs.

https://doi.org/10.1088/1742-6596/513/5/052018

052019

The following article is Open access

ATLAS Experience with HEP Software at the Argonne Leadership Computing Facility

Thomas D Uram, Thomas J LeCompte and D Benjamin

View article, ATLAS Experience with HEP Software at the Argonne Leadership Computing Facility PDF, ATLAS Experience with HEP Software at the Argonne Leadership Computing Facility

A number of HEP software packages used by the ATLAS experiment, including GEANT4, ROOT and ALPGEN, have been adapted to run on the IBM Blue Gene supercomputers at the Argonne Leadership Computing Facility. These computers use a non-x86 architecture and have a considerably less rich operating environment than in common use in HEP, but also represent a computing capacity an order of magnitude beyond what ATLAS is presently using via the LCG. The status and potential for making use of leadership-class computing, including the status of integration with the ATLAS production system, is discussed.

https://doi.org/10.1088/1742-6596/513/5/052019

052020

The following article is Open access

Systematic profiling to monitor and specify the software refactoring process of the LHCb experiment

Ben Couturier, E Kiagias and Stefan B Lohn

View article, Systematic profiling to monitor and specify the software refactoring process of the LHCb experiment PDF, Systematic profiling to monitor and specify the software refactoring process of the LHCb experiment

The LHCb upgrade program implies a significant increase in data processing that will not be matched by additional computing resources. Furthermore, new architectures such as many-core platforms can currently not be fully exploited due to memory and I/O bandwidth limitations. A considerable refactoring effort will therefore be needed to vectorize and parallelize the LHCb software, to minimize hotspots and to reduce the impact of bottlenecks. It is crucial to guide refactoring with a profiling system that gives hints to regions in source-code for possible and necessary re-engineering and which kind of optimization could lead to final success.

Software optimization is a sophisticated process where all parts, compiler, operating system, external libraries and chosen hardware play a role. Intended improvements can have different effects on different platforms. To obtain precise information of the general performance, to make profiles comparable, reproducible and to verify the progress of performance in the framework, it is crucial to produce profiles more systematically in terms of regular profiling based on representative use cases and to perform regression tests. Once a general execution, monitoring and analysis platform is available, software metrics can be derived from the collected profiling results to trace changes in performance back and to create summary reports on a regular basis with an alert system if modifications led to significant performance degradations.

https://doi.org/10.1088/1742-6596/513/5/052020

052021

The following article is Open access

Synergia CUDA: GPU-accelerated accelerator modeling package

Q Lu and J Amundson

View article, Synergia CUDA: GPU-accelerated accelerator modeling package PDF, Synergia CUDA: GPU-accelerated accelerator modeling package

Synergia is a parallel, 3-dimensional space-charge particle-in-cell accelerator modeling code. We present our work porting the purely MPI-based version of the code to a hybrid of CPU and GPU computing kernels. The hybrid code uses the CUDA platform in the same framework as the pure MPI solution. We have implemented a lock-free collaborative charge-deposition algorithm for the GPU, as well as other optimizations, including local communication avoidance for GPUs, a customized FFT, and fine-tuned memory access patterns. On a small GPU cluster (up to 4 Tesla C1070 GPUs), our benchmarks exhibit both superior peak performance and better scaling than a CPU cluster with 16 nodes and 128 cores. We also compare the code performance on different GPU architectures, including C1070 Tesla and K20 Kepler.

https://doi.org/10.1088/1742-6596/513/5/052021

052022

The following article is Open access

ATLAS offline software performance monitoring and optimization

N Chauhan, G Kabra, T Kittelmann, R Langenberg, R Mandrysch, A Salzburger, R Seuster, E Ritsch, G Stewart, N van Eldik et al

View article, ATLAS offline software performance monitoring and optimization PDF, ATLAS offline software performance monitoring and optimization

In a complex multi-developer, multi-package software environment, such as the ATLAS offline framework Athena, tracking the performance of the code can be a non-trivial task in itself. In this paper we describe improvements in the instrumentation of ATLAS offline software that have given considerable insight into the performance of the code and helped to guide the optimization work.

The first tool we used to instrument the code is PAPI, which is a programing interface for accessing hardware performance counters. PAPI events can count floating point operations, cycles, instructions and cache accesses. Triggering PAPI to start/stop counting for each algorithm and processed event results in a good understanding of the algorithm level performance of ATLAS code.

Further data can be obtained using Pin, a dynamic binary instrumentation tool. Pin tools can be used to obtain similar statistics as PAPI, but advantageously without requiring recompilation of the code. Fine grained routine and instruction level instrumentation is also possible. Pin tools can additionally interrogate the arguments to functions, like those in linear algebra libraries, so that a detailed usage profile can be obtained.

These tools have characterized the extensive use of vector and matrix operations in ATLAS tracking. Currently, CLHEP is used here, which is not an optimal choice. To help evaluate replacement libraries a testbed has been setup allowing comparison of the performance of different linear algebra libraries (including CLHEP, Eigen and SMatrix/SVector). Results are then presented via the ATLAS Performance Management Board framework, which runs daily with the current development branch of the code and monitors reconstruction and Monte-Carlo jobs. This framework analyses the CPU and memory performance of algorithms and an overview of results are presented on a web page.

These tools have provided the insight necessary to plan and implement performance enhancements in ATLAS code by identifying the most common operations, with the call parameters well understood, and allowing improvements to be quantified in detail.

https://doi.org/10.1088/1742-6596/513/5/052022

052023

The following article is Open access

C++ evolves!

Axel Naumann

View article, C++ evolves! PDF, C++ evolves!

C++ is used throughout High Energy Physics. CERN participates in the development of its standard. There has been a major shift in standardization procedures that will be visible starting 2014 with an increase rate of new standardized features. Already the current C++11 has major improvements, also for coding novices, related to simplicity, expressiveness, performance and robustness. Other major improvements are in the area of concurrency, where C++ is now on par with most other high level languages. To benefit from these language improvements and from the massive improvements in compiler technology for instance in usability, access to current compilers is crucial. Use of current C++ compiled with current compilers can considerably improve C++ for the HEP physicist community.

https://doi.org/10.1088/1742-6596/513/5/052023

052024

The following article is Open access

Does the Intel Xeon Phi processor fit HEP workloads?

A Nowak, G Bitzes, A Dotti, A Lazzaro, S Jarp, P Szostek, L Valsan, M Botezatu and J Leduc

View article, Does the Intel Xeon Phi processor fit HEP workloads? PDF, Does the Intel Xeon Phi processor fit HEP workloads?

This paper summarizes the five years of CERN openlab's efforts focused on the Intel Xeon Phi co-processor, from the time of its inception to public release. We consider the architecture of the device vis a vis the characteristics of HEP software and identify key opportunities for HEP processing, as well as scaling limitations. We report on improvements and speedups linked to parallelization and vectorization on benchmarks involving software frameworks such as Geant4 and ROOT. Finally, we extrapolate current software and hardware trends and project them onto accelerators of the future, with the specifics of offline and online HEP processing in mind.

https://doi.org/10.1088/1742-6596/513/5/052024

052025

The following article is Open access

Parallelization of particle transport using Intel® TBB

J Apostolakis, S Belogurov, R Brun, F Carminati, A Gheata, E Ovcharenko and S Wenzel

View article, Parallelization of particle transport using Intel® TBB PDF, Parallelization of particle transport using Intel® TBB

One of the current challenges in HEP computing is the development of particle propagation algorithms capable of efficiently use all performance aspects of modern computing devices. The Geant-Vector project at CERN has recently introduced an approach in this direction. This paper describes the implementation of a similar workflow using the Intel(r) Threading Building Blocks (Intel(r) TBB) library. This approach is intended to overcome the potential bottleneck of having a single dispatcher on many-core architectures and to result in better scalability compared to the initial pthreads-based version.

https://doi.org/10.1088/1742-6596/513/5/052025

052026

The following article is Open access

Improving robustness and computational efficiency using modern C++

M Paterno, J Kowalkowski and C Green

View article, Improving robustness and computational efficiency using modern C++ PDF, Improving robustness and computational efficiency using modern C++

For nearly two decades, the C++ programming language has been the dominant programming language for experimental HEP. The publication of ISO/IEC 14882:2011, the current version of the international standard for the C++ programming language, makes available a variety of language and library facilities for improving the robustness, expressiveness, and computational efficiency of C++ code. However, much of the C++ written by the experimental HEP community does not take advantage of the features of the language to obtain these benefits, either due to lack of familiarity with these features or concern that these features must somehow be computationally inefficient.

In this paper, we address some of the features of modern C+-+, and show how they can be used to make programs that are both robust and computationally efficient. We compare and contrast simple yet realistic examples of some common implementation patterns in C, currently-typical C++, and modern C++, and show (when necessary, down to the level of generated assembly language code) the quality of the executable code produced by recent C++ compilers, with the aim of allowing the HEP community to make informed decisions on the costs and benefits of the use of modern C++.

https://doi.org/10.1088/1742-6596/513/5/052026

052027

The following article is Open access

Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions

Danilo Piparo, Vincenzo Innocente and Thomas Hauth

View article, Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions PDF, Speeding up HEP experiment software with a library of fast and auto-vectorisable mathematical functions

During the first years of data taking at the Large Hadron Collider (LHC), the simulation and reconstruction programs of the experiments proved to be extremely resource consuming. In particular, for complex event simulation and reconstruction applications, the impact of evaluating elementary functions on the runtime is sizeable (up to one fourth of the total), with an obvious effect on the power consumption of the hardware dedicated to their execution. This situation clearly needs improvement, especially considering the even more demanding data taking scenarios after the first LHC long shut down. A possible solution to this issue is the VDT (VectorisD maTh) mathematical library. VDT provides the most common mathematical functions used in HEP in an open source product. The function implementations are fast, can be inlined, provide an approximate accuracy and are usable in vectorised loops. Their implementation is portable across platforms: x86 and ARM processors, Xeon Phi coprocessors and GPGPUs. In this contribution, we describe the features of the VDT mathematical library, showing significant speedups with respect to the LibM library and comparable accuracies. Moreover, taking as examples simulation and reconstruction workflows in production by the LHC experiments, we show the benefits of the usage of VDT in terms of runtime reduction and stability of physics output.

https://doi.org/10.1088/1742-6596/513/5/052027

052028

The following article is Open access

Preparing HEP software for concurrency

M Clemencic, B Hegner, P Mato and D Piparo

View article, Preparing HEP software for concurrency PDF, Preparing HEP software for concurrency

The necessity for thread-safe experiment software has recently become very evident, largely driven by the evolution of CPU architectures towards exploiting increasing levels of parallelism. For high-energy physics this represents a real paradigm shift, as concurrent programming was previously only limited to special, well-defined domains like control software or software framework internals. This paradigm shift, however, falls into the middle of the successful LHC programme and many million lines of code have already been written without the need for parallel execution in mind. In this paper we have a closer look at the offline processing applications of the LHC experiments and their readiness for the many-core era. We review how previous design choices impact the move to concurrent programming. We present our findings on transforming parts of the LHC experiment reconstruction software to thread-safe code, and the main design patterns that have emerged during the process. A plethora of parallel-programming patterns are well known outside the HEP community, but only a few have turned out to be straightforward enough to be suited for non-expert physics programmers. Finally, we propose a potential strategy for the migration of existing HEP experiment software to the many-core era.

https://doi.org/10.1088/1742-6596/513/5/052028

052029

The following article is Open access

Preparing the Gaudi framework and the DIRAC WMS for multicore job submission

N Rauschmayr and A Streit

View article, Preparing the Gaudi framework and the DIRAC WMS for multicore job submission PDF, Preparing the Gaudi framework and the DIRAC WMS for multicore job submission

HEP applications need to adapt to the continuously increasing number of cores on modern CPUs. This must be done at different levels: the software must support parallelization, and the scheduling has to differ between multicore and singlecore jobs. The LHCb software framework (GAUDI) provides a parallel prototype (GaudiMP), based on the multiprocessing approach. It allows a reduction of the overall memory footprint and a coordinated access to data via separated reader and writer processes. A comparison between the parallel prototype and multiple independent Gaudi jobs in respect of CPU time and memory consumption will be shown. Furthermore, speedup must be predicted in order to find the limit beyond which the parallel prototype (GaudiMP) does not bring further scaling. This number must be known as it indicates the point, where new technologies must be introduced into the software framework. In order to reach further improvements in the overall throughput, scheduling strategies for mixing parallel jobs can be applied. It allows overcoming limitations in the speedup of the parallel prototype. Those changes require modifications at the level of the Workload Management System (DIRAC).

https://doi.org/10.1088/1742-6596/513/5/052029

052030

The following article is Open access

Evaluating Predictive Models of Software Quality

V Ciaschini, M Canaparo, E Ronchieri and D Salomoni

View article, Evaluating Predictive Models of Software Quality PDF, Evaluating Predictive Models of Software Quality

Applications from High Energy Physics scientific community are constantly growing and implemented by a large number of developers. This implies a strong churn on the code and an associated risk of faults, which is unavoidable as long as the software undergoes active evolution. However, the necessities of production systems run counter to this. Stability and predictability are of paramount importance; in addition, a short turn-around time for the defect discovery-correction-deployment cycle is required. A way to reconcile these opposite foci is to use a software quality model to obtain an approximation of the risk before releasing a program to only deliver software with a risk lower than an agreed threshold.

In this article we evaluated two quality predictive models to identify the operational risk and the quality of some software products. We applied these models to the development history of several EMI packages with intent to discover the risk factor of each product and compare it with its real history. We attempted to determine if the models reasonably maps reality for the applications under evaluation, and finally we concluded suggesting directions for further studies.

https://doi.org/10.1088/1742-6596/513/5/052030

052031

The following article is Open access

ATLAS software configuration and build tool optimisation

Grigory Rybkin and the ATLAS Collaboration

View article, ATLAS software configuration and build tool optimisation PDF, ATLAS software configuration and build tool optimisation

ATLAS software code base is over 6 million lines organised in about 2000 packages. It makes use of some 100 external software packages, is developed by more than 400 developers and used by more than 2500 physicists from over 200 universities and laboratories in 6 continents. To meet the challenge of configuration and building of this software, the Configuration Management Tool (CMT) is used. CMT expects each package to describe its build targets, build and environment setup parameters, dependencies on other packages in a text file called requirements, and each project (group of packages) to describe its policies and dependencies on other projects in a text project file. Based on the effective set of configuration parameters read from the requirements files of dependent packages and project files, CMT commands build the packages, generate the environment for their use, or query the packages. The main focus was on build time performance that was optimised within several approaches: reduction of the number of reads of requirements files that are now read once per package by a CMT build command that generates cached requirements files for subsequent CMT build commands; introduction of more fine-grained build parallelism at package task level, i.e., dependent applications and libraries are compiled in parallel; code optimisation of CMT commands used for build; introduction of package level build parallelism, i. e., parallelise the build of independent packages. By default, CMT launches NUMBER-OF-PROCESSORS build commands in parallel. The other focus was on CMT commands optimisation in general that made them approximately 2 times faster. CMT can generate a cached requirements file for the environment setup command, which is especially useful for deployment on distributed file systems like AFS or CERN VMFS. The use of parallelism, caching and code optimisation significantly-by several times-reduced software build time, environment setup time, increased the efficiency of multi-core computing resources utilisation, and considerably improved software developer and user experience.

https://doi.org/10.1088/1742-6596/513/5/052031

052032

The following article is Open access

Computing on Knights and Kepler Architectures

G Bortolotti, M Caberletti, G Crimi, A Ferraro, F Giacomini, M Manzali, G Maron, M Pivanti, D Salomoni, S F Schifano et al

View article, Computing on Knights and Kepler Architectures PDF, Computing on Knights and Kepler Architectures

A recent trend in scientific computing is the increasingly important role of co-processors, originally built to accelerate graphics rendering, and now used for general high-performance computing. The INFN Computing On Knights and Kepler Architectures (COKA) project focuses on assessing the suitability of co-processor boards for scientific computing in a wide range of physics applications, and on studying the best programming methodologies for these systems. Here we present in a comparative way our results in porting a Lattice Boltzmann code on two state-of-the-art accelerators: the NVIDIA K20X, and the Intel Xeon-Phi. We describe our implementations, analyze results and compare with a baseline architecture adopting Intel Sandy Bridge CPUs.

https://doi.org/10.1088/1742-6596/513/5/052032

052033

The following article is Open access

The end of HEP-specific computing as we know it?

O Smirnova, R Brun, R Cailliau, F Carminati and P Elmer

View article, The end of HEP-specific computing as we know it? PDF, The end of HEP-specific computing as we know it?

CHEP Conferences are dedicated to a quite specific scientific computing domain, and deal with rather specialised software, developed for the needs of the High Energy Physics community. The sheer size of this community created an environment which until recently has been to a large extent isolated from the mainstream computing. There is however an emerging trend for the computing solutions to spill outside the traditional laboratory boundaries, benefiting from becoming less domain-specific. This paper summarises the panel discussion held at the CHEP'13 conference with the goal to answer the questions, why the mainstream software approaches are not always suitable for the High Energy Physics community, and why its own solutions so far enjoyed little popularity in other domains?

https://doi.org/10.1088/1742-6596/513/5/052033

052034

The following article is Open access

ATLAS Nightly Build System Upgrade

G Dimitrov, E Obreshkov, B Simmons, A Undrus and the ATLAS Collaboration

View article, ATLAS Nightly Build System Upgrade PDF, ATLAS Nightly Build System Upgrade

The ATLAS Nightly Build System is a facility for automatic production of software releases. Being the major component of ATLAS software infrastructure, it supports more than 50 multi-platform branches of nightly releases and provides ample opportunities for testing new packages, for verifying patches to existing software, and for migrating to new platforms and compilers. The Nightly System testing framework runs several hundred integration tests of different granularity and purpose. The nightly releases are distributed and validated, and some are transformed into stable releases used for data processing worldwide. The first LHC long shutdown (2013-2015) activities will elicit increased load on the Nightly System as additional releases and builds are needed to exploit new programming techniques, languages, and profiling tools. This paper describes the plan of the ATLAS Nightly Build System Long Shutdown upgrade. It brings modern database and web technologies into the Nightly System, improves monitoring of nightly build results, and provides new tools for offline release shifters. We will also outline our long-term plans for distributed nightly releases builds and testing.

https://doi.org/10.1088/1742-6596/513/5/052034

052035

The following article is Open access

Experiences with moving to open source standards for building and packaging

D H van Dok, M Sallé and O A Koeroo

View article, Experiences with moving to open source standards for building and packaging PDF, Experiences with moving to open source standards for building and packaging

The LCMAPS family of grid security middleware was developed during a series of European grid projects from 2001 until 2013. Since 2009 we actively started to move away from ETICS, the project-specific build system, to common open-source tools for building and packaging, such as the GNU Autotools and the Fedora and Debian tool set. By following the guidelines of these mainstream distributions, and improving the source code to fit in with the commonly available open source tools, we have established low-cost, long term sustainability of the code base.

https://doi.org/10.1088/1742-6596/513/5/052035

052036

The following article is Open access

Next-Generation Navigational Infrastructure and the ATLAS Event Store

P van Gemmeren, D Malon, M Nowak and the ATLAS Collaboration

View article, Next-Generation Navigational Infrastructure and the ATLAS Event Store PDF, Next-Generation Navigational Infrastructure and the ATLAS Event Store

The ATLAS event store employs a persistence framework with extensive navigational capabilities. These include real-time back navigation to upstream processing stages, externalizable data object references, navigation from any data object to any other both within a single file and across files, and more. The 2013-2014 shutdown of the Large Hadron Collider provides an opportunity to enhance this infrastructure in several ways that both extend these capabilities and allow the collaboration to better exploit emerging computing platforms. Enhancements include redesign with efficient file merging in mind, content-based indices in optimized reference types, and support for forward references. The latter provide the potential to construct valid references to data before those data are written, a capability that is useful in a variety of multithreading, multiprocessing, distributed processing, and deferred processing scenarios. This paper describes the architecture and design of the next generation of ATLAS navigational infrastructure.

https://doi.org/10.1088/1742-6596/513/5/052036

052037

The following article is Open access

Many-core on the Grid: From exploration to production

T Doherty, M Doidge, T Ferrari, L Salvador, J Walsh and A Washbrook

View article, Many-core on the Grid: From exploration to production PDF, Many-core on the Grid: From exploration to production

High Energy Physics experiments have successfully demonstrated that many-core devices such as GPUs can be used to accelerate critical algorithms in their software. There is now increasing community interest for many-core devices to be made available on the LHC Computing Grid infrastructure. Despite anticipated usage there is no standard method available to run many-core applications in distributed computing environments and before many-core resources are made available on the Grid a number of operational issues such as job scheduling and resource discovery will need to be addressed. The key challenges for Grid-enabling many-core devices will be discussed.

https://doi.org/10.1088/1742-6596/513/5/052037

052038

The following article is Open access

Vectorising the detector geometry to optimise particle transport

John Apostolakis, René Brun, Federico Carminati, Andrei Gheata and Sandro Wenzel

View article, Vectorising the detector geometry to optimise particle transport PDF, Vectorising the detector geometry to optimise particle transport

Among the components contributing to particle transport, geometry navigation is an important consumer of CPU cycles. The tasks performed to get answers to "basic" queries such as locating a point within a geometry hierarchy or computing accurately the distance to the next boundary can become very computing intensive for complex detector setups. So far, the existing geometry algorithms employ mainly scalar optimisation strategies (voxelisation, caching) to reduce their CPU consumption. In this paper, we would like to take a different approach and investigate how geometry navigation can benefit from the vector instruction set extensions that are one of the primary source of performance enhancements on current and future hardware. While on paper, this form of microparallelism promises increasing performance opportunities, applying this technology to the highly hierarchical and multiply branched geometry code is a difficult challenge. We refer to the current work done to vectorise an important part of the critical navigation algorithms in the ROOT geometry library. Starting from a short critical discussion about the programming model, we present the current status and first benchmark results of the vectorisation of some elementary geometry shape algorithms. On the path towards a full vector-based geometry navigator, we also investigate the performance benefits in connecting these elementary functions together to develop algorithms which are entirely based on the flow of vector-data. To this end, we discuss core components of a simple vector navigator that is tested and evaluated on a toy detector setup.

https://doi.org/10.1088/1742-6596/513/5/052038

Table of contents

Volume 513

Journal links