GB2536273A

GB2536273A - A porting verification apparatus, method and computer program

Info

Publication number: GB2536273A
Application number: GB1504183.3A
Authority: GB
Inventors: Bhaskaran Balakrishnan
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2015-03-12
Filing date: 2015-03-12
Publication date: 2016-09-14
Also published as: GB201504183D0

Abstract

Metric-based verification tool for large-scale models, e.g. weather simulations, ported between High Performance Computing (HPC) systems, comprising: data module 20 generating a host and a target ensemble of solutions after executing a numerical model program by repeatedly varying execution parameters in host and target environments, respectively; analytics engine 30 computing a mathematical distance between the host and target ensemble; and decision module 40 using the distance to decide whether porting has created an error. The host ensemble is a perturbed architecture ensemble (PAE) obtained by varying software/hardware parameters such as task/data parallelism, programming/runtime environment settings, operating system release, middleware version. The target ensemble is an initial condition ensemble (ICE) obtained by varying initial condition of the model. Varying the execution parameters randomly spreads the solutions to form distributed PAE and ICE whose Bhattachryya distances are used for decision making. Presence of systematic porting errors affecting the solutions is identified.

Description

A PORTING VERIFICATION APPARATUS, METHOD AND COMPUTER PROGRAM The present invention relates to porting of numerical computer programs between two computing environments. It has many applications in today's world of pre-manufacture testing and technical use of other numerical models. For example in the technical areas of design and manufacture of electronic components, porting of heat modelling computer programs to a more suitable environment can allow a faster and/or lower energy modelling process to be used in design or re-design of a circuit board (such as a motherboard), processor, server, or server facility, depending on the physical scale of the model.

Porting is the process of adapting software (such as application software or libraries) so that an executable program or other user-ready software product can be created for a computing context that is different from the one for which it was originally designed. Thus software may be ported to a different hardware architecture. Porting may alternatively or additionally include a change of software stack, such as operating system, compilers, library and other settings in the computing environment. Hence the term computing environment encompasses both the hardware architecture and associated software stack used at runtime.

The invention is primarily targeted at High Performance Computing (HPC) applications, libraries and tools where it is common to port the code to a wide range of supercomputing platforms which have differing hardware architectures and software stacks forming the tool chain of programming tools (compiler, parallel libraries etc.) used to create the executable program or other product. However, it is also more widely applicable, for example in porting applications from other fields and to other platforms.

An application, library or other piece of software may be ported to a new computing environment (or system); possibly with a different set of compilers, parallel programming libraries (such as the MPI library) or other system-specific tools as follows.

Firstly the source code is obtained, unpacked and prepared for compilation. Then the code is compiled (and if there are compile errors these are fixed and the code is compiled again). Otherwise, if there are no compilation errors, quality assurance (QA) testing can take place.

If the tests are passed the code can be packaged, installed and documented.

The arrival of multi-core, many-core and general purpose graphical processing units (CPUs) has changed the landscape of high performance computing in recent years. The CPU-GPU heterogeneous computing systems are particularly gaining popularity among scientists involved in component design and modelling, bioinformatics and climate modelling research. To gain improvement in model performance (for example faster results and/or lower energy costs) on these new High Performance Computing HPC architectures, models are regularly ported, profiled, optimized, tested and validated.

Embodiments of the invention can provide a metric based objective verification tool for large-scale models solving initial value problems on new HPC systems.

According to an embodiment of one aspect of the invention there is provided a porting verification apparatus arranged to check for errors created during porting from a host computing environment to a target computing environment of a numerical model computer program which executes iteratively to solve an initial value problem, the apparatus comprising: a data module arranged to: obtain from the host computing environment, a host ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; and to obtain from the target computing environment, a target ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; an analytics engine arranged to compute a mathematical distance between the host ensemble and the target ensemble; and a decision module arranged to use the mathematical host-target ensemble ICE distance to conclude whether the porting has created an error.

Invention embodiments aim to verify a possible porting process to establish its correctness. This verification can test if the model installed/ported on a new computer system is free of errors which may occur during the porting process (and hence is a different concept from "validation" which can refer to the quality/veracity of the model itself).

An ensemble may be viewed as a collection of model simulations in the form of solutions/results of the simulations that describe a set of possible values for a given parameter (e.g., temperature) at a given time point T. The collection of model simulation results (i.e. ensemble) can be defined in several ways. For example an initial condition ensemble ICE is defined by running repeat simulations of the model, each repeat using a different set of execution parameters in the form of different initial conditions. One initial condition is perturbed randomly to produce the next initial condition in a way that is known in the art. A set of randomly perturbed initial conditions will then be used to run the simulations. Thus this type of ensemble is called an ensemble of initial condition simulations (IC ensemble).

Implementation of the same numerical model on two different computing environments can result in "differences" in model simulations that go beyond the usual random variation found in an ensemble. These differences may be due to 1) mistakes made by the programmer (or automatic porting algorithm in implementing the computer program in the new environment or combination of the two), or 2) inherent differences in environments (numerical drift), or 3) a combination of both]. The tool will establish if there is an error which can cause systematic differences or if there is normal random numerical drift in the decision module.

Invention embodiments can effectively create a collection of results forming a mathematical distribution (such as a normal distribution) of solutions from the host computing environment and from the target computing environment. Comparison of the two distributions can be made by calculating a mathematical distance between the two.

In the code performance improvement cycle, porting a model, compiling and running it on a new system is perhaps a relatively easy task to do. However, on the new system, due to differences in processor architecture and/or software runtime environment, the model will produce solutions that are different from the solutions produced by the same model on the host system. Differences in solution could also occur when there are changes in the order of operations (for example, allocation of computing tasks between CPUs and GPUs). In models solving initial value problems, these differences, however small initially, can grow rapidly. The inventor has come to the realisation that it is important to establish that these differences are random in characteristics and that they do not mask differences which are essentially errors in porting the model, which can develop systematically in the time domain to influence the model prediction.

Hence it is necessary to ensure that these differences are not due to coding or porting errors, as such errors will systematically influence in such a way that the model will finally produce a completely different future state.

In porting, for example, complex coupled climate models, prior art methods for verification of their correct implementation are a highly resource consuming process.

Models need to be integrated for several decades (in simulation time) on the new computing system, giving a full climate simulation. This can take 2 or 3 months in real time. Climate statistics are then derived from this simulation and compared with those derived from the reference simulation carried out on the host system. This requires long computing and analysis time in addition to subject matter expertise, hindering the code development cycle.

Embodiments of the invention provide a method and apparatus to verify the ported model for its correct implementation. Embodiments can use a metric-based tool which enables an objective verification of the ported model on the new HPC system, removing the need for subject matter expertise. Such a tool can also eliminate the need for running full climate simulations, by providing a shorter test phase, for example giving a simulation length of 10 days or up to a few months, and certainly less than a year. This shorter test phase can reduce the consumption of HPC and time resources, and thus accelerate the code performance improvement cycle.

As mentioned above, varying execution parameters introduces random errors in the numerical calculations, providing a distribution of outcomes. The spread of these outcomes can be represented by a distribution of intra-ensemble mathematical distance measures. Hence, the analytics module may be arranged to compute a distribution of intra-ensemble mathematical distance measures of the host ensemble, and the decision module may be arranged to compare the distribution of host intraensemble mathematical distance measures with the mathematical host-target ensemble distance to conclude whether the porting has created an error.

Any suitable method may be used to compare the distribution of host intra-ensemble mathematical distance measures with the mathematical host-target ensemble distance.

For example, the decision module may be arranged to use the distribution of intra-ensemble mathematical distance measures within the host ensemble to compute an acceptance interval for the mathematical host-target ensemble distance of (p -a*a, p + a*a), where a = is a positive integer value, p is a mean of the host distribution measure and a is a standard deviation of the distribution of host intra-ensemble mathematical distance measures.

In one specific example, either the host intra-ensemble mathematical distance measures may be Bhattacharyya distance measures or the mathematical host-target ensemble distance may be a Bhattacharyya distance measure, or both. The distribution of host intra-ensemble mathematical distance measures may be derived from a plurality of Bhattacharyya distance measures of random sub-samples to provide the mean and standard deviation mentioned above.

A Mahalnobis distance measure could be used, for example to compare one ensemble member from a host ensemble with all ensemble members from a target ensemble.

As another alternative, the comparison methodology could use a distribution of target intra-ensemble distance measures in place of the distribution of host intra-ensemble distance measures, this approach being more suitable for use by a vendor of the target system. Hence, in one embodiment, varying execution parameters introduces random errors in the numerical calculations, providing a range of outcomes whose spread can be represented by a distribution of intra-ensemble distance measures; and the analytics module is arranged to compute a distribution of target intra-ensemble distance measures, and the decision module is arranged to compare the target intra-ensemble distance measure distribution with the mathematical host-target ensemble distance to conclude whether the porting has created an error. Effectively, the host could take the target's role and the target could take the host's role and the whole process will stand as previously defined.

The execution parameters may be varied in any preferred way. One known way of varying execution parameters is to perturb the initial conditions. Hence one or both of the host and target ensembles may be of the IC ensemble type.

The host and/or the target ensemble may be a new type of ensemble referred to herein as a perturbed architecture ensemble of solutions. The inventor has realised that an ensemble which is equivalent to an IC ensemble in terms of variation may be obtained by executing the numerical model computer program repeatedly using the same initial conditions but varying software and/or hardware (implementation) parameters (which are set when the computer program is executed in the environment). It may be that the programmer has control over the host system and so can use a PA ensemble, but the target machine is new HPC infrastructure to which the programmer may not have full access. In that case the target ensemble may be an IC ensemble.

A perturbed architecture (or PA) ensemble may be provided by solving the numerical problem using one combination of software and hardware parameters, and varying one or more software or hardware parameter for each of n-1 other solutions, to give an ensemble of n solutions.

Any hardware/software parameter settings that introduce small random differences in numerical calculations can be varied. For example, truncating the floating point representation can introduce such differences. Parameters varied for the perturbed architecture ensemble can include one or more of: task and/or data parallelism; programming and/or runtime environment settings; operating system release; and middleware version.

If the host and/or the target ensemble is an initial condition ensemble of solutions, the ensemble may be obtained by executing the numerical model computer program repeatedly, varying the initial conditions each time.

In some embodiments, the host ensemble is a perturbed architecture ensemble of solutions obtained by executing the numerical model computer program repeatedly, using the same initial conditions, but varying software and/or hardware parameters and the target ensemble is an initial condition ensemble of solutions for the numerical model obtained by executing the numerical model computer program repeatedly using the same initial conditions as the perturbed architecture ensemble for one solution but varying initial conditions for other solutions. These embodiments have the advantage that constructing the PA ensemble may inform the programmer as to suitable implementation parameters for the target computing environment, or even the host computing environment The variation of the software and/or hardware parameters may be selected to try to encompass possible variations in the results which may be caused by the new computing environment.

The initial condition ensemble may be provided by solving the problem using the same combination of software and hardware parameters for each solution, the same initial conditions as the perturbed architecture ensemble for one solution, and varying the initial conditions randomly for each of n-1 other solutions, to give an ensemble of n solutions.

Viewed in concrete terms, the numerical problem embodied in the computer program can be solved using an iterative process to solve a problem at time points representing a time progression over a period of time. That is, there are many iterations required to reach a (stable) solution for each time point. The mathematical host-target distance is taken at one point in the period of time (preferable at the end of the progression/last iteration) from a value/set of spatially separated values for a variable to be determined from the host ensemble and target ensemble. The variable may be as a single value such as temperature, or a set of values such as a temperature distribution. The distance may be a calculated using a single value and/or mean of a set of values and/or variance of a set of values.

Thus in some embodiments, the numerical model to solve an initial value problem is a model of a physical system, and the host and target computing environment s run simulations using an iterative process to solve a problem at time points representing a time progression to produce a solution in the form of one or more physical variables, such as temperature.

The solution may be in the form of a plurality of different physical variables, and the mathematical distance measure may hence take a multinomial form (e.g. temperature and stress, or temperature and precipitation).

The numerical model computer program which executes to solve the initial value problem must be ported to the target computing environment before verification (before the target ensemble is provided and in order to enable provision of the target ensemble) and may be verified as correctly ported if the mathematical host-target distance is under a threshold. The threshold may be user defined. The user may, for example, decide on a higher threshold (and accept a higher level of risk that the target computing environment will not give equivalent results to the host computing environment), or a lower threshold (and level of risk, implying higher tolerance to differences between the results and a higher likelihood that the target computing environment results will not be equivalent to those of the host computing environment).

The user may, for example define the value of a in the range (p -a*a, p + ata).

If verification fails (or even if it is successful), the results in a PA ensemble may be automatically or manually processed to identify advantageous/disadvantageous implementation parameters (e.g. for possible use on the target system). Then, the advantageous parameters may be implemented and/or disadvantageous parameters omitted (and the verification carried out again, if required). This allows the verification process to additionally improve the porting settings.

After successful verification of the porting, the computer program may be subsequently executed on the target computing environment for a (much) higher longer time progression than used in the perturbed architecture ensemble and initial condition ensemble. [For example, the subsequent length of simulation may be at least 10 times as long as in the test phase used for the ensembles The resultant solution may then be used to adjust an industrial process or design. For example, if a cooling process of processors is modelled when designing a mother board, then a solution with high temperatures unacceptable to the designer may indicate that a different positioning of the processors or heat sinks or a different arrangement of cooling fluid is required.

A number of repeats used for the host (or perturbed architecture) ensemble and target (or initial condition) ensemble may be 10 or more. Hence, the number of ensemble members may be 10 or more (depending on the amount of computing power available).

According to an embodiment of a further aspect of the invention, there is provided a porting verification method arranged to check for errors created during porting from a host computing environment to a target computing environment of a numerical model computer program which executes iteratively to solve an initial value problem, the method comprising: obtaining from the host computing environment a host ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; and obtaining from the target computing environment a target ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; computing a mathematical distance between the host ensemble, and the target ensemble; and using the mathematical host-target ensemble distance to conclude whether the porting has created an error.

Optionally, the porting verification method comprises running a simulation on the host computing environment n times varying software and/or hardware parameters to create a perturbed architecture (PA) ensemble as the host ensemble, and running a simulation on the target computing environment n times varying initial conditions to create an initial condition (IC) ensemble as the target ensemble.

A method may be provided including porting to the target computing environment 25 (system) before the porting verification method and/or improvement of the implementation parameters on the target system using PA ensemble results (from the host and/or target systems).

According to an embodiment of a further aspect of the invention, there is provided a computer program which when executed carries out a porting verification method arranged to check for errors created during porting from a host computing environment to a target computing environment of a numerical model computer program which executes iteratively to solve an initial value problem, the method comprising: obtaining from the host computing environment a host ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; and obtaining from the target computing environment a target ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; computing a mathematical distance between the host ensemble, and the target ensemble; and using the mathematical host-target ensemble distance to conclude whether the porting has created an error.

The invention can be implemented in digital electronic circuitry, or in computer hardware, firmware, software, or in combinations of them. The invention can be implemented as a computer program or computer program product, i.e., a computer program tangibly embodied in an information carrier, e.g., in a machine-readable storage device or in a propagated signal, for execution by, or to control the operation of, one or more hardware modules.

A computer program can be in the form of a stand-alone program, a computer program portion or more than one computer program compiled into a single package and can be written in any form of programming language, including compiled or interpreted languages, and it can be deployed in any form, including as a stand-alone program or as a module, component, subroutine, or other unit suitable for use in a data processing environment. A software tool implementing an invention embodiment may be written in R/Python/Postg reSQ L. A computer program can be deployed to be executed on one module or on multiple modules at one site or distributed across multiple sites and interconnected by a communication network.

Method steps of the invention can be performed by one or more programmable processors executing (a) computer program(s) to perform functions of the invention by operating on input data and generating output. Apparatus of the invention can be implemented as programmed hardware or as special purpose logic circuitry, including e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Processors suitable for the execution of a computer program include, by way of example, both general and special purpose microprocessors, and any one or more processors of any kind of digital computer. Generally, a processor will receive instructions and data from a read-only memory or a random access memory or both. The essential elements of a computer apparatus are a processor for executing instructions coupled to one or more memory devices for storing instructions and data.

Test scripts and script objects can be created in a variety of computer languages. Representing test scripts and script objects in a platform independent language, e.g., Extensible Markup Language (XML), allows one to provide test scripts that can be used on different types of computer platforms.

The invention is described in terms of particular embodiments. Other embodiments are within the scope of the following claims. For example, method steps of the invention can be performed in a different order and still achieve desirable results. Multiple test script versions can be edited and invoked as a unit without using object-oriented programming technology; for example, the elements of a script object can be organized in a structured database or a file system, and the operations described as being performed by the script object can be performed by a test control program.

A method according to preferred embodiments of the present invention can comprise any combination of the previous apparatus aspects. Methods according to any embodiments can be described as computer-implemented in that they require processing and memory capability.

The apparatus and functional modules therein including the analytics engine according to preferred embodiments are described as configured or arranged to carry out certain functions. This configuration or arrangement could be by use of hardware or middleware or any other suitable methodology. In preferred embodiments, the configuration or arrangement is by software, and the modules are software modules.

According to a further aspect there is provided a program which when loaded onto a computer apparatus or system configures the apparatus to carry out the method steps according to any of the preceding method definitions or any combination thereof.

In general the hardware mentioned may comprise the modules listed as being configured or arranged to provide the functions defined. For example this hardware may include memory, processing, and communications circuitry to connect the apparatus (local hardware apparatus) offline or online to the host and target environments running the numerical model computer program, possibly via the Internet/ cloud. The simulation software produces data which is stored, for example, in a database such as PostgreSQL. R/Python scripts are an example of a way of reading the data to apply the techniques described in the invention embodiments.

Elements of the invention may be described using the terms "module" or "engine". The skilled person will appreciate that this term and its equivalents may refer to parts of the apparatus that are spatially separate but combine to serve the function defined. Equally, the same physical parts of the system may provide two or more of the functions defined. For example, separately defined modules may be implemented using the same memory and/or processor as appropriate.

Non-limiting and exemplary embodiments of the invention will now be described with reference to the appended figures, in which: Figure 1 is a flowchart illustrating an overview embodiment of the present invention; Figure 2 is a schematic hardware diagram of the functional parts of an apparatus according to invention embodiments; Figure 3 is a hardware diagram showing more detailed functionality of the functional parts; Figure 4 is a schematic representation of implementation parameters that can be varied to create a PAE; and Figure 5 is an overview diagram showing the apparatus according to the invention and connections to a host environment and a target environment.

Figure 1 illustrates the basic process underlying invention embodiments. In step S10, a host ensemble is obtained from host computing environment (for example a computing environment or system which is currently used but is outdated or not optimised for the kind of initial value problem being executed). The ensemble represents a set of executions, with solutions (values produced during the execution of the numerical model). Each execution has at least one execution parameter which differs from all the other solutions.

In step S20 a target ensemble is obtained from a target computing environment. Hence the simulation has already been ported to the target computing environment, necessitating selection of software/hardware implementation parameters. The target computing environment may have a heterogeneous CPU/GPU architecture, and/or may be at least theoretically advantageous in comparison to the host computing environment in terms of speed or energy saving. However it can be untried in terms of the particular numerical model in consideration, and porting the model to the target computing environment may have given rise to one or more systematic errors, thus rendering it unsuitable (at least without adjustment).

The target ensemble also has varying execution parameters. As explained previously, the host ensemble may be a PA ensemble, and the target ensemble may be an IC ensemble. Each ensemble can represent a normal distribution.

In step S30, a mathematical distance between the host ensemble and target ensemble is calculated.

In step S40, this distance is used to conclude whether the porting has created an error. Such an error leads to more than simple random variation of the model outcome and a very different distribution between the host ensemble and the target ensemble.

The primary objective of the tool is to isolate errors (inaccuracies or random variations) that may develop due to the epsilon size of the machine (that is, round off errors due to HPC processor architecture and other features as described in the background section) from the errors that may be introduced to the model during the porting/coding process. The verification tool contains three components as described below.

Figure 2 shows the functional parts of the porting verification apparatus 100. These are data module 20, analytics engine/module 30, and decision module 40. The basic function of the data module 20 is to obtain and store the ensembles, for example in the form of parameter values varying across space and over time. Since only a few values may be necessary to verify the porting process (at least in comparison to the data in a full simulation) no special database is necessary.. The analytics engine 30 calculates one or more distance measures (also referred to as metrics) to check for the similarity between the host ensemble and the target ensemble. It can also calculate a confidence interval within which the distance between the host and target ensembles is acceptably small. The decision module 30 holds a threshold for the distance measure or metric and checks if the distance metric is within the confidence interval.

Figure 3 represents the same elements in a more detailed embodiment, with a summary of their functioning, which is set out in more detail below. The embodiment described provides a host PA ensemble and a target IC ensemble modelling a climate, but the skilled reader will understand the variations required to produce other ensembles for other models.

Data Module The model on the host system provides a reference simulation using standard hardware and software parameter settings. The hardware/software parameters of the host system will then be varied by the user (or programmer) manually one at a time for N-1 times and a new set of simulations will be carried out. All the simulations are of the same length of time.

For example, the user can vary the number of task and data parallelisms, programming/runtime environment settings, operating system releases and middleware versions one at a time to produce an ensemble of N simulations in total. The key here is to identify and perturb the host computer parameters that would influence the machine rounding-off errors to produce the ensemble, and the skilled person with knowledge of the environment will be aware of which parameters to vary, as explained in more detail below.

Figure 4 shows a schematic parameter diagram of a database schema in the form of a "star schema". There are five arms attaching tables with different titles and classes (dimensions) of properties of the runtime environment and application. The table in the middle captures results (for example, values of Temp etc). The lowest table refers to the application providing the simulation or other numerical problem. This could be the same application with a range of complexities described by the datasize and/functionality.

The tables "HardwareKey" and "KernelKey" can define hardware settings. The kernel could be seen as a "middleware", in between software and hardware. It is possible to change kernel parameters to access or configure hardware resources, but this is uncommon.

The other tables refer to software parameters. Some or all of the "HardwareKey" parameters may in fact be fixed for a given high performance computer, or may be variable (for example, number of cores used in a parallel configuration of the model defined by the computational domain decomposition). However, parameters described in other tables can be varied, either one at a time or multiple parameters at the same time. Hence changing at least some of hardware settings may be difficult (those described in HardwareKey) but changing other parameters can change the computing environment.

For instance, it would be possible to change "BuildLibraryVersion" and run a temperature model keeping all other settings unchanged. As an example, the model will then produce a value of Temp (say 25C) at time T=10days. Now we can change the value of another parameter, say "CompilerVersion" in addition to "BuildLibraryVersion" or on its own. This will produce a value for Temp (say 24.8) at time T=10days. This has generated 2 ensemble members.

This methodology allows generation of many ensemble members, especially if all possible combinations are used.

The length of the test-phase simulation depends on the complexity of the model that may be coupled with several other models. For example, for an atmospheric general circulation model in which the ocean and land-surface are considered as external boundaries, the length of the test-phase simulation could be anywhere between 10 and 30 days. On the other hand, an earth-system model in which the ocean, land-surface, atmospheric chemistry are modelled individually and coupled dynamically, the length of the reference simulation could be between 90 and 120 days or more, to account for the exchange of the information between all sub-models. In all these members the same initial conditions will be used, Since the parameters of the HPC computing environment are perturbed to generate this ensemble of simulations, it is referred to as a perturbed architecture ensemble (PAE).

Another ensemble of N simulations each also of the same length is generated on the target system using the same model. This time each ensemble member is made to differ from the rest by perturbing their corresponding initial conditions randomly, as is known in the art. For example, the target system produces a single simulation using the same initial condition used in the PA ensemble. This initial condition is then randomly perturbed to create a set of N-1 initial conditions. The model is then run using these initial conditions for N-1 repeats, to produce a total of N ensemble members. Since this ensemble is produced by perturbing the initial conditions, it is called an initial condition ensemble (ICE).

Analytics Engine Rationale The objective is to establish that the differences in model solutions arising from running the model on a new HPC infrastructure (target environment/system) are solely due to rounding off errors. These rounding off errors are random in nature but since the climate system is chaotic, small changes in initial state of the climate system such as changes in temperature and precipitation can lead to a very different future climate state. On the other hand, errors in porting the model to the target machine would also lead to differences in model solutions, resulting in a different future climate state. These coding and porting errors are expected to significantly affect the model systematically. That is, the model would consistently underestimate or overestimate, as opposed to random errors which will randomly affect the future simulated state.

The only prior art way to establish if the errors are random or systematic is to run the climate model for several decades in both host and target environments and make a detailed comparison. This requires subject matter expertise, in addition to large computing and storage resources, resulting in slow model performance improvement process, Metrics The PA ensemble helps to quantify the influence of the rounding off errors on the simulated future state probabilistically. Similarly the IC ensemble helps to quantify the influence of the natural variability (chaotic) on the simulated future state. The larger the ensemble size (N), the better the accuracy of the estimated probabilistic distributions. N is usually fixed based on a certain methodology. Here it is postulated that the machine rounding of errors will have an influence on the simulated climate similar to that of the natural variability of the system, as they both are expected to influence the initial state randomly. This requires us to test the equality of two distributions P and I generated by the PA and IC ensembles.

Here we use the Bhattacharyya Distance measure B. The distance between PA and IC distributions, assuming they are distributed with means (up, pi) and variances (ari, or?), the Bhattacharyya distance can be obtained as follows: 1 1 a2 2)) 1 ((PP_ -1,0)2) (1) B (P, I) = -In C4 \6P 2 4 k crS+ cri 4 4 az ap For example, (pip;) and are the means and variances of the temperature field simulated by the PA and IC ensemble members respectively at a time point T during the simulation (for example 6 January 2015), preferably at the end of the test simulation.

If B (P, I) = 0 then P = I. That is the distributions P and I are identical with the same mean and variance.

If B (P, I) = co then the distributions P and I are orthogonal to each other.

Now let us calculate similar Bhattacharyya distance measures between two random sub-samples drawn from the PA ensemble (for example using ransom numbering and selection of results from each simulation at time T), following the well-known bootstrap method. Repeat this calculation n times using n-pairs of samples drawn with replacement from the PA ensemble, resulting in n distance measures. Let us denote this B (P11, PA where i=1, 2, 3, ,n. The value of n could be as high as 1000.

Following the central limit theorem, for the large values of n, the Bhattacharyya distance measures B (P1, P21) are expected to be distributed as normal with mean p and variance o2.

This normal property allows us to calculate an acceptance interval (p -a*ci, p + a*a), where a = 1 or 2 or 3, depending on what degree of closeness between PAE and ICE is sought.

Decision Module If the calculated Bhattacharya distance between PAE and ICE ensembles (B (P, I)) falls within the interval (p -a*o-, p + a*o) with a = 3, then one could argue that the distance in the Euclidean space between PAE and ICE ensembles can be considered as one of the distances calculated using a large number of sub-samples drawn from the PA ensemble alone, because 97.5% of such distances will fall within this interval. That is, there are no significant differences in model simulations due to the porting.

If B (P, I) falls outside this interval, one an then conclude that the ensembles PAE and ICE are significantly different from each other and that there is a systematic error in the porting/coding of the model on the target system, That is, there is a porting error in addition due to rounding-off errors of the new HPC (target system).

In theory the value of the constant a can be anything between 0 and 00, where a = 0 would enforce the most stringent test and any value of 'a' above 3 would be a liberal test.

It is noted that all the steps in this apparatus/tool are innovative, except the IC ensemble preparation, which is one of the two experiment steps providing the ensembles. Although the Bhattacharyya distance, as described in Equation (1), is not new, its application to this particular problem novel. The PAE has not been designed or applied before. The metrics for the distance measure B(P,I) and the Acceptance Region are innovative, as is the method of making the decision whether the porting has been successful.

Generalization This method could very well be used if the decision maker wants to compare more than one parameter from each ensemble simultaneously. For example, it is reasonable to argue that the climate state is not fully defined by the temperature state alone but also by other climate parameters such as precipitation, evaporation, soil moisture etc. In this case, the Bhattacharyya distance measure described in Equation (1) will take a multinomial form. The rest of the method will hold.

Hardware Figure 5 is a schematic diagram illustrating components of hardware that can be used with invention embodiments. In one scenario, the apparatus 100 of invention embodiments can be brought into effect on a simple stand-alone PC or terminal shown in Figure 5, which may provide graphs and plots on demand. The terminal comprises a monitor 101, shown displaying a GUI 102, a keyboard 103, a mouse 104 and a tower 105 housing a CPU, RAM, one or more drives for removable media as well as other standard PC components which will be well known to the skilled person. Other hardware arrangements, such as laptops, Fads and tablet PCs in general could alternatively be provided. The software for carrying out the method of invention embodiments as well as simulation data 301 from a file system and any other file required 302 may be downloaded from one or more databases, for example over a network such as the Internet, or using removable media. Any modified file can be written onto removable media or downloaded over a network.

As mentioned above, the PC may act as a terminal and use one or more servers 200 to assist in carrying out the methods of invention embodiments. In this case, any data files and/or software for carrying out the method of invention embodiments may be accessed from database 300 over a network and via server 200. The server 200 and/or database 300 may be provided as part of a cloud 400 of computing functionality accessed over a network to provide this functionality as a service. In this case, the PC may act as a dumb terminal for display, and user input and output only. Alternatively, some or all of the necessary software may be downloaded onto the local platform provided by tower 105 from the cloud for at least partial local execution of the method of invention embodiments.

Benefits Embodiments of the invention enable a decision maker to check quickly if a large-scale model solving initial value problems (e.g., climate model or heat transfer in component modelling) ported on a new HPC infrastructure is free from porting and coding errors that may have occurred during the porting process.

Invention embodiments can provide a metric based tool which enables an objective verification of the ported model on a new HPC system, removing the need for subject matter expertise.

Since this tool also eliminates the need for running full climate simulations, it reduces the consumption of HPC and time resources, and thus accelerates the code performance improvement cycle.

Although the tool has been illustrated in detail for weather and climate models, it can very well be applied for any large-scale numerical model that solves an initial value problem. Alternative applications include manufacturing design (for example, cooling process of processors for designing mother board). transport and prediction of atmospheric air pollution and modelling heat transfer problems.

Claims

CLAIMS1. A porting verification apparatus arranged to check for errors created during porting from a host computing environment to a target computing environment of a numerical model computer program which executes iteratively to solve an initial value problem, the apparatus comprising: a data module arranged to: obtain from the host computing environment, host ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; and to obtain from the target computing environment, a target ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; an analytics engine arranged to compute a mathematical distance between the host ensemble and the target ensemble, and a decision module arranged to use the mathematical host-target ensemble distance to conclude whether the porting has created an error.
2. An apparatus according to claim 1, wherein varying execution parameters introduces random errors in the numerical calculations, providing a distribution of outcomes whose spread can be represented by a distribution of intra-ensemble mathematical distance measures; and the analytics module is arranged to compute a distribution of intra-ensemble mathematical distance measures of the host ensemble, and the decision module is arranged to compare the distribution of host infra-ensemble mathematical distance measures with the mathematical host-target ensemble distance to conclude whether the porting has created an error.
3. An apparatus according to claim 2, wherein the decision module is arranged to use the distribution of host intra-ensemble mathematical distance measures within the host ensemble to compute an acceptance interval for the mathematical host-target ensemble distance of (p -a"a, p + a"a), where a = is a positive integer value, p is a mean of the distribution of host intra-ensemble mathematical distance measures and a is a standard deviation of the distribution of host intra-ensemble mathematical distance measures.
4. An apparatus according to claim 2 or 3, wherein the host intra-ensemble mathematical distance measures are Bhattacharyya distance measures.
5. An apparatus according to any of the preceding claims, wherein the mathematical host-target ensemble distance is a Bhattacharyya distance 10 measure.
6. An apparatus according to claim 1, wherein varying execution parameters introduces random errors in the numerical calculations, providing a range of outcomes whose spread can be represented by a distribution of intra-ensemble distance measures; and the analytics module is arranged to compute a distribution of target intraensemble distance measures, and the decision module is arranged to compare the target intra-ensemble distance measure distribution with the mathematical host-target ensemble distance to conclude whether the porting has created an error.
7. An apparatus according to any of the preceding claims wherein the host and/or the target ensemble is a perturbed architecture ensemble of solutions obtained by executing the numerical model computer program repeatedly using the same initial conditions but varying software and/or hardware parameters.
8. An apparatus according to any of the preceding claims, wherein the perturbed architecture ensemble is provided by solving the numerical problem using one combination of software and hardware parameters, and varying one or more software or hardware parameter for each of n-1 other solutions, to give an ensemble of n solutions.An apparatus according to claims 7 or 8, wherein the parameters varied for the perturbed architecture ensemble include one or more of: Task and/or data parallelism programming and/or runtime environment settings operating system release middleware version 10. An apparatus according to any of the preceding claims, wherein the host and/or the target ensemble is an initial condition ensemble of solutions obtained by executing the numerical model computer program repeatedly, varying the initial conditions each time.11. An apparatus according to any of the preceding claims wherein the host ensemble is a perturbed architecture ensemble of solutions obtained by executing the numerical model computer program repeatedly, using the same initial conditions, but varying software and/or hardware parameters and the target ensemble is an initial condition ensemble of solutions for the numerical model obtained by executing the numerical model computer program repeatedly using the same initial conditions as the perturbed architecture ensemble for one solution but varying initial conditions for other solutions.12. An apparatus according to claim 11, wherein the initial condition ensemble is provided by solving the problem using the same combination of software and hardware parameters for each solution, the same initial conditions as the perturbed architecture ensemble for one solution, and varying the initial conditions randomly for each of n-1 other solutions, to give an ensemble of n solutions.13. An apparatus according to any of the preceding claims, wherein the numerical problem is solved using an iterative process to solve a problem at time points representing a time progression over a period of time, and the mathematical host-target distance is taken at one point in the period of time from a value/set of spatially separated values for a variable to be determined from the host ensemble and target ensemble.14, An apparatus according to any of the preceding claims, wherein the numerical model to solve an initial value problem is a model of a physical system, and the host and target computing environments run simulations using an iterative process to solve a problem at time points representing a time progression to produce a solution in the form of one or more physical values, such as temperature.15. An apparatus according to any of the preceding claims, wherein the solution is in the form of a plurality of different physical values, and the mathematical distance measure takes a multinomial form.16. An apparatus according to any of the preceding claims, wherein the numerical model computer program which executes to solve the initial value problem is ported to the target computing environment before verification and verified as correctly ported if the mathematical host-target distance is under a threshold.17. An apparatus according to claim 16, wherein the problem is subsequently executed on the target computing environment for a longer length of time than used for the perturbed architecture ensemble and initial condition ensemble, and the resultant solution is preferably used to adjust an industrial process or design.18. An apparatus according to any of the preceding claims, wherein a number of repeats used for the host ensemble and target ensemble is 10 or more and hence a number of ensemble members for the host and target ensembles is 10 or more.19. A porting verification method arranged to check for errors created during porting from a host computing environment to a target computing environment of a numerical model computer program which executes iteratively to solve an initial value problem, the method comprising: obtaining from the host computing environment a host ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; and obtaining from the target computing environment a target ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; computing a mathematical distance between the host ensemble, and the target ensemble; and using the mathematical host-target ensemble distance to conclude whether the porting has created an error.20. A porting verification method according to claim 19, further comprising running a simulation on the host computing environment n times varying software and/or hardware parameters to create a perturbed architecture ensemble, and running a simulation on the target computing environment n times varying initial conditions to create an initial condition ensemble, 21. A computer program which when executed carries out a porting verification method arranged to check for errors created during porting from a host computing environment to a target computing environment of a numerical model computer program which executes iteratively to solve an initial value problem, the method comprising: obtaining from the host computing environment a host ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; and obtaining from the target computing environment a target ensemble of solutions for the numerical model by executing the numerical model computer program repeatedly varying execution parameters; computing a mathematical distance between the host ensemble, and the target ensemble; and using the mathematical host-target ensemble distance to conclude whether the porting has created an error.