Nothing Special   »   [go: up one dir, main page]

Next Article in Journal
A Mathematical Model for Wind Velocity Field Reconstruction and Visualization Taking into Account the Topography Influence
Previous Article in Journal
Strabismus Detection in Monocular Eye Images for Telemedicine Applications
Previous Article in Special Issue
Enabling Low-Dose In Vivo Benchtop X-ray Fluorescence Computed Tomography through Deep-Learning-Based Denoising
You seem to have javascript disabled. Please note that many of the page functionalities won't work as expected without javascript enabled.
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

ssc-cdi: A Memory-Efficient, Multi-GPU Package for Ptychography with Extreme Data

by
Yuri Rossi Tonin
*,
Alan Zanoni Peixinho
,
Mauro Luiz Brandao-Junior
,
Paola Ferraz
and
Eduardo Xavier Miqueles
*
Brazilian Synchrotron Light Laboratory, Brazilian Center for Research in Energy and Materials (CNPEM), Campinas 13085-970, Brazil
*
Authors to whom correspondence should be addressed.
J. Imaging 2024, 10(11), 286; https://doi.org/10.3390/jimaging10110286
Submission received: 31 August 2024 / Revised: 28 October 2024 / Accepted: 4 November 2024 / Published: 7 November 2024
(This article belongs to the Special Issue Recent Advances in X-ray Imaging)
Figure 1
<p>Diagram illustrating the batch distributions of measurements for three GPUs. Each colored block represents data inside of the respective GPU. A batch of <math display="inline"><semantics> <mrow> <mi>B</mi> <mo>≤</mo> <mi>N</mi> </mrow> </semantics></math> measurements is distributed to the GPU memory, so that the wavefronts are updated in parallel. Once a GPU finishes processing and is made available, a remaining batch of unprocessed data is loaded from RAM. After all batches have been loaded and all the wavefronts updated, the new object and probe matrices are calculated by GPU<sub>0</sub> and then broadcasted to the other GPUs, so that each of them has faster access to <span class="html-italic">O</span> and <span class="html-italic">P</span> in the subsequent iteration.</p> ">
Figure 2
<p>Ptychography reconstruction of a Siemens Star measured at CARNAÚBA beamline. The finest features of the innermost circles are spaced <math display="inline"><semantics> <mrow> <mn>15</mn> <mspace width="0.166667em"/> <mi>nm</mi> </mrow> </semantics></math> from each other. The complex probe is shown in an hsv colormap, saturation encoding magnitude and hue encoding the phase.</p> ">
Figure 3
<p>Comparison of the simulated sample against the reconstruction using the DM algorithm from different packages. The insets show a zoomed region from the red square in the object and phase reconstructions. The reconstruction for <tt>ssc-cdi</tt> used the RAAR algorithm with parameter <math display="inline"><semantics> <mrow> <mi>β</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math>, such that the update function equals that of DM. For <tt>PyNX</tt> and <tt>PtyPy</tt>; we used the DM engine directly. In all cases, the same initial guesses were used: random magnitude and constant phase for the object array, and an inverse Fourier transform of the averaged measurements for the probe.</p> ">
Figure 4
<p>Single GPU performance of DM and PIE algorithms across different packages. The inset shows the same data without log scale on the vertical axis. Missing points on some curves indicate dimensions that were not supported by a specific engine. Note that <tt>PyNX</tt> does not provide an engine for an algorithm of the PIE family for comparison.</p> ">
Figure 5
<p>Multi-GPU performance of DM algorithm for <tt>ssc-cdi</tt> and <tt>PtyPy</tt> using batch sizes of (<b>a</b>) 128 and (<b>b</b>) 16. Dimensions that were not supported by an engine are the reason for missing points for some of the curves. The inset plots the same data without log scale on the vertical axis.</p> ">
Figure 5 Cont.
<p>Multi-GPU performance of DM algorithm for <tt>ssc-cdi</tt> and <tt>PtyPy</tt> using batch sizes of (<b>a</b>) 128 and (<b>b</b>) 16. Dimensions that were not supported by an engine are the reason for missing points for some of the curves. The inset plots the same data without log scale on the vertical axis.</p> ">
Figure 6
<p>Single GPU performance of <tt>ssc-cdi</tt> for DM and PIE engines at a conventional machine. RAAR was run with batch size <math display="inline"><semantics> <mrow> <mi>B</mi> <mo>=</mo> <mn>1</mn> </mrow> </semantics></math> and managed to run up to a data size of <math display="inline"><semantics> <msup> <mn>2048</mn> <mn>2</mn> </msup> </semantics></math>.</p> ">
Versions Notes

Abstract

:
We introduce ssc-cdi, an open-source software package from the Sirius Scientific Computing family, designed for memory-efficient, single-node multi-GPU ptychography reconstruction. ssc-cdi offers a range of reconstruction engines in Python version 3.9.2 and C++/CUDA. It aims at developing local expertise and customized solutions to meet the specific needs of beamlines and user community of the Brazilian Synchrotron Light Laboratory (LNLS). We demonstrate ptychographic reconstruction of beamline data and present benchmarks for the package. Results show that ssc-cdi effectively handles extreme datasets typical of modern X-ray facilities without significantly compromising performance, offering a complementary approach to well-established packages of the community and serving as a robust tool for high-resolution imaging applications.

1. Introduction

Coherent Diffractive Imaging (CDI) encompasses a family of lensless imaging techniques where the final resolution is no longer limited by the quality of optical components but by the wavelength of radiation and the detector size [1,2]. CDI techniques also go beyond traditional absorption imaging, providing both absorption and phase contrast information. Since the phase signal can be orders of magnitude larger than the absorption signal, phase contrast is a powerful tool for exploring materials with low absorption, such as soft matter samples [3]. Nevertheless, these advances come with a cost in experimental and algorithmic complexity. As the name suggests, CDI requires the use of coherent illumination, which is not always trivial to obtain, and measurements suffer from the phase problem [4] since X-ray detectors only measure the intensity of a signal, therefore loosing phase information.
Ptychography is a variation of CDI that offers a robust solution to the phase problem [5], and has been extremely successful in the X-ray and electron communities [6,7,8]. The increase in robustness comes from the redundancy in measurements obtained through scanning the sample with the coherent beam and making sure that each illuminated region overlaps with neighboring ones. The data redundancy not only makes the phase problem mathematically better posed, but also allows the deconvolution of the sample (referred to as the object) and the incoming wavefront (referred to as the probe) [9]. This approach is particularly effective because it eliminates the requirement for a fully coherent and clean wavefront, as was originally necessary in plane wave CDI (PWCDI) [10]. This relaxation in wavefront requirements eases wavefront engineering, allowing for enhanced beamline flux through focusing, and it places ptychography as a promising method for beam characterization and diagnostics [11]. The need for scanning naturally increases experimental complexity, as the probe positions at the object need to be precisely determined and the beam is assumed to remain stable throughout the experiment. The research community has been successful in circumventing issues that arise when these assumptions are no longer true. Examples are the introduction of position-correction algorithms [12,13,14,15], mixed-states or probe modes [16,17] and non-constant probe during scan [18]. Several ptychography algorithms have been proposed for solving the phase problem, namely the Ptychographic Iterative Engine (PIE) family [19,20,21], Difference Map (DM) [9], Relaxed Averaged Alternating Reflection (RAAR) [22,23] and Maximum Likelihood (ML) [24,25,26,27], among others [12,28,29,30]. Currently, the X-ray community offers many software packages implementing these algorithms, each with a different software architecture and optimization strategy [31,32,33,34,35,36,37,38,39,40].
Ptychography has a relatively recent history at the Brazilian Synchrotron Light Laboratory. Its previous light source, UVX [41], ceased operation in 2019. It was a second-generation machine and therefore did not offer enough coherence for CDI. With Sirius, the new fourth-generation light source of LNLS [42], two beamlines already offer ptychography to users: CARNAÚBA [43] and CATERETÊ [44]. The technique is also of interest for the EMA [45] and MOGNO [46] beamlines as a method for beam characterization. Ptychography will be used in future imaging experiments at LNLS, including in upcoming beamlines for the Orion laboratory [47].
The biggest issue with ptychography arguably lies in its cost. The scanning measurements place a heavy burden on experimental time, data storage and computational processing. This problem becomes even more evident in fourth-generation light sources, as the increased photon flux permits much shorter acquisition times, and the use of large-area detectors [48] generates vast amounts of data. Even with powerful computing nodes equipped with modern general-purpose Graphics Processing Unit (GPU) accelerators, reconstructing a single high-resolution image can take several minutes, and processing an entire tomography dataset may extend to several hours. Moreover, regardless of processing speed, the large volumes of data can pose significant challenges for data processing. Specifically, when a large field-of-view is required in ptychography, the number of measurements for current detector sizes may exceed memory capacity. Consequently, the implementation of High Performance Computing (HPC) strategies is necessary for improving both processing speed and data management.
In this work, we present ssc-cdi, a software package from the Sirius Scientific Computing (ssc) family [49,50,51] that encompasses reconstruction algorithms for ptychography. But why another package? The scientific computing group at LNLS supports beamlines with mathematical modeling and software packages for data processing, particularly those focused on imaging. Since LNLS/SIRIUS serves a relatively new user community for X-ray imaging in South America, a thorough understanding and control over techniques are crucial to consolidate scientific and technical knowledge locally, ensuring a successful beamtime delivery to the user community. Furthermore, the development of local expertise maximizes scientific throughput, enabling the conception of new scientific questions. While it is possible to provide various scientific processing pipelines for specific imaging beamlines by leveraging external tools like PtyPy [31] or PyNX [33], a deep understanding of the computational challenges in conjunction with the beamline allows for the development of customized strategies tailored to the facility’s needs. In this context, the package presented here addresses performance issues, optimizing the use of the computational resources available at LNLS [52]. Furthermore, the fast development of the data-transfer bandwidth between CPU and GPU makes a strong case for exploring CPU memory when dealing with extreme data. Although the amount of Video Random Access Memory (VRAM) in modern GPUs has also greatly increased [53], it is still far from the available Random Access Memory (RAM). Due to the limited availability of HPC nodes for each beamline at LNLS, for instance, we focus on efficiency using a single computing node. We also demonstrate that this can be an advantage for running ptychography outside of an HPC environment.
ssc-cdi is written in Python and offers multiple reconstruction engines for ptychography, namely rPIE, mPIE, Alternating Projections (AP), Relaxed Averaged Alternating Reflection (RAAR) and Maximum Likelihood (ML). All engines are accelerated using single-GPU with cupy [54] and most offer a C++/CUDA back-end with single and multi-GPU functionality. Implementing engines directly in C++/CUDA, despite requiring substantial effort, provides enhanced control over memory management and enables significant advantages in terms of speed and efficiency. The engines also allow for multi-mode probe decomposition [16] and a position-correction approach in CUDA using the Annealing method [13]. In this work, we present an overview both of ptychography and the ssc-cdi package, which has been used for reconstructions at SIRIUS since 2021 [43,44,55]. We detail its HPC strategy and benchmark it against other well-established Python packages, PtyPy and PyNX, to provide an overview of strengths and limitations relative to the state-of-the-art. We show that ssc-cdi consists in a good alternative for processing large datasets without considerably compromising performance, offering a complementary approach to other packages of the community and serving as a robust tool for high-resolution imaging applications.

2. Materials and Methods

Ptychography aims at recovering the object O ( r ) and probe P ( r ) from a series of N intensity measurements I i , 1 < i < N , at recorded positions r i = ( x i , y i , z ) . In what follows, we provide an overview of the main algorithms for solving the phase problem, referred to here as engines.
For X-rays of wavelength λ , the complex refractive index is given by n ( r ) = 1 δ ( r ) + i β ( r ) . Within the projection approximation [56], the object is related to real and imaginary refractive indexes, δ ( r ) and β ( r ) , by the X-Ray Transform R [57]:
O ( r ) = e k R { β ( r ) } e i k R { δ ( r ) } ,
where k = 2 π / λ and R is performed over the beam direction (z axis). For a successful reconstruction, ptychography assumes known scan positions, constant O ( r ) and P ( r ) in time, and the multiplicative approximation for the wavefront in the output plane of the sample:
ψ i ( r ) = P ( r ) O ( r r i ) .
Each scan consists of an intensity measurement, given by the absolute squared value of the propagated wavefront:
I i ( u , v ) = | D d { ψ i ( r ) } | 2 ,
where u , v are the transversal coordinates in the detector plane. The operator D d indicates a free-space propagator by a distance d. The form of the propagator depends on the experimental geometry: in the far-field, D d is simply the Fourier transform F ; in the near-field, the propagator may assume different forms which depend on the sampling conditions [58,59], the most common being the Angular Spectrum method. Near-field ptychography usually requires less measurements at the cost of a structure-rich probe [60], or the use of multiple scanning planes for successful convergence [61]. Ptychography engines can be broadly categorized into two categories: projection algorithms and cost–function optimization. We describe both in the following sections.

2.1. Projection Algorithms

Projection-based algorithms iteratively enforce the measured intensity to be the squared wavefront amplitude at the detector plane. Let M be the set of all wavefronts containing the measured magnitudes I i . Since multiple wavefronts ψ i may satisfy the forward model (Equation (3)), projecting the wavefront estimates to M brings them closer to the solution. It can be shown that a projection operation amounts to [62]
Π M { ψ } = I i ψ | ψ | .
Hence, the updated wavefront at the object exit plane is given by
ψ i ( r ) = D d 1 Π M D d { ψ i ( r ) } .
The above wavefront update is the one used in the Alternating Projections (AP) and PIE engines. DM and RAAR propose alternative approaches with different convergence characteristics [9,22]:
ψ i = ψ i + Π M { 2 Π O { ψ i } ψ i } Π O { ψ i } ,
ψ i = β ( ψ i + Π M { 2 Π O { ψ i } ψ i } ) + ( 1 2 β ) Π O { ψ i } ,
respectively, where 0 β 1 is a relaxation parameter. Π O is the object consistency projector, which amounts to the multiplicative approximation of Equation (2). Note that the DM and RAAR update functions coincide for β = 1 . The error of each algorithm iteration can be calculated as a normalized mean squared error between the measured intensities and the updated wavefront magnitude at the detector plane:
ϵ = i ( I i | D d { ψ i ( r ) } | 2 ) 2 i I i
After the wavefront update step, one uses ψ i to separate the wavefronts into an object and a probe component. The PIE approach is based on a gradient descent-like step [16] and requires a sequential update of O ( r ) and P ( r ) after each wavefront update:
O ( r r i ) = O ( r r i ) + w i o ( r ) P * ( r ) ( ψ i ( r ) ψ i ( r ) ) ,
P ( r ) = P ( r ) + w i p ( r ) O * ( r r i ) ( ψ i ( r ) ψ i ( r ) ) ,
where * indicates complex conjugate and the weights w i o and w i p are space-dependent factors that differ for each variation of the engine (PIE, ePIE and rPIE) [21]. The update for PIE algorithms is usually performed in random order for the indices i, making it stochastic in nature. ssc-cdi implements the mPIE algorithm, which consists of using rPIE update functions
O ( r ) = O ( r ) + s o P * ( r ) ( ψ ( r ) ψ ( r ) ) ( 1 r o ) | P ( r ) | 2 + r o | P ( r ) | max 2 ,
P ( r ) = P ( r ) + s p O * ( r r i ) ( ψ ( r ) ψ ( r ) ) ( 1 r p ) | O ( r r i ) | 2 + r p | O ( r r i ) | max 2 ,
followed by optional momentum acceleration, as proposed by [21]. The constants s o , s p , r o and r p are tunable constants related to the step size and regularization of the update functions, respectively.
The sequential nature of the PIE update function means it is not easily parallelizable. A more efficient approach proposed by [9] is used in the AP, DM and RAAR algorithms, which allows for updating all wavefronts in parallel, prior to the object and probe updates:
O ( r ) = i N P * ( r r i ) ψ ( r ) i N | P ( r r i ) | 2 ,
P ( r ) = i N O * ( r + r i ) ψ ( r + r i ) i N | O ( r + r i ) | 2 .

2.2. Cost–Function Optimization

Solving ptychography directly from the minimization of a function was proposed in the early days of ptychography with a general cost–function that minimizes the difference between observed and expected intensities [12]:
C = i = 1 N u , v M i ( u , v ) D d { ψ i ( r ) } 2 γ I i ( u , v ) γ 2 ,
where M i consists of a binary mask to ignore dubious pixels or missing data. When the exponent γ is set to 1 / 2 or 1, the cost function approximates the maximum likelihood estimators for Poisson and Gaussian counting statistics, apart from additive and multiplicative constants [24,27]:
L Poisson = i N M i ( u , v ) D d { ψ i ( r ) } 2 I i 2 ,
L Gaussian = i N M i ( u , v ) D d { ψ i ( r ) } 2 I i 2 .
The incorporation of such noise models contributes to the refinement of reconstructions. Different optimization methods have been proposed for recovery of object and probe in these cases [24,26]. Notwithstanding their success, such methods suffer from a high computational cost compared to projection-based algorithms. In recent years, cost–function minimization has resurged as a promising tool in the community [27,38,39,63] with the advancement in computing power of GPUs and the establishment of libraries such as Theano [64], PyTorch [65] and TensorFlow [66] that allow easy computation of derivatives via Automatic Differentiation and offer an accessible interface to GPUs.

2.3. HPC Strategy

Ptychographic reconstruction poses a significant computational challenge due to the iterative nature of the algorithms and especially to the large datasets required for achieving high resolution and reconstructing large samples. The use of parallel strategies, especially through GPU acceleration, is already commonplace for many of the available packages. For instance, PtyPy [31], SHARP [32], PyNX [33] and PtyGer [35] provide GPU accelerated engines. However, many implementations suffer from a significant memory footprint in order to achieve such speed up, and despite the significant increase in available VRAM of commercial GPUs in recent years, processing large datasets remains challenging.
In this context, ssc-cdi primarily focuses on developing fast yet memory-efficient GPU software for ptychography reconstruction, enabling rapid processing without excessive computational overhead. The approach of loading batches from CPU on demand (lazy loading) has the potential to lower the cost of entry-level hardware, making the technique more accessible. Additionally, we prioritize simplicity in code structure and processing pipelines. Our approach involves processing all scanning points of a large field-of-view dataset and synchronizing the data within VRAM, rather than performing parallel ptychography reconstruction of subsets of the whole dataset and subsequently stitching the results, as adopted by PyNX. We avoid the stitching approach for a few reasons. It requires precise knowledge of scan coordinates, removing phase-ramp (see Appendix A), and only then combining the resulting images. From our experience, phase-ramp corrections can be challenging to obtain for samples that span the entire scan field-of-view or that present low contrast. Overcoming these problems may require manually processing data later on, demanding valuable time of users for them to obtain satisfactory results. Even though our approach results in a decrease in processing speed, we demonstrate in the Section 3 that the on demand transfer from RAM to VRAM justifies the trade-off.
To achieve speed with memory efficiency, C++/CUDA is the language of choice for implementing our engines, as it allows for tight control of all memory and low-level access to GPU. Furthermore, using CUDA in association with high-level Python wrappers can significantly mitigate the development cost while maintaining the achieved benefits.
Most of the data involved in the reconstruction process can be efficiently managed within VRAM. For instance, both object O and probe P are manageable in size even with large-area detectors when using 32-bit float precision for the real and imaginary parts (typically dozens of MBs). However, one significant exception is the list of measurements I, which can easily reach hundreds of gigabytes for experiments with high-resolution or large field-of-view. To address this, our engines use smaller, manageable batches of measurements with size B, which are transferred to the GPU on demand. We note that performance decreases from this approach can be partly mitigated by employing contiguous page-locked memory (pinned memory).
The projection Π M is a critical component for the implementation of all algorithms within the package. To compute the Fourier transforms associated with it, we rely on NVIDIA’s CUFFT [67], an optimized CUDA library that implements the Fast Fourier Transform (FFT), for several use cases in GPU-accelerated systems. It provides a ready to use, fast and relatively memory-efficient implementation, while consuming roughly 50% of spent GPU kernel time. For the majority of our custom CUDA kernels, straightforward implementations were sufficient to guarantee good performance, with no apparent bottlenecks observed. Two exceptions required a more elaborate approach, nonetheless.
The first exception is computing the error ϵ (Equation (8)), which involves aggregating the element-wise difference between measured diffraction patterns and computed diffraction patterns. This requires atomic summation of the squared residuals, which we address with a parallel reduction strategy in shared memory, with a block size of 64 to optimize performance and ensure efficient computation.
Second, for those algorithms that implement the update step—Equations (13) and (14)—after computation over all diffraction patterns (AP, DM and RAAR), the projector Π M can be concurrently applied to each wavefront ψ i , and the updated ψ i batch can then be used for computing the new object and probe matrices. We implement yet another acceleration step of this update [55] by separately accumulating the numerator and the denominator in Equations (13) and (14), as it also improves performance.

Multi-GPU Implementation

In the multi-GPU scenario, measurements are partitioned into distinct batches, which are then distributed across all available GPUs within a single node (Figure 1). Each GPU processes a unique batch independently. If the number of batches for a given batch size is bigger than the number of available GPUs, the remaining batches await in RAM and are only transferred to VRAM when a GPU becomes available. This approach simplifies the efficient parallelization of AP and RAAR by allowing each GPU to handle a subset of the data.
Synchronization is primarily required during the probe and object update step. In this case, the batch summation of numerator and denominator are computed on each GPU and later aggregated on the first GPU using a binary reduction tree method. This approach combines partial sums recursively—with a logarithmic complexity—over the number of GPUs. After aggregation and subsequent application of Equations (13) and (14), the updated probe and object are broadcast to all GPUs to ensure that each has access to the new matrices, maintaining consistency for subsequent iterations.
The adopted strategy also means that our current implementation of stochastic projections (rPIE and mPIE) is restricted to a single GPU with one measurement per batch, due to the intrinsically sequential nature of these algorithms.

3. Results

We first validate ssc-cdi by presenting the reconstruction of a real dataset acquired at CARNAÚBA beamline [68]. A Siemens Star (manufacturer Applied Nanotools Inc., Edmonton, AB, Canada) was measured at 12 keV with detector placed at 1.1 m from the sample. The diffraction patterns were acquired in fly-scan mode, with a total of 10,100 measurements. The detector pixel size is 55 × 55 μ m 2 and an area of 256 × 256 pixels was used for reconstruction, resulting in an effective pixel of 8 nm . Figure 2 shows the reconstructed magnitude and phase of the sample, as well as the probe. The fly-scan acquisition required using at least five probe modes for a satisfactory reconstruction due to the integrating nature of the intensity measurements, which can be equivalently interpreted as an incoherence effect [69]. The object was initialized with random amplitude and constant phase, whereas the first probe mode was set to the inverse Fourier Transform of the averaged diffraction data. We used 100 iterations of rPIE followed by 300 iterations of AP with a loose circular probe support of 1.5 μ m diameter.

Benchmarks

To evaluate the performance and effectiveness of ssc-cdi, we conducted a series of benchmarks against well-established packages in the community, namely PtyPy 0.8.0 [31] and PyNX 2023.1.1 [33]. These benchmarks provide a comprehensive understanding of the strengths and limitations of our approach relative to current state-of-the-art methods. All benchmarks were conducted on the LNLS cluster [52], utilizing a NVIDIA DGX A100 system [70] with an AMD EPYC 7742 64-Core Processor, 1 TB of RAM and eight A100 GPUs, each featuring 40 GB of VRAM connected with NVLINK 300 GB/s high-bandwidth and PCIe (Gen4) with 32 GB/s bandwidth. The system employs the NVIDIA driver version 460.27.04, with CUDA 11.2 in an Ubuntu 20.04 operating system. We detail the parameters used in the algorithm calls for each package in the Supplementary File S1.
The initial goal was to ensure that all packages could successfully reconstruct data from a numerical experiment. Subsequently, we compared the speed and scalability of various ptychographic engines across the different packages using the same dataset and reconstruction parameters. The selected dataset for object amplitude and phase was the Camera Man and Gravel test images from [71], with a circular probe of constant phase, as shown in the top row of Figure 3. The object consisted of 400 × 400 pixels and probe of 180 × 180 pixels, with a raster grid scan recorded at 400 positions, with normally distributed offsets added to the grid points to avoid the raster grid pathology [9]. For evaluating convergence, we ran 200 iterations of the DM algorithm with each package using the same initial guesses. DM was the algorithm of choice as it is available across all packages, providing a consistent basis for comparison. Figure 3 shows the reconstructed object and probe by the different packages. Convergence was successful in all cases. A loose support for the probe was used, causing the reconstructions to be shifted by different amounts in each case. The position of the probe magnitude center of mass was used to correct for this shift. One can notice a slight difference in the phase ramp for the phase reconstruction in each case, which could arguably be avoided by a more meticulous tuning of engine parameters and initial guesses, or combining different algorithms for refinement.
For the speed and scalability benchmarks, we used the same dataset. In this case, the object and probe matrices were binned or interpolated as necessary to reach the desired matrix dimensions. The reported times do not include the time required for loading data from the disk.
Figure 4 shows the benchmark results as function of measurement size N, where the used dataset has dimensions ( 400 , N , N ) . We compared both the DM and PIE family algorithms. All engines ran 200 iterations from N = 128 to N = 3072 (the largest area detector dimension at LNLS [48]) in steps of N = 128 . The missing data points for some of the curves indicate that the engine could not cope with the desired dimensions due to lack of memory. All packages present a variable to control batch size B with similar purpose, which we set to B = 128 , apart from ssc-cdi’s rPIE engine, which is hard-coded for B = 1 . In terms of speed, ssc-cdi lies in between the performance of PyNX and PtyPy. For the PIE algorithms, both engines ran successfully for all data sizes. This is expected as the sequential nature of PIE implies a small memory footprint and only a single wavefront array needs be stored in memory at a time. For DM, our approach performs faster only for small dimensions, lying in between PtyPy and PyNX for the dimensions above 768 2 . We note that the performance of our DM engine eventually converges to the performance of rPIE as the cost associated with data transfer greatly increases with the data size, causing data transfer and Fourier transforms—common operations for both engines—to predominantly influence the overall running time. We show an extra curve for ssc-cdi using DM with batch B = 1 , which gets closer to the rPIE curve because a larger number of transfers happens in this case. The clear strength of ssc-cdi comes from the batching strategy from RAM to VRAM, which is made evident here. Among the DM engines, ssc-cdi’s was the only one capable of reconstructing large datasets, despite the algorithm’s substantial memory footprint.
We also analyzed multi-GPU performance of the DM engines. A few differences must be noted here. ssc-cdi provides multi-GPU support directly through CUDA, whereas PtyPy offers multi-GPU capability via OpenMPI. We excluded PyNX from the comparison because its multi-GPU approach utilizes the previously mentioned stitching strategy, which is not equivalent to PtyPy and ssc-cdi. We ran 200 iterations of DM using the same data dimensions and input parameters, repeating it for 2, 4 and 8 GPUs. Figure 5a shows results for a batch size B = 128 . In terms of speed, ssc-cdi is able to reconstruct faster than its PtyPy counterpart for almost all data points, especially for large N. In addition, regarding memory management and scalability, ssc-cdi was able to handle all tested data sizes, due to the strategy of lazy loading batches into GPU, which is essential for processing the data volumes currently generated by LNLS beamlines and is further supported by the increasing bandwidth of NVIDIA’s latest architectures.
Since the batch size B is crucial for determining how the engine manages large data, we ran a second benchmark for B = 16 (Figure 5b). We can see that the scalability of PtyPy indeed improves at the cost of higher execution time. ssc-cdi exhibits equivalent performance when compared to the larger batch case.
Additionally, we observe that the performance gains marginally decrease as more GPUs are utilized, with 4 GPUs providing reasonable compromise. For  B = 128 , using 8 GPUs presents no benefit as expected, since the 400 scan points are divided in only four batches of sizes 128, 128, 128 and 16. Nonetheless, the benefit becomes visible again for B = 16 , where the scans points are divided into 400 / 16 = 25 batches. We note that the point-to-point performance fluctuations in some cases may be partly attributed to the the machines not being used exclusively for our benchmarks during the tests, as machine RAM could be shared with other cluster users.
As a final experiment, we evaluated the performance of the rPIE and DM algorithms across all previously discussed data sizes on a simpler setup, outside the HPC environment. This machine was equipped with an Intel Xeon E5-2630 v3 CPU, 64 GB of RAM, and a single NVIDIA Quadro M4000 GPU with 8 GB of VRAM. The system employs the NVIDIA driver version 460.27.04, with CUDA 12.4 in an Ubuntu 22.04 operating system. Execution times shown in Figure 6 demonstrate that, despite the modest machine specifications compared to HPC environments, ssc-cdi was still able to run rPIE for all data sizes and RAAR up to N = 2048 in less than an hour. This finding reinforces the well-balanced trade-offs of our approach, demonstrating the potential use of advanced ptychography algorithms on more accessible hardware and advocating its relevance to the community.

4. Discussion

The development of ssc-cdi has been an ongoing effort since Sirius began operations in 2019, with contributions from physicists, mathematicians, and computer scientists. From the outset, the ability to reconstruct large datasets generated at Sirius has been a primary objective guiding its development and we relied on the principle that being able to handle a dataset, albeit at a slower pace, was preferable to being unable to process it at all. The HPC strategy and benchmarks presented in this work demonstrate that ssc-cdi provides a memory-efficient, multi-GPU approach to ptychographic reconstruction, effectively handling measurements from detectors with dimensions up to 3072 × 3072 pixels without significantly compromising performance. The on demand memory-loading approach is also favored by the fast improvement on CPU–GPU bandwidth currently happening, and it favors the use of ssc-cdi even outside of HPC environments, such as more modest single-node laboratory machines with limited VRAM.
ssc-cdi is publicly available at https://doi.org/10.5281/zenodo.13693178 (accessed on 27 October 2024). It is important to acknowledge that the current release still has limitations compared to some of the established packages within the X-ray imaging community. For example, the package currently does not include a CUDA-based maximum likelihood engine, fast algorithms for position correction [14,15], orthogonal probe relaxation [18], multi-slice approaches [72,73,74] or modern algorithms such as ADMM [29,75] and WASP [30]. These are functionalities left for future releases of ssc-cdi.
Finally, it should be noted that the development of C++/CUDA software was essential to achieving the demonstrated performance. Although the time invested in writing C++/CUDA software is substantial, the effort is justified. Our results show that ssc-cdi  enables memory-efficient reconstructions, achieving performance that is comparable to, or even exceeds, that of well-established tools in the community. Moreover, the development of our own approach has allowed us to create custom solutions tailored to the specific needs of the beamlines at LNLS, with the flexibility to adapt as these beamlines have transitioned from a commissioning phase to operation over the past few years. As implied by its name, ssc-cdi is designed to be more than just a ptychography toolbox. Future releases will include the implementation of multi-GPU distribution for three-dimensional PWCDI reconstructions [76], which will be crucial for CDI reconstruction of large volumes, as well as the addition of new engines and improvements for ptychography.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/jimaging10110286/s1, Supplementary File S1: Benchmark details.

Author Contributions

Conceptualization, Y.R.T., A.Z.P., P.F. and E.X.M.; methodology, Y.R.T., A.Z.P.; software, Y.R.T., A.Z.P., P.F. and M.L.B.-J.; validation, Y.R.T., A.Z.P., P.F. and M.L.B.-J.; formal analysis, Y.R.T., M.L.B.-J.; investigation, Y.R.T., A.Z.P., P.F. and M.L.B.-J.; resources, E.X.M.; data curation, Y.R.T., P.F. and M.L.B.-J.; writing—original draft preparation, Y.R.T., A.Z.P. and M.L.B.-J.; writing—review and editing, Y.R.T., A.Z.P., P.F., M.L.B.-J. and E.X.M.; visualization, Y.R.T., A.Z.P. and M.L.B.-J.; supervision, E.X.M.; project administration, Y.R.T. and E.X.M.; funding acquisition, E.X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Brazilian Ministry of Science, Technology, and Innovation (MCTI) through the Brazilian Center for Research in Energy and Materials (CNPEM).

Data Availability Statement

The ssc-cdi package is publicly available in Zenodo (https://doi.org/10.5281/zenodo.13693178, accessed on 27 October 2024) or upon request with one of the corresponding authors.

Acknowledgments

We would like to express our gratitude to Julia Carvalho and Camila Lages for their valuable contributions to the processing pipelines for the CATERETÊ, CARNAÚBA, and EMA beamlines. A special thanks to Giovani Baraldi for initiating the computational work on ptychography at LNLS and for developing the foundation of what would eventually become the ssc-cdi package. We also thank the staff of the CATERETÊ and CARNAÚBA beamlines for their ongoing feedback and collaboration over the years, especially Florian Meneau, Carla Polo, Tiago Kalile, Helio Tolentino, and Francisco da Silva. We are also grateful to Ana Diaz and Harry Westfahl for the insightful discussions and feedback on ptychography. Finally, we appreciate the thorough reviews of this manuscript provided by Izabela Zamboti and João Oliveira, and the support of Daniel Tavares during the release process of the package.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

CDICoherent Diffractive Imaging
PWCDIPlane-wave Coherent Diffractive Imaging
PIEPtychographic Iterative Engine
APAlternating Projections
DMDifference Map
RAARRelaxed Averaged Alternating Reflection
MLMaximum Likelihood
GPUGraphics Processing Unit
HPCHigh Performance Computing
SSCSirius Scientific Computing
RAMRandom Access Memory
VRAMVideo Random Access Memory
FFTFast Fourier Transform
BPBackprojection
FBPFiltered Backprojection

Appendix A

In ptycho-tomography experiments, one rotates the sample around the vertical axes, performing a ptychographic scan in between each rotation. After all 2D images (referred to here as sinogram projections) are obtained, one may use them in combination with tomography algorithms for reconstructing a 3D volume of the sample. Nonetheless, a careful treatment of data is often required prior to using a tomography algorithm [77], especially when working with phase-contrast images. ssc-cdi offers useful Python functions for users to process data prior to tomography.
First, the extracted phase signal is periodic with period 2 π . For strongly scattering samples, this can easily cause phase wrapping of images. ssc-cdi optimizes the phase-unwrapping algorithm [78] by parallelizing the process across multiple CPU cores for each projection, significantly reducing the unwrapping procedure by several times.
One important aspect of ptychography is the ambiguities in its solution. Given the multiplicative approximation, we have the equivalence
ψ ( r ) = P ( r ) O ( r r i ) = A e i R · r P ( r + t ) A 1 e i R · r O ( r r i + t ) ,
that is, the retrieved object and probe may present an amplitude offset A and a first-order phase-ramp R ( x ) . Additionally, the shift invariant property of the Fourier transform added to the fact that only the amplitude of the signal is measured makes the reconstruction invariant to a translation t in real space. Consequently, one needs to equalize the projections by removing phase-ramps from each projection [77] prior to tomography. ssc-cdi further enhances this method through an implementation across multiple CPU cores. We observe that successful removal of phase-ramps usually requires a linear fit of the background both to the left and right of the sample. For that, a binary mask is used to select multiple regions of air around the sample. Samples with a single interface to air in the measured field-of-view usually prove difficult to equalize.
A potential workaround to significantly reduce processing time is to skip the equalization process and perform tomography directly from the wrapped phase [77]. This method requires differentiation of the phase signal, which may be problematic for low-contrast samples with bad signal-to-noise ratio since noise is amplified by the differentiation. ssc-cdi also provides a function call for preparing the phase-wrapped sinogram using the Hilbert transform. Filtered Backprojection (FBP) allows the recovery of the 3D refractive index δ ( z , y , x ) from a set of phase projections ϕ [79]:
2 π λ δ ( z , y , x ) = 0 π F 1 { | u | F { ϕ } } ( θ , y , x cos θ + z sin θ , y ) d θ , y .
Letting H be the Hilbert transform and sgn the signal function, we can use the following properties
F ϕ x = i 2 π u F { ϕ } ,
F { H ( u ) } = i sgn ( w ) F { u } ,
to further simplify Equation (A2) and show that δ ( z , y , x ) is proportional to the backprojection (BP) of the Hilbert transform of the phase gradient:
δ ( z , y , x ) = λ 4 π 2 0 π H ϕ x ( θ , y , x cos θ + z sin θ ) d θ , y .
At last, the translation ambiguities, together with instabilities of the mechanical stages during rotation of the sample, make it necessary for alignment of sinogram projections. We do not discuss such a process here, as it has been thoroughly presented elsewhere [77,80]. In our pipelines, alignment functions are offered as part of the ssc-raftpackage, publicly available in [49].

References

  1. Veen, F.v.; Pfeiffer, F. Coherent X-Ray scattering. J. Phys. Condens. Matter 2004, 16, 5003–5030. [Google Scholar] [CrossRef]
  2. Miao, J.; Ishikawa, T.; Robinson, I.K.; Murnane, M.M. Beyond crystallography: Diffractive imaging using coherent X-Ray light sources. Science 2015, 348, 530–535. [Google Scholar] [CrossRef] [PubMed]
  3. Jacobsen, C. X-Ray Microscopy, 1st ed.; Cambridge University Press: Cambridge, UK, 2019. [Google Scholar] [CrossRef]
  4. Als-Nielsen, J.; McMorrow, D. Elements of Modern X-Ray Physics; John Wiley & Sons, Ltd.: Hoboken, NJ, USA, 2011. [Google Scholar]
  5. Guizar-Sicairos, M.; Thibault, P. Ptychography: A solution to the phase problem. Phys. Today 2021, 74, 42–48. [Google Scholar] [CrossRef]
  6. Jiang, Y.; Chen, Z.; Han, Y.; Deb, P.; Gao, H.; Xie, S.; Purohit, P.; Tate, M.W.; Park, J.; Gruner, S.M.; et al. Electron ptychography of 2D materials to deep sub-ångström resolution. Nature 2018, 559, 343–349. [Google Scholar] [CrossRef]
  7. Holler, M.; Odstrcil, M.; Guizar-Sicairos, M.; Lebugle, M.; Müller, E.; Finizio, S.; Tinti, G.; David, C.; Zusman, J.; Unglaub, W.; et al. Three-dimensional imaging of integrated circuits with macro- to nanoscale zoom. Nat. Electron. 2019, 2, 464–470. [Google Scholar] [CrossRef]
  8. Aidukas, T.; Phillips, N.W.; Diaz, A.; Poghosyan, E.; Müller, E.; Levi, A.F.J.; Aeppli, G.; Guizar-Sicairos, M.; Holler, M. High-performance 4-nm-resolution X-Ray tomography using burst ptychography. Nature 2024, 632, 81–88. [Google Scholar] [CrossRef]
  9. Thibault, P.; Dierolf, M.; Bunk, O.; Menzel, A.; Pfeiffer, F. Probe retrieval in ptychographic coherent diffractive imaging. Ultramicroscopy 2009, 109, 338–343. [Google Scholar] [CrossRef]
  10. Miao, J.; Sayre, D.; Chapman, H.N. Phase retrieval from the magnitude of the Fourier transforms of nonperiodic objects. J. Opt. Soc. Am. A 1998, 15, 1662. [Google Scholar] [CrossRef]
  11. Kahnt, M.; Klementiev, K.; Haghighat, V.; Weninger, C.; Plivelic, T.S.; Terry, A.E.; Björling, A. Measurement of the coherent beam properties at the CoSAXS beamline. J. Synchrotron Radiat. 2021, 28, 1948–1953. [Google Scholar] [CrossRef]
  12. Guizar-Sicairos, M.; Fienup, J.R. Phase retrieval with transverse translation diversity: A nonlinear optimization approach. Opt. Express 2008, 16, 7264. [Google Scholar] [CrossRef]
  13. Maiden, A.; Humphry, M.; Sarahan, M.; Kraus, B.; Rodenburg, J. An annealing algorithm to correct positioning errors in ptychography. Ultramicroscopy 2012, 120, 64–72. [Google Scholar] [CrossRef] [PubMed]
  14. Zhang, F.; Peterson, I.; Vila-Comamala, J.; Diaz, A.; Berenguer, F.; Bean, R.; Chen, B.; Menzel, A.; Robinson, I.K.; Rodenburg, J.M. Translation position determination in ptychographic coherent diffraction imaging. Opt. Express 2013, 21, 13592. [Google Scholar] [CrossRef] [PubMed]
  15. Dwivedi, P.; Konijnenberg, A.; Pereira, S.; Urbach, H. Lateral position correction in ptychography using the gradient of intensity patterns. Ultramicroscopy 2018, 192, 29–36. [Google Scholar] [CrossRef] [PubMed]
  16. Thibault, P.; Menzel, A. Reconstructing state mixtures from diffraction measurements. Nature 2013, 494, 68–71. [Google Scholar] [CrossRef]
  17. Li, P.; Edo, T.; Batey, D.; Rodenburg, J.; Maiden, A. Breaking ambiguities in mixed state ptychography. Opt. Express 2016, 24, 9038. [Google Scholar] [CrossRef]
  18. Odstrcil, M.; Baksh, P.; Boden, S.A.; Card, R.; Chad, J.E.; Frey, J.G.; Brocklesby, W.S. Ptychographic coherent diffractive imaging with orthogonal probe relaxation. Opt. Express 2016, 24, 8360. [Google Scholar] [CrossRef]
  19. Rodenburg, J.M.; Faulkner, H.M.L. A phase retrieval algorithm for shifting illumination. Appl. Phys. Lett. 2004, 85, 4795–4797. [Google Scholar] [CrossRef]
  20. Maiden, A.M.; Rodenburg, J.M. An improved ptychographical phase retrieval algorithm for diffractive imaging. Ultramicroscopy 2009, 109, 1256–1262. [Google Scholar] [CrossRef]
  21. Maiden, A.; Johnson, J.; Li, P. Further improvements to the ptychographical iterative engine. Optica 2017, 4, 736–745. [Google Scholar] [CrossRef]
  22. Luke, D. Relaxed averaged alternating reflections for diffraction imaging. Inverse Probl. 2004, 21, 37–50. [Google Scholar] [CrossRef]
  23. Marchesini, S.; Schirotzek, A.; Yang, C.; Wu, H.t.; Maia, F. Augmented projections for ptychographic imaging. Inverse Probl. 2013, 29, 115009. [Google Scholar] [CrossRef]
  24. Thibault, P.; Guizar-Sicairos, M. Maximum-likelihood refinement for coherent diffractive imaging. New J. Phys. 2012, 14, 063004. [Google Scholar] [CrossRef]
  25. Godard, P.; Allain, M.; Chamard, V.; Rodenburg, J. Noise models for low counting rate coherent diffraction imaging. Opt. Express 2012, 20, 25914. [Google Scholar] [CrossRef]
  26. Odstrčil, M.; Menzel, A.; Guizar-Sicairos, M. Iterative least-squares solver for generalized maximum-likelihood ptychography. Opt. Express 2018, 26, 3108. [Google Scholar] [CrossRef]
  27. Seifert, J.; Shao, Y.; Van Dam, R.; Bouchet, D.; Van Leeuwen, T.; Mosk, A.P. Maximum-likelihood estimation in ptychography in the presence of Poisson–Gaussian noise statistics. Opt. Lett. 2023, 48, 6027. [Google Scholar] [CrossRef]
  28. Enfedaque, P.; Chang, H.; Krishnan, H.; Marchesini, S. GPU-Based Implementation of Ptycho-ADMM for High Performance X-Ray Imaging. In Computational Science—ICCS 2018; Shi, Y., Fu, H., Tian, Y., Krzhizhanovskaya, V.V., Lees, M.H., Dongarra, J., Sloot, P.M.A., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2018; Volume 10860, pp. 540–553. [Google Scholar] [CrossRef]
  29. Aslan, S.; Nikitin, V.; Ching, D.J.; Bicer, T.; Leyffer, S.; Gürsoy, D. Joint ptycho-tomography reconstruction through alternating direction method of multipliers. Opt. Express 2019, 27, 9128. [Google Scholar] [CrossRef]
  30. Maiden, A.M.; Mei, W.; Li, P. WASP: Weighted average of sequential projections for ptychographic phase retrieval. Opt. Express 2024, 32, 21327. [Google Scholar] [CrossRef]
  31. Enders, B.; Thibault, P. A computational framework for ptychographic reconstructions. Proc. R. Soc. A Math. Phys. Eng. Sci. 2016, 472, 20160640. [Google Scholar] [CrossRef]
  32. Marchesini, S.; Krishnan, H.; Daurer, B.J.; Shapiro, D.A.; Perciano, T.; Sethian, J.A.; Maia, F.R.N.C. SHARP: A distributed, GPU-based ptychographic solver. J. Appl. Crystallogr. 2016, 49, 1245–1252. [Google Scholar] [CrossRef]
  33. Favre-Nicolin, V.; Girard, G.; Leake, S.; Carnis, J.; Chushkin, Y.; Kieffer, J.; Paléo, P.; Richard, M.I. PyNX: High performance computing toolkit for coherent X-Ray imaging based on operators. J. Appl. Crystallogr. 2020, 53, 1404–1413. [Google Scholar] [CrossRef]
  34. Wakonig, K.; Stadler, H.C.; Odstrčil, M.; Tsai, E.H.R.; Diaz, A.; Holler, M.; Usov, I.; Raabe, J.; Menzel, A.; Guizar-Sicairos, M. PtychoShelves, A Versatile High-Level Framework for High-Performance Analysis of Ptychographic Data. J. Appl. Crystallogr. 2020, 53, 574–586. [Google Scholar] [CrossRef] [PubMed]
  35. Yu, X.; Nikitin, V.; Ching, D.J.; Aslan, S.; Gürsoy, D.; Biçer, T. Scalable and accurate multi-GPU-based image reconstruction of large-scale ptychography data. Sci. Rep. 2022, 12, 5334. [Google Scholar] [CrossRef] [PubMed]
  36. Guzzi, F.; Kourousias, G.; Billè, F.; Pugliese, R.; Gianoncelli, A.; Carrato, S. A modular software framework for the design and implementation of ptychography algorithms. PeerJ Comput. Sci. 2022, 8, e1036. [Google Scholar] [CrossRef] [PubMed]
  37. Cherukara, M.J.; Zhou, T.; Nashed, Y.; Enfedaque, P.; Hexemer, A.; Harder, R.J.; Holt, M.V. AI-enabled high-resolution scanning coherent diffraction imaging. Appl. Phys. Lett. 2020, 117, 044103. [Google Scholar] [CrossRef]
  38. Kandel, S.; Maddali, S.; Allain, M.; Hruszkewycz, S.O.; Jacobsen, C.; Nashed, Y.S.G. Using automatic differentiation as a general framework for ptychographic reconstruction. Opt. Express 2019, 27, 18653. [Google Scholar] [CrossRef]
  39. Du, M.; Kandel, S.; Deng, J.; Huang, X.; Demortiere, A.; Nguyen, T.T.; Tucoulou, R.; De Andrade, V.; Jin, Q.; Jacobsen, C. Adorym: A multi-platform generic X-Ray image reconstruction framework based on automatic differentiation. Opt. Express 2021, 29, 10000. [Google Scholar] [CrossRef]
  40. Loetgering, L.; Du, M.; Flaes, D.B.; Aidukas, T.; Wechsler, F.; Molina, D.S.P.; Rose, M.; Pelekanidis, A.; Eschen, W.; Hess, J.; et al. PtyLab.m/py/jl: A cross-platform, open-source inverse modeling toolbox for conventional and Fourier ptychography. Opt. Express 2023, 31, 13763. [Google Scholar] [CrossRef]
  41. Craievich, A.F. Synchrotron radiation in Brazil. Past, present and future. Radiat. Phys. Chem. 2020, 167, 108253. [Google Scholar] [CrossRef]
  42. Liu, L.; Westfahl, H., Jr. Towards Diffraction Limited Storage Ring Based Light Sources. In Proceedings of the IPAC2017, Copenhagen, Denmark, 14–19 May 2017. [Google Scholar]
  43. Tolentino, H.C.; Geraldes, R.R.; Da Silva, F.M.; Guaita, M.G.D.; Camarda, C.M.; Szostak, R.; Neckel, I.T.; Teixeira, V.C.; Hesterberg, D.; Pérez, C.A.; et al. The CARNAÚBA X-Ray nanospectroscopy beamline at the Sirius-LNLS synchrotron light source: Developments, commissioning, and first science at the TARUMÃ station. J. Electron Spectrosc. Relat. Phenom. 2023, 266, 147340. [Google Scholar] [CrossRef]
  44. Górecki, R.; Polo, C.C.; Kalile, T.A.; Miqueles, E.X.S.; Tonin, Y.R.; Upadhyaya, L.; Meneau, F.; Nunes, S.P. Ptychographic X-Ray computed tomography of porous membranes with nanoscale resolution. Commun. Mater. 2023, 4, 68. [Google Scholar] [CrossRef]
  45. Dos Reis, R.D.; Kaneko, U.F.; Francisco, B.A.; Fonseca, J., Jr.; Eleoterio, M.A.S.; Souza-Neto, N.M. Preliminary Overview of the Extreme Condition Beamline (EMA) at the new Brazilian Synchrotron Source (Sirius). J. Phys. Conf. Ser. 2020, 1609, 012015. [Google Scholar] [CrossRef]
  46. Archilha, N.L.; Costa, G.R.; Ferreira, G.R.B.; Moreno, G.B.Z.L.; Rocha, A.S.; Meyer, B.C.; Pinto, A.C.; Miqueles, E.X.S.; Cardoso, M.B.; Westfahl, H., Jr. MOGNO, the nano and microtomography beamline at Sirius, the Brazilian synchrotron light source. J. Phys. Conf. Ser. 2022, 2380, 012123. [Google Scholar] [CrossRef]
  47. Rodrigues, M. First biolab in South America for studying world’s deadliest viruses is set to open. Nature 2024, 632, 959–960. [Google Scholar] [CrossRef] [PubMed]
  48. Campanelli, R.; Gomes, G.; Fernandes, M.; Mendes, L.; Rosa, L.; Reis, R.; Antonio, E.; Polli, J. Large area hybrid detectors based on Medipix3RX: Commissioning and characterization at Sirius beamlines. J. Instrum. 2023, 18, C02008. [Google Scholar] [CrossRef]
  49. Brazilian Synchrotron Light Laboratory; Miqueles, E.X.; Ferraz, P. ssc-raft: Reconstruction Algorithms for Tomography; Zenodo: Geneve, Switzerland, 2024. [Google Scholar] [CrossRef]
  50. Brazilian Synchrotron Light Laboratory; Miqueles, E. ssc-Resolution; Zenodo: Geneve, Switzerland, 2024. [Google Scholar] [CrossRef]
  51. Brazilian Synchrotron Light Laboratory; Macul Moreno, L.; Miqueles, E. ssc-Rings; Zenodo: Geneve, Switzerland, 2024. [Google Scholar] [CrossRef]
  52. Furusato, F.; Sarmento, M.; Aranha, G.; Zago, L.; Miqueles, E. TEPUI: High-Performance Computing Infrastructure for Beamlines at LNLS/Sirius. In High Performance Computing; Gitler, I., Barrios Hernández, C.J., Meneses, E., Eds.; Springer: Cham, Switzerland, 2022; pp. 3–18. [Google Scholar]
  53. NVIDIA Grace Hopper Superchip: A CPU and GPU Integrated for Accelerated Computing. 2024. Available online: https://resources.nvidia.com/en-us-grace-cpu/nvidia-grace-hopper (accessed on 10 October 2024).
  54. Okuta, R.; Unno, Y.; Nishino, D.; Hido, S.; Loomis, C. CuPy: A NumPy-Compatible Library for NVIDIA GPU Calculations. In Proceedings of the Workshop on Machine Learning Systems (LearningSys) in The Thirty-First Annual Conference on Neural Information Processing Systems (NIPS), Long Beach, CA, USA, 4–9 December 2017. [Google Scholar]
  55. Baraldi, G.; Dias, C.; Silva, F.; Tolentino, H.; Miqueles, E. Fast reconstruction tools for ptychography at Sirius, the fourth-generation Brazilian synchrotron. J. Appl. Crystallogr. 2020, 53, 1550–1558. [Google Scholar] [CrossRef]
  56. Paganin, D. Coherent X-Ray Optics; Oxford University Press: Oxford, UK, 2006. [Google Scholar] [CrossRef]
  57. Natterer, F.; Wübbeling, F. Mathematical Methods in Image Reconstruction; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2001. [Google Scholar] [CrossRef]
  58. Voelz, D.G. Computational Fourier Optics: A MATLAB Tutorial; SPIE Press: Bellingham, WA, USA, 2011. [Google Scholar]
  59. Schmidt, J.D. Numerical Simulation of Optical Wave Propagation with Examples in MATLAB; SPIE: St Bellingham, WA, USA, 2010. [Google Scholar] [CrossRef]
  60. Stockmar, M.; Cloetens, P.; Zanette, I.; Enders, B.; Dierolf, M.; Pfeiffer, F.; Thibault, P. Near-field ptychography: Phase retrieval for inline holography using a structured illumination. Sci. Rep. 2013, 3, 1927. [Google Scholar] [CrossRef]
  61. Robisch, A.L.; Kröger, K.; Rack, A.; Salditt, T. Near-field ptychography using lateral and longitudinal shifts. New J. Phys. 2015, 17, 073033. [Google Scholar] [CrossRef]
  62. Zhang, Z. Analysis and Development of Phase Retrieval Algorithms for Ptychography. Ph.D. Thesis, University of Sheffield, Sheffield, UK, 2021. [Google Scholar]
  63. Guzzi, F.; Gianoncelli, A.; Billè, F.; Carrato, S.; Kourousias, G. Automatic Differentiation for Inverse Problems in X-Ray Imaging and Microscopy. Life 2023, 13, 629. [Google Scholar] [CrossRef]
  64. Bergstra, J.; Breuleux, O.; Bastien, F.; Lamblin, P.; Pascanu, R.; Desjardins, G.; Turian, J.; Warde-Farley, D.; Bengio, Y. Theano: A CPU and GPU math compiler in Python. In Proceedings of the 9th Python in Science Conference, Austin, TX, USA, 28 June–3 July 2010; Volume 4, pp. 1–7. [Google Scholar]
  65. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Proceedings of the Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: San Francisco, CA, USA, 2019; Volume 32, pp. 8024–8035. [Google Scholar]
  66. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: https://www.tensorflow.org/ (accessed on 27 October 2024).
  67. NVIDIA Corporation. cuFFT Library Version 10.4.0. 2024. Available online: https://developer.nvidia.com/cufft (accessed on 27 October 2024).
  68. Tolentino, H.; Geraldes, R.; Moreno, G.; Pinto, A.; Bueno, C.; Kofukuda, L.; Sotero, A.; Neto, A.; Lena, F.; Wilendorf, W.; et al. X-Ray microscopy developments at Sirius-LNLS: First commissioning experiments at the Carnauba Beamline. In X-Ray Nanoimaging: Instruments and Methods V; SPIE: St Bellingham, WA, USA, 2021. [Google Scholar] [CrossRef]
  69. Odstrčil, M.; Holler, M.; Guizar-Sicairos, M. Arbitrary-path fly-scan ptychography. Opt. Express 2018, 26, 12585. [Google Scholar] [CrossRef]
  70. NVIDIA Corporation. NVIDIA DGX A100 System. 2024. Available online: https://resources.nvidia.com/en-us-dgx-systems/dgxa100-system?xs=489761 (accessed on 27 October 2024).
  71. Van der Walt, S.; Schönberger, J.L.; Nunez-Iglesias, J.; Boulogne, F.; Warner, J.D.; Yager, N.; Gouillart, E.; Yu, T.; The Scikit-Image Contributors. scikit-image: Image processing in Python. PeerJ 2014, 2, e453. [Google Scholar] [CrossRef]
  72. Maiden, A.M.; Humphry, M.J.; Rodenburg, J.M. Ptychographic transmission microscopy in three dimensions using a multi-slice approach. J. Opt. Soc. Am. A 2012, 29, 1606. [Google Scholar] [CrossRef] [PubMed]
  73. Tsai, E.H.R.; Usov, I.; Diaz, A.; Menzel, A.; Guizar-Sicairos, M. X-Ray ptychography with extended depth of field. Opt. Express 2016, 24, 29089. [Google Scholar] [CrossRef] [PubMed]
  74. Gilles, M.A.; Nashed, Y.S.G.; Du, M.; Jacobsen, C.; Wild, S.M. 3D X-Ray imaging of continuous objects beyond the depth of focus limit. Optica 2018, 5, 1078. [Google Scholar] [CrossRef] [PubMed]
  75. Chang, H.; Enfedaque, P.; Marchesini, S. Blind Ptychographic Phase Retrieval via Convergent Alternating Direction Method of Multipliers. SIAM J. Imaging Sci. 2019, 12, 153–185. [Google Scholar] [CrossRef]
  76. Chapman, H.N.; Barty, A.; Marchesini, S.; Noy, A.; Hau-Riege, S.P.; Cui, C.; Howells, M.R.; Rosen, R.; He, H.; Spence, J.C.H.; et al. High-resolution ab initio three-dimensional X-Ray diffraction microscopy. J. Opt. Soc. Am. A 2006, 23, 1179. [Google Scholar] [CrossRef]
  77. Guizar-Sicairos, M.; Diaz, A.; Holler, M.; Lucas, M.S.; Menzel, A.; Wepf, R.A.; Bunk, O. Phase tomography from X-Ray coherent diffractive imaging projections. Opt. Express 2011, 19, 21345. [Google Scholar] [CrossRef]
  78. Herráez, M.A.; Burton, D.R.; Lalor, M.J.; Gdeisat, M.A. Fast two-dimensional phase-unwrapping algorithm based on sorting by reliability following a noncontinuous path. Appl. Opt. 2002, 41, 7437. [Google Scholar] [CrossRef]
  79. Feeman, T.G. The Mathematics of Medical Imaging: A Beginner’s Guide; Springer Undergraduate Texts in Mathematics and Technology; Springer: New York, NY, USA, 2010. [Google Scholar] [CrossRef]
  80. Odstrčil, M.; Holler, M.; Raabe, J.; Guizar-Sicairos, M. Alignment methods for nanotomography with deep subpixel accuracy. Opt. Express 2019, 27, 36637. [Google Scholar] [CrossRef]
Figure 1. Diagram illustrating the batch distributions of measurements for three GPUs. Each colored block represents data inside of the respective GPU. A batch of B N measurements is distributed to the GPU memory, so that the wavefronts are updated in parallel. Once a GPU finishes processing and is made available, a remaining batch of unprocessed data is loaded from RAM. After all batches have been loaded and all the wavefronts updated, the new object and probe matrices are calculated by GPU0 and then broadcasted to the other GPUs, so that each of them has faster access to O and P in the subsequent iteration.
Figure 1. Diagram illustrating the batch distributions of measurements for three GPUs. Each colored block represents data inside of the respective GPU. A batch of B N measurements is distributed to the GPU memory, so that the wavefronts are updated in parallel. Once a GPU finishes processing and is made available, a remaining batch of unprocessed data is loaded from RAM. After all batches have been loaded and all the wavefronts updated, the new object and probe matrices are calculated by GPU0 and then broadcasted to the other GPUs, so that each of them has faster access to O and P in the subsequent iteration.
Jimaging 10 00286 g001
Figure 2. Ptychography reconstruction of a Siemens Star measured at CARNAÚBA beamline. The finest features of the innermost circles are spaced 15 nm from each other. The complex probe is shown in an hsv colormap, saturation encoding magnitude and hue encoding the phase.
Figure 2. Ptychography reconstruction of a Siemens Star measured at CARNAÚBA beamline. The finest features of the innermost circles are spaced 15 nm from each other. The complex probe is shown in an hsv colormap, saturation encoding magnitude and hue encoding the phase.
Jimaging 10 00286 g002
Figure 3. Comparison of the simulated sample against the reconstruction using the DM algorithm from different packages. The insets show a zoomed region from the red square in the object and phase reconstructions. The reconstruction for ssc-cdi used the RAAR algorithm with parameter β = 1 , such that the update function equals that of DM. For PyNX and PtyPy; we used the DM engine directly. In all cases, the same initial guesses were used: random magnitude and constant phase for the object array, and an inverse Fourier transform of the averaged measurements for the probe.
Figure 3. Comparison of the simulated sample against the reconstruction using the DM algorithm from different packages. The insets show a zoomed region from the red square in the object and phase reconstructions. The reconstruction for ssc-cdi used the RAAR algorithm with parameter β = 1 , such that the update function equals that of DM. For PyNX and PtyPy; we used the DM engine directly. In all cases, the same initial guesses were used: random magnitude and constant phase for the object array, and an inverse Fourier transform of the averaged measurements for the probe.
Jimaging 10 00286 g003
Figure 4. Single GPU performance of DM and PIE algorithms across different packages. The inset shows the same data without log scale on the vertical axis. Missing points on some curves indicate dimensions that were not supported by a specific engine. Note that PyNX does not provide an engine for an algorithm of the PIE family for comparison.
Figure 4. Single GPU performance of DM and PIE algorithms across different packages. The inset shows the same data without log scale on the vertical axis. Missing points on some curves indicate dimensions that were not supported by a specific engine. Note that PyNX does not provide an engine for an algorithm of the PIE family for comparison.
Jimaging 10 00286 g004
Figure 5. Multi-GPU performance of DM algorithm for ssc-cdi and PtyPy using batch sizes of (a) 128 and (b) 16. Dimensions that were not supported by an engine are the reason for missing points for some of the curves. The inset plots the same data without log scale on the vertical axis.
Figure 5. Multi-GPU performance of DM algorithm for ssc-cdi and PtyPy using batch sizes of (a) 128 and (b) 16. Dimensions that were not supported by an engine are the reason for missing points for some of the curves. The inset plots the same data without log scale on the vertical axis.
Jimaging 10 00286 g005aJimaging 10 00286 g005b
Figure 6. Single GPU performance of ssc-cdi for DM and PIE engines at a conventional machine. RAAR was run with batch size B = 1 and managed to run up to a data size of 2048 2 .
Figure 6. Single GPU performance of ssc-cdi for DM and PIE engines at a conventional machine. RAAR was run with batch size B = 1 and managed to run up to a data size of 2048 2 .
Jimaging 10 00286 g006
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tonin, Y.R.; Peixinho, A.Z.; Brandao-Junior, M.L.; Ferraz, P.; Miqueles, E.X. ssc-cdi: A Memory-Efficient, Multi-GPU Package for Ptychography with Extreme Data. J. Imaging 2024, 10, 286. https://doi.org/10.3390/jimaging10110286

AMA Style

Tonin YR, Peixinho AZ, Brandao-Junior ML, Ferraz P, Miqueles EX. ssc-cdi: A Memory-Efficient, Multi-GPU Package for Ptychography with Extreme Data. Journal of Imaging. 2024; 10(11):286. https://doi.org/10.3390/jimaging10110286

Chicago/Turabian Style

Tonin, Yuri Rossi, Alan Zanoni Peixinho, Mauro Luiz Brandao-Junior, Paola Ferraz, and Eduardo Xavier Miqueles. 2024. "ssc-cdi: A Memory-Efficient, Multi-GPU Package for Ptychography with Extreme Data" Journal of Imaging 10, no. 11: 286. https://doi.org/10.3390/jimaging10110286

APA Style

Tonin, Y. R., Peixinho, A. Z., Brandao-Junior, M. L., Ferraz, P., & Miqueles, E. X. (2024). ssc-cdi: A Memory-Efficient, Multi-GPU Package for Ptychography with Extreme Data. Journal of Imaging, 10(11), 286. https://doi.org/10.3390/jimaging10110286

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop