Introduction

Three-dimensional (3D) tomography refers to a collection of methods used for producing 3D representations of the internal structure of a solid object. This practice is commonly used for the study of specimens, where the material properties are strongly related to their internal structure. The morphology of a sample can be practically defined over many different length scales. A good example in materials science is that of networks of solution-processed nanomaterials, widely employed in electronics, energy, and sensing1,2. For these, the charge transport is determined by the morphology of the contacts between nanosheets, which are defined at a few tens of nanometers length scale. At the opposite side of the length-scale spectrum, one finds medical imaging techniques3, such as magnetic resonance imaging (MRI) and X-ray computed tomography (CT), where the relevant information is typically available with millimeter resolution.

There are several limitations to 3D tomography, common to many experimental techniques and length scales. Firstly, the resolution achieved must be sufficient for extracting the desired information, but it is often limited by the measuring technique and the necessity to keep the acquisition time short. For instance, in a CT scan, one wants to have enough details to inform a medical decision but limit the radiation dose the patient is exposed to4. Furthermore, in several cases, the measurement is destructive, meaning that the specimen being imaged is destroyed during the measurement process5,6. In this situation the 3D resolution is often anisotropic, meaning that cubic-voxel definition in the three dimensions is not achieved. In addition, in a destructive experiment one cannot go back and take a second measurement, should the first have not achieved enough resolution. The crucial question is then whether or not an image augmentation method may help in enhancing the image quality, in terms of resolution, thus enabling the extraction of more precise information. Should this be possible, one can also revert the question and establish which experimental conditions will be enabled by an image-augmentation method achieving the same information content as the non-corrected measurement. In order to understand better these issues, let us consider the specific case of FIB-SEM nanotomography (FIB-SEM-NT), where the term nano refers to the length scale involved5. This is a destructive imaging technique, where a focused-ion beam (FIB) mills away slices of a specimen, often a composite, while a scanning electron microscope (SEM) takes images of the exposed planes. The typical outcome is a stack of hundreds of 2D images, used to produce a high-fidelity 3D reconstruction. The resolution of the resulting 3D volume is often anisotropic, especially when working at high resolution. In fact, while the cross-section (also referred to as the xy-plane) is imaged at the SEM resolution, about 5 nm in our case, the resolution along the milling direction (the z-direction) corresponds to the slice thickness, and it is usually around 10–20 nm. As a consequence, the reconstructed 3D volume will not be characterized by cubic voxels. Note that cutting thinner slices is hindered by the instrumentation, the nature of the specimen, and economic constraints. Moreover, a reduction of the slice thickness implies a detriment of the resolution in the xy-plane, since the damage produced during one cut can propagate to the following one. This limitation is linked to another problem of FIB-SEM instruments, namely the slow imaging speed7. One would then desire a method to interpolate images, which preserves and possibly enhances the information quality and ideally reduces the number of milling steps to perform.

The simplest solution is linear interpolation8. However, this is reliable only when one can safely assume that the structural variations across consecutive cross-sections are smooth. Unfortunately, when this condition is only approximately met, linear interpolation tends to blur feature edges. This can be partially improved9 with interpolation strategies that account for feature changes among consecutive images by using optical flow10, but the performance remains poor at the image borders. As a consequence, such portions of the frame must be discarded, with a consequent loss of valuable information. Alternative solutions involve deep-learning algorithms. For instance, Hagita et al.11 proposed a deep-learning-based method for super-resolution of 3D images with asymmetric sampling. The model was trained on images obtained from the cross-section and applied to frames co-planar to the milling direction and obtained from the 3D reconstruction. Unfortunately, this strategy works only when it is possible to assume that the three directions have the same morphology, but does not provide unbiased reconstruction. In a different effort, Dahari et al.12 developed a generative adversarial network (GAN) trained on pairs of high-resolution 2D images and low-resolution 3D data, aiming at generating a super-resolved 3D volume. The scheme showed success on a variety of datasets. However, generative models are ambiguous to use in this context since they do not allow one to find a unique solution, due to the nature of this deep-learning architecture.

Here, we propose an alternative solution that relies on the use of a deep-learning model trained for video-frame interpolation. This is a process, where the frame per second of a video is increased by generating additional frames between the existing ones, thus creating a more visually fluid motion13. Common applications include the generation of slow-motion videos14 and video predictions15. Video-frame interpolation can be considered as a combination of both enhancement and reconstruction technology. In fact, it enhances the perceptual quality of a video by providing smoother and more natural motion. At the same time, it can be designated as a form of video reconstruction, since it aims at reconstructing the temporal evolution of a scene. Several deep-learning frameworks have been proposed for this task. The state-of-the-art algorithm is the Real-Time Intermediate Flow Estimation (rife)16, which will be extensively used here. The selected video-frame interpolation method, like any neural network developed for this purpose, is inherently bound by certain limitations. It is important to note that such networks are constrained by the information contained in the frames provided. The limitations associated with the chosen video-frame interpolation method have been studied in the original paper16 and are anticipated to be transferable to any dataset under investigation.

We begin by considering a dataset made of printed graphene-nanosheet images, obtained with FIB-SEM, where the milling direction is taken as correspondent to the video time direction. The resolution of this dataset is then improved by the application of rife, and quantitatively validated using several approaches. In particular, together with standard computer-vision metrics, we evaluate physical quantities, which can be extracted from the final 3D reconstructions, after appropriate image binarization with standard software such as Fiji17 or Dragonfly18. These are the porosity, tortuosity, and effective diffusivity, and their precise evaluation allows us to understand what information content is preserved during the interpolation. Then, the same scheme is applied, at a completely different length scale, to both MRI and CT scans. In the first case, the 3D mapping is already isotropic, so that our reconstructed images can be compared to an available ground truth, as in the case of FIB-SEM. Instead, for CT scans, we show a significant enhancement of the picture quality, a result that may enable us to reduce the scanning rate and, therefore, the radiation dose for the patient.

Importantly, these are only some illustrative applications of the proposed strategy, which could be useful in many other fields. Some examples include in-situ electron microscopy19, such as transmission electron microscopy (TEM) and scanning TEM. In this context, a reduction of recorded frames is crucial to minimize the beam dose and, therefore, the possible beam-induced damage to the specimen, and may allow to reduce the number of tilt images taken to obtain a tomographic TEM reconstruction. In this work, we demonstrate the applicability of a neural network developed for video frame interpolation, in contexts different from its original purpose. In particular, we employ this neural network to ameliorate the acquisition conditions of 3D tomography across different length scales. Throughout this work, we compare our algorithm to alternative interpolation strategies, demonstrating its quantitative advantage.

Results

The first test was conducted on a dataset made of printed nanostructured graphene networks, described in the Methods section. In order to prove the method’s efficacy, some frames are removed from the dataset and used as ground truth for results assessment. We consider different scenarios where one, three and seven consecutive frames are removed from the image sequence, although the seven-frame error suggests that it is not advisable to reduce so drastically the image density along the milling direction. A simple visual comparison offers a qualitative overview of the efficacy of the various interpolation methods. This is shown in Fig. 1 for the case where three consecutive frames are removed from our FIB-SEM sequence and then reconstructed by the different models. The first column shows the ground-truth image, namely that removed from the original dataset, while the remaining ones contain the pictures reconstructed with various methods. In order to appreciate better the quality of the reconstructions, we also provide the difference between the ground truth and the reconstructed images (second row), the magnification of a 120 × 120-pixel portion of each picture (third row), and again the difference from their ground truth (fourth row). The differences are obtained by simply subtracting the greyscale bitmap of each pixel. Blue (red) regions mean that the reconstructed image appears lighter (darker) than the original one.

Fig. 1: Visual comparison of image interpolation according to different methods.
figure 1

Images from the FIB-SEM dataset are augmented using different approaches. In particular, three additional frames are generated every two consecutive frames in the sequence. From left to right we show: the ground truth image (original FIB-SEM), and those reconstructed by rife hd, the fine-tuned rifem, dain, IsoFlow, and linear interpolation. The second row displays the difference between the ground truth and the reconstructions. A 120 × 120-pixel portion of each image (see green box in the upper left panel) is magnified and shown in the third row, while the differences from the original image are in the fourth row.

The inspection of the figure leads to some qualitative considerations on the different methods, and the comparison is particularly clear for the magnified images. The most notable feature is the loss of sharpness brought by the linear interpolation, which is not motion-aware. In fact, instead of tracing the motion of the border between a graphene nanosheet and a pore, namely the border region between dark and bright pixels, linear interpolation simply fills the space with an average grayscale. As a result, the image differences (e.g., see the rightmost lower panel) present some dipolar distribution, which, as we will show below, causes information loss. A similar, although less pronounced, drawback is found for images reconstructed by dain, which also tends to over-smooth the graphene borders. In contrast, IsoFlow appears to generate generally good-quality pictures, in particular in the middle of the frame. However, one can clearly notice a significant error appearing at the image border, which is not well reproduced and whose information thus needs to be discarded. Finally, the two rife models are clearly the best-performing ones. Of similar quality, they are able to maintain the original image sharpness across the entire field of view and do not seem to show any systematic failure. Although instructive, visual inspection just provides a qualitative understanding, more quantitative metrics need to be evaluated in order to determine what image content is preserved by the various reconstructions.

Quantitative assessment of the extracted information content

A full quantitative analysis is better performed on segmented images, where the pore and nanosheet components are well separated5. This can be obtained by using the trainable weka segmentation tool20 available in Fiji17. The procedure to produce binarised data is demonstrated in Fig. 2. A set of images from the original dataset is used to train a model, whose goal is to classify each pixel of the image either as pore or as nanosheet. The training set is automatically built by weka following the manual identification of pore and nanosheet areas from the user. This is shown in Fig. 2a, where the red circles identify pixels that are labeled as pores, and the green circles represent pixels labeled as nanosheets. Panel (b) of the same figure displays the outcome of the application of the trained model, where the grayscale expresses the probability of each pixel being labeled as a pore or nanosheet. This is called a probability map. Finally, a threshold is applied to obtain a binary classification, as shown in Fig. 2c. In particular, here, the Isodata algorithm21, available in Fiji, is used to select an appropriate threshold.

Fig. 2: WEKA image binarisation procedure.
figure 2

Different phases of the procedure used to segment each frame into pore and nanosheet components. This generates binarised data by using the trainable weka segmentation plugin of Fiji5,20. In panel (a) the user first manually assigns the label of pore (red circles) and nanosheet (green circles) to some areas of the original image. This information is used to build the dataset for training a classifier, whose output is the probability map displayed in panel (b). Here the probability of each pixel being either pore or nanosheet is displayed using pixel intensity values. Panel (c) shows the final output of the procedure, namely the binarised image, obtained by applying a threshold on the probability map.

Once the classifier is trained and the threshold is established, all the datasets obtained from the different interpolation strategies can be binarised. It should be noted that the performance of the classifier depends on the manual selection performed by the user. However, using the same classifier for all the analyzed datasets guarantees consistency in the segmentation process and, consequently, in the quantitative assessment of the various reconstruction methods.

The Mean Square Error (MSE) and the Structure Similarity Index Method (SSIM) are some of the standard metrics used in computer vision to evaluate results22. Both are full-reference metrics, meaning that the ground truth is required to assess their value. The MSE focuses on the pixel-by-pixel comparison and not on the structure of the image, while SSIM performs better in discriminating the structural information of the frames. Here the MSE is calculated between each of the generated frames and the corresponding image removed from the original dataset. The average of these values is then computed for each case of study (one, three, and seven replaced frames) and for each technique (rife hd, rifem, dain, IsoFlow and linear interpolation), for a set of 100 images. The same procedure is followed for the evaluation of the SSIM, and our results are available in panels a and b of Fig. 3. As expected, all models perform better when the number of removed frames remains limited, and in general, there is a significant loss in performance for the case of seven replaced frames. In more detail, rife-type schemes are always the top performer, with linear interpolation and also dain, remaining the most problematic. Interestingly, IsoFlow appears quite accurate according to these computer-vision metrics, which clearly do not emphasize the loss of resolution at the frame boundary. However, we will now see that this does not necessarily translate into the ability to preserve information.

Fig. 3: Evaluation of MSE, SSIM and porosity on binarised data.
figure 3

Mean Squared Error (MSE - panel a), Structural Similarity Index Method (SSIM - panel b), and Porosity (P - panel c) were evaluated for each test case (one, three, and seven replaced frames), and each interpolation method, for 100 images. MSE and SSIM are evaluated for each frame against the ground truth and they are expressed as an average over the 100 frames. P is also evaluated for each frame against the ground truth, and it is expressed in terms of ΔP [see Eq. (1)], namely as a percentage deviation from the ground-truth value. For each panel, the average value is shown with the associated variance.

Ultimately, the quality of a reconstruction procedure must be measured with the quality of the information that is able to be transferred/retrieved. In the case of printed graphene-nanosheet ensembles, the morphological properties can be measured and compared. The so-called network porosity, P, defined as the percentage of the total volume occupied by the pores, is one of the most important features measured in 2D networks and it affects the material electrical properties23. In our image-segmented 3D reconstruction this translates into the fraction of pores surface in each image, a quantity that can be evaluated from the binarised images by the conventional image-processing software fiji17. Such analysis is performed here for each case of study and technique. In particular, from the binarised data, the porosity of each frame is computed as P = (pores pixels)/(total pixels).

The results are then expressed in terms of delta porosity, ΔP, which is the fractional difference between the porosity computed from images reconstructed with a particular method m, Pm, and that of the ground truth, PGT, namely

$$\Delta P=100 * \frac{| {P}_{m}-{P}_{{{{\rm{GT}}}}}| }{{P}_{{{{\rm{GT}}}}}}.$$
(1)

In our case, the ΔP of 100 images is computed, and the average of these values is presented in panel c of Fig. 3. From the figure, the advantage of using rife is quite clear. In fact, although all methods, except for the linear interpolation, give a faithful approximation of P when one frame is removed from the sequence, differences start to emerge already at three replaced frames, where rife significantly outperforms all other schemes. The difference becomes even more evident for seven replaced frames, for which the rife error remains surprisingly below 2%. Also, it is interesting to note that, in contrast to what is suggested by the computer-vision metrics, IsoFlow is not capable of accurately returning a precise porosity, mainly because of the poor description of the image borders. In contrast, the weak performance of linear interpolation has to be attributed to its inability to describe sharp borders between nanosheets and pores.

A second important structural feature that can be retrieved from nanostructured networks is the network tortuosity, τ, which can be evaluated on binarised data using the TauFactor software24. This quantity describes the effect that a convolution in the geometry of heterogeneous media has on diffusive transport and can be measured for both the nanosheet and pore volumes. The nanosheet network tortuosity factor influences charge transport through the film. Pore tortuosity affects performance in nanosheet-based battery electrodes, while in gas sensing applications, pore tortuosity is directly linked to gas diffusion. The tortuosity, τ, and volume fraction, ε, of a phase are used to relate the reduction in the diffusive flux through that phase by comparing its effective diffusivity, Deff, to the intrinsic diffusivity, D:

$${D}^{{{{\rm{eff}}}}}=D\frac{\varepsilon }{\tau }.$$
(2)

The TauFactor software calculates the tortuosity from stacks of segmented images. Specifically, it employs an over-relaxed iterative method to determine the tortuosity. This approach is utilized for solving the steady-state diffusion equation governing the movement of species in an infinitely dilute solution within a porous medium confined by fixed-value boundaries. It has been proved that for the evaluation of the tortuosity and diffusivity, the sample volume needs to be adequately large, in order to be representative of the bulk and to reduce the effect of microscopic heterogeneities25. For this reason, we consider ten randomly selected volumes, ranging from 55% to 60% of the original one. Such fraction has been chosen after performing a scaling analysis, namely by computing the tortuosity factor for different volumes and by noticing that the bulk limit was obtained at around 55% (see Supplementary Information). As the size of the input sample highly affects the computation speed and memory requirement, not all methods are considered for this comparison. In particular, only rife hd and dain results are used as input for the tortuosity and diffusivity study. These two methods are chosen since they provide the best evaluation of the porosity. Then, the Python version of TauFactor26 is run on Quadro RTX 8000 GPUs.

Also in this case, we compute the fractional change of any given quantity from the ground truth, and the results are displayed in Fig. 4. Confirming the results obtained for ΔP also in this case rife is the best-performing method, with errors remaining below 2% at three replaced frames for both τ and Deff. In contrast, dain displays significant errors, exceeding 10%, already for a single replaced frame, an error that suggested analysis at other replaced-frame rates was unnecessary.

Fig. 4: Evaluation of tortuosity and effective diffusivity on binarised data.
figure 4

Delta tortuosity, Δτ (panel a), and delta effective diffusivity, ΔDeff (panel b), were evaluated for each test case (one, three, and seven replaced frames) for rife hd, and for one replaced frame for the dain interpolation. The metrics are evaluated for each frame against the ground truth and averaged over the sequence. The variance is also displayed, and in the case of rife, it is smaller than the symbol size.

Analyzing the properties outlined in this section is an essential aspect for all applications involving transport in heterogeneous media, influenced by geometry. For instance, in the field of catalysis, the mentioned properties are used to determine the extent of accessibility to the nanosheet surface area.

Dependence on the features size

As demonstrated in the previous section, our method performs well at reconstructing the frames of the FIB-SEM sequence, with errors on physically relevant quantities remaining below 10% even for seven replaced frames. This is equivalent to having a milling thickness of about 100 nm, indeed a very favorable experimental condition. In this section, we discuss the limitations of our approach in relation to the type of sample to investigate. A relevant problem with image interpolation techniques concerns the level of continuity between consecutive frames. In fact, it is well understood that rapid changes between the images in a sequence can reduce the quality of the interpolated frames13. The same issue may arise when considering FIB-SEM measurements of graphene nanosheets of different lengths. In this case, shorter nanosheets will result in FIB-SEM images with more abrupt changes between consecutive frames. For instance, if the average nanosheet length is L and the milling distance \({L}^{{\prime} }\), for \(L \sim {L}^{{\prime} }\) one will often encounter a situation where a nanosheet present in one image will not appear in the next one.

In order to explore how the proposed model works with increasingly challenging datasets, we investigate the case networks made of shorter graphene nanosheets, namely 80 nm and 298 nm in length (these are the average lengths). Examples of such networks, together with the original one of 695 nm, can be found in Fig. 5. In this case, we use ΔP as an evaluation criterion and consider three replaced frames, together with the original rife hd model. Our results can be found in Fig. 6, where we show ΔP against the average nanosheet length. For this comparison, the data are binarised following the procedure described in the previous section, and ΔP is computed as an average over 100 images. It is evident from the figure, as expected, that the performance of our model indeed deteriorates when reducing the nanosheet’s length. However, even for the smallest sample, 80 nm, the error remains below a very acceptable 2%. Note that in these conditions (nanosheet length 80 nm and three replaced frames, equivalent to  ~ 50 nm milling distance) the milling distance is about half of the average feature size of the sample. Since networks made of small flakes are certainly structurally more fragile than those made with larger ones, the fact that the milling frequency can be reduced significantly without a sensible loss in the accuracy of the morphology determination establishes a possible new experimental condition, where the milling effects on the final morphology are strongly minimized.

Fig. 5: Illustration of graphene nanosheets of different lenghts.
figure 5

Cross-sections of printed graphene nanosheets of different lengths (L). The nanosheet length decreases going from the top panel to the bottom one. In each case, the image width shown is 6510 nm.

Fig. 6: Investigation of the feature size dependence.
figure 6

Porosity was evaluated for the three replaced frame cases for different nanosheet lengths (80 nm, 298 nm, and 695 nm). The porosity is evaluated for each frame generated by rife hd against the ground truth, and it is expressed in terms of ΔP [see Eq. (1)]. Here we show the average ΔP over 100 images and the associated variance.

Application to medical datasets

The strategy proposed in this work is not limited to FIB-SEM generated data but can be employed to increase the through-plane resolution of datasets of materials at different scales, obtained by using different imaging instruments. The purpose of this section is exactly to show such transfer across scales, and for this reason, we did not perform any additional training or fine-tuning of the original model. As such, we now consider only rife hd. The first example is an application of rife to human brain MRI scans. For this dataset, the voxels of the reconstructed volume are already cubic with a 1 mm3 resolution. However, this is a useful case study, since it is possible to remove frames from the scan sequence and use them as ground truth for the validation, as in the case of FIB-SEM. In particular, we remove every other frame from the original dataset, which has been downloaded from the Brainstorm repository27,28. This is well documented and freely available online for download under the GNU general public license. For this study, we consider brain scans in the sagittal view as input for hd rife.

A visual comparison of one original and the corresponding rife-generated slice in the sagittal view is shown in the first row of Fig. 7 (panels a, b, c), where the patient was defaced to fulfill privacy requirements. The third column (panels c and f) of the mentioned figure shows the difference between the original and the generated image. Clearly, the visual inspection of the reconstructed image appears very positive with a difference from the ground truth (see top right panel), which presents an error similar to that of the FIB-SEM data, despite the rather different length scale, and little structure in its distribution.

Fig. 7: Application of rife to a MRI dataset.
figure 7

An example of the original and the corresponding rife-generated frame in the sagittal view is shown in panels (a) and (b), respectively. Panel (c) displays the intensity difference between the ground truth and the rife-generated images. The two datasets are segmented by using the Anatomical pipeline of the BrainsSuite software, as displayed in the second row. The original and corresponding rife-generated data are shown in panels (d) and (e), respectively, while panel (f) presents the difference between them. Although the results are here presented in axial view only, the segmentation is performed on the full volume.

The BrainSuite software29 is then used to quantitatively compare the original and the rife-generated volumes, again trying to understand whether the information content is preserved. More specifically, the Anatomical pipeline is performed on both stacks to obtain the full brain segmentation, whose output is shown in the axial view in panels d and e of Fig. 7. Also, in this case, the visual inspection is similar to that made for the original images, with similar error characteristics. These segmentations are then used to evaluate the gray-matter volume variation (GMV), a widely used metric for the investigation of brain disorders such as Alzheimer’s disease30,31. This is defined as,

$${{{\rm{GMV}}}}=\frac{{p}_{{{{\rm{gray}}}}}}{{p}_{{{{\rm{gray}}}}}+{p}_{{{{\rm{white}}}}}},$$
(3)

where pgray and pwhite indicate the number of gray and white pixels, respectively. The percentage error between the GMV of the two datasets is computed to be 0.5 %, again very low.

The final medical application investigated here refers to CT scans. Note that this measurement technique is not limited to the medical space, but it is also widely used in industrial settings and in general as a research tool across materials science32. The use of interpolation methods for medical CT could be really transformative since a reduction of the collected frames translates into a reduction of the radiation dose delivered to the patient. As a consequence, the potential risk of radiation-induced cancer will diminish33. Alternatively, one may have the possibility to perform more frequent scans for close monitoring of particular diseases. The dataset used for this investigation is downloaded from the Cancer Imaging Archive34,35 and is provided as a set of 152 frames in the axial view, with a pixel size of 0.74 × 0.74 × 2.49 mm. For this example, the voxels in the reconstructed volume are not cubic, and no ground truth is available. rife is then used to generate three additional frames between every two existing ones. The results can be seen in Figs. 8 and 9. Figure 8 shows the data in axial view. Panel a shows one of the original frames of the CT dataset, while panel b presents a frame generated with rife, using the former image as an input for the interpolation model. It is important to note that in this case, the two panels do not represent the same section of the body, therefore they can only be compared from a qualitative point of view. From this figure, we can only observe that rife is able to generate realistic representations of CT scans in the axial view. Figure 9 displays the original and rife-augmented datasets in coronal view on panels a and b, respectively. The original-sized data are shown in the first row, while the second row (panels c and d) presents 150 × 150-pixel magnified portions of the data (indicated by green boxes in the first row). The red and blue boxes are used for the noise power spectrum analysis that will be described shortly. For the coronal view, the comparative figure helps assess the improvement introduced by rife. This visual comparison appears quite favorable, with the rife reconstruction being, in general, smoother than the original image. This is, of course, the result of having more frames added to the sequence.

Fig. 8: Application of rife to an X-ray Computed Tomography dataset, axial view.
figure 8

Here the results are presented in the axial view, namely the view in which the original sequence, used as input for rife, is depicted. On the left-hand-side panel, there is one of the original frames from the CT dataset, while the right-hand-side panel displays a frame created using rife, with the former image serving as an input for the interpolation model. It is essential to emphasize that these two panels depict different sections of the body, allowing for only a qualitative comparison. This figure enables us to discern that rife effectively produces reasonable representations of CT scans in the axial view.

Fig. 9: Application of rife to an X-ray Computed Tomography dataset, coronal view.
figure 9

Here the results are presented in the coronal view. Panel (a) displays the original dataset, while panel (b) shows the rife-augmented dataset, where three additional frames have been added between every two consecutive frames. The green boxes in the first row of the figure indicate a 150 × 150-pixel portion of the original-sized data that is magnified in the second row. The red (Case 1) and blue (Case 2) boxes represent the uniform regions used for the noise power spectrum analysis, whose results are presented in Fig. 10.

Since no ground truth is available in this case, we need to resource computer-vision metrics to provide a quantitative assessment of the reconstruction procedure. Therefore, the two image stacks, the original and the rife-reconstructed one are compared by estimating the noise power spectrum on the coronal plane, computed over uniform regions of a sequence of ten images. This metric is generally used for the assessment of the image quality of CT scanners and it is evaluated over uniform regions of interest in water-filled phantoms36. As such, in order to adapt this metric to clinical datasets, it is necessary to select uniform areas of the image. For instance, the areas in Fig. 9 marked with the red and blue squares represent two possible regions of interest, here called ‘case 1’ and ‘case 2’, respectively. The noise power spectrum of these areas, evaluated over a stack of ten frames, is shown in Fig. 10. From the analysis of both cases, it is evident that this time augmenting the number of frames using rife hd is associated with a reduction of the noise in the data, a reduction visible across the entire frequency range. This quantifies the original observation that rife-augmented images look smoother. The same consideration holds for other uniform regions that have been investigated.

Fig. 10: Noise power spectrum analysis.
figure 10

Analysis of the noise power spectrum conducted over two uniform regions in the coronal view (see Fig. 9 for definition). The comparison is performed between the original and the rife-augmented dataset. The use of rife results in noise reduction for both the selected regions and across the entire frequency range. The first case is displayed in panel (a), and the second case is displayed in panel (b).

It is worth mentioning that medical images undergo significant processing, which impacts how noise and resolution appear in the final results. This is influenced by several factors, such as image acquisition techniques, reconstruction methodologies, and any additional post-processing steps. Further improvements, achievable with different approaches, were not investigated in this work.

We wish to conclude this section by making two considerations concerning the augmentation of sequences of images in the medical domain. In general, this has two main issues. Firstly, one has to ensure that the quality of the reconstructed images in a sequence is very close to the quality of the physically acquired images so that the information content in a sequence is preserved. Such a test has been performed here for the FIB-SEM case, for which ground truth was available, but only partially for the medical examples. In fact, for the medical examples, we were able just to evaluate computer-graphic metrics but not the information content, since no ground truth was accessible to us. Such a test may certainly become possible in the future, initially, mostly likely, through real images taken with dummies. The second issue is significantly more complex, as one has to evaluate whether the enhanced tomography helps the practitioners to make better decisions. This is an issue that must be resolved through a tight collaboration with medical experts, who will provide their human judgment on the usefulness of the augmentation. The format of such evaluation has still to be developed, but it is certainly desirable to formulate a protocol for both the research and the regulatory environments. In the absence of this, the deployment of AI-technology in the medical field will certainly remain at large.

Discussion

We have demonstrated that a state-of-the-art neural network developed for video-frame interpolation can be used to increase the resolution of image sequences in 3D tomography. This can be applied, without further training, across different length scales, going from a few nanometers to millimeters, and to the most diverse types of samples. As the main benchmark, we have considered a dataset of images of printed graphene nanostructured networks, obtained with the destructive FIB-SEM technique. For this, we have carefully evaluated computer-vision metrics, but most importantly, the quality of the information content that can be extracted from the 3D reconstruction. In particular, we have computed the porosity, tortuosity, and effective diffusivity of the original dataset. This was then compared to datasets where an increasing number of images were removed and replaced with computer-generated ones.

In general, we have found that motion-aware video-frame interpolation outperforms the other interpolation strategies we tested. In particular, we have shown that it is not prone to image blurring, typical of simple linear interpolation, or to resolution loss at the image boundaries, as shown by some hybrid optical-flow algorithms. This is due to the coarse-to-fine approach implemented within the rife model, which allows one to make accurate predictions both in terms of intermediate flow estimation and level of detail, as demonstrated in the original paper for several datasets16. Most importantly, the error on the determination of morphological observables remains below 2% as long as the milling thickness is less than approximately half of the nanosheet length. This suggests a very favorable experimental condition, where the effects of the milling on the measured morphology are significantly mitigated.

Then, we have moved our analysis to datasets taken from the medical field. These include a 3D tomography of the human brain volume, acquired with magnetic resonance imaging, and the X-ray computed tomography of the human torso. In the first case a ground truth was available, and we have been able to show that the estimate of the gray-matter volume variation is only affected by 0.5%, when half of the images in a scan are replaced with video-frame interpolated ones. This suggests that the scan rate can be actually increased, saving in acquisition time. In contrast, for the CT scan, no ground truth was available, so we limited ourselves to estimating computer-vision metrics. In particular, we have shown that the augmentation of data with computer-interpolated images, in order to reach cubic-voxel resolution, improves the power spectrum of the tomographic reconstruction, confirming the visual impression of smoother images. This result may be potentially transformative, since it can pave the way for reduced scan rates, with the consequent reduction in the radiation dose to be administrated to the patient.

In summary, we have shown that video-frame interpolation techniques can be successfully applied to 3D tomography regardless of the acquisition experimental technique and the nature of the specimen to the image. This can improve practices when radiation-dose damage or the acquisition time are issues limiting the applicability of the method.

Methods

Printed graphene network dataset

The main dataset used in this work contains 801 images, generated with FIB-SEM, of printed nanostructured graphene networks, with a nanosheet length of approximately 700 nm. Each image, made of 4041 × 510 pixels, has a 5 nm resolution in the cross-section, while the slice thickness is 15 nm. Therefore, the voxel size in the resulting reconstructed volume is 5 × 5 × 15 = 375 nm3. Note that the voxel size achievable with conventional micro CT scanners is 10–1000 times larger37,38. Therefore, FIB-SEM nanotomography is more suitable than CT for the quantitative characterization of the graphene network morphology. Further details on sample preparation and data acquisition can be found in ref. 5.

A fraction of the original dataset is considered for the majority of the analysis, so as to reduce the computational effort and to inspect the images in more detail. Specifically, for the computer-vision metrics and porosity analysis, 100 images of the original dataset are considered. Each image of this subset is cut to a 510 × 510 pixels size. In contrast, for the tortuosity and effective diffusivity study, ten randomly selected volumes are considered, ranging from 55% to 60% of the original volume. It should be noted that in all cases, the resolution is not altered.

Neural network

Five main methodologies are usually employed for video-frame interpolation, namely flow-based methods, convolutional neural networks (CNN), phase-based approaches, GANs, and hybrid schemes. These typically differ from each other because of the network architecture and their mathematical foundation39. rife16 belongs to the flow-based category, whose focus is the determination of the nature of the flow between corresponding elements in consecutive frames. When compared to other popular algorithms40,41,42, rife performs better both in terms of accuracy and computational speed. Models belonging to this class usually involve a two-phase process: the warping of input frames in accordance with the approximated optical flow and the use of CNNs to combine and refine the warped frames. The outcome of the intermediate flow estimation often requires the presence of additional components, such as depth-estimation40 and flow-refinement models41, so to mitigate potential inaccuracy. Unlike other methods, rife does not require supplementary networks, a feature that significantly impacts the model speed. In fact, the intermediate flow is learned end-to-end by a CNN. As demonstrated in the original rife paper16, learning the intermediate flow end-to-end can reduce motion-modeling related inaccuracies. rife adopts a neural-network architecture, IFNet, which directly estimates the intermediate flow adopting a coarse-to-fine approach with progressively increased resolution. In particular, the first step is an approximated prediction of the intermediate flow on low resolution. This allows one to capture large motions between consecutive frames. Subsequently, the prediction is refined iteratively, by considering frames with gradually increased resolution, a procedure that allows the model to retain fine details in the interpolated frame. This approach guarantees accurate results, both in terms of flow estimation and level of detail in the generated frame. Moreover, a privileged distillation scheme is introduced to train the model. Specifically, a teacher model with access to the ground-truth intermediate frame refines the student model performance. The input of the rife model can either be a video or a sequence of two images. For this project, the most straightforward solution is to use images. Therefore, we have adapted the model to accept a series of any number of images, instead of only two at a time. The new version of the inference file of the code can be found at the link provided in this manuscript. Although the model is trained on RGB data, it can seamlessly handle grayscale images, as this functionality is inherently embedded in the original code. Specifically, the images are loaded using an OpenCV43 flag, which ensures that they are loaded as grayscale data. As with any large machine-learning model, rife updates regularly. At the current moment in time, the best version available is the HD model v4.6, referred to as rife hd44. This is trained on the Vimeo90K dataset, which covers a large variety of scenes and actions, involving people, animals, and objects45. As the training set is remarkably different from the application cases of this work, we perform fine-tuning of the available pre-trained model on Quadro RTX 8000 GPUs, provided by Nvidia. Since fine-tuning of rife hd is currently not possible, the second-best model is here considered, namely rifem46. The fine-tuning is then performed on a subset of the graphene dataset, made of 1000 portions of the original images, cropped to a 510 × 510-pixel size, and not used for testing. The instructions provided by the rife code developers are followed to perform the fine-tuning, with some small code modifications. The modified training file can be found in the provided repository. The results of both the original and the fine-tuned model are presented in this work and compared throughout. Furthermore, we benchmark our scheme against another flow-based deep-learning algorithm, dain40, and against non-deep-learning methods. In particular, we consider the simple but widely used linear interpolation and the IsoFlow algorithm9, an interpolation technique that takes into consideration the variation among slices by using optical flow10.

In closing, we note that here we have explored the current state of the art in video-frame interpolation. Certainly, one can also expect that our results will further improve as newer and more accurate machine-learning architectures for video-frame interpolation become available.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.