Open AccessArticle

Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures

School of Software, Jiangxi Normal University, Nanchang 330022, China

Jiangxi Provincial Engineering Research Center of Blockchain Data Security and Governance, Nanchang 330022, China

Author to whom correspondence should be addressed.

Future Internet 2024, 16(6), 217; https://doi.org/10.3390/fi16060217

Submission received: 20 April 2024 / Revised: 12 June 2024 / Accepted: 17 June 2024 / Published: 19 June 2024

(This article belongs to the Special Issue Cloud Computing and High Performance Computing (HPC) Advances for Next Generation Internet)

Download

Browse Figures

Versions Notes

Abstract

Sequence alignment is a critical factor in the variant analysis of genomic research. Since the FM (Ferrainas–Manzini) index was developed, it has proven to be a model in a compact format with efficient pattern matching and high-speed query searching, which has attracted much research interest in the field of sequence alignment. Such characteristics make it a convenient tool for handling large-scale sequence alignment projects executed with a small memory. In bioinformatics, the massive success of next-generation sequencing technology has led to an exponential growth in genomic data, presenting a computational challenge for sequence alignment. In addition, the use of a heterogeneous computing system, composed of various types of nodes, is prevalent in the field of HPC (high-performance computing), which presents a promising solution for sequence alignment. However, conventional methodologies in short-read alignment are limited in performance on current heterogeneous computing infrastructures. Therefore, we developed a parallel sequence alignment to investigate the applicability of this approach in NUMA-based (Non-Uniform Memory Access) heterogeneous architectures against traditional alignment algorithms. This proposed work combines the LF (longest-first) distribution policy with the EP (enhanced partitioning) strategy for effective load balancing and efficient parallelization among heterogeneous architectures. The newly proposed LF-EP-based FM aligner shows excellent efficiency and a significant improvement over NUMA-based heterogeneous computing platforms. We provide significantly improved performance over several popular FM aligners in many dimensions such as read length, sequence number, sequence distance, alignment speedup, and result quality. These resultant evaluation metrics cover the quality assessment, complexity analysis, and speedup evaluation of our approach. Utilizing the capabilities of NUMA-based heterogeneous computing architectures, our approach effectively provides a convenient solution for large-scale short-read alignment in the heterogeneous system.

Keywords:

HPC system; heterogeneous architectures; short-read alignment; FM mechanism

1. Introduction

Advances in next-generation sequencing can be seen in the rapid decrease in sequencing costs attributed to genome sequencing, which dropped from USD 1 million in 2007 to USD 600 in 2021 [1]. This drop has led to the exponential growth of sequencing data, which ranges from gigabytes to terabytes and even petabytes. The fact that the capacity of sequence data doubles within 12 months while the processor processing capacity doubles within 18 months results in a significant challenge. Consequently, it requires a more powerful sequence processing program to manage the data deluge [2].

Sequence alignment is the most fundamental process in the analysis of genomic data. Across most variation detection and annotation processes, it is the starting point, leading to biologically significant consequences. It is a highly computing-intensive and resource-heavy process. In genomics, efficient sequence alignment certainly has practical implications, primarily for evaluating data correctly and aligning millions to billions of short-read sequences accurately. The greatest computational challenge, however, is handling the ever-growing data in genomic sequences. Among the operations that the alignment of sequences provides is variant analysis, which is crucial in the process of the interpretation of genetic information or the discovery of cryptic features in DNA, RNA, and protein sequences. It also enables researchers and scientists to analyze and interpret genomic data more compactly and precisely, advancing personalized medicine, genetic disease diagnosis, and drug discovery.

The FM index is renowned in the large bioinformatics community for its enormous ability to handle large-scale short-read sequence alignment projects [3]. Its small size and fast query capabilities in the next-generation sequencing platform pose an optimal condition to accommodate the vast amounts of short-read data. Paolo Ferragina and Giovanni Manzini [4] designed the succinct model of the FM index. This approach was adopted in 1994 when Michael Burrows and David Wheeler suggested compressing text using the Burrows–Wheeler Transform (BWT) before building the FM index. This allows characters in the text sequence to be permuted with the BWT, making the patterns of the transformed data similar to those of the original characters, generally allowing for the better compressibility of the modified data. In addition, the BWT was combined with the sequence alignment properties to form the FM index, further consolidating a wavelet tree data structure. This combination highly augmented efficiency during the pattern matching and searching performed within the compressed datasets, primarily for sequence data. Nowadays, the FM index is widely employed in many short-read sequence aligners to facilitate the efficient and accurate mapping of short reads to genome references [5], including BOWTIE, BWA, and SOAP2.

Heterogeneous architecture nodes could result in a reliable and efficient computational framework within an HPC system [6]. The heterogeneous system utilizes different computing infrastructures that come with various processor architectures to support parallel processing and maximize the extent of computation that can be executed simultaneously [7]. Thus, the heterogeneous system proves to be more reliable for sequence alignment and requires extensive computation. In addition, the heterogeneous system consists of NUMA architectures, which are prevalent in HPC systems. The effective utilization of NUMA can enhance the performance of crucial applications such as short-read sequence alignments. This NUMA architecture enables processing optimization, allowing processors to have much faster access to memory segments and significantly decreasing the bottleneck in data processing. Therefore, the enhancement in such heterogeneous systems can result from the implementation of NUMA-aware optimizations in the computational effectiveness of HPC environments [8].

In this paper, it was observed that the common FM-based alignment algorithms showed limitations in terms of computational efficiency when dealing with longer and more complex datasets. We proposed a parallel approach for short-read sequence alignment using the longest-first alignment method combined with enhanced partitioning to improve the performance of FM aligners for the evaluation of large-scale genomic short-read sequence data. The proposed algorithm employs LF and EP to achieve efficient load balancing and parallelization on heterogeneous NUMA processors of the HPC system. We conducted numerous experiments using sequence datasets from the 1000 Genomes Project and various benchmark datasets to assess the alignment performance of various FM-based algorithms. The alignment performance of the LF-EP-based FM aligner was shown to be particularly outstanding in comparison to established well-known FM aligners, like SOAP2, BOWTIE2, BWA, and GEM3, according to the popular benchmark BALIBASE, OXBENCH, SABRE, and PREFAB. These results in the experiments suggest that the LF-EP-based FM aligner results in better speedup and more accurate alignment results in better quality than other state-of-the-art FM aligners, thus making it a reliable, precise approach for large-scale genomic short-read sequence data. Furthermore, the results also provide valuable insights into future study directions to improve the efficiency of FM aligners in heterogeneous architecture systems.

The following is a summary of the contributions of this investigation:

(1): We developed a convenient parallelism technique for NUMA systems that couples the longest-first distribution strategy with enhanced partitioning. This approach attains effective load balancing and optimal parallelization across heterogeneous NUMA processors on HPC systems. This contribution focuses primarily on the effective and rapid deployment of bioinformatics algorithms on modern heterogeneous architectures.
(2): We introduced the LF-EP-based FM aligner, which utilizes the FM index for short-read sequence alignment with a novel implementation tailored for popular heterogeneous computing platforms that constitute NUMA nodes. This proposed aligner shows significant performance improvements in scalability, speedup, and quality, which is convenient for genomic research where massive data volumes are processed rapidly.
(3): We conducted a comprehensive evaluation of the popular NUMA infrastructures. Our paper demonstrates improvements in parallelism through a thorough review of both main processor platforms. We tested our approach on a wide variety of NUMA architectures, including Intel Xeon and AMD Opteron.
(4): We compared it to prevailing FM aligners and sequence benchmarks. These FM aligners include BWA, SOAP2, BOWTIE2, and GEM and various sequence benchmarks—BALIBASE, OXBENCH, SABRE, and PREFAB. These comprehensive aligners and benchmarks not only highlight the practical advantages but also establish a baseline for future studies to be compared with.

The rest of the paper is organized as follows: a critical survey of related studies in the literature is performed in Section 2. Basic concepts of sequence alignment, an in-depth review of the FM aligner, and details of heterogeneous architectures are presented in Section 3. Our proposed algorithm, longest-first combined with enhanced partitioning, is provided in Section 4. The experimental results and performance evaluation of the proposed algorithm are presented in Section 5. Experimental findings are concluded in Section 6, with prospective research directions detailed in brief.

2. Literature Study

The FM index finds a wide range of applications related to genomic studies, particularly in sequence alignment. It is a suitable approach for aligning short reads to large genome references with a small memory footprint [9]. Most efficient sequence aligners, including BOWTIE [10], BWA [11], HISAT [12], and SOAP2 [13], exploit the FM index-based mechanism. The FM index has two types of indices: a global FM index for the entire genome and many small FM indices for genome fragments that span the whole genome together [3].

Additionally, another area where the FM index is used in genomics studies is whole-genome alignment [14]. Moreover, hardware-accelerated exact and inexact alignment using the FM index has been studied for FPGA-based alignment architectures [15]. An evaluation study, in the context of RNA alignment, showed a significant reduction in memory consumption and runtime of FM index-based aligners compared to hash table-based aligners at nearly identical precision and recall rates [16]. This test also showed the high tolerance of BOWTIE to indels and sequencing errors, so it was not the proper candidate for using high-tolerant but inexact hash-based aligners. While FM-based aligners have been widely adopted as the data structure for several years for the fast alignment of sequences, some performance limitations have been observed in memory-constrained systems when using FM-based aligners compared to hash-based aligners [17].

Sequencing alignment is a collaborative research type in which heterogeneous architectures have been widely used. NUMA has emerged as one of the most widely utilized computing infrastructures [18]. Research focusing on sequence alignments as a trade-off between data locality and data dispersion in the systems based on NUMA highlights an interesting research topic on NUMA-based heterogeneous architecture [16]. Several studies [19,20,21] have quantitatively investigated the performance of various memory allocation policies and data partitioning strategies in NUMA architectures. Many algorithms and tools have been developed, but they do not consider the impact of the computing structure on the overall performance of sequence alignment. For instance, BarraCUDA is one of the short-read sequence aligners that rapidly align sequences using GPUs [22]. It uses a suffix-tree method to locate exact matches in the reference sequence. GSWABE provides all-to-all pairwise global, semi-global, and local alignment through a tile-based method that leverages the computational capability of CUDA-enabled GPUs [23]. However, these new hardware techniques are not aimed at improving the efficiency of the utilization of the existing infrastructures.

Based on the above discussion, using NUMA-based heterogeneous architectures in HPC systems can generally improve sequence alignment algorithms [24]. Processors with NUMA architectures and parallelization techniques enhance the speed of the alignment process. It is very important to note that the size of biological databases is continuously increasing, which results in data being more complex in the alignment process and also needing more time for computation [25]. Additionally, there is a requirement for the development of alignment algorithms that can work with datasets where sequences have varying lengths, which is a must to achieve more precise results for applications requiring high precision [26]. However, much more research is needed to be conducted regarding the investigation of the interplay between stages of alignment and the processor architecture to enhance the performance of the sequence alignment algorithms for NUMA-based heterogeneous architectures. The research aims to address these challenges and contribute to better and more effective alignment algorithms suitable for datasets exhibiting heterogeneity in computing architectures within heterogeneous processors of HPC systems.

3. Background

3.1. FM-Based Mechanism in Sequence Alignment

In the field of bioinformatics, deciphering genetic information is one of the most important tasks for sequence aligners and exposes hidden patterns within DNA, RNA, and protein sequences. The criteria for the selection of the sequence aligners depend on the properties of the data, the level of accuracy, and the computational performance. These aligners are implemented using computational methods, such as heuristics and machine learning algorithms, combined with statistical approaches to increase precision and effectiveness. Notable alignment tools in bioinformatics include MAQ [27], BOWTIE [10], BWA [11], and HISAT [12].

FM index-based sequence alignment is a widely employed tool in genomics research. It generates a compact index for the reference genome, enabling the alignment to be conducted in fixed-size segments using a round-robin pattern [28]. This aligner processes data individually, thereby avoiding interdependence among data segments. FM aligners permit predefined scores for mismatches and gaps, thus increasing the probability of aligning short reads to long reference sequences. It uses a compressed genome reference, which simplifies the search of sequencing data. The BWT data structure is applied to optimize pattern matching [29]. Therefore, the FM-based sequence aligner generally comprises two processes: index building and sequence searching. Index building is the process through which the genome reference is preprocessed. The alignment of short sequences, such as DNA or RNA segments, with the reference genome is called sequence searching. Due to these techniques, FM-based sequence aligners take less time and use fewer processing resources.

The FM alignment approach is a technique of facilitating the features of suffix array (SA) and BWT that helps optimize the alignment of sequences through the use of their structural efficiencies. This combination allows for the rapid alignment of all instances of a given pattern P within a large genome reference (GR). Unlike other traditional alignment methods, FM-based sequence aligner is essentially faster than other popular algorithms, and its computational complexity depends mainly on the length of P. Thus, it is especially useful for large genomes. The detailed computation process is as follows.

(1): SA construction

The creation of suffixes is initially conducted. For a given genome reference GR, all possible suffixes of GR are created by adding a unique character $ at the end of GR to define the ending of the string with a minimal lexicographical value. This is followed by sorting the suffixes in the lexicographical order. The SA is an array with elements indicating the initial positions where a sorted suffix begins in the original string.

(2): BWT computation

Rotations and sorting are performed in the procedure. Each rotation of the string GR is formed through a circular shift in the input string. The set of rotations is sorted lexically, followed by BWT formation. The BWT is constructed by taking the last character from each of the sorted rotations. The reasoning behind using these characters is that, when concatenated, they reveal the repetitive nature of the original string, aiding in effective compression and transformation.

Table 1 illustrates the calculation of SA and BWT of the string GR = GTGTAC. Each row indicates a rotation of the sequences in sorted order. The SA index column contains the original indices of the suffixes before sorting. The rotations column displays every possible rotation of GR. The Sorted-BWT column is the sequence of rotations in sorted order, which directly affects the construction of the BWT. The sorted index represents the SA values for the starting positions of the suffixes in the original string. Thus, the BWT represented as CTAT$GG is derived from the last characters of the sorted rotations in the Sorted-BWT. This BWT sequence is essential for the reconstruction of the original sequence and facilitates effective pattern searching.

3.2. Heterogeneity in the Nodes of HPC Field

Heterogeneity within the HPC field means that the field incorporates various computing infrastructures and architectures, making core configurations much more complex. However, this allows the end user to conveniently utilize diverse resources and enhance the overall computing power of the system. Heterogeneous architectures are significant to the scientific research community as they offer various computing resources that enhance the efficiency, scalability, and resilience of the system [30]. In addition, heterogeneous architectures comprise different kinds of computing nodes with various processor architectures. NUMA processors, such as Intel Xeon and AMD Opteron, can be observed widely used in HPC platforms [31]. These NUMA processors have dedicated local memory banks and local cores, allowing for significant performance improvements in applications like sequence alignment. The architectural features of NUMA processors, when employed in an HPC system, can lead to distinct performance advantages and various bottlenecks.

The architecture of processors is illustrated in Figure 1, showcasing two commonly utilized NUMA nodes as examples: the Intel Xeon E5-4620 and the AMD Opteron-6376. These NUMA processors possess the capacity to accommodate up to four processors, totaling 32 cores, 64 threads, and a memory capacity of 128 GB. Although both NUMA processors have four NUMA nodes in their systems, they differ in terms of the interconnections between the processors. This is because, while the AMD Opteron is a network structure with complete linkages, the Intel Xeon uses a ring structure with a circular connection.

Figure 2 demonstrates the critical principle of NUMA architecture: access to local memory is fast. If a processor requests memory that is not local, then ultimately, it is provided with access to the memory. However, latency is introduced compared to local memory. In NUMA architecture, local access generally refers to a processor accessing memory that is directly attached to that processor or is co-located in the same node. Remote access means that the cores on Processor 1 access memory, which is physically on another processor, Processor 2. In practice, this difference in memory access in NUMA architecture is essential for many research programs and hardware configuration optimizations in performance, but it is often ignored in HPC.

4. The Proposed Algorithm

This research takes into consideration various NUMA computing architectures, which have been used in most of the latest developments in the HPC field. Heterogeneous clusters have become the first preference of most of the research community today because they can form a versatile architecture for computing systems [32]. It is usually much easier to optimize load balance in homogeneous nodes as their similar designs allow for more coherent interaction. Therefore, our proposed scheme primarily focuses on addressing load balancing in heterogeneous systems comprising different types of NUMA architectures. Furthermore, we focus on homogeneous architecture to solve the potential parallelism within the processors. Our system comprises two phases, discussed as follows.

Heterogeneous phase: Load balancing across all the nodes of the cluster is the primary purpose behind our efforts in this phase. We use a longest-first (LF) approach to achieve load balancing among the heterogeneous architectures. This enables us to distribute the workload across many nodes
Homogeneous phase: The fundamental aim of this phase is to maximize the data parallelism on each homogeneous processor within the heterogeneous architectures. To achieve this objective, we have developed an enhanced partitioning (EP) technique by exploiting the available computational resources according to the maximum potential benefits on each node.

4.1. Longest-First Algorithm in the Heterogeneous Phase

The algorithm introduced in this research paper proposes a parallel methodology for partitioning a set of sequence reads, incorporating both the k-mer distance and the maximum length of the sequence read datasets. The k-mer distance is a measure of dissimilarity between genomic sequences with respect to the frequencies of subsequences of length k. The realization of multiple jobs executed simultaneously with an optimal balance is intended to be carried out in a NUMA node-based heterogeneous computing environment.

Algorithm 1. LF algorithm

Input:

(1): Sequence reads set $S = \{S_{i}| 1 \leq i \leq s\}$ $, s = |S|$ $, i \in N +$ .
(2): k-mer distance set of $S$ is $K = \{K_{i}| 1 \leq i \leq k\}$ $, k = |K|$ $, i \in N +$ .
(3): The longest length of a sequence read set $S$ is $l (S)$ .
(4): j-part partitioned sequence reads set $S^{'} = {{S^{'}}_{i} | 1 \leq i \leq j}$ $, j = |S^{'}|$ $, i \in N +$ .

Output:

(1): d-part partitioned sequence reads set $S^{″} = {{S^{″}}_{i} | 1 \leq i \leq d}$ $, d = |S^{″}|$ $, i \in N +$ .

Steps:

1.: Calculate the k-mer distance $K_{i} \in K$ for each element $S_{i} \in S$ in parallel.
2.: Rank sequence reads set $S$ in the decreasing order $K_{i} \geq K_{i + 1}$ based on the values of k-mer distance set K.
3.: Divide $S$ into $j$ subread sets $S^{'} = {{S^{'}}_{i} | 1 \leq i \leq j}$ , where $S^{'} = \{{S^{'}}_{i}| 1 \leq i \leq j - 1, |{S^{'}}_{i}| = |s / j| + 1\} + \{{S^{'}}_{i}| i = j, |{S^{'}}_{i}| = s - (j - 1) (|s / j| + 1)\}$ .
4.: Sort $S^{'}$ in decreasing order ${S^{'}}_{i} \geq {S^{'}}_{i + 1}$ based on the longest length of subsequence reads set $S^{'}$ , make $S^{'} = {S^{'}}_{i} | 1 \leq i \leq j, l ({S^{'}}_{1}) \geq l ({S^{'}}_{2}) \geq \dots \geq l ({S^{'}}_{j - 1}) \geq l ({S^{'}}_{j})$ .
5.: Set ${S^{″}}_{i} \neq \emptyset$ $, {S^{″}}_{i} \subset S^{'}$ $, L_{i}$ is the total sum of complementation time in ${S^{″}}_{i}$ $, 1 \leq i \leq d$ .
6.: For k = 1 to d.
7.: Sort ${S^{″}}_{1}$ $, {S^{″}}_{2}$ $, \dots {S^{″}}_{d - 1}$ $, {S^{″}}_{d}$ in increasing order ${l (S}^{″}_{k - 1}) \leq {l (S^{″}}_{k})$ , such that $L_{1} \leq L_{2} \leq \dots \leq L_{d - 1} \leq L_{d}$ .
8.: Set ${S^{″}}_{k} = {S^{″}}_{k} \cup {S^{'}}_{k}$ $, L_{k} = L_{k} + l ({S^{'}}_{k})$ .
9.: End for.

The following sequence depicts the parallel workflow of the longest-first algorithm in the heterogeneous phase.

Calculating the k-mer distance: For all the sequence reads in the input set, the first operation calculates the k-mer distance in parallel. The k-mer distance is calculated for any two sequence reads and measures the dissimilarity between them.
Ranking sequence reads: The sequence reads are ordered from the highest to the lowest based on k-mer distances. Therefore, the sequence reads with the highest k-mer distance are ranked at the top, the second highest k-mer distance follows, and so on.
Partitioning sequence reads: The ranked sequence reads are partitioned into several sets. There can optimally be different numbers of sets based on the value of j, where j is the number of partitioned sequence read sets. All the sets could be of almost equal size except for the last set, which can have any size based on the total number of sequence reads and the value of j.
Arranging subread sets: The subread sets are arranged in decreasing order from the longest size of the sequence reads in a subset. That means the subread with the longest sequence read is the one of the highest priority, and the subread with the second longest sequence read is the one of the second highest priority, and so on.
Merging partitions: The algorithm enters a loop that runs d times, where d is the final number of partitions. For each of those d iterations, the partitioned sequences are sorted in ascending order according to the sum of total complementation time. At each iteration, the sequence with the least complementation time is merged into the partitioned set, thereby updating the sum of complementation time for the partitioned set.

According to Algorithm 1, some crucial steps we conducted, including steps 3–8, are necessary to make an explanation. Step 3 describes how sequence reads are partitioned into subsets. It ensures that the sequence reads are divided as evenly as possible among the subsets.

S^{'} = {S'_{i} ∣ 1 \leq i \leq j - 1, |S'_{i}| = ⌊\frac{s}{j}⌋ + 1} + {S'_{i} ∣ i = j, |S'_{i}| = s - (j - 1) (⌊\frac{s}{j}⌋ + 1)}

There are components that exist in Step 3, including

S

, the set of all sequence reads;

S^{'}

, the set of partitioned subread sets;

j

, the number of divided sequence reads;

s

, the total number of sequence reads; and

S'_{i}

, the

i

-th subread set.

One part of the formula

{S'_{i} ∣ 1 \leq i \leq j - 1, |S'_{i}| = ⌊\frac{s}{j}⌋ + 1}

describes the first

j - 1

subread sets. Each of these sets

S'_{i}

(for

1 \leq i \leq j - 1

) contains

⌊\frac{s}{j}⌋ + 1

sequence reads. The floor function effectively balances the distribution by adding 1 to each subset. The other part of the formula

{S'_{i} ∣ i = j, |S'_{i}| = s - (j - 1) (⌊\frac{s}{j}⌋ + 1)}

describes the last subread set

S'_{j}

. The size of this last set is adjusted to ensure that the total number of sequence reads is accounted for. The size of

S'_{j}

is calculated as s minus the total number of reads allocated to the first

j - 1

sets, which is

(j - 1) (⌊\frac{s}{j}⌋ + 1)

Step 3 has two main characteristics. One is balancing subsets. The formula ensures that the first

j - 1

subsets are filled with a nearly equal number of sequence reads. Each of these subsets receives

⌊\frac{s}{j}⌋ +

1 reads, where

⌊\frac{s}{j}⌋

is the integer division of

s

j

. The other characteristic is adjusting the last subset. The last subset

S'_{j}

contains the remaining reads, which may not be divided evenly among the other subsets. This ensures that all reads are included in the partitioning without any being left out.

Step 4 involves sorting the partitioned sets of sequence reads based on the longest length of the subsequence reads within each set. The details are discussed in the following explanation. We sort

S^{'}

in decreasing order. The subsets

{S^{'}}_{i}

are sorted such that

{S^{'}}_{i} \geq {S^{'}}_{i} + 1

based on the length of the longest sequence read in each subset. This means that the subset with the longest sequence read is first, the subset with the next longest sequence read is second, and so on. Let

l ({S^{'}}_{i})

denotes the length of the longest sequence read in subset

{S^{'}}_{i}

. The sorted subsets

S^{'}

should satisfy the following condition. The length of the longest sequence read in the first subset is

l ({S^{'}}_{1})

, in the second subset is

l ({S^{'}}_{2})

, and so on.

S^{'} = {{S^{'}}_{i} ∣ 1 \leq i \leq j, l ({S^{'}}_{1}) \geq l ({S^{″}}_{2}) \geq \dots \geq l ({S^{'}}_{j - 1}) \geq l ({S^{'}}_{j})}

Steps 5–8 describe the final steps of the algorithm for partitioning and balancing the sequence reads based on their complementation times. These steps involve iterating through the final sets, sorting them, and merging them to ensure balanced load distribution. These steps first make an initialization. We ensure that each final set

{{S^{″}}_{i}}

is not empty and is a subset of the sorted sets

S^{'}

. This is followed by the iterative merging and sorting of the final sets. In the iteration process, we sort the final sets

{{S^{″}}_{1}, {S^{″}}_{2}, \dots, {S^{″}}_{d - 1}, {S^{″}}_{d}}

in increasing order based on the longest sequence read length

l ({S^{″}}_{k})

. We ensure that the total complementation times are in increasing order:

L_{1} \leq L_{2} \leq \dots \leq L_{d - 1} \leq L_{d}

. Then, we merge sequence reads from

{{S^{'}}_{k}}

into

{{S^{″}}_{k}}

{{S^{″}}_{k} = {S^{″}}_{k} \cup {S^{'}}_{k}}

. Finally, we update the total complementation time by

L_{k} = L_{k} + l ({S^{'}}_{k})

. These steps ensure that the sequence reads are efficiently and effectively balanced across the final sets, minimizing the total complementation time and achieving an optimal distribution of the computational load. This approach maximizes the performance and efficiency of sequence alignment tasks in high-performance computing environments.

According to the discussions above, the proposed LF method is a parallel approach for sequence read partitioning based on the average k-mer distance and the longest length of an obtained sequence read set. This implementation works in heterogeneous computing environments where numerous tasks can be performed in parallel. Algorithm 1 starts by calculating the k-mer distance between sequences in parallel and then ranking the sequences in decreasing order. These are later partitioned into subread sets by considering the value of

j

with the decreasingly ranked sequence reads. Finally, the subread sets are sorted in descending order by the sequence read length. The final partitioning loop is then run

d

times to ensure that the tasks are equally distributed among the processing units. These configurations ensure the efficiency and synchronization of the implementation process.

4.2. Enhanced Partitioning Algorithm in Homogeneous Phase

As shown in Figure 3, there are only two genome datasets involved in the sequence alignment procedure.

S

represents a short-read dataset and

G

represents a long genome reference.

S_{x}

is the individual dataset of short reads at node

x

S_{x y}

corresponds to a particular block in short read

S_{x}

and

s

is the total amount of blocks of

S_{x}

G_{x}

is the responding genome reference at node

x

G_{x y}

corresponds to a particular replica of the genome reference of

G_{x}

and

g

is the number of replicas of

G_{x}

. Therefore, short-read subsets in node

x

can be referred to as

S_{x 1}, S_{x 2}, \dots, S_{x y}, \dots, S_{x s}

and genome reference replicas in node

x

are referred to as

G_{x 1}, G_{x 2}, \dots, G_{x y}, \dots, G_{x g}

in Figure 3. Within the sequence alignment procedure, three pivotal steps can be identified: establishing, mapping, and gathering. These steps are summarized as follows.

(1): Establishing stage: The aligner builds the index file for the genome reference $G_{x y}$ ; the wall time of this stage is denoted as ${T (G}_{x y})$ .
(2): Mapping stage: The aligner achieves an alignment sub-result $R_{x y} = m a p (S_{x y}, G_{x y})$ by mapping short read $S_{x y}$ against genome reference $G_{x y}$ ; the wall time of this stage is denoted as ${T (R}_{x y})$ .
(3): Gathering stage: The aligner collects sub-results from stage 2 and takes some necessary configurations to merge the final output $R = \sum R_{x y}$ ; the wall time of this stage is denoted as $T (R)$ .

The majority of the total time in the sequence alignment is typically allocated to the establishing and mapping stages. It is noteworthy that the mapping stage necessitates a substantial allocation of computational resources and entails the most significant computing resource. The generation and dissemination of the index file can be performed in advance and shared among multiple instances during the experiments. Thus, the overall execution time of the three stages, denoted as

T = {T (G}_{x y}) + \sum T (R_{x y}) + T (R)

, can be simplified to

T = \sum T (R_{x y})

in general scenarios.

A range of s and g values is taken into consideration for a comprehensive view, which is detailed further in the following section.

(1): Case 1: s = g = 1. This case discusses the native implementation of the original structure of a sequence aligner built for mapping complete reads to a single genome reference.
(2): Case 2: s = 1, g > 1. This case is utilized for the purpose of conducting the alignment of complete reads that are shared, against a set of g replicates of the reference genome. The use of many replicas of the genetic reference inside a NUMA architecture system offers benefits in terms of system architecture, as this design incorporates a large number of memory blocks.
(3): Case 3: s = 1, 0 < g < 1. This case involves mapping the complete reads to a subsequence of the genome reference, specifically 1/g of the reference sequence. The genome reference is partitioned into 1/g segments, which are separately aligned with the whole set of read data.
(4): Case 4: s > 1, g = 1. This case is implemented by dividing the reads into s subsets, which are mapped against a complete replica of the genome reference. The dataset is partitioned into s segments, and a complete genome reference is distributed across the threads residing in the memory.
(5): Case 5: s > 1, g > 1. This case is implemented with s subsets of reads being mapped to g replicates of the genome reference. The short read and genome reference are divided or duplicated into s segments and g duplicates, respectively. A partitioned subread can be regarded as a genome reference that has been repeated, resulting in the equation s = g. Thus, this case is suitable for large memory systems.
(6): Case 6: s > 1, 0 < g < 1. This case is implemented by mapping s subsets of the reads to 1/g subsequence of the genome reference. Both the short read and the genome reference are divided into s and 1/g segments, respectively. A partitioned subread can be viewed as a genome reference that has been replicated, resulting in a value of s = 1/g. Therefore, this case is suitable for memory-sensitive systems.

Table 2 presents a taxonomy of the alignment approaches within the context of the EP algorithm, categorized by the number of read subsets and genome replicates involved. Additionally, Table 2 displays operational descriptions and illustrative applications in genomics. The examples in cases 1–4 provide a comprehensive overview of the steps in the EP method evaluation and critically illustrate its effectiveness in performance. They also establish a baseline for a more detailed comparison and analysis in the experiments. Cases 5 and 6 are discussed to illustrate the best-case scenarios for the EP approach in terms of its implementation quality in a heterogeneous system with distinctive architectural features.

The pseudocode of the EP algorithm is provided in Algorithm 2. Such an algorithmic framework is typical in bioinformatics because it facilitates the efficient processing of large-scale sets of sequences. It is a scalable and parallelizable methodology applicable to the complex datasets commonly associated with genomic research.

Algorithm 2. EP algorithm

Input:

1.: reads: The set of reads to be aligned.
2.: genome_reference: The genome reference used for alignment.
3.: num_read_partitions s: The number of subsets/partitions for reads.
4.: num_genome_partitions g: The number of replicas/partitions for the genome reference.

Output:

5.: alignment_results: The alignment results for the reads against the genome reference.

Steps:

6.: Function partitionReads(read reads, int num_partitions) {return array of read partitions}.
7.: Function replicateGenome(genome original_genome, int num_replicas) {return array of replicated genome partitions}.
8.: Function partitionGenome(genome original_genome, float partition_ratio) {return array of genome partitions}.
9.: Function alignPartition(readPartition reads, genomePartition genome) {return alignment results for that partition}.
10.: Function enhancedAlignment(read reads, genome genome, int num_read_partitions, int num_genome_partitions) {

Steps:

a.: Initialize an array alignment_results for storing results.
b.: If no partitioning is required (both num_read_partitions == 1 and num_genome_partitions are == 1).
c.: Directly align reads to the genome using a standard alignment function.
d.: Else:
e.: Partition reads using partitionReads().
f.: Depending on the value of num_genome_partitions, either replicate the genome using replicateGenome() or partition it using partitionGenome().
g.: If (num_genome_partitions == 1) {genome_partitions = genome;}
h.: Else if (num_genome_partitions > 1) { genome_partitions = replicateGenome(genome, num_genome_partitions);}
i.: Else {genome_partitions = partitionGenome(genome, 1/num_genome_partitions);}
j.: Loop through each combination of reads and genome_partitions:
k.: Align each read partition to each genome_partition using alignPartition().
l.: Store the results in alignment_results.
m.: Outputs: alignment_results.

11.: }

5. Results and Evaluations

5.1. Experimental Environment

In general, the data utilized in our research are derived from the international 1000 Genomes Project, which commenced in January 2008. We specifically selected the human genome (ID: HS37D5) as our reference sequence, which can reach a size of up to three gigabytes. For one of our natural short-read datasets, we employed the SRR766060 paired-end genome, with each segment approximately measuring 8.7 GB in size. Furthermore, we incorporated four widely recognized benchmark datasets in their latest version for alignment quality evaluation purposes—BALIBASE, OXBENCH, SABRE, and PREFAB. Additionally, for speedup evaluation and sequence complexity experiments, we utilized specified sequence datasets obtained from several widely implemented read simulation tools in bioinformatics, such as randDNA and wgsim. The utilization of these compelling datasets has significantly contributed to obtaining reliable outcomes throughout the experiment we conducted.

The overview information on the experimental environment and configurations is shown in Table 3. Several additional widely used FM algorithms—SOAP2, BOWTIE2, BWA, and GEM3—were used in this research to comprehensively compare several significant issues. The experiments were also conducted on two different NUMA nodes, Intel Xeon and AMD Opteron, which are also commonly used in heterogeneous systems. These NUMA nodes have the capacity to support four processors with 32 cores, 64 threads, and 128 GB of memory. Although each processor has four NUMA nodes, they are interconnected differently among the cores. AMD Opteron 6376 utilizes a core network with full ties, while Intel Xeon E5 4620 employs a ring structure with circular connections among its cores, as shown in Figure 1.

5.2. Quality Assessment of the Alignment Results

In this section, we compare the LF-EP FM aligner’s performance with the original FM aligner and other state-of-the-art FM algorithms such as SOAP2, BOWTIE2, BWA, and GEM3. We use BALIBASE, OXBENCH, SABRE, and PREFAB benchmarks to evaluate alignment quality. These benchmarks are highly respected in sequence alignment because they effectively evaluate aligner performance with respect to different features of alignment accuracy. This aims to assess aligner effectiveness across multiple benchmarks under various scenarios and sequence attributes. The benchmarks have been converted to the universally accepted FASTA format (as short-read sequences), which facilitates the following analytical processes.

In our evaluation, we use two accuracy metrics: Quality Score (QS) and Total Column Score (TCS). The QS measures the fraction of correct residue pairs aligned relative to the total number of residue pairs in the alignment. On the other hand, the TCS measures the correct alignment of columns relative to the total number of columns in the alignment. The TCS shows that the aligners are capable of reflecting the structural features of the sequences. The selection of the algorithms and the benchmarks is made in such a way that they represent a broad spectrum of methodologies for sequence alignment, allowing for the creation of a global assessment of the alignment performances of the aligners. The same FM-based algorithms are used in aligners tested in the benchmarks. Therefore, it might be expected that these various aligners would show slight differences in result quality. The absolute quantities QS and TCS are dimensionless; only their relative ratio is of interest.

Table 4 illustrates the QS and TCS values computed for different benchmarks employed in the present study. Since the QS and TCS values are dataset-dependent, we assess their relative values to provide a comparative assessment of the effectiveness of different sequence aligners. From the analysis in Table 4, the proposed LF-EP-based FM aligner surpassed the rest, with the highest scores in both QS and TCS. Evaluation results reveal that the LF-EP-based FM aligner is among the most highly reliable and accurate FM aligners compared with previous aligners, thus signifying the ability to integrate the LF-EP algorithm into the framework of FM aligner.

Furthermore, SOAP2 was observed to perform at almost the same level as the LF-EP FM aligner, albeit a little lower. It can be observed that QS and TCS scores of SOAP2 are the second highest across most benchmarks, indicating that this tool is quite promising and could improve its alignment performance with further development. Additionally, BOWTIE2 shows a small decrease in performance quality compared to SOAP2, suggesting it might be suboptimal for strict quality standards. BWA also shows a similar drop in performance to BOWTIE2 but maintains impressive QS and TCS scores consistently during the evaluations. This finding positions BWA as a practical choice based on particular task needs and circumstances. Conversely, GEM3 displays the least impressive performance among the specified FM aligners. Although adequate for some tasks, its effectiveness is restricted in situations where top-tier accuracy is essential.

Therefore, the LF-EP-based FM aligner introduced in this research outperforms the other popular FM aligners across the four benchmarks, demonstrating its excellent reliability and precision. This finding substantiates the efficacy of implementing the LF-EP algorithm. Although SOAP2 demonstrates robustness and dependability, there is still room for improvement. Furthermore, the evaluation indicates that the performance quality of BOWTIE2, BWA, and GEM3 decreases progressively. Therefore, their suitability might fluctuate with the changing needs of specific tasks, ranging from high to lower levels.

5.3. Sequence Complexity Analysis

In biological sequences, which include DNA, RNA, and protein sequences, etc., average pairwise distance refers to the average genetic variation among all possible pairs within a particular set of sequences. It is used to show similarities or differences among a group of sequences. This complexity is examined by testing the datasets with 1000 sequences generated using the randDNA program. The average pairwise distance between sequences is varied from 50 to 500. The experiments are implemented on the selected datasets using the proposed algorithm along with other popular FM aligners such as BWA, SOAP2, BOWTIE2, and GEM3. The speedup factor is determined by measuring the performance efficiency of the BWA aligner in processing sequences with an average pairwise distance of 50.

The runtimes of different FM aligners on sequence datasets in the heterogeneous system of Intel Xeon and Intel Xeon Phi are depicted in Figure 4. As the sequence distance increases, the speedup factor decreases for all aligners. This implies that the acceleration effect weakens in proportion to the sequence distance increase. The LF-EP-based FM aligner exhibits the largest speedup factor, reaching 4.4× at a sequence distance of 50, which demonstrates that the LF-EP-based FM aligner has a substantial accelerating effect on short-distance sequences. Additionally, the LF-EP-based FM aligner performs proficiently even at longer distances such as 300, 400, and 500. The LF-EP-based FM aligner consistently displays the highest speedup factor across all sequence distances, hence confirming its suitability for managing high-complexity data. Overall, GEM3 and BOWTIE2 demonstrate lower speedup factors than the LF-EP-based FM aligner. Most of their speedup factors are below 3, indicating less significant speed improvements. Therefore, GEM3 and BOWTIE2 may require further optimization to improve their speedup factors on high-complexity data. At shorter sequence distances (50 and 100), SOAP2 and BWA have speedup factors below 1, indicating that their speed is comparable to the original speed. Nonetheless, their speedup factors decrease as the sequence distance increases. Thus, SOAP2 and BWA are appropriate for shorter sequence distances, but their performance may have speedup limitations as the distance increases.

In conclusion, the research findings indicate that different aligners respond differently to variations in sequence distance, which affects their performance capabilities. The LF-EP-based FM aligner has been proven to be more efficient than SOAP2 in aligning sequences of both short and long distances. SOAP2, however, consistently demonstrates the least effective performance across all sequence distances. The BOWTIE2 algorithm exhibits a pronounced decline in its speedup as the sequence distance increases. The LF-EP-based FM aligner is the optimal tool for conducting complex sequence alignments on the Xeon–Xeon Phi heterogeneous system. It consistently demonstrates superior acceleration across various read lengths and is particularly well suited for large-scale datasets encompassing a mix of short and long sequences within a heterogeneous computing environment. The LF-EP-based FM aligner is an optimal solution for the management of a diverse range of sequence data due to its remarkable scalability.

5.4. Speedup Evaluation of FM Aligners

To accurately assess the speedup provided by the LF-EP-based FM aligner, we performed the following task of analyzing several sequence datasets. Some sequence mimic techniques, such as randDNA and wgsim, were used to establish these artificial datasets. In order to provide a complete picture of the speedup evaluation in these aligners, we included sequences whose average lengths ranged from 50 to 500, as well as a varied number of sequences ranging from 100 to 10,000.

In Figure 5, we provide a detailed comparison of the speedup achieved with the LF-EP-based FM aligner versus the original FM aligner, which was implemented in the native mode. We calculate the speedup based on the original FM aligner divided by the LF-EP-based FM aligner in each instance. Our performance analysis of the proposed algorithm shows an excellent speedup of up to 24.2×. It is seen that the speedup decreases as the average sequence length increases. This is because the computational complexity of the FM aligner is sequence length-based. As the analyzed sequences become more extensive, the demand for physical memory and computing power becomes high, resulting in a performance bottleneck for FM aligner implementation in such a heterogeneous architecture.

Furthermore, our study investigates those performance differences that really exist between the LF-EP-based FM aligner and the multi-threading FM aligner. This further analysis enables us to understand the relative strengths and limitations of each approach. Figure 6 gives a valuable overview of the speedup achieved by the LF-EP-based FM aligner concerning the variant of the multi-threading FM aligner. Speedup is obtained concerning the performance of the original FM aligner.

Three sequence datasets were generated using mimic sequence tools in the experiments, each varying in average sequence length and number of sequences. The first dataset had an average length of 50, the second an average length of 200, and the third an average length of 500. Each dataset contained sequences ranging from 100 to 10,000. The results shown in Figure 6 indicate that both the LF-EP-based FM aligner and the multi-threading FM aligner demonstrate remarkable speedup for the dataset with an average sequence length of 50. Additionally, a 24.2× speedup is obtained for the LF-EP-based aligner in a heterogeneous system, while this speedup is 16.3× for the multi-threading FM aligner. These results indicate substantial benefits of the LF-EP-based aligner, leading to a largely reduced execution time compared to the multi-threading alignment.

Thus, performance is significantly more different when analyzing other datasets whose average sequence lengths are 200 and 500. In these cases, although still impressively showing the speedup, the FM aligner based on the LF-EP method attains a lower value of 8.9×. The drop in speedup may be attributed to various factors, particularly the complexities of longer sequences and the corresponding increase in computational demands.

The efficacy of our parallel sequence alignment approach is apparent in its enhanced performance on sequences of moderate length, such as 50 or 200 lengths. However, it unveils intrinsic constraints when examining lengthier and more intricate data. The approach under consideration demonstrates significant improvements in computational efficiency when tested on artificially generated datasets, with average sequence lengths varying from 50 to 200 and average sequence quantities ranging from 100 to 1000. However, it is noticed that when the size of the datasets became greater than 200 in length or greater than 1000 in number, the expected improvement in computational performance became less significant. It also reveals some potential when larger datasets were used with our approach. In summary, our parallel approach demonstrates significant potential under specific circumstances but requires more refinement to be applicable to a wide range of sequence lengths and varied amounts of sequences.

6. Conclusions and Future Works

In this research, we develop and assess an optimized approach to the FM-based approach of short-read alignment on heterogeneous NUMA architectures. The novel integration of the longest-first algorithm and enhanced partitioning technique significantly enhances load balancing and data parallelism, addressing the computational challenges exhibited due to the exponential growth in sequencing data.

The proposed LF-EP-based FM aligner was benchmarked against popular tools such as BALIBASE, OXBENCH, SABRE, and PREFAB. These experiments demonstrate that our method, in terms of alignment quality and computational efficiency, outperforms some popular existing FM-based aligners. Our process has shown remarkable speedup and reduction in execution time, particularly in the handling of complex and large-scale genomic datasets within HPC systems. This approach we conducted is scalable and highly efficient, and our alignment management is robust in handling data deluge, thus assuring huge potential in personalized medicine for genetic disease diagnostics and drug discovery. Furthermore, our comprehensive quantification and correlation of sequence complexity underline the robustness of the LF-EP-based FM aligner in the management of high-complexity data from artificial data and the 1000 Genome Project.

Therefore, the LF-EP-based FM aligner is an impressive improvement to the sequence alignment technique. It is a very reliable and effective tool for genomics research. This research contributes to and opens new projects on the optimization of computational frameworks in heterogeneous environments within the field of bioinformatics, thereby enhancing the processing and interpretation ability of vast amounts of genomic data.

In the future, it is worth exploring and further developing the FM-based mechanism in areas of genomic studies [33]. Consequently, it is highly recommended that integration with other robust algorithms and data structures be conducted so that it becomes more enhanced, improved, and effective towards sequence alignment in the heterogeneous NUMA architectures across various HPC platforms such as CPU plus GPU, FPGA, etc. [34]. Furthermore, evaluations are recommended to be conducted on other reference sequences from different types of organisms. Therefore, this could ensure we validate our algorithm in detail concerning its efficiency and robustness.

Author Contributions

Conceptualization, S.C. and Y.D.; Methodology, S.C.; Validation, Y.D., L.L. and X.Y.; Investigation, Y.D.; Resources, X.Y.; Data Curation, L.L.; Writing—Original Draft Preparation, Y.D.; Writing—Review & Editing, S.C.; Visualization, L.L. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Natural Science Foundation of Jiangxi Province, with grant numbers 20212BAB212007 and 20212BAB202016.

Data Availability Statement

The data presented in this study are concluded in Table 3.

Acknowledgments

The authors would like to acknowledge all the support received from Jiangxi Normal University and Jiangxi Provincial Engineering Research Center of Blockchain Data Security and Governance.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Czarnul, P. Parallelization of Large Vector Similarity Computations in a Hybrid CPU+GPU Environment. J. Supercomput./J. Supercomput. 2017, 74, 768–786. [Google Scholar] [CrossRef]
Zeng, G.; Zhu, J.; Zhang, Y.; Chen, G.; Yuan, Z.; Wei, S.; Liu, L. A High-Performance Genomic Accelerator for Accurate Sequence-to-Graph Alignment Using Dynamic Programming Algorithm. IEEE Trans. Parallel Distrib. Syst. 2024, 35, 237–249. [Google Scholar] [CrossRef]
Kim, D.; Paggi, J.M.; Park, C.; Bennett, C.; Salzberg, S.L. Graph-Based Genome Alignment and Genotyping with HISAT2 and HISAT-Genotype. Nat. Biotechnol. 2019, 37, 907–915. [Google Scholar] [CrossRef]
Ferragina, P.; Manzini, G.; Mäkinen, V.; Navarro, G. Compressed Representations of Sequences and Full-Text Indexes. ACM Trans. Algorithms 2007, 3, 20. [Google Scholar] [CrossRef]
Anderson, T.; Wheeler, T.J. An Optimized FM-Index Library for Nucleotide and Amino Acid Search. Algorithms Mol. Biol. 2021, 16, 25. [Google Scholar] [CrossRef]
Michelogiannakis, G.; Klenk, B.; Cook, B.; Teh, M.Y.; Glick, M.; Dennison, L.; Bergman, K.; Shalf, J. A Case for Intra-Rack Resource Disaggregation in HPC. ACM Trans. Archit. Code Optim. 2022, 19, 1–26. [Google Scholar] [CrossRef]
Gai, K.; Qin, X.; Zhu, L. An Energy-Aware High Performance Task Allocation Strategy in Heterogeneous Fog Computing Environments. IEEE Trans. Comput. 2021, 70, 626–639. [Google Scholar] [CrossRef]
Jia, M.; Zhang, Y.; Gan, X.; Li, D.; Xu, E.; Wang, R.; Lu, K. vGraph: Memory-Efficient Multicore Graph Processing for Traversal-Centric Algorithms. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC ‘22), Dallas, TX, USA, 13–18 November 2022. [Google Scholar] [CrossRef]
Hong, A.; Oliva, M.; Köppl, D.; Bannai, H.; Boucher, C.; Gagie, T. Pfp-Fm: An Accelerated FM-Index. Algorithms Mol. Biol. 2024, 19, 15. [Google Scholar] [CrossRef]
Langmead, B.; Salzberg, S.L. Fast Gapped-Read Alignment with Bowtie 2. Nat. Methods 2012, 9, 357–359. [Google Scholar] [CrossRef]
Li, H.; Durbin, R. Fast and Accurate Long-Read Alignment with Burrows–Wheeler Transform. Bioinformatics 2010, 26, 589–595. [Google Scholar] [CrossRef]
Kim, D.; Langmead, B.; Salzberg, S.L. HISAT: A Fast Spliced Aligner with Low Memory Requirements. Nat. Methods 2015, 12, 357–360. [Google Scholar] [CrossRef]
Li, R.; Yu, C.; Li, Y.; Lam, T.-W.; Yiu, S.-M.; Kristiansen, K.; Wang, J. SOAP2: An Improved Ultrafast Tool for Short Read Alignment. Bioinformatics 2009, 25, 1966–1967. [Google Scholar] [CrossRef]
Marchet, C.; Boucher, C.; Puglisi, S.J.; Medvedev, P.; Salson, M.; Chikhi, R. Data Structures Based on K-Mers for Querying Large Collections of Sequencing Data Sets. Genome Res. 2020, 31, 1–12. [Google Scholar] [CrossRef]
Yang, C.-H.; Wu, Y.-C.; Chen, Y.-L.; Lee, C.-H.; Hung, J.-H.; Yang, C.-H. An FM-Index Based High-Throughput Memory-Efficient FPGA Accelerator for Paired-End Short-Read Mapping. IEEE Trans. Biomed. Circuits Syst. 2023, 17, 1331–1341. [Google Scholar] [CrossRef]
Ruffalo, M.; LaFramboise, T.; Koyutürk, M. Comparative Analysis of Algorithms for Next-Generation Sequencing Read Alignment. Bioinformatics 2011, 27, 2790–2796. [Google Scholar] [CrossRef]
Chen, J.; Chao, J.; Liu, H.; Yang, F.; Zou, Q.; Tang, F. WMSA 2: A Multiple DNA/RNA Sequence Alignment Tool Implemented with Accurate Progressive Mode and a Fast Win-Win Mode Combining the Center Star and Progressive Strategies. Brief. Bioinform. 2023, 24, bbad190. [Google Scholar] [CrossRef]
Yang, Z.; Zhang, A.; Mo, Z. PsmArena: Partitioned Shared Memory for NUMA-Awareness in Multithreaded Scientific Applications. Tsinghua Sci. Technol. /Tsinghua Sci. Technol. 2021, 26, 287–295. [Google Scholar] [CrossRef]
Sadiq, M.U.; Yousaf, M.M. A Scalable Parallel Algorithm for Global Sequence Alignment with Customizable Scoring Scheme. Concurr. Comput. 2023, 35, e7888. [Google Scholar] [CrossRef]
Papadakis, O.; Zakkak, F.S.; Foutris, N.; Kotselidis, C. You Can’t Hide You Can’t Run: A Performance Assessment of Managed Applications on a NUMA Machine. In Proceedings of the 17th International Conference on Managed Programming Languages and Runtimes (MPLR ’20), Manchester, UK, 4–6 November 2020. [Google Scholar] [CrossRef]
Gao, X.; Liu, R.; Kaushik, A. Hierarchical Multi-Agent Optimization for Resource Allocation in Cloud Computing. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 692–707. [Google Scholar] [CrossRef]
Klus, P.; Lam, S.; Lyberg, D.; Cheung, M.S.; Pullan, G.; McFarlane, I.; Yeo, G.S.; Lam, B.Y. BarraCUDA—A Fast Short Read Sequence Aligner Using Graphics Processing Units. BMC Res. Notes 2012, 5, 27. [Google Scholar] [CrossRef]
Doblas, M.; Lostes-Cazorla, O.; Aguado-Puig, Q.; Cebry, N.; Fontova-Musté, P.; Batten, C.F.; Marco-Sola, S.; Moretó, M. GMX: Instruction Set Extensions for Fast, Scalable, and Efficient Genome Sequence Alignment. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ‘23), Toronto, ON, Canada, 28 October–1 November 2023. [Google Scholar] [CrossRef]
Polyanovsky, V.O.; Roytberg, M.A.; Tumanyan, V.G. Comparative Analysis of the Quality of a Global Algorithm and a Local Algorithm for Alignment of Two Sequences. Algorithms Mol. Biol. 2011, 6, 25. [Google Scholar] [CrossRef]
Aguado-Puig, Q.; Doblas, M.; Matzoros, C.; Espinosa, A.; Moure, J.C.; Marco-Sola, S.; Moreto, M. WFA-GPU: Gap-Affine Pairwise Read-Alignment Using GPUs. Bioinformatics 2023, 39, 701. [Google Scholar] [CrossRef]
Liu, B.; Warnow, T. WITCH-NG: Efficient and Accurate Alignment of Datasets with Sequence Length Heterogeneity. Bioinform. Adv. 2023, 3, vbad024. [Google Scholar] [CrossRef]
Li, H.; Ruan, J.; Durbin, R. Mapping Short DNA Sequencing Reads and Calling Variants Using Mapping Quality Scores. Genome Res. 2008, 18, 1851–1858. [Google Scholar] [CrossRef]
Simpson, J.T.; Durbin, R. Efficient Construction of an Assembly String Graph Using the FM-Index. Bioinformatics 2010, 26, i367–i373. [Google Scholar] [CrossRef]
Jiang, L.; Zokaee, F. EXMA: A Genomics Accelerator for Exact-Matching. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of Korea, 27 February–3 March 2021. [Google Scholar] [CrossRef]
Kylasa, S.B.; Aktulga, H.M.; Grama, A.Y. Reactive Molecular Dynamics on Massively Parallel Heterogeneous Architectures. IEEE Trans. Parallel Distrib. Syst. 2017, 28, 202–214. [Google Scholar] [CrossRef]
Szustak, L.; Wyrzykowski, R.; Kuczynski, L.; Olas, T. Architectural Adaptation and Performance-Energy Optimization for CFD Application on AMD EPYC Rome. IEEE Trans. Parallel Distrib. Syst. 2021, 32, 2852–2866. [Google Scholar] [CrossRef]
Sun, Y. Construction of Artistic Design Patterns Based on Improved Distributed Data Parallel Computing of Heterogeneous Tasks. Math. Probl. Eng. 2022, 2022, 3890255. [Google Scholar] [CrossRef]
Liu, H.; Zou, Q.; Xu, Y. A Novel Fast Multiple Nucleotide Sequence Alignment Method Based on FM-Index. Brief. Bioinform. 2021, 23, bbab519. [Google Scholar] [CrossRef]
Wilton, R.; Szalay, A.S. Performance Optimization in DNA Short-Read Alignment. Bioinformatics 2022, 38, 2081–2087. [Google Scholar] [CrossRef]

Figure 1. Two common NUMA architectures employed by Intel Xeon and AMD Opteron processors.

Figure 2. Two memory access in NUMA architectures.

Figure 3. The brief workflow of sequence alignment.

Figure 4. Sequence complexity comparison among various FM aligners.

Figure 5. Performance comparison between original FM aligner and LF-EP-based FM aligner.

Figure 6. Performance comparison between multi-threading FM aligner and LF-EP-based FM aligner, t = threads, L = length.

Table 1. An illustration of SA and BWT computations based on single GR sequence.

GR = GTGTAC
SA Index	Rotations	Sorted-BWT	BWT	Sorted Index
0	$GTGTAC	$GTGTAC	C	0
1	GTGTAC$	AC$GTGT	T	5
2	TGTAC$G	C$GTGTA	A	6
3	GTAC$GT	GTAC$GT	T	3
4	TAC$GTG	GTGTAC$	$	1
5	AC$GTGT	TAC$GTG	G	4
6	C$GTGTA	TGTAC$G	G	2

Table 2. Summary of alignment cases based on different configurations of

s

and

g

Table 2. Summary of alignment cases based on different configurations of

s

and

g

Case Number	$s$ (Subsets of Reads)	$g$ (Genome Replicates)	Description	Application
1	1	1	Aligning complete reads to a single genome reference.	Standard single-genome alignments.
2	1	$g$ > 1	Aligning against multiple genome replicas.	Enhancing memory utilization in NUMA systems.
3	1	$0 < g$ < 1	Mapping to a fraction ( $1 / g$ ) of the genome reference.	Focusing on specific genome segments.
4	$s$ > 1	1	Dividing reads into subsets for a single genome replica.	Distributing workload across multiple threads.
5	$s$ > 1	$g$ > 1	Mapping subsets to multiple genome replicas, $s = g$ .	Suitable for large memory systems for parallel processing.
6	$s$ > 1	$0 < g$ < 1	Mapping subsets to a fractional genome segment, $s = 1 / g$ .	Minimizing memory footprint for memory-sensitive systems.

Table 3. Details of the configurations in the experiments.

Components	Details
Genome Reference	Human genome (id: HS37D5) from the international 1000 Genomes Project. Sizes are up to 3 GB.
Short-Read Data	SRR766060 paired-end genome. Each segment is 8.7 GB in size.
Benchmark Datasets	BALIBASE, OXBENCH, SABRE, and PREFAB.
Sequence Simulation Tools	randDNA and wgsim.
Alignment Algorithms	SOAP2, BOWTIE2, BWA, and GEM3.
Experimental Setup	Two NUMA nodes: Intel Xeon and AMD Opteron.
NUMA Node Specifications	Processors: 4 processors with 32 cores and 64 threads. Memory: 128 GB; NUMA Nodes: 4.
Processor Details	AMD Opteron-6376: core network with full ties. Intel Xeon E5-4620: ring with circular connections.

Table 4. Quality of various FM aligners.

	BALIBASE		OXBENCH		SABRE		PREFAB
Aligners	QS	TCS	QS	TCS	QS	TCS	QS	TCS
LF-EP-based FM aligner	0.870	0.590	0.892	0.460	0.720	0.680	0.880	0.820
SOAP2	0.860	0.580	0.880	0.450	0.710	0.670	0.870	0.810
BOWTIE2	0.850	0.570	0.870	0.440	0.700	0.660	0.860	0.800
BWA	0.840	0.560	0.860	0.430	0.690	0.650	0.850	0.790
GEM3	0.830	0.550	0.850	0.420	0.680	0.640	0.840	0.780

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, S.; Dai, Y.; Liu, L.; Yu, X. Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures. Future Internet 2024, 16, 217. https://doi.org/10.3390/fi16060217

AMA Style

Chen S, Dai Y, Liu L, Yu X. Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures. Future Internet. 2024; 16(6):217. https://doi.org/10.3390/fi16060217

Chicago/Turabian Style

Chen, Shaolong, Yunzi Dai, Liwei Liu, and Xinting Yu. 2024. "Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures" Future Internet 16, no. 6: 217. https://doi.org/10.3390/fi16060217

APA Style

Chen, S., Dai, Y., Liu, L., & Yu, X. (2024). Optimizing Data Parallelism for FM-Based Short-Read Alignment on the Heterogeneous Non-Uniform Memory Access Architectures. Future Internet, 16(6), 217. https://doi.org/10.3390/fi16060217

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu