Implementation of the Principal Component Analysis onto High-Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons
<p>CPU vs Graphics Processing Unit (GPU) architecture.</p> "> Figure 2
<p>General hardware architecture of a NVIDIA GPU.</p> "> Figure 3
<p>The Kalray Massively Parallel Processor Array (MPPA)-256-N architecture: (<b>a</b>) chip and (<b>b</b>) cluster.</p> "> Figure 4
<p>Processing kernels in the GPU: grids of blocks with computing threads.</p> "> Figure 5
<p>Division of the transmissions workload for the image preprocessing step.</p> "> Figure 6
<p>Division of the transmissions workload for the covariance computation step.</p> "> Figure 7
<p>Pseudocolor version of the 100 × 100 × 50 (<b>a</b>), 200 × 200 × 50 (<b>b</b>), 300 × 300 × 50 (<b>c</b>), 400 × 400 × 50 (<b>d</b>), and 500 × 500 × 50 (<b>e</b>) synthetically generated images.</p> "> Figure 8
<p>Pseudocolor version of the Cuprite large (<b>a</b>), Jasper Ridge (<b>b</b>) and World Trade Center (WTC) (<b>c</b>) Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) images.</p> "> Figure 9
<p>First principal component obtained with the GPU (<b>top row</b>) and with the MPPA (<b>bottom row</b>) for Cuprite (<b>a</b>,<b>d</b>), Jasper (<b>b</b>,<b>e</b>), and WTC (<b>c</b>,<b>f</b>) images.</p> "> Figure 10
<p>Comparison between platforms and implementations in terms of frequency, Cubes Per Second (CPS), power and programmability complexity.</p> ">
Abstract
:1. Introduction
2. Dimensionality Reduction by Means of the PCA Algorithm
Algorithm 1: Principal Component Analysis | Number of FLOPS | |
1 | Input: Hyperspectral image Y (N × M matrix), N pixels, M bands | |
2 | X = BandAverageRemoval (Y) | 2(N × M) + M |
3 | Covariance matrix, C = XT·X | 2 × N2 × M2 |
4 | E = EigenvectorDecomposition (C), e eigenvectors computed | 4 × M3 |
5 | Projection Matrix, Q = Y·E | e × N(2 × M − 1) |
6 | Q’ = MatrixColumnRemoval (Q, p), p principal components | |
7 | Output: Reduced hyperspectral image Q’ (N × p matrix) |
3. High Performance Computing in Graphics Processing Units and Manycores
4. GPU Implementation of PCA Jacobi
4.1. Preliminary Issues
4.2. CUDA Implementation of PCA Jacobi
- Data Initialization. The CUDA implementation of the PCA Jacobi algorithm requires the transfer of the hyperspectral image from the CPU (host) memory to the global memory of the GPU (device). Because some operations of the PCA Jacobi algorithm are implemented by means of cuBLAS library, which uses column-major storage, the hyperspectral image in column-major order has been transferred from the host to the GPU memory. Moreover, in order to simplify the transfer operation, the Thrust vector containers have been used for both, host and device. With this approach, we could completely avoid the need to manually manage the memory on the host and device using a Thrust vector for storing our data.In Code 4.1 the hyperspectral image is loaded into a Thrust host vector, h_image_in, by means of an iterator (lines 1.1, 1.2, 1.3, 1.4): in line 1.1, the filename with the hyperspectral image is passed as an argument of the command line; in lines 1.2, 1.3, and 1.4, the original image is loaded into a Thrust host vector container with BANDS × PIXELS elements, h_image_in. Finally, this image is transferred to the device memory, d_image_in, by means of thrust::copy (lines 1.5, 1.6). The result is that the hyperspectral image is loaded into the device as a sequence of bands. This is, the image matrix is row-major ordered. This fact must be considered whenever a cuBLAS function is called because cuBLAS uses column-major storage. This issue is achieved by using the transpose parameter whenever a cuBLAS function is called with the hyperspectral image.
Code 4.1. Data initialization 1.1 ifstream ifile(argv [1]); 1.2 thrust::host_vector<float> h_image_in(BANDS * PIXELS); 1.3 istream_iterator<float> beg(ifile), end; 1.4 thrust::copy(beg, end, h_image_in.begin()); 1.5 thrust::device_vector<float> d_image_in(BANDS * PIXELS); 1.6 thrust::copy(h_image_in.begin(), h_image_in.end(), d_image_in.begin()); - Image preprocessing. In this stage, the original hyperspectral image is going to be mean-centered. In order to parallelize this step, we have divided it into two sequential calculations: the average per every band of the hyperspectral image and the subtraction of this average from the original hyperspectral image.
- 2.1.
- Average per band calculation. For each band of the original image (BANDS × PIXELS matrix), its average is calculated and stored into a BANDS array. In order to calculate the average per band in parallel with the cuBLAS library, the BANDS × PIXELS matrix is multiplied with ones vector (vector with PIXELS elements) by means of the cublasSgemv function. According to the cuBLAS documentation, cublasSgemv performs the following matrix-vector multiplication:
- A is the BANDS×PIXELS matrix which contains the original hyperspectral image in column-major format. According to Code 4.1, this matrix is already in the device, d_image_in (see lines 1.5 and 1.6 in Code 4.1). Moreover, because this matrix is in row-major format, it needs to be transposed in the cuBLAS function.
- x is a vector with PIXELS elements filled to one. This vector is called d_ones and it has been filled to 1 by means of Thrust (see line 2.1.2 of Code 4.2.1).
- α is set to 1/ PIXELS, and β is set to 0 (line 2.1.3 of Code 4.2.1).
- y is the array with BANDS elements that stores the average per band. In Code 4.2.1 this vector is called d_means and it is filled with the result of cublasSgemv function (see lines 2.1.1 and the next to last parameter of cuBLAS function in line 2.1.4).
Code 4.2.1. Average per band calculation 2.1.1 thrust::device_vector<float> d_means(BANDS); // to store average per band 2.1.2 thrust::device_vector<float> d_ones(PIXELS, 1.f); //ones vector 2.1.3 float alpha = 1.f / (float) PIXELS; float beta = 0.f; 2.1.4 cublasSgemv(handle, //handle for use cuBLAS CUBLAS_OP_T, //the d_image_in is transposed
PIXELS, BANDS, //rows and columns of d_image_in
&alpha, thrust::raw_pointer_cast 1(d_image_in.data()),
PIXELS, //leading dimension of d_image_in
thrust::raw_pointer_cast(d_ones.data()),
1, //leading dimension of d_ones
&beta, //β scalar
thrust::raw_pointer_cast(d_means.data()),
1 ); //leading dimension of d_means array
1 thrust::raw_pointer_cast converts the device vector container to a raw pointer. - 2.2.
- Subtraction of the average per band from the original hyperspectral image. For each band of the original image, the average of this band must be removed. The subtraction of the average of a band from every pixel in such band is carried out by means of a Thrust transformation. Transformations are algorithms that apply an operation to each element in a set of (zero or more) input ranges and then store the result in a destination range. The transformation thrust::transform in line 2.2.2 of Code 4.2.2, uses the function object thrust::minus to subtract a device vector, d_image_in, from another device vector, d_means. As d_means is a vector with BANDS elements and d_image_in contains PIXELS columns with BANDS elements, the d_means array must be subtracted PIXEL times for all of the pixels in the original hyperspectral image, d_image_in. This purpose is achieved by combining three Thrust fancy iterators: permutation_iterator, transform_iterator and counting_iterator. Each of these iterators is created by a function called make_XXX, where XXX denotes the name of the Thrust fancy iterator. The permutation_iterator permutes the range of values in d_means by the index scheme obtained by the transform_iterator. In general, permutation_iterator allows for operating on a specific set of values in a sequence instead of the entire sequence. The transform_iterator changes the range generated by the counting_iterator (linear index beginning in 0) to its associated row index by calling the linear_index_to_row_index functor (A C++ functor is an object that acts just like a function, but it has the advantage of being stateful, meaning that it can keep data reflecting its state between calls) (see line 2.2.1 in Code 4.2.2). This last functor transforms the linear index that is generated by the counting_iterator to the row index of a matrix.
Code 4.2.2. Subtraction the average per band from the original hyperspectral image 2.2.1 template<typename T> struct linear_index_to_row_index: public thrust::unary_function<T, T> {T Ncols; // --- Number of columns__host__ __device__ linear_index_to_row_index(T Ncols):Ncols(Ncols) {}__host__ __device__ T operator()(T i) {return i / Ncols;}};
2.2.2 thrust::transform(d_image_in.begin(), d_image_in.end(),
thrust::make_permutation_iterator
(d_means.begin(),
thrust::make_transform_iterator(
thrust::make_counting_iterator(0),
linear_index_to_row_index<int>(PIXELS))),
d_image_in.begin(),thrust::minus<float>());
- Covariance computation. This stage computes the covariance matrix associated with the original image in parallel in the GPU (device) by means of Thrust and the cuBLAS library. The process to calculate this covariance matrix is carried out by the cublasSgemm function, which multiplies the preprocessed matrix by its transpose. According to the cuBLAS documentation, cublasSgemm performs the matrix-matrix multiplication, following the next operation:
- A and B are BANDS × PIXELS matrices that contain the preprocessed hyperspectral image. According to our Data Initialization code, these matrices are already in the device as d_image_in (see lines 1.5 and 1.6 of Data Initialization code). Moreover, because d_image_in is in row-major format, it needs to be transposed in cuBLAS function. This is the reason why op(A) must be CUBLAS_OP_T (transposed d_image_in ) and op(B) must be CUBLAS_OP_N (original d_image_in, not transposed) (see call to cublasSgemm function in line 3.3).
- m is equivalent to BANDS, n to BANDS and k to PIXELS.
- α is set to 1, and β is set to 0 (line 3.2 of the Covariance computation code).
- C is the result matrix that corresponds with the covariance matrix, d_cov, a BANDS × BANDS matrix (see lines 3.1 and the next to last parameter in the cuBLAS function).
The last step for the covariance computation stage is to normalize the result of cublasSgemm using a Thrust transformation along with a Thrust iterator, as it is shown in line 3.4 of code 4.3. In this normalization, the elements of the d_cov device vector are divided by PIXELS-1.Code 4.3. Covariance computation 3.1 thrust::device_vector<float> d_cov(BANDS * BANDS); 3.2 alpha = 1.f; beta = 0.f; 3.3 cublasSgemm (handle, CUBLAS_OP_T, //the d_image_in is transposed to get column-major ordered format op(A) CUBLAS_OP_N, //op(B) BANDS, //rows of d_image_in (A) and d_cov (C) BANDS, //columns of d_image_in (B) and d_cov (C) PIXELS, //columns of d_image_in (A) and rows of d_image_in (B) &alpha, //scalar set to 1 thrust::raw_pointer_cast(d_image_in.data()), //matrix d_image_in (A) PIXELS, //leading dimension of d_image_in (A) thrust::raw_pointer_cast(d_image_in.data()), //matrix d_image_in (B) PIXELS, //leading dimension of d_image_in (B) &beta, //scalar set to 0 thrust::raw_pointer_cast(d_cov.data()), //matrix d_image_in (C) BANDS //leading dimension of d_image_in (C));
3.4 thrust::transform (d_cov.begin(), d_cov.end(), thrust::make_constant_iterator((float) (PIXELS - 1)), d_cov.begin(), thrust::divides<float>());
- Eigenvector decomposition. This stage gets the eigenvectors associated to the covariance matrix computed in the previous stage by means of the cyclic Jacobi, which is divided into three parts:
- 4.1.
- The first part of this stage deals with a sequential code on the host side, which is in charge of finding the maximum element of the covariance matrix that is not greater than a provided stop factor (ε). The calculation of the maximum value and its position (, ) are carried out over the upper triangular matrix. After obtaining the maximum value position, the Equations (2)–(5) are calculated, where Equation (2) in this implementation is as the Equation (7), and i and j corresponds with the position of the maximum element in the covariance matrix (, ). Moreover, the set of operations in (13) is computed over the identity matrix.
- 4.2.
- The second part of this stage is executed on the GPU (device). A CUDA kernel is launched with the maximum number of threads per block and so many blocks as the relation between the number of BANDS and the (1024 threads):This kernel computes in parallel the Jacobi method with the equations in (15):As the goal of this stage is zeroing the off-diagonal elements, both parts of this stage are repeated until a stop condition is fulfilled. Because each iteration of this stage may undo the zeroes that are obtained in previous iterations, these iterations are repeated until the maximum value of the covariance matrix becomes smaller than a provided stop factor (ε). For each iteration, the covariance matrix becomes a diagonal matrix whose main diagonal represents the eigenvalues. Moreover, the eigenvector matrix is computed in parallel on the host side, as the equation shown in (16) using the cuBLAS library.
- 4.3.
- The last part of this stage consists of extracting and sorting both eigenvalues and eigenvectors. The eigenvalues are located in the main diagonal of the covariance matrix obtained in parallel in the last stage. A kernel is in charge of extracting these eigenvalues in parallel from the covariance matrix (see line 4.3.2 of Code 4.4.3). This kernel is launched with one block of BANDS threads and each thread extracts its value of the main diagonal of the covariance matrix and stores it in a Thrust device vector (see lines 4.3.1.), d_diagonal.
Code 4.4.3. Extraction and sort of eigenvalues and eigenvectors 4.3.1 thrust::device_vector<float> d_diagonal(BANDS); 4.3.2 diagonalMain<<<1,BANDS>>> (…, …); //parameters covariance matrix and d_diagonal 4.3.3 thrust::device_vector<int> d_pos_data(BANDS*BANDS); 4.3.4 thrust::sequence(thrust::device, d_pos_data.begin(), d_pos_data.end(), 0); 4.3.5 thrust::sort_by_key (d_diagonal.begin(),d_diagonal.end(),d_pos_data.begin(),thrust::greater<float>());
4.3.6 sortEigenvectors <<<1,BANDS>>> (…,…,…) //parameters: eigenvector matrix, the order of eigenvalues and the result of the sort
The main diagonal is sorted in descending order by means of two Thrust transformations: thrust::sequence and thrust::sort_by_key. The first transformation, thrust::sequence, fills a range with a sequence of numbers using the thrust::device execution policy for parallelization; the second transformation, thrust::sort_by_key, performs a key value sort, where the key value is provided by the thrust::sequence transformation. The result is a sorted device vector, d_diagonal, in descending order while using the thrust::greater comparison operator.Moreover, the eigenvectors must be sorted according to the order that is followed by the eigenvalues. As each eigenvector contains BANDS elements, the sort operation is carried out changing the order of each eigenvector in matrix P, according to the order followed for the eigenvalues, which is kept in the d_pos_data device vector after sorting the eigenvalues. This sort is computed by a kernel, sortEigenvectors, which is launched with a block with BANDS threads. Each thread is in charge of an eigenvector.
- Projection and reduction. In this stage, the original hyperspectral image is projected onto the first principal component, that is, onto the first eigenvector. Although the original hyperspectral image must be projected onto the set of eigenvectors, for simplicity this projection is performed only onto the first eigenvector. The projection is computed by multiplying the original hyperspectral image by a vector, the first eigenvector stored in matrix P. This operation is performed by a call to the cublasSgemm function (see Code 4.5).
Code 4.5. Projection and reduction thrust::device_vector<float> d_projection (PIXELS,0); thrust::device_vector<float> d_eigenvector (BANDS); //copy the first eigenvector into d_eigenvector: thrust::copy_n (thrust::device, d_G_ordered.begin(), BANDS, d_eigenvector.begin()); alpha = 1.0f; beta = 0.0f; //scalars for cublasSgemv //d_projection = d_image_in x d_eigenvector (the first eigenvector): cublasSgemm(handle, //handle for use cuBLAS CUBLAS_OP_N, CUBLAS_OP_N, PIXELS, BANDS,1, &alpha, //α scalar, thrust::raw_pointer_cast(d_image_in.data()), thrust::raw_pointer_cast(d_eigenvector.data()), &beta, //β scalar thrust::raw_pointer_cast(d_projection.data()));
5. MPPA Implementation of PCA Jacobi
- Image preprocessing. This stage centers the image by computing and removing the average of each band of the image. Consequently, this computation is divided into two steps: firstly, the calculation of the average of each band of the hyperspectral image, and, secondly, the subtraction of this average from the original hyperspectral image.
- 1.1
- Average per band calculation. For each band of the original hyperspectral image (BANDS × PIXELS matrix), its average is calculated individually. This means that this step is band-wise parallelizable, and several averages are computed simultaneously. As described before, the main limitation of the MPPA-256-N is the cluster memory; consequently, to avoid additional iterations, each cluster needs enough memory to store, at least, one band of the image. For the images under study, this means a minimum of 0.2 MB and a maximum of 1.2 MB per band. Therefore, 16 bands can be mean-centered at the same time when exploiting all of the computational resources of the platform. As demonstrated in previous works [26], the communication between I/Os and clusters becomes the main bottleneck of this algorithm when processing very large datasets, such as hyperspectral images. As a result, this implementation takes advantage of all the available I/O resources with the objective of reducing this bottleneck. As described before, there are two available I/Os, and each of them contains four DMAs to communicate with the clusters; hence, a total of eight communication links are available to send information to the clusters. The transmission of the bands to each cluster can be then parallelized, as depicted in Figure 5. As this implementation exploits the two available I/Os of the platform, in order to implement efficient communications, it has been decided that each I/O will handle the communications with half of the active clusters. As a result, I/O 0 communicates with clusters 0 to 7, while I/O does the same with clusters 8 to 15, and both of them exploit the four available DMAs. In addition, as each I/O has its own memory, there are two options to return information to the I/Os: (1) each cluster returns a value to its correspondent I/O and then I/O 0 and 1 exchange their partial information, or (2) each cluster sends the information to both I/Os simultaneously. The first option is a more computationally expensive operation, so throughout the implementation, the selected methodology has been the second one. Hence, in each iteration, the eight communication links are used in parallel to send one band to each cluster. Once the band has been received, the 16 internal cores—handled as threads—calculate its average by computing each one of them their share, as shown in lines 1.1–1.2 of Code 5.1. Afterwards, the master thread—i.e., the one supervising all the process—adds them up at the end, and orders the threads to start the average subtraction, as described in step 1.2.
- 1.2
- Subtraction of the average for each band. Once the average has been computed, the internal threads subtract it to their share of the band, as depicted in lines 1.4–1.6 of Code 5.1. After that, the master thread returns the band, now mean-centered. As mentioned before, each cluster returns the information twice, one for each active I/O. Consequently, after this operation, each I/O contains a copy of the mean-centered image, which is necessary to start the covariance matrix computation.
Code 5.1. Average calculation and removal 1.1 for (i = thread_start; i < thread_stop; i++) 1.2 sum = sum + image_block[i]; 1.3 Threads synchronization and average computation 1.4 average = sum / PIXELS; 1.5 for (i = thread_start; i < thread_stop; i++) 1.6 image_block[i] = image_block[i] - average;
- Covariance computation. This stage calculates the covariance matrix associated to the original image after being mean-centered. As exposed in previous works, this stage becomes the main bottleneck of the algorithm in platforms that are memory bounded, as the MPPA-256-N [23]. Consequently, it is crucial to find an efficient implementation to reduce this bottleneck. In this case, the division of the workload of each I/O is such that each of them deals with half the active clusters. Likewise, each cluster will return its partial result twice, once per each active I/O, hence eliminating the communication between I/Os.To compute the covariance matrix, the matrix that is obtained at stage 1 needs to be multiplied by its transpose. As noted in the previous step, depending on the input images, there are cases in which even one band of the image completely fills the cluster memory; therefore, neither of these matrices fit into a cluster, so the computation needs to be deeply iterated. Since each cluster can only store one band at a time, even computing just one element of the resulting matrix—i.e., multiplying two bands—must be an iterative process. As a result, the process of computing each element of the resulting matrix, as depicted in Figure 6, is divided following Equation (17):As shown in Figure 6, E represents the covariance matrix, A and B are the two halves of the input matrix, and C and D contain the two halves of its transpose, so each of them gathers N bands with half the number of pixels, where N represents the number of spectral bands of the original image. Consequently, A and B are N × M/2 matrices, while C and D are M/2 × N matrices.The process followed to compute the covariance matrix has been to calculate A × C in the clusters that are associated to I/O 0, while computing B × D in the clusters that are associated to I/O 1, and then add them up in each I/O. Furthermore, as the covariance matrix is symmetric, only half the computations are performed. It should be noticed that, for simplification, the process explained divides the computation in two halves. Nevertheless, when the size of the image grows, this implementation can be adapted to be divided in multiples of 2.Finally, Code 5.2 provides the pseudocode for the operations performed in each one of the 256 cores. As expected, the operations performed within the code are quite simple, which highlights the communication-intensive nature of this step.
Code 5.2. Covariance computation 2.1 for (i = thread_start; i < thread_stop; i++) 2.2 partial_mult[thread_id] = partial_mult[thread_id] + A_block[i]*B_block[i] - Eigenvector decomposition. This stage computes the eigenvectors that are associated to the covariance matrix that were obtained in the previous stage. The algorithm selected to perform this calculation is the cyclic Jacobi described in Algorithm 1. This algorithm is mainly divided into two different processes: firstly, to find the elements of the matrix that have not been zeroed yet (see Code 5.3.1); secondly, to zero the elements selected in the first place, updating the matrix containing the eigenvectors (see Codes 5.3.2 and 5.3.3). Related with the architecture under study, the covariance matrix is rather small, so it fits into a cluster. As a result, it has been decided to implement the Jacobi algorithm using only one cluster and its internal 16 cores, since using all of the clusters introduces a communication overhead that would outpace any speedup provided by the exploitation of the rest of the clusters. Hence, the 16 internal cores are the ones that are responsible for parallelizing this algorithm. Specifically, the cyclic Jacobi method has been implemented, as follows:
Code 5.3.1. Search elements to zero 3.1.1 cond1 = |element| > ε; 3.1.2 cond2 = element not zeroed yet; (row-wise order) 3.1.3 cond3 = row, column not already used; (parallel Jacobi condition) 3.1.4 while (! End of Jacobi) 3.1.4 for (i = 0; i < BANDS-1; i++) 3.1.5 for (j = i+1; j < BANDS; j++) 3.1.6 if (cond1 && cond2 && cond3) 3.1.7 Selection of aij, aii, ajj 3.1.8 Computation of α, cosα, senα (for building matrix P) 3.1.9 Storage of i, j for the selected element 3.1.10 values_iter++; 3.1.11 if (values_iter == 0) 3.1.12 if (all elements selectable) 3.1.13 End of Jacobi 3.1.14 else 3.1.15 Reset cond2 3.1.16 else 3.1.17 Reset cond3 3.1.18 Update rows (all threads) 3.1.19 Update columns (all threads) - -
- The master thread—namely, the core executing the main function—handles the search of the next elements to be zeroed. In each iteration, this thread verifies that the stop condition has not been fulfilled yet and it then selects the maximum number of elements that can be simultaneously zeroed in the next iteration. As described in Algorithm 1, two elements can be zeroed in parallel as long as they do not share an index, i.e., that they are not located in the same row or column. Once all of these elements have been selected, this thread builds the rotation matrix and sends it to the processing cores. This process is iterated until the stop condition is fulfilled by all the off-diagonal elements, which is when the master thread sorts the eigenvalues in a descending order, together with their associated eigenvectors.
- -
- Similarly, the cores perform the operation shown in (8), where k represents the current iteration, Pk is the rotation matrix associated to the kth iteration, Ck−1 contains the covariance matrix modified in the k-1th iteration and Ck is the resulting matrix, i.e., the covariance matrix with more elements already zeroed. To simplify this process, these products are not computed entirely, since the P matrix is mainly composed of zeros. As a result, these products are reduced to updating the rows and columns of C. The rationale behind this is that, as only the products that are related to the non-zero positions of P are computed, first the contribution of the products to the rows of C is computed, and afterwards, the same process is carried out for the columns, as depicted in Codes 5.3.2 and 5.3.3, respectively. Additionally, when updating the columns (Code 5.3.3), matrix P is also updated following Equation (10). The methodology to compute these products is very similar to the one that is described in (8), as each element to be zeroed only modifies four elements of its associated P matrix.
Code 5.3.2. Update rows 3.2.1 for each element processed by each thread 3.2.2 for (k = 0; k < BANDS; k++) 3.2.3 new_aik = pii * aik + pji * ajk; 3.2.4 new_ajk = pij * aik + pjj * ajk; 3.2.5 aik = new_aik; 3.2.6 ajk = new_ajk; Code 5.3.3. Update columns 3.3.1 for each element processed by each thread 3.3.2 for (k = 0; k < BANDS; k++) 3.3.3 new_aki = aki * pii + akj * pji; 3.3.4 new_akj = aki * pij + akj * pjj; 3.3.5 aki = new_aki; 3.3.6 akj = new_akj; 3.3.7 new_pki = pki * pii + pkj * pji; 3.3.8 new_pkj = pki * pij + pkj * pjj; 3.3.9 pki = new_pki; 3.3.10 pkj = new_pkj; - Projection and reduction. This stage projects the original hyperspectral image onto the set of eigenvectors computed in stage 3, obtaining the principal components in the columns of the resulting matrix. However, for the images that are under study, the volume of spectral information retained in the first principal component is higher than an 80% on average, so this step has been simplified to just projecting the image onto the first eigenvector. In that way, the complexity of this stage is considerably reduced, as instead of multiplying two matrices, it only needs to multiply the original image by the first eigenvector. Related with the parallelization process, the method that is applied is similar to that of stage 2, but is much simpler. As one eigenvector fits into a cluster, it is broadcasted to all of them, while the original image is split in a pixel-wise order and is sent iteratively, where each core computes its share of the projection, as shown in Code 5.4. As happened in stages 1 and 2, each I/O manages one half of the image and sends it to its corresponding clusters. Afterwards, each cluster returns its partial results to each I/O.
Code 5.4. Projection and reduction 4.1 for (i = thread_start; i < thread_stop; i++) 4.2 for (iter = 0; iter < BANDS_PCA; iter++) 4.3 proj = 0; 4.4 for (j = 0; j < BANDS; j++) 4.5 proj = proj + image[i*BANDS+j] * eigenvecs[iter+BANDS_PCA+i]; 4.6 pca_out [i*BANDS_PCA+iter] = proj;
6. Materials
7. Experimental Results
7.1. Performance of the Sequential PCA Jacobi Algorithm
7.2. Performance of the CUDA PCA Jacobi Algorithm
7.3. Performance of the MPPA PCA Jacobi Algorithm
7.4. Comparisons
8. Conclusions
Author Contributions
Acknowledgments
Conflicts of Interest
References
- Chang, C.-I. Hyperspectral Imaging: Techniques for Spectral Detection and Classification; Kluwer Academic: Dordrecht, The Netherlands, 2003. [Google Scholar]
- Chang, C.-I. Hyperspectral Data Processing: Algorithm Design and Analysis; Wiley: Hoboken, NJ, USA, 2013. [Google Scholar]
- Chang, C.-I. Hyperspectral Data Exploitation: Theory and Applications; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar]
- Ghamisi, P.; Yokoya, N.; Li, J.; Liao, W.; Liu, S.; Plaza, J.; Rasti, B.; Plaza, A. Advances in Hyperspectral Image and Signal Processing: A Comprehensive Overview of the State of the Art. IEEE Geosci. Remote Sens. Mag. 2017, 5, 37–78. [Google Scholar] [CrossRef]
- Bioucas-Dias, J.M.; Plaza, A.; Camps-Valls, G.; Scheunders, P.; Nasrabadi, N.; Chanussot, J. Hyperspectral Remote Sensing Data Analysis and Future Challenges. IEEE Geosci. Remote Sens. Mag. 2013, 1, 6–36. [Google Scholar] [CrossRef] [Green Version]
- Hotelling, H. Analysis of a complex of statistical variables into principal components. J. Educ. Psychol. 1933, 24, 498–520. [Google Scholar] [CrossRef]
- Plaza, A.; Chang, C.-I. High Performance Computing in Remote Sensing; Chapman & Hall/CRC Press: Boca Raton, FL, USA, 2007. [Google Scholar]
- Lee, C.A.; Gasster, S.D.; Chang, A.P.C.-I.; Huang, B. Recent Developments in High Performance Computing for Remote Sensing: A Review. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 4, 508–527. [Google Scholar] [CrossRef]
- Plaza, A.; Du, Q.; Chang, Y.-L.; King, R.L. High Performance Computing for Hyperspectral Remote Sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2011, 4, 528–544. [Google Scholar] [CrossRef]
- Fernandez, D.; Gonzalez, C.; Mozos, D.; Lopez, S. FPGA implementation of the Principal Component Analysis algorithm for dimensionality reduction of hyperspectral images. J. Real Time Image Process. 2016, 1–12. [Google Scholar] [CrossRef]
- Sanchez, S.; Ramalho, R.; Sousa, L.; Plaza, A. Real-Time Implementation of Remotely Sensed Hyperspectral Image Unmixing on GPUs. J. Real Time Image Process. 2015, 10, 469–483. [Google Scholar] [CrossRef]
- Wu, Z.; Li, Y.; Plaza, A.; Li, J.; Xiao, F.; Wei, Z. Parallel and Distributed Dimensionality Reduction of Hyperspectral Data on Cloud Computing Architectures. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2016, 9, 2270–2278. [Google Scholar]
- Sun, W.; Halevy, A.; Benedetto, J.J.; Czaja, W.; Liu, C.; Wu, H.; Shi, B.; Li, W. UL-Isomap based nonlinear dimensionality reduction for hyperspectral imagery classification. ISPRS J. Photogramm. Remote Sens. 2014, 89, 25–36. [Google Scholar] [CrossRef]
- Sun, W.; Halevy, A.; Benedetto, J.J.; Czaja, W.; Li, W.; Liu, C.; Shi, B.; Wang, R. Nonlinear Dimensionality Reduction via the ENH-LTSA Method for Hyperspectral Image Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2014, 7, 375–388. [Google Scholar] [CrossRef]
- Sun, W.; Du, Q. Graph-Regularized Fast and Robust Principal Component Analysis for Hyperspectral Band Selection. IEEE Trans. Geosci. Remote Sens. 2018. [Google Scholar] [CrossRef]
- Sun, W.; Yang, G.; Du, B.; Zhang, L.; Zhang, L. A Sparse and Low-Rank Near-Isometric Linear Embedding Method for Feature Extraction in Hyperspectral Imagery Classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 4032–4046. [Google Scholar] [CrossRef]
- Panju, M. Iterative methods for computing eigenvalues and eigenvectors. Waterloo Math. Rev. 2011, 1, 9–18. [Google Scholar]
- Golub, G.H.; van Loan, C.F. Matrix Computations, 4th ed.; John Hopkins University Press: Baltimore, MD, USA, 2013. [Google Scholar]
- Quarteroni, A.; Sacco, R.; Saleri, F. Numerical Mathematics, 2nd ed.; Springer: New York, NY, USA, 2007; pp. 183–238. [Google Scholar]
- Forsythe, G.E.; Henrici, P. The cyclic Jacobi method for computing the principal values of a complex matrix. Trans. Am. Math. Soc. 1960, 94, 1–23. [Google Scholar] [CrossRef]
- de Dinechin, B.D.; Ayrignac, R.; Beaucamps, Pi.; Couvert, P.; Ganne, B.; de Massas, P.G.; Jacquet, F.; Jones, S.; Chaisemartin, N.M.; Riss, F.; et al. A Clustered Manycore Processor Architecture for Embedded and Accelerated Applications. In Proceedings of the IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 10–12 September 2013; pp. 1–6. [Google Scholar]
- Roux, B.; Gautier, M.; Sentieys, O.; Derrien, S. Communication-Based Power Modelling for Heterogeneous Multiprocessor Architectures. In Proceedings of the IEEE 10th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC), Lyon, France, 21–23 September 2016; pp. 209–216. [Google Scholar]
- Lazcano, R.; Madroñal, D.; Salvador, R.; Desnos, K.; Pelcat, M.; Guerra, R.; Fabelo, H.; Ortega, S.; Lopez, S.; Callico, G.M.; et al. Porting a PCA-based hyperspectral image dimensionality reduction algorithm for brain cancer detection on a manycore architecture. J. Syst. Archit. 2017, 77, 101–111. [Google Scholar] [CrossRef] [Green Version]
- CUDA Toolkit Documentation. NVIDIA Developer Documentation. Available online: http://docs.nvidia.com/cuda/ (accessed on 6 April 2018).
- THRUST: CUDA Toolkit Documentation. Available online: http://docs.nvidia.com/cuda/thrust/index.html (accessed on 6 April 2018).
- Lazcano, R.; Madroñal, D.; Desnos, K.; Pelcat, M.; Guerra, R.; López, S.; Juarez, E.; Sanz, C. Parallelism Exploitation of a Dimensionality Reduction Algorithm Applied to Hyperspectral Images. In Proceedings of the 2016 Conference on Design and Architectures for Signal and Image Processing (DASIP), Rennes, France, 12–14 October 2016. [Google Scholar]
- Clark, R.N.; Swayze, G.A.; Wise, R.; Livo, E.; Hoefen, T.; Kokaly, R.; Sutley, S.J. USGS Digital Spectral Library splib06a: U.S. Geological Survey, Digital Data Series 231. 2007. Available online: http://speclab.cr.usgs.gov/spectral.lib06 (accessed on 24 May 2018).
- Saidi, S.; Ernst, R.; Uhrig, S.; Theiling, H.; de Dinechin, B.D. The Shift to Multicores in Real-Time and Safety-Critical Systems. In Proceedings of the 10th International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2015, Amsterdam, The Netherlands, 4–9 October 2015; pp. 220–229. [Google Scholar]
- De Dinechi, B.D.; Graillat, A. Network-on-Chip Service Guarantees on the Kalray MPPA-256 Bostan Processor. In Proceedings of the 2nd International Workshop on Advanced Interconnect Solutions and Technologies for Emerging Computing Systems, Stockholm, Sweden, 25 January 2017; pp. 35–40. [Google Scholar]
- De Dinechin, B.D. Kalray MPPA: Massively Parallel Processor Array: Revisiting DSP Acceleration with the Kalray MPPA Manycore Processor. In Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), Cupertino, CA, USA, 22–25 August 2015; pp. 1–27. [Google Scholar]
- Xilinx Boards and Kits—Power Supply Information. Available online: https://www.xilinx.com/support/answers/67507.html (accessed on 9 April 2018).
Pixel Number | Bands | |
---|---|---|
Cuprite (small) | 47,750 | 188 |
Cuprite (large) | 122,500 | 188 |
Jasper | 314,368 | 224 |
World Trade Center (WTC) | 314,368 | 224 |
Total | |
---|---|
Cuprite (small) | 3672.6 |
Cuprite (large) | 9410.2 |
Jasper | 33,858.2 |
WTC | 33,858.0 |
GeForce GTX 680 | |
---|---|
GPU cores | 1536 |
Total amount of global memory (MB) | 2047 |
GPU maximum clock rate (MHz) | 1058 |
Maximum Graphics Card Power (W) | 195 |
Minimum System Power Requirement (W) | 550 |
Memory clock rate (MHz) | 3004 |
Stage 1: Image Prep. | Stage 2: Cov. Matrix | Stage 3: Eigenvector Decomp. | Stage 4: Projection and Reduction | TOTAL | ||
---|---|---|---|---|---|---|
#PCs | Time | |||||
100 × 100 × 50 | 0.317 | 0.2177 | 68.21 | 1 | 0.453 | 69.19 |
3 | 0.541 | 69.28 | ||||
5 | 0.559 | 69.30 | ||||
200 × 200 × 50 | 0.886 | 0.234 | 70.47 | 1 | 1.557 | 73.14 |
3 | 1.566 | 73.15 | ||||
5 | 1.589 | 73.17 | ||||
300 × 300 × 50 | 1.560 | 0.286 | 70.91 | 1 | 2.981 | 75.73 |
3 | 2.987 | 75.74 | ||||
5 | 2.993 | 75.74 | ||||
400 × 400 × 50 | 2.296 | 0.236 | 73.34 | 1 | 4.858 | 80.73 |
3 | 4.846 | 80.71 | ||||
5 | 4.853 | 80.72 | ||||
500 × 500 × 50 | 3.467 | 0.983 | 77.53 | 1 | 7.277 | 89.25 |
3 | 7.290 | 89.27 | ||||
5 | 7.406 | 89.38 | ||||
300 × 300 × 20 | 1.465 | 0.223 | 67.82 | 1 | 1.357 | 70.86 |
3 | 1.363 | 70.87 | ||||
5 | 1.466 | 70.97 | ||||
300 × 300 × 30 | 1.461 | 0.197 | 66.66 | 1 | 1.888 | 70.20 |
3 | 1.905 | 70.22 | ||||
5 | 1.909 | 70.22 | ||||
300 × 300 × 100 | 1.482 | 0.244 | 80.37 | 1 | 5.585 | 87.68 |
3 | 5.593 | 87.68 | ||||
5 | 5.612 | 87.65 | ||||
300 × 300 × 200 | 1.605 | 1.332 | 146.32 | 1 | 11.23 | 160.48 |
3 | 11.12 | 160.37 | ||||
5 | 11.02 | 160.35 |
Stage 1: Image Prep. | Stage 2: Cov. Matrix | Stage 3: Eigenvector Decomp. | Stage 4: Projection and Reduction | TOTAL | ||
---|---|---|---|---|---|---|
#PCs | Time | |||||
Cuprite (small) | 1.220 | 0.982 | 140.65 | 1 | 5.834 | 148.68 (24.70) |
3 | 5.834 | 148.68 | ||||
5 | 5.852 | 148.70 | ||||
Cuprite (large) | 2.289 | 1.374 | 148.90 | 1 | 13.92 | 166.48 (56.52) |
3 | 14.00 | 166.56 | ||||
5 | 14.00 | 166.56 | ||||
Jasper | 4.638 | 1.043 | 170.57 | 1 | 40.53 | 216.78 (156.62) |
3 | 41.93 | 218.18 | ||||
5 | 41.92 | 218.17 | ||||
WTC | 4.690 | 1.263 | 169.47 | 1 | 40.54 | 215.96 (156.77) |
3 | 40.78 | 216.20 | ||||
5 | 40.63 | 216.05 |
MPPA-256-N | |
---|---|
Processing cores | 256 |
Total amount of global memory (MB) | 32 |
MPPA clock rate (MHz) | 400 |
Typical power consumption (W) [28] | 10–20 |
Stage 1: Image Prep. | Stage 2: Cov. Matrix | Stage 3: Eigenvector Decomp. | Stage 4: Projection and Reduction | TOTAL | ||
---|---|---|---|---|---|---|
#PCs | Time | |||||
100 × 100 × 50 | 8.5 | 112.6 | 25.0 | 1 | 1.6 | 140.1 |
3 | 1.7 | 140.4 | ||||
5 | 1.8 | 140.5 | ||||
200 × 200 × 50 | 12.5 | 113.9 | 32.8 | 1 | 4.8 | 152.2 |
3 | 5.0 | 152.7 | ||||
5 | 5.3 | 153.2 | ||||
300 × 300 × 50 | 25.4 | 129.9 | 28.9 | 1 | 10.0 | 169.7 |
3 | 10.5 | 170.7 | ||||
5 | 11.1 | 171.7 | ||||
400 × 400 × 50 | 38.3 | 162.2 | 28.7 | 1 | 17.9 | 209.9 |
3 | 18.9 | 211.7 | ||||
5 | 19.8 | 213.9 | ||||
500 × 500 × 50 | 55.8 | 229.1 | 25.0 | 1 | 27.5 | 282.5 |
3 | 29.0 | 285.2 | ||||
5 | 30.6 | 288.2 | ||||
300 × 300 × 20 | 11.6 | 28.3 | 7.9 | 1 | 9.3 | 46.4 |
3 | 9.6 | 47.2 | ||||
5 | 9.9 | 47.8 | ||||
300 × 300 × 30 | 15.0 | 54.0 | 14.4 | 1 | 9.5 | 78.8 |
3 | 10.1 | 80.1 | ||||
5 | 10.3 | 80.5 | ||||
300 × 300 × 100 | 43.1 | 461.0 | 122.5 | 1 | 11.0 | 595.4 |
3 | 12.0 | 596.5 | ||||
5 | 12.9 | 597.8 | ||||
300 × 300 × 200 | 76.0 | 1,730.3 | 841.5 | 1 | 13.9 | 2586.6 |
3 | 16.0 | 2587.5 | ||||
5 | 19.9 | 2591.2 |
Stage 1: Image Prep. | Stage 2: Cov. Matrix | Stage 3: Eigenvector Decomp. | Stage 4: Projection and Reduction | TOTAL | ||
---|---|---|---|---|---|---|
#PCs | Time | |||||
Cuprite (small) | 42.8 | 1465.7 | 515.1 | 1 | 7.7 | 1989.5 (1.84) |
3 | 8.8 | 1990.5 | ||||
5 | 10.6 | 1992.4 | ||||
Cuprite (large) | 103.7 | 1696.6 | 479.1 | 1 | 18.0 | 2194.6 (4.28) |
3 | 20.4 | 2199.1 | ||||
5 | 25.3 | 2204.1 | ||||
Jasper | 293.7 | 4141.6 | 749.1 | 1 | 52.2 | 4944.2 (6.84) |
3 | 60.2 | 4959.2 | ||||
5 | 84.8 | 4982.9 | ||||
WTC | 284.1 | 4144.2 | 770.9 | 1 | 52.7 | 4969.1 (6.81) |
3 | 59.4 | 4980.3 | ||||
5 | 84.5 | 5004.4 |
Image | Time (ms) | Cores | Frequency (MHz) | CPS | CPS/(Core × MHz) | |
---|---|---|---|---|---|---|
GPU | 100 × 100 × 50 | 69.19 | 1536 | 1058 | 14.49 | 8.916 × 10−6 |
200 × 200 × 50 | 73.14 | 13.67 | 8.411 × 10−6 | |||
300 × 300 × 50 | 75.73 | 13.21 | 8.128 × 10−6 | |||
400 × 400 × 50 | 80.73 | 12.39 | 7.624 × 10−6 | |||
500 × 500 × 50 | 89.25 | 11.20 | 6.891 × 10−6 | |||
300 × 300 × 20 | 70.86 | 14.11 | 8.682 × 10−6 | |||
300 × 300 × 30 | 70.20 | 14.24 | 8.762 × 10−6 | |||
300 × 300 × 100 | 87.68 | 11.38 | 7.002 × 10−6 | |||
300 × 300 × 200 | 160.48 | 6.23 | 3.833 × 10−6 | |||
MPPA | 100 × 100 × 50 | 140.1 | 256 | 400 | 7.14 | 69.705 × 10−6 |
200 × 200 × 50 | 152.2 | 6.57 | 64.163 × 10−6 | |||
300 × 300 × 50 | 169.7 | 5.89 | 57.546 × 10−6 | |||
400 × 400 × 50 | 209.9 | 4.76 | 46.525 × 10−6 | |||
500 × 500 × 50 | 282.5 | 3.54 | 34.569 × 10−6 | |||
300 × 300 × 20 | 46.4 | 21.55 | 210.466 × 10−6 | |||
300 × 300 × 30 | 78.8 | 12.69 | 123.929 × 10−6 | |||
300 × 300 × 100 | 595.4 | 1.68 | 16.402 × 10−6 | |||
300 × 300 × 200 | 2586.6 | 0.39 | 3.775 × 10−6 |
Image | Time (ms) | Cores | Frequency (MHz) | CPS | CPS/(Core × MHz) | |
---|---|---|---|---|---|---|
GPU | Cuprite (small) | 148.68 | 1536 | 1058 | 6.72 | 4.135 × 10−6 |
Cuprite (large) | 166.48 | 6.01 | 3.698 × 10−6 | |||
Jasper | 216.78 | 4.61 | 2.836 × 10−6 | |||
WTC | 215.96 | 4.63 | 2.849 × 10−6 | |||
MPPA | Cuprite (small) | 1989.5 | 256 | 400 | 0.50 | 4.909 × 10−6 |
Cuprite (large) | 2194.6 | 0.46 | 4.450 × 10−6 | |||
Jasper | 4944.2 | 0.20 | 1.975 × 10−6 | |||
WTC | 4969.1 | 0.20 | 1.965 × 10−6 |
XC7VX690T | |
---|---|
Logic Cells | 693,120 |
DSP slices | 3600 |
Intrenal memory (Kb) | 52,920 |
Typical power consumption (W) [31] | Up to 60 |
Image | Time (ms) | Frequency (MHz) | CPS | CPS/MHz | |
---|---|---|---|---|---|
GPU | Cuprite (large) | 166.48 | 1058 | 6.01 | 5.680 × 10−3 |
Jasper | 216.78 | 4.61 | 4.357 × 10−3 | ||
MPPA | Cuprite (large) | 2526.7 | 400 | 0.40 | 0.989 × 10−3 |
Jasper | 5488.8 | 0.18 | 0.455 × 10−3 | ||
FPGA | Cuprite (large) | 1490.0 | 76 | 0.67 | 8.831 × 10−3 |
Jasper | 4170.0 | 0.24 | 3.155 × 10−3 |
Image | Time (s) | |
---|---|---|
GPU GTX 580 [11] | Cuprite (large) | 0.239 |
WTC | 0.633 | |
Cloud computing * [12] | Cuprite (614 × 512 pixels × 224 bands) | 6.20 |
© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Martel, E.; Lazcano, R.; López, J.; Madroñal, D.; Salvador, R.; López, S.; Juarez, E.; Guerra, R.; Sanz, C.; Sarmiento, R. Implementation of the Principal Component Analysis onto High-Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons. Remote Sens. 2018, 10, 864. https://doi.org/10.3390/rs10060864
Martel E, Lazcano R, López J, Madroñal D, Salvador R, López S, Juarez E, Guerra R, Sanz C, Sarmiento R. Implementation of the Principal Component Analysis onto High-Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons. Remote Sensing. 2018; 10(6):864. https://doi.org/10.3390/rs10060864
Chicago/Turabian StyleMartel, Ernestina, Raquel Lazcano, José López, Daniel Madroñal, Rubén Salvador, Sebastián López, Eduardo Juarez, Raúl Guerra, César Sanz, and Roberto Sarmiento. 2018. "Implementation of the Principal Component Analysis onto High-Performance Computer Facilities for Hyperspectral Dimensionality Reduction: Results and Comparisons" Remote Sensing 10, no. 6: 864. https://doi.org/10.3390/rs10060864