Biplot and Singular Value Decomposition Macros For Excel
Biplot and Singular Value Decomposition Macros For Excel
Biplot and Singular Value Decomposition Macros For Excel
Abstract
The biplot display is a graph of row and column markers obtained from data that forms a twoway table. The markers are calculated from the singular value decomposition of the data matrix.
The biplot display may be used with many multivariate methods to display relationships between
variables and objects. It is commonly used in ecological applications to plot relationships
between species and sites. This paper describes a set of Excel macros that may be used to draw
a biplot display based on results from principal components analysis, correspondence analysis,
canonical discriminant analysis, metric multidimensional scaling, redundancy analysis, canonical
correlation analysis or canonical correspondence analysis. The macros allow for a variety of
transformations of the data prior to the singular value decomposition and scaling of the markers
following the decomposition.
1.
Introduction
The biplot display is a commonly used multivariate method for graphing row and column
elements using a single display (Gabriel, 1971). The method has been used to display objects and
variables on the same graph in principal components analysis, row and column factors in
correspondence analysis of two-way contingency tables and to detect interaction in two-way
analysis of variance tables (Gower and Hand, 1996). Biplot displays are commonly used in the
analysis of data from ecological and environmental studies. Data are often collected on the
abundance of species at various sites. Interest is in describing the data and the biplot display
provides a method for reducing the dimensionality of the data and displaying the species and sites
jointly on the same plot. Similarities between species or sites may be gleaned from these types of
plots. Also it is common to interpret the axes in the biplot and treat the coordinates as scores on
these axes. For example, in an ecological analysis the first axis might represent a moisture
gradient while the second a temperature gradient. Species or sites may then be ranked in terms of
tolerance to moisture or temperature. Examples of the display and interpretation guidelines are
given in Legendre and Legendre (1998) with a focus on ecological data or Gower and Hand
(1996) for more general applications. A detailed numerical illustration is presented in Digby and
Kempton (1987).
The basis of the display is the singular value decomposition. If we have an n by p matrix
Y of rank r with rpn, then the matrix may be decomposed as (Seber, 1984, pg. 504)
Y = UV '
G = U ( k ) ( k )
H = V( k ) 1-( k)
The value of k determines the dimension of the approximation (typically k=2) and the user
specified parameter (0 1) determines whether emphasis is placed on the rows or columns
correspondence analysis to study relationships between the abundance of different organisms and
environmental conditions. However, users in other fields may find the program useful for
producing biplot displays for methods such as principal components analysis (PCA) and
correspondence analysis (CA).
The BIPLOT add-in for Excel is implemented in the Visual Basic for Applications
macro language. The add-in requires Excel 97 or a more recent version of Excel to function
properly. The program will calculate singular value decompositions of the data matrix (or
transformed data matrix) and produce a standard biplot display as in principal components
analysis or correspondence analysis. In addition, the program also produces displays for the other
analyses mentioned above. The steps in adding the macros to Excel are fairly simple and are
described below along with some computational details and descriptions of options.
The macro is stored in an add-in file BIPLOT01.XLA that can be added to the Excel
environment. To add the macro to Excel, open Excel, then navigate to TOOLS>ADD-INS.
When the add-in window appears, click BROWSE and find the program named BIPLOT01.XLA.
It should appear now in the add-in window. Click the box to add it to your set of Excel macros.
If successful, the item Biplot will appear on your Excel menu at the top of the spreadsheet.
Selecting Biplot will open a pull down menu that has two main pieces, one to do
calculations and the other for plotting. It also has a help menu and an information window. The
main calculations are done after selecting options using a singular value decomposition (SVD)
dialog. A graphics selection dialog is then used to choose the type of biplot to display and to
specify graphics options.
2.
Data layout
The data is entered as a two-way array in Excel. For canonical correlation analysis and
canonical correspondence analysis there will be two arrays. It is recommended that data be in the
form of a two-way table with row names and column names. In a typical analysis, the rows will
correspond to the objects and columns to the variables. Here is a simple example using data
collected at 10 sites for four species:
Site
S1
S2
S3
S4
S5
S6
S7
S8
S9
S10
Sp1
1
0
0
11
11
9
9
7
7
5
Sp2
0
0
1
4
5
6
7
8
9
10
Sp3
0
0
0
0
17
0
13
0
10
0
Sp4
0
0
0
0
7
0
10
0
13
0
Species names are in row 1 and site names are in column1 of the data table. Names are
not required but are useful in the graphical displays as identifiers. You can have this table
anywhere in the spreadsheet. For example, you can have a title above the table and other data
below the table. There should be no other columns between the row names and the numerical
values that will be analyzed. The data analyzed forms a table and the table is selected so remove
any columns or rows that are not to be included in the analysis. There is an option to remove
some columns based on certain criteria but there is not option for selecting individual columns or
rows to be analyzed.
Sometimes the data will contain the squared symmetric matrix of inner products based on
some underlying n by p matrix. In fact, for MDS (multidimensional scaling) this may be the most
typical format since the data often are dissimilarities (or proximities) among stimuli. The data
below gives an example of a so-called confusion matrix; here it is percentage of times that the
pairs of Morse code signals for two numbers were declared to be the same by 598 subjects
(Rothkopf, 1957)
Signal
.----
..---
--
.-
..
-.
--
---..
---.
.----
84
62
16
12
12
20
37
57
----52
..---
62
89
59
23
14
25
25
28
18
--
16
59
86
38
27
33
17
16
.-
23
38
89
56
34
24
13
..
12
27
56
90
30
18
10
-.
12
14
33
34
30
86
65
22
18
--
20
25
17
24
18
65
85
65
31
15
---..
37
25
16
13
10
22
65
88
58
39
---.
57
28
31
58
91
79
-----
52
18
18
15
39
79
94
3. Find the first absolute maximum eigenvalue of R(j) using the power method: start with a
random vector xo, then iteratively compute wi = R(j) xi-1, xi = wi /|| wi || (where || w|| is the
squared Euclidean norm of vector w) until the relative norm L = || xi - xi-1||/ || xi-1|| < 10-10 or
abs(L-2)<10-10. The latter condition corresponds to the case of a negative eigenvalue. Then
the direction of xi switches from iteration to iteration and L tends to 2. The maximum number
of iterations is set to 500; if the algorithm does not converge, the program terminates and
prints out a report that would contain all elements of matrices U and V that were determined
at the previous steps.
4. Set vj = xi from the ith iteration of the previous step, form the j-th maximum absolute
eigenvalue of R, dj = vj'R(j) vj, compute the maximum singular value of Y as
j = abs(d j ) , set uj = Yvj / j . Assign the appropriate sign to the singular value.
(Actually it can be always taken as positive. Indeed, if it is negative, then the sign of u will
also change. Therefore there is indeterminacy in the signs of singular values and singular
vectors, and the former can always be positive.)
5. Compute the residual matrix R(j+1)= R(j) djujvj and repeat steps 2, 3, 4 until all r= rank(R)
eigenvalues of matrix R are found or until the algorithm fails to converge. The rank of R is
the number of nonzero eigenvalues; once the absolute value of an eigenvalues is smaller than
the tolerance = 10-10, the algorithm terminates. If the user requests that only a certain number
of eigenvalues be extracted, the algorithm terminates after this condition has been met.
6. Print out the report that contains (1) the transformed n by p data matrix Y, if a data
transformation was selected, (2) The p by r matrix V for the SVD, (3) the n by r matrix U for
the SVD, (4) the singular values and eigenvalues associated with vectors u and v are printed
in descending order, based on the eigenvalues.
The SVD is calculated after options are selected in the SVD dialog box. The box is used to
specify locations for data and output, choices for data transformation and method of analysis, and
details about the data array. An example of the set of selections chosen for a classical PCA is
given below. Here the data set is located in the Excel Sheet1 with row and column labels as the
first row and column, respectively. The data are standardized before the principal components
are extracted. The solution will contain results for the first two principal axes. A description of
the possible selections is given below.
Dialog Options:
Data Range for Ys:
The area on the sheet that contains the data matrix. No missing data are
allowed. Simply click on
at the end of the text box to toggle to the data sheet. Select
(highlight) the data table to enter elements. Note that the data range includes the row
/column labels, if provided.
Auxillary Data range: Use this to set the X variables in canonical correlation/canonical
correspondence analysis/redundancy analysis. Note that the first row of the X data set
should contain labels if the first row of the Y set contains the column labels, however the
row labels can be only in the first column of the Y set.
Output Range:
Click on the box at the end to toggle to the data file. Then click on a
cell in the sheet to define the place of the upper left corner of the output report.
Use first column for row labels: Generally select this it allows use of row labels (objects or
sites) in the graphical displays.
Use first row for column labels: Generally select this it allows use of column labels (variables
or species) in the graphical displays.
Data matrix contains inner-products: When this option is checked, the macro will apply the
SVD to the original data matrix. This is useful when the data are already in X'X form or
when the data can be interpreted as dissimilarities/ similarities, then the SVD can be used
to plot the principal coordinates (MDS).
Output the transformed data matrix: Generally this is not used. It allows you to print out the
transformed data values.
Chart output: Generally select this as it facilitates the charting of the results; it is needed for
automatic selection of columns and rows to plot in the biplot display.
Method
Principal Components Analysis:
First the data are transformed using the
transformation selected from the data transformation section. Then the SVD algorithm
described in the previous section is applied to the transformed data. Depending on the type of
transformation the user may obtain a
PCA of the covariance matrix, when the columns centered option is selected
PCA of the correlation matrix, when the columns centered and standardized option
is selected
PCA of two way interactions, when the rows and columns centered option is
selected
Principal Coordinates Analysis of Similarities, when the rows and columns
centered option is selected along with transformed data contains cross-products
option.
y y i y j / y
Correspondence Analysis: First the data are transformed into z ij = ij
,
y i y j
where y i , y j , y are totals for the i-th row, j-th column and the entire data, respectively.
Note that this transformation requires that all column/row totals are greater than zero, or the
program will give you an error message. Therefore you should delete all columns and rows
with zero totals before invoking the macro, if you have any. This transformation is useful
when the data matrix is a contingency table whose entries are frequencies. The transformed zij
is proportional to the square root of the cells contribution to the Pearsons 2 statistic for the
independence of row and column classifications. Using the SVD of this data, a classical
Correspondence Analysis (CA) can be performed by plotting
(ai(1) , ai(2) ) = (ui(1) 1 / yi , ui(2) 2 / yi ) for the row markers and
(2)
(1)
(b(1)
y j , v(2)
y j ) for the column markers. This scaling does not
j , b j ) = (v j 1 /
j 2 /
preserve inner products (as the singular values appear in both the row and column markers).
In this program we use the standard JK, GH, and SYM biplots (preserving the respective inner
product approximations; see Gower and Hand, 1996, p 180). The classical plot may be
obtained by first computing the coordinates in Excel then using the plotting feature.
Multidimensional Scaling:
2
First, the data, yij, are transformed to zij = 0.5 yij then the
*
double centering transformation is applied to produce zij = zij zi z j + z . When the
original data represent distances dij for an underlying matrix X, the transformed data will be
*
the centered inner product matrix zij = (xi x )(x j x ) and the SVD of this matrix will
produce a Principal Coordinate Analysis. The plotted values are the principal coordinates,
( v11 , v 2 2 ) , where are the square roots of the eigenvalues of Z* and vs are the eigenvectors
of Z*.
Canonical DA: This option will allow you to do canonical discriminant analysis (CDA) and
produce the associated biplot. Note that the grouping variable should be stored in the column
after any labels, i.e. in the first data column (that is the second or first column in the data range
depending on whether the row labels in the first column are provided or not). The group
identifiers may be in either numeric or textual format.
y ij* =
y ij y i y j / y
and
y i y j
xij* =
xij x j
swj
where y i , y j , y are totals for the i-th row, j-th column and the entire abundance matrix,
respectively and columns of X* are centered and standardized using column mean and
standard deviation (swj), weighted by the row totals divided by the total of the abundance
matrix Y, i.e.
r
x j = xij y i / y , swj =
i =1
(x
i =1
ij
Note that this transformation requires that all column/row totals be greater than zero, or the
program will give you an error message. Therefore you have to delete all columns and rows
with zero totals before invoking the macro, if you have any.
Denote Xw=R1/2X*, (that is rows of X* are weighted by the row totals of Y, the elements of the
= X B , where B is the matrix of regression
diagonal matrix R). Then we obtain Y
w
coefficients obtained by applying the multivariate OLS regression to Y* and Xw, and
Notice that the described procedure differs from the traditional weighted least squares in that
only the xs are weighted in weighted least squares, and the fitted values are obtained by
applying the regression coefficients to the weighted Xw.
= H Y* , where
Another way of looking at this procedure is by writing Y
w
H w = X w ( X'w X w ) 1 X'w is the idempotent projection matrix based on the standardized and
to produce Y
=UV'.
weighted Xs. The SVD is applied to the matrix of fitted values Y
The output contains the rescaled coordinates for rows of Y (sites), columns of Y
(species), and columns of X (environmental variables), calculated as follows:
Site coordinates (in the space of Y): U* = Y* V 1
1
2
Species coordinates, V * = C1/ 2 V , where C is the diagonal matrix whose elements are
the column totals of Y.
Either U* or Uf* fitted can be paired with V* to form a biplot using any of the scaling options
(JK, GH, or SYMM). The default is the GH biplot which would post multiply V* by the
diagonal matrix .
Note that the X variables can be plotted simultaneously on same plot, resulting in a
Triplot. Coordinates for the X variables are formed from the weighted correlations with
the fitted site scores. For additional information see Ter Braak (1986) or Legendre and
Legendre (1998).
Redundancy analysis (RDA) :
When this option is selected a redundancy analysis using Y and X is performed. RDA is
based on a SVD of the fits of the centered Y* regressed on columns of the centered X*
matrix. A typical application is when Y contains logs of abundance data (sites in rows
and species in columns), and X contains some site level environmental data. The data are
first transformed into Y* and X* according to one of the available options (centered or
= X* B
centered and standardized). Then the fitted values for multivariate regression Y
are computed, where B is the matrix of regression coefficients of Y* on X* obtained by
'
'
to produce Y
= UV ' .
then applied to the matrix of fitted values Y
The output contains the rescaled coordinates for rows of Y (sites), columns of Y
(species), and columns of X (environmental variables).
Site coordinates (in the space of Y) are given by U* = Y* V 1
Fitted site coordinates (in the space of X) are given by U and species coordinates by V.
Either U* or U can be paired with V to form a biplot. The default is the JK biplot which
would post multiply U* or U by the diagonal matrix of singular values
Note that the X variables can be plotted simultaneously on same display thus making it a
Triplot. Coordinates for X variables are formed from their correlations with the fitted site
scores U. Additional details are given in Ter Braak (1994).
Canonical correlation analysis:
Selecting this produces a canonical correlation analysis. The data consists of two sets of
variables Y and X. Variables in both sets are centered and standardized. Classical
canonical correlation analysis is based on eigenvalues and eigenvectors of the product of
1
1
the correlation matrices R YY
R YX R XX
R XY . The canonical correlation biplot, as
developed by ter Braak (1990) represents the singular value decomposition of the
correlation matrix between sets Y and X as
R YX = BC
where B is formed of the interset correlations between Y and the canonical variates
(structural correlations) of the X set, and C is formed of standardized canonical
coefficients (canonical weights) of the X set. Details concerning the biplot display for
this analysis are presented in Ter Braak (1990).
Data Transformation
The data transformation options allow for centering and scaling the data prior to analysis.
The options are:
*
Corrected by the grand mean: yij = yij y
Columns centered:
biplot based on a PCA (principal components analysis) of the covariance matrix. Note that
this analysis will differ from a standard PCA as by a factor of N-1, as the matrix that is
decomposed is the covariance matrix times (N-1).
Columns centered and standardized:
useful when we want a biplot based on the PCA of the correlation matrix. Here
sj =
N
i =1
( yij y j ) 2
n 1
useful for biplots that evaluate multiplicative interaction in 2-way tables. Also if the data
represent a symmetric matrix of similarities between objects (that should satisfy certain
*
conditions), the transformation will produce the inner product matrix y ij = (x i x)(x j x)
for some underlying matrix X (see for example Mardia et al, 1979) and the SVD of matrix
{zij} will allow the user to perform a Principal Coordinate Analysis (a Multidimensional
Scaling technique).
10
Drop all columns that have a fraction of zero counts larger than a pre-specified
value. For example you may want to remove all columns (species) where more than
10% of cells are zero.
Drop all columns whose totals are less than a pre-specified amount. For example
you may want to discard all columns (species) whose totals are below some absolute
value
Drop all columns whose total is less than a pre-specified percent of the grand total.
This is the same as the previous rule, but now you set the cut-off value as a
percentage to the total.
Drop all columns that are in a certain lower percentile of the column totals. For
example you may want to analyze only species with the largest counts and therefore
you want to drop the most rare species whose cumulative counts do not exceed say
20% of the grand total. To do this, check the associated check box and enter 20.
Following computation of the singular value decomposition, the resulting values are used
to obtain the biplot scores for plotting. The BIPLOT MACRO display box is selected and used to
choose the type of scaling of the singular vectors, plot options and location of the results of the
SVD. In a typical analysis, the chart output box is selected on the singular value decomposition
box and the location of the results of the SVD are automatically entered in the box. The user then
need not worry about the location of the data to be plotted. This macro uses the built-in Excel
scatter plot chart facilities to produce a biplot with various options. The options are described
below.
11
3.
Location Options
The Biplot Macro plots the row and column markers in a 2-dimensional display. The
markers associated with the largest singular value are plotted on the horizontal axis and are
denoted as the X values. If the chart option is selected, values associated with the second largest
singular value are plotted on the vertical axis and denoted as the Y values. If chart output is
selected, the X, Y and singular value information is automatically given. To plot other axes, the
user must select the information to be plotted. Note that the macro may be used independently of
the singular value macro for example to plot data from other packages by importing markers into
Excel. Then select the first column to plot for rows. You can select the range for the second
column and labels or you can click on the
box to fill in the details automatically, if
needed. The Auto-fill assumes that the Y-range is in the next rightmost column to the X range,
and that the label range is in the next leftmost column. Also select the input range for the singular
values if chart output is not selected.
Columns: Input X range
Columns: Input Y range
Columns: Input labels range
12
Specify the location with the horizontal coordinates for the row
markers (sites).
Specify the location with the vertical coordinates for the row
markers.
Specify the location with the labels for the row markers.
Specify the location with the grouping variable for the row
markers. The grouping variable represents the categories
identifiers and may be either numbers or names.
Specify the location with the first two singular values, those that
will be used when some of the non-trivial Scaling options are
selected (JK, GH, or SYM biplots). Note that only the default (no
scaling option) should be used with CDA and Canonical
Correlation Analysis biplots.
Note: to plot the column rays select the show column rays check box. Click the box for
show labels for data points to have the row and column names displayed on the chart.
Note: All ranges specified in the above options must be vertically oriented in the
Sheet.
Chart Options
Show labels for data points:
13
Symmetric scaling
Note: if a nontrivial scaling option or adjustment factor is selected, the macro will compute the
adjusted scores and place them in a separate sheet named Tmp<number> where the number is
adjusted based on the number of temporary sheets. Therefore the chart will have references to
that sheet. After the chart is drawn you can of course make any adjustments necessary using
standard Excel options.
4.
Examples
A variety of examples are given in the help files. Users are encouraged to copy data from
the examples in help and paste it into their spreadsheet to be able to reproduce charts themselves.
5.
Disclaimer
This software has been extensively validated with many data sets and all the problems
that have been known to us as of May 1, 2002 were fixed. However, the user should understand
that there may be other undetected bugs and problems and we will be grateful for any feedback
with relevant comments and suggestions for improvements.
6.
Acknowledgements
This research has been supported through grants from the U.S. Environmental Protection
Agencys Science to Achieve Results (STAR) (Grant No. R82795301) and National Center for
Environmental Assessment (NCEA) (Grant No. CR827820-01-0) programs. Although the
research described in the article has been funded wholly or in part by the U.S. Environmental
Protection Agencys STAR and NCEA programs, it has not been subjected to any EPA review
and therefore does not necessarily reflect the views of the Agency, and no official endorsement
should be inferred. Comments from an anonymous referee greatly enhanced the presentation of
the paper and program. Ilya Lipkovich programmed the biplot macro. The macros are provided
at no charge. Please cite the paper if you use the macro in publications.
14
7.
References
Digby, P.G.N. and Kempton, R.A. (1987). Multivariate Analysis of Ecological Communities.
Chapman and Hall, London.
Gabriel, K.R. (1971). The biplot-graphic display of matrices with application to principal
component analysis. Biometrika 58, 453-467.
Gower, C. and Hand D.J. (1996). Biplots. Chapman & Hill, London.
Greenacre, M. (1984). Theory and Applications of Correspondence Analysis. Academic Press,
London.
Heath, M.T. (1997). Scientific Computing. McGraw-Hill, New York.
Legendre, P., and Legendre, L. (1998). Numerical Ecology, Elsevier, Amsterdam.
Mardia, K.V., Kent, J.T. and Bibby, J.M. (1979). Multivariate Analysis, London: Academic
Press.
Rothkopf, E.Z. (1957). A measure of stimulus similarity and errors in some paired-associate
learning tasks. J. Exp. Psychol., 53, 94-101
Seber, G.A.F. (1984). Multivariate Observations, John Wiley and Sons, New York.
Ter Braak, C.J.F. (1986). Canonical correspondence analysis: A new eigenvector technique for
multivariate direct gradient analysis, Ecology 67:5, 1167-1179.
Ter Braak, C.J.F. (1990). Interpreting canonical correlation analysis through biplots of structural
correlations and weights. Psychometrika 55, 519-531.
Ter Braak, C.J.F. and Looman, C.W. (1994). Biplots in reduced-rank regression. Biometrical
Journal. 36:8, 983-1003.
15