Analytica Chimica Acta 376 (1998) 169±181
A numerical procedure for curve ®tting of noisy infrared spectra
Carlos E. Alciaturia,*, Marcos E. Escobara,b, Carlos De La Cruzb
a
Instituto Zuliano de Investigaciones TecnoloÂgicas (INZIT±CICASI), Apartado Postal 1114, Maracaibo, Venezuela
Laboratorio de Espectroscopia Molecular y AtoÂmica, Departamento de QuõÂmica, Facultad Experimental de Ciencias,
La Universidad del Zulia (LUZ), Maracaibo, Venezuela
b
Received 30 September 1997; received in revised form 29 June 1998; accepted 6 July 1998
Abstract
A method for the detection of over®tting of noisy spectra is presented. When ®tting data that contains random noise, the
autocorrelation function (RL) of residuals at lag 1 (R1) approaches zero and then shows a tendency toward more negative
values, while the Wald±Wolfowitz test tends to give more positive values, as the data is over®tted. Models that give residuals
with a non-random ordering may be rejected. The use of these functions for the determination of the ``best'' ®tting of several
approximations (Savitzky±Golay smoothing, segmented osculating polynomials (SOPA), Fourier series, and Lorentzian
bands) to a noisy synthetic spectrum, and the application to infrared spectra of coal, is shown. # 1998 Elsevier Science B.V.
All rights reserved.
Keywords: Smoothing; Over®tting; Autocorrelation; Infrared spectra; Coal
1. Introduction
When the exercise of ®tting functions such as
polynomials or Fourier series to noisy spectra is
performed, it is possible to ``®t'' exactly the model
to the results if one introduces enough parameters. A
common trouble arises because the model would then
reproduce the noise as well as the relevant information
in the data. Many authors consider indispensable to
``smooth'' the data, because it reduces the noise as
well. Smoothing with polynomials is usually accomplished using a polynomial of degree m<nÿ1 where n
is the number of data points. The equivalent procedure
with Fourier series consists mainly in the elimination
*Corresponding author. Tel.: +61-91-0952; fax: +61-91-3769;
e-mail: alciaturi@hotmail.com
0003-2670/98/$19.00 # 1998 Elsevier Science B.V. All rights reserved.
PII S0003-2670(98)00504-2
of the highest-frequency terms. However, there
seem to be no generally applicable procedures for
determining the ``correct'' amount of smoothing, or
the ``right'' number of parameters to use in a particular
model.
The goodness-of-®t of a model is very often measured by spectroscopists with the ``discrepancy index''
(DIS) [1] or the sample standard deviation (SD) [2]:
X
1=2
;
(1)
DIS
2 =n
SD
X
2 = n ÿ k
1=2
;
(2)
where is the difference (often referred to as residual) between the experimental and calculated values,
n the number of points and k the number of parameters
in the model. When ®tting polynomials to data with
170
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
normally distributed errors, Ralston [3] suggests to
increase the number of terms until SD does not
decrease in a signi®cant amount.
Another test routinely used to select models is the
cross-validation with the PRESS (prediction residual
sum of squares, a bootstrap technique), most often
with a leave-one-out procedure, where the model is
constructed excluding one of the points at a time, with
the value for that point predicted from the model [4±
8]. The procedure is repeated for every point. The
approximation with the lowest PRESS is preferred.
From the PRESS, a root-mean-squares deviation may
be calculated. In this paper, we call this deviation
``DIS from PRESS'':
X
1=2
;
(3)
DIS from PRESS
2 =n
where are the deviations from the models and n the
number of points.
A low DIS (or standard deviation) is not enough to
prove the soundness of a model. It is useful to plot the
residuals vs. the independent variable [1,2]. Applying
this procedure, discrepancies from the model can be
easily detected. Often, residuals will form ``clumps''
on one side of the abscissa, indicating a non-random
ordering. Another case of non-random ordering occurs
when residuals alternate systematically. These two
cases may be illustrated by the sequences:
ÿÿÿÿÿ and ÿÿÿÿÿ, respectively. To verify if the residuals are randomly distributed, the autocorrelation function (RL) may be applied:
RL
n ÿ L
P
P P
Yi YiL ÿ Yi YiL = n ÿ L n ÿ L ÿ 1
;
P 2
P 2
Yi =n n ÿ 1
n Yi ÿ
(4)
where n is the total number of points in the series, i the
position in the series, and L the lag [2]. The lag is the
offset between the points being correlated. This function measures the correlation of a series of numerical
data (often a time series) with itself. When the points
``cluster'' in groups, the sequence is not random, and
high positive values of RL will appear. If the points
alternate between positive and negative values, RL
takes high negative values. Thus, this function is
useful for the detection of systematic deviations with
respect to the model utilized. For a random series of
numbers, the expected autocorrelation value is zero,
and its population standard deviation is given by the
formula [2]:
L 1= n ÿ L1=2 ;
(5)
where n and L are de®ned as in Eq. (2). Also, Z-values
may be calculated, and a probability associated with
each observation
Z RL ÿ 0=L :
(6)
RL may be then used to test whether a series of
numbers is random, provided n is ``large'' and L is
``small''. A ``rule of thumb'' states that n should be
larger than 50 and L smaller than n/4 [2]. The autocorrelation function does not seem to have been
widely used by spectroscopists. Smit, in his reviews
[9,10] about speci®cation and estimation of noisy
analytical signals, explains the use of the autocorrelation function for the description of random and deterministic signals. No reference, however, is made in
these papers of applications to residuals.
Also, the theory of runs may be employed. By
considering the positive deviations to be from one
population (A) and the negative deviations from
another population (B), the Wald±Wolfowitz test
may be applied. If the number of positive deviations
is na and the number of negative deviations is nb, the
expected number of runs is [11,12]
2na nb = na nb 1
(7)
with a variance
2 2na nb 2na nb ÿ na ÿ nb = na nb 2 na nb ÿ 1:
(8)
In this way, if r is the observed number of runs, Zvalues may be calculated, together with a probability
associated with each observation, as in Eq. (4):
Z r ÿ =:
(9)
This test, however, does not take into account the
quantitative values of the residuals.
Other statistical parameters are commonly used in
model selection. The F-test and the Durbin±Watson
test for autocorrelation are mentioned in textbooks
[4±13] and are included in commercial statistics
programs.
Maddams, in his extensive review of curve ®tting
[1] cites several statistical criteria that may be useful to
select between alternative sets of ®tting parameters.
Hamilton's R factor has been used in signi®cance tests
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
in crystallography and in band ®tting of Raman
spectra [1,14±16].
When comparing different models, the ``parsimony
principle'' is usually applied, i.e. when two models
represent the data equally well, the one with the
smallest number of parameters should be chosen
[17,18]. It has been pointed out [1] that no model
should be chosen on statistical evidence alone, physical signi®cance being also of importance.
In this paper, it is shown the application of residuals
of the autocorrelation function with lag 1 (R1) and the
Wald±Wolfowitz test, to determine whether or not
there are ``too many'' parameters in models that
represent infrared spectra of coal, i.e. when there is
``over®tting''. The results are compared with the use
of the standard deviation (SD) and the application of
prediction residuals sum of squares (PRESS). Those
functions are also employed here to determine the
right amount of smoothing when applying the
Savitzky±Golay procedure [19±23] to the same
data.
1710 Fourier-transform, i.e. interferometric, spectrometer with triglycine sulfate (TGS) pyroelectric detector. The sample was a neat, pulverized (ÿ200 mesh)
Venezuelan coal. This sample is the number 35 of a set
of Venezuelan coals studied by us [24] and was
classi®ed as a high-volatile bituminous coal. This
spectrum was exported to JCAMP format using the
Perkin Elmer Infrared Data Manager (IRDM) software and further processed with programs written by
one of us (CEA) in the Microsoft Corporation QuickBasic language. The data points in the digital JCAMP
format were spaced as in the nominal resolution, i.e.
4 cmÿ1.
3. Results and discussion
3.1. Fitting data points generated from Lorentzian
bands and random noise
Fig. 1 shows a set of three synthetic Lorentzian
bands. The bands were calculated with the function
f X A= 1 4 X ÿ B=C2 ;
2. Experimental
The experimental infrared spectrum was obtained
averaging 100 scans at 4 cmÿ1 resolution using a
Diffuse Re¯ectance accessory and a Perkin Elmer
171
(10)
where A is the maximum amplitude, B the position of
the maxima, and C the width at the half-height. The
parameters (A, B, C) for the three bands are, respectively, (1, 500, 40), (1, 520, 40) and (1, 400, 60). The
Fig. 1. Synthetic Lorentzian bands with noise added (see Section 3.1).
172
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
Fig. 2. Parameters from the application of the Fourier approximation to the synthetic spectrum of Fig. 1 (see Section 3.1.1).
points displayed result from the addition of the three
bands and Gaussian noise with a population standard
deviation of 0.03.
3.1.1. Comparison of Savitzky±Golay smoothing,
SOPA, and Fourier series
Figs. 2±4 display the results for the application of
these methods to the synthetic data of Fig. 1. The
tendencies for the Z-values are easily seen from those
®gures.
Notice that the value of the expected standard
deviation of R1, for a random series of numbers,
with n257, calculated from Eq. (5), is
1 1= 257 ÿ 11=2 0:0625. This is the value
used in Eq. (6) for the calculation of Z-values in this
example. Segmented lines are included at Z1.96
for a 95% con®dence level. Thus, models giving
Z-values outside these limits may be rejected.
The results for the Fourier series are shown in
Fig. 2. From the Z(R1) values, the models with less
than 37 (excessive smoothing) or more than 41 coef®cients (over®tting) were rejected. The Wald±Wolfowitz test gave a greater spread of ``acceptable''
models, while both the SD and PRESS calculations
show a minimum for the model with 43 coef®cients.
The Savitzky±Golay procedure which is a smoothing
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
173
Fig. 3. Parameters from the application of the Savitzky±Golay smoothing to the synthetic spectrum of Fig. 1 (see Section 3.1.1).
procedure, makes it dif®cult to calculate the degrees of
freedom or the PRESS; however, the autocorrelation
function and the Wald±Wolfowitz test can be easily
calculated from the residuals. The results are shown in
Fig. 3. Considering Z(R1), smoothed values from
window widths greater than 43 and smaller than 37
were rejected.
The results for SOPA are shown in Fig. 4. The
segmented osculating polynomial approximation
(SOPA) and its application for correlation of coal
properties was explained by us in a previous paper
[25]. SOPA uses the central-point approximation to
calculate parameters for the nodes. A segment length
of m corresponds to a window width of 2m1 in the
method of Savitzky±Golay. Values between nodes are
found by interpolation [25]. From the Z(R1) values in
Fig. 4, the models with segment lengths greater than
20 (excessive smoothing) or less than 17 (over®tting)
were rejected. As in the procedure of Savitzky±Golay,
the PRESS is dif®cult to perform. SD values, however,
were calculated from the number of variables in the
model, considering that there are ®ve variables at each
node. For example, considering the segment length of
16 (the minimum of SD in Fig. 4), the number of
nodes is 17 and the number of variables 85. It is seen
that the changes in Z and SD values for this approxi-
174
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
Fig. 4. Parameters from the application of SOPA to the synthetic spectrum of Fig. 1 (see Section 3.1.1).
mation are somewhat irregular. However, the tendencies are clearly the same as in the previous ®gures.
The ``best'' approximations may be chosen as the
ones which have Z(R1) closest to zero (see Figs. 2±4).
These are: the model with 39 coef®cients for Fourier,
smoothed values with a window width of 41 for
Savitzky±Golay, and a segment length of 20 for SOPA.
The last one has 14 nodes, and 70 parameters. All of
them have similar values of DIS. However, the
parsimony principle indicates that the Fourier
approximation should be preferred over SOPA in
this case, because it needs a smaller number of
parameters.
3.1.2. Fitting with Lorentzian bands
Lorentzian bands were ®tted to the synthetic data of
Fig. 1 by linear least-squares procedures, adjusting
only the values of the amplitudes (``A'' parameters).
The bands used were (B, C values (Eq. (10)) shown
between parenthesis) ± 1: (500, 40); 2: (520, 40); 3:
(400, 60); 4: (510, 60); 5: (410, 40); 6: (390, 40). When
®tting all of the six bands, the intensities are ± 1:
0.948; 2: 0.975; 3: 0.894; 4: 0.080; 5: 0.052; 6: 0.075.
The results of ®tting are shown in Fig. 5. It is seen that
bands 4±6 have a minor in¯uence on the values of the
calculated parameters. Z-values favor the models with
3, 4, 5, or 6 bands; however, given the very small
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
175
Fig. 5. Parameters from the fitting of Lorentzian bands to the synthetic spectrum of Fig. 1 (see Section 3.1.2).
change in Z-values caused by bands 4±6, the parsimony principle indicates the model with 3 bands as the
best. Also, there is a minimum for DIS and DIS from
PRESS for the model with the ®rst three bands.
between consecutive points. When processing spectra
from Fourier transform spectrometers, it should be
veri®ed that the separation between the data points is
equal to the nominal resolution. Otherwise, the points
``in-between'' will not show random deviations.
3.2. Fitting data points from DRIFT spectra of coal
Fig. 6 shows the infrared spectrum of a coal sample
between 3100 and 2796 cmÿ1. This is the region
where the C±H stretching bands appear, and has been
widely studied in the literature [26±30]. This example
contains 77 data points for a separation of 4 cmÿ1
3.2.1. Comparison of Savitzky±Golay smoothing,
SOPA, and Fourier series
The results of the approximations are shown in
Figs. 7±9. The value of 1 calculated for n77
(Eq. (5)) is 0.115. As before, Z-values were calculated
from Eqs. (6) and (9).
176
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
Fig. 6. DRIFT spectrum of a coal sample (see Section 3.2).
Fig. 7. Parameters from the application of the Fourier approximation to the coal spectrum of Fig. 6 (see Section 3.2.1).
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
177
Fig. 8. Parameters from the application of the Savitzky±Golay smoothing to the coal spectrum of Fig. 6 (see Section 3.2.1).
The results for the Fourier series are displayed in
Fig. 7. From the Z(R1) values, the models with less
than 19 or more than 29 coef®cients were rejected.
Also in this case, the Wald±Wolfowitz test gave a
greater spread of ``acceptable'' models, while DIS
from PRESS values show a minimum for the model
with 27 coef®cients. SD values do not show a clear
minimum in the range considered. The results for the
Savitzky±Golay procedure are displayed in Fig. 8.
Considering the Z(R1) values in this ®gure, smoothed
values from window widths greater than 27 and
smaller than 25 were rejected.
The results for SOPA are shown in Fig. 9. From the
Z(R1) values in this ®gure, the models with segment
lengths greater than 13 or less than 9 were discarded.
A minimum for the SD values is reached at a segment
length of 10. For this segment length, the number of
nodes is 9 and the number of variables is 45.
The ®tted curves for SOPA and the second-derivatives are given in Fig. 10(a). The downward-pointing
minima in the second-derivative [1] suggest the presence of nine bands in this region. The derivatives
from the Savitzky±Golay procedure or Fourier series
(not shown) indicate the presence of bands at approxi-
178
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
Fig. 9. Parameters from the application of the SOPA to the coal spectrum of Fig. 6 (see Section 3.2.1).
mately the same positions. Thus, the data points were
®tted with Lorentzian bands, as it is explained in the
following section.
3.2.2. Fitting of Lorentzian bands
The ®tting of Lorentzian bands was done by the
steepest descent procedure [31], optimizing amplitudes, positions and width at half-height (A, B, and
C in Eq. (10)). A second-degree polynomial was used
for the base line. Six of the bands were found to be
signi®cant (see Fig. 10 (b)). These bands are (A, B ,
and C given): 5.56710ÿ2, 3065, 17; 4.14410ÿ2,
3041, 31; 0.2261, 2960, 33; 0.2398, 2926, 40; 0.1653,
2896, 61; 0.2544, 2853, 30. Fig. 11 demonstrates that
the model with these six bands yields satisfactory
Z-values and has a minimum for SD and DIS from
PRESS.
Thus, these bands represent a good model for the
data. The Lorentzian bands also give a more ``compact'' representation (18 parameters only) of the spectrum than SOPA or Fourier series. However, when
®tting the data from the other coals in [24] with these
six bands, although most spectra gave satisfactory
values of DIS, most Z-values were outside the accept-
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
179
Fig. 10. Fitting of the SOPA approximation (a) and Lorentzian bands (b) to the coal spectrum of Fig. 6 (see Sections 3.2.1 and 3.2.2).
able range (1.96) indicating an imperfect match for
this model.
4. Conclusions
It was shown that the determination of the randomness of residuals is a useful test for selection of
models. In certain cases, the autocorrelation function
with lag 1 (R1) or the Wald±Wolfowitz test provides an
objective way of avoiding excessive smoothing (or the
opposite, over®tting), making it possible to reject
approximations on statistical grounds. In the examples
displayed, R1 was more selective than the Wald±
Wolfowitz test. Thus, when performing curve-®tting
or smoothing procedures, the determination of the
autocorrelation of residuals should be done together
with the discrepancy index (DIS), standard deviation
(SD), or bootstrap procedures (such as PRESS). The
autocorrelation of residuals has the advantage of being
easily applicable for testing smoothing procedures,
while SD or PRESS are dif®cult to determine for
these. It was found that the comparison between the
Fourier transform, SOPA, and Lorentzian functions,
180
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
Fig. 11. Parameters from the fitting of Lorentzian bands to the coal spectrum of Fig. 6 (see Section 3.2.2).
results in that the last one gave the most compact
representation of the infrared spectra of coal in the
3100±2796 cmÿ1 region.
Acknowledgements
We acknowledge with thanks the ®nancial assistance through Project S1-2278 of the Consejo Nacional de Investigaciones Cientõ®cas y TecnoloÂgicas
(CONICIT, Venezuela). We also thank the reviewers
for useful suggestions, and specially Reviewer II for
calling our attention to the Wald±Wolfowitz test.
References
[1] W.F. Maddams, Appl. Spectrosc. 34 (1980) 245.
[2] J.C. Davis, Statistics and Data Analysis in Geology, Wiley,
New York, 1973.
[3] A. Ralston, IntroduccioÂn al anaÂlisis numeÂrico, Limusa-Wiley,
New York, 1970.
[4] R.E. Walpole, R.H. Myers, Probability and Statistics for
Engineers and Scientists, 4th ed., MacMillan, New York,
1990.
[5] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1.
[6] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 19.
[7] E.V. Thomas, Anal. Chem. 66 (1994) 795A.
[8] S. Sekulic, M.B. Seasholtz, Z. Wang, B.R. Kowalski, S.E.
Lee, B.R. Holt, Anal. Chem. 65 (1993) 835A.
C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
H.C. Smit, Chemom. Int. Lab. Systems 8 (1990) 15.
H.C. Smit, Chemom. Int. Lab. Systems 8 (1990) 29.
S. Wilks, Mathematical Statistics, Wiley, New York, 1962.
M.R. Spiegel, Statistics (Schaum Series), 2nd ed., McGrawHill, New York, 1988.
D.K. Hildebrand, R.L. Ott, Statistical Thinking for Managers,
PWS Kent, 1991.
W.C. Hamilton, Acta Cryst. 18 (1965) 502.
A.T. Lemley, J.H. Roberts, K.R. Plowman, J.J. Lagowski,
J. Phys. Chem. 77 (1973) 2185.
P. Gans, B. Gill, Appl. Spectrosc. 31 (1977) 451.
H.G. Gauch, American Scientist 81 (1993) 468.
W.H. Jefferys, J.O. Berger, American Scientist 80 (1992) 64.
A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627.
J. Steiner, Y. Termonia, J. Deltour, Anal. Chem. 44 (1972)
1906.
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
181
H.H. Madden, Anal. Chem. 50 (1978) 1383.
S.E. Bialkowski, Anal. Chem. 61 (1989) 1308.
P.A. Gorry, Anal. Chem. 62 (1990) 570.
C.E. Alciaturi, M.E. Escobar, R. Vallejo, Fuel 4 (1996)
491.
C.E. Alciaturi, T. Montero, C. De La Cruz, M.E. Escobar,
Anal. Chim. Acta 340 (1997) 233.
W.I. Friesen, K.H. Michaelian, Appl. Spectrosc. 45 (1991) 50.
K.H. Michaelian, W.I. Friesen, Fuel 69 (1990) 1271.
S.-H. Wang, P.R. Griffiths, Fuel 64 (1985) 229.
D.W. Kuehn, R.W. Snyder, A. Davis, P.C. Painter, Fuel 61
(1982) 682.
P.C. Painter, R.W. Snyder, M. Starsinic, M.C. Coleman, D.W.
Kuehn, A. Davis, Appl. Spectrosc. 35 (1981) 475.
D. Kincaid, W. Cheney, Numerical Analysis: Mathematics of
Scientific Computing, Brooks/Cole, 1991.