Nothing Special   »   [go: up one dir, main page]

Academia.eduAcademia.edu

A numerical procedure for curve fitting of noisy infrared spectra

1998, Analytica Chimica Acta

Analytica Chimica Acta 376 (1998) 169±181 A numerical procedure for curve ®tting of noisy infrared spectra Carlos E. Alciaturia,*, Marcos E. Escobara,b, Carlos De La Cruzb a Instituto Zuliano de Investigaciones TecnoloÂgicas (INZIT±CICASI), Apartado Postal 1114, Maracaibo, Venezuela Laboratorio de Espectroscopia Molecular y AtoÂmica, Departamento de QuõÂmica, Facultad Experimental de Ciencias, La Universidad del Zulia (LUZ), Maracaibo, Venezuela b Received 30 September 1997; received in revised form 29 June 1998; accepted 6 July 1998 Abstract A method for the detection of over®tting of noisy spectra is presented. When ®tting data that contains random noise, the autocorrelation function (RL) of residuals at lag 1 (R1) approaches zero and then shows a tendency toward more negative values, while the Wald±Wolfowitz test tends to give more positive values, as the data is over®tted. Models that give residuals with a non-random ordering may be rejected. The use of these functions for the determination of the ``best'' ®tting of several approximations (Savitzky±Golay smoothing, segmented osculating polynomials (SOPA), Fourier series, and Lorentzian bands) to a noisy synthetic spectrum, and the application to infrared spectra of coal, is shown. # 1998 Elsevier Science B.V. All rights reserved. Keywords: Smoothing; Over®tting; Autocorrelation; Infrared spectra; Coal 1. Introduction When the exercise of ®tting functions such as polynomials or Fourier series to noisy spectra is performed, it is possible to ``®t'' exactly the model to the results if one introduces enough parameters. A common trouble arises because the model would then reproduce the noise as well as the relevant information in the data. Many authors consider indispensable to ``smooth'' the data, because it reduces the noise as well. Smoothing with polynomials is usually accomplished using a polynomial of degree m<nÿ1 where n is the number of data points. The equivalent procedure with Fourier series consists mainly in the elimination *Corresponding author. Tel.: +61-91-0952; fax: +61-91-3769; e-mail: alciaturi@hotmail.com 0003-2670/98/$19.00 # 1998 Elsevier Science B.V. All rights reserved. PII S0003-2670(98)00504-2 of the highest-frequency terms. However, there seem to be no generally applicable procedures for determining the ``correct'' amount of smoothing, or the ``right'' number of parameters to use in a particular model. The goodness-of-®t of a model is very often measured by spectroscopists with the ``discrepancy index'' (DIS) [1] or the sample standard deviation (SD) [2]: X 1=2 ; (1) DIS ˆ 2 =n SD ˆ X 2 = n ÿ k† 1=2 ; (2) where  is the difference (often referred to as residual) between the experimental and calculated values, n the number of points and k the number of parameters in the model. When ®tting polynomials to data with 170 C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 normally distributed errors, Ralston [3] suggests to increase the number of terms until SD does not decrease in a signi®cant amount. Another test routinely used to select models is the cross-validation with the PRESS (prediction residual sum of squares, a bootstrap technique), most often with a leave-one-out procedure, where the model is constructed excluding one of the points at a time, with the value for that point predicted from the model [4± 8]. The procedure is repeated for every point. The approximation with the lowest PRESS is preferred. From the PRESS, a root-mean-squares deviation may be calculated. In this paper, we call this deviation ``DIS from PRESS'': X 1=2 ; (3) DIS from PRESS ˆ 2 =n where  are the deviations from the models and n the number of points. A low DIS (or standard deviation) is not enough to prove the soundness of a model. It is useful to plot the residuals vs. the independent variable [1,2]. Applying this procedure, discrepancies from the model can be easily detected. Often, residuals will form ``clumps'' on one side of the abscissa, indicating a non-random ordering. Another case of non-random ordering occurs when residuals alternate systematically. These two cases may be illustrated by the sequences: ‡‡‡‡‡ÿÿÿÿÿ and ‡ÿ‡ÿ‡ÿ‡ÿ‡ÿ, respectively. To verify if the residuals are randomly distributed, the autocorrelation function (RL) may be applied: RL ˆ ‰ n ÿ L† P P P Yi Yi‡L † ÿ Yi Yi‡L Š= n ÿ L† n ÿ L ÿ 1† ; P 2 P 2 Yi † Š=n n ÿ 1† ‰n Yi ÿ (4) where n is the total number of points in the series, i the position in the series, and L the lag [2]. The lag is the offset between the points being correlated. This function measures the correlation of a series of numerical data (often a time series) with itself. When the points ``cluster'' in groups, the sequence is not random, and high positive values of RL will appear. If the points alternate between positive and negative values, RL takes high negative values. Thus, this function is useful for the detection of systematic deviations with respect to the model utilized. For a random series of numbers, the expected autocorrelation value is zero, and its population standard deviation is given by the formula [2]: L ˆ 1= n ÿ L††1=2 ; (5) where n and L are de®ned as in Eq. (2). Also, Z-values may be calculated, and a probability associated with each observation Z ˆ RL ÿ 0†=L : (6) RL may be then used to test whether a series of numbers is random, provided n is ``large'' and L is ``small''. A ``rule of thumb'' states that n should be larger than 50 and L smaller than n/4 [2]. The autocorrelation function does not seem to have been widely used by spectroscopists. Smit, in his reviews [9,10] about speci®cation and estimation of noisy analytical signals, explains the use of the autocorrelation function for the description of random and deterministic signals. No reference, however, is made in these papers of applications to residuals. Also, the theory of runs may be employed. By considering the positive deviations to be from one population (A) and the negative deviations from another population (B), the Wald±Wolfowitz test may be applied. If the number of positive deviations is na and the number of negative deviations is nb, the expected number of runs is [11,12]  ˆ 2na nb = na ‡ nb † ‡ 1 (7) with a variance 2 ˆ 2na nb 2na nb ÿ na ÿ nb †= na ‡ nb †2 na ‡ nb ÿ 1†: (8) In this way, if r is the observed number of runs, Zvalues may be calculated, together with a probability associated with each observation, as in Eq. (4): Z ˆ r ÿ †=: (9) This test, however, does not take into account the quantitative values of the residuals. Other statistical parameters are commonly used in model selection. The F-test and the Durbin±Watson test for autocorrelation are mentioned in textbooks [4±13] and are included in commercial statistics programs. Maddams, in his extensive review of curve ®tting [1] cites several statistical criteria that may be useful to select between alternative sets of ®tting parameters. Hamilton's R factor has been used in signi®cance tests C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 in crystallography and in band ®tting of Raman spectra [1,14±16]. When comparing different models, the ``parsimony principle'' is usually applied, i.e. when two models represent the data equally well, the one with the smallest number of parameters should be chosen [17,18]. It has been pointed out [1] that no model should be chosen on statistical evidence alone, physical signi®cance being also of importance. In this paper, it is shown the application of residuals of the autocorrelation function with lag 1 (R1) and the Wald±Wolfowitz test, to determine whether or not there are ``too many'' parameters in models that represent infrared spectra of coal, i.e. when there is ``over®tting''. The results are compared with the use of the standard deviation (SD) and the application of prediction residuals sum of squares (PRESS). Those functions are also employed here to determine the right amount of smoothing when applying the Savitzky±Golay procedure [19±23] to the same data. 1710 Fourier-transform, i.e. interferometric, spectrometer with triglycine sulfate (TGS) pyroelectric detector. The sample was a neat, pulverized (ÿ200 mesh) Venezuelan coal. This sample is the number 35 of a set of Venezuelan coals studied by us [24] and was classi®ed as a high-volatile bituminous coal. This spectrum was exported to JCAMP format using the Perkin Elmer Infrared Data Manager (IRDM) software and further processed with programs written by one of us (CEA) in the Microsoft Corporation QuickBasic language. The data points in the digital JCAMP format were spaced as in the nominal resolution, i.e. 4 cmÿ1. 3. Results and discussion 3.1. Fitting data points generated from Lorentzian bands and random noise Fig. 1 shows a set of three synthetic Lorentzian bands. The bands were calculated with the function f X† ˆ A= 1 ‡ 4  X ÿ B†=C†2 †; 2. Experimental The experimental infrared spectrum was obtained averaging 100 scans at 4 cmÿ1 resolution using a Diffuse Re¯ectance accessory and a Perkin Elmer 171 (10) where A is the maximum amplitude, B the position of the maxima, and C the width at the half-height. The parameters (A, B, C) for the three bands are, respectively, (1, 500, 40), (1, 520, 40) and (1, 400, 60). The Fig. 1. Synthetic Lorentzian bands with noise added (see Section 3.1). 172 C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 Fig. 2. Parameters from the application of the Fourier approximation to the synthetic spectrum of Fig. 1 (see Section 3.1.1). points displayed result from the addition of the three bands and Gaussian noise with a population standard deviation of 0.03. 3.1.1. Comparison of Savitzky±Golay smoothing, SOPA, and Fourier series Figs. 2±4 display the results for the application of these methods to the synthetic data of Fig. 1. The tendencies for the Z-values are easily seen from those ®gures. Notice that the value of the expected standard deviation of R1, for a random series of numbers, with nˆ257, calculated from Eq. (5), is 1 ˆ 1= 257 ÿ 1††1=2 ˆ 0:0625. This is the value used in Eq. (6) for the calculation of Z-values in this example. Segmented lines are included at Zˆ1.96 for a 95% con®dence level. Thus, models giving Z-values outside these limits may be rejected. The results for the Fourier series are shown in Fig. 2. From the Z(R1) values, the models with less than 37 (excessive smoothing) or more than 41 coef®cients (over®tting) were rejected. The Wald±Wolfowitz test gave a greater spread of ``acceptable'' models, while both the SD and PRESS calculations show a minimum for the model with 43 coef®cients. The Savitzky±Golay procedure which is a smoothing C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 173 Fig. 3. Parameters from the application of the Savitzky±Golay smoothing to the synthetic spectrum of Fig. 1 (see Section 3.1.1). procedure, makes it dif®cult to calculate the degrees of freedom or the PRESS; however, the autocorrelation function and the Wald±Wolfowitz test can be easily calculated from the residuals. The results are shown in Fig. 3. Considering Z(R1), smoothed values from window widths greater than 43 and smaller than 37 were rejected. The results for SOPA are shown in Fig. 4. The segmented osculating polynomial approximation (SOPA) and its application for correlation of coal properties was explained by us in a previous paper [25]. SOPA uses the central-point approximation to calculate parameters for the nodes. A segment length of m corresponds to a window width of 2m‡1 in the method of Savitzky±Golay. Values between nodes are found by interpolation [25]. From the Z(R1) values in Fig. 4, the models with segment lengths greater than 20 (excessive smoothing) or less than 17 (over®tting) were rejected. As in the procedure of Savitzky±Golay, the PRESS is dif®cult to perform. SD values, however, were calculated from the number of variables in the model, considering that there are ®ve variables at each node. For example, considering the segment length of 16 (the minimum of SD in Fig. 4), the number of nodes is 17 and the number of variables 85. It is seen that the changes in Z and SD values for this approxi- 174 C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 Fig. 4. Parameters from the application of SOPA to the synthetic spectrum of Fig. 1 (see Section 3.1.1). mation are somewhat irregular. However, the tendencies are clearly the same as in the previous ®gures. The ``best'' approximations may be chosen as the ones which have Z(R1) closest to zero (see Figs. 2±4). These are: the model with 39 coef®cients for Fourier, smoothed values with a window width of 41 for Savitzky±Golay, and a segment length of 20 for SOPA. The last one has 14 nodes, and 70 parameters. All of them have similar values of DIS. However, the parsimony principle indicates that the Fourier approximation should be preferred over SOPA in this case, because it needs a smaller number of parameters. 3.1.2. Fitting with Lorentzian bands Lorentzian bands were ®tted to the synthetic data of Fig. 1 by linear least-squares procedures, adjusting only the values of the amplitudes (``A'' parameters). The bands used were (B, C values (Eq. (10)) shown between parenthesis) ± 1: (500, 40); 2: (520, 40); 3: (400, 60); 4: (510, 60); 5: (410, 40); 6: (390, 40). When ®tting all of the six bands, the intensities are ± 1: 0.948; 2: 0.975; 3: 0.894; 4: 0.080; 5: 0.052; 6: 0.075. The results of ®tting are shown in Fig. 5. It is seen that bands 4±6 have a minor in¯uence on the values of the calculated parameters. Z-values favor the models with 3, 4, 5, or 6 bands; however, given the very small C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 175 Fig. 5. Parameters from the fitting of Lorentzian bands to the synthetic spectrum of Fig. 1 (see Section 3.1.2). change in Z-values caused by bands 4±6, the parsimony principle indicates the model with 3 bands as the best. Also, there is a minimum for DIS and DIS from PRESS for the model with the ®rst three bands. between consecutive points. When processing spectra from Fourier transform spectrometers, it should be veri®ed that the separation between the data points is equal to the nominal resolution. Otherwise, the points ``in-between'' will not show random deviations. 3.2. Fitting data points from DRIFT spectra of coal Fig. 6 shows the infrared spectrum of a coal sample between 3100 and 2796 cmÿ1. This is the region where the C±H stretching bands appear, and has been widely studied in the literature [26±30]. This example contains 77 data points for a separation of 4 cmÿ1 3.2.1. Comparison of Savitzky±Golay smoothing, SOPA, and Fourier series The results of the approximations are shown in Figs. 7±9. The value of 1 calculated for nˆ77 (Eq. (5)) is 0.115. As before, Z-values were calculated from Eqs. (6) and (9). 176 C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 Fig. 6. DRIFT spectrum of a coal sample (see Section 3.2). Fig. 7. Parameters from the application of the Fourier approximation to the coal spectrum of Fig. 6 (see Section 3.2.1). C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 177 Fig. 8. Parameters from the application of the Savitzky±Golay smoothing to the coal spectrum of Fig. 6 (see Section 3.2.1). The results for the Fourier series are displayed in Fig. 7. From the Z(R1) values, the models with less than 19 or more than 29 coef®cients were rejected. Also in this case, the Wald±Wolfowitz test gave a greater spread of ``acceptable'' models, while DIS from PRESS values show a minimum for the model with 27 coef®cients. SD values do not show a clear minimum in the range considered. The results for the Savitzky±Golay procedure are displayed in Fig. 8. Considering the Z(R1) values in this ®gure, smoothed values from window widths greater than 27 and smaller than 25 were rejected. The results for SOPA are shown in Fig. 9. From the Z(R1) values in this ®gure, the models with segment lengths greater than 13 or less than 9 were discarded. A minimum for the SD values is reached at a segment length of 10. For this segment length, the number of nodes is 9 and the number of variables is 45. The ®tted curves for SOPA and the second-derivatives are given in Fig. 10(a). The downward-pointing minima in the second-derivative [1] suggest the presence of nine bands in this region. The derivatives from the Savitzky±Golay procedure or Fourier series (not shown) indicate the presence of bands at approxi- 178 C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 Fig. 9. Parameters from the application of the SOPA to the coal spectrum of Fig. 6 (see Section 3.2.1). mately the same positions. Thus, the data points were ®tted with Lorentzian bands, as it is explained in the following section. 3.2.2. Fitting of Lorentzian bands The ®tting of Lorentzian bands was done by the steepest descent procedure [31], optimizing amplitudes, positions and width at half-height (A, B, and C in Eq. (10)). A second-degree polynomial was used for the base line. Six of the bands were found to be signi®cant (see Fig. 10 (b)). These bands are (A, B , and C given): 5.56710ÿ2, 3065, 17; 4.14410ÿ2, 3041, 31; 0.2261, 2960, 33; 0.2398, 2926, 40; 0.1653, 2896, 61; 0.2544, 2853, 30. Fig. 11 demonstrates that the model with these six bands yields satisfactory Z-values and has a minimum for SD and DIS from PRESS. Thus, these bands represent a good model for the data. The Lorentzian bands also give a more ``compact'' representation (18 parameters only) of the spectrum than SOPA or Fourier series. However, when ®tting the data from the other coals in [24] with these six bands, although most spectra gave satisfactory values of DIS, most Z-values were outside the accept- C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 179 Fig. 10. Fitting of the SOPA approximation (a) and Lorentzian bands (b) to the coal spectrum of Fig. 6 (see Sections 3.2.1 and 3.2.2). able range (1.96) indicating an imperfect match for this model. 4. Conclusions It was shown that the determination of the randomness of residuals is a useful test for selection of models. In certain cases, the autocorrelation function with lag 1 (R1) or the Wald±Wolfowitz test provides an objective way of avoiding excessive smoothing (or the opposite, over®tting), making it possible to reject approximations on statistical grounds. In the examples displayed, R1 was more selective than the Wald± Wolfowitz test. Thus, when performing curve-®tting or smoothing procedures, the determination of the autocorrelation of residuals should be done together with the discrepancy index (DIS), standard deviation (SD), or bootstrap procedures (such as PRESS). The autocorrelation of residuals has the advantage of being easily applicable for testing smoothing procedures, while SD or PRESS are dif®cult to determine for these. It was found that the comparison between the Fourier transform, SOPA, and Lorentzian functions, 180 C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 Fig. 11. Parameters from the fitting of Lorentzian bands to the coal spectrum of Fig. 6 (see Section 3.2.2). results in that the last one gave the most compact representation of the infrared spectra of coal in the 3100±2796 cmÿ1 region. Acknowledgements We acknowledge with thanks the ®nancial assistance through Project S1-2278 of the Consejo Nacional de Investigaciones Cientõ®cas y TecnoloÂgicas (CONICIT, Venezuela). We also thank the reviewers for useful suggestions, and specially Reviewer II for calling our attention to the Wald±Wolfowitz test. References [1] W.F. Maddams, Appl. Spectrosc. 34 (1980) 245. [2] J.C. Davis, Statistics and Data Analysis in Geology, Wiley, New York, 1973. [3] A. Ralston, IntroduccioÂn al anaÂlisis numeÂrico, Limusa-Wiley, New York, 1970. [4] R.E. Walpole, R.H. Myers, Probability and Statistics for Engineers and Scientists, 4th ed., MacMillan, New York, 1990. [5] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 1. [6] P. Geladi, B.R. Kowalski, Anal. Chim. Acta 185 (1986) 19. [7] E.V. Thomas, Anal. Chem. 66 (1994) 795A. [8] S. Sekulic, M.B. Seasholtz, Z. Wang, B.R. Kowalski, S.E. Lee, B.R. Holt, Anal. Chem. 65 (1993) 835A. C.E. Alciaturi et al. / Analytica Chimica Acta 376 (1998) 169±181 [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] H.C. Smit, Chemom. Int. Lab. Systems 8 (1990) 15. H.C. Smit, Chemom. Int. Lab. Systems 8 (1990) 29. S. Wilks, Mathematical Statistics, Wiley, New York, 1962. M.R. Spiegel, Statistics (Schaum Series), 2nd ed., McGrawHill, New York, 1988. D.K. Hildebrand, R.L. Ott, Statistical Thinking for Managers, PWS Kent, 1991. W.C. Hamilton, Acta Cryst. 18 (1965) 502. A.T. Lemley, J.H. Roberts, K.R. Plowman, J.J. Lagowski, J. Phys. Chem. 77 (1973) 2185. P. Gans, B. Gill, Appl. Spectrosc. 31 (1977) 451. H.G. Gauch, American Scientist 81 (1993) 468. W.H. Jefferys, J.O. Berger, American Scientist 80 (1992) 64. A. Savitzky, M.J.E. Golay, Anal. Chem. 36 (1964) 1627. J. Steiner, Y. Termonia, J. Deltour, Anal. Chem. 44 (1972) 1906. [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] 181 H.H. Madden, Anal. Chem. 50 (1978) 1383. S.E. Bialkowski, Anal. Chem. 61 (1989) 1308. P.A. Gorry, Anal. Chem. 62 (1990) 570. C.E. Alciaturi, M.E. Escobar, R. Vallejo, Fuel 4 (1996) 491. C.E. Alciaturi, T. Montero, C. De La Cruz, M.E. Escobar, Anal. Chim. Acta 340 (1997) 233. W.I. Friesen, K.H. Michaelian, Appl. Spectrosc. 45 (1991) 50. K.H. Michaelian, W.I. Friesen, Fuel 69 (1990) 1271. S.-H. Wang, P.R. Griffiths, Fuel 64 (1985) 229. D.W. Kuehn, R.W. Snyder, A. Davis, P.C. Painter, Fuel 61 (1982) 682. P.C. Painter, R.W. Snyder, M. Starsinic, M.C. Coleman, D.W. Kuehn, A. Davis, Appl. Spectrosc. 35 (1981) 475. D. Kincaid, W. Cheney, Numerical Analysis: Mathematics of Scientific Computing, Brooks/Cole, 1991.