CN101430276B

CN101430276B - Wavelength variable optimization method in spectrum analysis

Info

Publication number: CN101430276B
Application number: CN2008102395880A
Authority: CN
Inventors: 张广军; 李丽娜; 李庆波
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2008-12-15
Filing date: 2008-12-15
Publication date: 2012-01-04
Anticipated expiration: 2028-12-15
Also published as: CN101430276A

Abstract

The invention discloses a method for optimizing wavelength variable in spectral analysis. The method comprises the steps as follows: obtained original spectrum is pretreated to obtain a spectral array with useless information eliminated; the purity value of each wavelength variable is calculated in the obtained spectral array to select the wavelength variable with maximum purity value as a first wavelength variable; the relative weighting function of no. j wavelength variable and selected (j-1) wavelength variables is calculated, and the purity value of each wavelength variable after the relative weighting function is added is calculated; the wavelength variable with the maximum purity value is selected as no. j wavelength variable, wherein, j is the integral more than or equal to 2; partial least square regression modeling is carried out by optimized different quantities of the wavelength variables, and predicted root mean square error is calculated; when the predicted root mean square error is minimum, the wavelength variable combination selected for modeling is the optimized wavelength variable combination. The quantity of the wavelength variables selected by the method is small, and the method can minimize redundant information and can improve modeling speed and efficiency obviously.

Description

Method for optimizing wavelength variable in spectral analysis

Technical Field

The invention relates to a spectral analysis technology, in particular to a method for optimizing wavelength variables in spectral analysis.

Background

The method is a new technology for rapidly and nondestructively detecting the content or the property of the components in the sample by combining the spectrum analysis technology of a multivariate calibration model. Because the absorption spectrum changes when the content or the property of the component of the sample to be detected changes, a multivariate calibration model is established by correlating the spectrum of the sample with the concentration or the property of the sample, and then the unknown component concentration or the unknown component property in the sample to be detected is predicted through the multivariate calibration model and the spectrum information of the sample to be detected. However, since spectral information is complicated and easily overlapped due to the presence of various disturbances, it is necessary to extract useful information by removing redundant information from the complicated spectral information to create a multivariate calibration model with high efficiency and high accuracy.

However, when the multivariate calibration model is established, in order not to lose the information of the spectrum, the modeling may be performed by using the spectrum data in all the wavelength ranges, but the modeling by using all the spectrum data has a large calculation amount, poor spectrum selectivity, large spectrum noise at some wavelengths and less useful information. Therefore, how to select the wavelength variable to obtain the most effective spectrogram information in the spectrum, simplify the data operation, and enable the multivariate correction model to have the best prediction capability is an important problem for establishing the multivariate correction model.

In fact, there are many preferred methods of wavelength variation, such as: the wavelength variable optimization is carried out by adopting a forward selection variable method, a backward deletion variable method or a stepwise regression method in the multiple regression analysis, however, theoretically, the wavelength variable optimization methods are all directed at data without correlation, and under the condition that multiple correlations are serious, the reliability of the conclusion obtained by the methods is influenced to a certain extent; for another example: the method comprises a correlation analysis method, a display variation analysis method, a method for optimizing wavelength variables according to the ratio of regression coefficients and spectrum residuals of a multivariate calibration model of a measured component, and the like, but the wavelength variable optimization method is only suitable for spectrum applications with measurement conditions which are not particularly complex, and has a not obvious effect on improving the quality of the multivariate calibration model.

At present, some global optimization methods are applied to wavelength variable optimization, such as a simulated annealing method, a genetic algorithm and the like, wherein the simulated annealing method is a random search method developed by inspiring of a metal heating technology; the genetic algorithm is a method for searching an optimal solution by simulating a natural evolution process of an organism by using a computer. Although search algorithms such as simulated annealing method and genetic algorithm have quite strong search capability, the parameter setting of the method is complex, the capability of searching the global optimal solution and the local optimal solution is influenced, and the parameter setting also depends on the experience of researchers and the grasp of the researched problems, so that the method has certain subjectivity and randomness. In addition, when the genetic algorithm is adopted for wavelength variable optimization, although the prediction capability of the multivariate correction model in a single experiment is high, the adaptability of the multivariate correction model is low due to certain randomness, so the robustness and the adaptability of the multivariate correction model are not improved by the wavelength optimized by the genetic algorithm.

In view of the above, the main objective of the present invention is to provide a method for optimizing wavelength variables in spectral analysis, which can improve modeling efficiency and prediction accuracy.

In order to achieve the above object, the present invention provides a method for optimizing wavelength variation in spectral analysis, comprising:

acquiring near infrared spectrum data of a sample through a near infrared spectrometer, and preprocessing the currently acquired near infrared spectrum data to obtain a near infrared spectrum without useless information; according to the preprocessed near infrared spectrum, calculating purity values of all wavelength variables, selecting the wavelength variable with the maximum purity value as the 1 st wavelength variable, and applying an MATLAB program to automatically select the first j wavelength variables in sequence; calculating a correlation weight function of the jth wavelength variable and the selected first (j-1) wavelength variables, calculating purity values of the wavelength variables after the correlation weight function is added, and selecting the wavelength variable with the maximum purity value as the jth wavelength variable, wherein j is an integer greater than or equal to 2; performing partial least squares regression by using the optimized wavelength variables with different numbers to establish a multivariate correction model, and calculating and predicting a root mean square error; when the predicted root mean square error is minimum, the wavelength variable combination selected by modeling is the optimal wavelength variable combination; predicting a sample which is pre-configured and used as a prediction set by adopting a multivariate correction model; the preprocessing is to process the collected near infrared spectrum data by adopting a correlation analysis method, a useless information variable elimination method or a wavelet transformation method. Wherein the purity value is a percentage of a standard deviation of the wavelength variable to a mean value after adding a compensation factor.

The preferred different number of wavelength variables may be the first j wavelength variables selected in sequence.

The method for optimizing wavelength variables in spectral analysis provided by the invention comprises the steps of preprocessing the spectral data of samples in a correction sample set, removing noise, background interference and information irrelevant to an analyte, calculating the purity value of each wavelength variable in a spectrum matrix after preprocessing, selecting the wavelength variable with the maximum purity value as the 1 st wavelength variable, and when calculating the purity value of the jth (j is more than or equal to 2) wavelength variable, recording the correlation weight function of the jth wavelength variable and the selected previous (j-1) wavelength variables into the correlation weight function. The above process has good repeatability and no randomness, and is a deterministic algorithm. The method only needs to set a compensation factor parameter when calculating the purity value, and can overcome the problem of complex parameter setting in the prior wavelength variable optimization method. It can be seen that the preferred method of wavelength variation of the present invention is simple and easy to implement.

In addition, the method of the present invention is a self-modeling wavelength variable selection method, i.e. analysis is performed on the data of the spectrum itself, unlike some prior art wavelength variable selection methods that relate to concentration information. Moreover, since the selected wavelength variable is the original wavelength, it can be used as a reference in analyzing the prediction result to evaluate the information of the molecules or groups in the substance to be measured.

In addition, when the Method of the invention is applied to modeling of the selected wavelength variables, different numbers of wavelength variables can be sequentially selected to perform Partial Least Squares (PLS) regression modeling, and the Root Mean Square Error (RMSEP) of Prediction is calculated, and when the Root Mean Square Error is minimum, the wavelength variable combination selected by modeling is the most preferable wavelength variable combination, so that the Prediction precision of the established multivariate correction model can be obviously improved.

It can be seen that the wavelength variable optimization method of the present invention can minimize redundant information; the method can solve the problem of collinearity among wavelength variables due to the introduction of the related weight function, so that the number of the selected wavelength variables is small, and the method can establish a spectrum quantitative correction model with higher prediction precision through the selected few wavelength variables, thereby obviously improving the modeling speed and efficiency.

Drawings

FIG. 1 is a schematic flow diagram of a preferred method of wavelength variation in spectroscopic analysis in accordance with the present invention;

FIG. 2(a) is a graph of the original near infrared spectrum before pretreatment according to an embodiment of the method of the present invention;

FIG. 2(b) is a diagram of a near infrared spectrum after being preprocessed by a wavelet transform method according to an embodiment of the method of the present invention;

FIG. 3(a) is a graph of the purity value curve at each wavelength for an embodiment of the method of the present invention when selecting a second wavelength variable;

FIG. 3(b) is a graph of the distribution of the standard deviation curves at each wavelength for an embodiment of the method of the present invention when selecting a second wavelength variable;

FIG. 4 is a distribution diagram of RMSEP values obtained when different numbers of wavelength variables are selected in sequence for modeling according to an embodiment of the method of the present invention;

FIG. 5 is a graph of a preferred wavelength variation for an embodiment of the method of the present invention;

FIG. 6 is a diagram illustrating the predicted results of a PLS multivariate calibration model established by optimal wavelength variable combinations according to an embodiment of the method of the present invention.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

The basic idea of the invention is: firstly, preprocessing the spectrum data of a sample in a correction sample set, then carrying out self-model wavelength variable optimization on the preprocessed spectrum data, and selecting the wavelength variable with the maximum purity value as the 1 st wavelength variable; when the purity value of the jth wavelength variable (j is more than or equal to 2) is calculated, the purity value of the jth wavelength variable is counted into a correlation weight function of the jth wavelength variable and the selected previous (j-1) wavelength variable; then sequentially selecting different numbers of wavelength variables to perform PLS regression modeling, and calculating a predicted root mean square error; when the root mean square error is predicted to be the smallest, the selected modeling wavelength variable combination is the most preferable wavelength variable combination.

It should be noted that, before the wavelength variable optimization is performed, the spectrum samples obtained in advance through experiments can be divided into a training set and a prediction set, wherein the training set samples are used for performing the wavelength variable optimization and the multivariate correction model training, and the prediction set samples are used for evaluating the wavelength variable optimization and the prediction accuracy of the multivariate correction model.

Generally, the spectral measurement data of the sample may contain useless information such as high-frequency noise caused by instrument noise, measurement condition changes and the like, and background interference generated by light absorption of other chemical components, so the preprocessing is mainly to remove the information in the spectrum which is irrelevant to the component concentration or property of the sample to be measured, to ensure that the selected variables are related to the component concentration or property parameter of the sample to be measured as much as possible, and further improve the spectral quality.

In fact, there are several methods for preprocessing the raw spectra: correlation analysis, garbage variable elimination, wavelet transform, and the like, the implementation of the preprocessing method will be described in detail below by way of example.

A first method of spectral preprocessing, correlation analysis, comprising the steps of:

step a101, correlating the measured component concentration or component property data Y (n × 1) with the spectral data X (n × m) of the sample, and obtaining a correlation coefficient C at each wavelength according to formula (1):

<math><mrow> <mi>C</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <mover> <mi>x</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>-</mo> <mover> <mi>y</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> </mrow> <mrow> <msqrt> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>x</mi> <mi>i</mi> </msub> <mo>-</mo> <mover> <mi>x</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mn>2</mn> </msup> </msqrt> <msqrt> <munderover> <mi>Σ</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>-</mo> <mover> <mi>y</mi> <mo>&OverBar;</mo> </mover> <mo>)</mo> </mrow> <mn>2</mn> </msup> </msqrt> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>1</mn> <mo>)</mo> </mrow> </mrow></math>

wherein n is the number of samples, m is the number of variables, x_iAs elements in the spectral data X, y_iIs an element in data Y, and x and Y are x respectively_iAnd y_iMean of the columns.

Step A102, setting a threshold value C of a correlation coefficient₀The correlation coefficient exceeding the threshold value, i.e. C_k＞C₀The corresponding kth wavelength is selected.

Step A103, forming a new matrix X by the selected wavelength variables_NEWFor further spectral processing.

A second spectrum preprocessing method, a useless information variable elimination method, which comprises the following steps:

step A201, performing PLS regression analysis on a sample collection spectrum matrix X (nxm) and a concentration matrix Y (nx1), and selecting the number f of principal components, wherein f is a positive integer;

where n represents the number of samples and m represents the number of wavelength variables.

A202, generating a random noise matrix R (n × m), and combining X and R into a matrix XR (n × 2 m);

here, the combined matrix XR has the first m columns as X and the last m columns as R.

Step A203, performing PLS regression on the matrixes XR and Y, removing the interactive verification of one sample each time to obtain a regression coefficient vector B, and obtaining n PLS regression coefficients to form a matrix B (n multiplied by 2 m).

Step A204, calculating the standard deviation std (B) and mean (B) of the matrix B (n × 2m) by columns, and then calculating C_iMean (bi)/std (bi), where i is 1, 2.

Step A205, at [ m +1, 2m ]]The interval is the maximum absolute value C of C_max＝max(abs(C))。

Step A206 at [1, m]Interval removal matrix X corresponds to C_i＜C_maxAnd the remaining variables are combined into a new matrix X selected by a garbage variable elimination method_NEWAnd preparing for subsequent spectrum processing.

The third spectrum preprocessing method, wavelet transform method. The wavelet transform has the characteristic of multi-resolution analysis, and because the background generated by the absorption of noise and other component light in a spectrum signal is mostly represented by a low-scale detail coefficient and a high-scale approximate component, various noise, background interference and other useless information can be removed simultaneously by utilizing the wavelet transform. The process of preprocessing the original spectrum by wavelet transform comprises the following steps:

step A301, generating a random noise and background matrix R (n × m), and combining a sample collection spectrum matrix X (n × m) and R into a matrix XR (n × 2 m);

wherein n is the number of samples, m is the number of variables, m is X before the combined matrix XR, and R is in the last m.

Step a302, performing wavelet decomposition on each signal of the matrix XR, selecting a wavelet basis and a wavelet decomposition layer number k, and obtaining a wavelet detail coefficient matrix D (k × 2m/(i × 2)) and an approximate component matrix a (k × 2m), where i is 1, 2.

Step A303, calculating the standard deviation std (di) and the mean value mean (di) of the matrix D (kX2 m/(i X2)) by columns, and then calculating C_di＝mean(di)/std(di)。

Step A304, calculating the standard deviation std (bi) and the mean (bi) of the matrix A (k × 2m) by columns, and then calculating C_ai＝mean(ai)/std(ai)。

A306, reconstructing signals by using the denoised and background-removed low-frequency and high-frequency coefficients of a k-th layer, establishing a correction model by using a reconstructed spectrum signal, and selecting an optimal wavelet basis according to a predicted root-mean-square error; the reconstructed spectrum signals form a new spectrum matrix X_NEWSo as to carry out the optimal equal spectrum processing work of the wavelength variable.

It should be noted that the preprocessing method according to the preferred embodiment of the present invention is not limited to the above-mentioned method, and any other preprocessing method for removing useless information such as noise and background should fall within the scope of the present invention.

Based on the pretreatment method described above, as shown in fig. 1, the flow chart of the preferred method for wavelength variation in spectroscopic analysis of the present invention comprises the following steps:

s101, preprocessing original spectrum data of all samples in an experiment to obtain a spectrum matrix with useless information eliminated;

by eliminating the unwanted information through pre-processing, the quality of the spectrum can be improved, making the relationship between the spectrum and the concentration or properties of the analyte component tighter. In the preprocessing method, the correlation analysis method and the useless information variable elimination method are suitable for the situation that the spectrum is not complex, and are generally only used for removing noise; the wavelet transformation method can simultaneously remove noise, background and other useless information by means of the multi-resolution analysis characteristics. The method of spectral pre-processing may be selected as appropriate.

Step S102, spectrum matrix X after pretreatment_NEWCalculating the purity value of each wavelength variable, and selecting the wavelength variable with the maximum purity value as the selected 1 st wavelength variable;

the purity value is used for representing the contribution of each variable to the multivariate correction model and can be expressed as the percentage of the discrete degree of the wavelength variable and the concentration trend after the compensation factor is added; the degree of dispersion is the standard deviation of the wavelength variable and the central tendency is the mean of the wavelength variable. In addition, in the case of weak signals and comparable noise, the adjustment can be made by a compensation factor. Generally, the compensation factor may be set to 1% to 5% of the mean value.

Spectral matrix X_NEWIn (3), the method for calculating the purity value of each wavelength variable i is shown in the following formula (2):

p_i，1＝σ_i/(μ_i+α) (2)

wherein σ_iIs standard deviation, mu_iIs the mean value and alpha is the compensation factor. The purity value p of each wavelength variable i obtained by the formula (2)_i，1Then, p is judged_i，1Size of value, having maximum p_i，1The ith wavelength variable of the value is the selected 1 st wavelength variable.

Step S103, calculating a correlation weight function of the jth wavelength variable, calculating a purity value of each wavelength variable after the wavelength variable is included in the correlation weight function, and selecting the wavelength variable with the maximum purity value as the selected jth wavelength variable;

wherein j is an integer greater than or equal to 2; the correlation weight function is used for representing the importance degree of the relationship between the jth wavelength variable and the selected first (j-1) wavelength variable.

The general procedure for selecting the jth (j ≧ 2) wavelength variable is as follows:

calculating the spectral matrix X_NEWLength l of each wavelength variable i_iAs shown in equation (3):

wherein d is_i，jIs a spectral matrix X_NEWThe ith row and the jth column of the element are as follows:

obtaining a relationship matrix C ═ D (l)^TWherein D (l) is a radical of the elements d (l)_i，jA matrix of compositions; and calculates a correlation weight function ρ_i，jAs shown in equation (4).

Where j denotes the number of the jth wavelength variable to be determined, P_j-1Indicating the number, p, of the (j-1) th wavelength variable that has been selected so far in the relation matrix C₁Indicating the number of the selected 1 st wavelength variable in the relation matrix C, the j-th wavelength variable purity value p_i，jComprises the following steps:

p_i，j＝ρ_i，j(σ_i/(μ_i+α)) (5)

having a maximum of p_i，jThe selected j wavelength variable is the wavelength variable with the corresponding standard deviation value s_i，jThe expression is shown in formula (6).

s_i，j＝ρ_i，jσ_i (6)

Generally, with a maximum of p_i，jOf wavelength variations of valueStandard deviation s_i，jWill also be relatively high, and therefore the standard deviation s_i，jMay be used as a reference value to supervise the selected wavelength variable.

S104, sequentially selecting different numbers of wavelength variables to perform PLS regression modeling, and calculating a predicted Root Mean Square Error (RMSEP);

in general, the formula for RMSEP is:

<math><mrow> <mi>RMSEP</mi> <mo>=</mo> <msqrt> <mfrac> <mrow> <munderover> <mi>Σ</mi> <mi>i</mi> <mi>n</mi> </munderover> <msup> <mrow> <mo>(</mo> <msub> <mover> <mi>y</mi> <mo>^</mo> </mover> <mi>i</mi> </msub> <mo>-</mo> <msub> <mi>y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mi>n</mi> </mfrac> </msqrt> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow></math>

wherein,

to predict value, y_iIs a reference value.

The predicted root mean square error RMSEP reflects the degree to which the measured data deviates from the true value, and in general, a smaller value of RMSEP indicates a higher measurement accuracy, and therefore RMSEP can be used as a criterion for assessing the accuracy of this measurement process. When the value of RMSEP is minimal, the selected modeled wavelength variable combination is the optimal wavelength variable combination. In step S104, PLS regression analysis is performed on the selected wavelength variables sequentially and iteratively, and the number of the selected modeling wavelength variables is determined.

Step S105, judging whether the value of RMSEP reaches the minimum value, if so, executing step S106, otherwise, returning to step S103;

generally, when the value of RMSEP obtained by the current modeling is greater than the result obtained by the previous modeling, the RMSEP of the previous time is considered as the minimum value; if the value of RMSEP does not reach the minimum value, step S103 is repeated, and the iteration is performed in sequence until the optimal wavelength variable combination is selected.

Alternatively, a minimum RMSEP value may be preset as a condition for ending the loop at the start of the algorithm.

And step S106, when the RMSEP reaches the minimum value, the optimization of the wavelength variable at this time can be finished, and the modeling is finished.

The wavelength variable optimization method can be applied to MATLAB program design, and can realize automatic selection of the wavelength variable. And establishing a multivariate correction model according to the sequentially iteratively selected wavelength variables, judging by using the RMSEP of interactive verification, and determining the variable combination for modeling as the selected optimal wavelength variable combination when the RMSEP value is minimum.

It should be emphasized that, the wavelength variable optimization method of the present invention may also select a certain number of wavelength variables first, and then perform PLS regression modeling on the selected wavelength variables, wherein, more preferably, the different number of wavelength variables used in the modeling may be the wavelength variables selected in turn; otherwise, some variable selection methods (such as an exhaustive method, a genetic algorithm, a Monte Carlo method and the like) can be further combined to select part of the wavelength variables for modeling. Then calculating the value of RMSEP, and judging whether the value of RMSEP is minimum, if the minimum value of RMSEP appears, the selection of wavelength variable can be stopped; otherwise, the preference of the wavelength variant continues.

The preferred method of wavelength variation in the spectroscopic analysis of the present invention is described in detail below with reference to a specific embodiment.

Taking a human plasma near infrared spectrum blood sugar detection experiment as an example, the glucose concentration in a sample is subjected to prediction analysis. In the embodiment, a Fourier transform infrared spectrometer is adopted in the human plasma near infrared spectrum experiment, the spectrum acquisition range is 900-3600 nm, the adopted detector is an InSb detector cooled by liquid nitrogen, and in addition, instruments such as a 1mm quartz sample cell, a peristaltic pump automatic sample feeding system, a full-automatic biochemical analyzer and the like are selected in the experiment.

The preparation method of the plasma experimental sample comprises the following steps: adding heparin anticoagulant into whole blood, separating in centrifuge at 1500 rpm for 10min, adding glucose into separated blood plasma, and calibrating blood sugar value with full-automatic biochemical analyzer and glucose oxidase method. 33 samples are obtained in the plasma experiment, wherein 22 samples are used as a training set for training wavelength variable optimization and multivariate calibration models; and 11 samples are used as a prediction set to evaluate the wavelength variable optimization and the prediction accuracy of the multivariate correction model. In addition, the glucose concentration range is 10.4-44.4 mg/dL, and the glucose concentration range is randomly distributed, and the standard deviation of the glucose concentration range is 8.5 mg/dL.

The implementation process of the wavelength variable optimization method of the embodiment comprises the following steps:

step S201, preprocessing the spectrum data of all samples in the experiment, and removing useless information such as noise, background and the like.

The spectral analysis range is 1000-1890.36 nm, and each spectrum has 4711 wavelength variables in total. And removing useless information by applying a wavelet transform method to each spectrum, selecting a wavelet base db3, decomposing by Mallat, wherein the decomposition scale is 12, removing components corresponding to scales of 1, 2, 3 and 10 respectively, and then reconstructing the spectrum information.

As shown in FIG. 2(a), it is the original near-infrared spectrogram before the pre-processing of the present embodiment, and FIG. 2(b) is the near-infrared spectrogram after the pre-processing of the present embodiment by the wavelet transform methodThe infrared spectrogram is a new spectrum matrix X after pretreatment_NEW. Due to the spectral matrix X_NEWThe spectral data amount in (1) is large (matrix of 33 × 4711), and can be seen from fig. 2 (b).

Step S202, spectrum data X after pretreatment_NEWOptimizing wavelength variable, and calculating sample spectral matrix X of training set_NEWThe purity value of each wavelength variable i in the spectrum to select the 1 st wavelength variable;

the present embodiment sets the value of the compensation factor α to 5% of the mean value. By comparing the purity values at the respective wavelength variables calculated in step B1 with the standard deviation values, the maximum purity value p can be obtained_1，10.0431, it can be determined that the selected 1 st wavelength variable is the wavelength variable with the variable index 1 (wavelength 1000 nm).

Step S203, selecting a second wavelength variable;

calculating the purity value and the standard deviation value after adding the correlation weight function according to the formula (5) and the formula (6), wherein the obtained corresponding result between the purity value and each wavelength variable is shown in fig. 3(a), which is a distribution diagram of a purity value curve at each wavelength when selecting the second wavelength variable in the present embodiment; the obtained correspondence between the standard deviation values and the respective wavelength variables is shown in fig. 3(b), which is a distribution graph of the standard deviation value curve at the respective wavelengths when the second wavelength variable is selected in the present embodiment. As can be seen from fig. 3, the second wavelength variable selected is the 4711 th wavelength variable (wavelength 1890.36nm) with a maximum.

Repeating the step S203, and further obtaining the 3 rd to 16 th wavelength variables as follows: 4223 th variable (wavelength 1730.7nm), 1944 th variable (wavelength 1241.16nm), 2655 th variable (wavelength 1361.29nm), 4700 th variable (wavelength 1886.44nm), 3281 th variable (wavelength 1488.1nm), 4684 th variable (wavelength 1880.76nm), 2973 th variable (wavelength 1422.88nm), 3857 th variable (wavelength 1627.6nm), 2814 th variable (wavelength 1391.4nm), 1232 nd variable (wavelength 1140.38nm), 2558 th variable (wavelength 1343.54nm), 4078 th variable (wavelength 1688.33 nm).

And step S204, judging the number of the selected wavelength variables.

And (3) establishing a PLS regression multivariate calibration model by using the sequentially selected 1 st to 16 th wavelength variables, and obtaining RMSEP when modeling is carried out by adopting different numbers of wavelength variables by an interactive verification method.

As shown in fig. 4, a distribution diagram of RMSEP values obtained when different numbers of wavelength variables are selected in sequence for modeling in the present embodiment is shown, where an inverted triangle in fig. 4 represents the RMSEP value, and a curve represents a trend of the RMSEP value changing with the change of the number of wavelength variables. It can be seen that when the first 14 wavelength variables are selected for modeling, the RMSEP value is the smallest, so the number of the optimal wavelength variables is 14, and at this time, the selected first 14 wavelength variables are combined into the optimal wavelength variable combination, as shown in fig. 5, which is a distribution diagram of the preferred wavelength variables of this embodiment, and can reflect the range of the preferred wavelength variables, the circle in fig. 5 represents the selected wavelength variable, and the curve represents the spectral curve.

By the wavelength variable optimization method of the present invention, a PLS regression multivariate calibration model is established, and the prediction set samples are predicted, so that the RMSEP value is 1.9mg/dL, the Correlation coefficient (Correlation) of the prediction results of the multivariate calibration model is 0.94, and the Correlation is shown in fig. 6, which is a schematic diagram of the prediction results of the PLS multivariate calibration model established for the optimal wavelength variable combination in this embodiment. In fig. 6, a black dot represents a correlation between the reference value and the predicted value, and a straight line represents a reference of the correlation. When the black point is closer to the straight line, the correlation between the predicted value and the reference value is larger.

As shown in table 1, in order to select prediction parameters for modeling in different wavelength variable ranges, 14 wavelength variables are selected for modeling by using a self-model wavelength variable optimization method in this embodiment, and compared with the effect of performing full-spectrum modeling by selecting 4711 wavelength variables, the self-model wavelength variable optimization method provided by the invention is not only simple and easy to implement, but also has high modeling efficiency, and the prediction accuracy of the established multivariate correction model is also significantly improved.

TABLE 1

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The above description is only for the purpose of illustrating the present invention and is not intended to limit the scope of the present invention. For simplicity of explanation, the foregoing embodiments are described as a series of acts or combinations, but it will be appreciated by those skilled in the art that the invention is not limited by the order of acts, as some steps may occur in other orders or concurrently with other steps in accordance with the invention. In addition, any modification and variation of the present invention within the spirit of the present invention and the scope of the claims fall within the scope of the present invention.

Claims

1. A method for wavelength variable optimization in spectroscopic analysis, the method comprising:

acquiring near infrared spectrum data of a sample through a near infrared spectrometer, and preprocessing the currently acquired near infrared spectrum data to obtain a near infrared spectrum without useless information;

according to the preprocessed near infrared spectrum, calculating purity values of all wavelength variables, selecting the wavelength variable with the maximum purity value as the 1 st wavelength variable, and applying an MATLAB program to automatically select the first j wavelength variables in sequence;

calculating a correlation weight function of the jth wavelength variable and the selected first (j-1) wavelength variables, calculating purity values of the wavelength variables after the correlation weight function is added, and selecting the wavelength variable with the maximum purity value as the jth wavelength variable, wherein j is an integer greater than or equal to 2;

performing partial least squares regression by using the optimized wavelength variables with different numbers to establish a multivariate correction model, and calculating and predicting a root mean square error; when the predicted root mean square error is minimum, the wavelength variable combination selected by modeling is the optimal wavelength variable combination; predicting a sample which is pre-configured and used as a prediction set by adopting a multivariate correction model;

wherein, the preprocessing is to process the collected near infrared spectrum data by adopting a correlation analysis method, a useless information variable elimination method or a wavelet transformation method;

wherein the purity value is a percentage of a standard deviation of the wavelength variable to a mean value after adding a compensation factor.

2. The method of claim 1, wherein the compensation factor is 1% to 5% of the mean value.

3. The method of claim 1, further comprising: a predicted RMS error value is preset to its minimum value.

4. The method according to claim 1 or 3, wherein when the predicted root mean square error value obtained by the current modeling is larger than the predicted root mean square error value obtained by the previous modeling, the previous predicted root mean square error value is the minimum value.