Economics">
000930612
000930612
000930612
TESE DE DOUTORADO
Cassiano Ranzan
Porto Alegre
2014
2
3
CASSIANO RANZAN
Orientador:
Prof. Dr. Jorge Otávio Trierweiler
Co - Orientador:
Prof. Dr. Luciane Ferreira Trierweiler
Co – Orientador Estrangeiro:
Prof. Dr. Bernd Hitzmann
Porto Alegre
2014
4
5
Comissão Examinadora:
Resumo
Abstract
Agradecimentos
SUMÁRIO
LISTA DE FIGURAS
Figure 5.4: Significant fluorescence spectral regions associated with Glucose, Ethanol
and Biomass concentration, obtained by ACO pheromone trail evolution. ..................... 97
Figure 5.5: PCR and PLS modeling using Full Fluorescent Spectral data and Reduced
Fluorescent Spectral data based on ACO analysis. ............................................................ 97
Figure 6.1: NIR measurements average from 34 flour samples. (a) NIR raw data and (b)
NIR SNV normalized data. ................................................................................................ 103
Figure 6.2: 34 flour samples localizations in PC’s plans, using PCA results for qualitative
data set evaluation (T – Wheat Flour; C – Rye Flour). ..................................................... 104
Figure 6.3: Schematic representation of MWPLSR and CSMWPLS. ................................ 106
Figure 6.4: Results of RMSEP and Explained Variance for PLSR modeling applied to full
spectral data of wheat flour and respective protein content. ........................................ 108
Figure 6.5: Optimized regions of NIR spectra for prediction of protein in function of LV’s
number on PLS models, applying CSMWPLS. .................................................................. 109
Figure 6.6: RMSEP values for protein prediction obtained using CSMWPLS for NIR
spectral region selection, for seven distinct windows size: 10, 50, 100, 150, 200, 250 and
300 spectral elements window. ....................................................................................... 110
Figure 6.7: RMSEP storage values for modCSMWPLS, with window size from one up to
fifteen, applied for NIR data of wheat flour samples and using PLS models with maximum
of 8 LV’s. ........................................................................................................................... 111
Figure 6.8: Relative pheromone amount deposited in spectral elements, obtained using
PSCM methodology varying models from one up to nine spectral elements prediction
groups. ............................................................................................................................. 112
Figure 6.9: RMSEP results for chemometric modeling of protein content prediction using
NIR data from wheat flour samples, using full spectrum data, filtrated spectrum data
using modified CSMWPLS and filtrated spectrum data using PSCM/ACO. Results are
presented in function of independent variables used on chemometric models. Modeling
process divided into (a) standard PLS regression, (b) CSMWPLS and (c) PSCM/ACO. .... 113
Figure 6.10: Percentage difference between RMSEP results for protein prediction of PLS
chemometrical models using full NIR spectral data and: PLS models using NIR filtered
data with modified CSMWPLS (PLS(CSMWPLS)) and PSCM/ACO (PLS(PSCM)), CSMWPLS
models using NIR filtered data with modified CSMWPLS (CSMWPLS(CSMWPLS)) and
PSCM/ACO (CSMWPLS(PSCM)) and PSCM/ACO models using NIR filtered data with
modified CSMWPLS (PSCM(CSMWPLS)) and PSCM/ACO (PSCM(PSCM)) ....................... 114
xi
LISTA DE TABELAS
Tabela 2.1: Comparação das características qualitativas de MIR e NIR. Fonte: Adaptado
de Pasquini (2002). .............................................................................................................. 37
Tabela 3.1: Conjunto de dados experimentais off-line para o ensaio fermentativo 1.
Fonte: Solle et al. (2003). .................................................................................................... 42
Tabela 3.2: Conjunto de dados experimentais off-line para o ensaio fermentativo 2.
Fonte: Solle et al. (2003). .................................................................................................... 42
Tabela 3.3: Parâmetros do modelo dinâmico ajustados com dados interpolados do
ensaio fermentativo 1. ......................................................................................................... 45
Tabela 3.4: Conjunto de amostras de farinha. .................................................................... 53
Tabela 3.5: Análises laboratoriais realizadas nas amostras de farinha. .............................. 54
Tabela 3.6: Dados off-line do conjunto de amostras de farinhas. ...................................... 55
Tabel 6.1: CSMWPLS results of RMSEP for the best obtained PLS models ....................... 108
xii
ABREVIAÇÕES
Capítulo 1 – Introdução
1.1 Motivação
Com o incremento da demanda por produtos industrializados, sem mencionar a
competição econômica dentro dos diversos segmentos, exige-se que as empresas operem
cada vez mais próximas ao nível máximo de desempenho permitido, seja para garantir
vantagens financeiras e logísticas sobre seus concorrentes, seja para um melhor
aproveitamento dos recursos disponíveis.
Controle de processos é definido como sendo o provedor do meio reacional ideal para
os reagentes (ou microrganismos no caso de bioprocessos) de forma a favorecer a
obtenção de um determinado produto desejado. Esta função inclui disponibilizar
quantidade certa de reagentes ou nutrientes para a cultura (p.ex. carbono, nitrogênio,
oxigênio, fósforo, enxofre, dentre outros minerais), remoção de qualquer subproduto
indesejado e manipulação de parâmetros reacionais importantes (por exemplo,
temperatura e pH).
Com base nas características almejadas pelo sensor ideal, vem surgindo um crescente
interesse envolvendo o desenvolvimento e aplicação de sensores ópticos na
caracterização de processos. A rápida evolução desse tipo de sensor, aparentemente
muito promissor para a aplicação nos mais variados ramos de processos químicos e
bioquímicos, aliada às suas características e funcionalidades, tem chamado a atenção dos
pesquisadores. A atratividade de sensores ópticos em relação a sensores convencionais
reside nas vantagens de sua utilização, principalmente sua imunidade eletromagnética às
interferências, permitindo a construção de equipamentos não invasivos, de
monitoramento remoto, contínuos, com tamanho reduzido e capazes de mensurar
simultaneamente diversas grandezas de interesse (Skibsted, Lindemann et al. 2001,
Wong, Chan et al. 2014).
Figura 1.2: Sensor NIR In-line de fibra ótica aplicado para o acompanhamento do
processo de produção de fibra de papel. a) Etapa do processo fabril, b) sensor de
espectroscopia NIR e c) sensor sendo aplicado diretamente no processo. Fonte: Adaptado de
Kessler e Kessler (2014).
É importante salientar que não é toda informação coletada por sensores baseados em
espectros que apresentam relevância para a caracterização do processo (Boehl, Solle et
al. 2003), de forma que é essencial a utilização de analisadores virtuais ou estimadores de
estado para traduzir o que é lido on-line pelo espectrômetro em informação útil à
operação do sistema. Medidas de espectroscopia podem ser geradas em intervalos
significativamente menores que a unidade de tempo padrão da maioria dos processos,
dependendo basicamente da resolução dos espectros coletados.
A busca por regiões do espectro onde os indicadores para predição podem ser
considerados os melhores possíveis constituem uma alternativa válida para seleção de
variáveis. A seleção de regiões espectrais específicas é considerada a escolha de janelas
espectrais que possuem a maior quantidade de informação a respeito do analito de
interesse. Um dos métodos que atua neste sentido é chamado Regressão por Mínimos
Quadráticos Parciais por Intervalo (i-PLS) (Jiang, Berry et al. 2002). Esse método constrói
um modelo multivariado em cada uma das janelas do espectro, selecionadas através da
estratégia de movimento de uma janela de tamanho fixo. Desta forma, a melhor região
espectral é considerada aquela com o menor erro de predição dentre as janelas
selecionadas (Filgueiras, Alves et al. 2014). Uma evolução desta técnica emprega janelas
com tamanhos variados, com o indicador de erro sendo dependente da predição da
janela e do seu tamanho da mesma. Essa variação permite a seleção de janelas com
tamanho superior ao tamanho mínimo escolhido, entretanto, não pode localizar regiões
através da combinação de sub-regiões separadas (Olivieri, Goicoechea et al. 2004).
Objetivos:
Uma vez que esta tese foi estruturada na forma de artigos científicos, o Capítulo 3 faz
um apanhado detalhado sobre os dois conjuntos de dados experimentais utilizados no
decorrer do trabalho, necessário para a compreensão completa da problemática
abordada nos capítulos seguintes. Também apresenta o detalhamento da ferramenta de
modelagem quimiométrica proposta, modelagem Quimiométrica de Elementos Espectrais
Puros (PSCM), que combina modelos MISO (Múltiplas Entradas com Saída Única) e
seleção de grupos de variáveis espectrais usando ACO (Otimização Colônia de Formigas).
1.5 Contribuições
Pode-se listar como principais contribuições desta Tese os seguintes pontos:
Capítulo 2 –
Quimiometria e Espectroscopia
2.1 Quimiometria
A quimiometria trata da aplicação de ferramentas matemáticas e estatísticas para o
planejamento, otimização e extração de informações partindo de um conjunto de dados
físico-químicos multivariados. Introduzida no final dos anos 60, por grupos de pesquisa
das áreas de química analítica e físico-química orgânica, seu desenvolvimento foi devido à
disponibilidade de equipamentos de análise com respostas multivariadas, assim como a
disponibilidade de microprocessadores com elevada capacidade computacional, o que
permitiu o desenvolvimento de métricas capazes de tratar grande quantidade de
informações de forma simultanea (Geladi 2003).
Com relação à análise exploratória, busca-se encontrar quais as variáveis que mais
afetam determinado processo, bem como as interações entre elas, de forma a determinar
as melhores condições de análise. De maneira geral, esses métodos podem ser
classificados como métodos supervisionados, nos quais estão enquadrados (1) Análise
discriminante linear (LDA – Linear Discriminant Analysis), (2) Método dos K-vizinhos mais
próximos (KNN – K-Nearest Neighbor), (3) Análise Descriminante com Calibração
Multivariável por Mínimos Quadrados Parciais (PLS-DA – Partial Least Square Discriminant
Analysis), ou não supervisionados como, por exemplo, (1) Análise de Componentes
Principais (PCA – Principal Component Analysis) e (2) Análise de Agrupamento Hierárquico
(HCA – Hierarchical Cluster Analysis) (Wehrens 2011).
15
Por fim, a calibração multivariada tem por objetivo estabelecer modelos capazes de
relacionar uma elevada quantidade de medidas (variáveis independentes) químicas,
físico-químicas ou espectrais de uma dada amostra para inferir os valores de
determinadas propriedades de interesse daquela (Kowalski 1983). Dentre os métodos
que estão classificados neste grupo, pode ser citada a (1) Regressão Multivariável, (2)
Regressão de Componentes Principais (PCR – Principal Component Regression), (3)
Regressão por Mínimos Quadráticos Parciais (PLSR – Partial Least Squares Regression), (4)
Regressão de Picos (RR – Ridge Regression), dentre outros (Wehrens 2011).
Alguns problemas podem ser facilmente reconhecidos, assim como ruído de medidas,
picos de sensor ou valores anômalos. Neste caso, a escolha da ação ideal não é um
problema. Entretanto, a dificuldade surge nas situações em que não se sabe qual
característica dos dados contém informação real e quais não. Alguns métodos
considerados padrões para pré-tratamento de dados são (1) redução de ruído, (2) ajuste
da linha de base, (3) alinhamento de picos, (4) seleção de picos e (5) escalamento
(Wehrens 2011).
(2.1)
O modelo presente na equação 2.1 pode ser expandido para sua forma matricial de
acordo com a equação 2.2, ao se considerar k variáveis explicativas (independentes) e n
observações (amostragens).
(2.2)
A equação 2.2 pode ser escrita em sua forma resumida de acordo com a equação 2.3.
(2.3)
(2.4)
(2.5)
(2.6)
O desvio padrão individual dos coeficientes, fornecido pela raiz quadrada dos
elementos diagonais da matriz de variância-covariância (equação 2.5), pode ser utilizado
para teste estatístico. Variáveis que apresentarem coeficientes não significativamente
diferentes de zero são usualmente removidas do modelo.
O OLS possui algumas desvantagens, sendo a mais significativa para o ramo das
ciências naturais, a sensibilidade deste método à colinearidade, que é a existência de,
pelo menos, uma dependência linear entre as variáveis independentes (Mardia, Kent et
al. 1979). O intervalo de confiança para os coeficientes de regressão são baseados no
pressuposto de independência das variáveis e, por consequência, de parâmetros.
Correlações entre variáveis violam esta independência, inviabilizando o cálculo do
18
A ideia central por trás da PCA é que, usualmente, dados com grandes dimensões são
compostos por muitas variáveis supérfluas. Em uma análise mais detalhada em espectros
de alta resolução, conclui-se que comprimentos de onda nas vizinhanças próximas são
altamente correlacionados e contêm informações similares. Uma forma de filtragem
desses dados poderia ser feita com base na escolha de comprimentos de onda que
possuem maior informação, ou então daqueles que diferem dos demais. Este processo
pode ser baseado em agrupamento de variáveis e seleção de um representante de cada
grupo. Entretanto, referida abordagem é um tanto elaborada e leva a diferentes
resultados dependendo dos critérios de agrupamento e descarte empregados.
Os PCs são combinações ortogonais das variáveis, definidos de forma que a variância
dos Scores é máxima, a soma Euclidiana das distâncias ente os Scores é máxima e a
reconstituição da matriz de dados original ( ) é o mais próxima possível da matriz original
X( é mínimo) (Jackson 1991).
A técnica de PCA possui inúmeras vantagens: é simples, possui uma única solução
analítica e geralmente leva a uma representação dos dados mais simples de ser
interpretada. A desvantagem desse método é que ele não produz, como resultado, um
19
pequeno grupo de comprimentos de onda que carregam consigo a informação, mas sim
um pequeno grupo de PCs, nos quais todos os comprimentos de onda estão
representados.
Uma vez que a PCA definiu as variáveis latentes, todas as amostras podem ser
graficadas, ignorando os PCs de ordem maiores. Usualmente, poucos PCs são necessários
para capturar a maior fração de variância do conjunto de dados (apesar disto ser
altamente dependente da característica dos dados analisados) (Geladi 2003).
A implementação da PCA padrão pode ser feita até mesmo em softwares com baixo
poder numérico, isso porque o algoritmo utilizado para o cálculo dos PCs é a
Decomposição em Valores Singulares (SVD – Singular Value Decomposition) da matriz de
dados. Uma alternativa à utilização de SVD seria a decomposição da matriz de covariância
ou correlação dos dados em seus autovalores e autovetores, entretanto, SVD é
numericamente mais estável e preferível na grande maioria dos casos.
(2.7)
(2.8)
A variância explicada (FV) de cada PC pode ser obtida de acordo com a equação 2.9
(Camacho, Picó et al. 2010).
(2.9)
estatísticos para definir quais PCs devem ser utilizados. Entretanto, mesmo esta escolha
fica a critério do pesquisador.
Os gráficos contidos na Figura 2.1 claramente mostram que para o referido exemplo,
os PCs 1 e 2 possuem mais variância que os demais. Juntos, eles possuem 55% do total de
variância acumulada. Apesar disto, os gráficos de Scree não apresentam um ponto de
corte claro (fato não usualmente apresentado por dados reais). Dependendo do objetivo
da investigação, poderiam ser escolhidos de três à cinco PCs. A escolha de quatro PCs,
neste caso, não faria sentido, pois o quinto PC iria adicionar tanta variância quanto o
quarto PC, de forma que se o quarto PC for adicionado, assim deve ser feito com o quinto.
Figura 2.1: Gráficos de Scree para avaliação do total de variância explicada para cada
PC. (a) Variâncias para cada PC, (b) Logaritmo das Variâncias, (c) Fração do total de
variância acumulada e (d) percentagem cumulativa do total de variância. Fonte: Adaptado
de Wehrens (2011).
Uma vez que determinado conjunto de dados foi redefinido para um número de
dimensões menor por análise de PCA, é possível determinar como outros dados serão
posicionados neste novo espaço dimensional. Estes novos dados correspondem a novas
amostras, medidas ou instrumentos. O objetivo é que a representação em espaços
dimensionais menores permita uma avaliação dos dados de forma mais ampla e simples
para a identificação de padrões, possível através da obtenção dos Scores do novo
21
conjunto de dados. Partindo de uma nova matriz de dados X, a projeção desta no espaço
definido pelos Loadings (P), pode ser obtida de acordo com a equação 2.10.
(2.10)
O preço pago pelas vantagens de utilização PCR recai sobre a possível perda de
informações vitais do processo, devido à compressão de dados, a necessidade de
determinar o grau de compressão utilizado (seleção do número de componentes
principais utilizados) e por fim, a incapacidade de obtenção de expressões analíticas para
predição do erro e das variâncias dos coeficientes de regressão individuais. Conclusões
acerca do número ótimo de componentes principais e erro esperado de predição só
podem ser feitas através da utilização de técnicas de Validação Cruzada (Cross validation)
ou similares.
A metodologia PCR pode ser dividida em duas etapas distintas: Obtenção dos
componentes principais a partir da matriz de dados original (aplicação da PCA) e
regressão utilizando estes componentes para obtenção do modelo de predição. Uma vez
obtidos os Scores e selecionados os PCs mais importantes que serão utilizados na
regressão, o problema fica resumido à resolução do problema de regressão do modelo,
resumido na forma da equação 2.11, onde T é uma matriz n x a, onde n corresponde ao
número de observações, a ao número de componentes principais selecionados, Y
corresponde as observações da variável de interesse e A é o vetor de parâmetros do
modelo.
(2.11)
PLSR, assim como o PCR, define variáveis latentes ortogonais para comprimir a
informação e descartar dados irrelevantes. Entretanto, PLS visa obter as variáveis latentes
de forma a capturar a maior variância contida em X e Y, maximizando a correlação entre
estas matrizes (Dayal and MacGregor 1997).
De posse dos vetores de Scores, obtidos pelo método PLS, a resolução do PLSR
encontra-se em uma etapa similar ao PCR, onde ao invés de modelarmos Y a partir de X,
será utilizado os vetores de Scores (T) para calcular os coeficientes de regressão (A) do
modelo de predição de Y. O vetor A é calculado de acordo com a equação 2.11,
equivalentemente à utilizado no modelo PCR, sendo a única diferença o método aplicado
para a obtenção de T, que no PLSR leva em consideração a informação de Y (Kettaneh,
Berglund et al. 2005).
PLS apresenta algumas vantagens que motivam sua vasta aplicação nas áreas de
modelagem quimiométrica de processos, dentre as quais podemos citar o fato deste
método ser considerado resistente ao chamado overfitting, que é o termo utilizado para
descrever situações onde modelos estatísticos se ajustam em demasiado ao conjunto de
amostras usado em sua calibração, se ajustando a erros de medição e fatores aleatórios
presentes neste conjunto, mas não representativos do processo como um todo (Land Jr,
H. et al. 2011)
A Busca Tabu, por sua vez, remonta sua origem na década de 70. Consiste em um
método iterativo de procura de ótimos globais não monotônico, cuja principal
característica é a capacidade de exploração do histórico do processo de busca, organizado
em estruturas que compõe a chamada “memória adaptativa”. Este método mantêm uma
lista de movimentos proibitivos, conhecida como Lista Tabu, reduzindo o risco de
ciclagem (execução em ciclo infinito) do algoritmo (Pereira 2007).
O feromônio, por ser uma substância biologicamente ativa, que evapora com o
tempo, funciona como um chamariz para as demais formigas. No experimento real, as
formigas que percorreram o menor caminho, retornaram ao formigueiro mais
rapidamente, de forma que a trilha percorrida por estes indivíduos apresenta maior
26
A primeira versão do algoritmo ACO foi desenvolvida por Dorigo e Gambardela (1997),
para a resolução do problema do Caixeiro Viajante, um problema de otimização
combinatória de busca em um espaço de permutações. Este algoritmo é fundamentado
na distribuição de um exército de formigas que deve visitar um conjunto de cidades, uma
única vez, percorrendo a menor distância possível (Shamsipur, Zare-Shahabadi et al.
2006).
Uma vez que o grupo de amostras disponível pode ser considerado representativo do
conjunto de dados que serão coletados e preditos, deve ser adotada uma forma de
estimar a qualidade das predições, simulando através dos dados coletados, dados que
serão preditos.
(2.12)
28
O segundo critério para avaliação de modelos é o R², que indica o quanto modelos
estatísticos se ajustam a dados reais. Este indicador fornece a informação do quanto os
dados observados são replicados pelo modelo, quantificando a proporção da variação da
resposta que é explicada pelas regressões em um modelo. No caso de modelos MLR, este
índice pode ser classificado como sendo o quadrado do coeficiente de correlação entre as
observações das variáveis e seus valores preditos.
(2.13)
Diversos níveis de energia vibracional estão associados a cada um dos quatro estados
eletrônicos, conforme sugerido pelas linhas horizontais mais finas. Como exemplificado
na Figura 2.6, as transições de absorção podem ocorrer do estado eletrônico fundamental
singleto (S0) para vários níveis vibracionais dos estados eletrônicos excitados singleto (S 1 e
S2). Vale salientar que a excitação direta do estado fundamental singleto para o estado
excitado tripleto não é mostrada, uma vez que esta transição envolve uma mudança na
multiplicidade, há uma probabilidade muito pequena da sua ocorrência. Uma transição
de baixa probabilidade desse tipo é chamada transição proibida (Skoog, Holler et al.
2007).
O segundo feixe oriundo da fonte de energia radiante, gerado após o sinal oriundo do
seletor de excitação passar através de um divisor de feixes, usualmente um espelho
semitransparente, é focado através de um atenuador que reduz sua potência para um
valor próximo daquele da radiação de fluorescência, reduzindo cerca de cem vezes sua
intensidade. Este feixe atenuado atinge um segundo transdutor, sendo convertido em um
sinal elétrico. Os componentes eletrônicos, associados em conjunto com um sistema de
análise de dados, processam os dois sinais de forma a calcular a razão da intensidade de
emissão de fluorescência para a intensidade da fonte de excitação, desta forma é possível
36
Fluorômetros são equipamentos que empregam apenas filtros para fazer a seleção
dos comprimentos de onda de excitação e emissão, enquanto que Espectrofluorômetros
empregam dois monocromadores para isolar os comprimentos de onda (Omary and
Patterson 1999).
Tabela 2.1: Comparação das características qualitativas de MIR e NIR. Fonte: Adaptado
de Pasquini (2002).
MIR NIR
Vibrações Sobretons e
Fundamentais Combinações
Qualitativa Excelente (estrutura) Ruim (identidade)
Quantitativa Excelente Excelente
38
Dois tipos de detectores são utilizados para cobrir toda a faixa útil do espectro: (a)
detectores de silício para a região entre 0,8 e 1,1 μm, e de (b) sulfeto de chumbo para a
região compreendida entre 1,1 e 2,5 μm.
Figura 3.1: Interpolação de dados off-line dos estados fermentativos. (a) ensaio
fermentativo 1 e (b) ensaio fermentativo 2. (×) Etanol, (●) Glicose e (♦ ) Biomassa.
Quando células de S. cerevisiae são expostas à glicose, em um meio propício para seu
crescimento, produzem etanol e biomassa, mesmo em condições aeróbicas (efeito
Crabtree) e pode ser observado o padrão de crescimento diáuxico da população (Zang,
Scharer et al. 1997, Pratap R 2003, Pratap 2003). Este processo, quando operado em
configuração batelada, pode ser descrito pelo sistema de equações diferenciais 3.1 à 3.3,
conforme Zang et al. (1997).
(3.1)
(3.2)
(3.3)
44
No modelo dinâmico, o crescimento diáuxico implica em µG ser maior que zero apenas
quando existe glicose no meio, e consequentemente, µE é igual a zero, não havendo
crescimento em etanol (repressão de crescimento por glicose). Quando a concentração
de glicose é igual à zero, ou seja, toda glicose adicionada no início do processo é
consumida e o etanol passa a ser a única fonte de carbono disponível, µE passa a ser
maior que zero até que todo etanol seja consumido e µG passa a valer zero.
(3.4)
(3.5)
(3.6)
(3.7)
(3.8)
Uma vez que a estrutura de modelo dinâmico proposto para o sistema em questão
esta validada, é realizado o ajuste de parâmetros do modelo utilizando simultaneamente
os dois conjuntos de dados de fermentação. Assim, o modelo ajustado resultante é
correlacionado de forma equivalente a ambas as fermentações, sendo igualmente
representativo para os dois conjuntos de dados experimentais.
O ajuste de parâmetros obtido para o modelo modificado, mostra que esta transição
entre as taxas de crescimento ocorre no intervalo de valores de concentração de glicose
compreendidos, aproximadamente, entre 0,5 g/L e 1,5 g/L. Caso esta transição fosse
realmente instantânea, o parâmetro a ajustado assumiria valores elevados, pois quanto
maior este valor, mais abrupta é a modificação do metabolismo celular. Nos processos
em questão, a variação da concentração de glicose na qual ocorre a modificação no
metabolismo celular corresponde a um período de aproximadamente 3 horas, cerca de
15% do tempo total de cada batelada, mostrando a importância da consideração de
transição gradativa entre as taxas de crescimento.
Figura 3.5: Simulação de dados off-line dos estados fermentativos (a) ensaio
fermentativo 1 e (b) ensaio fermentativo 2, através de modelo dinâmico ajustado
simultaneamente para ambos os ensaios fermentativos. (×) Etanol, (●) Glicose e (◊)
Biomassa. As linhas representam os estados simulados pelo modelo.
A análise dos parâmetros a e b permite uma avaliação mais detalhada dos ensaios
fermentativos, fornecendo índices capazes de caracterizar de forma quantitativa o
metabolismo celular deste microrganismo. Modelos capazes de descrever transições no
comportamento celular permitem o estudo de estratégias de operação otimizadas para o
sistema em questão, elevando sua rentabilidade.
De acordo com Kara et al. (Sep 2010), a diferença entre dois processos pode ser
visualizada como a diferença entre os respectivos espectros iniciais de cada ensaio
fermentativo. Associado a elevada complexidade de compostos presentes no meio
reacional de processos fermentativos, e a sensibilidade do processo de Espectroscopia de
Fluorescência, pequenas variações no meio reacional podem acarretar diferenças
significativas nos espectros de fluorescência para os ensaios.
50
Uma avaliação mais clara da diferença real entre os espectros iniciais pode ser obtida
através da comparação entre estes espectros após serem normalizados. Os espectros
normalizados, utilizando a metodologia SNV, são novamente comparados e o módulo das
diferenças, assim como os espectros normalizados, são apresentados na Figura 3.9.
Figura 3.9: Espectros de fluorescência, em t=0, normalizados com método SNV para
(a) ensaio fermentativo 1 e (b) ensaio fermentativo 2. (c) módulo das diferenças de
intensidade de fluorescência, par-à-par, entre os espectros normalizados.
Apesar da equivalência entre os meios reacionais dos ensaios em questão, deve ser
avaliada a viabilidade de comparação entre os processos ao longo de todo o tempo de
fermentação. Bioprocessos são altamente suscetíveis a distúrbios e mesmo pequenas
variações de temperatura nos processos, ou outras variáveis não monitoradas (cor do
meio, concentrações de compostos secundários, etc.), podem ocasionar uma variação
significativa na qualidade dos espectros de fluorescência (Lakowicz 2006).
Uma vez que os pontos correspondentes aos dois ensaios assumem valores próximos
e evolução equivalente no decorrer dos ensaios, pode ser concluído que os ensaios foram
conduzidos da mesma forma, tratando assim de dados espectrais comparáveis do mesmo
processo bio-químico. Caso houvesse uma diferença significativa entre os resultados
obtidos por PCA para os dois processos, seria necessário promover diferentes pré-
tratamento dos dados espectrais, visando reduzir influências causadas por variáveis
externas não controladas (Kara, Anton et al. Sep 2010), entretanto, este não é o caso para
o conjunto de dados em questão, sendo possível sua utilização, para fins quimiométricos,
apenas utilizando normalização SNV.
Os tipos de farinha de trigo usualmente encontrados são: 405, 550, 812, 1050, 1600 e
integral. Já farinha de centeio é encontrada usualmente nos tipos: 997, 1150 e integral,
desta forma, a escolha do conjunto de amostras foi feita de forma a buscar a
representatividade do universo de farinhas comercializados (Strohm 2012).
Figura 3.11: Multi Purpose NIR Analyzer, Bruker Optics GmbH, Ettlingen, Alemanha.
No detalhe, recipiente de amostra acondicionado para medidas de espectroscopia NIR
refletiva.
A avaliação qualitativa dos dados de espectroscopia NIR foi feita utilizando PCA. Nesta
etapa, são utilizadas as médias das medidas em triplicatas de cada amostra, o que
possibilita a análise da equivalência entre os grupos de amostras através das
características dos espectros. A Figura 3.12 apresenta: (a) as médias das medidas NIR para
as 34 amostras e (b) os dados de espectroscopia normalizados com SNV.
57
Os gráficos presentes na Figura 3.13 são usualmente utilizados para avaliação das
amostras com relação à formação de clusters, ou distinção das amostras em grupos com
qualidades de espectrais similares. No caso das amostras em questão, as amostras
apresentam tendência de se agruparem de acordo com a origem das amostras de farinha,
formando grupos de amostras correlacionadas estatisticamente. Apesar destes grupos
poderem ser segmentados, sua distinção não é clara, mostrando que similaridades
existem entre as amostras.
Neste trabalho é proposta a ferramenta PSCM como uma alternativa aos métodos
convencionais de análise qualitativa e quantitativa de dados espectroscópicos para o
estudo e caracterização de processos químicos ou bioquímicos.
2° • Caracterização Off-Line
6° • Calibração do modelo
7° • Teste do modelo
A terceira etapa está relacionada com a escolha do método mais indicado para
segmentação do conjunto amostral em dados utilizados para calibração e validação dos
modelos quimiométricos. Nesta etapa, a característica do conjunto de dados amostrais é
o fator significativo para decisão de qual a estratégia de validação cruzada (Cross
Validation) a ser empregada. Para grandes conjuntos de dados, a metodologia Y-rank é a
mais indicada, enquanto para conjuntos de dados com reduzido número de amostras, a
estratégia One Out fornece resultados mais representativos.
Uma vez que ambos os conjuntos de dados utilizados são ricos em número de
amostras, em ambos os casos, a estratégia Y-rank foi aplicada, resultando na utilização de
2/3 dos dados amostrais na fase de calibração e seleção de elementos espectrais, e 1/3
no teste dos modelos calibrados. Desta forma, a calibração de modelos, assim como a
própria seleção dos componentes espectrais utilizados nestes, é feita
independentemente das amostras utilizadas na validação, tornando as conclusões acerca
da metodologia e aplicabilidade da técnica plausíveis.
A quarta etapa, por sua vez, trata da seleção prévia da estrutura de modelo que será
utilizada na modelagem. Dentro da estrutura estão compreendidas características do
modelo como número de variáveis de entrada e tipos de efeitos considerados. Nestas
etapas do procedimento a utilização da regressão multilinear (MLR) propicia a vantagem
do ajuste dos parâmetros do modelo ser feito através da resolução de um problema
algébrico (capitulo 2.1.2), caso contrário, técnicas de ajuste de parâmetros através de
métodos de otimização são necessárias, elevando o tempo computacional
60
Uma vez que o número total de possíveis grupos de elementos espectrais é igual à
combinação do universo de elementos disponíveis, combinados n a n, onde n representa
o tamanho do grupo selecionado, a utilização de ferramentas especifica para este fim é
obrigatória para viabilizar o procedimento.
A seleção das cidades é feita de forma que cada uma das formigas estando em uma
determinada cidade, escolha a próxima cidade, excluindo de sua seleção todas as cidades
já visitadas. Esta seleção é feita de forma que sejam levados em consideração dois
fatores: um aleatório e um baseado na distribuição de feromônio nas trilhas entre as
cidades.
tempo, e (b) reforço da intensidade da marcação devido à utilização das trilhas pelas
formigas.
Na fase inicial do algoritmo, são inicializadas as variáveis necessárias para dar início à
resolução do problema de otimização. Nesta etapa são definidas ou carregadas no
programa: (a) dados espectrais, dispostos na forma de matriz bidimensional, onde cada
linha corresponde a uma amostra distinta e cada coluna corresponde a um determinado
componente espectral, (b) vetor coluna de variáveis observadas, onde cada linha
corresponde à respectiva amostra dos dados de espectroscopia carregados, (c) escolha do
número de ciclos que o algoritmo realizará a busca pelo ótimo, (d) escolha do tamanho
do exército de formigas utilizado na busca, (e) tamanho do modelo, indicando o tamanho
do grupo de elementos espectrais a ser buscado, (f) tipo de modelo utilizado para
avaliação do grupo de elementos espectrais, sendo por default do tipo linear sem
interações, mas podendo assumir estrutura quadrática pura, quadrática com interações
ou linear com interações, (g) valor inicial da trilha de feromônios, referente ao valor
igualitário associado a cada elemento espectral antes do início do processo de otimização,
de forma que todos os elementos iniciem com a mesma quantidade de marcador (valor
default igual a 10-6) e (h) taxa de evaporação de marcador, indicando o multiplicador do
vetor de feromônios aplicado entre cada ciclo (default igual a 0,5, indicando que a cada
ciclo todos os marcadores reduzem à metade de seu valor anterior).
Nesta etapa, durante cada ciclo do algoritmo, o exército de formigas varre o universo
de possíveis soluções na busca pelo grupo de elementos espectrais que forneça o menor
valor para a função objetivo. A função objetivo, neste caso, trata do somatório da
diferença elevada ao quadrado entre os valores observados da variável de estado e o
predito pelo grupo de elementos selecionado, para cada amostra. Cada um dos indivíduos
do exército de formigas escolhe uma grupo de elementos e submete ao teste da função
objetivo. Caso o resultado seja inferior ao previamente armazenado, este é então
substituído e o novo melhor resultado toma lugar no vetor de soluções.
O fator aleatório garante que o algoritmo de busca não fique retido em possíveis
mínimos locais, fazendo com que toda a região de busca seja avaliada. Este fator, na
pratica, é implementado através de uma função que gera valores randômicos entre 0 e 1.
Cada vez que um novo elemento deve ser selecionado e adicionado a um grupo por
iniciar ou em processo de formação, o algoritmo aciona o “gatilho randômico” e utiliza
seu resultado como fator de decisão para seleção do próximo elemento constituinte do
grupo.
(3.9)
(3.10)
O fato de o gatilho randômico ser utilizado para seleção de elementos associados aos
valores da curva CF, e desta, por sua vez, apresentar variações significativas apenas para
elementos com elevados valores de F, faz com que a probabilidade do algoritmo
selecionar elementos com maior concentração de feromônios associadas seja elevada,
priorizando assim a seleção dos elementos espectrais melhor avaliados pelo exército de
formigas.
Testes desta nova abordagem foram feitos com conjuntos de dados para resolução do
problema do caixeiro viajante e apresentaram resultados equivalentes aos obtidos pela
implementação original, com diminuição do tempo computacional em cerca de 10%.
(3.11)
(3.12)
(3.13)
acumulada diferirem brevemente, para os dados do exemplo, pode ser confirmado que a
representatividade dos elementos com baixas quantidades de feromônio foi reduzida,
aumentando a probabilidade dos elementos melhor avaliados serem selecionados para
compor os grupos de resposta.
Esta estratégia faz com que elementos que apresentem maior correlação com a
variável de interesse, ao serem adicionados a grupos de busca, produzam melhores
resultados da função objetivo, aumentando o incremento da quantidade de feromônios
nos elementos do grupo. Na evolução do processo de otimização, esta estratégia permite
que os elementos espectrais sejam qualitativamente caracterizados em função de sua
capacidade de predição da variável de interesse.
Abstract: The key objective for process optimization is to obtain higher productivity
and profit in chemical or bio-chemical process. To achieve this, we must apply control
techniques that closely correlate with our ability to characterize this process. Within this
context, optical sensors associated with chemometrical modeling are considered a natural
choice due to their low response time as well as their non-intrusive and high sensibility
characteristics. Usually, chemometrical modeling is based on PCR (Principal Component
Regression) and PLS (Partial Least Squares). However, since optical techniques are highly
sensible and bio-chemical mediums are highly complex, these methodologies can be
replaced by using chemometrical modeling based on Pure Spectra Components (PSCM).
Our study applies PCR, PLS and PSCM for protein prediction in flour samples measured
with Near Infrared Reflectance (NIR), comparing the three methodologies for on-line
sensor project. We also outline the development of a spectral filter based on PSCM
associated with Ant Colony Optimization. The results lead to our conclusion that the use of
optical techniques works best when PSCM analysis is applied, as it allows the development
of a spectral sensor for protein quantification in flour samples with less than twenty NIR
wavelengths evaluated, selected from a total of 1150. The filtering tool showed favorable
results in condensing relevant information from NIR spectral data, increasing R² from
sample prediction by almost 60% for PCR models and 40% for PLS models, using 10% and
20% of full spectral data, confirming the viability of filtering methods.
Published at Chemometrics and Intelligent Laboratory Systems
68
4.1 Introduction
The ability to develop advanced control and optimization tools is intimately correlated
with the ability to measure the state variables (Scheper, Hitzmann et al. 1999, Whitford
and Julien 2007). Optical sensors are noninvasive, continuous and present low response
time and cost with high sensitivity and resolution. More specifically, spectroscopy
measurements - such as fluorescence spectroscopy, near infrared (NIR), multivariate FT-IR
spectroscopy, Raman spectroscopy, and others (Clementschitsch, Jürgen et al. 2005, Rhee
and Kang 2007, Whitford and Julien 2007, Roy and Pratim Roy 2009, Bosque-Sendra,
Cuadros-Rodríguez et al. 2012) - allow us to detect several analytes simultaneously. All
these features make optical sensors one of the most promising tools to be applied in
chemical and biochemical processes (Ge, Kostov et al. 2005, Whitford and Julien 2007).
Spectral methods provide a very large amount of data that must be pre-processed to
provide practical information for the user (Solle, Geissler et al. 2003, Kara, Anton et al.
2010, Warth, Rajkai et al. 2010). Therefore, the use of mathematical modeling is required
in order to effectively measure analyte concentrations and/or material properties. As
defined by Varmuza and Filzmoser (2008), “chemometrics concerns the extraction of
relevant information from chemical data with mathematical and statistical tools”.
Successful methods to handle such data have been developed in the field of
chemometrics: linear multivariate statistics such as multiple linear regression with factor
analysis (FA-MLR), Stepwise Multi Linear Regression (Stepwise MLR), Partial Least Squares
(PLS), Genetic Function Algorithm (GFA), Genetic PLS (G/PLS), Principal Component
Analysis (PCA) or Principal Component Regression (PCR), as well as non-linear tools, such
as Artificial Neural Network (ANN) (Clementschitsch, Jürgen et al. 2005, Rhee and Kang
2007, Roy and Pratim Roy 2009, Krishnan, Williams et al. 2011, Bosque-Sendra, Cuadros-
Rodríguez et al. 2012, Farrés, Villagrasa et al. 2012). The most applicable methods are
PCA, PCR and PLS, useful for quantitative analysis of spectroscopy data (Bro, van den Berg
et al. 2002, Geladi, Sethson et al. 2004). These techniques are meant to provide a
synthetic description of large data sets, allowing evaluations across the spectrum (Jolliffe
1986).
PCA is a powerful tool for data analysis, able to identify patterns in the data set and
express data in a manner that highlights similarities and differences. Once patterns are
found, the data set can be compressed without losing the main information. Several kinds
of analyses use it to extract information related to physical and chemical properties from
fluorescence matrices or for dimensionality reduction of fluorescence spectra in several
systems (Tartakovsky, Lishman et al. 1996, Boehl, Solle et al. 2003, Guimet, Ferré et al.
2004, Rhee and Kang 2007, Kara, Anton et al. 2010).
PCR and PLS are commonly used with spectral data. After identifying the Principal
Components, which account for most of the variance, these components can be used in
regression. This method can transform highly correlated independent variables into
uncorrelated Principal Components (PCs) (Rhee and Kang 2007). PCR has applications in
Raman spectroscopy analysis and a few studies apply this method to analyze 2D
fluorescence spectral data (Cooper 1999, Boehl, Solle et al. 2003, Otsuka 2004,
Sorouraddin, Rashidi et al. 2005).
PLS, considered one of the most widely used multivariate calibration methods, is
extensively applied to chemometric modeling of spectroscopic data to characterize
biological process, such as in Infrared and 2D fluorescence spectroscopy, for instance. PLS
69
In their research, they analyzed two aspects not usually explored in literature: (i) the
use of pure spectra for chemometric modeling without the need for statistical
pretreatment of data sets and (ii) extension of this methodology for creation of a more
robust procedure for on-line sensor development. In this sense, Shamsipur et al. (2009),
Hemmateenejad et al. (2011) and Allegrini and Olivieri (2011) focused their attention on
the use of Ant Colony Optimization (ACO) as a tool for wavelength selection, comparing
the different possible implementations of ACO with other heuristic optimization
methodologies, such as with genetic algorithms. However, a comparison between the
well-established methods for spectral analysis - PCR and PLS - with the methodology using
ACO, highlighting the advantages and disadvantages of each method is missing.
Thus, the main objective of our study is to compare well-known chemometric PCR and
PLS methods with a non-trivial approach, which uses models directly based on spectrum
components for state variable prediction. This method improvement is called
Chemometric Modeling Based on Pure Spectra Components (PSCM) and uses the ACO as
a tool for spectral component selection and chemometrical modeling.
We use a set of flour samples characterized using NIR spectroscopy to compare the
chemometrical methods used. Given that the evaluation of NIR spectroscopy with PLS is
considered a standard methodology for characterizing flour (Cocchi, Corbellini et al. 2005,
Ait Kaddour and Cuq 2009, Li Vigni, Durante et al. 2009, Vigni, Baschieri et al. 2011), the
use of this specific data set are ideal to compare the results of chemometric modeling
using PSCM with standard methodologies.
4.2 Methodology
4.2.1 Experimental Data Set
The experimental data used in this work includes 34 samples of different brands and
kinds of flour measured in triplicate to determine several important variables. In our
study, however, we will only take the protein content values into consideration. All the
samples were off-line and characterized through a farinograph analysis (Brabender GmbH
& Co. KG, Duisburg, Germany, model FD0234H). We evaluated the protein content with a
Digestion Apparatus (Digesdahl® Hach - Düsseldorf, Germany). Parallel to the farinograph
analysis, we characterized the samples with NIR spectroscopy measurements. We
performed NIR measurements in a Multi-Purpose NIR Analyzer (Bruker Optik GmbH -
Ettlingen, Germany), varying wavelength from 800 nm to 2800 nm.
Figure 4.1 shows the average of protein concentration for data set samples. The off-
line protein data set presents the segmentation of flour samples into calibration and test
groups, which is necessary for chemometrical analysis.
70
Figure 4.1: Protein content in sample set and segmentation in calibration and test set.
We chose this data set based on specific characteristics, such as protein content
variability and spectral range. The sample’s spectral data are composed by a large
number of wavelengths that fall between 800nm and 2700nm, with a variable interval,
totalizing 1150 spectral NIR elements.
Regardless of which chemometric method is applied, we calibrate and test them all in
different sample groups. Calibration and test groups are determined by selecting samples
with protein contents that represent the full range inside each sample group. Figure 4.1
shows the segmentation used.
Principal Component Analysis – PCA: To apply this method, the data set should be
structured as a single matrix, with samples listed in rows, and individual spectrum
intensity values listed in their respective columns (Jolliffe 1986).
that PCs are mutually orthogonal, it is possible to avoid the typical problem of collinearity
and high correlation which arises in many regression techniques. In this methodology, the
PCs are combined in a way to predict the output data matrix, using multivariable linear
regression (Liu, Kuang et al. 2003).
The procedure for PCR is divided in three steps: evaluation of PCA and determination
of more significant PCs, evaluation of ordinary least squares regression on the selected
components and computation of model parameters for the selected explanatory variables
(Upton, Cook et al. 2008).
Partial Least Squares – PLS: This is a multivariate statistical technique with the goal of
correlating two data sets and making a prediction of one set from another. It is meant to
identify the factors (latent variables, LVs) that not only capture the largest amount of
variance in data, but also provide a linear correlation between the spectral data and
process variables.
The inner relation in PLS can be expressed as a sum of the outer product of score
matrix and identity matrix plus a residual matrix. The regression equation of output data
matrix can be written as the sum of the product of spectral data matrix and regression
coefficients related to the weights, which can be considered additional loading matrices
that express the correlation between spectral data matrix and output data matrix (Geladi
and Kowalski 1986).
Pure Spectra Chemometric Modeling – PSCM: This chemometric analysis has two
main pillars: the selection of pure spectral elements and model adjustment for state
variable prediction.
Figure 4.2 illustrates the steps for chemometric modeling using the PSCM
methodology. The full methodology is divided into blocks in way to present some
characteristics about each step.
The selection of the spectral group aims to choose spectrum components that present
direct correlation with state variables, discarding possible noise and errors caused by
spectra regions not significantly related to interest variable (Skoog, Holler et al. 2007).
Selected group of spectral variables are used as input data in multi linear models with
one output (MISO models - Multiple Input Single Output). Those models are linear in
relation to their parameters, allowing calibration by analytical least squares solution.
The use of multilinear models is not mandatory, although their parameter calibration
allows for an algebraic resolution, significantly decreasing the time needed for
adjustment of models as well as for spectral group selection. Any model structure can be
applied, depending solely on the kind of spectral data available and process in the study
case.
72
The search for independent variables in PSCM is crucial for method efficiency, and can
be done through many different approaches. The simplest way applies an exhaustive
search, where all the possible combinations of spectral variables are tested and the best
one is selected. Although this technique ensures that the optimal one is found, depending
on the number of spectral variables and model size, the number of possible combinations
can be so high that computation time makes it impracticable.
To solve this problem, we apply optimization techniques to select the best spectral
component set to describe a particular process variable. The benefits obtained from this
methodology are not only the stability of the model in terms of collinearity in multivariate
spectra, but also the interpretability of the relationship between the model and the
sample compositions, as initially presented by Allegrini and Olivieri (2011). Previous
studies in our group tested several methods for spectral variables selections (Ranzan,
Trierweiler et al. 2012, Masiero, Trierweiler et al. 2013, Ranzan, Ranzan et al. 2013), and
despite PSCM allow the use of any variable selection method, functionalities of ACO (e.g.
easy implementation, time consuming, qualitative spectrum data, etc) led us to decide for
this method.
Mullen, Monekosso et al. 2009). Figure 4.3 presents a schematic algorithm for the
discrete version of ACO implementation. The main idea behind this is that, in real ants,
the convergence of ant trails toward the shortest route between the food source and the
nest is a result of the tendency of ants to follow a trail that contains a higher
concentration of pheromone deposit (Deneubourg, Aron et al. 1986).
Our implementation of ACO in this study is a discrete version, where the selection of
spectrum components is made using a random factor associated with a pheromone
density function. Pheromone is added to spectrum components by ants as a function of
residual error between state variable prediction and measurement. More details about
Ant Colony Optimization algorithms can be obtained in Mullen et al. (2009).
Spectral data Filter: PSCM analysis using the ACO algorithm presents the advantage
of providing qualitative information about spectral elements as a function of interest
variables.
Allegrini and Olivieri (2011) previously described this characteristic of ACO, analyzing
the advantages of ant colony optimization use combined with Monte Carlo repeated
calculations to discard irrelevant spectral regions when PLS regression analysis is
performed on NIR spectroscopy data of sugar cane, corn, octane and synthetic samples.
Filtered spectral data is tested using PCR and PLS chemometrical models, comparing
protein prediction results using full and filtered spectral data.
The filtering method viability is tested using the sample set previously presented, as
well as the triplicate measures without mean spectra, allowing larger PCR/PLS models,
given that the intention is to test the filtering tool, not the technique’s prediction
capability.
4.3 Results
column, reducing each value from the column mean and dividing by the standard
deviation of columns.
Figure 4.4 presents the normalized variance of principal components obtained in PCA
and the scores from the first and second principal components. The graphic of variances
(figure 4.4(a)) shows that the sum of variance of the three first PC’s represents around
82% of all data variance. We used 102 NIR measurements (triplicate measurements for 34
flour samples) for this first analysis.
Figure 4.4: (a) PCA results with variance (only 20 more significant principal
components). (b) Scores plotting from NIR spectroscopy of triplicate flour samples.
The score plot (figure 4.4(b)) shows that data can be separated into two big groups
according to sample nature (rye flour and white flour), indicating the high correlation and
similarity between the samples. Figure 4(b) also shows the high reproducibility of NIR
measurements grouped by samples once triplicate measurements presented equivalent
score plots. Despite generally qualitative similarities presented by flour samples, Sample 5
(hatched border) presented significant difference from the two distinguished groups. We
did not, however, remove it from the data set.
All the models have linear structures, not including interaction or quadratic effects of
independent variables in the prediction of dependent variables. Many different models
are fitted for the three methodologies with different sizes, using 1 to 20 independent
76
variables. We determined this huge interval to evaluate the behavior of prediction model
capability and determine which methodology is better to run a chemometric analysis.
In PSCM the selection of model size is done by selecting the number of spectral
components that will be used as independent variables in the model. It involves a
significant broadening of search universe for the best group of elements to be used.
Given that PSCM has no pre-treatment of spectral data of any kind and input variables
are the pure spectral components, the issue becomes choosing a subset of spectral
channels that result in small prediction errors.
Figure 4.5 presents the results obtained in stages of calibration and prediction test of
chemometric models using PCR, PLS and PSCM. Analyzing the values of RMSE and R² for
PCR and PLS, we can conclude that the prediction of protein content in the prediction
stage is more accurate in PCR for models with only one input variable; as for PLS, the best
result is obtained for models with seven input variables. As for model size, PLS presents
better results for models with less than seven input variables; however, there is an
inversion for bigger models and PCR presents better results.
PSCM results presented better values of R² and RMSEP, when compared with
standard chemometric models. For almost all sizes of multi-linear models analyzed, PSCM
presented the best results as well as the best models (twelve and twenty input variables)
for prediction of protein content using NIR data.
Comparing the three techniques, we can draw the conclusion that models obtained by
PSCM present significant differences in prediction capability using a reduced number of
independent variables. Analyzing the increment in accuracy when increasing the model
size, models obtained through PSCM with twelve and twenty independent variables (R²
for validation equal to 0.9) presented the best results for protein prediction. This shows
that models based exclusively in pure spectral components could predict protein content
in flour samples using less information than standard PCR or PLS models for the same
amount of input variables.
Figure 4.5: Effect of independent variables in R² and RMSE values for PCR, PLS and
PSCM (calibration and validation) for samples of flour characterized by NIR spectroscopy.
77
The objective in variable selection is to reach the expected state variable prediction
without the use of intuition or complementary selection methods. As exemplified with
the experimental data set, the modeling process using the spectrum components results
in better models with less independent variables because those components are directly
correlated with the interest variable and, for the most part, provides useful information.
Figure 4.6: (a) Pheromone trail concentration (dimensionless) evolution in search for
NIR spectra region to protein prediction and (b) pheromone mean values during
optimization with indication of significant spectral regions for protein content in flour.
Figure 4.7 presents an NIR reflectance spectrum of typical bread wheat flour with the
indication of functional groups associated with some characteristic flour overtones (Sun
2008).
79
Figure 4.7: NIR reflectance spectrum of typical bread flour. Selected vibrational bands
assigned to: (1) O–H and N–H stretch, (2) C–H stretch, (3) O–H combinations and N–H
combinations, (4) amide and (5) C–H combinations.
Given that the interest variable analyzed was protein, regions pointed in NIR spectrum
are expected to possess some correlation with the vibrational bands of functional groups
presented in protein molecules. The match between the experimental results and
theoretical knowledge is considered an indicator of this methodology’s soundness.
PCR and PLS methodologies follow the same procedure previously applied, but using
triplicate NIR measures as single measurements. In this case, the experimental data set is
composed of 102 samples (34 flour samples measured in triplicate).
Considering that PCR and PLS are data selectors concentrating significant information
from spectral data in just a few elements, the significant information is distributed in
singular spectral regions. The selection of these specific spectral components allow the
exclusion of possible noise sources and errors, beyond the increase in measurement
velocity, robustness and data storage efficiency.
80
The similar results obtained from PCR and PLS prediction using full and filtrated NIR
data set indicate that the methodology applied for the selection of spectral components
is efficient and mainly selected the spectral elements that carry the most significant
information correlated to the state variable.
Figure 4.8 presents the mean of pheromone trail obtained for protein characterization
with NIR using PSCM and ACO. The horizontal lines represent a threshold, indicating the
minimal pheromone concentration presented by the elements selected for 8.7% (line 1),
17.4% (line 2), and 43.5% (line 3) spectral components. The lines indicate the filtered data
set obtained.
Figure 4.8: Mean of pheromone trail concentration (dimensionless) after search for
groups composed of one to ten NIR spectral elements to predict protein in wet flour. The
horizontal lines indicate the minimum pheromone concentration consider for filtered
data: 8.7% (1), 17.4% (2) and 43.5% (3).
The results presented in Figure 4.8 are directly correlated with the ones presented on
Figure 4.6(b) (Figure 4.6(b) presents one of the ten groups of data used to generate the
results presented in Figure 4.8). The similarities between those two figures highlight the
convergence and tendency of ACO to attribute more importance (pheromone
concentration) to specific spectral regions.
Comparing Figures 4.8 and 4.6(b), we can see that the use of mean pheromone trails
highlight the main spectral regions instead of only few components, as occurs in Figure
4.6(b).
Using the importance ordering based on Figure 4.8, PCR and PLS models were
generated and compared with full spectral models. Figures 4.9 and 4.10 show results
obtained with all spectral data ranges, where only prediction results are presented. Figure
4.9 presents the results for PCR modeling while Figure 4.10 is related to PLS modeling.
81
Figure 4.9: Coefficient of Determination (R²) and RMSEP for validation phase of PCR
modeling using full data set and filtered data set of NIR for protein prediction.
The results presented in Figures 4.9 and 4.10 shows that models of PCR and PLS
obtained using filtered NIR data present more efficiency in condensing useful information
for protein prediction in flour samples. This is confirmed by the fact that the sum or
residual errors and determination coefficient achieve better values than the use of full
spectral data. For instance, the result obtained with PCR modeling using filtered data
works best for models smaller than fifteen input variables. For models with more than
fifteen input variables, the result presented by filtered and full spectral models is
equivalent.
82
Figure 4.10: Coefficient of Determination (R²) and RMSEP for validation phase of PLS
modeling using full data set and filtered data set of NIR for protein prediction.
The results presented in Figures 4.9 and 4.10 are similar, proving the concentration
capability of PSCM spectral data, given that prediction quality either increased or
produced the same results for PLS and PCR models with the use of filtered data, especially
for models with a smaller number of inputs.
However, PLS filtered models with more than ten input variables presented an
increase in RMSEP, while standard models kept constant, although the values of R²
remained equal to standard models. This indicates that even with the increase on RMSEP
for PLS, the filtered methodology allowed the achievement of more accurate models
using less number of input variables.
PCR and PLS models obtained using up to 17.4% of NIR data demonstrated better
prediction capability than other models with less input variables, confirming the efficiency
in selection of significant spectral variables and discard of noise or others factors that
interfere in protein prediction.
83
Figure 4.11 presents the differences in model prediction obtained through filtering
based on ACO pheromone trail using statistical coefficient of determination of RMSEP
presented by full spectral modeling as standard references, evaluating the difference with
the use of filtered data sets.
Figure 4.11: Percentage difference of RMSEP and R² between models obtained using
full spectral data (reference value) and compressed data in PCR (a) and PLS (b) modeling.
The results presented in Figure 4.11 show that the use of filtered data sets improves
significantly the prediction capability of PCR and PLS models for regression coefficient or
RMSEP parameters. The increment in prediction precision is significant in models with up
to ten input variables; after that, the models begin to show an increase in RMSEP, more
pronounceable in PLS models, but nevertheless producing acceptably low absolute
values.
The increase in prediction capability presented by PCR and PLS models provides
evidence of ACO filtering capability in the selection of significant spectral components
and rejection of spectral noise, usually responsible for the decrease in PLS and PCR
variable prediction.
84
In relation to filtering rate, models that used 17.4% and 8.7% of full spectral data
presented better results, although equivalent for models with up to six input variables.
Models generated using 8.7% of spectral data presented better results than others
models. This is a good indication that spectral noise rejection is effective and that we are
only inserting significant information into PCR and PLS models.
4.4 Conclusions
The huge amount of spectral information provided by optical measurement
techniques and the high sensitivity and correlation between spectrum components
hinders our task of transforming information into process knowledge. Nonetheless, the
application of multivariable modeling tools as PCR and PLS shows good results and
applicability in this field.
When comparing PCR and PLS as tools for translation of NIR spectral measurements of
flour samples in protein content information, both techniques achieve equivalent results,
predicting the interest variable with the same degree of confidence. PLS, however,
achieved better results using less independent variables compared to PCR. Therefore we
conclude that PLS is more indicated to filter NIR data for protein content predictions in
flour samples.
The importance of the spectral region map obtained using PSCM and ACO allowed
spectral data filter as a function of interest variable. We applied pre-PCR/PLS data
treatment to discard non-significant spectral regions, improving the chemometrical
models precision once models are not affected by irrelevant information.
PCR and PLS are standard chemometrical methodologies presenting good results on
variables prediction. However, the advantages presented by PSCM in chemometrical
modeling and spectral analysis shows how viable it is for on-line sensor development, not
only to characterize individual samples but for process variable measurements as well as
its usefulness as an analytical source of knowledge on processes.
The filtering rate of spectral data tested (reducing total amount of spectral
components in 80%, 70% and 50%) significantly improved prediction of PCR and PLS
models. For some model sizes, the improvement in efficiency surpassed 60% in relation to
the same models structures that used full spectral data. Those improvements in results
are due to the selection of right spectral region correlated with interest variable, and the
85
Capítulo 5 – Characterization of
Saccharomyces cerevisiae fermentation
using Fluorescence Spectroscopy 2D
Submitted to Applied Spectroscopy
87
5.1 Introduction
Biotechnological processes are taking on a prominent position in the production
matrix and have grown proportionately more than the standard chemical process
(Mussatto, Dragone et al. 2010). An increase in production, analogous to chemical
processes, can be achieved by applying control strategies designed to improve the
process operation. Success, however, is closely related to knowledge of process states,
and, because of the lack of concentration sensors, this represents a serious problem for
bioprocesses (Ranzan, Trierweiler et al. 2011).
Historically, the most effective way to improve production in a bioprocess plant has
been associated with the evolution of strains used in fermentation processes (Aynsley,
Hofland et al. 1993). However, recent developments show that improvements in the
bioprocess can be obtained by using supervision and control tools, reducing production
costs while maintaining the quality of the desired products (Yamuna and Ramachandra
1999, Whitford and Julien 2007, Kabbaj, Nakkabi et al. 2010, Menezes 2011).
The critical problem associated with the conversion of spectral data into state
variables lies on the huge amount of information contained inside the spectra. This
problem requires the application of virtual analyzers or state estimators to translate the
fluorescence information into process variable knowledge in order to determine the
relationship between fluorescence data and process variables (Boehl, Solle et al. 2003).
Among the chemometric modeling techniques, applied in data processing, Partial Least
Squares (PLS), Principal Component Regression (PCR) and Neural Networks are standard
methodologies (Christensen, Norgaard et al. 1995, Wolf, Almeida et al. 2001, Hagerdon,
Legge et al. 2003, Solle, Geissler et al. 2003).
PCR is not directly applied in the original spectral data, but is correlated to the PCs
(Principal Components) obtained by PCA (Principal Component Analysis). Once the PCs
are mutually orthogonal, the typical problem of collinearity and high correlation, which
arises in many regression techniques, can be avoided. In this methodology, the PCs are
88
combined in a way to predict the output data matrix, using linear multivariable regression
(Liu, Kuang et al. 2003).
PLS is a multivariate statistical technique with the goal of correlating two data sets
and predicting one set based on another. It attempts to identify not only the factors
(latent variables, LVs) that capture the largest amount of variance in data matrix, but also
those that allow a linear correlation to be obtained between the spectral data and the
process variables (Geladi and Kowalski 1986).
The fluorescence experimental data used in this work consists of two cultivations of
glucose by Saccharomyces cerevisiae H620 growing in a 1.5L bioreactor at constant
temperature and pH, 30ºC and 5.5, respectively, with Schatzmann medium
supplementation. During cultivation, we collected fluorescence spectra every 6 minutes,
using a BioView fluorometer (Delta Light & Optics, Denmark), as described by Stärk et al.
(2002). Each spectrum contained 150 fluorescence pairs with excitation/emission
1
A descrição detalhada deste item encontra-se no Capítulo 3
89
wavelengths: 15 filters in the region of 270 to 550 nm for excitation and 15 filters in the
region of 310 to 590 nm for emission, both with a bandwidth of 20 nm, collected
equidistantly.
We collected 190 spectra from each cultivation. The data obtained by BioView
Spectrum Fluorometer was processed with MATLAB software (Ver. 5.3.0.10183 R11, The
Mathworks, Inc., Natick, USA).
When S. cerevisiae cells are exposed to glucose in a medium propitious for their
growth, they produce ethanol, carbon dioxide and biomass, by which one can observe a
diauxic growth pattern. This phenomenon is characterized by cellular growth in two
phases during the batch fermentations (Zang, Scharer et al. 1997).
(5.1)
(5.2)
(5.3)
The growth rate values (µG and µE) are considered a function of glucose concentration
(µG(G) e µE(G)). To this end, one could use expressions commonly found in the literature
to describe this dependence (Monod, Aiba, Levenspiel, amongst others) (Mulchandani
and Luong 1989, Habibi, Vahabzadeh et al. 2013); however, in this study, we set forth a
new form of dependence between growth rates and glucose concentration, where those
variables are correlated by a hyperbolic tangent factor, as described by eqs. (5.4) and
(5.5). Growth rates are dependent on glucose concentration and four constant
parameters (a, b, µGm and µEm).
(5.4)
(5.5)
In a dynamic model, diauxic growth results in a change between µG and µE. When
glucose is available, µG values are greater than zero. On the other hand, when all the
glucose is consumed and the biomass changes its metabolism, µG value is equal to zero
and µE (which has a value equal to zero when glucose is available) has values greater than
zero.
In order to solve the dynamic system formed by eqs. (5.1) to (5.3) we applied the
Runge-Kutta method. In all simulations, the coefficients of specific yield are the ones
reported by Solle et al. (2003) (YGX = 0,167 gcell/ggluc, YGE = 0,5getha/ggluc, YEX = 0,333
gcell/getha). We estimated a, b, µGm and µEm using SIMPLEX methodology and the
experimental data of both fermentations resulting in µGm = 0.3792 h-1, µEm = 0.0587 h-1,
a = 1.3968 L.g-1 and b = 2.2136 g.L-1. Figure 5.1 presents the original and the respective
simulated off-line data for the fermentations.
Figure 5.1: Fermentation 1 (a) and Fermentation 2 (b). (o) Off-line data. (―) Simulated
data using dynamic model.
91
The version of ACO implemented in this study is based on pheromone trail evolution
during spectral group scanning. Initially, all spectral components are marked with the
same pheromone concentration (Ranzan, Strohm et al. 2014). The ACO routine selects
random spectral components for a compound test group that is evaluated using the
objective function for process variable prediction. Based on objective function error, the
pheromone concentration, associated with each spectral component at the evaluated
spectral group, is updated. For the subsequent spectral group selection, the random
selection chooses spectral components associating the same random trigger and a
cumulative density of pheromone for the full range of spectral elements.
The combined association of random selection and pheromone density brings into
evidence significant elements inside the spectral range, and, after few iterative runs, a
pheromone profile is established, and a density pheromone trail highlights the significant
spectral for process variable prediction. See (Ranzan, Strohm et al. 2014) for more detail
about this algorithm.
Given that the amount of experimental data available englobes two batch
fermentations, we segmented experimental data so that first fermentative batch is used
only in the models calibration phase, and fermentative batch two is used only in the
model test phase. This way, we can guarantee that the calibration and test of
chemometric models are performed with distinct experimental data and results are
directly associated with robustness of predictive models in real process application.
92
Given that the efficiency of PCR and PLS methodologies are highly associated with
spectral data quality, it is useful to normalize the spectral signals prior to data analysis.
This process helps in eliminating arbitrary offsets and multiplication factors. We achieved
this by applying Standard Normal Variate (SNV) scaling to spectral data. This method
essentially autoscales the samples, obtaining zero mean and standard deviation of 1 for
each spectrum (Gemperline 2006, Wehrens 2011).
3) Validation Test (using simulated and spectroscopic data from fermentation 2).
Once the structures of PCR, PLS and PSCM models are polynomials multilinear at
parameters, fitting is made through Ordinary Least Square (OLS) problem. This problem
has an algebraic solution and model parameters are calculated with equation 5.6, where
X is the independent variables matrix, Y is the dependent variable vector and β is the
vector of model parameters.
(5.6)
(5.7)
(5.8)
Figure 5.3 shows the statistical indices of chemometric models at step 3. This figure
presents the results of models calibrated using fermentation 1 data and tested on
fermentation 2 data, grouped by indices (R² and RMSEP) and state variables (Glucose,
Ethanol and Biomass concentration). The factor selected for comparing between models
was the number of input variables. Models obtained with PCR can be compared with
models obtained by PLS or PSCM, which use the same number of independent variables
(PCR – Principal components, PLS – Load Vectors, PSCM – Pairs of Fluorescence), and,
consequently, the same number of estimated model parameters, for the prediction of the
dependent variable, given that the model structures applied in all methodologies are
similar (multi-linear models).
For the process data analyzed, all tested chemometric methodologies achieved
accurate results in the prediction of a set of simulated interest variables for fermentation
2 using spectral fluorescence data from this same fermentation. The fact that all
chemometric methodologies resulted in a satisfactory proximity to process data indicates
that 2D fluorescence spectroscopy is a reliable way to monitor state variables on glucose
fermentation.
94
Figure 5.3: RMSEP and R² versus number of Input variables in PCR (principal
components), PLS (load vectors) and PSCM (pairs of fluorescence) models for Ethanol,
Glucose and Biomass prediction of fermentation 2, using data from fermentation 1 for
model calibration.
Biomass prediction was more accurate using standard methodologies. However, all
methodologies were able to accurately predict biomass content, achieving R² higher than
0.97, and RMSEP less than 1.5, for all models with more than one input variable.
optimization to overcome this issue. Another possible explanation for that behavior is due
to overfitting models parameters to calibration data group. Despite validation oscillatory
behavior, results are satisfactory for all interest variables prediction, confirming PSCM
modeling using fluorescent data, as fermentative characterization option.
Results suggest that any of the tested methodologies can be applied to predict
Glucose or Biomass concentration; however, considering the size of models and quality of
prediction, the results obtained using PSCM were superior in indicating Glucose and
Ethanol inference models.
This superior result of PSCM was aligned with our expectation, given that only
information directly associated with the studied variables are selected, whereas, on PCR
and PLS, the principal components and load vectors are loaded with all significant
information on fluorescent spectrums, making PSCM models specific to interest variables.
Regarding ethanol prediction, the PSCM models presented the best results, with the
exception of the models with 14 and 16 parameters (regarding R² indices). However, the
values of RMSEP for PSCM presented worse results than PCR and PLS, indicating that the
standard error presented by PSCM models was larger compared with the other
methodologies. Nonetheless, this difference is not enough to impact R², and it is still
considered the best methodology for ethanol characterization.
Since PSCM presented good or, in some cases, better results than standard
chemometric methods, sensors based on small amount of fluorescent pairs appear to be
viable and more reliable than sensors based on full range spectrum. This is a significant
result for biochemical and chemical characterization allowing the development of
customized sensors that can help to improve process monitoring and control.
The pheromone trail is a vector associated with all spectrum components. Initially, it
has the same value for all components. Then, the values change during the optimization
routine in a discrete manner (element by element), in such a way that elements with
greater accuracy in interest variable prediction receive a greater pheromone amount. In
this way, pheromone trail acts as a qualitative filter based on trial and error during
optimization runs. At the end of the routine, global or local minima are achieved and
pheromone trail presents its signature associated with interest variable, emphasizing
significant data components.
chemometric methodology is not efficiently treating spectral noise created from medium
components unrelated with the interest variable, problems may arise. PCR and PLS use
the variance of principal components or load vectors, respectively, as indicators to
translate spectral data and predict state variables. They are highly dependent of spectral
data pre-treatment to block and filter spectral noise, but are unable to distinguish
between state variable of medium noises responsible for spectral changes.
The pheromone trail can be explored as a filter, condensing the fluorescent data into
a subset that contains only the significant information for the interest process variable.
This filtering process reduces the amount of non-significant information within the model,
reducing the influence of noise in the models.
Figure 5.4 shows the pheromone trail information, normalized between one and zero
for excitation range from 270nm to 430nm, and emission range of 10nm to 430nm. The
highlighted regions on each graphic are associated with the most significant fluorescent
pairs assignment with their respective state variable.
In this study, we propose to use the ACO pheromone trail as a tool for search and
selection of the spectral regions that are directly correlated with each state variable,
enabling the characterization of process variables based on spectral changes only in
specific regions. This features acts like a searching engine for state variable signatures
inside the fluorescence spectra.
Figure 5.4 indicates that characterization of all interest variables from glucose
fermentation can be done through the analysis of a relatively small region of fluorescence
spectrum. This is a valid property from the sensor development point of view, because
the spectral range of light sources and detectors is reduced. On the other hand, it can
lead to a model development issues, because the models could lose precision due to
fluorescence pairs being simultaneously used for distinct state variables prediction and
these variations can mask fluorescence changes on the others variables.
The problem of variables’ cross interference on model prediction is not specific to this
fermentative process, since PCR, PLS and PSCM models presented acceptable results in
the prediction of all tested fermentative variables.
As shown in Figure 5.4, glucose presents a larger significant region (usually associated
with protein fluorescence measurements) than ethanol (region associated with NADH
fluorescence) and biomass (regions associated with tyrosine fluorescence). Furthermore,
this region comprehends parts of the important spectral regions correlated with ethanol
and biomass. Nonetheless, both variables also present other specific regions that are
important for its prediction as well. Therefore, by associating the equivalent region with
the specific region for each variable, one could obtain distinct models for each variable.
97
Figure 5.4: Significant fluorescence spectral regions associated with Glucose, Ethanol
and Biomass concentration, obtained by ACO pheromone trail evolution.
The efficiency of ACO spectrum regions classification is tested using the region
between 270nm to 390nm of excitation and 310nm to 430nm of emission (region
containing most of the significant information for all state variables) for calibration and
testing PCR and PLS chemometric models. Once PCR and PLS condenses all the significant
information in principal components and load vectors, the comparison between models
based on full range spectral data and reduced spectral data could provide evidence that
the selected region contains the main information for state variable characterization
inside full fluorescence spectra.
Results presented on Figure 5.5 prove that the primary information associated with
each state variable is effectively found inside the region selected by ACO. The results of R²
for the test phase of PCR and PLS modeling have greater, or at least similar prediction
success, when reduced spectral data are compared to full spectral data models. This
result confirms the applicability of ACO for filtering and selecting significant spectrum
selection prior to applying PCR and PLS.
Figure 5.5: PCR and PLS modeling using Full Fluorescent Spectral data and Reduced
Fluorescent Spectral data based on ACO analysis.
98
5.4 Conclusions
In fermentation processes with Saccharomyces cerevisiae operating in batch
configuration, simple dynamic models are usually applied for simulation, although these
models cannot always represent modifications in process dynamics caused by metabolism
changes.
The dynamic simulation of off-line data performed in this study enabled the efficient
modeling of diauxic growth. The modified dynamic model included terms in the growth
rate equations dependent on glucose concentration, taking into account changes in
metabolism when glucose is completely consumed.
The simulation of off-line variables is an important tool that allows the evaluation of
fluorescence spectral data for the development of reliable on-line chemometric models
for state variable predictions, providing off-line measurements (by simulation) at the
same sampling rate as the spectral data.
The results of PCR and PLS chemometric models for predicting Glucose, Ethanol and
Biomass content using 2D Fluorescence Spectroscopy confirmed that this analytical
method can be applied in the development of inference sensors for on-line fermentation
characterization. The results of models during the test phase presented high values of R²
and low values of RMSEP.
Analysis of spectral data using PSCM has shown a better result for chemometric
modeling when compared with PCR and PLS models, showing that the use of pure
spectral signals to prediction of state variables is more accurate. The better results
presented by PSCM are due to the fact that when groups of spectral components are
selected, cross interference caused by noisy regions or secondary process variables are
neglected.
Unlike PCR and PLS, PSCM models do not require pre-treatments of spectral data (e.g.
normalization or standardization). This feature of PSCM models, combined with the fact
that a small amount of spectral components presented good results in variable
predictions, shows that development of small sensors based on just a few fluorescence
wavelength measurements are viable and can be directly applied on a real process,
possibly in parallel with standard methodologies already applied within the industry.
From a theoretical and practical point of view, results provided by PSCM methodology
associated with ACO (segmentation of spectral data into significant regions as function of
state variables) is relevant not only for the development of specialized sensors, but for
further developments of research in chemical and biochemical processes as well.
99
Abstract: The aim of process optimization is obtaining higher productivity and profit
in chemical or bio-chemical process. For that, one must apply control techniques that
closely correlate with our ability to characterize a process. Optical sensors associated with
chemometric modeling are considered a natural choice for non-intrusive and high
sensitivity measurements. This study focus on wheat flour characterization (usual and
mandatory action, widely present on the food industry) using Near-Infrared, comparing
two approaches for spectral region selection: modified CSMWPLS and PSCM/ACO.
Spectroscopic data is assayed using a combination of CSMWPLS and variable selection
algorithm based on Ant Colony Optimization. Protein prediction results are compared with
standards PLS, CSMWPLS and PSCM/ACO models. Prediction capability improved 46%
using modified CSMWPLS and PSCM/ACO modeling, confirming the efficiency of the
proposed characterization methods and chemometric modeling strategy.
For submission
101
6.1 Introduction
Industrial needs for online monitoring and control of process key variables encourage
the research and development of new methods for measurement (Whitford and Julien
2007). The limitations associated with sensors, regarding their expensive costs, or even its
not reliable quality measurements, leads to the development of data-driven soft sensors,
like those based on canonical variate analysis (CVA), partial least-squares (PLS), artificial
neural networks (ANN) neuro-fuzzy systems and Gaussian process regression (GPR) (Ni,
Brown et al.).
Nowadays many soft sensors are widely accepted and applied as viable and useful
methods for online qualitative predictions of processes variables. Despite the huge
available options, PLS regression is the one most adopted due its advantages for noisy
and correlated data, usually common in industrial processes (Du, Liang et al. 2004, Brown
2013, Cariou, Verdun et al. 2014, Chi, Fei et al. 2014).
Despite the well know PLS capacity of dealing with full-spectrum calibration problem,
the selection or filtering of spectrum regions is still a very important issue, once its impact
on models prediction capability are directly associated with the sensitivity of spectral data
to process medium changes and their influences in specific spectral regions (Xu and
Schechter 1996, Sato, Kiguchi et al. 2004, Sratthaphut and Ruangwises 2012).
Jiang et al. (2002) made an extended and detailed discussion about the methods for
spectral interval selection, proposing a method called moving window partial least
squares regression (MWPLSR). That method searches for informative spectral regions for
multi-component spectral analysis, using a series of PLS models, for spectral mapping,
that prospects all spectral range using a continuous size moving window. A new model is
obtained for each window displacement and tested using Cross-validation strategy,
optimizing the spectral region for PLS interest variable inference.
Based on Jiang’s et al. (2002), Du et al., (Du, Liang et al. 2004) proposed an evolution
of MWPLSR, introducing two new methods for spectral selection, the changeable size
moving window partial least squares (CSMWPLS) and the searching combination moving
window partial least squares (SCMWPLS).
constituents from selected regions to obtain the best group of spectral elements for PLS
model prediction of interest state variable. The main idea in CSMWPLS and SCMWPLS is
performing combinations of exhaustive search in pre-selected regions by MWPLSR. Once
the hole spectral is significantly reduced, these strategies allowed the variable modeling
problem resolution in an easy and intuitive way.
Despite their advantages, as allowing the obtention of better PLS models when
compared with the whole spectral models, those strategies are tied to combine well
correlated regions with the interest variable, in this way; possible effects of interactions
between well correlated regions with less significant regions can be masked and
neglected. One possible solution can be an association of moving window PLS strategy
with a global optimization technique, as Genetic Algorithm (GA), programmed for
combinatorial analysis and searching of spectral group elements. In this way, effects of
interactions regions can be evaluated, ensuring that less individual significant regions,
possibly neglected, have their combined information considered.
This work presents the merger of two chemometric strategies: the changeable size
moving windows PLS (and their following modifications) with the pure spectral
chemometric modeling (PSCM), and compare both approaches for spectral data pre-
selection and variable inference. Once both approaches allows pre-selection of spectral
data, and variable infer using multiple input models, all the arrangements of pre-selection
and infer methods using those strategies are tested, searching for the best combination
for protein quantification on flour samples using NIR measurements.
These features allow the combination of both strategies and create an algorithm for
chemometric modeling of spectral data for process variable prediction, where spectral
data are pre-selected using CSMWPLS and combined using PSCM.
6.2 Methodology
column makes reference to a specific wavelength (spectral component), for all samples.
NIR data was collected in triplicate measurements, totalizing 102 spectras Segmentation
was made randomly, taking two thirds of samples for the calibration phase, and one third
remained for prediction and test phase.
NIR spectral data was collected in a Multi-Purpose NIR Analyzer (Bruker Optik GmbH -
Ettlingen, Germany), with a wavelength range from 800 nm to 2800 nm, with a non-
continuous increment, leading to spectral information compound by 1050 independent
wavelengths.
Real spectral data are full of noise and others nonidealities that mask the data
information. Before any chemometric analysis, it is necessary the normalization of the
spectral data, allowing the real data can correctly be analyzed. NIR data was normalized
using Standard Normal Variate (SNV), which scales the samples instead of the spectral
variables (Beebe, Pell et al. 1998).
Figure 6.1 presents NIR spectroscopy measurements from the 34 flour samples, here
each spectral line correspond to the mean of the triplicate NIR measurements for each
sample. Figure 6.1(a) makes reference to the raw NIR data, while Figure 6.1(b) refers to
SNV normalized NIR data.
Figure 6.1: NIR measurements average from 34 flour samples. (a) NIR raw data and (b)
NIR SNV normalized data.
Figure 6.2: 34 flour samples localizations in PC’s plans, using PCA results for qualitative
data set evaluation (T – Wheat Flour; C – Rye Flour).
The proposed models were compared according the statistical parameter Root Mean
Squares Error (RMSE), for accuracy ranking. This parameter is named differentially when
calculated in the calibration or prediction phase: RMSEC for calibration and RMSEP for
prediction. In equation 6.1 the sub index p refers to the vector of the predicted variable
values, resulting from model evaluation, and m to the measured variable values, N is the
number of measurements and y is the vector of interest variable.
(6.1)
Initially presented by Jiang et al. (2002), MWPLSR searches for the significant spectral
regions by scanning spectral data with a partial least squares regression procedure over
continuous sweeping analysis of spectral intervals. In MWPLSR, a sequence of PLS models
105
is built for a moving window through the whole spectra. Qualitative mapping of spectral
data is obtained in terms of the decrease of prediction error, showing an improvement of
PLS models predictions (Du, Liang et al. 2004).
The kernel of MWPLSR rises on a moving window. This window is made by a certain
number of spectral elements, defined by the user, and called window size (h). It starts at
ith spectral element and ends at (i + h +1)th element, comprehended by all the elements
inside this interval. For each window size, there are n – h +1 windows over the whole
spectra, where n is the number of spectral elements available at the full spectral data.
Each window is considered a subset of original data and PLSR models are generated for
each window. The number of load vectors (LV) used for modeling is determined by the
user, although, its maximum number should be equal or smaller than the lowest
dimension of the subset data matrix (number of calibration samples x h).
The prediction was evaluated by the sums of squared residues (SSR) presented by
each window position. The value of SSR in function of window position is plotted for the
first element constituent of the subset, and its behavior in function of the spectral region
is evaluated proportionally for the range of SSR. Regions with significant dependence of
the interest variable show smallest values for SSR. In that way, valleys in the curve of SSR
in function of window position indicate regions of significant information content (Du,
Liang et al. 2004)..
The basic idea of CSMWPLS is to move many windows inside a defined spectral
region, varying the window size from one until the size of the spectral region defined.
During this process, all the sub-windows with the same window size (h) are obtained. For
every window, PLS models are obtained, using the number of LVs previously determined
as the most indicated, and RMSEC (root mean square error or calibration) values are used
for spectral region refining. Figure 6.3 presents a graphical representation of MWPLSR
and CSMWPLS methodologies, showing the sliding window inside the spectral dimension.
For each position of the window, a new PLSR model is generated using calibration data
set and tested with prediction data set. The result of RMSEP for each window is
associated with the first element comprehended by the window.
Once CSMWPLS is an extension of MWPLSR, results are compared for qualitative and
quantitative analysis. The main idea of spectral region selection using CSMWPLS is search
for a combination of window size and position inside spectral range that presents smaller
values of RMSEP.
106
This chemometric analysis has presented by our group in a previously work (Ranzan,
Strohm et al. 2014), where is described in details the two main pillars of this method: the
selection of pure spectral elements and model adjustment for state variable prediction.
The steps for chemometric modeling using the PSCM methodology can be divided into
three phases: Selection of spectral elements group; models calibration, and models
validation. The selection of the spectral group aims to choose spectrum components that
when combined using MISO (Multiple Input Single Output) models allow direct
correlation with state variables (Skoog, Holler et al. 2007). Those models have the
characteristic of being linear in relation to their parameters, allowing the estimation of its
parameters using ordinary least squares with the analytical solution (Joe Qin 1998). Once
PSCM is a supervised learning method, the quality, representativeness and the number of
measured points have a strong impact in the model quality. Thus, we seek the best,
largest and most representative data set to build the models.
ACO is based on real ants behavior, more specifically, by the indirect communication
between them within the colony using chemical pheromone secretion (Dorigo and Blum
2005, Dorigo, Birattari et al. 2006, Mullen, Monekosso et al. 2009). The main idea behind
this algorithm is that, in real ants, the convergence of ant trails toward the shortest route
between the food source and the nest is a result of the tendency of ants to follow a trail
that contains a higher concentration of the pheromone deposit (Deneubourg, Aron et al.
1986). Details about Ant Colony Optimization algorithms and its implementation can be
obtained in Mullen et al. (2009) and Ranzan et al. (Ranzan, Strohm et al. 2014).
107
This work evaluates all the possible combinations of these two chemometrical
methodologies, applied for conversion of NIR spectral data into protein content in flour
samples. CSMWPLS and PSCM/ACO were alternated in pre-selection of spectral regions
and variable modeling. Results are compared with standard PLS, CSMWPLS and
PSCM/ACO strategies applied to full spectral data. In this way, the best combination
between both methods is obtained for flour samples characterization. We also obtained
the indication of the best features associated with each method.
The results of PLSR standard analysis indicate that the best number of LVs to use in a
PLS models, for this specific data set, are around 11 and 13, since on these models the
minimum of RMSEP was achieved, with values of explained variance reaching more than
95%.
After initial PLSR modeling, CSMWPLS was applied for spectral data classification. In
the first run of CSMWPLS, the window size varied from one to four hundred with
simultaneous variation in the LVs number used on the models from two to twenty. In that
way, were searched for the optimized combination of single window size (single spectral
108
region) and PLS number of components (LVs). The best results of RMSEP in function of
LVs are presented in Table 6.1, indicating the optimized spectral region used in each
prediction model. The spectral regions of Table 6.1 are plotted in Figure 6.5, allowing the
visualization of spectral regions selection by CSMWPLS, as a function of PLS models size.
Figure 6.4: Results of RMSEP and Explained Variance for PLSR modeling applied to full
spectral data of wheat flour and respective protein content.
Tabel 6.1: CSMWPLS results of RMSEP for the best obtained PLS models
,
109
Figure 6.5: Optimized regions of NIR spectra for prediction of protein in function of
LV’s number on PLS models, applying CSMWPLS.
The results reported in Table 6.1 and Figure 6.5 show an improvement in the
prediction using PLS models. The best result of CSMWPLS, using a single spectral region
was obtained using a model compounded by eight LVs and using the NIR spectral region
comprehended between 1109nm and 1403nm. PLS models with nine LVs, with spectral
region similar to the previously one, and ten LVs, with a spectral region around 40% larger
than the previously indicated regions, have similar values of RMSEP. The best results
obtained with CSMWPLS presented better protein prediction when compared with
standard PLS models. The smallest RMSEP achieved using all spectral range was equal to
0.41 (see Figure 6.4, for 10, 11 and 12 LVs models), while CSMWPLS obtained a PLS model
able to achieve RMSEP values of 0.388 using less number of LVs and only 21% of the total
amount of spectral elements available.
Results presented by Du et al. (2004) and Jiang et al. (Jiang, Berry et al. 2002)
indicated that the classification of spectra regions was dependent of PLS models size,
which variation implicates in resolution of regions degree of differentiation, but not in
regions changes. In other words, for a continuous window size, the variation of PLS
models sizes implies in increase of RMSEP valleys and consequently better segmentation
of significant spectral regions.
Once the best PLS model obtained from previously CSMWPLS running was found with
PLS model size of 8 LVs, evaluation of subsequent spectral filtering was developed with
this model size, keeping it constant.
The selection of the spectral region using usual CSMWPLS could not be made
comparing the results of RMSEP from different window size. As showed on Figure 6.6, the
increase on window size, keeping the PLS models size constant, implies in the
displacement, in the direction of small wavelengths of the peaks and valleys of RMSEP.
The changes of highlighted spectral region become a problem when different windows
size results are compared.
110
Figure 6.6: RMSEP values for protein prediction obtained using CSMWPLS for NIR
spectral region selection, for seven distinct windows size: 10, 50, 100, 150, 200, 250 and
300 spectral elements window.
6.3.2 modCSMWPLS
To solve this segmentation region problem, we are proposing a modification on the
original CSMWPLS (modCSMWPLS). The method CSMWPLS proposes for spectral region
segmentation does the individual analyses of RMSEP for each window size. It does not
evaluate the effect of each spectral element as unique but only in the group. The
modification proposed in this work intends to evaluate each spectral component
contribution in the efficiency of process variable prediction.
Inspired by the pheromone trail evolution during ACO routine (Ranzan, Strohm et al.
2014), CSMWPLS algorithm was modified by the insertion of a vector for RMSEP storage,
in function of spectral elements applied on PLS models and state variable prediction. This
vector intends to evaluate the prediction capability of models by the insertion of
spectrum elements on it, in a generalized way, not been dependent of the windows size
for region highlighting.
CSMWPLS routine can be partitioned in subsequent MWPLSR runs. For each window
movement, during a single MWPLSR search, a PLS model is obtained and tested for
variable prediction, using the set of spectral elements contained in the window. The
vector added on CSMWPLS routine, stores, on respective positions for spectral elements
in the analysis, the value of RMSEP for the tested PLS model, adding the result to the
previously storage values. This approach is repeated, subsequently, for all window sizes
selected by the user, and between the window transitions, each element on storage
vector is divided by the number of times that its correspondent spectral element was
used on a model group. The vector of summed RMSEP results is initialized with zeros in
the beginning of modCSMWPLS routine.
Figure 6.7 presents the results of RMSEP storage during modCSMWPLS, varying
window size from one up to fifteen. The position of valleys and peaks are constant,
improving the process of spectral region selection. Once the storage vector decreases its
variation with the addition of new windows search steps, it is crucial the right
combination between quantities of window search take into evaluation, and the desired
111
degree of region segmentation. As higher the amount of window searched, smaller is the
differentiation between reliable and not reliable regions for variable prediction.
Figure 6.7: RMSEP storage values for modCSMWPLS, with window size from one up to
fifteen, applied for NIR data of wheat flour samples and using PLS models with maximum
of 8 LV’s.
Results presented on Figure 6.8 are related with the pheromone amount deposited by
ants on spectral elements, it is a qualitative criterion for selection of elements into PSCM
models, and it relative higher amounts are connected with better elements for process
variable prediction.
Some spectral regions highlighted by PSCM/ACO are similar to the ones presented by
modified CSMWPLS, as the regions comprehended in 1140nm – 1270nm, 1650nm –
1780nm and 1950nm – 2200nm. Despite those regions, PSCM/ACO also indicates the
region comprehended between 2550nm – 2690nm as significative for protein prediction.
Those regions were not selected by CSMWPLS. There was a valley around 2690nm,
although, it did not show significant RMSEP to be included into the selected regions.
112
Both approaches for spectral data selection emphasized NIR spectral regions
associated with knowledged overtones. According to the works of Sun (Sun 2008) and
Champe and Harvey (Champe and Harvey 2005) the regions indicated by pre-selection
methods as significant for protein inference are high correlated with C – H stretch bands
(overtone of 1200nm and 1800nm) and amide band (2100nm). In our previously work
(Ranzan, Strohm et al. 2014) we correlated the spectral regions highlighting for
PSCM/ACO with interesting variable features expected into NIR spectral overtones
associated with protein samples. The results showed equivalent results to the ones
obtained by modified CSMWPLS (Figure 6.7) and PSCM/ACO (Figure 6.8) spectral regions
selection.
The qualitative differences between Figures 6.7 and 6.8 are mainly results from the
approach used in each spectral search. While CSMWPLS (Figure 6.7) bases its spectral
analyses into PLS modeling using sequential spectral elements, PSCM/ACO searches for
non-sequential spectral groups. The first method is limited by the fact that if single
wavelengths are highly sensitive for the interested compound, but are located in a
spectral region where the other elements are not, CSMWPLS will have a small valley at
that region, and could mask the important information contained on single elements. This
disadvantage is not present on PSCM/ACO, once this method searches for spectral groups
in an individualized approach, and the importance of each spectral element, for variable
prediction, is evaluated separately, even through models propositions and tests are made
in groups.
The results of spectral regions selection from modCSMWPLS and PSCM/ACO are
compared using tree chemometric methodologies: PLS; standard CSMWPLS, where best
results of PLS models with variable windows size are tested; and PSCM/ACO
methodology, where filtered spectral data is combined individually, using multilinear
models. In this phase, the selected regions are used for protein content inferring.
The results of protein prediction using both pre-selected data process are shown in
Figure 6.9. The two new spectral data set are compounded by same amount of spectral
113
elements, corresponding to 20% (230 elements) of the total amount (1150 elements). In
both filtering methodologies, the selection of spectral elements is divided into 2 phases:
selection of spectral significant regions, as described previously for each approach, and
sorting of spectral elements inside each region, based on the qualitative information of
each approach. For CSMWPLS, the sorting is based on the lowest summed RMSEP values,
inside each region separately, while in PSCM/ACO the selection of spectral elements is
made considering the highest values of pheromone in the total spectrum data range, as
once.
Figure 6.9: RMSEP results for chemometric modeling of protein content prediction
using NIR data from wheat flour samples, using full spectrum data, filtrated spectrum
data using modified CSMWPLS and filtrated spectrum data using PSCM/ACO. Results are
presented in function of independent variables used on chemometric models. Modeling
process divided into (a) standard PLS regression, (b) CSMWPLS and (c) PSCM/ACO.
Results presented on Figure 6.9(b) and Figure 6.9(c) are related with protein
prediction achieved using the same methodologies applied for spectral filtering. Both
114
results are used for evaluation of prediction capability of filtered spectral data, using
different approach than standard PLS models. Those results confirm the condensing
capability presented by discussed methods, and shows that spectral data selected using
PSCM/ACO produced better results for lower models sizes than using standard CSMWPLS
modeling, indicating that the filtering procedure applied by PSCM/ACO is more efficient in
spectral information summarization.
An interesting point is that both spectral filters organize spectral data filtered in
wavelength order. The position of selected regions could influence CSMWPLS results, as
discussed by Du, et al. (Du, Liang et al. 2004), although the evaluation of elements
optimum position can be as hard as spectral filter implementation, since the
redistribution of spectral data for CSMWPLS modeling improvement becomes another
optimization and spectral selection problem.
For better comparison between predictions results showed in Figure 6.9, Figure 6.10
presents the percentage difference of RMSEP results between RMSEP values of PLS
models using full spectral data and the models obtained using PLS, CSMWPLS and PSCM
chemometric methodologies and the different spectral data selected. On Figure 6.10, PLS
using full spectral data curve correspond to the line of zeros. In that way, the respective
curve of each modeling procedure and its respective spectral data set is directly
associated with the zero line, concluding that higher the values assumed by the curves,
better is the prediction presented by the model in comparison to standard PLS models.
These results are made reducing from the RMSEP values obtained by standard PLS the
values presented by each chemometric methodology and dividing by the value of
standard PLS.
Figure 6.10: Percentage difference between RMSEP results for protein prediction of
PLS chemometrical models using full NIR spectral data and: PLS models using NIR filtered
data with modified CSMWPLS (PLS(CSMWPLS)) and PSCM/ACO (PLS(PSCM)), CSMWPLS
models using NIR filtered data with modified CSMWPLS (CSMWPLS(CSMWPLS)) and
PSCM/ACO (CSMWPLS(PSCM)) and PSCM/ACO models using NIR filtered data with
modified CSMWPLS (PSCM(CSMWPLS)) and PSCM/ACO (PSCM(PSCM))
Results presented on Figure 6.10 corroborate the conclusions from Figure 6.9, where
models obtained using filtered data results in smaller RMSEP values for models with less
than 8 input variables. Filtered spectral data obtained using CSMWPLS have the better
results in comparison with standard PLS and full spectral data, especially for models with
3 input variables, achieving improvement in RMSEP values in order of 46%.
115
Chemometric models with more than eight input variables have a decrease on
prediction capability using filtered data, although, as showed in Figure 6.9, models with
more than 8 input variables, presented higher values of RMSEP, indicating that the
maximum models size for filtered spectral data for wheat flour characterization is equal
to eight.
The best result for increase of PLS models was obtained using PSCM/ACO strategy for
NIR data selection. In function of models size, the use of pre-selected data obtained by
this approach leads to a increase of 30% into model accuracy, using 3 input variables.
6.4 Conclusions
The huge amount of spectral information provided by optical measurement
techniques and high sensitivity and correlation between spectrum components hinders
our task of transforming information into process knowledge. Nonetheless, the significant
amount of noise and medium complex characteristics makes the application of
multivariable modeling tools a challenging problem.
Spectral pre-selection methodologies are a significant subject for spectral data pre-
treatment, blocking spectrum components with low correlation to the inferred variables
and improving the model robustness and quality. Also allows the non full range spectral
measurements, reducing significantly the number of collected points and information for
storage and/or to process.
This paper compares different approaches for pre-selection of NIR spectral data. The
protein content prediction in wheat and rye flour samples using these techniques are
compared with the standard PLS chemometric method using the full range spectral data.
The modified method provided good results for condensing NIR spectral data from
wheat flour samples, tested for reduced total spectral data into 21% of its all spectral
components, and been efficient to maintain the main elements, corresponding to the
ones responsible for protein content prediction on samples.
Comparing the spectral filter based on the modified CSMWPLS with the PSCM/ACO
approach, we can conclude that both methods have similar results in the selection of
spectral regions for the experimental data set, indicating the viability in application of
those techniques in spectral data treatment. Despite the good results presented by both
approaches in comparison to full spectral data, PSCM/ACO achieved smaller results of
RMSEP, indicating its better data condensing capability.
116
For the most chemometric models tested (i.e., PLS, CSMWPLS and PSCM) pre-selected
data presented smallest RMSEP values than the same size and model type using full
spectral data. Usually for chemometric model prediction, the model order has an
optimum size, since for larger models prediction becomes less accurate. Filtered spectral
data reduces the size and improve the data quality, what can improve the final model
quality.
For all tested chemometric approaches, PSCM models shown better results in
function of models sizes, leading to the conclusion that this method is more appropriate
to improve the model quality than PLS or even CSMWPLS. The proposed modified
CSMWPLS has also shown a very good spectral condensing capability, allowing more
accurate models adjustment with more than 45% of accuracy improvement using 3 input
variables, in comparison with standard PLS modeling and full spectral data
117
carga), levando a concluir que inferidores de estado propostos para este sistema,
utilizando a metodologia PLS, são os mais indicados.
CSMWPLS. Apesar deste resultado positivo, modelos PLS usando a matriz completa de
dados NIR apresentaram resultado geral melhor, atingindo menor valor de RMSEP
específico. Apesar disso, para modelos com menos de 8 variáveis de entrada, dados
filtrados, independentemente do método de filtragem ou ajuste de modelo,
apresentaram menores valores de RMSEP.
De forma geral, pode ser concluído com este trabalho que a metodologia baseada em
componentes espectrais puros, PSCM, apresenta viabilidade de aplicação no estudo e
caracterização de dados de processos industriais, principalmente com sua utilização
conjunta com a ferramenta de otimização ACO. A aplicação desta estratégia de
segmentação dos dados espectrais em seus elementos constituintes permite a construção
de sensores ajustados para analitos específicos e mensurados através de informações
distintas das regiões espectrais. O fato de combinarmos mais de uma região espectral em
modelos com baixo número de variáveis de entrada, menos influenciáveis por ruídos e
variações no meio reacional.
Além destes, novas estruturas podem ser propostas para a conversão do método de
estudo de processos PSCM/ACO da forma supervisionada para a forma assistida. Na
forma como a estratégia encontra-se concebida, são necessários valores de referência
para as variáveis de estado de interesse das respectivas amostras em estudo. Esta
característica implica em que um conjunto de amostras deva ser caracterizado
previamente, para então ser analisado. A proposta de expansão da técnica para a versão
assistida propõe combinar a metodologia de caracterização PSCM/ACO com simulação e
ajuste de modelos dinâmicos de processos, como no trabalho de Oliveira et al. (2008).
Métodos assistidos possuem vantagens com relação a métodos supervisionados, como
por exemplo, a não necessidade de caracterizar todas as amostragens com relação às
variáveis de interesse.
Referências
Ait Kaddour, A. and B. Cuq (2009). "In line monitoring of wet agglomeration of wheat
flour using near infrared spectroscopy." Powder Technology 190(1–2): 10-18.
Alford, J. S. (2006). "Bioprocess control: Advances and challenges." Computers & Chemical
Engineering 30(10-12): 1464-1475.
Allegrini, F. and A. C. Olivieri (2011). "A new and efficient variable selection algorithm
based on ant colony optimization. Applications to near infrared spectroscopy/partial
least-squares analysis." Analytica Chimica Acta 699(1): 18-25.
Alves, J. C. L. and R. J. Poppi (2013). "Biodiesel content determination in diesel fuel blends
using near infrared (NIR) spectroscopy and support vector machines (SVM)." Talanta
104(0): 155-161.
Ammari, F., R. Bendoula, D. Jouan-Rimbaud Bouveresse, D. N. Rutledge and J.-M. Roger
(2014). "3D front face solid-phase fluorescence spectroscopy combined with Independent
Components Analysis to characterize organic matter in model soils." Talanta 125(0): 146-
152.
Andries, J. P. M., Y. V. Heyden and L. M. C. Buydens (2013). "Predictive-property-ranked
variable reduction in partial least squares modelling with final complexity adapted
models: Comparison of properties for ranking." Analytica Chimica Acta 760(0): 34-45.
Aynsley M, Hofland A, Morris AJ, Montague GA and D. M. C. (1993). "Artificial intelligence
and the supervision of bioprocesses (real-time knowledge-based systems and neural
networks)." Bioprocess Design and Control.
Aynsley, M., A. Hofland, A. J. Morris, G. A. Montague and C. Di Massimo (1993). "Artificial
intelligence and the supervision of bioprocesses (real-time knowledge-based systems and
neural networks)." Bioprocess Design and Control.
Bag, N., D. H. X. Yap and T. Wohland (2014). "Temperature dependence of diffusion in
model and live cell membranes characterized by imaging fluorescence correlation
spectroscopy." Biochimica et Biophysica Acta (BBA) - Biomembranes 1838(3): 802-813.
Banwell, C. N. (1983). Fundamentals of Molecular Spectroscopy. New York, McGraw-Hill.
Beebe, K. R., R. J. Pell and M. B. Seasholtz (1998). Chemometrics: A pratical guide. New
York, Wiley & Sons.
Bequette, B. W. (2003). Process control : modeling, design, and simulation. Upper Saddle
River, NJ, Prentice Hall PTR.
Boehl, D., D. Solle, B. Hitzmann and T. Scheper (2003). "Chemometric modelling with two-
dimensional fluorescence data for Claviceps purpurea bioprocess characterization."
Journal of Biotechnology 105(1-2): 179-188.
Bosque-Sendra, J. M., L. Cuadros-Rodríguez, C. Ruiz-Samblás and A. P. de la Mata (2012).
"Combining chromatography and chemometrics for the characterization and
authentication of fats and oils from triacylglycerol compositional data—A review."
Analytica Chimica Acta 724(0): 1-11.
Brereton, R. (2007). Chemmometrics for Pattern Recognition. Chichester, John Wiley &
Sons.
123
Krishnan, A., L. J. Williams, A. R. McIntosh and H. Abdi (2011). "Partial Least Squares (PLS)
methods for neuroimaging: A tutorial and review." NeuroImage 56(2): 455-475.
Kumar, N., A. Bansal, G. S. Sarma and R. K. Rawal (2014). "Chemometrics tools used in
analytical chemistry: An overview." Talanta(0).
Lacerda, E. (2007). Otimização Nuvem de Partículas (Particle Swarm). Universidade
Federal do Rio Grande do Norte.
Lakowicz, J. R. (2006). Principles of fluorescence spectroscopy. New York, Springer.
Land Jr, W. H., F. W., P. W., M. R., H. N., H. J., E. S., Q. X. and Y. T (2011). "Partial Least
Squares Applied to Medical Bioinformatics." Procedia Computer Science 6(0): 273-278.
Leardi, R., M. B. Seasholtz and R. J. Pell (2002). "Variable selection for multivariate
calibration using a genetic algorithm: prediction of additive concentrations in polymer
films from Fourier transform-infrared spectral data." Analytica Chimica Acta 461(2): 189-
200.
Li Vigni, M., C. Durante, G. Foca, A. Marchetti, A. Ulrici and M. Cocchi (2009). "Near
Infrared Spectroscopy and multivariate analysis methods for monitoring flour
performance in an industrial bread-making process." Analytica Chimica Acta 642(1–2): 69-
76.
Lindemann, C., S. Marose, H. O. Nielsen and T. Scheper (1998). "2-Dimensional
fluorescence spectroscopy for on-line bioprocess monitoring." Sensors and Actuators B:
Chemical 51(1-3): 273-277.
Liu, R. X., J. Kuang, Q. Gong and X. L. Hou (2003). "Principal component regression analysis
with spss." Computer Methods and Programs in Biomedicine 71(2): 141-147.
Mardia, K., J. Kent and J. Bibby (1979). Multivariate Analysis, Academic Press.
Marose, S., C. Lindemann and T. Scheper (1998). "Two-dimensional fluorescence
spectroscopy: A new tool for on-line bioprocess monitoring." Biotechnology Progress
14(1): 63-74.
Masiero, S. S., J. O. Trierweiler, M. Farenzena, M. Escobar, L. F. Trierweiler and C. Ranzan
(2013). "Evaluation of wavelength selection methods for 2D fluorescence spectra applied
to bioprocesses characterization." Brazilian Journal of Chemical Engineering 30.
McLeod, G., K. Clelland, H. Tapp, E. K. Kemsley, R. H. Wilson, G. Poulter, D. Coombs and C.
J. Hewitt (2009). "A comparison of variate pre-selection methods for use in partial least
squares regression: A case study on NIR spectroscopy applied to monitoring beer
fermentation." Journal of Food Engineering 90(2): 300-307.
Mello, P. A. and J. C. C. S. Pinto (2008). Introdução à Modelagem Matemática e Dinâmica
Não-Linear de Processos Químicos. Rio de Janeiro, Escola Piloto Virtual Giuliano
Massaran.
Menezes, J. C. (2011). Process Analytical Technology in Bioprocess Development and
Manufacturing. Comprehensive Biotechnology (Second Edition). M.-Y. Editor-in-
Chief: Murray. Burlington, Academic Press: 501-509.
Mulchandani, A. and A. S. Bassi (1995). "Principles and applications of biosensors for
bioprocess monitoring and control." Crit. Rev. Biotechnol 1: 105-124.
Mulchandani, A. and J. H. T. Luong (1989). "Microbial inhibition kinetics revisited."
Enzyme and Microbial Technology 11(2): 66-73.
128
Mullen, R. J., D. Monekosso, S. Barman and P. Remagnino (2009). "A review of ant
algorithms." Expert Systems with Applications 36(6): 9608-9617.
Mussatto, S. I., G. Dragone, P. M. R. Guimarães, J. P. A. Silva, L. M. Carneiro, I. C. Roberto,
A. A. Vicente, L. Domingues and J. A. Teixeira (2010). "Technological trends, global
market, and challenges of bio-ethanol production." Biotechnology Advances 28: 817–830.
Nascimento, R. S., R. E. S. Froes, N. O. C. e Silva, R. L. P. Naveira, D. B. C. Mendes, W. B.
Neto and J. B. B. Silva (2010). "Comparison between ordinary least squares regression and
weighted least squares regression in the calibration of metals present in human milk
determined by ICP-OES." Talanta 80(3): 1102-1109.
Ni, W., S. D. Brown and R. Man "A localized adaptive soft sensor for dynamic system
modeling." Chemical Engineering Science(0).
Oliveira, F. R. P., K. Goldberg, A. Liese and B. Hitzmann (2008). "Chemometric modelling
for process analyzers using just a single calibration sample." Chemometrics and Intelligent
Laboratory Systems 94(2): 118-122.
Olivieri, A. C., H. C. Goicoechea and F. A. Iñón (2004). "MVC1: an integrated MatLab
toolbox for first-order multivariate calibration." Chemometrics and Intelligent Laboratory
Systems 73(2): 189-197.
Omary, M. A. and H. H. Patterson (1999). Luminescence, Theory. Encyclopedia of
Spectroscopy and Spectrometry (Second Edition). L. Editor-in-Chief: John. Oxford,
Academic Press: 1372-1391.
Omrani, H., A. E. Dudelzak, B. P. Hollebone and H.-P. Loock (2014). "Assessment of the
oxidative stability of lubricant oil using fiber-coupled fluorescence excitation–emission
matrix spectroscopy." Analytica Chimica Acta 811(0): 1-12.
Otsuka, M. (2004). "Comparative particle size determination of phenacetin bulk powder
by using Kubelka–Munk theory and principal component regression analysis based on
near-infrared spectroscopy." Powder Technology 141(3): 244-250.
Pasquini, C. (2002). Espectroscopia no Infravermelho Proximo (NIR). Salvador, UFBA.
Pattison, R. N., J. Swamy, B. Mendenhall, C. Hwang and B. T. Frohlich (2000).
"Measurement and control of dissolved carbon dioxide in mammalian cell culture
processes using an in situ fiber optic chemical sensor." Biotechnology progress 16 (5):
769-774.
Pereira, J. P. G. (2007). Heurísticas computacionais aplicadas à otimização estrutural de
treliças bidimensionais Centro Federal de Educação Tecnológica de Minas Gerais.
Pratap R, P. (2003). "Oscillatory metabolism of Saccharomyces cerevisiae: an overview of
mechanisms and models." Biotechnology Advances 21(3): 183-192.
Pratap, R. P. (2003). "Oscillatory metabolism of Saccharomyces cerevisiae: an overview of
mechanisms and models." Biotechnology Advances 21(3): 183-192.
Química, A.-A. B. d. I. (2013). A Indústria Química Brasileira.
Ramos, L. S., K. R. Beebe, W. P. Carey, E. M. Sanchez, B. C. Erickson, B. E. Wilson, L. E.
Wangen and B. R. Kowalski (1986). "Chemometrics." Anal. Chem. 58: 294 - 315.
Ranzan, C. (2010). Fermentação Contínua de Zymomonas mobilis: Modelagem, Ajuste de
Parâmetros e Inferências a Partir do Consumo de Hidróxido de Sódio. Master,
Universidade Federal do Rio Grande do Sul.
129
Wong, W. C., C. C. Chan, P. Hu, J. R. Chan, Y. T. Low, X. Dong and K. C. Leong (2014).
"Miniature pH optical fiber sensor based on waist-enlarged bitaper and mode excitation."
Sensors and Actuators B: Chemical 191(0): 579-585.
Xu, L. and I. Schechter (1996). "Wavelength selection for simultaneous spectroscopic
analysis. Experimental and theoretical study." Anal Chem 68: 2392–2400.
Yamuna, R. K. and R. V. S. Ramachandra (1999). "Control of fermenters – a review."
Bioprocess Engineering 21: 77-88.
Zang, Z., J. M. Scharer and M. Moo-Young (1997). "Mathematical model for aerobic
culture of a recombinant yeast." Bioprocess Engineering 17: 235–240.
Zhang, Y., A. M. Zamamiri, M. A. Henson and M. A. Hjortsø (2002). "Cell population
models for bifurcation analysis and nonlinear control of continuous yeast bioreactors."
Journal of Process Control 12(6): 721-734.
Župerl, Š., S. Fornasaro, M. Novič and S. Passamonti (2011). "Experimental determination
and prediction of bilitranslocase transport activity." Analytica Chimica Acta 705(1–2): 322-
333.