Transformer For Modeling Physical Systems

Neural Networks 146 (2022) 272–289
Contents lists available at ScienceDirect
Neural Networks
journal homepage: www.elsevier.com/locate/neunet
Transformers for modeling physical systems

∗
Nicholas Geneva, Nicholas Zabaras
Scientific Computing and Artificial Intelligence (SCAI) Laboratory, University of Notre Dame, 311 Cushing Hall, Notre Dame, IN 46556, USA
article info a b s t r a c t
Article history: Transformers are widely used in natural language processing due to their ability to model longer-term
Received 18 January 2021 dependencies in text. Although these models achieve state-of-the-art performance for many language
Received in revised form 8 November 2021 related tasks, their applicability outside of the natural language processing field has been minimal.
Accepted 22 November 2021
In this work, we propose the use of transformer models for the prediction of dynamical systems
Available online 1 December 2021
representative of physical phenomena. The use of Koopman based embeddings provides a unique and
Keywords: powerful method for projecting any dynamical system into a vector representation which can then
Transformers be predicted by a transformer. The proposed model is able to accurately predict various dynamical
Deep learning systems and outperform classical methods that are commonly used in the scientific machine learning
Self-attention literature.1
Physics
© 2021 Elsevier Ltd. All rights reserved.
Koopman
Surrogate modeling
1. Introduction & Lin, 2013), polynomial chaos expansions (Xiu & Karniadakis,
2002), reduced-order models (Chakraborty & Zabaras, 2018; Gao,
The transformer model (Vaswani, et al., 2017), built on self- Wang, & Zahr, 2020), reservoir computing (Tanaka, et al., 2019)
attention, has largely become the state-of-the-art approach for and deep neural networks (Geneva & Zabaras, 2020a; Tripathy
a large set of natural language processing (NLP) tasks includ- & Bilionis, 2018; Zhu & Zabaras, 2018). A surrogate model is
ing language modeling, text classification, question answering, defined as a computationally inexpensive approximate model of
etc. Although more recent transformer work is focused on un- a physical phenomenon that is designed to replace an expensive
supervised pre-training of extremely large models (Dai, Yang, computational solver that would otherwise be needed to resolve
Yang, Carbonell, Le, & Salakhutdinov, 2019; Devlin, Chang, Lee, the system of interest. An important characteristic of surrogate
& Toutanova, 2019; Liu, et al., 2019; Radford, Wu, Child, Luan, models is their ability to model a distribution of initial or bound-
Amodei, & Sutskever, 2019), the original transformer model gar- ary conditions rather than learning just one solution. This is
nered attention due to its ability to out-perform other state-of- arguably essential for the justification of training a deep learning
the-art methods by learning longer-term dependencies without model versus using a standard numerical solver, particularly in
recurrent connections. Given that the transformer model was the context of utilizing deep learning methods which tend to have
originally developed for NLP, nearly all related work has been expensive training procedures. The most tangible applications
rightfully confined within this field with only a few exceptions. of surrogates are for optimization, design and inverse problems
Here, we focus on the development of transformers to model dy- where many repeated simulations are typically needed.
namical systems that can replace otherwise expensive numerical Standard deep neural network architectures such as auto-
solvers. In other words, we are interested in using transformers regressive (Geneva & Zabaras, 2020a; Mo, Zhu, Zabaras, Shi,
to learn the language of physics. & Wu, 2019), residual/Euler (González-García, Rico-Martínez, &
The surrogate modeling of physical systems is a research field Kevrekidis, 1998; Sanchez-Gonzalez, Godwin, Pfaff, Ying, Leskovec,
that has existed for several decades and is a large ongoing effort in & Battaglia, 2020), recurrent and LSTM based models (Geneva
scientific machine learning. Past literature has explored multiple & Zabaras, 2020b; Maulik, Egele, Lusch, & Balaprakash, 2020;
surrogate approaches including Gaussian processes (Atkinson & Mo et al., 2019; Tang, Liu, & Durlofsky, 2020) have been largely
Zabaras, 2019; Bilionis & Zabaras, 2016; Bilionis, Zabaras, Konomi, demonstrated to be effective at modeling various physical dy-
namics. Such models generally rely on the most recent time-steps
∗ Corresponding author. to provide complete information on the current and past state
E-mail addresses: ngeneva@nd.edu (N. Geneva), nzabaras@gmail.com
of the system’s evolution. Approaches that meld numerical time-
(N. Zabaras). integration methods with neural networks have also proven to
URL: https://zabaras.com/ (N. Zabaras). be fairly successful, e.g. Wang and Lin (1998), Wessels, Weißen-
1 Code available at: https://github.com/zabaras/transformer-physx. fels, and Wriggers (2020) and Zhu, Chang, and Fu (2018), but
https://doi.org/10.1016/j.neunet.2021.11.022
0893-6080/© 2021 Elsevier Ltd. All rights reserved.
N. Geneva and N. Zabaras Neural Networks 146 (2022) 272–289
have a fixed temporal window from which information is pro- During testing, the embedding decoder is used to reconstruct the
vided. Present machine learning models lack generalizable time physical states from the transformer’s predictions.
cognizant capabilities to predict multi-time-scale phenomena
present in systems including turbulent fluid flow, multi-scale 2.1. Transformer
materials modeling, molecular dynamics, chemical processes, etc.
Much work is needed to scale such deep learning models to The transformer model was originally designed with NLP as
complex physical systems that are of scientific and industrial the sole application with word vector embeddings of a passage
interest. This work deviates from this pre-existing literature by of text being the primary input (Vaswani, et al., 2017). How-
investigating the use of transformers for the prediction of physical ever, recent works have explored using attention mechanisms
systems, relying entirely on self-attention to surrogate model for different machine learning tasks (Fu, et al., 2019; Veliçković,
dynamics. In the recent work of Shalova and Oseledets (2020), Cucurull, Casanova, Romero, Liò, & Bengio, 2018; Zhang, Good-
such self-attention models were tested to learn single solutions fellow, Metaxas, & Odena, 2019) and a few investigate the use
of several low-dimensional ordinary differential equations. of transformers for applications outside of the NLP field (Chen,
The novel contributions of this paper are as follows: (a) The et al., 2020). This suggests that self-attention and in particular
application of self-attention transformer models for modeling transformer models may work well for any problem that can be
physical dynamics; (b) The use of Koopman dynamics for devel- posed as a sequence of vectors.
oping physics inspired embeddings of high-dimensional systems
with connections to embedding methods seen in NLP; (c) Dis- 2.1.1. Transformer decoder
cussion of the relations between self-attention with traditional In this work, the primary input to (the transformer) will be an
numerical time-integration; (d) Demonstration of our model on embedded dynamical system, Ξ = ξ 0 , ξ 1 , . . . .ξ T , where the
high-dimensional partial differential equation problems that in- embedded state at time-step i is denoted as ξ i ∈ Re . Given that we
clude chaotic dynamics, fluid flows and reaction–diffusion sys- are interested in the prediction of a physical time series, this mo-
tems. To the authors best knowledge, this is the first work to tivates the usage of a language modeling architecture that is de-
explore transformer NLP architectures for the surrogate mod- signed for the sequential prediction of words in a body of text. We
eling of physical systems. The remainder of this paper is as select the transformer decoder architecture used in the Genera-
follows: In Section 2, the machine learning methodology is dis- tive Pre-trained Transformer (GPT) models (Radford, Narasimhan,
cussed including the transformer decoder model in Section 2.1 Salimans, & Sutskever, 2018; Radford et al., 2019). Our model
and the Koopman embedding model in Section 2.2. Following follows the GPT-2 architecture based on the implementation in
in Section 3, the proposed model is implemented for a set of the Hugging Face transformer repository (Wolf, et al., 2019), but is
numerical examples of different dynamical nature. This includes significantly smaller in size than these modern NLP transformers.
classical chaotic dynamics in Section 3.1, periodic fluid dynamics This model consists of a stack of transformer decoder layers
in Section 3.2 and three-dimensional reaction–diffusion dynamics that use masked attention, as depicted in Fig. 2. The input to
in Section 3.3. Lastly, concluding discussion and future directions the transformer is the embedded representation of the physical
are given in Section 4. system from the embedding model and a sinusoidal positional
encoding proposed in the original transformer (Vaswani, et al.,
2. Methods 2017). Since the transformer does not contain any recurrent or
convolutional operation, information regarding the relative po-
We are interested in systems that can be described through a sition of the embedded input sequence must be provided. The
dynamical ordinary or partial differential equation of the form: positional embedding is defined by the following sine and cosine
functions:
φt = F x, φ(t , x, η), ∇x φ, ∇x2 φ, φ · ∇x φ, . . . ,
( )
(1) PEpos,2j = sin pos/100002j/e , PEpos,2j+1 = cos pos/100002j/e ,
( ) ( )
t ∈ T ⊂ R+ , x ∈ Ω ⊂ Rm ,
in which φ ∈ Rn is the solution of this differential equation of (2)
n state variables with parameters η, in the time interval T and for the 2j and 2j + 1 elements in the embedded vector, ξ i . pos
spatial domain Ω with a boundary Γ ⊂ Ω . This general form can is the embedded vector’s relative or global position in the input
embody a vast spectrum of physical phenomena including fluid time series. To train the
{ model,
}D consider a data set of D embedded
flow and transport processes, mechanics and materials physics, i.i.d. time-series D = Ξ i i=1 for which we can use the standard
and molecular dynamics. In this work, we are interested in learn- time-series Markov model (language modeling) log-likelihood:
ing the set of solutions for a distribution of initial conditions φ0 ∼
D T
p(φ0 ), boundary conditions B(φ) ∼ p (B) ∀x ∈ Γ or equation ∑ ∑
− log p ξ ij |ξ ij−k , . . . , ξ ij−1 , θ ,
( )
parameters η ∼ p (η). This accounts for modeling initial value, LD = (3)
i j
boundary value and stochastic problems.
To make the modeling of such dynamical systems applicable where θ are the model’s parameters and k is the transformer’s
to the use of modern machine learning architectures, the con- context window. Contrary to the standard NLP approach which
tinuous solution is discretized in both the spatial and temporal poses the likelihood as a softmax over a dictionary of tokens,
domains ) the solution of the differential equation is Φ =
such that the likelihood here is taken as a standard Gaussian between the
φ0 , φ1 , . . . φT ; φi ∈ Rn×d , for which φi has been discretized by
(
transformer’s prediction and the target embedded value resulting
d points in Ω . We assume an initial state φ0 and that the time in a L2 loss. This is due to the fact that the solution to most
interval T is discretized by T time-steps with a time-step size physical systems cannot be condensed to a discrete finite set.
∆t. Hence, we pose the modeling a dynamical system as a time- Thus tokenization into a finite dictionary is not possible and
series problem. The proposed machine learning methodology has a softmax approach is not applicable. Training is the standard
two core components: the transformer for modeling dynamics auto-regressive method used in GPT (Radford et al., 2018), as
and the embedding network for projecting physical states into opposed to the word masking (Devlin et al., 2019), constrained to
a vector representation. Similar to NLP, the embedding model is the embedded space. The physical states, φi , have the potential
trained prior to the transformer. This embedding model is then to be very high-dimensional and training the transformer in
frozen and the entire data-set is converted to the embedded space the lower-dimensional embedded space can significantly lower
in which the transformer is then trained as illustrated in Fig. 1. training costs.
273
Fig. 1. The two training stages for modeling physical dynamics using transformers. (Left to right) The embedding model is first trained using Koopman based
dynamics. The embedding model is then frozen (fixed), all training data is embedded and the transformer is trained in the embedded space.
Fig. 2. The transformer decoder model used for the prediction of physical dynamics.
2.1.2. Self-attention Kaiser, & Levskaya, 2020; Sukhbaatar, Grave, Bojanowski, & Joulin,
Self-attention is the main mechanism that allows the trans- 2019; Sukhbaatar, Grave, Lample, Jegou, & Joulin, 2019).
former to learn temporal dependencies. Prior to the seminal While designed for natural language processing, in the context
transformer paper, several works had already proposed a handful of dynamical systems, self-attention bears a very similar form to
of different attention models typically integrated into recur- numerical time-integration methods. The use of time-integration
rent models for language processing (Bahdanau, Cho, & Ben- methods such as Runge–Kutta schemes or Euler methods can
gio, 2015; Graves, Wayne, & Danihelka, 2014; Luong, Pham, be found in Neural ODEs (Chen, Rubanova, Bettencourt, & Du-
& Manning, 2015). With the popularization of the transformer venaud, 2018; Dupont, Doucet, & Teh, 2019) and Res-Net based
model (Vaswani, et al., 2017), the most commonly used attention models (González-García et al., 1998; Sanchez-Gonzalez et al.,
model is the scaled-dot product attention: 2020). Self-attention offers a much larger learning capacity than
k i = Fk (xi ), qi = Fq (xi ), vi = Fv (xi ) (4) using these traditional numerical methods for learning a unique
k √ latent time-integration method. The function space of a self-
∑ exp(qTn k i / dk ) attention layer with a residual connection contains the space
cn = αn,i vi , αn,i = ∑k √ , (5)
j=1 exp(qn k j / dk )
T of explicit linear time-integration methods within an arbitrarily
i=1
small non-zero amount of error.
in which we use xi ∈ Rd and c i ∈ Rdv to denote an arbitrary input
and context output, respectively. k ∈ Rdk , q ∈ Rdk and v ∈ Rdv are Theorem 1. Consider a dynamical system of the form, dφ/dt =
referred to as the key, query, and value vectors, respectively, cal- f (t , φ), φ(t) ∈ Rd , where f : R × Rd → Rd is Lipschitz
) respect to φ and continuous in t. Let the function
culated using neural networks Fk , Fq and Fv . The attention score, continuous with
αn,i , is calculated by the soft-max of the dot product between the Aθ t , φt −k∆t :t = φ̂t +∆t be a self-attention layer with a residual
(
query and key vectors scaled by the dimension dk . Due to the soft- connection, of context length
max
{ k and containing learnable parameters
∑k calculation note that the attention scores always sum to one, θ ∈ Rdθ . Suppose A := Aθ (t , φ) | ∀θ ∈( Rdθ be the) set of all
}
i=1 αn,i = 1, for every input. possible self-attention calculations. Let Mi t , φt −(i−1)∆t :t = φt +∆t
The attention calculation can be condensed for the entire
be the ith order explicit Adams method time-integrator. The set of
context (length√of )the model by a matrix representation C =
functions up to kth order M := {Mi | 1 ≤ i ≤ k} is a subset of A
softmax Q K T / dk V , where Q ∈ Rk×dk , K ∈ Rk×dk and V ∈
such that ∃ Aθi ∈ A s.t. ∥Mi − Aθi ∥∞ < O(ϵ ) for any Mi ∈ M and
Rk×dv . As illustrated in Fig. 2, self-attention is typically imple-
ϵ > 0.
mented with a residual connection with multiple independent at-
tention calculations referred to as attention heads (Vaswani, et al.,
Proof. The state variables are discretized w.r.t. time such that
2017). While computationally inexpensive to evaluate, the mem-
φi = φ(i∆t); ti = i∆t. The general definition for linear multi-step
ory requirement of scaled-dot product attention can become in-
methods follows:
creasingly cumbersome as the context length, k, increases. Hence,
several methods for approximating this calculation have been φn+s + as−1 · φn+s−1 + · · · + a0 · φn
proposed which include the use of kernels and random pro-
= ∆t · bs · f tn+s , φn+s + bs−1 · f tn+s−1 , φn+s−1
( ( ) ( )
jections in an attempt to lower the dimensionality of the self-
+ · · · + b0 · f tn , φn ,
( ))
attention calculation without loss of predictive accuracy (Kitaev, (6)
274
in which ai and bi are coefficients determined by the integration A single transformer layer is significantly more expressive
method. For explicit Adams methods, the φn+s is directly com- than any numerical time integration scheme, enabling it to learn
puted with bs = 0, as−1 = −1 and a0:s−2 = 0. The generalized more complex temporal dependencies. However, the linear com-
formula for s-step explicit Adams methods can be represented as bination of weighted features from past time-steps is a com-
the following linear combination: monality between self-attention and numerical time-integration
s−1
∑ methods. This mathematical similarity indicates self-attention
Ms = φn+s = φn+s−1 + ∆t bj f tn+j , φn+j , models may be better suited for learning dynamical systems than
( )
(7)
j=0
alternative deep learning approaches. In particular traditional
Euler time integration present in various models for modeling
which encapsulates up to and including sth order time integra- dynamics (González-García et al., 1998; Lu, Zhong, Li, & Dong,
tion (Stoer & Bulirsch, 2013). Consider a residual scaled dot-
2018; Sanchez-Gonzalez et al., 2020) is a specific case of this
product self-attention calculation with context length, s, for out-
single layer attention calculation.
put prediction φ̂n+s , input states φn:n+s−1 and time t:
s−1
∑ Remark. The inclusion of time, t, as an input is required for
Aθi = φ̂n+s = φn+s−1 + αn+s−1,i vi , non-autonomous systems with explicit time-dependent terms.
i=0 This has surprising connections to the implementation of the
√ transformer which uses a positional embedding. Although the
exp(qTn+s−1 k i / dk )
αn+s−1,i = ∑s−1 √ , (8) positional embedding is included for the sake of denoting the
j=0 exp(qTn+s−1 k j / dk ) order of the input to the transformer, using current time of the
for which k i = Fk (tn+i , φn+i ), qi = Fq (tn+i , φn+i ) and vi = given input state can accomplish the same goal.
Fv (tn+i , φn+i ) vectors are outputs of differentiable functions pa-
rameterized by neural networks: 2.2. Embedding model
Fk : R × Rd → Rdk , Fq : R × Rd → Rdk , Fv : R × Rd → Rdv . The second major component of the machine learning method-
(9) ology is the embedding model responsible for projecting the
discretized physical state space into a 1D vector representation.
It suffices to show that there exists a form of Eq. (8) that is equal
In NLP, the standard approach is to tokenize then embed a finite
to Eq. (7) within an error order O(ϵ ) such that
  vocabulary of words, syllables or characters using methods such
φn+s − φ̂n+s 
 
< O(ϵ ). (10) as n-gram models, Byte Pair Encoding (Gage, 1994), Word2Vec
∞ (Mikolov, Chen, Corrado, & Dean, 2013; Mikolov, Sutskever, Chen,
To this end, although Fq and Fk are fully-trainable neural net- Corrado, & Dean, 2013), GloVe (Pennington, Socher, & Manning,
works, herein we consider a fixed single-entry query and key 2014), etc. These methods allow language to be represented by
vectors resulting in the following simplified self-attention scores: a series of 1D vectors that serve as the input to the transformer.
Clearly a finite tokenization and such NLP embeddings are not di-
√ rectly applicable to physics, thus we propose our own embedding
qi = Fq (tn+i , φn+i ) = dk em , k i = Fk (tn+i , φn+i ) = log(bi ) em , method designed specifically for dynamical systems. Consider
exp(log(bi )) bi learning the generalized mapping between the system’s state
αn+s−1,i = ∑s−1 = ∑s−1 ,
exp(log(bj )) bj space and embedded space: F : Rn×d → Re and G : Re →
j=0 j=0
Rn×d . Naturally, multiple approaches can be used especially if the
(11) dimensionality of the embedded space is less than that of the
dk
where em ∈ R is a unit vector with all elements set to zero state–space but this is not always the case.
except the mth element that is set to 1. By the universal approx- The primary approach that we will propose is a Koopman
imation theorem (Cybenko, 1989; Hornik, 1991; Hornik, Stinch- observable embedding which is a technique that can be applied
combe, & White, 1989), the neural network, Fv , is assumed to be universally to all dynamical systems. Considering the discrete
of sufficient capacity such that it can approximate the r.h.s. of the time form of the dynamical system in Eq. (1), the evolution
of the state variables can be abstracted by φi+1 = F φi for
( )
governing equation:
which F is the dynamic map from one time-step to the next. The
vi = Fv (tn+i , φn+i ) = c f (tn+i , φn+i ) + O(ϵ ), ϵ > 0, (12)
foundation of Koopman theory states any dynamical system can
∑s−1
where the constant c is taken as c = ∆t j=0 bj . Eqs. (11) and be represented in terms of an infinite dimensional linear( operator
acting on an infinite set of state observable functions, g φi , such
)
(12) can then be combined leading to the following:
that:
s−1 s−1
bi
Kg φ i ≜ g ◦ F φ i ,
∑ ∑ ( ) ( )
φn+s−1 + αn+s−1,i vi = φn+s−1 + ∑s−1 (14)
i=0 i=0 j=0 bj
where K is the infinite-dimensional linear operator referred to
c f (tn+i , φn+i ) + O(ϵ ), as the Koopman operator (Koopman, 1931). This implies that the
s−1
∑ system of observables can be evolved in time through repeated
= φn+s−1 + ∆t bi f (tn+i , φn+i ) + O(ϵ ). application of the Koopman operator:
i=0
g φi+1 = Kg φi , g φi+2 = K2 g φi ,
( ) ( ) ( ) ( )
(13)
g φi+3 = K3 g φi , . . .
( ) ( )
(15)
For a large capacity neural network, we can assume that the error
can be made arbitrarily small (ϵ → 0) and Eq. (10) is proved.
n 3
in which K denotes a n-fold composition, e.g. K (g) = K(K(K(g))).
Therefore, the form of the explicit Adams multi-step methods of Modeling the dynamics of a system through the linear Koopman
order ≤ s can be captured by a residual self-attention transformer space can be attractive due to its simplification of the dynamics
layer. □ but also the potential physical insights it brings along with it.
275
Fig. 3. Example of a Koopman embedding for a two-dimensional system using a convolution encoder–decoder model. The encoder model, F , projects the physical
states into the approximate Koopman observable embedding. The decoder model, G recovers the physical states from the embedding.
Spectral analysis of the Koopman operator can reveal funda- This loss function consists of three components: the first is a
mental dynamical modes that drive the system’s evolution in reconstruction loss which ensures a consistent mapping to and
time. from the embedded representation. The second is the Koopman
Koopman theory can be viewed as a trade off between lifting dynamics loss which pushes ξ j to follow linear dynamics. The last
the state space into observable space with more complex states term decays the Koopman operator’s parameters to help force
but simpler dynamics. In practice, the Koopman operator must the model to discover meaningful dynamical modes and further
be finitely approximated. This finite approximation requires the prevent overfitting.
identification of the essential measurement functions that govern In reference to traditional NLP embeddings, we believe our
the system’s dynamics and the respective approximate Koop- Koopman observable embedding has a motivation similar to
man operator. Data-driven machine learning has proven to be Word2Vec (Mikolov, Chen, et al., 2013) as well as more recent
an effective approach for learning key Koopman observables for embedding methods such as context2vec (Melamud, Goldberger,
modeling, control and dynamical mode analysis of many physical & Dagan, 2016), ELMo (Peters, Neumann, Iyyer, Gardner, Clark,
systems (Korda & Mezić, 2018; Korda, Putinar, & Mezić, 2020; Lee, & Zettlemoyer, 2018), etc. These methods are based on word
Li, Dietrich, Bollt, & Kevrekidis, 2017; Mezic, 2020). In recent context and association to develop a map where words that are
years, the use of deep neural networks for learning Koopman related or synonymous to each other have similar embedded
dynamics has proven to be successful (Brunton, Budišić, Kaiser, vectors. The Koopman embedding model has a similar objective
& Kutz, 2021; Lusch, Kutz, & Brunton, 2018; Otto & Rowley, encouraging physical realizations containing similar dynamical
2019; Takeishi, Kawahara, & Yairi, 2017). While deep learning modes to also have similar embeddings. This is because the Koop-
methods have enabled greater success with discovering Koop- man operator is time-invariant which means the embedded states
man observables and operators such approaches have yet to be must share the same basis functions that govern their evolution.
demonstrated for the long time prediction of high-dimensional As a result, the loss function of the embedding model rewards
systems. This is likely due to the approximation of the finite- time-steps that are near each other in time or have the same
dimensional Koopman observables, limited data and complete underlying dynamics to have similar embeddings. Hence, our
dependence on the discovered Koopman operator K to model the goal with the embedding model is to not find the true Koopman
dynamics. Suggesting the prediction of a system through a single observables or operator, but rather leverage Koopman to enforce
linear transform clearly has significant limitations and is funda- physical context and association using the learned dynamical
mentally a naive approach from a machine learning perspective. modes.
In this work, we propose using approximate Koopman dynam- 3. Experiments and results
ics as a methodology to develop embeddings for the transformer
model such that F (φi ) ≜ g(φi ). As seen in Fig. 3, the embed- The proposed transformer model is implemented for three dy-
ding model follows a standard encoder–decoder model with the namical systems. The classical Lorenz system is used as a baseline
middle latent variables being the Koopman observables. In this test case in Section 3.1 to compare the proposed model to alter-
model, the Koopman operator assumes the form of a learnable native machine learning approaches due to the Lorenz system’s
banded matrix that is optimized with the auto-encoder. Imposing numerical sensitivity. Following in Section 3.2, we consider the
some level of inductive bias on the form of the Koopman matrix modeling of two-dimensional Navier–Stokes fluid flow and com-
is fairly common in similar deep Koopman works (Li, He, Wu, pare different embedding methods for the transformer. Lastly,
Katabi, & Torralba, 2020; Lusch et al., 2018; Otto & Rowley, to demonstrate the models scalability, we demonstrate in Sec-
2019). We found that this reduction of learnable parameters tion 3.3 the transformer for the prediction of a three-dimensional
helped encourage the model to discover better dynamical modes, reaction–diffusion system. These examples encompass surrogate
preventing the model from overfitting to high-frequency fluctua- modeling physical systems with stochastic initial conditions and
tions. Additionally, this form requires significantly less memory parameters.
to store allowing models with embedding vectors of higher-
dimensionality to be trained. This learned Koopman operator is 3.1. Chaotic dynamics
disposed of once training of the embedding model is {complete.
}D
Given the data set of physical state time-series, DΦ = Φ i i , the As a foundational numerical example to rigorously compare
Koopman embedding model is trained with the following loss: the proposed model to other classical machine learning tech-
D T
niques, we will first look at surrogate modeling of the Lorenz
∑ ∑ system governed by:
λ0 MSE φij , G ◦ F φij
( ( ))
LDΦ =
dx dy dz
i=1 j=0
= σ (y − x) , = x (ρ − z ) − y, = xy − β z .
  
Reconstruction (17)
dt dt dt
+ λ1 MSE φij , G ◦ Kj F φi0 +λ2 ∥K∥22 .
( ( ))
(16) We use the classical parameters of ρ = 28, σ = 10, β = 8/3. For
     
Dynamics Decay this numerical example, we wish to develop a surrogate model
276
Fig. 4. Fully-connected embedding network with ReLU activation functions for the Lorenz system.
Fig. 5. Four test case predictions using the transformer model for 320 time-steps.
for predicting the Lorenz system given a random initial state same structure, which qualitatively indicates that the transformer
x0 ∼ U (−20, 20), y0 ∼ U (−20, 20) and z0 ∼ U (10, 40). In other indeed maintains physical dynamics.
words, we wish to surrogate model various initial value problems The proposed transformer and alternative models’ relative
for this system of ODEs. The Lorenz system is used because of its mean squared errors for the test set are plotted in Fig. 7(a) as
well known chaotic dynamics which make it extremely sensitive a function of time and listed in Table 1 segmented into several
to numerical perturbations and thus an excellent benchmark for intervals based on the transformer’s context length. In general,
assessing a machine learning model’s accuracy. we can see all deep learning models perform well in the time-
A total of four alternative machine learning models are imple- series length for which they were trained with the deep Koopman
model performing the best followed by the transformer. As we
mented: a fully-connected auto-regressive model, a fully-
extrapolate our predictions past the trained context range, the
connected LSTM model, a deep neural network Koopman model
benefits of the transformer become apparent with it achieving
and lastly an echo-state model. All of these types of models
the best accuracy for later times. To quantify accuracy of the
have been proposed in past literature for predicting various
chaotic dynamics of each model, the Lorenz map is plotted in
physical systems (e.g. auto-regressive Geneva & Zabaras, 2020a, Fig. 7(b) which is a well-defined relation between successive z
fully-connected LSTMs Zhao, Deng, Cai, & Chen, 2019, deep Koop- local maxima despite the Lorenz’s chaotic nature. Calculated us-
man Lusch et al., 2018; Otto & Rowley, 2019 and echo-state mod- ing 25k time-step predictions from each model, again we can see
els Chattopadhyay, Hassanzadeh, Subramanian, & Palem, 2020; that the transformer model agrees the best with the numerical
Lukoševičius, 2012). Each is provided the same training, valida- solver indicating that it has learned the best physical dynamics
tion and testing data sets containing 2048, 64 and 256 time-series of all the tested models.
at a time-step size of ∆t = 0.01 solved using a Runge–Kutta Additionally, we test the transformer’s sensitivity to contam-
numerical solver, respectively. The training data set contains inated data by adding white noise to the training observations
time-series of 256 time-steps while the validation and testing scaled by the magnitude of the state variables. Each model tested
data sets have 1024 time-steps. Each model is allowed to train for with clean data is retrained with data perturbed by 1% and 5%
500 epochs if applicable. The proposed transformer and embed- noise. The effects of these two different noise levels are qualita-
ding model are trained for 200 and 300 epochs, respectively, with tively illustrated in Appendix A. The errors are listed in Table 2. In
an embedding dimension of 32. The embedding model is a simple general, we can indeed see that the transformer can still perform
fully-connected encoder–decoder model, F : R3 → R32 ; G : adequately with noisy data by being the best performing model
R32 → R3 , illustrated in Fig. 4. The transformer is trained with a for 1% noise and still being competitive, particularly at later
time-steps, with 5% noise.
context length of 64 with 4 transformer decoder layers. Both the
The self-attention vectors, αi ∈ R64 , for a single time-series
training and validation data sets were chunked into a set of 64
prediction of 512 steps are plotted in Fig. 8. For time-steps over
steps for the alternative models, when applicable, to train on the
i ≥ 64, the full context length of the transformer is used with
same context length.
the attention weight αi,64 corresponding to the most recent time-
We plot four separate test cases in Fig. 5 for which only step. Note that the transformer has learned multi-scale tem-
the initial state is provided and the transformer model predicts poral dependencies not achievable with other machine learn-
320 time-steps. Several test predictions for alternative models ing architectures. Additionally, this verifies that each attention
are provided in Appendix A. In general, we can see that the head is learning different dependencies suggesting that each head
transformer model is able to yield extremely accurate predictions may be a particular component of the transformer’s latent dy-
even beyond its context length. Additionally, we plot the Lorenz namics. This aligns with what is observed in NLP when using
solution for 25k time-steps from a numerical solver and predicted multi-head attention which increases the transformer’s predictive
from the transformer model in Fig. 6. Note that both have the accuracy (Vaswani, et al., 2017).
277
Fig. 6. Lorenz solution of 25k time-steps with ∆t = 0.01.
Fig. 7. (a) The test relative mean-squared-error (MSE) with respect to time. (b) The Lorenz map produced by each model.
Table 1
Test set relative mean-squared-error (MSE) for surrogate modeling the Lorenz system at several
time-step intervals.
Model Parameters Relative MSE
[0 − 64) [64 − 128) [128 − 192)
Transformer 36k/54ka 0.0003 0.0060 0.0221
LSTM 103k 0.0041 0.0175 0.0369
Autoregressive 92k 0.0057 0.0253 0.0485
Echo state 7.5k/6.3 mb 0.1026 0.1917 0.2209
Koopman 108k 0.0001 0.0962 2.0315
a
Learnable parameters for the embedding/transformer model.
b
Learnable output parameters/fixed input and reservoir parameters.
Table 2
Test set relative mean-squared-error (MSE) for surrogate modeling the Lorenz system at several time-step intervals with noisy data.
Model Relative MSE 1% Noise Relative MSE 5% Noise
[0 − 64) [64 − 128) [128 − 192) [0 − 64) [64 − 128) [128 − 192)
Transformer 0.0021 0.0216 0.0429 0.0210 0.0759 0.1292
LSTM 0.0045 0.0218 0.0437 0.0212 0.0758 0.1324
Autoregressive 0.0114 0.0417 0.0901 0.0760 0.2060 0.2065
Echo state 0.0859 0.1686 0.2102 0.1000 0.1581 0.2051
Koopman 0.0047 0.1192 0.1597 0.0200 0.0787 0.1906
3.2. 2D fluid dynamics Reynolds number between Re ∼ U (100, 750). This problem is a
fairly classical flow to investigate in scientific machine learning
The next dynamical system we will test is transient 2D fluid with various levels of difficulty (Geneva & Zabaras, 2020b; Han,
flow governed by the Navier–Stokes equations: Wang, Zhang, & Chen, 2019; Lee & You, 2017; Lusch et al., 2018;
∂ ui ∂ ui 1 ∂p ∂ 2 ui Morton, Jameson, Kochenderfer, & Witherden, 2018; Xu & Du-
+ uj =− +ν , (18) raisamy, 2020). Here we choose one of the more difficult forms:
∂t ∂ xj ρ ∂ xi ∂ xj ∂ xj model the flow starting at a steady state flow field at t = 0.
in which ui and p are the velocity and pressure, respectively. ν Meaning the model is provided zero information on the structure
is the viscosity of the fluid. We consider modeling the classical of the cylinder wake during testing other than the viscosity.
problem of flow around a cylinder at various Reynolds number Training, validation and testing data is obtained using the
defined by Re = uin d/ν in which uin = 1 and d = 2 are the OpenFOAM simulator (Jasak, Jemcov, Tukovic, et al., 2007), from
inlet velocity and cylinder diameter, respectively. In this work, we which a rectangular structured sub-domain is sampled centered
choose to develop a surrogate model to predict the solution at any around the cylinder’s wake. Given that the flow is two-
278
Fig. 8. Attention vectors of the transformer model for the prediction of a single test case of the Lorenz system. The top contours illustrate the attention weights at
each time-step; the bottom shows the respective state variables.
Fig. 9. 2D convolutional embedding network with ReLU activation functions for the flow around a cylinder system consisting of 5 convolutional encoding/decoding
layers. Each convolutional operator has a kernel size of (3, 3). In the decoder, the feature maps are up-sampled before applying a standard convolution. Additionally,
two auxiliary fully-connected networks are used to predict the diagonal and off-diagonal elements of the Koopman operator for each viscosity ν.
dimensional, our( model will) predict the x-velocity, y-velocity and R128 ; G : R128 → R3×64×128 . The second embedding method is us-
pressure fields, ux , uy , p ∈ R3×64×128 . The training, validation ing the same convolutional auto-encoder model but without the
and test data sets consist of 27, 6 and 7 fluid flow simulations, re- enforcement of Koopman dynamics on the embedded variables.
spectively with 400 time-steps each at a physical time-step size of The third embedding method tested was principal component
∆t = 0.5. As a base line model, we train a convolutional encoder– analysis (PCA) as a classical baseline. For each embedding method
decoder model with a stack of three convolutional LSTMs (Shi, an identical transformer model with a context length of 128 and
Chen, Wang, Yeung, Wong, & WOO, 2015) in the center. The 4 transformer decoder layers is trained. Similar to the previous
input for this model consists of the velocity fields, pressure field example, the embedding models are trained for 300 epochs when
and viscosity of the fluid. We note that convolutional LSTMs applicable and the transformer is trained for 200. To further
isolate the impact of the transformer model from the embedding
have been used extensively in recent scientific machine learn-
model, we also train a fully-connected model with 4 LSTM cells
ing literature for modeling various physical systems including
using the Koopman embedding as an input for 200 epochs.
fluid dynamics (Geneva & Zabaras, 2020b; Han et al., 2019;
Each trained model is tested on the test set by providing the
Maulik, Lusch, & Balaprakash, 2021; Tang et al., 2020; Wiewel, initial laminar state at t = 0 with the fluid viscosity and allowing
Becher, & Thuerey, 2019), thus can be considered a state-of-the- the model to predict 400 time-steps into the future. Two test
art approach. The convolutional LSTM model is trained for 500 predictions using the proposed transformer model with Koopman
epochs. embeddings are plotted in Fig. 10 in which the predicted vorticity
Additionally, three different embedding methods are imple- fields are in good agreement with the true solution. The test set
mented: the first is the proposed Koopman observable embed- relative mean square error for each output field for each model
ding using a convolutional auto-encoder illustrated in Fig. 9. This is plotted in Fig. 11. The errors of each field over the entire
model encodes the fluid x-velocity, y-velocity, pressure and vis- time-series are listed in Table 3. Additional results are provided
cosity fields to an embedded dimension of 128, F : R4×64×128 → in Appendix B.
279
Fig. 10. Vorticity, ω = ∇x uy − ∇y ux , of two test case predictions using the proposed transformer with Koopman embeddings at Reynolds numbers 233 (top) and
633 (bottom).
Fig. 11. Test set relative mean-squared-error (MSE) of the transformer with Koopman (KM), auto-encoder (AE) and PCA embedding methods, the convolutional LSTM
model (ConvLSTM) and fully-connected LSTM with Koopman embeddings (LSTM-KM).
Table 3
Test set relative mean-squared-error (MSE) of each output field for surrogate modeling 2D fluid
flow past a cylinder. Models listed include the transformer with Koopman (KM), auto-encoder (AE)
and PCA embedding methods, the convolutional LSTM model (ConvLSTM) and fully-connected LSTM
with Koopman embeddings (LSTM-KM).
Model Parameters Relative MSE [0 − 400]
ux uy p
Transformer-KM 224k/628ka 0.0042 0.0198 0.0021
Transformer-AE 224k/628ka 0.0288 0.1068 0.0161
Transformer-PCA 3.1m/628ka 0.0247 0.0984 0.0117
ConvLSTM 934k 0.0240 0.0938 0.0103
LSTM-KM 224k/759kb 0.0266 0.1155 0.0094
a
Learnable parameters for the embedding/ transformer model.
b
Learnable parameters for the embedding/ LSTM model.
For all alternative models, a rapid error increase can be seen note that the Koopman principle subspace has a more consistent
between t = [0, 100] which is due to the transition from the structure for later time-steps indicating that the Koopman loss
laminar flow into vortex shedding. This error then plateaus since term does indeed encourage the discovery of common dynamical
each model is able to produce stable vortex shedding, as illus- modes. Yet for early time-steps, the Koopman model has a unique
trated in Figs. B.18 & B.19 in Appendix B. The proposed trans- trajectory between Reynolds numbers. This lack of uniqueness
former with Koopman embeddings is the only model that can with the PCA embedding at early time-steps results in the trans-
accurately match the instantaneous states from the numerical former not being able to differentiate between Reynolds num-
solution. These results indicate that the performance of the trans- bers, lowering predictive accuracy with this embedding method.
former is highly dependent on the embedding method used, The Koopman approach strikes a balance between structure but
which should be expected. However, the transformer with self- uniqueness between flows for the transformer to learn with.
attention is equally important to the model’s success as indicated
by the performance of the Koopman embedding LSTM model. 3.3. 3D Reaction–diffusion dynamics
Compared to the widely used ConvLSTM model, the proposed
transformer offers more reliable predictions for this fluid flow The final numerical example to demonstrate the proposed
with less learnable parameters. transformer model is a 3D reaction–diffusion system governed by
To gain a greater understanding of why the Koopman em- the Gray–Scott model:
bedding performs better than the alternatives, we perform linear
dimensionality reduction of the embedded states using PCA into ∂u ∂ 2u ∂v ∂ 2v
two principle components, ξ̃ 1 , ξ̃ 2 , in Fig. 12. For all embedding = ru 2 − uv 2 + f (1 − u) , = rv 2 + v u2 − (f + k) v,
∂t ∂ xi ∂t ∂ xi
methods, a circular structure is present reflecting the periodic dy-
namics of the vortex shedding. Compared to the auto-encoder, we (19)
280
Fig. 12. The two-dimensional principle subspace of the embedded vectors, ξ i , from each tested embedding model for two different Reynolds numbers.
Fig. 13. 3D convolutional embedding network with leaky ReLU activation functions for the Gray–Scott system. Batch-normalization used between each of the
convolutional layers. In the decoder, the feature maps are up-sampled before applying a standard 3D convolution.
in which u and v are the concentration of two species, ru and rv Table 4

are their respective diffusion rates, k is the kill rate and f is the Test set relative mean-squared-error (MSE) for surrogate modeling 3D
Gray–Scott system.
feed rate. This is a classical system of particular application to
Model Layers Parameters Relative MSE [0 − 200]
chemical processes as it models the following reaction: U + 2V →
3V ; V → P. For a set region of feed and kill rates, this seemingly u v
simple system can yield a wide range of complex dynamics (Lee, Transformer 2 6.2m/6.6ma 0.0159 0.0120
Transformer 4 6.2m/12.9ma 0.0154 0.0130
McCormick, Ouyang, & Swinney, 1993; Pearson, 1993). Hence,
Transformer 8 6.2m/25.5ma 0.0125 0.0101
under the right settings, this system is an excellent case study
a
to push the proposed methodology to its predictive limits. In Learnable parameters for the embedding/ transformer model.
this work, we will use the parameters: ru = 0.2, rv = 0.1,
k = 0.055 and f = 0.025 which results in a complex dynamical
reaction. Akin to the first numerical example, the initial condition
of this system is stochastic such that the system is seeded with 3 convolutional encoder–decoder is used to embed the two species
randomly placed perturbations within the periodic domain. into a 512 embedding dimension, F : R2×32×32×32 → R512 ; G :
Training, validation and testing data are obtained from a R512 → R2×32×32×32 , illustrated in Fig. 13.
Runge–Kutta finite difference simulation on a structured grid, Transformer models with varying depth are trained all with a
(u, v) ∈ R2×32×32×32 . The training, validation and test data sets context length of 128. All other model and training parameters
consist of 512, 16 and 56 time-series, respectively, with 200 for the transformer models are consistent. A test prediction using
time-steps each at a physical time-step size of ∆t = 5. A 3D the transformer model is shown in Fig. 14 and the errors for
281
Fig. 14. Test case volume plots for the Gray–Scott system. Isosurfaces displayed span the range u, v = [0.3, 0.5] to show the inner structure.
each trained transformer are listed in Table 4. Despite this system Declaration of competing interest
having complex dynamics in 3D space, the transformer is able
to produce acceptable predictions with very similar structures as The authors declare that they have no known competing finan-
the numerical solver. To increase the model’s predictive accuracy cial interests or personal relationships that could have appeared
we believe the limitation here is not in the transformer, but to influence the work reported in this paper.
rather the number of training data and the inaccuracies of the
embedding model due to the dimensionality reduction needed.
This is supported by the fact that increasing the transformer’s Acknowledgments
depth does not yield considerable improvements for the test
errors in Table 4. Additional results are provided in Appendix C. The anonymous reviewers are thanked for their significant
effort in improving and clarifying this manuscript. The work
4. Conclusion reported here was initiated from the Defense Advanced Research
Projects Agency (DARPA) under the Physics of Artificial Intel-
While transformers and self-attention models have been es- ligence (PAI) program (contract HR00111890034). The authors
tablished as a powerful framework for NLP tasks, the adoption acknowledge computing resources provided by the AFOSR Office
of such methodologies has yet to fully permute other fields. of Scientific Research through the DURIP program and by the
In this work, we have demonstrated the potential transform- University of Notre Dame’s Center for Research Computing (CRC).
ers have for modeling dynamics of physical phenomena. The The work of NG was supported by the National Science Founda-
transformer architecture allows the model to learn longer and tion (NSF), USA Graduate Research Fellowship Program grant No.
more complex temporal dependencies than alternative machine DGE-1313583.
learning methods. Such models can be particularly beneficial for
physical systems that exhibit dynamics that evolve on multi-
Appendix A. Lorenz supplementary results
ple time scales or consist of multiple phases such as turbulent
fluid flow, chemical reactions or molecular dynamics. This can
be attributed to the transformer’s ability to draw information Predictions of the tested models for three test cases are plotted
from many past time-steps directly with self-attention, learn- in Fig. A.15. All machine learning methods are accurate for the
ing accurate time integration without computationally expensive first several time-steps, but begin to quickly deviate from the
recurrent connections. numerical solution in subsequent time-steps. The transformer
The key challenge of using the transformer model is identi- model is consistently accurate within the plotted time frame,
fying appropriate embeddings to represent the physical state of typically outperforming alternative methods at later time-steps.
the system, for which we propose the use of Koopman dynamics The predictions of the transformer model are also compared to
to enforce dynamical context. Using the proposed methods, our the solution from a less accurate Euler time-integration method
transformer surrogate can outperform alternative models widely in Fig. A.16. Due to the numerical differences between the Runge–
seen in recent scientific machine learning literature. The inves- Kutta ground truth and the Euler methods, the solutions are quick
tigation of unsupervised pre-training of such transformer models to deviate and the transformer clearly outperforms the Euler
as well as gaining a better understanding of what attention mech- numerical simulator. Lastly, in Fig. A.17 we illustrate the two
anisms imply for physical systems will be the subject of works in different noise levels used to contaminate the training data for
the future. the results listed in Table 2.
282
Fig. A.15. Three Lorenz test case predictions using each tested model for 128 time-steps.
Fig. A.16. Three Lorenz test solutions solved using Runge–Kutta and Euler time-integration methods with the prediction of the transformer for 256 time-steps.
Fig. A.17. Comparison between the clean and contaminated (noisy) training data.
Appendix B. Cylinder supplementary results region, t < 100, before the system enters the periodic vortex
shedding. Once in the periodic region, we can see that the higher
Prediction fields of all the tested models for a single test Reynolds number has a higher frequency which reflects the in-
case are plotted in Figs. B.18 and B.19. While all models are creased vortex shedding speed.
able to perform adequately with qualitatively good results, the
transformer model with Koopman embeddings (Transformer-KM) Appendix C. Gray–Scott supplementary results
outperforms alternatives. Additionally, the training and validation
errors during training are plotted in Fig. B.20 where all models Volume plots of two test cases for both species are provided
have similar training and validation error indicating minimal in Figs. C.22 & C.23. In addition to better visualize the accuracy of
overfitting. the trained transformer model, contour plots for three test cases
The evolution of the flow field projected onto the dominant are also provided in Fig. C.24. The transformer model is able to
eigenvectors of the learned Koopman operator for two Reynolds reliably predict the earlier time-steps with reasonable accuracy.
numbers is plotted in Fig. B.21. This reflects the dynamical modes As the system continues to react, the reaction fronts begin to
that were learned by the embedding model to impose physical interact resulting in the well-known complex structures of the
‘‘context’’ on the embeddings. Given that the eigenvectors are Gray–Scott system (Lee et al., 1993; Pearson, 1993). During these
complex, we plot both the magnitude, |ψ|, and angle, ̸ ψ . For later times, the transformer begins to degrade in accuracy yet
both Reynolds numbers, it is easy to see the initial transition does maintain notable consistency with the true solution.
283
Fig. B.18. Velocity magnitude predictions of a test case at Re = 633 using the transformer with Koopman (KM), auto-encoder (AE) and PCA embedding methods,
the convolutional LSTM model (ConvLSTM) and fully-connected LSTM with Koopman embeddings (LSTM-KM).
Fig. B.19. Pressure predictions of a test case at Re = 633 using the transformer with Koopman (KM), auto-encoder (AE) and PCA embedding methods, the convolutional
LSTM model (ConvLSTM) and fully-connected LSTM with Koopman embeddings (LSTM-KM).
Fig. B.20. The training and validation mean square error (MSE) of the transformer with Koopman (KM), auto-encoder (AE) and PCA embedding methods, the
convolutional LSTM model (ConvLSTM) and fully-connected LSTM with Koopman embeddings (LSTM-KM) during training.
284
Fig. B.21. The dynamics of the fluid flow around a cylinder projected onto the 8 most dominant eigenvectors of the learned Koopman operator, K, in the embedding
model.
Fig. C.22. Test case volume plots for the Gray–Scott system. Isosurfaces displayed span the range u, v = [0.3, 0.5] to show the inner structure.
285
Fig. C.23. Test case volume plots for the Gray–Scott system. Isosurfaces displayed span the range u, v = [0.3, 0.5] to show the inner structure.
286
Fig. C.24. x–y plane contour plots of three Gray–Scott test cases sliced at z = 16.
References Bilionis, I., Zabaras, N., Konomi, B. A., & Lin, G. (2013). Multi-output separable
Gaussian process: Towards an efficient, fully Bayesian paradigm for uncer-
tainty quantification. Journal of Computational Physics, 241, 212–239. http:
Atkinson, S., & Zabaras, N. (2019). Structured Bayesian Gaussian process //dx.doi.org/10.1016/j.jcp.2013.01.011, URL http://www.sciencedirect.com/
latent variable model: Applications to data-driven dimensionality reduc- science/article/pii/S0021999113000417.
tion and high-dimensional inversion. Journal of Computational Physics, Brunton, S. L., Budišić, M., Kaiser, E., & Kutz, J. N. (2021). Modern Koopman
383, 166–195. http://dx.doi.org/10.1016/j.jcp.2018.12.037, URL http://www. theory for dynamical systems. arXiv preprint arXiv:2102.12086.
sciencedirect.com/science/article/pii/S0021999119300397.
Chakraborty, S., & Zabaras, N. (2018). Efficient data-driven reduced-order mod-
Bahdanau, D., Cho, K., & Bengio, Y. (2015). Neural machine translation by jointly els for high-dimensional multiscale dynamical systems. Computer Physics
learning to align and translate. In 3rd International conference on learning Communications, 230, 70–88. http://dx.doi.org/10.1016/j.cpc.2018.04.007, URL
representations, arXiv:1409.0473. http://www.sciencedirect.com/science/article/pii/S0010465518301176.
Bilionis, I., & Zabaras, N. (2016). BayesIan uncertainty propagation using Gaussian Chattopadhyay, A., Hassanzadeh, P., Subramanian, D., & Palem, K. (2020). Data-
processes. In R. Ghanem, D. Higdon, & H. Owhadi (Eds.), Handbook of driven prediction of a multi-scale Lorenz 96 chaotic system using a hierarchy
uncertainty quantification (pp. 1–45). Springer International Publishing, http: of deep learning methods: Reservoir computing, ANN, and RNN-LSTM.
//dx.doi.org/10.1007/978-3-319-11259-6_16-1, URL https://doi.org/10.1007/ Nonlinear Processes in Geophysics, 27, 373–389. http://dx.doi.org/10.5194/npg-
978-3-319-11259-6_16-1. 27-373-2020, URL https://doi.org/10.5194/npg-27-373-2020.
287
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D., et al. (2020). Generative Korda, M., & Mezić, I. (2018). Linear predictors for nonlinear dynamical
pretraining from pixels. In H. D. III, A. Singh (Eds.), Proceedings of machine systems: Koopman operator meets model predictive control. Automatica,
learning research: 119, Proceedings of the 37th international conference on 93, 149–160. http://dx.doi.org/10.1016/j.automatica.2018.03.046, URL http:
machine learning (pp. 1691–1703). PMLR, URL http://proceedings.mlr.press/ //www.sciencedirect.com/science/article/pii/S000510981830133X.
v119/chen20s.html. Korda, M., Putinar, M., & Mezić, I. (2020). Data-driven spectral analysis
Chen, R. T. Q., Rubanova, Y., Bettencourt, J., & Duvenaud, D. K. (2018). of the Koopman operator. Applied and Computational Harmonic Analy-
Neural ordinary differential equations. In S. Bengio, H. Wallach, sis, 48(2), 599–629. http://dx.doi.org/10.1016/j.acha.2018.08.002, URL http:
H. Larochelle, K. Grauman, N. Cesa-Bianchi, & R. Garnett (Eds.), 31, //www.sciencedirect.com/science/article/pii/S1063520318300988.
Advances in neural information processing systems (pp. 6571–6583). Lee, K. J., McCormick, W. D., Ouyang, Q., & Swinney, H. L. (1993). Pattern
Curran Associates, Inc., URL https://proceedings.neurips.cc/paper/2018/ formation by interacting chemical fronts. Science, 261(5118), 192–194. http://
file/69386f6bb1dfed68692a24c8686939b9-Paper.pdf. dx.doi.org/10.1126/science.261.5118.192, URL https://science.sciencemag.org/
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. content/261/5118/192.
Mathematics of Control, Signals, and Systems, 2(4), 303–314. http://dx.doi.org/ Lee, S., & You, D. (2017). Prediction of laminar vortex shedding over a cylinder
10.1007/BF02551274. using deep learning. arXiv preprint arXiv:1712.07854.
Dai, Z., Yang, Z., Yang, Y., Carbonell, J. G., Le, Q., & Salakhutdinov, R. (2019). Li, Q., Dietrich, F., Bollt, E. M., & Kevrekidis, I. G. (2017). Extended dynamic mode
Transformer-XL: Attentive language models beyond a fixed-length context. decomposition with dictionary learning: A data-driven adaptive spectral
In Proceedings of the 57th Annual Meeting of the Association for Computa- decomposition of the Koopman operator. Chaos. An Interdisciplinary Jour-
tional Linguistics (pp. 2978–2988). http://dx.doi.org/10.18653/v1/P19-1285, nal of Nonlinear Science, 27(10), Article 103111. http://dx.doi.org/10.1063/1.
URL https://aclanthology.org/P19-1285. 4993854.
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of Li, Y., He, H., Wu, J., Katabi, D., & Torralba, A. (2020). Learning compositional
deep bidirectional transformers for language understanding. In NAACL-HLT Koopman operators for model-based control. In International Conference on
(pp. 4171–4186). Association for Computational Linguistics, http://dx.doi.org/ Learning Representations. URL https://openreview.net/forum?id=H1ldzA4tPr.
10.18653/v1/N19-1423, URL https://aclanthology.org/N19-1423. Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., et al. (2019). Roberta:
Dupont, E., Doucet, A., & Teh, Y. W. (2019). Augmented neural ODEs. In A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.
H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, & R. Garnett 11692.
(Eds.), 32, Advances in neural information processing systems (pp. 3140–3150). Lu, Y., Zhong, A., Li, Q., & Dong, B. (2018). Beyond finite layer neural net-
Curran Associates, Inc., URL https://proceedings.neurips.cc/paper/2019/file/ works: Bridging deep architectures and numerical differential equations. In
21be9a4bd4f81549a9d1d241981cec3c-Paper.pdf. International conference on machine learning (pp. 3276–3285). PMLR.
Fu, J., Liu, J., Tian, H., Li, Y., Bao, Y., Fang, Z., et al. (2019). Dual Lukoševičius, M. (2012). A practical guide to applying echo state networks. In
attention network for scene segmentation. In Proceedings of the IEEE Neural Networks: Tricks of the Trade (pp. 659–686). Springer, http://dx.doi.
Conference on Computer Vision and Pattern Recognition (pp. 3146–3154). org/10.1007/978-3-642-35289-8_36.
URL https://openaccess.thecvf.com/content_CVPR_2019/papers/Fu_Dual_ Luong, T., Pham, H., & Manning, C. D. (2015). Effective approaches to attention-
Attention_Network_for_Scene_Segmentation_CVPR_2019_paper.pdf. based neural machine translation. In EMNLP (pp. 1412–1421). arXiv:508.
Gage, P. (1994). A new algorithm for data compression. 12, (2), (pp. 23– 04025, URL http://aclweb.org/anthology/D/D15/D15-1166.pdf.
38). McPherson, KS: R & D Publications, c1987-1994., URL https://www. Lusch, B., Kutz, J. N., & Brunton, S. L. (2018). Deep learning for universal
derczynski.com/papers/archive/BPE_Gage.pdf, linear embeddings of nonlinear dynamics. Nature Communications, 9(1), 1–10.
http://dx.doi.org/10.1038/s41467-018-07210-0.
Gao, H., Wang, J.-X., & Zahr, M. J. (2020). Non-intrusive model reduction
of large-scale, nonlinear dynamical systems using deep learning. Physica Maulik, R., Egele, R., Lusch, B., & Balaprakash, P. (2020). Recurrent neural
D: Nonlinear Phenomena, 412, Article 132614. http://dx.doi.org/10.1016/ network architecture search for geophysical emulation. In SC20: International
j.physd.2020.132614, URL http://www.sciencedirect.com/science/article/pii/ conference for high performance computing, networking, storage and analysis
S0167278919305573. (pp. 1–14). http://dx.doi.org/10.1109/SC41405.2020.00012, URL https://doi.
org/10.1109/SC41405.2020.00012.
Geneva, N., & Zabaras, N. (2020a). Modeling the dynamics of PDE systems with
Maulik, R., Lusch, B., & Balaprakash, P. (2021). Reduced-order modeling of
physics–constrained deep auto–regressive networks. Journal of Computational
advection-dominated systems with recurrent neural networks and convo-
Physics, 403, Article 109056. http://dx.doi.org/10.1016/j.jcp.2019.109056, URL
lutional autoencoders. Physics of Fluids, 33(3), Article 037106. http://dx.doi.
https://www.sciencedirect.com/science/article/pii/S0021999119307612.
org/10.1063/5.0039986.
Geneva, N., & Zabaras, N. (2020b). Multi-fidelity generative deep learning
Melamud, O., Goldberger, J., & Dagan, I. (2016). Context2vec: Learning generic
turbulent flows. Foundations of Data Science, 2, 391. http://dx.doi.org/
context embedding with bidirectional LSTM. In Proceedings of the 20th SIGNLL
10.3934/fods.2020019, URL http://aimsciences.org/article/id/3a9f3d14-3421-
Conference on Computational Natural Language Learning (pp. 51–61). URL
4947-a45f-a9cc74edd097.
https://aclanthology.org/K16-1006.pdf.
González-García, R., Rico-Martínez, R., & Kevrekidis, I. (1998). Identifica-
Mezic, I. (2020). On numerical approximations of the koopman operator. arXiv
tion of distributed parameter systems: A neural net based approach.
preprint arXiv:2009.05883.
Computers & Chemical Engineering, 22, S965 – S968. http://dx.doi.org/10.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of
1016/S0098-1354(98)00191-4, European Symposium on Computer Aided
word representations in vector space. In Workshop proceedings international
Process Engineering-8, URL http://www.sciencedirect.com/science/article/pii/
conference on learning representations.
S0098135498001914.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013).
Graves, A., Wayne, G., & Danihelka, I. (2014). Neural turing machines. arXiv
Distributed representations of words and phrases and their
preprint arXiv:1410.5401.
compositionality. In Advances in neural information processing systems
Han, R., Wang, Y., Zhang, Y., & Chen, G. (2019). A novel spatial-temporal (pp. 3111–3119). URL https://proceedings.neurips.cc/paper/2013/file/
prediction method for unsteady wake flows based on hybrid deep neural 9aa42b31882ec039965f3c4923ce901b-Paper.pdf.
network. Physics of Fluids, 31(12), Article 127101. http://dx.doi.org/10.1063/
Mo, S., Zhu, Y., Zabaras, N., Shi, X., & Wu, J. (2019). Deep convolutional encoder-
1.5127247.
decoder networks for uncertainty quantification of dynamic multiphase flow
Hornik, K. (1991). Approximation capabilities of multilayer feedforward in heterogeneous media. Water Resources Research, 55(1), 703–728. http:
networks. Neural Networks, 4(2), 251–257. http://dx.doi.org/10.1016/0893- //dx.doi.org/10.1029/2018WR023528.
6080(91)90009-T, URL https://www.sciencedirect.com/science/article/pii/ Morton, J., Jameson, A., Kochenderfer, M. J., & Witherden, F. (2018).
089360809190009T. Deep dynamical modeling and control of unsteady fluid flows. In
Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward net- S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, &
works are universal approximators. Neural Networks, 2(5), 359–366. http: R. Garnett (Eds.), 31, Advances in neural information processing systems.
//dx.doi.org/10.1016/0893-6080(89)90020-8, URL https://www.sciencedirect. Curran Associates, Inc., URL https://proceedings.neurips.cc/paper/2018/file/
com/science/article/pii/0893608089900208. 2b0aa0d9e30ea3a55fc271ced8364536-Paper.pdf.
Jasak, H., Jemcov, A., Tukovic, Z., et al. (2007). OpenFOAM: A C++ Library for Otto, S. E., & Rowley, C. W. (2019). Linearly recurrent autoencoder networks
complex physics simulations. 1000, In International workshop on coupled for learning dynamics. SIAM Journal on Applied Dynamical Systems, 18(1),
methods in numerical dynamics (pp. 1–20). IUC Dubrovnik Croatia. 558–593. http://dx.doi.org/10.1137/18M1177846.
Kitaev, N., Kaiser, L., & Levskaya, A. (2020). Reformer: The efficient trans- Pearson, J. E. (1993). Complex patterns in a simple system. Science,
former. In International Conference on Learning Representations. URL https: 261(5118), 189–192. http://dx.doi.org/10.1126/science.261.5118.189, URL
//openreview.net/forum?id=rkgNKkHtvB. https://science.sciencemag.org/content/261/5118/189.
Koopman, B. O. (1931). Hamiltonian systems and transformation in Hilbert space. Pennington, J., Socher, R., & Manning, C. D. (2014). Glove: Global vectors for word
Proceedings of the National Academy of Sciences of the United States of America, representation. In Proceedings of the 2014 conference on empirical methods in
17(5), 315. http://dx.doi.org/10.1073/pnas.17.5.315. natural language processing (EMNLP) (pp. 1532–1543).
288
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., et al. Tripathy, R. K., & Bilionis, I. (2018). Deep UQ: Learning deep neural
(2018). Deep contextualized word representations. In Proceedings of the 2018 network surrogate models for high dimensional uncertainty quantifica-
conference of the North American chapter of the association for computational tion. Journal of Computational Physics, 375, 565–588. http://dx.doi.org/
linguistics: Human language technologies, volume 1 (long papers) (pp. 2227– 10.1016/j.jcp.2018.08.036, URL http://www.sciencedirect.com/science/article/
2237). Association for Computational Linguistics, http://dx.doi.org/10.18653/ pii/S0021999118305655.
v1/N18-1202, URL https://aclanthology.org/N18-1202. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N.,
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving et al. (2017). Attention is all you need. In I. Guyon, U. V. Luxburg,
language understanding by generative pre-training. OpenAI Blog, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett
URL https://openai-assets.s3.amazonaws.com/research-covers/language- (Eds.), 30, Advances in neural information processing systems. Curran
unsupervised/language_understanding_paper.pdf. Associates, Inc., URL https://proceedings.neurips.cc/paper/2017/file/
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
(2019). Language models are unsupervised multitask learners. OpenAI Veliçković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., & Bengio, Y.
Blog, URL https://cdn.openai.com/better-language-models/language_models_ (2018). Graph attention networks. In International conference on learning
are_unsupervised_multitask_learners.pdf. representations, URL https://openreview.net/forum?id=rJXMpikCZ.
Sanchez-Gonzalez, A., Godwin, J., Pfaff, T., Ying, R., Leskovec, J., & Battaglia, P. Wang, Y.-J., & Lin, C.-T. (1998). Runge-kutta neural network for identification of
(2020). Learning to simulate complex physics with graph networks. In H. dynamical systems in high accuracy. IEEE Transactions on Neural Networks,
D. III, & A. Singh (Eds.), Proceedings of machine learning research: 119, Proceed- 9(2), 294–307. http://dx.doi.org/10.1109/72.661124.
ings of the 37th international conference on machine learning (pp. 8459–8468). Wessels, H., Weißenfels, C., & Wriggers, P. (2020). The neural particle method –
PMLR, URL http://proceedings.mlr.press/v119/sanchez-gonzalez20a.html. an updated Lagrangian physics informed neural network for computational
Shalova, A., & Oseledets, I. (2020). Tensorized transformer for dynamical systems fluid dynamics. Computer Methods in Applied Mechanics and Engineering,
modeling. arXiv preprint arXiv:2006.03445. 368, Article 113127. http://dx.doi.org/10.1016/j.cma.2020.113127, URL https:
Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-k., & WOO, W.-c. (2015). //www.sciencedirect.com/science/article/pii/S0045782520303121.
Convolutional LSTM network: A machine learning approach for precip- Wiewel, S., Becher, M., & Thuerey, N. (2019). Latent space physics: Towards
itation nowcasting. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, & learning the temporal evolution of fluid flow. Computer Graphics Forum,
R. Garnett (Eds.), 28, Advances in neural information processing systems. 38(2), 71–82. http://dx.doi.org/10.1111/cgf.13620, URL https://onlinelibrary.
Curran Associates, Inc., URL https://proceedings.neurips.cc/paper/2015/file/ wiley.com/doi/abs/10.1111/cgf.13620.
07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf. Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., et al. (2019).
Stoer, J., & Bulirsch, R. (2013). vol. 12, Introduction to numerical analysis. Springer, HuggingFace’s Transformers: State-of-the-art natural language processing.
http://dx.doi.org/10.1007/978-1-4757-2272-7, URL https://doi.org/10.1007/ arXiv preprint arXiv:1910.03771.
978-1-4757-2272-7. Xiu, D., & Karniadakis, G. E. (2002). The Wiener-Askey polynomial chaos
Sukhbaatar, S., Grave, E., Bojanowski, P., & Joulin, A. (2019). Adaptive attention for stochastic differential equations. SIAM Journal on Scientific Computing,
span in transformers. In Proceedings of the 57th annual meeting of the associ- 24(2), 619–644. http://dx.doi.org/10.1137/S1064827501387826, URL https:
ation for computational linguistics (pp. 331–335). Florence, Italy: Association //doi.org/10.1137/S1064827501387826.
for Computational Linguistics, http://dx.doi.org/10.18653/v1/P19-1032, URL Xu, J., & Duraisamy, K. (2020). Multi-level convolutional autoencoder networks
https://aclanthology.org/P19-1032. for parametric prediction of spatio-temporal dynamics. Computer Methods in
Sukhbaatar, S., Grave, E., Lample, G., Jegou, H., & Joulin, A. (2019). Augmenting Applied Mechanics and Engineering, 372, Article 113379. http://dx.doi.org/10.
self-attention with persistent memory. arXiv preprint arXiv:1907.01470. 1016/j.cma.2020.113379, URL https://www.sciencedirect.com/science/article/
Takeishi, N., Kawahara, Y., & Yairi, T. (2017). Learning Koopman in- pii/S0045782520305648.
variant subspaces for dynamic mode decomposition. In I. Guyon, U. Zhang, H., Goodfellow, I., Metaxas, D., & Odena, A. (2019). Self-attention gen-
V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, & erative adversarial networks. In K. Chaudhuri, & R. Salakhutdinov (Eds.),
R. Garnett (Eds.), 30, Advances in neural information processing systems. Proceedings of machine learning research: 97, Proceedings of the 36th inter-
Curran Associates, Inc., URL https://proceedings.neurips.cc/paper/2017/file/ national conference on machine learning (pp. 7354–7363). PMLR, URL http:
3a835d3215755c435ef4fe9965a3f2a0-Paper.pdf. //proceedings.mlr.press/v97/zhang19d.html.
Tanaka, G., Yamane, T., Héroux, J. B., Nakane, R., Kanazawa, N., Takeda, S., et al. Zhao, J., Deng, F., Cai, Y., & Chen, J. (2019). Long short-term memory -
(2019). Recent advances in physical reservoir computing: A review. Neural fully connected (LSTM-FC) neural network for PM2.5 concentration
Networks, 115, 100–123. http://dx.doi.org/10.1016/j.neunet.2019.03.005, URL prediction. Chemosphere, 220, 486–492. http://dx.doi.org/10.1016/j.
http://www.sciencedirect.com/science/article/pii/S0893608019300784. chemosphere.2018.12.128, URL https://www.sciencedirect.com/science/
Tang, M., Liu, Y., & Durlofsky, L. J. (2020). A deep-learning-based surro- article/pii/S0045653518324639.
gate model for data assimilation in dynamic subsurface flow problems. Zhu, M., Chang, B., & Fu, C. (2018). Convolutional neural networks combined
Journal of Computational Physics, 413, Article 109456. http://dx.doi.org/10. with Runge-Kutta methods. arXiv preprint arXiv:1802.08831.
1016/j.jcp.2020.109456, URL https://www.sciencedirect.com/science/article/ Zhu, Y., & Zabaras, N. (2018). BayesIan deep convolutional encoder–decoder net-
pii/S0021999120302308. works for surrogate modeling and uncertainty quantification. Journal of Com-
putational Physics, 366, 415–447. http://dx.doi.org/10.1016/j.jcp.2018.04.018,
URL http://www.sciencedirect.com/science/article/pii/S0021999118302341.
289

Transformer For Modeling Physical Systems

Uploaded by

Copyright:

Available Formats

Transformer For Modeling Physical Systems

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Transformer For Modeling Physical Systems

Uploaded by

Copyright:

Available Formats

Neural Networks 146 (2022) 272–289

Contents lists available at ScienceDirect

Transformers for modeling physical systems

Fig. 6. Lorenz solution of 25k time-steps with ∆t = 0.01.

in which u and v are the concentration of two species, ru and rv Table 4

You might also like