Nothing Special   »   [go: up one dir, main page]

(Very Important Paper) A Primer On Motion Capture With Deep Leaning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

A Primer on Motion Capture with Deep

Learning: Principles, Pitfalls and Perspectives


Alexander Mathis1,2,3,5,* , Steffen Schneider3,4 , Jessy Lauer1,2,3 , and Mackenzie W. Mathis1,2,3,*
1 Center for Neuroprosthetics, Center for Intelligent Systems, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
2 Brain Mind Institute, School of Life Sciences, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
3 The Rowland Institute at Harvard, Harvard University, Cambridge, MA USA
4 University of Tübingen & International Max Planck Research School for Intelligent Systems, Germany
5 lead Contact
arXiv:2009.00564v2 [cs.CV] 2 Sep 2020

* Corresponding authors: alexander.mathis@epfl.ch, mackenzie.mathis@epfl.ch

Extracting behavioral measurements non-invasively from video other animal’s motion is determined by the geometric struc-
is stymied by the fact that it is a hard computational problem. tures formed by several pendulum-like motions of the ex-
Recent advances in deep learning have tremendously advanced tremities relative to a joint (6). Seminal psychophysics stud-
predicting posture from videos directly, which quickly impacted ies by Johansson showed that just a few coherently moving
neuroscience and biology more broadly. In this primer we re- keypoints are sufficient to be perceived as human motion (6).
view the budding field of motion capture with deep learning.
This empirically highlights why pose estimation is a great
In particular, we will discuss the principles of those novel al-
summary of such video data. Which keypoints should be ex-
gorithms, highlight their potential as well as pitfalls for experi-
mentalists, and provide a glimpse into the future. tracted, of course, dramatically depends on the model organ-
ism and the goal of the study; e.g., many are required for
dense, 3D models (12–14), while a single point can suffice
for analyzing some behaviors (10). One of the great advan-
Introduction tages of deep learning based methods is that they are very
The pursuit of methods to robustly and accurately measure flexible, and the user can define what should be tracked.
animal behavior is at least as old as the scientific study of
behavior itself (1). Trails of hominid footprints, “motion” Principles of deep learning methods for
captured by pliocene deposits at Laetoli that date to 3.66 markerless motion capture
million years ago, firmly established that early hominoids
achieved an upright, bipedal and free-striding gait (2). Be- In raw video we acquire a collection of pixels that are static
yond fossilized locomotion, behavior can now be measured in in their location and have varying value over time. For ana-
a myriad of ways: from GPS trackers, videography, to micro- lyzing behavior, this representation is sub-optimal: Instead,
phones, to tailored electronic sensors (3–5). Videography is we are interested in properties of objects in the images, such
perhaps the most general and widely-used method as it allows as location, scale and orientation. Objects are collections
noninvasive, high-resolution observations of behavior (6–8). of pixels in the video moving or being changed in conjunc-
Extracting behavioral measures from video poses a challeng- tion. By decomposing objects into keypoints with semantic
ing computational problem. Recent advances in deep learn- meaning —such as body parts in videos of human or animal
ing have tremendously simplified this process (9, 10), which subjects—a high dimensional video signal can be converted
quickly impacted neuroscience (10, 11). into a collection of time series describing the movement of
each keypoint (Figure 1). Compared to raw video, this repre-
In this primer we review markerless (animal) motion capture sentation is easy to analyze, and semantically meaningful for
with deep learning. In particular, we review principles of al- investigating behavior and addressing the original research
gorithms, highlight their potential, as well as discuss pitfalls question for which the data has been recorded.
for experimentalists, and compare them to alternative meth-
ods (inertial sensors, markers, etc.). Throughout, we also Motion capture systems aim to infer keypoints from videos:
provide glossaries of relevant terms from deep learning and In marker-based systems, this can be achieved by manu-
hardware. Furthermore, we will discuss how to use them, ally enhancing parts of interest (by colors, LEDs, reflective
what pitfalls to avoid, and provide perspectives on what we markers), which greatly simplifies the computer vision chal-
believe will and should happen next. lenge, and then using classical computer vision tools to ex-
tract these keypoints. Markerless pose estimation algorithms
What do we mean by “markerless motion capture?” While directly map raw video input to these coordinates. The con-
biological movement can also be captured by dense, or sur- ceptual difference between marker-based and marker-less ap-
face models (10, 12, 13), here we will almost exclusively fo- proaches is that the former requires special preparation or
cus on “keypoint-based pose estimation.” Human and many equipment, while the latter can even be applied post-hoc, but

Mathis et al. | preprint | September 4, 2020 | 1–21


Figure 1. Schematic overview of markerless motion capture or pose estimation. The pixel representation of an image (left) or sequence of images
(video) is processed and converted into a list of keypoints (right). Semantic information about object identity and keypoint type is associated to the
predictions. For instance, the keypoints are structures with a name e.g. ear, the x and y coordinate as well as a confidence readout of the network (often
this is included, but not for all pose estimation packages), and are then grouped according to individuals (subjects).

typically requires ground truth annotations of example im- of one human the algorithm would return a list of pixel co-
ages (i.e., a training set). Notably, markerless methods allow ordinates (these can have subpixel resolution) per body part
for extracting additional keypoints at a later stage, something and frame (and sometimes an uncertainty prediction; 18–20).
that is not possible with markers (Figure 2). Which body parts the algorithm returns depends on both the
application and the training data provided—this is an impor-
Fundamentally, a pose estimation algorithm can be viewed as
tant aspect with respect to how the algorithms can be cus-
a function that maps frames from a video into the coordinates
tomized for applications.
of body parts. The algorithms are highly flexible with regard
to what body parts are tracked. Typically the identity of the
Overview of algorithms.
body parts (or objects) have semantically defined meaning
(e.g., different finger knuckles, the head), and the algorithms While many pose estimation algorithms (21, 22) have been
can group them accordingly (namely, to assemble an indi- proposed, algorithms based on deep learning (23) are the
vidual) so that the posture of multiple individuals can be ex- most powerful as measured by performance on human pose
tracted simultaneously (Figure 1). For instance, for an image estimation benchmarks (18, 24–29). More generally, pose es-

Figure 2. Comparison of marker-based (traditional) and markerless tracking approaches. (A) In marker-based tracking, prior to performing an
experiment, special measures have to be taken regarding hardware and preparation of the subject (images adapted from 15, 16; IMU stands for inertial
measurement unit). (B) For markerless pose estimation, raw video is acquired and processed post-hoc: Using labels from human annotators, machine
learning models are trained to infer keypoint representations directly from video (on-line inference without markers is also possible (17)). Typically, the
architectures underlying pose estimation can be divided into a feature extractor and a decoder: The former maps the image representation into a feature
space, the latter infers keypoint locations given this feature representation. In modern deep learning systems, both parts of the systems are trained
end-to-end.

2 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


timation algorithms fall under “object detection”, a field that feature representation in a way most suitable for the task at
has seen tremendous advances with deep learning (aptly re- hand (For a glossary of deep learning terms see Box 1).
viewed in Wu et al., 9). In brief, pose estimation can often Machine learning systems are composed of a dataset, model,
intuitively be understood as a system of an encoder that ex- loss function (criterion) and optimization algorithm (33). The
tracts important (visual) features from the frame, which are dataset defines the input-output relationships that the model
then used by the decoder to predict the body parts of interests should learn: In pose estimation, a particular pose (output)
along with their location in the image frame. should be predicted for a particular image (input), see Fig-
ures 1 & 2B. The model’s parameters (weights) are itera-
In classical algorithms (see 9, 21, 22), handcrafted feature
tively updated by the optimizer to minimize the loss function.
representations are used that extract invariant statistical de-
Thereby the loss function measures the quality of a predicted
scriptions from images. These features were then used to-
pose (in comparison to the ground truth data). Choices about
gether with a classifier (decoder) for detecting complex ob-
these four parts influence the final performance and behavior
jects like humans (21, 30). Handcrafted feature representa-
of the pose-estimation system and we discuss possible design
tions are (loosely) inspired by neurons in the visual pathway
choices in the next sections.
and are designed to be robust to changes in illumination, and
translations; typical feature representations are Scale Invari- Datasets & Data Augmentation.
ant Feature Transform (SIFT; 31), Histogram of Gradients
(HOG; 30) or Speeded Up Robust Features (SURF; 32). Two kinds of datasets are relevant for training pose estima-
tion systems: First, one or multiple datasets used for re-
In more recent approaches, both the encoder and decoders lated tasks—such as image recognition—can be used for pre-
(alternatively called the backbone and output heads, respec- training computer vision models on this task (also known as
tively) are deep neural networks (DNN) that are directly op- transfer learning; see Box 1). This dataset is typically con-
timized on the pose estimation task. An optimal strategy for siderably larger than the one used for pose estimation. For
pose estimation is jointly learning representations of the raw example, ImageNet (34), sometimes denoted as ImageNet-
image or video data (encoder) and a predictive model for 21K, is a highly influential dataset and a subset was used
posture (decoder). In practice, this is achieved by concate- for the ImageNet Large Scale Visual Recognition Challenge
nating multiple layers of differentiable, non-linear transfor- in 2012 (ILSRC-2012; 35) for object recognition. Full Im-
mations and by training such a model as a whole using the ageNet contains 14.2 million images from 21K classes, the
back propagation algorithm (9, 23, 33). In contrast to classi- ILSRC-2012 subset contains 1.2 million images of 1,000 dif-
cal approaches, DNN based approaches directly optimize the ferent classes (such as car, chair, etc; 35). Groups working

Figure 3. Example augmentation images with labeled body parts in red. (A) Two example frames of Alpine choughs (Pyrrhocorax graculus) near Mont
Blanc with human-applied labels in red (original). The images to the right illustrate three augmentations (as labeled). (B) Two example frames of a
trail-tracking mouse (mus musculus) from (20) with four labeled bodyparts as well as augmented variants. Open in Google Colaboratory

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 3


towards state-of-the-art performance on this benchmark also scaling image size). Based on the chosen corruptions, models
helped push the field to build better DNNs and openly share become more invariant to rotations, scale changes or transla-
code. This dataset has been extensively used for pre-training tions and thus more accurate (with less training data). Aug-
networks, which we will discuss in the model and optimiza- mentation can also help with improving robustness to noise,
tion section below. like jpeg compression artefacts and motion blur (Figure 3).
To note, data augmentation schemes should not affect the se-
The second highly relevant dataset is the one curated for the
mantic information in the image: for instance, if color con-
task of interest– Mathis et al. (20) empirically demonstrated
veys important information about the identity of an animal,
that the size of this dataset can be comparably small for typ-
augmentations involving changes in color are not advisable.
ical pose estimation cases in the laboratory. Typically, this
Likewise, augmentations which change the spatial position of
dataset contains 10–500 images, vs. the standard human pose
objects or subjects should always be applied to both the input
estimation benchmark datasets, such as MS COCO (36) or
image and the labels (Box 2).
MPII pose (37), which has annotated 40K images (of 26K
individuals). This implies that the dataset that is curated is
Model architectures.
highly influential on the final performance, and great care
should be taken to select diverse postures, individuals, and Systems for markerless pose estimation are typically com-
background statistics and labeling the data accurately (dis- posed of a backbone network (encoder), which takes the role
cussed below in “pitfalls”). of the feature extractor, and one or multiple heads (decoders).
Understanding the model architectures and design choices
In practice, several factors matter: the performance of a fine-
common in deep learning based pose estimation systems re-
tuned model on the task of interest, the amount of images
quires basic understanding of convolutional neural networks.
that need to be annotated for fine-tuning the network, and the
We summarize the key terms in Box 1, and expand on what
convergence rate of optimization algorithm—i.e., how many
encoders and decoders are below.
steps of gradient descent are needed to obtain a certain per-
formance. Using a pre-trained network can help with this Instead of using handcrafted features as in classical systems,
in several regards: He et al. (38) show that in the case of deep learning based systems employ “generic” encoder archi-
large training datasets, pre-training typically aids with con- tectures which are often based on models for object recogni-
vergence rates, but not necessarily the final performance. De- tion. In a typical system, the encoder design affects the most
spite this evidence that under the right circumstances (i.e., important properties of the algorithms such as its inference
given enough task-relevant data) and with longer training, speed, training-data requirements and memory demands. For
randomly initialized models can match the performance of the pose estimation algorithms so far used in neuroscience
fine-tuned ones for key point detection on COCO (38) and the encoders are either stacked hourglass networks (26), Mo-
horses (39), however, the networks are less robust (39). Be- bileNetV2s (53), ResNets (54), DenseNets (55) or Efficient-
yond robustness, using a pre-trained model is generally ad- Nets (56). These encoder networks are typically pre-trained
visable when the amount of labeled data for the target task is on one or multiple of the larger-scale datasets introduced pre-
small, which is true for many applications in neuroscience, viously (such as ImageNet), as this has been shown to be
as it leads to shorter training times and better performance an advantage for pose estimation on small lab-scale sized
with less data (20, 38–40). Thus, pre-trained pose estimation datasets (20, 39, 40). For common architectures this pre-
algorithms save training time, increase robustness, and re- training step does not need to be carried out explicitly-trained
quire substantially less training data. Indeed, most packages weights for popular architectures are already available in
in Neuroscience now use pre-trained models (20, 40–44), al- common deep learning frameworks.
though some do not (45–47), which can give acceptable per-
The impact of the encoder on DNN performance is a highly
formance for simplified situations with aligned individuals.
active research area. The encoders are continuously im-
proved in regards to speed and object recognition perfor-
More recently, larger datasets like the 3.5 billion Instagram mance (9, 53, 55–57). Naturally, due to the importance of
dataset (48), JFT which has 300M images (49, 50) and Open- the ImageNet benchmark the accuracy of network architec-
Images (51) became popular, further improving performance tures continuously increases (on that dataset). For example,
and robustness of the considered models (50). What task is we were able to show that this performance increase is not
used for pre-training also matters. Corroborating this insight, merely reserved for ImageNet, or (importantly) other object
Li et al. showed that pre-training on large-scale object detec- recognition tasks (57), but in fact that better architectures on
tion task can improve performance for tasks that require fine, ImageNet are also better for pose estimation (44). How-
spatial information like segmentation (52). ever, being better on ImageNet, also comes at the cost of
Besides large datasets for pre-training, a curated dataset with decreasing inference speed and increased memory demands.
pose annotations is needed for optimizing the algorithm on DeepLabCut (an open source toolbox for markerless pose
the pose estimation task. The process is discussed in more estimation popular in neuroscience) thus incorporates back-
detail below and it typically suffices to label a few (diverse) bones from MobileNetV2s (faster) to EfficientNets (best per-
frames. Data augmentation is the process of expanding the formance on ImageNet; 39, 44).
training set by applying specified manipulations (like rotate,

4 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


Box 1: Glossary of Deep Learning terms
An excellent textbook for Deep Learning is provided by Goodfellow et al. (33). See Dumoulin & Visin (58) for an in-depth
technical overview of convolution arithmetic.

Artificial neural network (ANN): An Convolutional neural network (CNN):


ANN can be represented by a collec- A CNN is an ANN composed of one or
tion of computational units (“neurons”) multiple convolutional layers. Influential
arranged in a directed graph. The output early CNNs are the LeNet, AlexNet and
of each unit is computed as a weighted VGG16 (9, 23, 33).
sum of its inputs, followed by a nonlin-
ear function.
Residual Networks (ResNets): Increasing network depth makes deep neural networks (DNNs) more
expressive compared to adding units to a shallow architecture. However, optimization becomes hard
for standard CNNs beyond 20 layers, at which point depth in fact decreases the performance (54). In
residual networks, instead of learning a mapping y = f (x), the layer is re-parametrized to learn the
mapping y = x + f (x), which improves optimization and regularizes the loss landscape (59). These
networks can have much larger depth without diminishing returns (54) and are the basis for other pop-
ular architectures such as MobileNets (53) and EfficientNets (56).
Convolution: A convolution is a special type of linear filter. Compared with a full linear transforma-
tion, convolutional layers increase computational efficiency by weight sharing (23, 33). By applying the
convolution, the same set of weights is used across all locations in the image. Deconvolution: Decon-
volutional layers allow to upsample a feature representation. Typically, the kernel used for upsampling
is optimized during training, similar to a standard convolutional layer. Sometimes, fixed operations
such as bilinear upsampling filters are used.

Stride, Downsampling and Dilated (atrous) Convolutions: In DNNs for computer vision, images are
presented as real-valued pixel data to the network and are then transformed to symbolic representations
and image annotations, such as bounding boxes, segmentation masks, class labels or keypoints. During
processing, inputs are consecutively abstracted by aggregating information from growing “receptive
fields”. Increasing the receptive field of the unit is possible by different means: Increasing the stride
of a layer computes outputs only for every n-th input and effectively downsamples the input with a
learnable filter. Downsampling layers perform the same operation, but with a fixed kernel (e.g. taking
the maximum or mean activation across the receptive field). In contrast, atrous or dilated convolutions
increase the filter size by adding intermittent zero entries between the learnable filter weights—e.g.,
for a dilation of 2, a filter with entries (1, 2, 3) would be converted into (1, 0, 0, 2, 0, 0, 3). This allows
increases in the receptive field without loosing resolution in the next layers, and is often applied in
semantic segmentation algorithms (60), and pose estimation (18, 20).

Transfer learning: The ability to use parameters from a network that has been trained on one task—
e.g. classification, see (i)—as part of a network to perform another task—e.g. pose estimation, see
(ii). The approach was popularized with DeCAF (61), which used AlexNet (62) to extract features
to achieve excellent results for several computer vision tasks. Transfer learning generally improves
the convergence speed of model training (38–40, 63) and model robustness compared to training from
scratch (39).

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 5


Box 2: Key parameters and choices. the animal pose estimation toolbox DeepLabCut (20), was
built with a ResNet (54) backbone network, but adapted the
The key design choices of pose estimation systems stride by atrous convolutions (60) to retain a higher spatial
are dataset curation, data augmentation, model archi- resolution (Box 1). This allowed larger receptive fields for
tecture selection, optimization process, and the opti- predictions, but retains a relatively high speed (i.e., for video
mization criterions. analysis) but most importantly because ResNets can be pre-
trained on ImageNet, those initialized weights could be used.
• Data Augmentation: is the technique of increas-
Other architectures, like stacked hourglass networks (26)
ing the training set by converting images and an-
used in DeepFly3D (70) and DeepPoseKit (41), retain fea-
notations into new, altered images via geometric
ture representations at multiple scales and pass those to the
transformations (e.g. rotation, scaling..), image
decoder (Figure 4A, B).
manipulations (e.g. contrast, brightness,...), etc.
(Figure 3) . Depending on the annotation data, var-
ious augmentations, i.e. rotation symmetry/etc. are Loss functions: training architectures on datasets.
ideal. Packages such as Tensorpack (64) and im- Keypoints (i.e., bodyparts) are simply coordinates in image
gaug (65) as well as tools native to PyTorch (66) space. There are two fundamentally different ways for esti-
and TensorFlow (67) provide common augmenta- mating keypoints (i.e., how to define the loss function). The
tion methods and are used in many packages. problem can be treated as a regression problem with the coor-
• Model Architecture: Users should select an dinates as targets (24, 71). Alternatively, and more popular,
architecture that is accurate and fast (enough) the problem can be cast as a classification problem, where
for their goal. Top performing networks (in the coordinates are mapped onto a grid (e.g. of the same size
terms of accuracy) include Stacked Hourglass (26), as the image) and the model predicts a heatmap (scoremap)
ResNets (54) and EfficientNets (56) with appro- of location probabilities for each bodypart (Figure 4C). In
priate decoders (18, 19, 28, 44) as well as recent contrast to the regression approach (24), this is fully convo-
high-resolution nets (29, 68). Performance gains lutional, and allows modeling of multi-modal distributions
in speed at the expense of slightly worse accuracy and aids the training process (18, 26, 27, 72). Moreover, the
are possible with (optimized) lightweight models heatmaps have the advantage that one can naturally predict
such as MobileNetV2 (53) in DeepLabCut (39) and
stacked hourglass networks with DenseNets (55) as
proposed in DeepPoseKit (41); and often this per-
formance gap can be rescued with good data aug-
mentation.

In (standard) convolutional encoders, the high-resolution in-


put images get gradually downsampled while the number
of learned features increases. Regression based approaches
which directly predict keypoint locations from the feature
representation can potentially deal with this downsampled
representation. When the learning problem is instead cast
as identifying the keypoint locations on a grid of pixels, the
output resolution needs to be increased first, often by decon-
volutional layers (18, 28). We denote this part of the network
as the decoder, which takes downsampled features, possibly
from multiple layers in the encoder hierarchy, and gradu-
ally upsamples them again to arrive at the desired resolution.
The first models of this class were Fully Convolutional Net-
works (69), and later DeepLab (60). Many popular archi-
Figure 4. Schematic overview of possible design choices for model archi-
tectures today follow similar principles. Design choices in- tectures and training process (A) A simple, but powerful variant (18) is a
clude the use of skip connections between decoder layers, but ResNet-50 (54) architecture adapted to replace the final down-sampling
also regarding skip connections between the encoder and de- operations by atrous convolutions (60) to keep a stride of 16, and then
a single deconvolution layer to upsample to output maps with stride 8. It
coder layers. Example encoder–decoder setups are illustrated also forms the basis of other architectures (e.g. 28). The encoder can
in Figure 4. The aforementioned building blocks—encoders also be exchanged for different backbones to improve speed or accuracy
and decoders—can be used to form a variety of different ap- (see Box 2). (B) Other approaches like stacked hourglass networks (26),
are not pre-trained and employ skip connections between encoder and
proaches, which can be trained end-to-end directly on the tar- decoder layers to aid the up-sampling process. (C) For training the net-
get task (i.e., pose estimation). work, the training data comprising input images and target heatmaps
is used. The target heatmap is compared with the forward prediction.
Pre-trained models can also be adapted to a particular appli- Thereby, the parameters of the network are optimized to minimize the
cation. For instance, DeeperCut (18), which was adapted by loss that measures the difference between the predicted heatmap and
the target heatmap (ground truth).

6 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


multiple locations of the “same” bodypart in the same image of bodyparts within individuals (i.e., limbs). These connec-
(i.e., 2 elbows) without mode collapse (Figure 5A). tions are then used to link candidate bodyparts to form indi-
viduals (19, 27, 29, 73). To note, these techniques can be used
Loss functions can also reflect additional priors or inductive on single individuals for increased performance, but often are
biases about the data. For instance, DeepLabCut uses lo- not needed and usually imply reduced inference speed.
cation refinement layers (locref), that counteract the down-
sampling inherent in encoders, by training outputs to pre-
dict corrective shifts in image coordinates relative to the Optimization.
downsampled output maps (Figure 5A). In pose estimation,
it is possible to define a skeleton or graph connecting key- For pre-training, stochastic gradient descent (SGD; 77) with
points belonging to subjects with the same identity (see be- momentum (78) is an established method. Different vari-
low) (18, 27). When estimating keypoints over time, it is also ants of SGD are now common (such as Adam; 79) and used
possible to employ temporal information and encourage the for fine-tuning the resulting representations. As mentioned
model to only smoothly vary its estimate among consecutive above, pose estimation algorithms are typically trained in a
frames (73–76). Based on the problem, these priors can be multi-stage setup where the backbone is trained first on a
directly encoded and be used to regularize the model. large (labeled) dataset of a potentially unrelated task (like
image classification). Users can also download these pre-
How can pose estimation algorithms accommodate multi- trained weights. Afterwards, the model is fine-tuned on the
ple individuals? Fundamentally, there are two different pose-estimation task. Once trained, the quality of the pre-
approaches: bottom-up and top-down methods (Figure 5). diction can be judged in terms of the root mean square error
In top-down methods, individuals are first localized (of- (RMSE), which measures the distance between the ground
ten with another neural network trained on object localiza- truth keypoints and predictions (20, 45), or by measuring the
tion) then pose estimation is performed per localized individ- percentage of correct keypoints (PCK, 37, 39); i.e., the frac-
ual (26, 28, 68). In bottom-up methods all bodyparts are lo- tion of detected keypoints that fall within a defined distance
calized, and networks are also trained to predict connections of the ground truth.

Figure 5. Multi-animal pose estimation approaches. A: Bottom-up approaches detect all the body parts (e.g. elbow and shoulder in example) as well
as “limbs” (part confidence maps). These limbs are then used to associate the bodyparts within individuals correctly (Figure from OpenPose, 27). For
both OpenPose and DeepLabCut, the bodyparts and part confidence maps, and part affinity fields (paf’s) are predicted as different decoders (aka output
heads) from the encoder. B: Top-down approaches localize individuals with bounding-box detectors and then directly predict the posture within each
bounding box. This does not require part confidence maps, but is subject to errors when bounding boxes are wrongly predicted (see black bounding
box encompassing two players in (c)). The displayed figures, adapted from Xiao et al. (28), improved this disadvantage by predicting bounding boxes
per frame and forward predicting them across time via visual flow.

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 7


To properly estimate model performance in an application perimentalists would go to great lengths to simplify the en-
setting, it is advisable to split the labeled dataset at least into vironment, even in the laboratory (i.e., no bedding, white or
train and test subsets. If systematic deviations can be ex- black walls, high contrast), and this is no longer required with
pected in the application setting (e.g., because the subjects deep learning-based pose estimation. Now, the aesthetics one
used for training the model differ in appearance from sub- might want for photographs or videos taken in daily life are
jects encountered at model deployment (39), this should be the best option.
reflected when choosing a way to split the data. For instance,
if data from multiple individuals is possible, distinct individ- Indeed, the field has been able to rapidly adopt these tools
uals should form distinct subsets of the data. On the contrary, for neuroscience. Deep learning-based markerless pose es-
strategies like splitting data by selecting every n-th frame in timation applications in the laboratory have already been
a video likely overestimates the true model performance. published for flies (20, 41, 43, 45, 70, 85), rodents (20,
40, 41, 43, 45, 47, 70, 87), horses (39), dogs (74), rhesus
The model is then optimized on the training dataset, while macaque (42, 74, 88, 89) and marmosets (90); the original
performance is monitored on the validation (test) split. If architectures were developed for humans (18, 26, 27). Out-
needed, hyperparameters—like parameter settings of the op- side of the laboratory, DeepPoseKit was used for zebras (41)
timizer, or also choices about the model architecture—of the and DeepLabCut for 3D tracking of cheetahs (80), for squir-
model can be adapted based on an additional validation set. rels (91) and macaques (89), highlighting the great “in-the-
wild” utility of this new technology (10). As outlined in the
All of the aforementioned choices influence the final outcome principles section, and illustrated by these applications, these
and performance of the algorithm. While some parts of the deep learning architectures are general-purpose and can be
training pipeline are well-established and robust—like pre- broadly applied to any animal as well as condition.
training a model on ImageNet—choices about the dataset,
architecture, augmentation, fine-tuning procedure, etc. will Recent research highlights the prevalent representations of
inevitably influence the quality of the pose estimation algo- action across the brain (92), which emphasizes the impor-
rithm (Box 2). See Figure 3 for a qualitative impression of tance of quantifying behavior even in non-motor tasks. For
augmentation effects of some of these decisions (see also Fig- instance, pose estimation tools have recently been used to
ure 8). We will discuss this in more detail in the Pitfalls sec- elucidate the neural variability across cortex in humans dur-
tion. ing thousands of spontaneous reach movements (93). Pupil
tracking is of great importance for visual neuroscience. One
So far, we considered algorithms able to infer 2D keypoints recent study by Meyer et al. used head-fixed cameras and
from videos, by training deep neural networks on previously DeepLabCut to reveal two distinct types of coupling between
labeled data. Naturally, there is also much work in computer eye and head movements (94). In order to accurately corre-
vision and machine learning towards the estimation of 3D late neural activity to visual input tracking the gaze is cru-
keypoints from 2D labels, or to directly infer 3D keypoints. cial. The recent large, open dataset from the Allen Institute
In the interest of space, we had to omit those but refer the includes imaging data of six cortical and two thalamic re-
interested reader to (74, 81–84) as well specifically for neu- gions in response to various stimuli classes as well as pupil
roscience (42, 47, 70, 74, 80, 85). tracking with DeepLabCut (95). The International Brain
Lab has integrated DeepLabCut into their workflow to track
Lastly, it is not understood how CNNs make decisions and
multiple bodyparts of decision-making mice including their
they often find “shortcuts” (86). While this active research
pupils (96).
area is certainly beyond the scope of this primer, from prac-
tical experience we know that at least within-domain—i.e., Measuring relational interactions is another major direction,
data that is similar to the training set—DNNs work very well that has been explored less in the literature so far, but is fea-
for pose estimation, which is the typical setting relevant for sible. Since the feature detectors for pose estimation are of
downstream applications in neuroscience. It is worth noting general nature one can easily not only track the posture of
that in order to optimize performance, there is no one-size- individuals but also the tools and objects one interacts with
fits-all solution. Thus, we hope by building intuition in users (e.g. for analyzing golf or tennis). Furthermore, social be-
of such systems, we provide the necessary tools to make these haviors, and parenting interactions (for example in mice) can
decisions with more confidence (Figure 6). now be studied noninvasively.

Due to the general capabilities, these tools have several ap-


Scope and applications plications for creating biomarkers by extracting high fidelity
Markerless motion capture can excel in complicated scenes, animal traits, for instance in the pain field (97) and for moni-
with diverse animals, and with any camera available (mono- toring motor function in healthy and diseased conditions (98).
chrome, RGB, depth cameras, etc). The only real require- DeepLabCut was also integrated with tools for x-ray analy-
ment is the ability of the human to be able to reliably la- sis (99). For measuring joint center locations in mammals,
bel keypoints (manually or via alternative sources). Simply, arguably, x-ray is the gold standard. Of course, x-ray data
you need to be able to see what you want to track. Histor- also poses challenges for extracting body part locations from
ically, due to limitations in computer vision algorithms ex- x-ray data. A recent paper shared methodology to integrate

8 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


Figure 6. An overview of the workflow for deep learning based pose estimation, which highlights several critical decision points.

DeepLabCut with XROMM, a popular analysis suite, to ad- past 2 years (20, 40–43, 45, 47, 70). Each has focused on pro-
vance the speed and accuracy for x-ray based analysis (99). viding slightly different user experiences, modularity, avail-
able networks, and balances to the speed/accuracy trade-off
How do the (current) packages work? for video inference. Several include their (adapted) imple-
mentations of the original DeepLabCut or LEAP networks
Here we will focus on packages that have been used in be- as well (41, 43). But the ones we highlight have the full
havioral neuroscience, but the general workflow for pose es- pipeline delineated above as a principle and are open source,
timation in computer vision research is highly similar. What i.e., at minimum inference code is available (see Table 1).
has made experimentalist-focused toolboxes different is that The progress gained and challenges they set out to address
they provide essential code to generate and train on one’s (and some that remain) are reviewed elsewhere (10, 100).
own datasets. Typically, what is available in computer vi- Here, we discuss collective aims of these packages (see also
sion focused pose estimation repositories is code to run infer- Figure 6).
ence (video analysis) and/or run training of an architecture
for specific datasets around which competitions happen (e.g., Current packages for animal pose estimation have focused
MS COCO; 36 and MPII pose; 37). While these are two cru- on primarily providing tools to train tailored neural networks
cial steps, they are not sufficient to develop tailored neural to user-defined features. Because experimentalists need flexi-
networks for an individual lab or experimentalist. Thus, the bility and are tracking very different animals and features, the
“barrier to entry” is often quite high to use these tools. It re- most successful packages (in terms of user base as measured
quires knowledge of deep learning languages to build appro- by citations and GitHub engagement) are species agnostic.
priate data loaders, data augmentation pipelines, and training However, given they are all based on advances from prior art
regimes. Therefore, in recent years several packages have in human pose estimation, the accuracy of any one package
not only focused on animal pose estimation networks, but given the breadth of options that could be deployed (i.e, data
in providing users a full pipeline that allows for (1) label- augmentation, training schedules, and architectures) will re-
ing a customized dataset (frame selection and labeling tools), main largely comparable, if such tools are provided to the
(2) generating test/train datasets, (3) data augmentation and user. What will determine performance the most is the input
loaders, (4) neural architectures, (5) code to evaluate perfor- training data provided, and how much capacity the architec-
mance, (6) run video inference, and (7) post-processing tools tures have.
for simple readouts of the acquired machine-labeled data.
Thus far, around 10 packages have become available in the It is notable that using transfer learning has proven to be ad-
Table 1. Overview of popular deep learning tools for animal motion capture (or newly presented packages that, minimally, include code). Here, we
denote if it can be used to create tailored networks, or only specific animal tools are provided, i.e., only work “as-is” on a fly or rat. We also only highlight
if beyond human pre-trained neural networks (PT-NNs) are available. We also provide the release date and current citations for noted references,
including those to related preprints (indexed from google scholar). *note, this code is deprecated and supplanted by SLEAP.

Any species 3D >1 animal Training Code Full GUI Ex. Data PT-NNs Released Citations
DeepLabCut (20, 80) yes yes yes yes yes yes many 4/2018 491
LEAP (45) yes no yes yes yes yes no 6/2018* 98
DeepBehavior (40) no yes yes no no no no 5/2019 15
DeepPoseKit (41) yes no no yes partial yes no 8/2019 48
DeepFly3D (70) no yes no 2D only partial yes fly 5/2019 21
FreiPose (47) no yes no partial no yes no 2/2020 1
Optiflex (43) yes no no yes partial yes no 5/2020 0

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 9


vantageous for better robustness (i.e., its ability to general- reproducible and scalable research. For example, as we show
ize, see 20, 39, 40), which was first deployed by DeepLab- in other sections of the primer, if the labeling accuracy is not
Cut (see Table 1). Now, training on large animal-specific of a high quality, and the data is not diverse enough, then
datasets has recently been made available in DeepLabCut as the networks are not able to generalize to so-called “out-of-
well (such as a horse pose dataset with >8,000 annotated im- domain” data. If as a community we collectively build sta-
ages of 30 horses; 39). This allows the user to bypass the ble and robust models that leverage the breadth of behaviors
only manual part of curating and labeling ground truth data, being carried out in laboratories worldwide, we can work to-
and these models can directly be used for inference on novel wards models that would work in a plug-in-play fashion. We
videos. For DeepLabCut, this is an emerging community- anticipate new datasets and models to become available in the
driven effort, with external labs already contributing models next months to years.
and data1 .
All packages, just like all applications of deep learning
to video, prefer access to GPU computing resources (See
Box 3: Computing hardware
Box 3). On GPUs one experiences faster training and in-
• CPU: The central processing unit (CPU) is the core ference times but the code can also be deployed on standard
of a computer and executes computer programs. CPUs or laptops. With cloud computing services, such as
CPUs work well on sequential or lightly paral- Google Colaboratory and JupyterLab, many pose estimation
lelized routines due to the limited number of cores. packages can simply be deployed on remote GPU resources.
This still requires (1) knowledge about these resources, and
• GPU: A graphical processing unit (GPU) is a spe- (2) toolboxes providing so-called “notebooks” that can be
cialized computing device designed to rapidly pro- easily deployed. But, given these platforms have utility be-
cess and alter memory. GPUs are ideal for com- yond just pose estimation, they are worthwhile to learn about.
puter graphics and often located in graphics cards.
Their highly parallel architecture enables them to
For the non-GPU aspects, only a few packages have provided
be more efficienta than CPUs for algorithms with
easy-to-use graphical user interfaces that allow users with no
many small subroutines which can be launched in
programming experience to use the tool (see Table 1). Lastly,
parallel. They can be applied to run DNNs at
the available packages vary in their access to 3D tools, multi-
higher speed (62) and pose estimation in particu-
animal support, and types of architectures available to the
lar (17, 39, 87).
user, which is often a concern for speed and accuracy. Addi-
• Affordability of GPUs: Modern GPUs are afford- tionally, some packages have limitations on only allowing the
able (around 300 - 800 USD for cards than can be same sized videos for training and inference, while others are
used for the pose estimation tools mentioned here; more flexible. These are all key considerations when decid-
and up to 10,000 USD for high end cards) and ide- ing which eco-system to invest in learning (as every package
ally suited to run video processing within a single has taken a different approach to the API).
lab in a decentralized way. They can be placed into Perhaps the largest barrier to entry for using deep learning-
standard desktop computers, or even “gaming” lap- based pose estimation methods is managing the computing
tops are options. However, to get started it might be resources (See Box 3, Box 4). From our experience, in-
easier to test software in cloud computing services stalling GPU drivers and the deep learning packages (Tensor-
first for ease-of-use (i.e. no driver installation). Flow, PyTorch), that all the packages rely on, is the biggest
challenge. To this end, in addition to documentation that is
• Cloud computing: Ability to use resources online
“user-focused” (i.e., not just an API for programmers), re-
rapidly (minimal installation) often in a pay-per-
sources like webinars, video tutorials, workshops, Gitter and
use scheme. Two relevant examples are Google
community-forums (like StackOverflow and Image Forum
Colaboratory and My Binder. Google Colabora-
SC) have become invaluable resources for the modern neu-
tory is an online platform for hosted free GPU use
roscientist. Here, users can ask questions and get assistance
with run times of up to 6 hours. My Binder allows
from developers and users alike. We believe this has also
turning a Git repository into a collection of interac-
been a crucial step for the success of DeepLabCut.
tive notebooks by running them in an executable
environment, making your code immediately re- While some packages provide full GUI-based control over
producible by anyone, anywhere (mybinder.org). the packages, to utilize more advanced features at least min-
a Link to NVIDIA Data Center Deep Learning Product imal programming knowledge is ideal. Thus, better train-
Performance: developer.nvidia.com/deep-learning-performance- ing for the increasingly computational nature of neuroscience
training-inference will be crucial. Making programming skills a requirement of
graduate training, building better community resources, and
leveraging the fast-moving world of technology to harness
In the future, having the ability to skip labeling and training
those computing and user resources will be crucial. In animal
and run video inference with robust models will lead to more
pose estimation, while there is certainly an attempt to make
1 modelzoo.deeplabcut.org many of the packages user-friendly, i.e., to onboard users

10 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


and have a scalable discussion around common problems, we accurate skeletal kinematics 3 (109). To make matters worse,
found user forums to be very valuable (101). Specifically, contaminated marker trajectories may be harmful in clini-
DeepLabCut is a member of the Scientific Community Im- cal contexts, potentially invalidating injury risk assessment
age Forum2 alongside other packages that are widely used (e.g. 110). Although a multitude of numerical approaches
for image analysis in the life sciences such as Fiji (102), na- exists to tackle this issue, the most common, yet incomplete,
pari, CellProfiler (103) Ilastik (104) and scikit-image (105). solution is multi-body kinematics optimization (or “inverse
kinematics” in computer graphics and robotics; 111). This
procedure uses a kinematic model and searches for the body
Box 4: Reproducible Software pose that minimizes in the least-squares sense the distance
between the measured marker locations and the virtual ones
Often installation of deep learning languages like from the model while satisfying the constraints imposed by
TensorFlow/Keras (67) and PyTorch (66) is the the various joints (112). Its accuracy is, however, decisively
biggest hurdle for getting started. determined by the choice of the underlying model and its fi-
delity to an individual’s functional anatomy (111). In con-
• Python virtual environments: Software often has
trast, motion capture with deep learning elegantly circum-
many dependencies, and they can conflict if multi-
vents the problem by learning a geometry-aware represen-
ple versions are required for different needs. Thus,
tation of the body from the data to associate keypoints to
placing dependencies within a contained environ-
limbs (10, 18, 27), which, of course, presupposes that one
ment can minimize issues. Common environments
can avoid the “soft tissue artifact” when labeling.
include Anaconda (conda) and virtualenv, both for
Python code bases. At present, deep learning-powered pose estimation can be
poorly suited to evaluate rotation about a bone’s longitudi-
• Docker delivers software in packages called con- nal axis. From early markerless techniques based on visual
tainers, which can be run locally or on servers. hull extraction this is a known problem (113). In marker-
Containers are isolated from one another and bun- based settings, the problem has long been addressed by rather
dle their own software, libraries and configuration tracking clusters of at least three non-aligned markers to fully
files (docker.com; 106). reconstruct a rigid segment’s six degrees of freedom (114).
• GitHub: github.com is a platform for developing Performing the equivalent feat in a markerless case is dif-
and hosting software, which uses Git version con- ficult, but it is possible by labeling multiple points (for in-
trol. Version control is excellent to have history- stance on either side of the wrist to get the lower-limb ori-
dependent versions and discrete workspaces for entation). Still, recent hybrid, state-of-the-art approaches
code development and deployment. GitLab git- jointly training under both position and orientation supervi-
lab.com/explore also hosts code repositories. sion augur very well for video-based 3D joint angle compu-
tation (75, 76).
With the notable exception of approaches leveraging radio
wave signals to predict body poses through walls (115), deep
Practical considerations for pose estimation learning-powered motion capture requires the individuals be
(with deep learning) visible; this is impractical for kinematic measurements over
wide areas. A powerful alternative is offered by Inertial Mea-
As a recent field gaining traction, it is instructive to regard
surement Units (IMUs)—low-cost and lightweight devices
the operability of deep learning-powered pose estimation in
typically recording linear accelerations, angular velocities
light of well-established, often gold standard, techniques.
and the local magnetic field. Raw inertial data can be used for
coarse behavior classification across species (3, 116). They
General considerations and pitfalls. can also be integrated to track displacement with lower power
As discussed in Scope and applications and as evidenced by consumption and higher temporal resolution than GPS (117),
the strong adaptation of the tools, deep learning-based pose thereby providing a compact and portable way to investi-
estimation work well in standard setups with visible animals. gate whole body dynamics (e.g. 118) or, indirectly, energet-
The most striking advantage over traditional motion capture ics (119). Recent advances in miniaturization of electronical
systems is the absence of any need for body instrumentation. components now also allow precise quantification of posture
Although seemingly obvious, the previous statement hides in small animals (120), and open new avenues for kinematic
the belated recognition that marker-based motion capture suf- recordings in multiple animals at once at fine motor scales.
fers greatly from the wobble of markers placed on the skin Nonetheless, IMU-based full body pose reconstruction ne-
surface. That behavior, referred to as “soft tissue artifact” cessitates multiple sensors over the body parts of interest;
among movement scientists and attributable to the deforma- 3 Intra-cortical pins and biplane fluoroscopy give direct, uncontaminated
tion of tissues underneath the skin such as contracting mus- access to joint kinematics. The first, however, is invasive (and entails careful
cles or fat, is now known to be the major obstacle to obtaining surgical procedures; 107) whereas the second is only operated in very con-
strained and complex laboratory settings (108). Both are local to a specific
2 forum.image.sc joint, and as such do not strictly address the task of pose estimation.

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 11


Figure 7. Labeling Pitfalls: How corruptions affect performance (A) Illustration of two types of labeling errors. Top is ground truth, middle is missing
a label at the tailbase, and bottom is if the labeler swapped the ear identity (left to right, etc.). (B) Using a small dataset of 106 frames, how do the
corruptions in A affect the percent of correct keypoints (PCK) as the distance to ground truth increases from 0 pixel (perfect prediction) to 20 pixels
(larger error)? The X-axis denotes the difference in the ground truth to the predicted location (RMSE in pixels), whereas Y-axis is the fraction of frames
considered accurate (e.g., ≈80% of frames fall within 9 pixels, even on this small training dataset, for points that are not corrupted, whereas for corrupted
points this falls to ≈65%). The fraction of the dataset that is corrupted affects this value. Shown is when missing the tailbase label (top) or swapping
the ears in 1, 5, 10 and 20% of frames (of 106 labeled training images). Swapping vs. missing labels has a more notable adverse effect on network
performance.

commercial solutions require up to 17 of them (121). That Pitfalls of using deep learning-based
burden was recently eased by utilizing a statistical body motion capture.
model that incorporates anatomical constraints, together with
optimizing poses over multiple frames to enforce coher- Despite being trained on large scale datasets of thousands of
ence between the model orientation and IMU recordings— individuals, even the best architectures fail to generalize to
reducing the system down to six sensors while achieving “atypical” postures (with respect to the training set). This is
stunning motion tracking (122). Yet, two additional diffi- wonderfully illustrated by the errors committed by OpenPose
culties remain. The first arises when fusing inertial data in on yoga poses (129).
order to estimate a sensor’s orientation (for a comprehensive These domain shifts are major challenges (also illustrated be-
description of mathematical formalism and implementation low), and while this is an active area of research with much
of common fusion algorithms, see 123). The process is sus- progress, the easiest way to make sure that the algorithm gen-
ceptible to magnetic disturbances that distort sensor readings eralizes well is to label data that is similar to the videos at
and, consequently, orientation estimates (124). The second inference time. However, due to active learning implemented
stems from the necessity to align a sensor’s local coordi- for many packages, users can manually refine the labels on
nate system to anatomically meaningful axes, a step crucial “outlier” frames.
(among others) to calculating joint angles (e.g., 125). The
calibration is ordinarily carried out by having the subject per- Another major caveat of deep learning-powered pose estima-
form a set of predefined movements in sequence, whose exe- tion is arguably its intrinsic reliance on high-quality labeled
cution determines the quality of the procedure. Yet, in some images. This suggests that a labeled dataset that reflects the
pathological populations (let alone in animals), calibration variability of the behavior should be used. If one – due to
may be challenging to say the least, deteriorating pose recon- the quality of the video – cannot reliably identify body parts
struction accuracy (126). in still images (i.e., due to massive motion blur, uncertainty
A compromise to making the task less arduous is to com- about body part (left/right leg crossing) or animal identity)
bine videos and body-worn inertial sensors. Thanks to their then the video quality should be fixed, or sub-optimal results
complementary nature, incorporating both cues mitigates the should be expected.
limitations of each individual system; i.e., both modalities To give readers a concrete idea about label errors, aug-
reinforce one another in that IMUs help disambiguate oc- mentation methods, and active learning, we also pro-
clusions, whereas videos provide disturbance-free spatial in- vide some simple experiments with shared code and
formation (127). The idea also applies particularly well data. Code for reproducing these analyses is available
to the tracking of multiple individuals—even without the at github.com/DeepLabCut/Primer-MotionCapture.
use of appearance features, advantageously—by exploiting
unique movement signatures contained within inertial signals To illustrate the importance of error-free labeling, we arti-
to track identities over time (128). ficially corrupted labels from the trail-tracking dataset from

12 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


Mathis et al. (20). The corruptions respectively simulate inat- Body parts should be labeled reliably and consistently across
tentive labeling (e.g., with left–right bodyparts being occa- frames that preferably capture a variety of behaviors. Note
sionally confounded), and missing annotation or uncertainty that some packages provide the user means to automatically
as to whether to label an occluded bodypart. We corrupted extract frames differing in visual content based on unsuper-
1, 5, 10 and 20% of the dataset (N=1,066 images) either by vised clustering, which simplifies the selection of relevant
swapping two labels or removing one, and trained on 5% of images in sparse behaviors.
the data. The effect of missing labels is barely noticeable
Utilize symmetries for training with augmentation and try
(Figure 7A). Swapping labels, on the other hand, causes a
to include image augmentations that are helpful. Use the
substantial drop in performance, with an approximate 10%
strongest model (given the speed requirements). Check per-
loss in percentage of correct keypoints (PCK) (Figure 7B).
formance and actively grow the training set if errors are
We therefore reason that careful labeling, more so than label-
found.
ing a very large number of images, is the safest guard against
poor ground truth annotations. We believe that explicitly Box 5: Avoiding pitfalls.
modeling labeling errors, as done in Johnson and Evering-
ham (130), will be an active area of research and integrated • Video quality: While deep learning based methods
in some packages. are more robust than other methods, and can even
Even if labeled well, augmentation greatly improves results learn from blurry, low-resolution images, you will
and should be used. For instance, when training on the ex- make your life easier by recording quality videos.
ample dataset of (highly)-correlated frames from one short • Labeling: Label accurately and use enough data
video of one individual, the loss nicely plateaus and shows from different videos. 10 videos with 20 frames
comparable train/test errors for three different augmentation each is better than 1 video with 200 frames. Check
methods (Figure 8A, B). The three models also give good per- labeling quality. If multiple people label, agree on
formance and generalize to a test video of a different mouse. conventions - i.e. be sure that for a larger body part
However, closer inspection reveals that the "scalecrop" aug- (like back of mouse) the same location is labeled.
mentation method, which only performs cropping and scal-
ing during training (80), leads to swaps in bodyparts with • Dataset curation: Collect annotation data from
this small training set from only one different mouse (Fig- the full repertoire of behavior (different individu-
ure 8C, D). The other two methods, which were configured als, backgrounds, postures). Automatic methods of
to perform rotations of the training data, could robustly track frame extraction exist, but the videos need to be
the posture of the mouse. This discrepancy becomes striking manually selected.
when observing the PCK plots: imgaug and tensorpack out-
perform scalecrop by a margin of up to ≈ 30% (Figure 8E). • Data Augmentation: Are there specific features
One simple way to generalize to this additional case is by ac- you know happen in your videos, like motion blur
tive learning (80), which is also available for some packages. or contrast changes? Can rotational symmetry, or
Thereby one annotates additional frames with poor perfor- mirroring be exploited? Then use an augmentation
mance (outlier frames) and then trains the network from the scheme that can build this into training.
final configuration, which thus only requires a few thousand • Optimization: Train until loss plateaus, and do not
iterations. Adding 28 annotated frames from the higher res- over-train. Check that it worked by looking at per-
olution camera, we get good generalization for test frames formance on training images (both quantitatively
from both scenarios (Figure 8F). Generally, this illustrates, and visually), ideally across “snapshots" (i.e. train
how the lack of diversity in training data leads to worse per- iterations of the network). If that works, look at test
formance, but can be fixed by adding frames with poor per- images. Does the network generalize well? Note
formance (active learning). that, even if everything is proper, train and test per-
formance can be different due to over-fitting on id-
Coping with pitfalls.
iosyncrasies of training set. Bear in mind that the
Fortunately, dealing with the most common pitfalls is rela- latest iterations may not be the ones yielding the
tively straightforward, and mostly demands caution and com- smallest errors on the test set. It is therefore recom-
mon sense. Rules of thumb and practical guidelines are given mended to store and evaluate multiple snapshots.
in Box 5. Video quality should be envisaged as a trade-off
between storage limitations, labeling precision, and training • Cross-validation: You can compare different pa-
speed; e.g., the lower the resolution of a video, the smaller rameters (networks, augmentation, optimization)
the occupied disk space and the faster the training speed, but to get the best performance (see Figure 7).
the harder it gets to consistently identify bodyparts. In prac-
tice, DeepLabCut was shown to be very robust to downsizing Pose estimation algorithms can make different types of er-
and video compression, with pose reconstruction degrading rors: jitter, inversion (e.g. left/right), swap (e.g. associating
only after scaling videos down to a third of their original size body part to another individual) and miss (131). Depend-
or compression by a factor of 1000 (87). ing on the type of errors, different causes need to be ad-

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 13


dressed (i.e., check the data quality for any human-applied packages existed pre-deep learning and can now be leveraged
mistakes (20), use suitable augmentation methods). Also for with this new technology as well. While the general topic of
some cases, post processing filters can be useful (such as what to do with the data is beyond this primer, we will pro-
Kalman filters), but also graphical models or other methods vide a number of pointers. These tools fall into three classes:
that learn the geometry of the bodyparts. We also believe time series analysis, supervised, and unsupervised learning
that future work will explicitly model labeling errors during tools.
training.
A natural step ahead is the quantitative analysis of the key-
point trajectories. The computation of linear and angular
What to do with motion capture data? displacements, as well as their time derivatives, lays the
Pose estimation with deep learning is to relieve the user of ground for detailed motor performance evaluation—a great
the painfully slow digitization of keypoints. With markerless introduction to elementary kinematics can be found in (132),
tracking you need to annotate a much smaller dataset and this and a thorough description of 151 common metrics is given
can be applied to new videos. Pose estimation also serves in (133). These have a broad range of applications, of which
as a springboard to a plethora of other techniques. Indeed, we highlight a system for assessing >30 behaviors in groups
many new tools are specifically being developed to aid users of mice in an automated way (134), or an investigation of
of pose estimation packages to analyze movement and behav- the evolution of gait invariants across animals (135). Fur-
ioral outputs in a high-throughput manner. Plus, many such thermore, kinematic metrics are the basis from which to de-

Figure 8. Data Augmentation Improves Performance Performance of three different augmentation methods on the same dataset of around 100 training
images from one short video of one mouse (thus correlated). Scalecrop is configured to only change the scale, and randomly crop images; Imgaug
also performs motion blur and rotation (±180◦ ) augmentation. Tensorpack performs Gaussian noise and rotation (±180◦ ) augmentation. (A) Loss
over training iterations has plateaued, and (B) test errors in pixels appear comparable for all methods. (C) Tail base aligned skeletons across time for a
video of a different mouse (displayed as a cross connecting snout to tail and left ear to right ear). Note the swap of the “T” in the shaded gray zone (and
overlaid on the image to the right in (D)). Imgaug and tensorpack, which also included full 180◦ rotations, work perfectly). This example highlights that
utilizing the rotational symmetry of the data during training can give excellent performance (without additional labeling). (E) Performance of the networks
on different mice recorded with the same camera (top) and a different camera (≈ 2.5x magnification; bottom). Networks trained with tensorpack and
imgaug augmentation generalize much better, and in particular generalize very well to different mice. The generalization to the other camera is difficult,
but also works better for tensorpack and imgaug augmentation. (F) Performance of networks on same data as in (E), but after an active learning step,
adding 28 training frames from the higher resolution camera and training for a few thousand iterations. Afterwards, the network generalizes well to both
scenarios.

14 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


construct complex whole-body movements into interpretable tations. An emerging relevant research direction in machine
motor primitives, non-invasively probing neuromuscular con- learning is large scale semi-supervised and self-supervised
trol (136). Unsupervised methods such as clustering meth- representation learning (SSL). In SSL, the problem of pre-
ods (137), MotionMapper (138), MoSeq (139), or variational training representations is no longer dependent on large la-
autoencoders (140) allow the extraction of common “kine- beled datasets, as introduced above. Instead, even larger
matic behaviors” such as turning, running, rearing. Super- databases comprised of unlabeled examples—often multiple
vised methods allow the prediction of human defined labels orders of magnitude larger than the counterparts used in su-
such as “attack” or “‘freezing.” For this, general purpose pervised learning—can be leveraged. A variety of SSL algo-
tools such as scikit-learn (137) can be ideal, or tailored solu- rithms are becoming increasingly popular in all areas of ma-
tions with integrated GUIs such as JAABA can be used (141). chine learning. Recently, representations obtained by large-
Sturman et al. have developed an open source package to scale self-supervised pre-training began to approach or even
utilize motion capture outputs together with classifiers to au- surpass performance of the best supervised methods. Various
tomate human annotations for various behavioral tests (open SSL methods (148–156) made strides in both image recogni-
field, elevated plus maze, forced swim test). They showed tion (156), speech processing (157–160) and NLP (161, 162),
that these open source methods outperform commercially already starting to outperform models obtained by super-
available platforms (142). vised pre-training on large datasets. Considering that re-
cent SSL models for computer vision are continued to be-
Kinematic analysis, together with simple principles derived
ing shared openly (e.g. 50, 156), it can be expected to impact
from physics, also allows the calculation of the energy re-
and improve new model development in pose estimation, es-
quired to move about, a methodology relevant to understand-
pecially if merely replacing the backend model is required.
ing the mechanical determinants of the metabolic cost of lo-
On top, SSL methods can be leveraged in end-to-end models
comotion (e.g. 143) or informing the design of bio-inspired
for estimating keypoints and poses directly from raw, unla-
robots (e.g. 144, 145).
beled video (163–165). Approaches based on graph neural
Modeling and motion understanding. networks (166) can encode priors about the observed struc-
ture and model correlations between individual keypoints
Looking forward, we also expect that the motion capture data and across time (167). For some applications (like mod-
will be used to learn task-driven and data-driven models of eling soft tissue or volume) full surface reconstructions are
the sensorimotor as well as the motor pathway. We have needed and this area has seen tremendous progress in recent
recently provided a blueprint combining human movement years (12, 14, 168). Such advances can be closely watched
data, inverse kinematics, biomechanical modeling and deep and incorporated in neuroscience, but we also believe our
learning (146). Given the complexity of movement, as well field (neuroscience) is ready to innovate in this domain too.
as the highly nonlinear nature of the sensorimotor process-
ing (145, 147), we believe that such approaches will be fruit- Pose estimation specifically for neuroscience.
ful to leverage motion capture data to gain insight into brain
function. The goal of human pose estimation—aside from the purely
scientific advances for object detection—range from person
localization in videos, self-driving cars and pedestrian safety,
Perspectives to socially aware AI, is related to, but does differ from, the ap-
As we highlighted thus far in this primer, markerless motion plied goals of animal pose estimation in neuroscience. Here,
capture has reached a mature state in only a few years due to we want tools that give us the highest precision, with the
the many advances in machine learning and computer vision. most rapid feedback options possible, and we want to train
While there are still some challenges left (10), this is an ac- on small datasets but have them generalize well. This is a tall
tive area of research and advances in training schemes (such order, but so far we have seen that the glass is (arguably more
as semi-supervised and self-supervised learning) and model than) half full. How do we meet these goals going forward?
architectures will provide further advances and even less re- While much research is still required, there are essentially
quired manual labour. Essentially, now every lab can train two ways forward: datasets and associated benchmarks, and
appropriate algorithms for their application and turn videos algorithms.
into accurate measurements of posture. If setups are suffi-
ciently standardized, these algorithms already broadly gener- Neuroscience needs (more) benchmarks.
alize, even across multiple laboratories as in the case of the In order to push the field towards innovations in areas the
International Brain Lab (96). But how do we get there, and community finds important, setting up benchmark datasets
how do we make sure the needs of animal pose estimation for and tasks will be crucial (i.e., the Animal version of Im-
neuroscience applications are met? ageNet). The community can work towards sharing and
collecting data of relevant tasks and curating it into bench-
Recent developments in deep learning.
marks. This also has the opportunity of shifting the focus
Innovations in the field of object recognition and detection in computer vision research: Instead of “only” doing human
affect all aforementioned parts of the algorithm, as we dis- pose estimation, researchers probably will start evaluating on
cussed already in the context of using pre-trained represen- datasets directly relevant to neuroscience community. Indeed

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 15


there has been a recent interest in more animal-related work Currently, only DeepLabCut provides model weights (albeit
at top machine learning conferences (14, 169), and providing not at the time of the original publication) as part of the re-
proper benchmarks for such approaches would be ideal. cently launched Model Zoo (modelzoo.deeplabcut.org). Cur-
rently it contains models trained on MPII pose (18), dog and
For animals, such efforts are developing: Khan et al. re- cat models as well as contributed models for primate facial
cently shared a dataset comprising 22.4K annotated faces recognition, primate full body recognition (89) and mouse
from 350 diverse species (169) and Labuguen announced a pupil detection (Figure 6). Researchers can also contribute in
dataset of 13K annotated macaque (89). We recently released a citizen-science fashion by labeling data on the web (con-
two benchmark datasets that can be evaluated for state-of- trib.deeplabcut.org) or by submitting models.
the-art performance 4 on within domain and out-of-domain
data 5 . The motivation is to train on a limited number of Both datasets and models will benefit from common for-
individuals and test on held out animals (the so-called “out- matting to ease sharing and testing. Candidate formats are
of-domain” issue) (39, 44). We picked horses due to the vari- HDF5 (also chosen by NeuroData Without Borders (174)
ation in coat colors (and provide >8K labeled frames). Sec- and DeepLabCut), TensorFlow data6 , and/or PyTorch data7 .
ondly, to directly study the inherent shift in domain between Specifically, for models, proto-buffer formats for weights are
individuals, we set up a benchmark for common image cor- useful and easy to share (17, 175) for deployment to other
ruptions, as introduced by Hendrycks et al. (170) that uses the systems. Platforms such as OSF and Zenodo allow banking
image corruptions library proposed by Michaelis et al. (171). of weights, and some papers (e.g. 91, 142) have also shared
their trained models. We envision that having easy-to-use in-
terfaces to such models will be possible in the future.
Of course these aforementioned benchmarks are not suffi-
cient to cover all the needs of the community, so we encour- These pre-trained pose estimation networks hold several
age consortium-style efforts to also curate data and provide promises: it saves time and energy (as different labs do not
additional benchmarks. Plus, making robust networks is still need to annotate and train networks), as well as contributes to
a major challenge, even when trained with large amounts of reproducibility in science. Like many other forms of biologi-
data (86, 172). In order to make this a possibility it will be cal data, such as genome sequences, functional imaging data,
important to develop and share common keypoint estimation behavioral data is notoriously hard to analyze in standard-
benchmarks for animals as well as expand the human ones to ized ways. Lack of agreement can lead to different results,
applications of interest, such as sports (129). as pointed out by a recent landmark study comparing the re-
sults achieved by 70 independent researchers analyzing nine
Sharing Pre-trained Models. hypothesis in shared imaging data (176). To increase repro-
ducibility in behavioral science, video is a great tool (177).
We believe another major step forward will be sharing pre- Analyzing behavioral data is complex, owing to its unstruc-
trained pose estimation networks. If as a field we were to tured, large-scale nature, which highlights the importance of
annotate sufficiently diverse data, we could train more robust shared analysis pipelines. Thus, building robust architectures
networks that broadly generalize. This success is promised that extract the same behavioral measurements in different
by other large scale data sets such as MS COCO (36) and laboratories would be a major step forward.
MPII pose (37). In the computer vision community, shar-
ing model weights such that models do not need to be re-
trained has been critical for progress. For example, the ability Conclusions
to download pre-trained ImageNet weights is invaluable— Deep learning based markerless pose estimation has been
training ImageNet from scratch on a standard GPU can take broadly and rapidly adopted in the past two years. This im-
more than a week. Now, they are downloaded within a few pact was, in part, fueled by open-source code: by developing
seconds and fine tuned in packages like DeepLabCut. How- and sharing packages in public repositories on GitHub they
ever even for custom training setups, sharing of code and easy could be easily accessed for free and at scale. These packages
access to cloud computing resources enables smaller labs to are built on advances (and code) in computer vision and AI,
train and deploy models without investment in additional lab which has a strong open science culture. Neuroscience also
resources. Pre-training a typical object recognition model on has strong and growing open science culture (178), which
the ILSVC is now possible on the order of minutes for less greatly impacts the field as evidenced by tools from the Allen
than 100 USD (173) thanks to high-end cloud computing, Institute, the UCLA Miniscope (179), OpenEphys (180), and
which is also feasible for labs lacking the necessary on-site Bonsai (175) (just to name a few).
infrastructure (Box 3).
Moreover, Neuroscience and AI have a long history of influ-
In neuroscience, we should aim to fine tune even those encing each other (181), and research in Neuroscience will
models; namely, sharing of mouse-specific, primate-specific likely contribute to making AI more robust (181, 182). The
weights will drive interest and momentum from researchers analysis of animal motion is a highly interdisciplinary field at
without access to such data, and further drive innovations. the intersection of biomechanics, computer vision, medicine
4 paperswithcode.com 6 tensorflow.org/api_docs/python/tf/data
5 horse10.deeplabcut.org 7 pytorch.org/docs/stable/torchvision/datasets.html

16 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


and robotics with a long tradition (1). The recent advances 16. Pablo Maceira-Elvira, Traian Popa, Anne-Christine Schmid, and
in deep learning have greatly simplified the measurement of Friedhelm C Hummel. Wearable technology in stroke rehabilita-
tion: towards improved diagnosis and treatment of upper-limb mo-
animal behavior, which, as we and others believe (183), in tor impairment. Journal of neuroengineering and rehabilitation, 16
turn will greatly advance our understanding of the brain. (1):142, 2019.
17. Gary Kane, Gonçalo Lopes, Jonny L. Saunders, Alexander Mathis,
Acknowledgments: and Mackenzie Mathis. Real-time deeplabcut for closed-loop feed-
back based on posture. bioRxiv, 2020.
We thank Yash Sharma for discussions around future di- 18. Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo
Andriluka, and Bernt Schiele. DeeperCut: A deeper, stronger, and
rections in self-supervised learning, Erin Diel, Maxime Vi- faster multi-person pose estimation model. In European Confer-
dal, Claudio Michaelis, Thomas Biasi for comments on the ence on Computer Vision, pages 34–50. Springer, 2016.
manuscript. Funding was provided by the Rowland Institute 19. Sven Kreiss, Lorenzo Bertoni, and Alexandre Alahi. Pifpaf: Com-
posite fields for human pose estimation. In Proceedings of the
at Harvard University (MWM, AM), the Chan Zuckerberg IEEE Conference on Computer Vision and Pattern Recognition,
Initiative (MWM, AM, JL) and the German Federal Ministry pages 11977–11986, 2019.
of Education and Research (BMBF) through the Tübingen 20. Alexander Mathis, Pranav Mamidanna, Kevin M Cury, Taiga Abe,
Venkatesh N Murthy, Mackenzie Weygandt Mathis, and Matthias
AI Center (StS; FKZ: 01IS18039A). StS thanks the Interna- Bethge. Deeplabcut: markerless pose estimation of user-defined
tional Max Planck Research School for Intelligent Systems body parts with deep learning. Nature neuroscience, 21:1281–
(IMPRS-IS) and acknowledges his membership in the Euro- 1289, 2018.
21. Thomas B Moeslund, Adrian Hilton, and Volker Krüger. A sur-
pean Laboratory for Learning & Intelligent Systems (ELLIS) vey of advances in vision-based human motion capture and analy-
PhD program. The authors declare no conflicts of interest. sis. Computer vision and image understanding, 104(2-3):90–126,
M.W.M. dedicates this work to Adam E. Max. 2006.
22. Ronald Poppe. Vision-based human motion analysis: An overview.
Computer Vision and Image Understanding, 108(1):4 – 18, 2007.
ISSN 1077-3142. doi: https://doi.org/10.1016/j.cviu.2006.10.016.
References Special Issue on Vision for Human-Computer Interaction.
23. Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.
1. Reinhard Klette and Garry Tee. Understanding human motion: A Nature, 521(7553):436–444, 2015.
historic review. In Human motion, pages 1–22. Springer, 2008. 24. Alexander Toshev and Christian Szegedy. Deeppose: Human
2. Mary D Leakey and Richard L Hay. Pliocene footprints in the laetolil pose estimation via deep neural networks. CoRR, abs/1312.4659,
beds at laetoli, northern tanzania. Nature, 278(5702):317–323, 2013.
1979. 25. Arjun Jain, Jonathan Tompson, Yann LeCun, and Christoph Bre-
3. Roland Kays, Margaret C Crofoot, Walter Jetz, and Martin Wikel- gler. Modeep: A deep learning framework using motion features
ski. Terrestrial animal tracking as an eye on life and planet. Sci- for human pose estimation. CoRR, abs/1409.7963, 2014.
ence, 348(6240):aaa2478, 2015. 26. Alejandro Newell, Kaiyu Yang, and Jia Deng. Stacked hourglass
4. Danielle D Brown, Roland Kays, Martin Wikelski, Rory Wilson, and networks for human pose estimation. In European Conference on
A Peter Klimley. Observing the unwatchable through acceleration Computer Vision, pages 483–499. Springer, 2016.
logging of animal behavior. Animal Biotelemetry, 1(1):20, 2013. 27. Zhe Cao, Gines Hidalgo, Tomas Simon, Shih-En Wei, and Yaser
5. Valentina Camomilla, Elena Bergamini, Silvia Fantozzi, and Sheikh. OpenPose: realtime multi-person 2D pose estimation us-
Giuseppe Vannozzi. Trends supporting the in-field use of wearable ing Part Affinity Fields. In arXiv preprint arXiv:1812.08008, 2018.
inertial sensors for sport performance evaluation: A systematic re- 28. Bin Xiao, Haiping Wu, and Yichen Wei. Simple baselines for hu-
view. Sensors, 18(3):873, 2018. man pose estimation and tracking. In Proceedings of the European
6. Gunnar Johansson. Visual perception of biological motion and a conference on computer vision (ECCV), pages 466–481, 2018.
model for its analysis. Perception & psychophysics, 14(2):201– 29. Bowen Cheng, Bin Xiao, Jingdong Wang, Honghui Shi, Thomas S
211, 1973. Huang, and Lei Zhang. Higherhrnet: Scale-aware representa-
7. Allan F O’Connell, James D Nichols, and K Ullas Karanth. Camera tion learning for bottom-up human pose estimation. In Proceed-
traps in animal ecology: methods and analyses. Springer Science ings of the IEEE/CVF Conference on Computer Vision and Pattern
& Business Media, 2010. Recognition, pages 5386–5395, 2020.
8. Ben G Weinstein. A computer vision for animal ecology. Journal 30. Navneet Dalal and Bill Triggs. Histograms of oriented gradients
of Animal Ecology, 87(3):533–545, 2018. for human detection. In 2005 IEEE computer society conference
9. Xiongwei Wu, Doyen Sahoo, and Steven CH Hoi. Recent ad- on computer vision and pattern recognition (CVPR’05), volume 1,
vances in deep learning for object detection. Neurocomputing, pages 886–893. IEEE, 2005.
2020. 31. David G Lowe. Distinctive image features from scale-invariant
10. Mackenzie Weygandt Mathis and Alexander Mathis. Deep learn- keypoints. International journal of computer vision, 60(2):91–110,
ing tools for the measurement of animal behavior in neuroscience. 2004.
Current Opinion in Neurobiology, 60:1–11, 2020. 32. Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool.
11. Sandeep Robert Datta, David J Anderson, Kristin Branson, Pietro Speeded-up robust features (surf). Computer vision and image
Perona, and Andrew Leifer. Computational neuroethology: a call understanding, 110(3):346–359, 2008.
to action. Neuron, 104(1):11–24, 2019. 33. Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learn-
12. Rıza Alp Güler, Natalia Neverova, and Iasonas Kokkinos. Dense- ing. MIT press, 2016.
pose: Dense human pose estimation in the wild. In Proceedings 34. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-
of the IEEE Conference on Computer Vision and Pattern Recogni- Fei. Imagenet: A large-scale hierarchical image database. In
tion, pages 7297–7306, 2018. 2009 IEEE conference on computer vision and pattern recognition,
13. Silvia Zuffi, Angjoo Kanazawa, David Jacobs, and Michael Black. pages 248–255. Ieee, 2009.
3d menagerie: Modeling the 3d shape and pose of animals. 2017 35. Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
IEEE Conference on Computer Vision and Pattern Recognition jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya
(CVPR), 11 2016. Khosla, Michael Bernstein, et al. Imagenet large scale visual
14. Artsiom Sanakoyeu, Vasil Khalidov, Maureen S McCarthy, Andrea recognition challenge. International journal of computer vision, 115
Vedaldi, and Natalia Neverova. Transferring dense pose to proxi- (3):211–252, 2015.
mal animal classes. arXiv preprint arXiv:2003.00080, 2020. 36. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro
15. Samsoon Inayat, Surjeet Singh, Arashk Ghasroddashti, Qandeel Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Mi-
Qandeel, Pramuka Egodage, Ian Q Whishaw, and Majid H Mo- crosoft coco: Common objects in context. In European conference
hajerani. A matlab-based toolbox for characterizing behavior of on computer vision, pages 740–755. Springer, 2014.
rodents engaged in string-pulling. eLife, 9:e54540, 2020.

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 17


37. Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt ceedings of the IEEE conference on computer vision and pattern
Schiele. 2d human pose estimation: New benchmark and state of recognition, pages 4700–4708, 2017.
the art analysis. In Proceedings of the IEEE Conference on com- 56. Mingxing Tan and Quoc V Le. Efficientnet: Rethinking
puter Vision and Pattern Recognition, pages 3686–3693, 2014. model scaling for convolutional neural networks. arXiv preprint
38. Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet arXiv:1905.11946, 2019.
pre-training. arXiv preprint arXiv:1811.08883, 2018. 57. Simon Kornblith, Jonathon Shlens, and Quoc V Le. Do better ima-
39. Alexander Mathis, Mert Yüksekgönül, Byron Rogers, Matthias genet models transfer better? In Proceedings of the IEEE Confer-
Bethge, and Mackenzie W Mathis. Pretraining boosts out- ence on Computer Vision and Pattern Recognition, pages 2661–
of-domain robustness for pose estimation. arXiv preprint 2671, 2019.
arXiv:1909.11229, 2019. 58. Vincent Dumoulin and Francesco Visin. A guide to convolution
40. Ahmet Arac, Pingping Zhao, Bruce H Dobkin, S Thomas arithmetic for deep learning. arXiv preprint arXiv:1603.07285,
Carmichael, and Peyman Golshani. Deepbehavior: A deep learn- 2016.
ing toolbox for automated analysis of animal and human behavior 59. Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Gold-
imaging data. Frontiers in systems neuroscience, 13:20, 2019. stein. Visualizing the loss landscape of neural nets. In Advances in
41. Jacob M Graving, Daniel Chae, Hemal Naik, Liang Li, Benjamin Neural Information Processing Systems, pages 6389–6399, 2018.
Koger, Blair R Costelloe, and Iain D Couzin. Deepposekit, a 60. Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin
software toolkit for fast and robust animal pose estimation using Murphy, and Alan L Yuille. Deeplab: Semantic image segmen-
deep learning. eLife, 8:e47994, oct 2019. ISSN 2050-084X. doi: tation with deep convolutional nets, atrous convolution, and fully
10.7554/eLife.47994. connected crfs. IEEE transactions on pattern analysis and ma-
42. Praneet C. Bala, Benjamin R. Eisenreich, Seng Bum Michael chine intelligence, 40(4):834–848, 2017.
Yoo, Benjamin Y. Hayden, Hyun Soo Park, and Jan Zimmermann. 61. Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning
Openmonkeystudio: Automated markerless pose estimation in Zhang, Eric Tzeng, and Trevor Darrell. Decaf: A deep convolu-
freely moving macaques. bioRxiv, 2020. doi: 10.1101/2020.01. tional activation feature for generic visual recognition. In Interna-
31.928861. tional conference on machine learning, pages 647–655, 2014.
43. XiaoLe Liu, Si-yang Yu, Nico Flierman, Sebastian Loyola, Maarten 62. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Ima-
Kamermans, Tycho M. Hoogland, and Chris I. De Zeeuw. Opti- genet classification with deep convolutional neural networks. In
flex: video-based animal pose estimation using deep learning en- Advances in neural information processing systems, pages 1097–
hanced by optical flow. bioRxiv, 2020. doi: 10.1101/2020.04.04. 1105, 2012.
025494. 63. Amir R Zamir, Alexander Sax, William Shen, Leonidas J Guibas,
44. Alexander Mathis, Thomas Biasi, Y Mert, Byron Rogers, Matthias Jitendra Malik, and Silvio Savarese. Taskonomy: Disentangling
Bethge, and Mackenzie Weygandt Mathis. Imagenet performance task transfer learning. In Proceedings of the IEEE Conference
correlates with pose estimation robustness and generalization on on Computer Vision and Pattern Recognition, pages 3712–3722,
out-of-domain data. International Conference on Machine Learn- 2018.
ing 2020 Workshop on Uncertainty and Robustness in Deep 64. Yuxin Wu et al. Tensorpack. https://github.com/
Learning, 2020. tensorpack/, 2016.
45. Talmo D Pereira, Diego E Aldarondo, Lindsay Willmore, Mikhail 65. Alexander B. Jung, Kentaro Wada, Jon Crall, Satoshi Tanaka,
Kislin, Samuel S-H Wang, Mala Murthy, and Joshua W Shaevitz. Jake Graving, Christoph Reinders, Sarthak Yadav, Joy Baner-
Fast animal pose estimation using deep neural networks. Nature jee, Gábor Vecsei, Adam Kraft, Zheng Rui, Jirka Borovec, Chris-
methods, 16(1):117, 2019. tian Vallentin, Semen Zhydenko, Kilian Pfeiffer, Ben Cook, Ismael
46. Semih Günel, Helge Rhodin, Sergei Manzhos, Joao Luiz Cam- Fernández, François-Michel De Rainville, Chi-Hung Weng, Ab-
pagnolo, Ramdya, and Pascal Fua. Deepfly 3 d : A deep learning- ner Ayala-Acevedo, Raphael Meudec, Matias Laporte, et al. im-
based 1 approach for 3 d limb and appendage 2 tracking in teth- gaug. https://github.com/aleju/imgaug, 2020. Online;
ered , adult drosophila. 2019. accessed 01-Feb-2020.
47. Christian Zimmermann, Artur Schneider, Mansour Alyahyay, 66. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James
Thomas Brox, and Ilka Diester. Freipose: A deep learning frame- Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia
work for precise animal motion capture in 3d spaces. bioRxiv, Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-
2020. doi: 10.1101/2020.02.27.967620. performance deep learning library. In Advances in neural informa-
48. Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming tion processing systems, pages 8026–8037, 2019.
He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens 67. Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy
van der Maaten. Exploring the limits of weakly supervised pre- Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey
training. In Proceedings of the European Conference on Computer Irving, Michael Isard, et al. Tensorflow: A system for large-scale
Vision (ECCV), pages 181–196, 2018. machine learning. In 12th {USENIX} symposium on operating
49. Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowl- systems design and implementation ({OSDI} 16), pages 265–283,
edge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 2016.
50. Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. Self- 68. Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-
training with noisy student improves imagenet classification. In resolution representation learning for human pose estimation. In
Proceedings of the IEEE/CVF Conference on Computer Vision and Proceedings of the IEEE conference on computer vision and pat-
Pattern Recognition, pages 10687–10698, 2020. tern recognition, pages 5693–5703, 2019.
51. Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan 69. Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully con-
Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo volutional networks for semantic segmentation. In Proceedings of
Malloci, Tom Duerig, et al. The open images dataset v4: Unified the IEEE conference on computer vision and pattern recognition,
image classification, object detection, and visual relationship de- pages 3431–3440, 2015.
tection at scale. arXiv preprint arXiv:1811.00982, 2018. 70. Semih Günel, Helge Rhodin, Daniel Morales, João H Campag-
52. Hengduo Li, Bharat Singh, Mahyar Najibi, Zuxuan Wu, and Larry S nolo, Pavan Ramdya, and Pascal Fua. Deepfly3d, a deep learning-
Davis. An analysis of pre-training on object detection. arXiv based approach for 3d limb and appendage tracking in tethered,
preprint arXiv:1904.05871, 2019. adult Drosophila. eLife, 8:e48571, oct 2019. ISSN 2050-084X. doi:
53. Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmogi- 10.7554/eLife.48571.
nov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and 71. Joao Carreira, Pulkit Agrawal, Katerina Fragkiadaki, and Jitendra
linear bottlenecks. In Proceedings of the IEEE conference on com- Malik. Human pose estimation with iterative error feedback. In Pro-
puter vision and pattern recognition, pages 4510–4520, 2018. ceedings of the IEEE conference on computer vision and pattern
54. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep recognition, pages 4733–4742, 2016.
residual learning for image recognition. In Proceedings of the IEEE 72. Jonathan J Tompson, Arjun Jain, Yann LeCun, and Christoph Bre-
conference on computer vision and pattern recognition, pages gler. Joint training of a convolutional network and a graphical
770–778, 2016. model for human pose estimation. In Advances in neural infor-
55. Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q mation processing systems, pages 1799–1807, 2014.
Weinberger. Densely connected convolutional networks. In Pro- 73. Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu

18 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. Art- herd. Manual dexterity of mice during food-handling involves the
track: Articulated multi-person tracking in the wild. In Proceedings thumb and a set of fast basic movements. PloS one, 15(1):
of the IEEE conference on computer vision and pattern recogni- e0226774, 2020.
tion, 2017. 92. Harris S Kaplan and Manuel Zimmer. Brain-wide representations
74. Yuan Yao, Yasamin Jafarian, and Hyun Soo Park. Monet: Multi- of ongoing behavior: a universal principle? Current opinion in
view semi-supervised keypoint detection via epipolar divergence. neurobiology, 64:60–69, 2020.
In Proceedings of the IEEE International Conference on Computer 93. Steven M Peterson, Satpreet H Singh, Nancy XR Wang, Ra-
Vision, pages 753–762, 2019. jesh PN Rao, and Bingni W Brunton. Behavioral and neural vari-
75. Lan Xu, Weipeng Xu, Vladislav Golyanik, Marc Habermann, ability of naturalistic arm movements. BioRxiv, 2020.
Lu Fang, and Christian Theobalt. Eventcap: Monocular 3d capture 94. Arne F Meyer, John O’Keefe, and Jasper Poort. Two distinct types
of high-speed human motions using an event camera. In Proceed- of eye-head coupling in freely moving mice. bioRxiv, 2020.
ings of the IEEE/CVF Conference on Computer Vision and Pattern 95. Joshua H Siegle, Xiaoxuan Jia, Séverine Durand, Sam Gale, Cor-
Recognition, pages 4968–4978, 2020. bett Bennett, Nile Graddis, Greggory Heller, Tamina K Ramirez,
76. Yuxiao Zhou, Marc Habermann, Weipeng Xu, Ikhsanul Habibie, Hannah Choi, Jennifer A Luviano, et al. A survey of spiking activ-
Christian Theobalt, and Feng Xu. Monocular real-time hand shape ity reveals a functional hierarchy of mouse corticothalamic visual
and motion capture using multi-modal data. In Proceedings of the areas. bioRxiv, page 805010, 2019.
IEEE/CVF Conference on Computer Vision and Pattern Recogni- 96. Kenneth D Harris, Max Hunter, Cyrille Rossant, Maho Sasaki,
tion, pages 5346–5355, 2020. Shan Shen, Nicholas A Steinmetz, Edgar Y Walker, Olivier Winter,
77. Léon Bottou. Large-scale machine learning with stochastic gradi- and Miles Wells. Data architecture for a large-scale neuroscience
ent descent. In Proceedings of COMPSTAT’2010, pages 177–186. collaboration. BioRxiv, page 827873, 2019.
Springer, 2010. 97. Irene Tracey, Clifford J Woolf, and Nick A Andrews. Composite
78. Ilya Sutskever, James Martens, George Dahl, and Geoffrey Hin- pain biomarker signatures for objective assessment and effective
ton. On the importance of initialization and momentum in deep treatment. Neuron, 101(5):783–800, 2019.
learning. In International conference on machine learning, pages 98. Silvestro Micera, Matteo Caleo, Carmelo Chisari, Friedhelm C
1139–1147, 2013. Hummel, and Alessandra Pedrocchi. Advanced neurotechnolo-
79. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic gies for the restoration of motor function. Neuron, 105(4):604–620,
optimization. In Yoshua Bengio and Yann LeCun, editors, 3rd In- 2020.
ternational Conference on Learning Representations, ICLR 2015, 99. J. D. Laurence-Chasen, A. R. Manafzadeh, N. G. Hatsopoulos,
San Diego, CA, USA, May 7-9, 2015, Conference Track Proceed- C. F. Ross, and F. I. Arce-McShane. Integrating xmalab and
ings, 2015. deeplabcut for high-throughput xromm. Journal of Experimental
80. Tanmay Nath, Alexander Mathis, An Chi Chen, Amir Patel, Biology, 2020. ISSN 0022-0949. doi: 10.1242/jeb.226720.
Matthias Bethge, and Mackenzie W Mathis. Using deeplabcut for 100. Nidhi Seethapathi, Shaofei Wang, Rachit Saluja, Gunnar Blohm,
3d markerless pose estimation across species and behaviors. Na- and Konrad P Kording. Movement science needs different pose
ture protocols, 14:2152–2176, 2019. tracking algorithms. arXiv preprint arXiv:1907.10226, 2019.
81. Julieta Martinez, Rayat Hossain, Javier Romero, and James J Lit- 101. Curtis T Rueden, Jeanelle Ackerman, Ellen T Arena, Jan Eglinger,
tle. A simple yet effective baseline for 3d human pose estimation. Beth A Cimini, Allen Goodman, Anne E Carpenter, and Kevin W
In Proceedings of the IEEE International Conference on Computer Eliceiri. Scientific community image forum: A discussion forum for
Vision, pages 2640–2649, 2017. scientific image software. PLoS biology, 17(6):e3000340, 2019.
82. Dushyant Mehta, Helge Rhodin, Dan Casas, Oleksandr Sotny- 102. Johannes Schindelin, Ignacio Arganda-Carreras, Erwin Frise, Ver-
chenko, Weipeng Xu, and Christian Theobalt. Monocular 3d hu- ena Kaynig, Mark Longair, Tobias Pietzsch, Stephan Preibisch,
man pose estimation using transfer learning and improved CNN Curtis Rueden, Stephan Saalfeld, Benjamin Schmid, et al. Fiji: an
supervision. CoRR, abs/1611.09813, 2016. open-source platform for biological-image analysis. Nature meth-
83. Denis Tomè, Chris Russell, and Lourdes Agapito. Lifting from ods, 9(7):676–682, 2012.
the deep: Convolutional 3d pose estimation from a single image. 103. Claire McQuin, Allen Goodman, Vasiliy S Chernyshev, Lee Ka-
CoRR, abs/1701.00295, 2017. mentsky, Beth A Cimini, Kyle W. Karhohs, Minh Doan, Liya Ding,
84. Ching-Hang Chen and Deva Ramanan. 3d human pose estima- Susanne M. Rafelski, Derek J. Thirstrup, Winfried Wiegraebe,
tion= 2d pose estimation+ matching. In Proceedings of the IEEE Shantanu Singh, Tim Becker, Juan C. Caicedo, and Anne E. Car-
Conference on Computer Vision and Pattern Recognition, pages penter. Cellprofiler 3.0: Next-generation image processing for bi-
7035–7043, 2017. ology. PLoS Biology, 16, 2018.
85. Pierre Karashchuk, Katie L Rupp, Evyn S Dickinson, Elischa 104. Christoph Sommer, Christoph Straehle, Ullrich Koethe, and Fred A
Sanders, Eiman Azim, Bingni W Brunton, and John C Tuthill. Ani- Hamprecht. Ilastik: Interactive learning and segmentation toolkit.
pose: a toolkit for robust markerless 3d pose estimation. bioRxiv, In 2011 IEEE international symposium on biomedical imaging:
2020. From nano to macro, pages 230–233. IEEE, 2011.
86. Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, 105. Stefan Van der Walt, Johannes L Schönberger, Juan Nunez-
Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Iglesias, François Boulogne, Joshua D Warner, Neil Yager, Em-
Wichmann. Shortcut learning in deep neural networks. arXiv manuelle Gouillart, and Tony Yu. scikit-image: image processing
preprint arXiv:2004.07780, 2020. in python. PeerJ, 2:e453, 2014.
87. Alexander Mathis and Richard A. Warren. On the inference speed 106. Dirk Merkel. Docker: lightweight linux containers for consistent
and video-compression robustness of deeplabcut. bioRxiv, 2018. development and deployment. Linux journal, 2014(239):2, 2014.
doi: 10.1101/457242. 107. DK Ramsey, PF Wretenberg, DL Benoit, M Lamontagne, and
88. Michael Berger, Naubahar Shahryar Agha, and Alexander Gail. G Nemeth. Methodological concerns using intra-cortical pins to
Wireless recording from unrestrained monkeys reveals motor goal measure tibiofemoral kinematics. Knee Surgery, Sports Trauma-
encoding beyond immediate reach in frontoparietal cortex. Elife, tology, Arthroscopy, 11(5):344–349, 2003.
9:e51322, 2020. 108. Renate List, Barbara Postolka, Pascal Schütz, Marco Hitz, Peter
89. Rollyn Labuguen, Jumpei Matsumoto, Salvador Negrete, Hiroshi Schwilch, Hans Gerber, Stephen J Ferguson, and William R Tay-
Nishimaru, Hisao Nishijo, Masahiko Takada, Yasuhiro Go, Ken- lor. A moving fluoroscope to capture tibiofemoral kinematics during
ichi Inoue, and Tomohiro Shibata. Macaquepose: A novel ‘in the complete cycles of free level and downhill walking as well as stair
wild’macaque monkey pose dataset for markerless motion cap- descent. PloS one, 12(10):e0185952, 2017.
ture. bioRxiv, 2020. 109. Valentina Camomilla, Raphaël Dumas, and Aurelio Cappozzo. Hu-
90. Teppei Ebina, Keitaro Obara, Akiya Watakabe, Yoshito Masamizu, man movement analysis: The soft tissue artefact issue. Jour-
Shin-Ichiro Terada, Ryota Matoba, Masafumi Takaji, Nobuhiko nal of Biomechanics, 62:1 – 4, 2017. ISSN 0021-9290. doi:
Hatanaka, Atsushi Nambu, Hiroaki Mizukami, et al. Arm move- https://doi.org/10.1016/j.jbiomech.2017.09.001. Human Movement
ments induced by noninvasive optogenetic stimulation of the mo- Analysis: The Soft Tissue Artefact Issue.
tor cortex in the common marmoset. Proceedings of the National 110. Kenneth B. Smale, Brigitte M. Potvin, Mohammad S. Shourijeh,
Academy of Sciences, 116(45):22844–22850, 2019. and Daniel L. Benoit. Knee joint kinematics and kinetics during the
91. John M Barrett, Martinna G Raineri Tapies, and Gordon MG Shep- hop and cut after soft tissue artifact suppression: Time to recon-

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 19


sider acl injury mechanisms? Journal of Biomechanics, 62:132 – human keypoint recognition. In Chinese Conference on Pat-
139, 2017. tern Recognition and Computer Vision (PRCV), pages 110–121.
111. Mickaël Begon, Michael Skipper Andersen, and Raphaël Dumas. Springer, 2019.
Multibody Kinematics Optimization for the Estimation of Upper and 130. Sam Johnson and Mark Everingham. Learning effective human
Lower Limb Human Joint Kinematics: A Systematized Method- pose estimation from inaccurate annotation. In CVPR 2011, pages
ological Review. Journal of Biomechanical Engineering, 140(3), 1465–1472. IEEE, 2011.
2018. 131. Matteo Ruggero Ronchi and Pietro Perona. Benchmarking and
112. T-W Lu and JJ O’connor. Bone position estimation from skin error diagnosis in multi-instance pose estimation. In Proceedings
marker co-ordinates using global optimisation with joint con- of the IEEE international conference on computer vision, pages
straints. Journal of biomechanics, 32(2):129–134, 1999. 369–378, 2017.
113. Elena Ceseracciu, Zimi Sawacha, and Claudio Cobelli. Compari- 132. D.A. Winter. Biomechanics and motor control of human movement.
son of markerless and marker-based motion capture technologies John Wiley & Sons, 2009.
through simultaneous data collection during gait: proof of concept. 133. Anne Schwarz, Christoph M Kanzler, Olivier Lambercy, Andreas R
PloS one, 9(3):e87640, 2014. Luft, and Janne M Veerbeek. Systematic review on kinematic as-
114. CW Spoor and FE Veldpaus. Rigid body motion calculated from sessments of upper limb movements after stroke. Stroke, 50(3):
spatial co-ordinates of markers. Journal of biomechanics, 13(4): 718–727, 2019.
391–393, 1980. 134. Fabrice de Chaumont, Elodie Ey, Nicolas Torquet, Thibault La-
115. Mingmin Zhao, Tianhong Li, Mohammad Abu Alsheikh, Yonglong gache, Stéphane Dallongeville, Albane Imbert, Thierry Legou,
Tian, Hang Zhao, Antonio Torralba, and Dina Katabi. Through-wall Anne-Marie Le Sourd, Philippe Faure, Thomas Bourgeron, et al.
human pose estimation using radio signals. In Proceedings of the Real-time analysis of the behaviour of groups of mice via a depth-
IEEE Conference on Computer Vision and Pattern Recognition, sensing camera and machine learning. Nature biomedical engi-
pages 7356–7365, 2018. neering, 3(11):930–942, 2019.
116. Pritish Chakravarty, Gabriele Cozzi, Arpat Ozgul, and Kamiar 135. Giovanna Catavitello, Yury Ivanenko, and Francesco Lacquaniti. A
Aminian. A novel biomechanical approach for animal behaviour kinematic synergy for terrestrial locomotion shared by mammals
recognition using accelerometers. Methods in Ecology and Evolu- and birds. Elife, 7:e38190, 2018.
tion, 10(6):802–814, 2019. 136. Alessia Longo, Thomas Haid, Ruud Meulenbroek, and Peter
117. OR Bidder, JS Walker, MW Jones, MD Holton, P Urge, DM Scant- Federolf. Biomechanics in posture space: Properties and rele-
lebury, NJ Marks, EA Magowan, IE Maguire, and RP Wilson. Step vance of principal accelerations for characterizing movement con-
by step: reconstruction of terrestrial animal movement paths by trol. Journal of Biomechanics, 82:397–403, 2019.
dead-reckoning. Movement ecology, 3(1):1–16, 2015. 137. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent
118. Alan M Wilson, Tatjana Y Hubel, Simon D Wilshin, John C Lowe, Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter
Maja Lorenc, Oliver P Dewhirst, Hattie LA Bartlam-Brooks, Re- Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit-learn: Ma-
becca Diack, Emily Bennitt, Krystyna A Golabek, et al. Biome- chine learning in python. Journal of machine learning research, 12
chanics of predator–prey arms race in lion, zebra, cheetah and (Oct):2825–2830, 2011.
impala. Nature, 554(7691):183–188, 2018. 138. Gordon J. Berman, Daniel M. Choi, William Bialek, and Joshua W.
119. Adrian C Gleiss, Rory P Wilson, and Emily LC Shepard. Making Shaevitz. Mapping the stereotyped behaviour of freely moving fruit
overall dynamic body acceleration work: on the theory of acceler- flies. Journal of The Royal Society Interface, 11(99), 2014. ISSN
ation as a proxy for energy expenditure. Methods in Ecology and 1742-5689. doi: 10.1098/rsif.2014.0672.
Evolution, 2(1):23–33, 2011. 139. Alexander B Wiltschko, Matthew J Johnson, Giuliano Iurilli,
120. Matthieu O Pasquet, Matthieu Tihy, Aurélie Gourgeon, Marco N Ralph E Peterson, Jesse M Katon, Stan L Pashkovski, Victoria E
Pompili, Bill P Godsil, Clément Léna, and Guillaume P Dugué. Abraira, Ryan P Adams, and Sandeep Robert Datta. Mapping sub-
Wireless inertial measurement of head kinematics in freely-moving second structure in mouse behavior. Neuron, 88(6):1121–1135,
rats. Scientific reports, 6:35689, 2016. 2015.
121. Daniel Roetenberg, Henk Luinge, and Per Slycke. Xsens mvn: 140. Kevin Luxem, Falko Fuhrmann, Johannes Kürsch, Stefan Remy,
Full 6dof human motion tracking using miniature inertial sensors. and Pavol Bauer. Identifying behavioral structure from deep varia-
Xsens Motion Technologies BV, Tech. Rep, 1, 2009. tional embeddings of animal motion. bioRxiv, 2020.
122. Timo von Marcard, Bodo Rosenhahn, Michael J Black, and Ger- 141. Mayank Kabra, Alice A Robie, Marta Rivera-Alba, Steven Branson,
ard Pons-Moll. Sparse inertial poser: Automatic 3d human pose and Kristin Branson. Jaaba: interactive machine learning for au-
estimation from sparse imus. In Computer Graphics Forum, vol- tomatic annotation of animal behavior. Nature methods, 10(1):64,
ume 36, pages 349–360. Wiley Online Library, 2017. 2013.
123. Angelo Maria Sabatini. Estimating three-dimensional orientation 142. Oliver Sturman, Lukas von Ziegler, Christa Schläppi, Furkan
of human body parts by inertial/magnetic sensing. Sensors, 11(2): Akyol, Mattia Privitera, Daria Slominski, Christina Grimm, Laeti-
1489–1525, 2011. tia Thieren, Valerio Zerbi, Benjamin Grewe, et al. Deep learning-
124. Bingfei Fan, Qingguo Li, and Tao Liu. How magnetic distur- based behavioral analysis reaches human accuracy and is capa-
bance influences the attitude and heading in magnetic and inertial ble of outperforming commercial solutions. Neuropsychopharma-
sensor-based orientation estimation. Sensors, 18(1):76, 2018. cology, 2020.
125. Julien Lebleu, Thierry Gosseye, Christine Detrembleur, Philippe 143. Franco Saibene and Alberto E Minetti. Biomechanical and physio-
Mahaudens, Olivier Cartiaux, and Massimo Penta. Lower limb logical aspects of legged locomotion in humans. European journal
kinematics using inertial sensors during locomotion: Accuracy and of applied physiology, 88(4-5):297–316, 2003.
reproducibility of joint angle calculations with different sensor-to- 144. Chen Li, Chad C Kessens, Ronald S Fearing, and Robert J
segment calibrations. Sensors, 20(3):715, 2020. Full. Mechanical principles of dynamic terrestrial self-righting using
126. Laura Susana Vargas-Valencia, Arlindo Elias, Eduardo Rocon, wings. Advanced Robotics, 31(17):881–900, 2017.
Teodiano Bastos-Filho, and Anselmo Frizera. An imu-to-body 145. John A Nyakatura, Kamilo Melo, Tomislav Horvat, Kostas
alignment method applied to human gait analysis. Sensors, 16 Karakasiliotis, Vivian R Allen, Amir Andikfar, Emanuel Andrada,
(12):2090, 2016. Patrick Arnold, Jonas Lauströer, John R Hutchinson, et al.
127. Andrew Gilbert, Matthew Trumble, Charles Malleson, Adrian Reverse-engineering the locomotion of a stem amniote. Nature,
Hilton, and John Collomosse. Fusing visual and inertial sensors 565(7739):351, 2019.
with semantics for 3d human pose estimation. International Jour- 146. Kai J Sandbrink, Pranav Mamidanna, Claudio Michaelis, Macken-
nal of Computer Vision, 127(4):381–397, 2019. zie Weygandt Mathis, Matthias Bethge, and Alexander Mathis.
128. Roberto Henschel, Timo von Marcard, and Bodo Rosenhahn. Si- Task-driven hierarchical deep neural networkmodels of the propri-
multaneous identification and tracking of multiple people using oceptive pathway. bioRxiv, 2020.
video and imus. In Proceedings of the IEEE Conference on 147. Manu S Madhav and Noah J Cowan. The synergy between neuro-
Computer Vision and Pattern Recognition Workshops, pages 0– science and control theory: the nervous system as inspiration for
0, 2019. hard control challenges. Annual Review of Control, Robotics, and
129. Ying Huang, Bin Sun, Haipeng Kan, Jiankai Zhuang, and Autonomous Systems, 3:243–267, 2020.
Zengchang Qin. Followmeup sports: New benchmark for 2d 148. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Represen-

20 | arXiv.org Mathis et al. | Motion Capture with Deep Learning


tation learning with contrastive predictive coding. arXiv preprint Shao, and Georgios Tzimiropoulos. Animalweb: A large-scale hi-
arXiv:1807.03748, 2018. erarchical dataset of annotated animal faces. In Proceedings of the
149. Lajanugen Logeswaran and Honglak Lee. An efficient framework IEEE/CVF Conference on Computer Vision and Pattern Recogni-
for learning sentence representations. In International Conference tion, pages 6939–6948, 2020.
on Learning Representations, 2018. 170. Dan Hendrycks, Kimin Lee, and Mantas Mazeika. Using pre-
150. Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsuper- training can improve model robustness and uncertainty. In ICML,
vised feature learning via non-parametric instance discrimination. 2019.
In Proceedings of the IEEE Conference on Computer Vision and 171. Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia
Pattern Recognition, pages 3733–3742, 2018. Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge,
151. Olivier J Hénaff, Ali Razavi, Carl Doersch, SM Eslami, and Aaron and Wieland Brendel. Benchmarking robustness in object detec-
van den Oord. Data-efficient image recognition with contrastive tion: Autonomous driving when winter is coming. arXiv preprint
predictive coding. arXiv preprint arXiv:1905.09272, 2019. arXiv:1907.07484, 2019.
152. Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive multi- 172. Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in
view coding. arXiv preprint arXiv:1906.05849, 2019. terra incognita. In Proceedings of the European Conference on
153. R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Computer Vision (ECCV), pages 456–473, 2018.
Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 173. Cody Coleman, Deepak Narayanan, Daniel Kang, Tian Zhao, Jian
Learning deep representations by mutual information estimation Zhang, Luigi Nardi, Peter Bailis, Kunle Olukotun, Chris Ré, and
and maximization. arXiv preprint arXiv:1808.06670, 2018. Matei Zaharia. Dawnbench: An end-to-end deep learning bench-
154. Philip Bachman, R Devon Hjelm, and William Buchwalter. Learning mark and competition. Neural Information Processing Systems
representations by maximizing mutual information across views. Workshops, 2017.
In Advances in Neural Information Processing Systems, pages 174. Jeffery L Teeters, Keith Godfrey, Rob Young, Chinh Dang, Claudia
15509–15519, 2019. Friedsam, Barry Wark, Hiroki Asari, Simon Peron, Nuo Li, Adrien
155. Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Peyrache, et al. Neurodata without borders: creating a common
Momentum contrast for unsupervised visual representation learn- data format for neurophysiology. Neuron, 88(4):629–634, 2015.
ing. arXiv preprint arXiv:1911.05722, 2019. 175. Gonçalo Lopes, Niccolò Bonacchi, João Frazão, Joana P Neto,
156. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Bassam V Atallah, Sofia Soares, Luís Moreira, Sara Matias,
Hinton. A simple framework for contrastive learning of visual rep- Pavel M Itskov, Patrícia A Correia, et al. Bonsai: an event-based
resentations. arXiv preprint arXiv:2002.05709, 2020. framework for processing and controlling data streams. Frontiers
157. Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael in neuroinformatics, 9:7, 2015.
Auli. wav2vec: Unsupervised pre-training for speech recognition. 176. Rotem Botvinik-Nezer, Felix Holzmeister, Colin F Camerer, Anna
Proc. Interspeech 2019, pages 3465–3469, 2019. Dreber, Juergen Huber, Magnus Johannesson, Michael Kirchler,
158. Alexei Baevski, Steffen Schneider, and Michael Auli. vq-wav2vec: Roni Iwanir, Jeanette A Mumford, R Alison Adcock, et al. Variability
Self-supervised learning of discrete speech representations. In in the analysis of a single neuroimaging dataset by many teams.
International Conference on Learning Representations (ICLR), Nature, pages 1–7, 2020.
2020. 177. Rick O Gilmore and Karen E Adolph. Video can make behavioural
159. Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael science more reproducible. Nature human behaviour, 1(7), 2017.
Auli. wav2vec 2.0: A framework for self-supervised learning of 178. Samantha R White, Linda M Amarante, Alexxai V Kravitz, and
speech representations. CoRR, 2020. Mark Laubach. The future is open: Open-source tools for behav-
160. Mirco Ravanelli, Jianyuan Zhong, Santiago Pascual, Pawel Swi- ioral neuroscience research. Eneuro, 6(4), 2019.
etojanski, Joao Monteiro, Jan Trmal, and Yoshua Bengio. Multi- 179. Daniel Aharoni, Baljit S Khakh, Alcino J Silva, and Peyman Gol-
task self-supervised learning for robust speech recognition. In shani. All the light that we can see: a new era in miniaturized
ICASSP 2020-2020 IEEE International Conference on Acous- microscopy. Nature methods, 16(1):11–13, 2019.
tics, Speech and Signal Processing (ICASSP), pages 6989–6993. 180. Joshua H Siegle, Aarón Cuevas López, Yogi A Patel, Kirill
IEEE, 2020. Abramov, Shay Ohayon, and Jakob Voigts. Open ephys: an open-
161. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina source, plugin-based platform for multichannel electrophysiology.
Toutanova. Bert: Pre-training of deep bidirectional transformers Journal of neural engineering, 14(4):045003, 2017.
for language understanding. In NAACL-HLT (1), 2019. 181. Demis Hassabis, Dharshan Kumaran, Christopher Summerfield,
162. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, and Matthew Botvinick. Neuroscience-inspired artificial intelli-
Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and gence. Neuron, 95(2):245–258, 2017.
Veselin Stoyanov. Roberta: A robustly optimized bert pretraining 182. Fabian H. Sinz, Xaq Pitkow, Jacob Reimer, Matthias Bethge, and
approach. ArXiv, abs/1907.11692, 2019. Andreas S. Tolias. Engineering a less artificial intelligence. Neu-
163. Rafi Umer, Andreas Doering, Bastian Leibe, and Juergen Gall. ron, 103(6):967 – 979, 2019. ISSN 0896-6273. doi: https:
Self-supervised keypoint correspondences for multi-person pose //doi.org/10.1016/j.neuron.2019.08.034.
estimation and tracking in videos. arXiv preprint arXiv:2004.12652, 183. John W Krakauer, Asif A Ghazanfar, Alex Gomez-Marin, Mal-
2020. colm A MacIver, and David Poeppel. Neuroscience needs behav-
164. Hsiao-Yu Tung, Hsiao-Wei Tung, Ersin Yumer, and Katerina Fragki- ior: correcting a reductionist bias. Neuron, 93(3):480–490, 2017.
adaki. Self-supervised learning of motion capture. In Advances in
Neural Information Processing Systems, pages 5236–5246, 2017.
165. Muhammed Kocabas, Salih Karagoz, and Emre Akbas. Self-
supervised learning of 3d human pose using multi-view geometry.
In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1077–1086, 2019.
166. Franco Scarselli, Marco Gori, Ah Chung Tsoi, Markus Hagenbuch-
ner, and Gabriele Monfardini. The graph neural network model.
IEEE Transactions on Neural Networks, 20(1):61–80, 2008.
167. Yujun Cai, Liuhao Ge, Jun Liu, Jianfei Cai, Tat-Jen Cham, Junsong
Yuan, and Nadia Magnenat Thalmann. Exploiting spatial-temporal
relationships for 3d pose estimation via graph convolutional net-
works. In Proceedings of the IEEE International Conference on
Computer Vision, pages 2272–2281, 2019.
168. Silvia Zuffi, Angjoo Kanazawa, Tanja Berger-Wolf, and Michael
Black. Three-d safari: Learning to estimate zebra pose, shape,
and texture from images "in the wild". In ICCV. IEEE Computer
Society, 08 2019.
169. Muhammad Haris Khan, John McDonagh, Salman Khan, Muham-
mad Shahabuddin, Aditya Arora, Fahad Shahbaz Khan, Ling

Mathis et al. | Motion Capture with Deep Learning arXiv.org | 21

You might also like