WO2021142532A1

WO2021142532A1 - Activity recognition with deep embeddings

Info

Publication number: WO2021142532A1
Application number: PCT/CA2021/050017
Authority: WO
Inventors: David Burns; Michael HARDISTY; Cari WHYNE; Stewart MCLACHLIN
Original assignee: Halterix Corporation
Priority date: 2020-01-14
Filing date: 2021-01-11
Publication date: 2021-07-22

Abstract

Systems and methods are disclosed for performing human activity recognition using deep embeddings. In some example embodiments, a learned deep embedding is employed to generate subject-specific reference embeddings that are generated based on labeled sensor data that is specific to a subject and collected in a presence of supervision of subject movements. The labeled reference embeddings may then be employed to infer an activity class of a movement performed by the subject in an absence of supervision, based on the processing of the subject-specific labeled reference embedding and an embedding generated from unlabeled sensor data collected when the unsupervised movements are performed. In some example implementations, the deep embedding may be learned according to a metric loss function, such as a triplet loss function, and the metric loss function may include at least one comparative measure that is based on two or more samples associated with a common subject.

Description

ACTIVITY RECOGNITION WITH DEEP EMBEDDINGS

CROSS-REFERENCE TO RELATED APPLICATION

This application claims priority to U.S. Provisional Patent Application No. 62/960,876, titled “ACTIVITY RECOGNITION WITH DEEP EMBEDDINGS” and filed on January 14, 2020, the entire contents of which is incorporated herein by reference.

BACKGROUND The present disclosure relates to the field of machine learning for human activity recognition.

Inertial sensors embedded in mobile phones and wearable devices are commonly employed to classify and characterize human behaviors in a number of applications including tracking fitness, elder safety, sleep, physiotherapy, and others. Improving the accuracy and robustness of the algorithms underlying inertial human activity recognition (HAR) systems is an active field of research.

A significant challenge for a supervised learning approach to inertial human activity recognition is the heterogeneity of data between individual users. This heterogeneity occurs in relation to heterogeneity in the hardware on which the inertial data is collected, different inherent capabilities or attributes relating to the users themselves, and differences in the environment in which the data is collected.

Large data sets incorporating the full spectrum of user, device, and environment heterogeneity may represent a potential approach to addressing these challenges, however, this approach presents significant logistical and financial challenges. Further, the devices and sensors on which inertial data is collected continuously evolve overtime. It may not be feasible to train generic supervised algorithms that perform equally well for all users and devices.

One approach to addressing such challenges involves the use of labeled user-specific data for a personalized approach to activity recognition. Human activity recognition from inertial time series data has conventionally been conducted using a supervised learning approach with non-neural classifiers, after transformation of the data using an engineered feature representation consisting of statistical, time- domain, and or frequency-domain transforms.

User-specific supervised learning models have been trained through one of three general schemes. In a first scheme, a user-specific model can be trained de novo with user-specific data or a combination of generic and user-specific data. This is generally not feasible for neural network approaches that require vast data sets and computational resources for training, but works well for non-neural approaches with engineered features. According to a second scheme, model updating (e.g. online learning, transfer learning) with user-specific data may be employed for both non-neural and neural network supervised learning algorithms. For example, a generic convolution neural network architecture can be trained and adapted to specific users by retraining the classification layer while fixing the weights of the convolutional layers. A third scheme involves using user-specific classifier ensembles, for example, pre-training non-neural classifiers on subpopulations within a training set, and selecting user-specific classifier ensembles based on testing the pre-trained classifiers on user-specific data.

The aforementioned personalized methods have all produced favorable results in comparison to generic models. However, generating, validating, and maintaining user-specific supervised learning models presents its own logistical challenges in a commercial deployment.

SUMMARY

Systems and methods are disclosed for performing human activity recognition using deep embeddings. In some example embodiments, a learned deep embedding is employed to generate subject-specific reference embeddings that are generated based on labeled sensor data that is specific to a subject and collected in a presence of supervision of subject movements. The labeled reference embeddings may then be employed to infer an activity class of a movement performed by the subject in an absence of supervision, based on the processing of the subject-specific labeled reference embedding and an embedding generated from unlabeled sensor data collected when the unsupervised movements are performed. In some example implementations, the deep embedding may be learned according to a metric loss function, such as a triplet loss function, and the metric loss function may include at least one comparative measure that is based on two or more samples associated with a common subject. Accordingly, in one aspect, there is provided a method of assessing movements of a subject, the method comprising: collecting first sensor data associated with first supervised movements performed by a plurality of subjects in a presence of supervision, and obtaining a first set of activity class labels characterizing the first supervised movements, each activity class label of the first set of activity class labels associating a given first supervised movement with a respective activity class selected from a first set of activity classes; generating a training dataset comprising the first sensor data and the first set of activity class labels; employing the training dataset to learn a deep embedding using a neural network; collecting second sensor data associated with second supervised movements performed by a selected subject in a presence of supervision, and obtaining a second set of activity class labels characterizing the second supervised movements, each activity class label of the second set of activity class labels associating a given second supervised movement with a respective activity class selected from a second set of activity classes; employing the learned deep embedding to process the second sensor data to generate a set of reference embeddings, each reference embedding having an associated activity class; collecting third sensor data while the selected subject performs an unsupervised movement in an absence of supervision; employing the learned deep embedding to process the third sensor data to generate an unsupervised embedding associated with the unsupervised movement; and processing the set of reference embeddings, the activity classes associated with the set of reference embeddings, and the unsupervised embedding to determine an activity class associated with the unsupervised movement.

In some implementations of the method, the neural network is trained according to a metric learning loss function. The metric learning loss function may comprise at least one subject-specific comparative measure based on two or more samples associated with a common subject.

In some implementations of the method, the neural network is trained according to a triplet loss function based on a set of sample triplets, wherein at least one sample triplet includes an anchor data sample and a positive data sample that are associated with a common subject. Al least one sample triplet may further include a negative data sample associated with the common subject. Each sample triplet may include an anchor data sample, a positive data sample and a negative sample that are respectively associated a respective common subject.

In some implementations of the method, the neural network is trained according to a contrastive loss function based on a set of sample doublets, wherein at least one sample doublet includes two samples that are associated with a common subject. In some implementations of the method, the deep embedding is obtained from a feature vector that forms an output of a convolutional portion of a convolutional neural network.

In some implementations of the method, the deep embedding is learned by training a convolutional neural network comprising a fully connected portion, and wherein the deep embedding is obtained from the feature vector that forms an input to the fully connected portion of the convolutional neural network.

In some implementations of the method, the activity class associated with the unsupervised movement is determined from the unsupervised embedding and the set of reference embeddings using a classifier that employs a distance metric. The classifier may be a K-nearest neighbour classifier.

In some implementations of the method, the neural network comprises a convolutional neural network. The neural network may comprise a convolutional recurrent neural network.

In some implementations of the method, the neural network comprises a recurrent neural network.

In some implementations, the method further comprises processing the third sensor data to determine a performance attribute associated with the unsupervised movement. The activity class of the unsupervised movement may be employed when processing the second sensor data to determine the performance attribute.

In some implementations of the method, the plurality of subjects comprises the selected subject.

In some implementations of the method, the first supervised movements include correctly performed movements and incorrectly performed movements.

In some implementations of the method, the second supervised movements include correctly performed movements and incorrectly performed movements.

In some implementations of the method, the second set of activity classes is the same as the first set of activity classes.

In some implementations of the method, the second set of activity classes is different than the first set of activity classes.

In some implementations of the method, the second sensor data and the third sensor data are obtained from one or more wearable sensors worn by the selected subject.

In some implementations of the method, the second sensor data and the third sensor data are obtained from one or more sensors configured to detect fiducial markers worn by the selected subject.

In some implementations of the method, the first sensor data is only collected when a given supervised movement satisfies performance criteria.

In some implementations of the method, each activity class of the first set of activity classes corresponds to a unique physiotherapy exercise.

In some implementations of the method, each activity class of first set of activity classes corresponds to a unique dance movement.

In some implementations of the method, each activity class of first set of activity classes corresponds to a unique movement associated with a medical procedure.

In some implementations of the method, the deep embedding is learned via a convolutional neural network trained with categorical cross entropy loss.

In some implementations of the method, the deep embedding is learned via a recurrent neural network trained with categorical cross entropy loss.

In some implementations of the method, the deep embedding is learned via one of an auto-encoder and a variational auto-encoder.

In some implementations of the method, the deep embedding comprises engineered features and learned features.

In some implementations of the method, at least one reference embedding is validated in relation to the training dataset before being employed to determine the activity class associated with the unsupervised movement.

In some implementations of the method, the activity class associated with the unsupervised movement is identified as not belonging to the first set of activity classes. A local outlier factor algorithm may be employed to compare the unsupervised embedding with the set of reference embeddings.

In another aspect, there is provided a method of learning subject-specific a deep embedding for use in assessing movements of a subject via inference, the method comprising: collecting first sensor data associated with first supervised movements performed by a plurality of subjects in a presence of supervision, and obtaining a first set of activity class labels characterizing the first supervised movements, each activity class label of the first set of activity class labels associating a given first supervised movement with a respective activity class selected from a first set of activity classes; generating a training dataset comprising the first sensor data and the first set of activity class labels; and employing the training dataset to learn a deep embedding using a neural network, wherein the neural network is trained according to a metric learning loss function; wherein the metric learning loss function comprises at least one subject- specific comparative measure based on two or more samples associated with a common subject.

In some implementations of the method, the neural network is trained according to a triplet loss function based on a set of sample triplets, wherein at least one sample triplet includes an anchor data sample and a positive data sample that are associated with a common subject. At least one sample triplet may further include a negative data sample associated with the common subject. Each sample triplet may include an anchor data sample, a positive data sample and a negative sample that are respectively associated a respective common subject. The neural network may be trained according to a contrastive loss function based on a set of sample doublets, wherein at least one sample doublet includes two samples that are associated with a common subject.

In some implementations of the method, the deep embedding is obtained from a feature vector that forms an output of a convolutional portion of a convolutional neural network.

In another aspect, there is provided a method of assessing movements of a subject, the method comprising: obtaining a set of reference embeddings associated with the subject, each reference embedding having an associated activity class, each reference embedding having been generated by processing, with a learned deep embedding, first sensor data associated with supervised movements performed by the subject in a presence of supervision; collecting second sensor data while the subject performs an unsupervised movement in an absence of supervision; employing the learned deep embedding to process the second sensor data to generate an unsupervised embedding associated with the unsupervised movement; and processing the set of reference embeddings, the activity classes associated with the set of reference embeddings, and the unsupervised embedding to determine an activity class associated with the unsupervised movement.

In another aspect, there is provided a system for assessing movements, the system comprising: one or more sensors; control and processing circuitry operably coupled to said one or more sensors, said control and processing circuitry comprising at least one processor and memory operably coupled to said at least one processor, said memory comprising instructions executable by said at least one processor for performing operations comprising: obtaining a set of reference embeddings associated with a subject, each reference embedding having an associated activity class, each reference embedding having been generated by processing, with a learned deep embedding, first sensor data associated with supervised movements performed by the subject in a presence of supervision; collecting second sensor data while the subject performs an unsupervised movement in an absence of supervision; employing the learned deep embedding to process the second sensor data to generate an unsupervised embedding associated with the unsupervised movement; and processing the set of reference embeddings, the activity classes associated with the set of reference embeddings, and the unsupervised embedding to determine an activity class associated with the unsupervised movement.

A further understanding of the functional and advantageous aspects of the disclosure can be realized by reference to the following detailed description and drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments will now be described, by way of example only, with reference to the drawings, in which:

FIG. 1A shows an example method for learning a subject-specific deep embedding and employing the deep embedding for performing human activity recognition.

FIG. 1B shows an example method for performing human activity recognition.

FIG. 1C shows an example method for learning a subject-specific deep embedding for use in human activity recognition.

FIG. 2 shows an example system for performing human activity recognition.

FIG. 3 is a table presenting the data set characteristics employed in the Examples section. The MHEALTH data was collected with three inertial sensors on the subjects’ right wrist, left leg, and chest. The WISDM data was collected from an Android™ smart watch worn by the subjects, and a mobile phone in the subjects’ pocket. The SPAR data was collected from 20 subjects (40 shoulders) using an Apple^® smart watch. IMU: Inertial Measurement Unit, ADL: Activity of Daily Living.

FIG. 4 shows an example fully convolutional network (FCN) model core. 1 D convolutional layers are defined by {filters, kernel size, stride}. A dropout ratio of 0.3 was used at each layer. The embedding size is equal to the number of filters in the last convolutional layer (128). The total number of parameters for this model is 270,848 (an implementation is available for keras at https://github.com/dmbee/fcn- core).

FIG. 5 schematically shows data splitting employed for evaluation of personalized feature classifiers. Folds are split by subject (S), with reference data derived from test subjects prior to sliding window segmentation ensuring no temporal overlap between train, reference, and test data.

FIG. 6 is a table showing cross-validated classification accuracy (mean ± standard deviation) aggregated by subject. PTN^† was trained with conventional triplet loss, and PTN was trained with 50% subject triplets. FCN: Fully convolutional network (impersonal), PEF: personalized engineered features, PDF: personalized deep features, PTN: personalized triplet network.

FIG. 7 provides box and whisker plots showing distribution of classifier performance by subject using 5-fold cross validation. The personalized classifiers have better performance and less inter-subject performance variation than the impersonal FCN (fully convolutional network) model, with the PTN (personalized triplet network) model performing best overall.

FIG. 8 plots training and validation curves for FCN model (above), PTN model with conventional triplet loss (middle), and PTN model with subject-specific triplet loss (bottom) over 150 epochs.

FIG. 9 presents box and whisker plots showing the distribution of OOD detection performance across subjects, with 30% of activity classes held back from the training set. The FCN, PTN, and PDF classifiers had the highest mean OOD scores for the MHEALTH, WISDM, and SPAR data sets respectively.

FIG. 10 presents box and whisker plots showing the distribution of activity classification performance when generalizing an embedding to novel activity classes, with 30% of activity classes held back from the training set. The PTN model had the best mean performance across all three data sets.

FIG. 11 presents box and whisker plots showing the effect of reference data size (number of reference segments per activity class) on personalized feature classifier accuracy, evaluated on the SPAR data set. Increasing reference data size improves performance for all personalized algorithms.

FIG. 12 presents box and whisker plots showing the effect of embedding size (number of features) on personalized feature classifier accuracy, evaluated on the SPAR data set. This demonstrates how both the PDF and PTN models can maintain high performance with a parsimonious embedding, whereas the PEF model performance is significantly degraded.

FIG. 13 is a table presenting the computational and storage expense for SPAR data set on a single fold (test size 0.2). Reference size for personalized feature classifiers is based on single precision 64 feature embeddings, with 16 samples for each of the 7 activity classes.

DETAILED DESCRIPTION

Various embodiments and aspects of the disclosure will be described with reference to details discussed below. The following description and drawings are illustrative of the disclosure and are not to be construed as limiting the disclosure. Numerous specific details are described to provide a thorough understanding of various embodiments of the present disclosure. However, in certain instances, well- known or conventional details are not described in order to provide a concise discussion of embodiments of the present disclosure.

As used herein, the terms “comprises” and “comprising” are to be construed as being inclusive and open ended, and not exclusive. Specifically, when used in the specification and claims, the terms “comprises” and “comprising” and variations thereof mean the specified features, steps or components are included. These terms are not to be interpreted to exclude the presence of other features, steps or components.

As used herein, the term “exemplary” means “serving as an example, instance, or illustration,” and should not be construed as preferred or advantageous over other configurations disclosed herein.

As used herein, the terms “about” and “approximately” are meant to cover variations that may exist in the upper and lower limits of the ranges of values, such as variations in properties, parameters, and dimensions. Unless otherwise specified, the terms “about” and “approximately” mean plus or minus 25 percent or less.

It is to be understood that unless otherwise specified, any specified range or group is as a shorthand way of referring to each and every member of a range or group individually, as well as each and every possible sub-range or sub-group encompassed therein and similarly with respect to any sub-ranges or sub-groups therein. Unless otherwise specified, the present disclosure relates to and explicitly incorporates each and every specific member and combination of sub-ranges or subgroups.

As used herein, the term "on the order of", when used in conjunction with a quantity or parameter, refers to a range spanning approximately one tenth to ten times the stated quantity or parameter. As noted above, supervised learning methods can be hindered by many challenges when applied to human activity recognition, including the heterogeneity of data among users, limited amounts of training data, and changes in device and sensor design evolution, which can lead to poor performance of impersonal algorithms for some subjects. Furthermore, known methods in generating user- specific supervised learning models suffer from complexities in implementation, validation and maintenance.

The present inventors have aimed to address these shortcoming in the field of subject-specific activity recognition by developing a more robust approach to personalized activity recognition. It was discovered that improved performance and robustness could be achieved using a subject-specific approach involving a learned deep embedding that is subsequently employed to generate and store subject- specific reference embeddings based on labeled subject-specific sensor data. New unlabeled sensor data associated with activity of the subject can then be classified and/or characterized using the labeled subject-specific reference embeddings for inference. As explained in detail below, some benefits of this approach may include the ability to incorporate novel activity classes without model re-training, and the ability to identify out-of-distribution (OOD) activity classes, thus supporting an open- set activity recognition framework.

Referring now to FIG. 1 , a flow chart is provided of an example method for performing human activity recognition based on a learned deep embedding. In steps 100-120, a deep embedding is learned based on labeled sensor data that is recorded while a set of subjects perform movements. As shown in step 100, sensor data is collected based on movements performed by a set of subjects while the subjects wear one or more activity sensors. In the present example embodiment, the movements are performed under supervision.

A set of activity class labels is also collected when the sensor data is recorded. The activity class labels characterize the supervised movements, such that each activity class label associates a given supervised movement with a respective activity class. The activity classes may be selected from a set of activity classes. For example, in an example implementation involving a physiotherapy application, the set of activity classes may include a number of different types of rehabilitation exercises. Sensor data that is recorded while performing a specific exercise therefore has a corresponding activity label correlating the sensor data with the specific exercise.

The sensor data recorded during the supervised movements performed by the set of subjects in step 100, and the activity class labels respectively associated with the sensor data, are employed as a training dataset for learning a deep embedding using a neural network, as shown at steps 110 and 120. Various example methods for learning the deep embedding are described in detail below. The learned deep embedding facilitates the generation of an embedding, e.g. a set of features, based on sensor data from a selected subject.

As shown at step 130, a set of labeled sensor data associated with supervised movements performed by the selected subject may be recorded, where the sensor data for each supervised movement performed by the selected subject is associated with a respective activity class label. The learned deep embedding may then be employed to generate a set of reference embeddings for the selected subject based on the set of labeled sensor data and the associated activity class labels, as shown at 140.

The set of subject-specific reference embeddings (that generated by the labeled sensor data associated with the selected subject) may be employed with their respective activity class labels to infer an activity class for unsupervised movements performed by the subject. As shown at steps 150 and 160, sensor data may be collected while the selected subject performs an unsupervised movement in an absence of supervision, and the learned deep embedding may be employed to process the sensor data to generate an unsupervised embedding associated with the unsupervised movement. The set of reference embeddings (which are specific to the subject), along with their respective activity classes, may be employed to process the unsupervised embedding to infer the activity class associated with the unsupervised movement, as shown at 170. Various example methods for determining an activity class based on the reference and unsupervised embeddings associated with the selected subject are described in further detail below.

It will be understood that the supervised movements in steps 100 and 130 of FIG. 1 may be performed according to a wide variety of supervision methods. For example, the movements may be supervised with guidance from an instructor, facilitator, teacher, care provider, or other authority. The supervision provides at least some guidance with regard to the movements to facilitate or encourage the correct performance of the movements. For example, in a physiotherapy example implementation, the supervised movements may be performed based on feedback from a physiotherapist, where the physiotherapist provides supervision and guidance during the performing of the movements. The physiotherapist need not be physically present during the performing of the movements, and may, for example, provide guidance based on a video session in which the physiotherapist views the movements over a remote video connection and provides feedback to the subject. In other example implementations, the movements may be supervised by an automated means. For example, the subject may wear one or more sensors or fiducial markers, and signals recorded based on the one or more sensors or fiducial markers may be processed, and feedback may be provided to the subject based on whether or not the movements meet predetermined criteria.

Referring again to FIG. 1A, step 170 includes the step of processing the reference and unsupervised embeddings associated with the selected subject to determine an activity class for the unsupervised movement. This may be performed, for example, using a distance-based search method. For example, the k-nearest neighbour algorithm, or another algorithm based on distance (e.g. Euclidean distance) may be employed to classify the unsupervised embedding with a suitable activity class label.

The example embodiment illustrated in the flow chart shown in FIG. 1 B provides a subject-specific (subject-customized) method for inferring an activity class for one or more unsupervised movements performed by a selected subject. This is achieved, as illustrated in step 180, by first obtaining a set of reference embeddings associated with the subject, each reference embedding having an associated activity class, with each reference embedding having been generated by processing, with a learned deep embedding, sensor data associated with supervised movements performed by the subject. Additional sensor data is then collected while the subject performs an unsupervised movement, as shown at step 185. The learned deep embedding is then employed in step 190 to process the additional (unsupervised) sensor data to generate an unsupervised embedding associated with the unsupervised movement. Finally, as shown at step 195, the set of reference embeddings are processed, with the activity classes associated with the set of reference embeddings, and with the unsupervised embedding, to determine an activity class associated with the unsupervised movement.

The deep embedding can be learned according to a wide variety of methods. In one example embodiment, the deep embedding may be learned by training a neural network having a classification output layer, where the deep embedding is obtained according to an internal feature layer (e.g. a final feature layer) of the neural network, as opposed to a final classification output layer. For example, in an example implementation, the neural network may be a convolutional neural network having a convolutional portion and a fully connected (e.g. dense) portion, with a final output layer that is processed (e.g. via a softmax function) to generate predictions of the probability that a given movement belongs to each activity class. For example, a convolutional neural network may be trained according to categorical cross entropy loss. In such a case, while the neural network is trained based on predicted activity classes, a final feature layer associated with the convolutional portion of the neural network (e.g. a feature vector generated based on the output of the convolutional portion of the neural network) may be employed as the deep embedding for use in the present example methods involving activity class inference based on the use of a labeled subject-specific reference embeddings.

In other example embodiments, the deep embedding may be learned directly according to training that employs a metric learning loss function that operates on an embedding, thereby directly optimizing (training) the deep embedding according to a desired task. For example, a Siamese neural network trained according to metric loss function can generate such a deep embedding. Non-limiting examples of metric learning loss functions include contrastive loss, triplet loss, N-pair loss, proxy-loss, and prototypical loss.

In some example embodiments in which the deep embedding is learned via training a neural network with a metric learning loss function, the metric learning loss function may be subject-specific, in which the loss function includes at least one comparative measure that involves at least two samples associated with a common subject (as shown in step 125 of the metric learning method for learning a deep embedding as shown in FIG. 1C). An example of such an embodiment is illustrated below with reference to a deep embedding learned with a triplet loss function (e.g. a triplet neural network).

A conventional triplet neural network learns an embedding /(x), for data x into a feature space R^d such that the Euclidean distance between datum of the same target class (y) is small and the distance between datum of different target classes is large. With a squared Euclidean distance metric, triplet loss ( L_T ) is defined by as:

L_T = å[max {[|| /(xf) - /(xf) ||| -II /(x ) - /(xf) |||+ a], q}, (1) where xf is a sample from a given class (anchor), and xf is a different sample of the same class (positive), and xf is a sample of a different class (negative) a is the margin, which is a hyperparameter of the model defining the distance between class clusters. The same embedding /(x) is applied to each sample in the triplet, and the objective is optimized over a training set of triplets with cardinality T. The number of possible triplets (T) that can be generated from a data set with cardinality N is 0(N³).

In practice, a triplet neural network will converge well before a single pass over the full set of triplets, and therefore a subset of triplets should be specifically selected from the full set. According to the present example embodiment, a subject- specific triplet selection strategy may be employed in which each triplet derives its respective samples (anchor, positive and negative samples) from a single respective subject, which yields a modified, subject-specific triplet loss function, as follows:

is a segment of a particular activity class for subject s (anchor),

a segment of the same activity class and subject of the anchor (positive), and x% is a segment of a different activity class but from the same subject as the anchor (negative). T_s denotes the full set of triplets that may be drawn from a single subject, and S is the full set of subjects. This approach reduces the number of possible triplets to 0(N).

Accordingly, in the present example embodiment, the triplet loss function is based on a sum of comparative measures, where each comparative measure involves a triplet of embeddings that are generated from an anchor sample, a positive sample, and a negative sample, and where minimization of the loss function results in a minimization of the distance between the anchor and positive embeddings within each triplet and the maximization of the distance between the anchor embedding and the negative embedding. However, unlike the conventional approach to generating a triplet loss, the present example embodiment constrains each triplet to employ embeddings based on sensor data obtained from a common subject. In other words, one triplet of embeddings may be generated based on sensor data from a first subject, another triplet of embeddings may be generated based on sensor data from a second subject, and yet another triplet of embeddings may be generated based on sensor data from a third subject. It will be understood that sensor data from a given subject may be employed to generate more that one triplet, provided that sufficient sensor data (e.g. corresponding to a sufficient number of labeled movements of a given activity class) exists.

While the preceding example embodiment involved a triplet loss function in which each triplet of embeddings were generated based on sensor data from a common subject, other example embodiments may not require each and every triplet to be so generated. For example, in one example implementation, at least one triplet of embedding is generated based on sensor data from a common subject, or at least a given fraction, such as at least 25%, 50% or 75%, of the triplets are generated based on sensor data from respective common subjects. In other example implementations, two of the three embeddings generated with a given subject- specific triplet may be generated based on sensor data obtained from a common subject. For example, one triplet may include an anchor embedding and a positive embedding that are generated based on sensor data from a first subject, and a negative embedding that is generated based on sensor data from a second subject. Alternatively, a given triplet may include an anchor embedding and a negative embedding that are generated based on sensor data from a first subject, and a positive embedding that is generated based on sensor data from a second subject. In another alternative implementation, a given triplet may include a positive embedding and a negative embedding that are generated based on sensor data from a first subject, and an anchor embedding that is generated based on sensor data from a second subject.

While the preceding example embodiment employed a subject-specific triplet loss function for training a neural network, other types of metric loss functions may also be rendered subject-specific in a similar manner. For example, a Siamese network may be employed to learn a deep embedding based on a subject-specific contrastive loss function, in which each positive pair and each negative pair of embeddings are associated with respective common subjects. For example, a first pair of positive embeddings employed in a contrastive loss doublet may be generated based on sensor data obtained from a first subject, and a second pair of positive embeddings employed in a contrastive loss doublet may be generated based on sensor data obtained from a second subject, likewise for negative doublets). In another example implementation, a subject-specific multi-class N-pair loss function may be employed to learn a deep embedding, in which at least one component of the loss function includes positive and anchor samples from a common subject and common class, and also N-1 negative samples from the common subject but multiple different activity classes. It will be understood that other metric loss functions, such as proxy-loss and prototypical loss, may also be adapted to generate subject-specific loss functions.

It is further noted that the subject-specific metric loss function that is employed in the preceding embodiment may or may not include sensor data from the subject for whom the reference embedding and the unsupervised embedding are generated. Accordingly, the present subject-specific metric loss example embodiments involve two different subject-specific aspects, one associated with training and one associated with inference of activity class for unsupervised movements, namely: (i) a subject-specific learned deep embedding generated based on a subject-specific metric loss function based on sensor data collected from a plurality of training subjects, and (ii) subject-specific reference and unsupervised embeddings generated for a subject for whom activity class inference is to be performed. To be clear, aspect (i) involves subject-specific learning and loss minimization for a metric learning deep embedding, while aspect (ii) involves subject- specific inference based on a comparison of a supervised reference embedding generated from labeled sensor data that is collected during the performance of supervised movements, and an unsupervised embedding that is generated from unlabeled sensor data that is collected during the performance of unsupervised movement, where both the reference and unsupervised embeddings are generated based on respective sensor data associated with the same subject.

In embodiments involving the use of a metric loss function to learn a deep embedding from a neural network, the neural network may take on a wide variety of forms. In one example implementation, the neural network may be or include a convolutional core of a convolutional neural network (e.g. a neural network including a series of convolutional layers, and associated activation and dropout/pooling and/or other layers/functions (e.g. batch normalization)), resulting in a deep embedding as opposed to an output classification layer. Other non-limiting examples of suitable neural networks that can be employed to generate a deep embedding trained via metric learning include convolutional recurrent neural networks and recurrent neural networks. Other examples of suitable neural networks include auto-encoder and variational auto-encoder neural networks.

In some example embodiments, the deep embedding may include a combination of learned features and engineered features. For example, the deep embedding may include a first set of features learned via metric learning and an additional set of features obtained by performing feature engineering on the sensor data. The full set of features (learned and engineered) may be employed to generate the reference embedding and the unsupervised embedding, and thus subsequently employed for inferring the activity class associated with the unsupervised movement.

In some example embodiments, the sensor data that is employed to learn the deep embedding, and/or to generate the reference embedding may include sensor data from only correctly performed movements (e.g. with sensor data only included when pre-selected performance criteria associated with the sensor data is satisfied), or from both correctly and incorrectly performed movements.

In some example embodiments, the set of activity classes that are associated with the supervised movements performed when obtaining the labeled sensor data for generating the reference embeddings may be the same as, or different from (e.g. a subset of), the set of activity classes associated with the supervised movements employed to learn the deep embedding. In some example implementations, at least one reference embedding may be validated in relation to the training dataset before being employed to determine the activity class associated with the unsupervised movement. For example, this could be accomplished by comparing this new reference embedding to known valid reference embeddings in terms of distance in the embedding space. It would be expected that a valid reference embedding would be near to previously valid reference embeddings of the same class.

In some example embodiments, an activity class associated with the unsupervised movement may be identified as not belonging to the set of activity classes that were employed when collecting the sensor data for learning the deep embedding. For example, a local outlier factor algorithm may be employed to compare the unsupervised embedding with the set of reference embeddings and identify movements as not belonging to the initial set of activity classes that were employed to learn the deep embedding. In another embodiment, if the distance between the unsupervised embedding and the reference embeddings is sufficiently large, the unsupervised embedding could be considered an outlier in so far as it represents a different activity class than exists in the reference embeddings.

Unlike conventional approaches that employ a neural network classifier for performing human activity classes, in which the identification of type of movement is limited to the set of activity classes employed during training, the present example methods involving the use of labeled reference embeddings may permit the use of the learned deep embedding to classify new activity classes that were not present in the initial sensor data employed to learn the deep embedding. For example, the embedding can generalize to new activity classes when samples of an activity novel with respect to the training distribution are obtained under supervision and embedded as reference embeddings for that subject. Future unsupervised instances of this activity may then be identified when the unsupervised embeddings are near to the reference embeddings of the novel activity class in the embedding space.

In some example embodiments, the unsupervised sensor data collected when the subject performs unsupervised movements may be processed to determine one or more performance attributes associated with the movement, with the assessment optionally being based on the inferred activity class of the unsupervised movement. For example, range of motion, number of exercise repetitions, frequency of exercise repetitions, and qualitative exercise performance.

Referring now to FIG. 2, an example system is shown that includes control and processing circuitry 200 that is programmed to computer an inferred activity class associated with unsupervised movement of a subject, according to one of the example embodiments described herein, or variations thereof. The example system includes one or more sensors 305 for sensing signals associated with movement of a subject. Nonlimiting examples of such sensors include accelerometers, gyroscopes, pressure sensors, electrical contact sensors, force sensors, velocity sensors, and myoelectric sensors. In some example implementations, one or more sensors may be wearable sensors configured to be worn by, held by, or otherwise supported by or secured to a subject. In some example implementations, one or more sensors may be remote sensors configured to fiducial markers (e.g. active or passive fiducial markers) worn by, held by, or otherwise supported by or secured to a subject.

The sensor(s) 305 may be integrated within or a sub-component of the control and processing circuitry 200, as shown by the dashed perimeter 300. For example, the control and processing circuitry may be a mobile computing device that includes sensors, such as a mobile phone (smartphone), smartwatch, or mobile computing fitness device. In other example implementations, one or more sensors may be external sensors to which the control and processing circuitry 300 is interfaced for recording sensor signals associated with subject movement (e.g. via wireless signals, such as Bluetooth™).

As shown in the example embodiment illustrated in FIG. 2, control and processing circuitry 200 may include a processor 210, a memory 215, a system bus 205, a data acquisition and control interface 220 for acquiring sensor data and user input and for sending control commands to the machine 400, a power source 225, and a plurality of optional additional devices or components such as storage device 230, communications interface 235, display 240, and one or more input/output devices 245.

The example methods described herein can be implemented, at least in part, via hardware logic in processor 210 and partially using the instructions stored in memory 215. Some example embodiments may be implemented using processor 210 without additional instructions stored in memory 215. Some embodiments may be implemented using the instructions stored in memory 215 for execution by one or more microprocessors.

For example, the example methods described herein for inferring an activity class based on a learned deep embedding, in which subject-specific labeled reference embeddings, generated based on labeled sensor data collected from a subject in the presence of supervision, are compared to an unlabeled embedding, generated based on unlabeled sensor data collected from the same subject without supervision, may be implemented via processor 210 and/or memory 215. Such a method is represented as activity classification module 320, which operates on the stored reference embeddings and associated activity class labels 330, as well as the unsupervised embedding. The learned deep embedding, which is generated based on sensor data obtained from a plurality of subjects, optionally using a subject- specific metric learning method, as per FIG. 1C, may be generated via the processing and control circuitry 200, or, for example, via a remote computing system that is connectable to the processing and control circuitry 200.

For example, the deep embedding may be learned using a remote server 350 that is operably connected or connectable to the control and processing circuitry 200 via a network 360, with the deep embedding parameters (e.g. filters/kernels of a convolutional neural network) transmitted to the processing and control circuitry 200 for local generation of the reference and unsupervised embeddings. As shown in the example embodiment illustrated in FIG. 2, a remote client device 370 (e.g. a smartphone) may be employed to permit an external user (e.g. a healthcare provider) to access the inferred activity classes, for example, to monitor the performance and/or progress of a subject in an unsupervised environment.

It is to be understood that the example system shown in the figure is not intended to be limited to the components that may be employed in a given implementation. For example, the processing and computing circuitry 200 may include a mobile computing device, such as a tablet or smartphone that is connected to a local processing hardware supported by a wearable computing device via one or more wired or wireless connections. In another example implementation, a portion of the control and processing circuitry 200 may be implemented, at least in part, on a remote computing system that connects to a local processing hardware via a remote network, such that some aspects of the processing are performed remotely (e.g. in the cloud), as noted above.

Although only one of each component is illustrated in FIG. 2, any number of each component can be included. For example, a computer typically contains a number of different data storage media. Furthermore, although the bus 210 is depicted as a single connection between all of the components, it will be appreciated that the bus 210 may represent one or more circuits, devices or communication channels which link two or more of the components. For example, in many computers, bus 210 often includes or is a motherboard.

Although some example embodiments of the present disclosure can be implemented in fully functioning computers and computer systems, various embodiments are capable of being distributed as a computing product in a variety of forms and are capable of being applied regardless of the particular type of machine or computer readable media used to actually effect the distribution. A computer readable storage medium can be used to store software and data which when executed by a data processing system causes the system to perform various methods. The executable software and data may be stored in various places including for example ROM, volatile RAM, nonvolatile memory and/or cache. Portions of this software and/or data may be stored in any one of these storage devices. As used herein, the phrases “computer readable material” and “computer readable storage medium” refers to all computer-readable media, except for a transitory propagating signal perse.

While some of the preceding example embodiments have been illustrated in the context of implementations in the domain of physiotherapy, it will be understood that the human activity recognition methods disclosed herein are broadly applicable to a wide range of uses, including, but not limited to, activity recognition of video games, medical procedures, performing arts, martial arts, military, and athletics.

In the broader medical rehabilitation field, including, but not limited to physiotherapy, the present example embodiments may be employed in applications including, but not limited to, rehabilitation of subjects suffering from loss or deficit of motor control, such as stroke patients or patients recovering from a surgical procedure or injury. In some example embodiments, the supervised sensor data that is collected and labelled to generate the subject-specific reference embedding may be collected when a care provider is present with the subject, with the provider observing the movements and providing the labeling of the movements. The subject may then subsequently perform the unsupervised movements in an unsupervised environment, such as at home setting. The inferred activity classes of the unsupervised movements may be communicated directly to the subject, and/or remotely communicated to the provider for assessment.

In the field of video games, the present example methods can be employed for activity recognition in applications including, but not limited to, identification of player movements, training of player movements (e.g. providing feedback to a player based on an inferred movement class), confirmation/verification of player activity recognition via other algorithms that process sensor input, and logging/archiving of player movements. In the medical field, the present example methods can be employed in applications including, but not limited to, training of medical procedures (e.g. surgical and/or diagnostic procedures) and logging of movements performed during medical procedures. In the field of athletics and martial arts, the present example embodiments may be employed in applications including identification and recording of movements performed during training or during a competitive event. EXAMPLES

The following examples are presented to enable those skilled in the art to understand and to practice embodiments of the present disclosure. They should not be considered as a limitation on the scope of the disclosure, but merely as being illustrative and representative thereof.

The present examples describe experiments in which deep embeddings are employed for personalized activity recognition, within reference to example methods involving 1) extracted features from a neural network classifier and 2) an optimized embedding learned using triplet neural networks (TNNs). These example methods are compared to a baseline impersonal neural network classifier and a personalized engineered feature representation, and are evaluated on three publicly available inertial human activity recognition data sets (MHEALTH, WISDM, and SPAR), comparing classification accuracy, out-of-distribution activity detection, and embedding generalization to new activities. It is demonstrated that an example deep embedding trained using subject-specific triplet loss provides the best performance overall, and with all personalized deep embeddings out-performing a baseline personalized engineered feature embedding and an impersonal fully convolutional neural network classifier. Example 1: Methods

1.1 Data Sets and Hardware

The algorithms were evaluated on three publicly available inertial activity recognition data sets: MHEALTH , WISDM , and SPAR. These data sets encompass a combination of activities of daily living, exercise activity, and physiotherapy activities. Class balance is approximately equal within each and there is minimal missing data. The specific attributes of these data sets is summarized in FIG. 3.

1.2 Preprocessing

The WISDM and MHEALTH data was resampled to 50 Hz, using cubic interpolation, to provide a consistent basis for evaluating model architecture. The time series data were then pre-processed with sliding window segmentation to produce fixed length segments of uniform activity class. A four second sliding window was utilized for the MHEALTH and SPAR data sets, and a ten second window was utilized for WISDM for consistency with previous evaluations. An overlap ratio of 0.8 was used in the sliding window segmentation as a data augmentation strategy.

Only the smart watch data was included from the WISDM data set, because the smart watch and mobile phone data were not synchronized during data collection. Four WISDM subjects were also excluded from the evaluation due to errors in the data collection that resulted in absent or duplicated sensor readings (subjects 1637, 1638, 1639, and 1640).

Example 2 - Impersonal Fully Convolutional Network (FCN)

A fully convolutional neural network (FCN) was utilized as the baseline impersonal supervised learning model. The FCN architecture is considered a strong baseline for time series classification even in comparison to deep learning models with modern architectural features used in computer vision such as skip connections. As shown in FIG. 4, the FCN core consists of 1D convolutional layers, with rectified linear unit (ReLU) activation, and batch normalization. Regularization of the model is achieved using dropout applied at each layer. Global average pooling is used after the last convolutional layer to reduce the model sensitivity to translations along the temporal axis, as this ensures the receptive field of the features in the penultimate feature layer includes the entirety of the window segment. The receptive field of filters in the last convolutional layer prior to global average pooling was 13 samples, which is equivalent to 260 ms at a sampling rate of 50 Hz. An L² normalization is applied after global pooling to constrain the embedding to the surface of a unit hypersphere, which improves training stability.

The impersonal FCN classifier model consists of an FCN core with a final dense layer with softmax activation. The FCN classifier was trained for 150 epochs using the adam optimizer, categorical cross entropy loss, and a learning rate of 0.001. Gradient norm clipping to 1.0 was used to mitigate exploding gradients.

The FCN core architecture was also used as the basis for both the personalized deep features (PDF) and personalized triplet network (PTN) models (both of which are described below) to provide a consistent comparison of these approaches.

Example 3 - Personalized Feature Classifiers

Three personalized feature classifiers were implemented and compared for activity classification performance, namely a personalized engineered features (PEF) classifier, a personalized deep features (PDF) classifier, and a personalized triplet network (PTN) classifier. Inference was achieved in each of these models by comparing a subject’s embedded test segments to labeled reference embeddings specific to the subject (i.e. previously measured and corresponding to the same subject). For the test subjects, the time series data for each activity was split along the temporal axis, reserving the first part of the split time series for use as labeled reference data (employed for inference) and the latter part of the spit time series for use as unlabeled test data. This split was performed prior to sliding window segmentation to ensure there is no temporal overlap of reference and test samples. This partitioning of the data is depicted in FIG. 5. The effect of reference data size on model performance was evaluated using 50% of the test data as the baseline evaluation.

To determine the activity class in a test segment, the reference embeddings for the 3 k-nearest neighbors (k-NN) were searched using a Euclidean distance metric and a uniform weight decision function.

2.1 Personalized Engineered Features (PEF)

An engineered feature representation was employed to serve as a baseline personalized classifier model. The representation consisted of statistical and heuristic features used for inertial activity recognition including mean, median, absolute energy, standard deviation, variance, minimum, maximum, skewness, kurtosis, mean spectral energy, and mean crossings. The features were individually computed for each of the data channels in the data set. This resulted in 66 features for the WISDM and SPARS data sets, and 174 features for the MHEALTH data set. All features were individually scaled to unit norm and zero mean across the training data set.

2.2 Personalized Deep Features (PDF)

The personalized deep feature model was identical to the FCN model in architecture and training. However, inference was achieved by embedding reference samples with the FCN penultimate feature layer and using k-NN to search the reference data nearest to the embedded test samples.

2.3 Personalized Triplet Network (PTN)

A PTN embedding /(x) was derived by training the FCN core with triplet loss. In the present experiments, conventional triplet loss was evaluated with random triplets (PTN^† as per Eq. 1), and subject triplet loss (PTN as per Eq. 2) with a portion of the triplets being subject triplets and the remainder randomly selected. The same optimizer and hyperparameters as described previously were employed, with the exception that the learning rate was reduced to 0.0002 when training the FCN core with triplet loss. A value of 0.3 was employed for the margin parameter a. Despite the greater cardinality of the triplet set, an epoch was consistently defined as having N samples. Example 3 - Experiments

3.1 Activity Classification

Classification accuracy was evaluated using 5-fold cross-validation grouping folds by subject. Subject distribution across folds was randomized but consistent for each algorithm in keeping with best practices for the evaluation of human activity recognition algorithms. Cross-validated test set performance is summarized for each algorithm on the three data sets in FIG. 6. Accuracy statistics (mean and standard deviation) are aggregated by subject, not by fold. Box and whisker plots demonstrating the variation in performance between individuals are provided in FIG. 7. Training curves for the FCN and PTN models are depicted in FIG. 8.

It was found that all of the personalized feature classifiers out-performed the impersonal FCN classifier and reduced the incidence and degree of negative outlier subjects who individually experience poor performance in the impersonal model, as observed in FIG. 7. Both the personalized deep feature models (PDF and PTN) outperformed the personalized engineered features (PEF). Specifically, the PTN model utilizing subject-specific triplet loss had the best classification performance.

3.2 Out-of-Distribution Detection

Model performance was assessed for distinguishing activity classes present in the training distribution from unknown (out-of-distribution) activity classes. This evaluation was performed by training the models on a subset (70%) of the activity classes, and testing with the full set of activity classes in a subject group 5-fold cross validation scheme. In each fold, the classes considered out-of-distribution were randomly selected but were consistent across the algorithms evaluated. Out-of- distribution performance was assessed using the area under the receiver operating curve (AUROC) for the binary classification task of in- vs out-of-distribution.

Out-of-distribution (OOD) classification was implemented for the personalized feature classifiers using a local outlier factor model trained on the in-distribution embeddings on a per-subject basis, and again using a 3-neighbor search with a Euclidean distance metric. For the FCN model, the maximum softmax layer output was used as the decision function. OOD detection performance is plotted in FIG. 9.

In contrast to the classification task, the best performing OOD detector appeared to depend on the data set tested. The FCN, PTN, and PDF classifiers had the highest mean OOD scores for the MHEALTH, WISDM, and SPAR data sets respectively. Considering all data sets, the PTN model performed best overall for OOD detection. 3.3 Generalization to New Activity Classes

Generalization of the personalized features to new activity classes was assessed in a manner similar to the out-of-distribution detection. Instead of a binary in- vs out- classification target, accuracy for classification was assessed across the full distribution of activities in the context of training the embedding on the reduced (in-distribution) subset of activities. The FCN model was not assessed for this task as generalization to new target classes is not a feature of the softmax classification layer. The results are plotted in FIG. 10, depicting excellent performance of all three feature classifiers with the PTN algorithm performing best overall across all three data sets.

3.4 Reference Data Size

The effect of reference sample quantity on personalized feature classifier accuracy was evaluated using 5-fold cross validation on the SPAR data set. The results are plotted in FIG. 11. It was found that approximately 16 reference segments (equal to approximately 16 seconds of data) were required per activity class to achieve optimal results.

3.5 Embedding Size

A parsimonious representation of activity is desirable to minimize the storage and computational cost of personalized feature inference. The effect of embedding size on model performance was assessed using 5-fold cross validation on the SPAR data set. For the PDF and PTN models, the embedding size was adjusted at the final dense layer of the FCN core. For the engineered features, we reduced the embedding size by selecting the most important features as ranked using Gini importance . The Gini importance was calculated for the engineered features using an Extremely Randomized T rees classifier with an ensemble of 250 trees.

Classifier performance as a function of embedding size is plotted in FIG. 12. It was found that accurate results were achievable down to an embedding size of 8, and that the PDF and PTN models are superior to PEF for learning a parsimonious embedding.

3.6 Computational Expense

The computational cost for each model on the SPAR data set is reported in FIG. 13, detailing training and inference time on our hardware, and storage size for model and reference data. In the present implementation, the inference time for the PDF and PTN classifiers was split nearly equally between embedding computation and nearest embedding search. Training the FCN core with triplet loss in the PTN model increased the fit time by approximately 5-fold in comparison to training with categorical cross entropy loss as done for the PDF and FCN models.

Example 4 - Discussion and Analysis

In the preceding examples, two methods for personalized, subject-specific inertial human activities were described and evaluated using personalized deep features (PDF) and a personalized triplet network (PTN), comparing these to a baseline impersonal fully convolutional network (FCN) and personalized engineered features (PEF) for inertial human activity recognition. The PTN and PDF models significantly outperformed PEF for personalized activity recognition across all metrics evaluated. The three personalized feature classifiers also significantly outperformed the impersonal FCN classifier, which represents current state of the art for the field.

In fact, the personalized classifiers were able to achieve performance approaching training set performance of the impersonal FCN classifier. As predicted, it was found that the FCN classifier, while producing excellent results over all, performed poorly for some individuals (as low as 50% accuracy) as shown in FIG. 7. The three personalized feature classifiers evaluated all significantly mitigated this issue, and had more consistent results across individual subjects in each data set, thereby providing a technical solution to the previously noted problems in the field of supervised learning for human activity recognition.

The PTN model had the best performance overall, exceeding that of the personalized deep features learned by the PDF model serendipitously in a conventional supervised learning framework. However, a significant disadvantage of using a triplet neural network to learn the embedding function is the increased computational cost during training. On the hardware employed in the present experiments, the PTN approach increased the training time five fold and tripled the GPU memory requirements in comparison to training an identical core with categorical cross entropy loss. This is because there is the further cost of triplet selection and each triplet includes three distinct samples that must each be embedded to compute the triplet loss. Fortunately, once the embedding has been trained, there is little difference in computational requirements to compute the embedding or classify an unknown sample.

A significant advantage of using personalized, subject-specific features for activity classification is the built-in capability to use them for out-of-distribution activity detection and classification of novel activities. In the human activity recognition field, out-of-distribution detection is particularly important as there exists an infinite number of possible human actions, and therefore it may be impractical to include all possible actions in the training set or even all reasonably likely actions. It is a therefore desirable for a HAR system to be capable of being trained to recognize a select number of pertinent activities and have the ability to reject anomalous activities that do not reside in the training distribution.

Typically, deep learning classification algorithms implementing a softmax output layer perform poorly at out-of-distribution activity detection due to overconfidence. This well-known result was replicated in the present evaluations of the FCN classifier on the WISDM and SPAR data sets. Interestingly, the FCN classifier had good OOD performance on MHEALTH. However, various approaches to improving OOD performance for neural networks have been investigated in the computer vision fields with mixed results and this remains an active area of research. Nonetheless, there has been relatively little work on this problem in the context of time series data and particularly inertial activity recognition. The present examples demonstrate that personalized features using a local outlier factor model have good performance for the present synthetic OOD evaluation. Further experimentation may be performed to evaluate alternative approaches and build true out-of-distribution data sets incorporating real-world variation in subject daily activities.

The personalized k-NN model employed to search reference embeddings for classification of test samples in the PEF, PDF, and PTN models was found to be effective, but approximately doubles the inference time in comparison to the FCN model that used a softmax layer for classification. A disadvantage with k-NN search is that computational time complexity and data storage requirement scales with the number of reference samples 0(N). This property of k-NN limits its utility as an impersonal classifier, as performing inference requires searching the entire training data set. In the context of a personalized algorithm, however, the k-NN search model is limited only to the subject’s reference samples, which have been presently demonstrated to only need to include tens of samples per activity class. It is noted that other search strategies could be implemented to search the reference data. For example, the nearest centroid method could be used, which has computational complexity 0(1), scaling linearly with number of reference classes.

The FCN core architecture described in this work, with just 278,848 parameters (~1 MB), is a relatively compact model. Particularly, in comparison to computer vision or language models that can exceed tens or hundreds of millions of parameters. Given the small size of the model and reference embeddings, implementing a personalized feature classifier based on the FCN core may be feasible within an edge computing system where the computations for activity recognition are performed locally on the user’s hardware (e.g. a mobile device).

There are various advantages of an edge computing approach, including improved classification latency, reliability, and network bandwidth usage.

The present subject-specific triplet loss function (Eq. 2) and triplet selection strategy described in the present disclosure significantly improved the performance of the PTN model in comparison to conventional triplet loss. The subject triplets can be considered “hard” triplets in the context of other strategies for specifically selecting hard triplets to improve TNN training. The present models may be beneficial in that they are straightforward to implement and computationally inexpensive in comparison to strategies that require embeddings to be computed prior to triplet selection. A loss function employing subject-specific triplets may be particularly advantageous for data sets collected with heterogenous hardware.

The present examples have demonstrated that deep embeddings derived from fully convolutional neural networks trained with categorical cross entropy or triplet loss significantly outperform engineered features for personalized activity recognition and outperform impersonal supervised learning approaches. Subject- specific triplet selection was found to improve the training and performance of triplet neural networks used for personalized activity recognition.

The specific embodiments described above have been shown by way of example, and it should be understood that these embodiments may be susceptible to various modifications and alternative forms. It should be further understood that the claims are not intended to be limited to the particular forms disclosed, but rather to cover all modifications, equivalents, and alternatives falling within the spirit and scope of this disclosure.

Claims

THEREFORE WHAT IS CLAIMED IS:

1. A method of assessing movements of a subject, the method comprising: collecting first sensor data associated with first supervised movements performed by a plurality of subjects in a presence of supervision, and obtaining a first set of activity class labels characterizing the first supervised movements, each activity class label of the first set of activity class labels associating a given first supervised movement with a respective activity class selected from a first set of activity classes; generating a training dataset comprising the first sensor data and the first set of activity class labels; employing the training dataset to learn a deep embedding using a neural network; collecting second sensor data associated with second supervised movements performed by a selected subject in a presence of supervision, and obtaining a second set of activity class labels characterizing the second supervised movements, each activity class label of the second set of activity class labels associating a given second supervised movement with a respective activity class selected from a second set of activity classes; employing the learned deep embedding to process the second sensor data to generate a set of reference embeddings, each reference embedding having an associated activity class; collecting third sensor data while the selected subject performs an unsupervised movement in an absence of supervision; employing the learned deep embedding to process the third sensor data to generate an unsupervised embedding associated with the unsupervised movement; and processing the set of reference embeddings, the activity classes associated with the set of reference embeddings, and the unsupervised embedding to determine an activity class associated with the unsupervised movement.

2. The method according to claim 1 wherein the neural network is trained according to a metric learning loss function.

3. The method according to claim 2 wherein the metric learning loss function comprises at least one subject-specific comparative measure based on two or more samples associated with a common subject.

4. The method according to claim 2 wherein the neural network is trained according to a triplet loss function based on a set of sample triplets, wherein at least one sample triplet includes an anchor data sample and a positive data sample that are associated with a common subject.

5. The method according to claim 4 wherein the at least one sample triplet further includes a negative data sample associated with the common subject.

6. The method according to claim 4 wherein each sample triplet includes an anchor data sample, a positive data sample and a negative sample that are respectively associated a respective common subject.

7. The method according to claim 2 wherein the neural network is trained according to a contrastive loss function based on a set of sample doublets, wherein at least one sample doublet includes two samples that are associated with a common subject.

8. The method according to any one of claims 2 to 7 wherein the deep embedding is obtained from a feature vector that forms an output of a convolutional portion of a convolutional neural network.

9. The method according to claim 1 wherein the deep embedding is learned by training a convolutional neural network comprising a fully connected portion, and wherein the deep embedding is obtained from the feature vector that forms an input to the fully connected portion of the convolutional neural network.

10. The method according to claim 1 wherein the activity class associated with the unsupervised movement is determined from the unsupervised embedding and the set of reference embeddings using a classifier that employs a distance metric.

11. The method according to claim 10 wherein the classifier is a K-nearest neighbour classifier.

12. The method according to claim 1 wherein the neural network comprises a convolutional neural network.

13. The method according to claim 12 wherein the neural network comprises a convolutional recurrent neural network.

14. The method according to claim 1 wherein the neural network comprises a recurrent neural network.

15. The method according to any one of claims 1 to 14 further comprising processing the third sensor data to determine a performance attribute associated with the unsupervised movement.

16. The method according to claim 15 wherein the activity class of the unsupervised movement is employed when processing the second sensor data to determine the performance attribute.

17. The method according to any one of claims 1 to 16 wherein the plurality of subjects comprises the selected subject.

18. The method according to any one of claims 1 to 17 wherein the first supervised movements include correctly performed movements and incorrectly performed movements.

19. The method according to any one of claims 1 to 17 wherein the second supervised movements include correctly performed movements and incorrectly performed movements.

20. The method according to any one of claims 1 to 17 wherein the second set of activity classes is the same as the first set of activity classes.

21. The method according to any one of claims 1 to 17 wherein the second set of activity classes is different than the first set of activity classes.

22. The method according to any one of claims 1 to 21 wherein the second sensor data and the third sensor data are obtained from one or more wearable sensors worn by the selected subject.

23. The method according to any one of claims 1 to 21 wherein the second sensor data and the third sensor data are obtained from one or more sensors configured to detect fiducial markers worn by the selected subject.

24. The method according to any one of claims 1 to 23 wherein the first sensor data is only collected when a given supervised movement satisfies performance criteria.

25. The method according to any one of claims 1 to 24 wherein each activity class of the first set of activity classes corresponds to a unique physiotherapy exercise.

26. The method according to any one of claims 1 to 24 wherein each activity class of first set of activity classes corresponds to a unique dance movement.

27. The method according to any one of claims 1 to 24 wherein each activity class of first set of activity classes corresponds to a unique movement associated with a medical procedure.

28. The method according to claim 1 wherein the deep embedding is learned via a convolutional neural network trained with categorical cross entropy loss.

29. The method according to claim 1 wherein the deep embedding is learned via a recurrent neural network trained with categorical cross entropy loss.

30. The method according to claim 1 wherein the deep embedding is learned via one of an auto-encoder and a variational auto-encoder.

31. The method according to claim 1 wherein the deep embedding comprises engineered features and learned features.

32. The method according to claim 1 wherein at least one reference embedding is validated in relation to the training dataset before being employed to determine the activity class associated with the unsupervised movement.

33. The method according to claim 1 wherein the activity class associated with the unsupervised movement is identified as not belonging to the first set of activity classes.

34. The method according to claim 31 wherein a local outlier factor algorithm is employed to compare the unsupervised embedding with the set of reference embeddings.

35. A method of learning subject-specific a deep embedding for use in assessing movements of a subject via inference, the method comprising: collecting first sensor data associated with first supervised movements performed by a plurality of subjects in a presence of supervision, and obtaining a first set of activity class labels characterizing the first supervised movements, each activity class label of the first set of activity class labels associating a given first supervised movement with a respective activity class selected from a first set of activity classes; generating a training dataset comprising the first sensor data and the first set of activity class labels; and employing the training dataset to learn a deep embedding using a neural network, wherein the neural network is trained according to a metric learning loss function; wherein the metric learning loss function comprises at least one subject- specific comparative measure based on two or more samples associated with a common subject.

36. The method according to claim 35 wherein the neural network is trained according to a triplet loss function based on a set of sample triplets, wherein at least one sample triplet includes an anchor data sample and a positive data sample that are associated with a common subject.

37. The method according to claim 36 wherein the at least one sample triplet further includes a negative data sample associated with the common subject.

38. The method according to claim 36 wherein each sample triplet includes an anchor data sample, a positive data sample and a negative sample that are respectively associated a respective common subject.

39. The method according to claim 36 wherein the neural network is trained according to a contrastive loss function based on a set of sample doublets, wherein at least one sample doublet includes two samples that are associated with a common subject.

40. The method according to any one of claims 35 to 39 wherein the deep embedding is obtained from a feature vector that forms an output of a convolutional portion of a convolutional neural network.

41. A method of assessing movements of a subject, the method comprising: obtaining a set of reference embeddings associated with the subject, each reference embedding having an associated activity class, each reference embedding having been generated by processing, with a learned deep embedding, first sensor data associated with supervised movements performed by the subject in a presence of supervision; collecting second sensor data while the subject performs an unsupervised movement in an absence of supervision; employing the learned deep embedding to process the second sensor data to generate an unsupervised embedding associated with the unsupervised movement; and processing the set of reference embeddings, the activity classes associated with the set of reference embeddings, and the unsupervised embedding to determine an activity class associated with the unsupervised movement.

42. A system for assessing movements, the system comprising: one or more sensors; control and processing circuitry operably coupled to said one or more sensors, said control and processing circuitry comprising at least one processor and memory operably coupled to said at least one processor, said memory comprising instructions executable by said at least one processor for performing operations comprising: obtaining a set of reference embeddings associated with a subject, each reference embedding having an associated activity class, each reference embedding having been generated by processing, with a learned deep embedding, first sensor data associated with supervised movements performed by the subject in a presence of supervision; collecting second sensor data while the subject performs an unsupervised movement in an absence of supervision; employing the learned deep embedding to process the second sensor data to generate an unsupervised embedding associated with the unsupervised movement; and processing the set of reference embeddings, the activity classes associated with the set of reference embeddings, and the unsupervised embedding to determine an activity class associated with the unsupervised movement.