research-article

Open access

Towards Recognizing Food Types for Unseen Subjects

Authors:

Wei Niu,

Bin RenAuthors Info & Claims

ACM Transactions on Computing for Healthcare, Volume 6, Issue 1

Article No.: 1, Pages 1 - 21

https://doi.org/10.1145/3696424

Published: 08 January 2025 Publication History

PDF eReader

Abstract

Recognizing food types through sensor signals for unseen users remains remarkably challenging despite extensive recent studies. The efficacy of prior machine learning techniques is dwarfed by giant variations of data collected from multiple participants, partly because users have varied chewing habits and wear sensor devices in various manners. This work treats the problem as an instance of the domain adaptation problem, where each user represents a domain. We develop the first multi-source domain adaptation (MSDA) method for food-typing recognition, which consists of three major components: stratified normalization, a multi-source domain adaptor, and adaptive ensemble learning. New techniques are developed for each component. Using a real-world dataset comprised of 15 participants, we demonstrate that our method achieves \(1.33\times\) to \(2.13\times\) improvement in accuracy compared with nine state-of-the-art MSDA baselines. Additionally, we perform an in-depth ablation study to examine the behavior of each component and confirm its efficacy.

1 Introduction

The escalating global prevalence of obesity poses a significant public health risk, contributing to an alarming number of premature deaths each year in both developed and rapidly developing countries. In the United States alone, approximately 670,000 deaths annually are attributed to nutrition- and obesity-related diseases, including heart disease, cancer, and diabetes. Moreover, maintaining a well-structured dietary pattern is crucial for the health status of every individual. To address this health crisis, there has been a push for the development of innovative sensor-based technologies and machine learning models aimed at monitoring food intake and analyzing collected data. Among the various solutions, machine learning models using motion and acoustic sensors, often integrated into wearable devices like smartwatches, earphones, and glasses, have shown promise [47]. Notable advancements include the application of random forests [36] and neural networks [55] to model food typing.

However, a critical challenge emerges when models are trained on a collective dataset from various users—these models struggle to generalize effectively across different sets of users. Table 1 illustrates the diminished performance when predicting food types for unseen users, highlighting the limitation of existing machine learning models. This generalization issue becomes a significant barrier to large-scale deployment, as users expect technologies to work seamlessly across diverse individuals. The variations in sensor locations and chewing habits among different users hinder machine learning models from extracting robust user-oblivious signals, a problem commonly referred to as the domain adaptation (DA) problem [5].

Table 1.

	# Users	# Food types	Model type	Seen users	Unseen users
	# Users	# Food types	Model type	Acc. of predicting
Mirtchouk et al. [36]	6	40	Random forest	82.7%	29%
Wang et al. [55]	15	11	Two-layer Perceptron	82.3%	23%

Table 1. Prior Works’ Low Generalization Performance

We referenced the original paper and re-implemented their works to present seen and unseen users’ accuracy scores. The original implementations have data from the same user in training and testing sets and achieved high accuracy. The accuracy scores are degraded to a great extent when the setting is changed to predict the unseen user, i.e., split data from one user as the testing set exclusively.

In this work, we specifically focus on the multi-source domain adaptation (MSDA) problem [34, 51, 61], where labeled training data belongs to multiple domains, exacerbating the challenges associated with domain divergence. Existing DA methods fall short when applied to the task of predicting food types for unseen users. To address this, we propose a comprehensive pipeline that integrates a diverse set of techniques to combat multiple interrelated subproblems (see Figure 4). Our contributions can be categorized into three main techniques:

Domain-Invariant Features: These features refer to data representations that remain consistent across different domains. While end-to-end neural networks aim to automatically extract features unaffected by distribution variations, such approaches prove ineffective in our setting. CoDATs [57], for example, work on long-range muscle motions (e.g., sitting down, walking), making it unclear how an end-to-end neural network can efficiently extract food-type signals from fine-grained muscle movements (e.g., chewing different types of foods) across diverse domains. To mitigate domain shifts, we propose to apply an “anti-wisdom” approach, leveraging hand-crafted features to trade domain expertise and manual labor for a simplification in fitting the target function. Additionally, we introduce stratified normalization, inspired by stratified sampling in statistics, to control feature variations across domains. These techniques, relying on different principles, work synergistically to enhance signal extraction.

Source-Source DA: This technique aims to align models trained on different source domains to control dissimilarity, facilitating effective generalization across diverse sources. While existing works [45, 56, 57, 60] primarily concentrate on adapting multiple source domains to the target domain, they often overlook the inherent divergence among the source domains themselves. It is essential to recognize that the multiple source domains not only differ from the unseen target domain but also exhibit variations among each other to varying extents. Therefore, the complexity of domain divergence is growing with the number of source domains. Consequently, mitigating source-source domain divergence becomes crucial when training a reliable classifier for the target domain; otherwise, the model will struggle to learn from multiple domains with different distributions. In our work, we renovate a multi-branch neural network where each branch independently adapts one source domain to the target. This adaptation process incorporates a consensus regularizer [33] to guide all branches, encouraging them to learn common features and effectively reduce source-source domain divergence.

Adaptive Ensemble Weight: This technique addresses the challenge of static ensemble weights, which can be suboptimal, leading to accuracy degradation when incorporating data or models from irrelevant domains. Inspired by theoretical work [34], we introduce a two-stage adaptive ensemble method that dynamically adjusts ensemble weights for different users. Notably, our solution employs source-source similarities to filter out useless or harmful models prone to mispredicting food types for unseen users.

While DA challenges are prevalent in many machine learning applications, existing techniques are often tailored to specific domains and lack generalizability. This limitation becomes more pronounced in our context, where adapting to multiple domains becomes increasingly challenging with a growing number of users. Previous approaches are frequently tested on a limited number of domains (e.g., up to four domains [7, 16, 18, 26, 30, 40, 44, 54]) or synthetic data [39], rendering them less suitable for our multi-domain scenario. In contrast, our solution stands out as a “cocktail” flavor, offering a pipeline in which each stage is either robust or effective to adapt various numbers of domains. The first stage employs hand-crafted features and stratified normalization for each domain independently, ensuring effectiveness regardless of the number of domains. The second stage employs a multi-branch structured neural network with consensus regularization (CR) to control domain similarities. In the final stage, our adaptive ensemble scheme further enhances robustness across different domain scales.

In summary, our contributions are as follows:

–

An MSDA pipeline with renovated algorithm components is proposed to address the generalization issue in the food type recognition task.

–

A new set of techniques and principles is introduced, incorporating consensus regularizer, stratified normalization, and domain knowledge-guided feature extraction to address the more severe domain divergence problem caused by the growing number of domains.

–

A two-stage adaptive ensemble method is designed to automatically assign weights to relevant domains and prune off irrelevant ones. This method is robust to parameter settings and further improves accuracy.

–

Extensive empirical evaluation is conducted. We experimentally verified the importance of hand-crafted features in the food typing task with multiple domains. Besides, the evaluation shows that our method achieved \(1.33\times\) to \(2.13\times\) higher accuracy than other baselines.

The rest of the paper is organized as follows: Section 2 explains the research effort closely related to this work. Section 3 presents the challenges of performing DA on food typing. Section 4 demonstrates the overall solutions. Section 5 describes experiments to evaluate our methods. Finally, a conclusion remark is given in Section 6.

2 Related Work

DA. The main challenge of DA was to reduce the domain discrepancy between different domains, which was approached from multiple perspectives: (1) Data manipulation and feature engineering. Several existing works selected a subset of training samples or assigned weights to them based on the distance of one training sample to the test set [21, 31, 41]. Similarly, Nikolaidis et al. [37] iteratively selected subset training samples with high confidence scores and fine-tuned the classifier with the selected data and predicted labels. An et al. [4] used labeled target samples to fine-tune specific layers of a neural net (NN) that produced user-specific features. In contrast, our approach prunes and assigns weights to the trained sub-models where the information of the dataset had been learned so that labeled data were not wasted. TCA [38] and CORAL [49] learned matrix mappings to align the features of different domains. Instance Normalization [15, 28] and AdaBN [29] designed domain-adapted normalization layers to transform intermediate feature maps in an NN. These methods were not straightforward to integrate into our framework but were more suitable for other tasks or training methodologies. (2) Neural network innovation. Maximum mean discrepancy (MMD) [48] measured the discrepancy between two domains and was applied to train various NNs to reduce distribution shift [17, 32, 53, 64, 65]. Inspired by MMD-based solutions, various NNs coupled with domain discrepancy measurement functions were proposed, including Deep-CORAL [50] and GAN-based solutions [13, 52, 59]. These approaches could be seamlessly extended to MSDA [51, 61] by combining multiple source domains into one; however, they were susceptible to accuracy degradation [14, 29, 57] because the learning procedure was interfered with by quadratically increased domain divergences [42]. Our method instead learned data from different domains through multi-branch model training [9, 43]. Finally, Luo et al. [33] proved that the disagreement between multiple sources was the upper bound for classification errors, so optimizing the consensus regularizer led to better prediction performance [27, 39, 64]. (3) Ensemble learning. It was proven that the target distribution could be represented as a weighted combination of source distributions [34]. Accordingly, many existing efforts trained one model or multiple sub-models and late-fused the prediction confidence scores with uniform weights [46, 64] or fine-tuned weights based on various metrics. Peng et al. [39] assigned source-only accuracy weights to sub-models. Xu et al. [58] calculated a perplexity score during the adversary training procedure as a weight. Guo et al. [19] designed a point-to-set metric based on Mahalanobis distance to re-weight domain experts. Zhao et al. [62] re-weighted trained distilled source classifiers using Wasserstein distance. Our method updates the mask weights (\(\omega_{m}\)) and the similarity weights (\(\omega_{s}\)) based on the entropy of \(\ell_{1}\) distances between domain-specific models and MMD metrics. Another category of ensemble schemes was to update weights during training. In contrast, our method updates weights after training without interfering with the learning procedure.

DA and Food Type Recognition. Recognizing food types through sensor signals achieved promising results in recent years. Oliver Amft’s team achieved \(80\sim 100\%\) accuracy in classifying four food types using earbud-embedded microphone sensors [2]. Later, they produced two prototypes that achieved \(80\%\) and \(86.6\%\) accuracy, respectively, in classifying 19 food types [1, 3]. Yin et al. [6] proposed a prototype for recognizing seven types of food using two microphones embedded in a neckband. The microphone could also be placed near the mouth to classify six types of food [20]. Besides microphones, a smart utensil containing an array of LEDs could recognize twenty food types [22]. An intraoral sensor placed in the mouth while eating classified nine food types based on temperature and jawbone movement [8]. However, the current state-of-the-art is the work combining microphones with other sensor types. Samantha’s team identified 40 different types of food with an accuracy of \(82.7\%\) [36], combining a microphone-embedded earbud, Google Glass, and two smartwatches. Although these food type recognition methods achieved acceptable accuracy, none considered the DA problem. Therefore, their recognition accuracy could significantly decrease when the application environment or scenario changed (see Table 1). Although prior works have applied DA to sensor signals in other tasks, they are not suitable for the food-type recognition task for various reasons. For example, Zheng et al. [63] generated fake labels for the target domain based on MMD [48] to recognize daily behaviors utilizing sensors scattered in an apartment. Mathur et al. [35] studied the DA problem caused by different sensor deployment locations. Jiang et al. [23] adopt an adversary training approach to recognize human activities for single subject on WiFi signals. These methods are not effective in recognizing food types without incorporating domain knowledge.

3 Problem Setup, Motivation, and Challenges

This section describes the problem definition, reviews standard techniques, and performs preliminary experiments to motivate our solutions.

Problem Setup. Figure 1 illustrates the chewing habits of different users for two types of foods: gum candy and nuts. Users generally chew nuts faster and with less force than gum candy due to the properties of the foods—gum is chewy, while nuts are crispy. However, the distributions among different users vary significantly, leading to potential misclassification of food types. For instance, user 6 and user 8 exhibit completely different chewing forces, with the minimum chewing force of user 6 being greater than the maximum chewing force of user 8. This indicates distinct marginal distributions. Consequently, using a model trained on data from user 8 to predict data from user 6 could result in all instances being misclassified as gum candy, which requires a stronger chewing force. Similarly, user 2 and user 12 chew gum candy and nuts at a similar frequency, leading to similar conditional distributions of chewing speed. However, this differs from other users, who chew nuts at a higher frequency. These distribution divergences contribute to poor generalizability to unseen users, as shown in Table 1. Additionally, determining which labeled users are similar to a new, unlabeled user is challenging. To address the domain divergence issues in recognizing food types for unseen users, we first formalize this challenge as an MSDA problem. We then analyze the performance of prior solutions to motivate our designs.

Fig. 1.

In an MSDA scenario, there exist \(n\) source domains and a target domain \(T\), corresponding to different individuals. We observe features and labels (\(\mathbf{x}\)’s and \(y\)’s) from source domains and only features (\(\mathbf{x}\)’s) from the target. Our goal is to build a classifier for predicting labels in the target domain using the available labeled data from \(n\) users and unlabeled data from the target user. Let \(s_{i}\) be the number of observations from domain \(i\) and \(S_{i}=\{(\mathbf{x}^{j}_{i},y^{j}_{i})\}_{j\in[s_{i}]}\) be the set of observations, each of which is independent and identically distributed (i.i.d.) sampled from the distribution \(\mathcal{D}_{i}\). Similarly, we assume that the data \((y_{T},\mathbf{x}_{T})\)’s are sampled from distribution \(\mathcal{D}_{T}\) (note that \(y_{T}\)’ are the ground truth and not observed). Let also \(X_{i}=\{\mathbf{x}^{j}_{i}\}_{j\in[s_{i}]}\) (\(i\in[n]\)) be the set of features, and \(X_{T}=\{\mathbf{x}^{j}_{T}\}_{j\in[t]}\) be the features of the target, where \(t\) is the total number of (unlabeled) observations from the target domain. Finally, let \(\mathcal{D}_{i}(X)\) and \(\mathcal{D}_{T}(X)\) be the feature distribution in domains \(i\) and \(T\), respectively.

When \(y_{i}=y_{T}\), meaning each domain has the same set of labels, the problem is defined as the closed set MSDA. If this condition does not hold, but for at least one \(y_{i}\), \(y_{i}\cap y_{T}\subset y_{T}\), the problem is defined as open set MSDA [61]. We describe and evaluate our method primarily using the closed set MSDA setting, similar to prior approaches [27, 39, 45, 56, 57, 60]. To test whether our method can adapt to the more challenging scenario where the target domain can only learn partial labels from each source domain, we also evaluate our method under the open set MSDA configuration in Section 5.4.

Prior Solutions and Performance. DA appears widely in many areas. Different downstream applications possess different distribution structures, so generic DA building blocks may not always be effective. To motivate our work, we review standard techniques and perform preliminary experiments to highlight their inefficacy. As aforementioned in Section 2, existing works develop or use techniques from one or more of the following categories: A1. Data manipulation and feature engineering. A2. Neural network innovation. A3. Ensemble learning. For example, CORAL [49] and TCA [38] use A1, DANN [14] uses A2, Schweikert et al. [46] use A1 and A3, CoDATS [57] uses A2 and A3. Note that A1 and A2 usually do not appear together because there is a strong belief that properly designed neural network models can automatically learn representations from raw data and do not need heavy feature engineering. Therefore, no work simultaneously uses A1-A3.

Figure 2 showcases the results of our preliminary experiments on these techniques. The Upper bound refers to a setting, in which labels in the target domain are accessible. Colored bars depict the performance of existing MSDA algorithms. Rigorous tuning efforts were applied to these algorithms, exploring two variants: Type I, utilizing data from all sources to train a single model, and Type II, training a model for each source and generating forecasts for the target through a linear combination of source models. The upper bound achieves over 80% accuracy, while all MSDA techniques fall below 35%. This performance gap underscores the need for substantial advancements in techniques, and intriguingly, Type II variants, employing ensemble learning, tend to outperform their Type I counterparts—an observation we will revisit and leverage in our algorithm design.

Fig. 2.

Reasoning About the Performance. A salient challenge we face here is that the interactions between sources and target and between domains grow quadratically in the number of users and the source-source and source-target divergences are uniformly high. Specifically, a model tuned for a specific target requires us to control the divergences between each source and the target, and between sources, the latter of which is quadratic in the number of users. When divergences between sources are weak, source-source interaction can be suppressed in a model. All existing techniques in Figure 2 focus on the source-target interaction and ignore the source-source interaction so they have reasonable performance only when the divergences between each pair of source domains are moderate and quickly deteriorate when divergences are significant.

Compounding a large source number and large divergences further amplify the weakness of existing techniques/solutions: (i) Use A1 (or A1 & A3) without A2. TCA [38] and CORAL [49] focus on designing specialized procedures to transform features \(\mathbf{x}\)’s from different domains and pipe the transformed \(\mathbf{x}\) with a standard NN, which is usually sub-optimal because neural network architecture/loss functions are not optimized toward the structure of the MSDA problem. (ii) Use A2 (or A2 & A3) without A1. While deep architectures (e.g., CoDATS [57]) are more flexible in learning feature representations, they cannot be fully automated to perform representation learning from the raw data effectively. We observe that deep architectures’ inability to use raw features directly is a generic problem for wearable ML problems and is not tied to a specific downstream prediction task (in our case food type prediction). Section 4.6 elaborates further our observations. (iii) Problems of A3. Ensemble learning assumes that each weak learner delivers sufficiently “orthogonal” and useful predictions. This assumption also breaks. Specifically, we notice that sometimes having more ensembles in fact can harm (see Figure 3). This result shows that increasing the number of ensembles first improves the prediction capability and then degrades it [14, 29, 57]. This highlights a delicate interaction among ensembles and the challenges in weighting (and pruning) them.

Fig. 3.

4 Our Approach

This section explains our solution. Our key observation is that we need to innovate a broad set of techniques across all A1-A3 (feature engineering/normalization, model architecture, and ensemble weighting) and integrate them to collectively tackle the source-source and source-target divergence problems. Changes in one component (e.g., feature engineering) can result in complex interactions with other components, so it is important to design a “pipelined” system consisting of loosely interacting components. Each component in the pipeline addresses a specific ML subproblem and can be implemented using one or more techniques. This pipeline articulates and restricts the search space, defining the possible ways to integrate different techniques. By doing so, we can allocate most of our computational resources to explore combinations of more promising techniques and limit the resources spent on tuning less effective ones. We first provide an overview, and then describe each component in detail.

4.1 Overview

Figure 4 provides a visual representation of our pipeline. Initially, raw time-series instances undergo processing in the feature extractor, generating 65-dimensional hand-crafted features. This process yields target features \(\tilde{\mathbf{x}}_{T}^{j}=h(\mathbf{x}_{T}^{j})\) and source features \(\tilde{\mathbf{x}}_{i}^{j}=h(\mathbf{x}_{i}^{j})\), with \(i\in[n]\). Our DA algorithm is structured around three key components (C1–C3): C1. Stratified Normalization: This component, vital for MSDA, normalizes features from diverse domains to a consistent scale. This step is particularly crucial for datasets exhibiting significant shifts in marginal distributions across domains. C2. Multi-Source Domain Adaptor: Comprising a shared layer \(g(\cdot)\) and a set of \(n\) classifiers \(\{d_{1}(\cdot),\dots,d_{n}(\cdot)\}\), this component manages the delicate balance between model robustness and diversity. The shared layer, \(g(\cdot)\), extracts robust features across domains, minimizing divergence. Each classifier, \(d_{i}(\cdot)\), is individually optimized to learn labels from its respective source domain. Specifically, each branch \(d_{i}(\cdot)\) outputs \(p_{i}\) for the source domain, feeding it into cross-entropy loss \(\mathcal{L}_{cls}\) to train the classifier. Simultaneously, \(q_{i}\) is produced from the target domain, and the pairs \((q_{i},p_{i})\) are utilized in the MMD loss \(\mathcal{L}_{mmd}\) to reduce source-target divergence. Independence among the branches ensures diverse predictions, crucial for effective downstream ensembling. Furthermore, \(p_{i}\) outputs contribute to the CR \(\mathcal{L}_{con}\) to mitigate source-source divergence. C3. Adaptive Ensemble Learner: Treating each output from \(d_{i}\) as an ensemble, this component determines suitable ensemble weights by leveraging \(\mathcal{L}_{mmd}\) and \(\mathcal{L}_{con}\). These weights are dynamically adjusted, assigning greater significance to sources more akin to the target.

Fig. 4.

4.2 Hand-Crafted Feature Extraction

We follow the approach [55] to construct a total collection of \(65\) features optimized for building food-type recognition models. See Table 2. We let \(h(\cdot)\) be the feature engineering procedure so that \(h(\mathbf{x})\in\mathbf{R}^{65}\). Recall that \(\tilde{\mathbf{x}}=h(\mathbf{x})\), and we also let \(\tilde{X}_{i}=\{h(\mathbf{x}^{j}_{i})_{j}\}\) (\(i\in[n]\cup\{T\}\), and \(j\in[s_{i}]\cup\{T\})\). Our pipeline critically relies on hand-crafted features, which are more robust for food-typing tasks and departs from a recent “fashionable” trend that aims to use a neural network to learn features automatically [57]. See also Section 4.6.

Table 2.

Target features	1–7	8	9	10–23	24–37	38–51	52–65	1–65
Sensors	LG, RG	LA, RA	LG, RG	LA	RA	LG	RG	All
LSTM Avg. \(R^{2}\)	0.197	0.003	0.002	0.261	0.374	0.418	0.349	0.372
CoDATs Avg. \(R^{2}\)	0.263	\(-\)0.038	\(-\)0.004	0.6231	0.6138	0.5549	0.5324	0.482

Table 2. We Examine the Ability of DL Models to Learn All 65 Features

Left/right gyroscope and accelerometer sensors are abbreviated to LG, RG, LA, and RA. Features 1–7 are statistics of chewing speed and duration; 8–9 are the magnitude of translation and rotation; 10–23 are the number of mean-crossing, entropy/energy of frequency spectrum, maximum frequency component, and statistics of spectrum component, 24–27, 38–51, and 52–65 extract the same features as 10–23 but on different sensors.

4.3 Stratified Normalization

Machine learning algorithms often assume that the data in training and test sets are from the same distribution, which is severely violated in our setting. First, each user could wear devices in slightly different ways. Second, people have different chewing habits. For example, when user \(i\) eats faster than user \(j\), \(i\)’s chewing time will be shorter, but his or her chewing force will be stronger. Therefore, \(\mathcal{D}_{i}\) and \(\mathcal{D}_{j}\) could be drastically different.

Traditional normalization re-scales the input features across sources to have uniform standard deviations and means, which is ineffective in our setting. Figure 5(a) shows data collected from two domains, \(i\) and \(j\) (users). After normalizing the training data, the data still shift between \(\mathcal{D}_{i}\) and \(\mathcal{D}_{j}\). The problem will become more pronounced when the number of sources grows.

Fig. 5.

To address this challenge, we introduce a simple yet effective DA technique named stratified normalization (S-Norm). S-Norm draws inspiration from stratified sampling, a method developed for sampling from multiple subpopulations. It performs normalization independently for each \(\tilde{X}=\{\tilde{X}_{1},...,\tilde{X}_{n},\tilde{X}_{T}\}\). S-Norm serves two primary purposes: (i) aligning features from different domains to the same scale, ensuring \(\Pr(\tilde{X})\) is well-aligned (see Figure 5(b)). (ii) enhancing the alignment and ease of learning of conditional distributions \(\Pr[\tilde{x}_{i}|y_{i}]\) for \(i\in[n]\cup\{T\}\) across different domains. For instance, users often chew nuts faster than ice-creams, making it simpler to extract this signal under stratified normalization.

4.4 MSDA

The MSDA takes re-scaled features as input and outputs a total number of \(n\) predictions (ensembles). It possesses a branching structure, consisting of a “root” component \(g(\cdot)\) and a total number of \(n\) branches \(d_{i}(\cdot)\) (\(i\in[n]\)). Each \(d_{i}(\cdot)\)’s and \(g(\cdot)\) has the linear-ReLu-linear structure. All the training data from different domains first flow into \(g(\cdot)\) simultaneously, and afterward, they are branched out to different \(d_{i}(\cdot)\)’s. A \(d_{i}(\cdot)\) consumes labeled data from source domain \(i\) and unlabeled data from the target domain. Intuitively, \(g(\cdot)\) aims to extract features that are robust across all domains, whereas \(d_{i}(\cdot)\) aims to train an augmented model for \(\Pr[y_{i}\mid\tilde{\mathbf{x}}_{i}]\) that approximates \(\Pr[y_{T}\mid\tilde{\mathbf{x}}_{T}]\), i.e., learning the link function for the target based on link function for the source \(i\) as well as unlabeled target data.

Next, we explain how we implement this idea. We view the domain adaptor as sending \(\tilde{X}_{i}\) and \(\tilde{X}_{T}\) to an embedded space. We apply two techniques to learn \(g(\cdot)\) and \(d_{i}(\cdot)\)’s.

–

Technique 1. Properly construct embedded space. Intuitively, we aim to make sure after we apply \(g_{i}(\cdot)\) to \(s(\tilde{\mathbf{x}}_{i})\)’s and \(g(\tilde{\mathbf{x}}_{T})\)’s, these two “clouds” have similar distribution in the embedded space.

–

Technique 2. Shrinking towards the mean. For a fixed target \(T\), we hope to set \(d_{i}(g(\tilde{\mathbf{x}}_{T}^{j}))\)’s for different \(i\) to be “similar” so that the total “function complexity” across all the models we learned becomes smaller, which improves bias-variance tradeoff.

We also note that both techniques provide methods for measuring similarities (or distances) between pairs of models trained from different source domains, as well as between a source domain and the target domain. The distance measures derived from these techniques will be used in the ensemble learning component, which operates outside the deep learning loop (see Section 4.5).

Construction of embedded space. Our goal is to bring the following two sets of points closer to the embedded space:

\begin{align*}\{d_{i}(g(\tilde{\mathbf{x}}_{i})):\tilde{\mathbf{x}}_{i} \sim\mathcal{D}_{i}(X)\}\quad\mbox{ and }\quad\{d_{i}(g((\tilde{\mathbf{x}}_{T})):\tilde{\mathbf{x}}_{T}\sim\mathcal{D}_{T}(X)\}.\end{align*}

We use the MMD to measure the statistical distance.

Definition 4.1.

Let \(P=\{p_{1},\dots,p_{n}\}\) and \(Q=\{q_{1},\dots,q_{n}\}\). The MMD between \(P\) and \(Q\) is

\begin{align*}\mathcal{L}_{mmd}(P,Q)=\left\|\frac{1}{s}\sum_{p\in P}\phi(p)- \frac{1}{t}\sum_{q\in Q}\phi(q)\right\|^{2}_{\mathcal{H}},\end{align*}

where \(\mathcal{H}\) is the reproducing kernel Hilbert space (RKHS), and the \(\phi(\cdot)\) denotes a feature map to map the inputs to the \(\mathcal{H}\), which is achieved by a kernel \(k(P,Q)=\langle\phi(P),\phi(Q)\rangle\)

Our algorithm uses the Gaussian Radial Basis Function (RBF) kernel: \(k(u,v)=\exp(-\lambda\|u-v\|^{2})\) for \(\mathcal{H}\). Also, let \(p_{i}=\{d_{i}(g(\tilde{\mathbf{x}}_{i}))\}_{i\in n}\) be the feature representations for domain \(i\) at branch \(i\), and \(q_{i}=\{d_{i}(g(\tilde{\mathbf{x}}_{T})\}_{i\in n}\) be the feature representations for the target at branch \(i\). We let

\begin{align*}\mathcal{L}_{mmd}=\frac{1}{n}\sum_{i\in[n]}\mathcal{L}_{mmd}(p_{i},q_{i}).\end{align*}

\(L_{mmd}\) aggregates the distance of each pair source-target domain. By minimizing the \(\mathcal{L}_{mmd}\), each domain-specific adaptor \(d_{i}(\cdot)\) would be able to map the \(\tilde{\mathbf{x}}_{i}\) and \(\tilde{\mathbf{x}}_{T}\) into similar representations.

Shrink Towards the Mean. We impose global constraints over \(d_{i}(\cdot)\)’s. We want that \(d_{i}(\cdot)\)’s to shrink towards the same function to achieve improved bias-variance tradeoffs. Specifically, if models shrink towards the same one, data would be sufficient to train a model, but the model will not be expressive enough to have reasonable forecasting power. If we do not shrink at all, there are way too many models to be learned to reduce bias and increase variance.

We leverage the CR [33] based on L1 distance to achieve this target.

Definition 4.2.

The \(\mathcal{L}_{con}\) measures the L1 distance between the outputs of each pair of domain-specific adaptors, which can be formulated as:

\begin{align*}\mathcal{L}_{con}=\frac{1}{n\times(n-1)}\sum_{j=1}^{n-1}\sum_{i=j +1}^{n}\sum_{\tilde{\mathbf{x}}_{T}\in\mathcal{D}_{T}(X)}|d_{i}(g(\tilde{ \mathbf{x}}_{T}))-d_{j}(g(\tilde{\mathbf{x}}_{T}))|.\end{align*}

The effectiveness of CR in alleviating over-fitting is analyzed in Section 5.3.2. In an alternative view, the CR reduces the domain divergence of each pair of the source-source domain, and the MMD reduces the domain divergence of each pair of the source-target domain.

We use \(\mathcal{L}_{mmd}\) and \(\mathcal{L}_{con}\) developed above together with the standard cross-entropy loss \(\mathcal{L}_{cls}\) to construct the final loss function. Recall that

\begin{align*}\mathcal{L}_{cls}=\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{s_{i}}J(C(d_{i}(g(\tilde{\mathbf{x}}_{i}^{j}))),y_{i}^{j}),\end{align*}

where \(C(\cdot)\) is the Softmax function, and \(J(\cdot,\cdot)\) is the cross-entropy loss function. Our final goal is to minimize the following loss function:

\begin{align*}\mathcal{L}=\lambda\mathcal{L}_{mmd}+(1-\lambda)\mathcal{L}_{con} +\mathcal{L}_{cls},\end{align*}

where \(\lambda\) is a constant value to balance two loss functions and we set it to \(0.5\). We remark that specific details of our cost function can be tweaked. For example, \(\mathcal{L}_{mmd}\) can be replaced by CORAL [49, 50] or GAN-based loss, whereas \(\ell_{2}\)-loss can replace \(\mathcal{L}_{con}\). Our experiments find that making these minor changes does not result in additional performance gain.

4.5 Adaptive Ensemble Learning

4.5.1 Ensemble Learning.

This section explains our ensemble learning procedure. Our consolidated forecast is a linear combination of \(n\) ensembles

\begin{align}\tilde{g}(\tilde{\mathbf{x}}_{T})=\sum_{i=1}^{n}\omega_{i}d_{i}(g (\tilde{\mathbf{x}}_{T})),\end{align}

(1)

where \(\omega_{i}\)’s are uniform weights, i.e., \(\omega_{i}=\frac{1}{n}\), for \(i\in n\), or adaptive weights. Uniform weighted consolidated prediction is the default option for ensemble learning. We develop an adaptive weight assignment technique to achieve better prediction consolidation. Note also that we use a standard convention to represent the output of classifiers, i.e., \(d_{i}(\mathbf{x})\) outputs a probability measure in \(\mathbf{R}^{m}\), where \(m\) is the number of categories and the \(j\)th coordinate/component in the output represents the probability that the output is in category \(j\) (estimated by \(d_{i}\)). Therefore, \(\tilde{g}(\cdot)\in\mathbf{R}^{m}\). In evaluation, we assume that \(\tilde{g}(\cdot)\) picks up the category with the highest probability estimate.

4.5.2 Adaptive Weight Assignment.

Source domains have different approximations of the target domain. Assigning a higher weight \(\omega_{i}\) to a domain-specific model \(i\) benefits consolidated forecasting. Therefore, our goal is to learn \(\omega_{i}\) without labels from the target domain.

Design Intuition. Our algorithm for determining \(\omega_{i}\)’s need to (i) utilize the observation that having more ensembles is not always better, and (ii) ensure that \(\omega_{i}\)’s dynamically change according to the target; using static target-oblivious weights is ineffective.

When labels are available, estimating \(\omega_{i}\) is a simple regression problem. Here, we build our solution by unwinding key intuitions of solving a (possibly over-parametrized) linear regression and “re-implement” these intuitions in the no-label setting by using information-theoretic tools. Let us first briefly review the linear regression problem. Recall that the (ordinary least squares) OLS coefficient estimator is \((XX^{\mathrm{T}})^{-1}X^{\mathrm{T}}y\), which consists of a feature-feature interaction (source-source interaction in our setting) component \((XX^{\mathrm{T}})^{-1}\), and a feature-response (source-target) interaction component \(X^{\mathrm{T}}y\). Commonly used regularizations often “shrink” \(XX^{\mathrm{T}}\) to control variance-bias tradeoffs. For example, ridge regression shrinks \(XX^{\mathrm{T}}\) towards identity, whereas principal component regression (PCR) shrinks \(XX^{\mathrm{T}}\) towards low-rank matrices.

We mimic the linear regression and design two subroutines to capture source-source interactions and source-target interactions. The component using source-source interactions prunes away ineffective models, resembling pruning away inconsequential subspaces in PCR, whereas the component using source-target interactions further fine-tunes model weights.

(1) Source-source interaction: entropy-driven representative election. Here, our goal is to identify a subset of orthogonal signals that resemble variable selections in PCR. Recall that PCR selects a subset of orthogonal vectors as regressors. We aim to generalize the notion of orthogonality but the models \(d_{i}(\cdot)\)’s are non-linear so standard PCR techniques do not work. Instead, we re-utilize the CR measures introduced in Section 4.4. Recall that the CR defines the distance between each pair of \(d_{i}(g(\tilde{\mathbf{x}}_{T}))\)’s, which outputs a \(n\times n\) symmetric similarity matrix \(A\) (Figure 6(a)), where

\begin{align*}A_{i,j}=\sum_{\tilde{\mathbf{x}}_{T}\in\mathcal{D}_{T}(X)}|d_{i} (g(\tilde{\mathbf{x}}_{T})-d_{j}(g(\tilde{\mathbf{x}}_{T}))|.\end{align*}

Fig. 6.

We prefer a model \(i\) whose distances to other models are uniform (i.e., generalizing orthogonality), and we use entropy to measure the orthogonality. Specifically, let the entropy associated with model \(i\) be \(\sum_{j\in[n]}-A_{i,j}\log A_{i,j}\), which is maximized when \(A_{i,j}\) are the same for different \(j\)’s. We then choose the models with the largest entropy (either using the top-\(k\) rule or through thresholding after min-max standardization). A mask vector \(\omega_{m}\in\mathbf{R}^{n}\) is generated to indicate whether each individual model is selected. See Figure 6(b) and (c). Both top-\(k\) and thresholding rules require a pre-defined hyper-parameter that controls the aggressiveness of the pruning operation. Section 5.3.3 presents a parameter sensitivity analysis to show the robustness of our method.

(2) Source-target interaction: adaptive ensemble prediction. We next explain how we use source-target interaction. The MMD distance measures the distances \(d_{i}(g(x_{i}))\) and \(d_{i}(g(x_{T}))\) in the RKHS. I.e., the distance between the \(i\)th source domain to the target domain. We leverage the MMD distance to generate the similarity vector \(\omega_{s}\) to assign more weights to the more similar domain-specific adaptor (Figure 6(d)). Compounding with the \(\omega_{m}\), we obtain the weight vector \(\omega=\omega_{m}\cdot\omega_{s}\). Then the confidence matrix (Figure 6(e)) is retrieved to forecast the label for \(\tilde{\mathbf{x}}_{T}\) according to Equation 1.

We remark that (i) our procedure depends on the target’s features (both \(\omega_{m}\) and \(\omega_{s}\)) so the final weights dynamically adapt to the structure of the target features. (ii) Existing adaptive ensemble methods require an additional parameter that is updated during the training phase [27], whereas ours does not interfere with the training procedure. Instead, it is a post-training method that directly adjusts the \(\omega_{i}\) based on the loss function.

4.6 Discussions

Our algorithm incorporates a function \(h(\cdot)\) crafted by domain experts to transform features, utilizing manually built features. In contrast, existing works [57, 64] often leverage deep learning to autonomously learn \(h(\cdot)\). This design decision is not arbitrary; rather, it stems from a discerned universal phenomenon prevalent in classification problems involving the analysis of fine-grained muscle movements. Our conjecture posits that deep learning models face challenges in learning semantically meaningful intermediate features crucial for accurate responses in our domain. This conjecture is substantiated by comparing our problem with vision/NLP problems. In vision/NLP tasks, deep learning models excel at identifying interpretable local patterns in lower layers, which are then utilized for making predictions. For instance, convolutional layers in vision models extract local texture patterns, while NLP models discern words/tokens with similar meanings. However, in our setting, deep learning proves less effective in learning useful ‘local’ features from raw time-series sensor data [11].

To validate our conjecture, we design additional experiments where we task neural networks with predicting a set of seemingly “simpler” tasks using raw data. If our conjecture held true, deep learning models would struggle to predict these tasks. We define the “simple tasks” as manually built features, including simple statistics such as the number of chews or the duration of each chew. We select two deep-learning models to predict the 65-dimensional hard-crafted features. The first model comprises two LSTM layers with a dropout rate of 0.5, followed by a fully connected layer for prediction. The second model, CoDATs, mirrors the original implementation but adjusts the last fully connected layer based on the number of features. Both models employ MSE loss and the Adam optimizer [24] with a learning rate of 0.5e-3. We split the dataset into 2,100 training samples and 608 test samples. We fix each time-series sample to a length of 1,024 by padding zeros or truncating and use a batch size of 128. We use \(R^{2}\) metric to measure the regression performance.

Table 2 illustrates that deep learning models struggled to accurately predict hand-crafted features, even with meticulous parameter tuning. While it might be feasible to fine-tune a neural network for predicting specific features, using a single neural network architecture to predict a substantial portion of hand-crafted features appears challenging. This underscores a fundamental difference between our problem and vision/NLP problems, where a single architecture can typically extract a diverse set of “local features” such as various textures or the meanings of many words.

Conclusion: The features \(\tilde{\mathbf{x}}_{i}\)’s and \(\tilde{\mathbf{x}}_{T}\) originate from vastly different distributions, presenting a formidable challenge even for simple responses, such as the 65 extracted features. In light of this, mastering effective transfer learning in the food typing task remains a complex endeavor. Our investigation strongly suggests that, given the current landscape, relying on manually-built features proves to be a more efficacious approach. This observation aligns with recent empirical findings [11], further emphasizing the ongoing difficulty in leveraging automated methods to bridge the gap between diverse feature distributions.

5 Evaluation

5.1 Evaluation Methodology

This section evaluates our proposed DA method for the task of food typing. Specifically, we show that our algorithm outperforms state-of-the-art baselines significantly. We also perform extensive experiments, including ablation studies to analyze the roles and efficacy of different components in our pipeline. Moreover, we perform open set MSDA evaluations to test the extensibility of our methods.

In our experimentation, we employ a rigorous evaluation technique known as Leave One User Out Cross-Validation (LOOCV). LOOCV is a specialized form of cross-validation where the model is trained on all users except one, and the excluded user serves as the target domain for validation. This process is iteratively repeated until each user has been left out exactly once. LOOCV provides a robust assessment of the model’s generalization performance, especially in scenarios where user-specific characteristics play a significant role.

5.1.1 Dataset.

We use a standard benchmark human chewing datasets introduced in [55], namely food-15. Comprising data from 15 participants, each engaging with up to 20 distinct types of food, this dataset captures chewing activities via gyroscope and accelerometer sensors. For consistency and improved interpretability, we adopt the categorization scheme proposed in [55], condensing the 20 food types into \(m=11\) categories. The data summary is detailed in Table 3. This categorization proves advantageous for two primary reasons. Firstly, from a clinical perspective, predicting food categories is often more meaningful than predicting individual types. Secondly, the variation in participants’ dietary habits, such as one individual consuming almonds while another opts for peanuts. Both almonds and peanuts fall under the “Nuts” category. Predicting unseen labels (food types) for the target users is a non-scope in this work.

Table 3.

	Nuts	Gum	Dry Fruit	Fruits	Pretzel	Corn/Fry	Cookie	Vegetable	Bread	Meat	Cream	Total
User1	30	10	30	38	10	30	10	10	10	10	8	196
User2	30	10	29	30	10	21	10	10	10	10	1	171
User3	30	10	30	40	10	30	10	10	10	10	4	194
User4	30	10	30	39	10	28	10	10	10	9	8	194
User5	29	10	29	38	6	24	10	9	10	9	4	178
User6	30	10	30	40	10	27	9	9	8	10	4	187
User7	30	10	30	30	10	29	10	9	10	10	2	180
User8	30	10	30	40	10	27	10	10	10	10	8	195
User9	29	10	30	36	9	29	10	10	10	10	3	186
User10	29	10	30	36	10	30	10	0	10	10	6	181
User11	10	0	27	39	10	28	10	10	10	10	2	156
User12	28	8	27	38	9	21	10	10	10	1	1	163
User13	30	6	18	23	9	18	10	9	10	10	1	144
User14	29	10	30	37	10	28	10	10	9	10	3	186
User15	30	10	30	40	10	30	10	10	9	10	8	197
Total	424	134	430	544	143	400	149	136	146	139	63	2,708

Table 3. Data Distribution of food-15

The dataset includes 15 users and spans 11 food categories, comprising a total of 2,708 samples. Individual users contributed samples ranging from 144 to 197, and each food type has 63–544 samples. Notably, User10 lacks samples for vegetables, and User11 has no samples for gum.

5.1.2 Baseline Methods.

We examine a wide range of baselines, including a domain expert model without DA [55], marked as No-Ada, three single-source DA methods: CORAL [49], TCA [38], DANN [14], and six MSDA methods: DARN [56], MDMN [27], M3SDA [39], MDAN [60], MuLANN [45], CoDATs [57]. Section 2 reviews these baselines. Table 4 also compares their key characteristics against our algorithm. CoDATs [57] distinguishes itself by utilizing raw time series datasets from sensors. In contrast, the remaining baselines were originally devised for recognition problems other than food type, such as vision and natural language processing. It is not obvious how we can effectively pipe an architecture for time series with these solutions. Thus, we feed the 65-dimensional hand-crafted features to these baselines. To maintain consistency in our experiments, we employ the same set of neural network hyperparameters (e.g., number of hidden nodes, number of layers) for both baselines and our proposed method. This practice enables us to control the impact of model complexity, mitigating concerns of overfitting or underfitting. Adapting single-source DA methods to the multi-source setting introduces additional challenges. For No-Ada, CORAL, and TCA, we merge all source data to create a large joint source dataset as the training data. In the case of DANN, we adhere to its single-domain scheme by training individual models on each source-target pair and subsequently ensembling the models with uniform weights. This approach aligns with methodologies applied in prior works [39, 56, 60].

Table 4.

Work	Feature extraction	Domain-invariant feature transform	Adaptation target	Ensemble scheme
Single-source DA
1. CORAL [49]	Domain knowledge	A1. Data	Source-Target	Source-combined
2. TCA [38]	Domain knowledge	A1. Data	Source-Target	Source-combined
3. DANN [14]	Domain knowledge	A2. Model	Source-Target	Uniform weight
Multi-source DA
4. DARN [56]	Domain knowledge	A2. Model	Source-Target	Dynamic weight
5. MDMN [27]	Domain knowledge	A2. Model	Source-Target, Source-Source	Dynamic weight
6. M3SDA [39]	Domain knowledge	A2. Model	Source-Target, Source-Source	Model accuracy weight
7. MDAN [60]	Domain knowledge	A2. Model	Source-Target	Dynamic weight
8. MuLANN [45]	Domain knowledge	A2. Model	Source-Target	Source-combined
9. CoDATs [57]	Data-driven	A2. Model	Source-Target	Source-combined
Ours	Domain knowledge	A1. Data, A2. Model	Source-Target, Source-Source	Dynamic weight

Table 4. Comparison of Our Method (Ours) and Nine State-of-the-Art Baselines

A1. Data represents manipulating data to align to different domains without innovating model architecture/loss function, and A2. Model represents a method that has innovated model architecture/loss function, see Section 3. Domain knowledge-guided feature extraction means a baseline using hand-crafted features. Data-driven feature extraction methods use NNs to learn representations on raw time series data.

5.1.3 Lower and Upper Bounds.

To put the numbers into context, we also present lower and upper bounds of DA performance for this dataset. The lower bound is defined as the performance of a “strawman” model, in which all data from different sources are used to train one single model, and the model is used to predict data from an unseen target, a.k.a., a “no adaptation” solution. This is the default approach for ML modeling. The upper bound is constructed as training a model with the knowledge of the target’s labels (i.e., a “target only” model). The upper bound represents the behavior of typical ML models in an (excessively) ideal world, whereas the lower bound represents the performance of a model one would expect from a typical practitioner. The gap between the lower and upper bounds reflects the performance surprise and is a good indicator of the “difficulties” of our DA problem.

The lower bound trains an LSTM model on source-combined data without domain knowledge. The upper bound approximation uses the MLP model from Wang et al. [55] because it outperforms other model families.

5.2 Overall Accuracy

Figure 7 compares the average accuracy of our method and baseline methods on each target domain of the food-15 dataset, respectively. Our method outperforms baseline methods in each target domain with a \(1.33\times\) to \(2.13\times\) accuracy improvement. We also note while in general DA in food-type prediction problems is remarkably difficult, performance on certain targets (e.g., participant 3) is quite close to the upper bound (a promising sign). It remains an interesting open problem to understand when a participant is easy to predict.

Fig. 7.

The two methods that do not utilize source adaptation techniques, No-ada and LSTM, perform 22.8% and 38.5% worse than our approach, respectively, confirming the effectiveness of DA in predicting food types for unseen users. The LSTM learns features from raw time-series data, while No-Ada relies on domain expert features, highlighting the robustness of manually extracted features in handling domain shifts. Moreover, compared with CoDATS [57] (CNN on raw time series), our method extracts features with expert knowledge and is 25.2% better on average, confirming the intuition of preferring not to use raw data (Section 4.6). TCA [38] and CORAL [49] use only standard NN, and are 19.2% and 23.8% worse than us, respectively. Therefore, engineering NN is essential. MDMN [27] and M3SDA [39] originally designed for computer visions, optimize source-source and source-target divergence simultaneously and are the best baselines, also confirming the importance of reducing source-source divergences (Section 3). They are nevertheless 11.8% and 12.7% worse than us, respectively. They do not have stratified normalization or adaptive ensemble learning. The cost functions and architectures in our NN are also different. We are 13.9% and 23.7% better than DANN [14] and MuLANN [45], which do not directly address source-source divergence. We are also 14.7% and 22% better than DARN [56] and MDAN [60], respectively, which couple ensemble-weight learning with deep learning (i.e., ensemble weights are part of the network, which could potentially impact DL training in an adversarial manner), whereas we take a two-staged approach (i.e., learning the ensemble weights after training the multi-source domain adaptor).

To delve deeper into the performance of the proposed food typing methods in the face of domain divergence challenges, we analyze the confusion matrix as depicted in Figure 8. Notably, certain classes exhibit a high degree of ease in being classified into each other. For instance, classes such as “Nuts” and “Dry Fruit” or “meat” and “bread” seem to be easily confused, as evidenced by the relatively high numbers in the corresponding off-diagonal elements. This suggests a potential similarity or overlap in the features that the model uses for classification between these pairs of classes. Recognizing food types through chewing behavior across different users remains a challenging task. However, the presented work represents a significant stride toward a promising solution.

Fig. 8.

Our average accuracy is 47.5%, insufficiently close to the “productization” level. Note also that there are a total number of 11 classes, so a “null” model has a 9% accuracy. The problem we face appears to resemble model development for ImageNet [10], which requires multi-year effort to engineer a fully effective model (the most accurate model three years after ImageNet’s inception is only slightly over 60% [25]). We remark that strawman’s (lower bound) performance is 9%, the best baseline is 35.7% (after substantial hyper-parameter tuning), and ours is 47.5%, which represents a 5.28 multiplicative improvement. Our “delta” with the strawman is 1.44 times stronger than the delta of the best baselines and the strawman.

5.3 Ablation Study

Three components are vital to our model, including stratified normalization, architectures with MSDA (including consensus regularizer), and adaptive ensembling. This section performs ablation studies to examine each component’s effect. Note that the MMD technique handles source-target interaction and is widely deployed in modern DA algorithms so we do not specifically examine this component for conciseness.

Overview. Figure 9 presents a bird’s eye view of model performance after progressively adding each of the components. Specifically, the red line represents the performance evolution on average after adding the components, whereas dots in different colors represent the performance of individual users. After adding S-Norm, the model improves by 9.66% on average. Then we add uniform ensemble learning (Ensem.), which MSDA techniques assume, the model further improves by 2.8%. Next we add MSDA (MSD-adaptor), which consists of two steps, including first adding all techniques except for the CR, and then adding CR. We can see that the gross performance improvement is 11.55% whereas the CR on its own contributes 6.85% improvement. Finally, we add adaptive weights (Ada.weights) to replace uniform weights in ensemble learning. We can see that while the average performance improvement is moderate, they are more powerful for some targets (and never result in worse performance). The next part further examines/interprets each individual technique in detail.

Fig. 9.

5.3.1 Interplay between Stratified Normalization and Ensemble Learning.

Ensemble learning is the de facto design to address the MSDA problem [51]. To study the interplay between S-Norm and ensemble learning, we compare the following settings: uniform weight ensemble learning with or without S-Norm and combining data from multiple domains to train one model, a.k.a., a “source-combined” model, with or without S-Norm. Figure 10(a) shows the comparison results. It proves that S-Norm substantially improves model accuracy by \(1.3\times\) even without being integrated with ensemble learning. Integrating S-Norm with ensemble learning further enhances the accuracy by \(1.41\times\). It is interesting that applying ensemble learning only without S-Norm results in an accuracy that is slightly worse than the non-ensemble version.

Fig. 10.

5.3.2 Multi-Source Domain Adaptor and Consensus Regularizer.

All other techniques without consensus regularizer. A substantial body of prior works has demonstrated the efficacy of multi-branch architecture (i.e., hard parameter sharing layers reduce risks of overfitting [9, 43]) and the MMD [48, 53] cost function. Thus, it may not be surprising that these techniques continue to work in our problem. The more worthwhile point is that these techniques are additive, i.e., they can be integrated seamlessly with other innovations in our pipeline. To study the effectiveness of a multi-branch structure, we compare it with a sequential MLP model. As Figure 10(b) shows, the multi-branch structure model achieves higher accuracy than the sequential model. Most importantly, a multi-branch model facilitates the integration of other DA designs, i.e., consensus regularize.

Consensus Regularization. We further perform convergence analysis for training and test sets to inspect the impact of consensus regularizers. See Figure 11 for the training and test accuracy using models with and without consensus regularizer. One can see that (i) the consensus regularizer slows down the training process but eventually coincides with the model without the regularizer. (ii) the test performance continues to improve over time even when the training errors stall. These observations resemble behaviors of AdaBoost [12], and it is an interesting open problem to understand how they are connected. In addition, the reduction in the training-test gap highlights the consensus regularizer’s efficacy in alleviating the overfitting problem.

Fig. 11.

5.3.3 Adaptive Weights Assignment in Ensemble Learning.

The ensemble learning module includes an optimization that adaptively assigns a weight to each domain-specific adaptor. Each adaptive weight consists of a 0–1 mask vector \(\omega_{m}\) that decides if this domain-specific model is included in the ensemble and a fine-tuning vector \(\omega_{s}\) that measures the similarity between a source domain and the target domain in the embedded space. As aforementioned (in Section 4.5.2), our method offers two ways to determine the value of \(\omega_{m}\): either a top-k strategy (AE1, AE is short for Adaptive Ensemble) or a threshold-based strategy (AE2).

Figure 12(a) shows that our top-k strategy (i.e., AE1) consistently outperforms the default uniform weight assignment and random selection in accuracy by an average value of 0.28% and 0.6%, respectively, for each eligible parameter setting. Figure 12(b) studies the interaction between threshold value selection (in AE2) and the number of representatives (i.e., selected domain-specific models). As Figure 12 shows, a higher threshold results in more aggressive pruning, i.e. fewer representatives participate in predicting. This result also shows that selecting a middle-range value of the threshold achieves the optimal improvement (1%) over the uniform weight assignment. Users can select either AE1 or AE2 for their application (and dataset) based on a similar empirical study.

Fig. 12.

5.4 Open Set MSDA

Prior experiments assumed an ideal condition where each domain consumes the same type of food (except for user 10 and user 11, refer to Table 3). This setting is commonly used by the research community [51, 61]. However, such a condition may not always exist in the real world. In this section, we address a more challenging scenario where each source domain only provides partial food types. This is formally defined as the open set MSDA problem [61], where \(y_{i}\cap y_{T}\subset y_{T}\). This setting is more challenging because the total amount of labeled data is reduced, and the target domain can only learn a few food types from one specific user and other food types from different users. Such a situation could occur in real-world applications when extending the model to recognize more food types, as new data may need to be provided by different users.

To validate our methods in this setting, we modify the dataset so that each user provides partial labels. For example, user 1 provides labels with odd IDs, and user 2 provides labels with even IDs, meaning the model can only learn approximately 50% of the food types from each user. Figure 13 compares the average accuracy of our model with baseline methods under this setting. Our method outperforms the baseline methods with a \(1.39\times\) to \(2.40\times\) improvement, indicating that our approach is extensible to incorporate more food types. Figure 14 presents a detailed analysis by varying the percentage of food types provided by each source domain. Our method surpasses the best baseline by a range of \(1.37\times\) to \(1.82\times\). Although our method experiences accuracy losses, these are attributable to the reduced data size under the open set configuration.

Fig. 13.

Fig. 14.

6 Conclusion and Future Work

This work develops the first MSDA method for food typing recognition, which consists of a pipeline with three main components. First, the stratified normalization aligns the conditional and marginal distributions of features to adapt to different domains, improving accuracy by 9.66% compared with a no-adaptation baseline. Second, a multi-source domain adaptor is trained on the domain-aligned features to learn a generalizable classifier for recognizing food types, incorporating a consensus regularizer and the MMD. This component further increases accuracy by 11.55%. Finally, the adaptive ensemble weight selection prunes irrelevant sub-models of the multi-source domain adaptor and fine-tunes the weights for ensembling, contributing an additional 0.68%-1% accuracy improvement. Our evaluation empirically validates the importance of the consensus regularizer and domain knowledge in providing generalizable forecasting through sensor signals. We compare our method with nine state-of-the-art baselines to evaluate accuracy improvements in both closed set and open set MSDA problems, demonstrating that our method achieves \(1.33\times\) to \(2.13\times\) and \(1.39\times\) to \(2.40\times\) accuracy improvements, respectively.

Based on the current study, our future work includes: (1) improving the model to achieve higher accuracy and recognize a greater variety of food types in both closed set and open set MSDA problems, (2) extending our method to more challenging problems, such as zero-shot MSDA, where target data are not available during training.

References

[1]

Oliver Amft. 2010. A wearable earpad sensor for chewing monitoring. In Proceedings of the IEEE International Conference on SENSORS. IEEE, 222–227.

Abstract

1 Introduction

2 Related Work

3 Problem Setup, Motivation, and Challenges

4 Our Approach

4.1 Overview

4.2 Hand-Crafted Feature Extraction

4.3 Stratified Normalization

4.4 MSDA

4.5 Adaptive Ensemble Learning

4.5.1 Ensemble Learning.

4.5.2 Adaptive Weight Assignment.

4.6 Discussions

5 Evaluation

5.1 Evaluation Methodology

5.1.1 Dataset.

5.1.2 Baseline Methods.

5.1.3 Lower and Upper Bounds.

5.2 Overall Accuracy

5.3 Ablation Study

5.3.1 Interplay between Stratified Normalization and Ensemble Learning.

5.3.2 Multi-Source Domain Adaptor and Consensus Regularizer.

5.3.3 Adaptive Weights Assignment in Ensemble Learning.

5.4 Open Set MSDA

6 Conclusion and Future Work

References

Index Terms

Recommendations

Towards Recognizing Unseen Categories in Unseen Domains

Pose-robust personalized facial expression recognition through unsupervised multi-source domain adaptation

Unseen Food Segmentation

Comments

Information

Published In

Publisher

Publication History

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

View options

PDF

eReader

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations