Beyond Finite Data: Towards Data-Free Out-Of-Distribution Generalization Via Extrapolation
Beyond Finite Data: Towards Data-Free Out-Of-Distribution Generalization Via Extrapolation
Beyond Finite Data: Towards Data-Free Out-Of-Distribution Generalization Via Extrapolation
Yijiang Li1 , Sucheng Ren1 , Weipeng Deng2 , Yuzhi Xu3 , Ying Gao4 , Edith Ngai2 , and
Haohan Wang5
1
Johns Hopkins University
arXiv:2403.05523v2 [cs.CV] 11 Mar 2024
2
The University of Hong Kong
3
New York University
4
South China University of Technology
5
University of Illinois Urbana-Champaign
1 Introduction
Deep neural networks have demonstrated remarkable achievements in various fields and
applications [13, 17, 18, 21, 32], yet their effectiveness heavily depends on the assump-
tion that the training and testing environments are subject to independent and identically
distributions [7, 8]. Out-of-distribution (OOD) generalization aims to learn model from
some training distribution that can generalize well to unseen testing domains, usually
with distribution or label shifts [33]. A typical scenario is domain generalization (DG)
where multiple source domains are available and these available source domains aid
the training of generalizable models that learn invariant features and discard spurious
ones. However, a significant challenge arises. The availability of these source domains
often becomes a limiting factor, hindering the success of current OOD approaches in
more challenging scenarios [39, 52, 54, 58], which can be attributed to the difficulty and
high expenses to collect, not just, data but data in diverse domains with annotations,
which is sometimes impossible in critical areas such as healthcare or extreme condi-
tions (e.g. deep sea or space). Motivated by these challenges, domain augmentation is
straightforward and multiple methods have been proposed to generate novel domains
and images through mixup [59], mixing of statistics [63], uncertainty modeling [30,64]
and convex combination [1]. However, these methods generally interpolate the exist-
ing training domains to generate novel domains that still fall within the convex hall of
available domains [1]. Consequently, the constrained number of source domains ham-
pers the expressiveness of these methods, continuing to act as a performance bottleneck.
On the other hand, Humans harness the innate ability of the human brain to create novel
domains as illustrated in [41, 47] where a pre-defined set of novel domains and styles
are utilized. However, this also requires human labor which fails to scale to larger sizes.
Naturally, an intriguing question arises: How can neural networks extrapolate truly
“novel” domains and achieve OOD generalization?
Large language models (LLMs) [9] have been shown to encapsulate a vast wealth of
knowledge and simulate human cognitive processes. Thus, a pertinent question emerges:
Can one harness the power of LLMs to produce novel domains and relevant knowledge,
thereby replacing the human in the above training process? Stemming from this primary
query, we investigate how we can extract knowledge of a specific task and produce novel
domains from LLMs. A subsequent research question is: How can we leverage this text-
centric knowledge from LLMs to instruct an image system that processes pixel input?
State-of-the-art text-to-image generation models such as Imagen [46], Stable Diffu-
sion [42] and GLIDE [36] exhibit promosing capability to synthesize photo-realistic
images positioning them as the optimal conduit between textual and visual realms. Fi-
nally, we seek to answer to what extent the synthesized images based on knowledge can
serve as Out-of-distribution learners that can generalize to unseen testing domains. Fol-
lowing these problems, we are the first study to design a new paradigm that leverages
the knowledge of LLMs to extrapolate novel domains for training better generalizable
and sample-efficient models. With the ability to extrapolate any domains for any class,
our method has the potential to learn a generalized model for any task without any
existing data.
In addition, we present data-free domain generalization. Data-free generalization
endeavors to enable a model across unseen testing domains based solely on task spec-
Data-free Generalization 3
ifications (for example, distinguishing between dog and cat classes) without the need
for gathering or utilizing any pre-existing datasets. In the era of large foundation mod-
els, data-free domain generalization is formulated as OOD problem with inaccessible
meta distribution and domain distribution (detailed in Section 2.1) – essentially, de-
void of any real-world data. This scenario presents a significantly more complex chal-
lenge than that encountered in multi-domain or single-domain generalization efforts.
Moreover, it holds pragmatic significance in democratizing machine learning, by urg-
ing the community to develop methodologies that are viable under stringent resource
constraints. Such an initiative paves the way for wider access to, and application of,
machine learning technologies. For instance, our proposed method provides a viable
solution by leveraging LLM as a knowledge base to extrapolate domains and scenarios
domains and scenarios where specific classes may be represented. Then synthetic data
is generated via a text-to-image generation model. These synthetic data only are used to
train a model that can generalize well to the given task and fulfill its requirements. Our
method not only addresses the challenge of data scarcity in DG problems but also un-
derscores the potential of synthetic data in overcoming traditional barriers to machine
learning implementation.
Extensive experiments on single, multi-domain and data-free evaluations demon-
strate the effectiveness of our proposed method. In both single and multi-domain config-
urations, we demonstrate that synthetic data in the extrapolated novel domains markedly
outperforms baseline results across various datasets. On the more challenging data-free
setting, our proposed method exhibits near-supervised performance in this setting, even
surpassing the supervised baseline by approximately 1-2% on VLCS. Data synthesized
via the knowledge from LLMs excels compared to the synthetic data directly gener-
ated from text-to-image generation models. This demonstrates the ability of LLMs to
effectively extrapolate like humans and integrate this prior knowledge into the model.
We also underscore the scalability of our approach by highlighting that as the num-
ber of domains escalates, the performance correspondingly improves. Intriguingly, this
trend diverges from the outcomes observed when augmenting with synthetic data di-
rectly produced by text-to-image models reported in [5, 22]. This further demonstrates
the pivotal role of the knowledge derived from LLMs in mitigating overfitting to syn-
thetic data.
The remainder of this paper is organized as follows: In Section 2, we will first
motivate our method from the perspective of the theoretical error bound for out-of-
distribution (OOD) generalization. Then we will detail our method design and specifi-
cations. Section 3 introduces the data-free generalization and its potential usage in the
era of large foundation models. Section 4 describes our experiment design, results and
the implications of our findings. Section 5 introduces related work. Section 6 concludes
our paper and potential limitation of our work.
2 Method
We motivate our method from the perspective of the theoretical error bound for out-of-
distribution (OOD) generalization. We will first provide the notation for the theoretical
framework. Then we motivate our research problem from the OOD generalization error
4 Y Li, S Ren W Deng et al.
bound, i.e. limited number of source domains leading to a larger error bound. Then we
propose a proxy method that approximates the meta-distribution with a proxy distri-
bution. We give a new error bound on this method. Lastly, we propose one realization
of our method by using LLMs to approximate the meta-distribution and text-to-image
generation models to bridge the text-centric knowledge with the input pixel space.
\small \mathcal {L}^\mu (f) = \mathbb {E}_{P_{XY} \sim \mu }\mathbb {E}_{(x, y)\sim P_{XY}}[\mathit {l}(f(x), y)] \label {eq:objective} (1)
(1) (i) (n)
Since we have no access to µ and all the realizations PXY , · · · , PXY , PXY but sampled
images from these realizations, we can derive an empirical error:
\small \hat {\mathcal {L}}^\mu (f) = \sum _{i=1}^n \sum _{j=1}^m \mathit {l}(f(x_{ij}, y_{ij}) (2)
(j) (j)
where (xij , yij ) ∼ PXY denotes the ith sample drawn from PXY . It’s easy to see
that when n → ∞, m → ∞, L̂µ (f ) converges to Lµ (f ), which gives the intuitive
sense that increasing m and n gives us better-approximated solutions. This motivates
us to increase n and m, i.e.increasing the number of domains and training images per
(1) (i) (n)
domain, which is difficult due to the inaccessible µ and PXY , · · · , PXY , PXY . Prior
arts have proposed various methods to generate novel domains but the majority falls
in the interpolation of existing domains, failing to effectively increase n. How can to
approach this problem? We can approximate µ by new distribution µ′ sufficiently
close to µ that can be sampled.
\small D(\mu , \mu ') = \sup _{f\in \mathcal {F}}| \mathcal {L}^{\mu '} (f) - \mathcal {L}^\mu (f) |
\small \mathcal {L}^\mu (f) \le \hat {\mathcal {L}}^{\mu '} (f) + 2\mathcal {R}_{mn}(\mathcal {F}) + 2\mathcal {R}_{n}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2mn}} + 3\sqrt {\frac {\ln (2/\delta )}{n}} + \epsilon
Data-free Generalization 5
′
Proof in Appendix A. By replacing µ with µ′ , we now have control over L̂µ (f ), m and
n as we can sample as many domains and images from µ′ as possible. This is obtained
at the cost of ϵ, which we assume to be small.
Remark 1. We also note that as n and m increase, the upper bound of the generalization
error decreases, which gives us better generalization errors.
With sufficiently large n and m, the decrease part of the generalization error will cancel
out the cost of ϵ, leading to a lower generalization error.
Step 1: Generate Novel Domains
Prompt
Airport, Steampunk, minimalism
[Role] ……
Task [Task description]
Description: classify image to [Steps] Step 2: Get Diffusion Prompts
different classes [Output Format]
Classes: dog, cat, car, chair Dog - street: A dog sitting on the
and person streets with houses behind it
.…
Test
&
Deploy
Caption 1 Caption 2 … Caption n
Frozen
train …
Text-to-Image
Generalized Model
Diffusion Model
Fig. 1: Overall pipeline of our paradigm: Extrapolation of novel domains via the knowledge of
LLMs, a novel learning paradigm where knowledge from LLMs assists the training of generaliz-
able models via text-to-image models in a completely data-free fashion.
System Prompt
Home: wooden dining table,
[Role] high-back chairs
Prompt Structure {domain knowledge expert}
Office: spacious cabin, leather
[Task description] executive chair
{ask the model to give
[Role] domains} Park: greenery, wooden bench
You are an expert on the given class and
object. You have knowledge of its attribute, Knowledge Provider chair
[Steps]
shapes, appearance. {or expert in using {guide LLM step by step} Cafe: sidewalk, cafe tables,
diffusion model}
metal chairs
[Output Format] ……
[Task description] {Output format instructions}
{ask the model to give domains, environment,
styles or generate diffusion prompts}
Step 2: Construct Prompts for Diffusion
[Steps]
{“CoT” prompting to guide LLM step by step System Prompt Dog - Airport: In the airport's pet
for better quality and diversity.} relief area, a fluffy white poodle
[Role]
enjoys a game of fetch with its
{diffusion expert}
[Output Format] owner, surrounded by travelers
{Output format instructions} [Task description] and their luggage.
{ask the model to write
Cat – Classroom: A curious tabby
prompts for diffusion model}
cat perches on a classroom desk,
Diffusion Prompter its inquisitive eyes fixed on a
[Steps]
{guide LLM step by step} textbook as if trying to decipher
the mysteries of math.
……
[Output Format]
{Output format instructions}
Fig. 2: Knowledge extraction pipeline. We first employ various SOTA prompting methods: e.g.
"Chain of Thought [55]" (CoT) prompting, role prompting to extract domains from LLM (Step
1) and automatically generate prompt for a Text-to-Image model. (Step 2)
Bridging text and pixel with text-to-image generation models. After obtaining
a number of the most plausible and reasonable domains of a specific class, we trans-
form the text-centric knowledge from LLMs to pixel space by text-to-image generation
(i) (i)
models. This process is exactly the realization of sampling X from PX where PX
′
is the ith domain generated by µ (i.e. the LLM). Numerous strategies exist to prompt
text-to-image generation models conditioned on class and domain information.
Data-free Generalization 7
4 Experiments
The objective of our experiments is to (i) demonstrate that knowledge from LLMs suc-
cessfully extrapolates novel domains and leads to performance benefits grounded by
theoretical bounds. (ii) Investigate the most efficient and effective approach for extract-
ing knowledge and sampling from text-to-image models. (iii) Analyze to what extent
the synthetic images generated condition on LLMs’ knowledge can serve as good out-
of-distribution learners that lead to generalization on unseen testing domains.
method also achieved a significant improvement (1.2% on average) over the CLIP fine-
tuned baseline. This improvement is remarkable, given the already high performance of
the CLIP model.
Single Domain Generalization. Single-domain generalization Evaluation lever-
ages a single domain for training and subsequently assesses the outcomes on the re-
maining domains. This scenario presents a greater challenge when juxtaposed with the
Leave-one-out setting due to the model’s exclusive exposure to just one domain dur-
ing its training phase. Such a setting accentuates the issue of restricted availability of
source domains. Considering our methodology does not impose assumptions on either
the source domains or the model, but instead extrapolates novel domains via LLMs to
augment the training set, it is optimally more suited for this specific context. Empiri-
cal evidence underscores its exceptional efficacy and with merely one source domain
of real images, our results closely mirror, and at times even surpass, those obtained
in a multi-domain configuration, as per Table 2. Specifically, we achieve the highest
of 78.0%, 87.6%, 69.4% on the three datasets, outperforming the ERM with multiple
source domains by margins of 0.8%, 3.2% and 4.6% respectively. Compared to base-
lines, our method achieves a remarkable improvement of over 10% across all datasets
and baselines. This evidences that our methodology substantially mitigates the chal-
lenges associated with restricted source domains, rendering it particularly optimal and
10 Y Li, S Ren W Deng et al.
where we achieve the highest performance of 79.9%, 86.9%, 67.4% with only less than
1% gap between its multi-domain counterparts and largely surpasses single-domain
counterparts. Notably, data-free ERM + EMA presents an accuracy of 79.9% on VLCS
outperforming the multi-domain supervised counterparts by more than 1%. With the
knowledge injected and novel domain extrapolated, this empirical result illustrates the
promise of achieving generalization in a completely data-free fashion free of laborious
data collection and annotation.
Table 4: Data-free generalization on DomainBed Benchmark.
increases scales with more synthetic data from more extrapolated novel domains. We
perform scaling by prompting LLMs to extrapolate more novel domains and generate 64
image per domain. We can see in Figure 3 that with more domains (larger n in Section
2.1), performance keeps increasing, which is consistent with our theoretical framework.
We also make a comparison with class-template and class-prompt baselines and scale
the two baselines by increasing the synthetic images to the corresponding size. How-
ever, these two methods both suffer from performance saturation and degradation when
synthetic data increases, which is consistent with previous studies [5, 22]. This demon-
strated that our method can scale better to larger sizes of synthetic data and under-
score the importance of new knowledge injected by LLMs that benefits generalization.
w/ CLIP filter
variance measure on PACS 94 w/o CLIP filter
Accuracy
90.3
model training 90.3 ± 0.4 90
90
60
Fig. 3: Scaling the training dataset by adding PACS VLCS OfficeHome
(b)
more novel domains. Each novel domain con-
sists of 64 images. To facilitate fair comparison,
Fig. 4: (a) Effectiveness of CLIP filtering.
we scale the class template method by the same
(b) Comparison between different knowl-
amount of images.
edge extraction methods.
Variance Analysis. We aim to measure how stable our method is by decomposing the
variance into three parts, i.e. LLMs extrapolation, text-to-image generation and final
model training. We repeat each experiment three times and report the average and stan-
dard deviation in Table 5. For instance, to conduct variance anlysis on text-to-image
generation, we use the same set of novel domains generated by LLMs, can generate
synthetic datasets with the same text-to-image model three times. As per the table, we
can see that all three parts contribute to a relatively small variance, suggesting that our
method is stable.
Different LLMs. To make sure that our method does not reply on specific LLMs,
i.e. ChatGPT-4, we conduct experiments with LLMs from different families, e.g Llama
and Mixtral in table .
LLM A C P S Avg
GPT-4 94.4 ± 0.2 85.0 ± 0.5 98.5 ± 0.1 83.3 ± 1.7 90.3
Llama-13B 92.6 ± 0.5 83.2 ± 0.5 98.2 ± 0.1 80.9 ± 0.7 88.7
Llama-70B 93.0 ± 0.4 83.6 ± 0.4 98.5 ± 0.2 81.9 ± 0.4 89.3
Mixtral-8x7B 92.4 ± 0.0 84.6 ± 0.3 98.8 ± 0.0 81.1 ± 0.6 89.2
Table 6: Performance with different LLMs. We experiment with GPT4, Llama-13B, Llama-70B
and Mixtral-8x7B models.
dog
horse
person
house
Fig. 5: Examples of synthetic images conditioned on novel domain knowledge from LLM. The
first two columns (i.e. art painting and cartoon) are selected from PACS datasets while the rest
four columns are images generated based on the novel domains (i.e. cityscapes, etc) provided by
LLMs.
5 Related Work
Domain Generalization. Various approaches have been proposed to solve this prob-
lem, such as domain alignment [29,31], meta-learning [6,28], ensemble learning [4,10]
14 Y Li, S Ren W Deng et al.
and augmentation-based [1, 30, 58, 62–64]. Augmentation-based methods are closely
related to this work, both with the intention of generating more source domains to
approximate the expected generalization error. However, these methods resort to in-
terpolation of existing domains and fail to extrapolate the "truly" novel domains. For
instance, MixStyle [63] mixes the statistics of two samples by linear interpolation. More
recently, with the advent of vision-language models such as CLIP [41] and Stable Dif-
fusion [44], researchers propose to utilize Stable Diffusion to identify and cure short-
cuts [56] or CLIP to generate novel domain augmentation [51]. However, they all re-
quire some form of human labor to pre-define a set of domains or styles, which makes
them laborious and not scalable. Our work aims to solve this problem and achieve gen-
uine domain extrapolation.
Language scaffolded vision aims to develop better and more robust vision systems
with the help of language. Our method also falls within this category. Clipood [47]
proposes to fine-tune a CLIP model to adapt the downstream DG tasks by a text sim-
ilarity aware loss. [34] utilize an RNN as an explanation network enforcing the model
to self-explain, thereby increasing the robustness. [60] utilize language models to pro-
duce a comprehensive set of bottleneck features and leverage CLIP to classify. With the
help from LLMs, [60] has pushed the performance of the bottleneck network to SOTA.
Despite many works proposed, this research, to the best of our knowledge, is the first
endeavor to investigate the potential of a Large Language Model (LLM) in facilitating
the training of a robust and generalizable vision model.
Large Language Models. Recent advances in NLP, as evidenced by [9, 37]) high-
light the impressive capabilities of Large Language Models like ChatGPT, GPT4 [9],
and Llama 2 [49]. These models glean diverse knowledge from vast training data sourced
from the Internet, positioning LLMs as next-generation knowledge bases for various
tasks. Motivated by studies showcasing the vast knowledge [2, 38] and the exceptional
reasoning ability [23, 40, 55] within LLMs, we aim to harness this knowledge for the
training of robust vision models.
6 Conclusion
The limited availability of domains has been a prevailing problem in Domain General-
ization. In this work, we propose the first data-free learning paradigm that leverages the
knowledge and reasoning of LLMs to extrapolate novel domains. By bridging the text-
centric knowledge and pixel input space by sampling from text-to-image generation
models, we are able to train generalizable models with task information only. The syn-
thetic images can be used to augment the existing dataset or train a model in a data-free
fashion. Extensive experiments have demonstrated that our method achieves signifi-
cant improvements over baselines and the state-of-the-art by a significant margin. We
also demonstrate a promising learning paradigm where LLMs’ knowledge combined
with text-to-image generation models are sufficient to train a generalizable model to
any task. However, it’s important to acknowledge the inherent biases present in foun-
dational models like LLMs and text-to-image generators, which our vision models may
inherit. This includes a bias towards common object representations, exacerbating the
long-tail distribution challenge by privileging common entities in generation processes.
Data-free Generalization 15
Furthermore, our method faces limitations in domain specificity, with current text-to-
image models excelling in generating natural images but underperforming in special-
ized fields such as medical imaging, highlighting a critical area for future improvement
and adaptation.
References
1. Albuquerque, I., Monteiro, J., Falk, T.H., Mitliagkas, I.: Adversarial target-invariant repre-
sentation learning for domain generalization. arXiv preprint arXiv:1911.00804 8 (2019) 2,
14
2. Alivanistos, D., Santamaría, S.B., Cochez, M., Kalo, J.C., van Krieken, E., Thanapalasingam,
T.: Prompting as probing: Using language models for knowledge base construction. arXiv
preprint arXiv:2208.11057 (2022) 14
3. Arjovsky, M., Bottou, L., Gulrajani, I., Lopez-Paz, D.: Invariant risk minimization. arXiv
preprint arXiv:1907.02893 (2019) 9
4. Arpit, D., Wang, H., Zhou, Y., Xiong, C.: Ensemble of averages: Improving model selec-
tion and boosting performance in domain generalization. Advances in Neural Information
Processing Systems 35, 8265–8277 (2022) 8, 13
5. Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., Fleet, D.J.: Synthetic data from diffusion
models improves imagenet classification. arXiv preprint arXiv:2304.08466 (2023) 3, 11, 12
6. Balaji, Y., Sankaranarayanan, S., Chellappa, R.: Metareg: Towards domain generalization
using meta-regularization. Advances in neural information processing systems 31 (2018) 13
7. Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Vaughan, J.W.: A theory of
learning from different domains. Machine learning 79, 151–175 (2010) 2
8. Blanchard, G., Lee, G., Scott, C.: Generalizing from several related classification tasks to a
new unlabeled sample. Advances in neural information processing systems 24 (2011) 2
9. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in
neural information processing systems 33, 1877–1901 (2020) 2, 14
10. Cha, J., Cho, H., Lee, K., Park, S., Lee, Y., Park, S.: Domain generalization needs stochastic
weight averaging for robustness on domain shifts. arXiv preprint arXiv:2102.08604 3, 3
(2021) 13
11. Cha, J., Chun, S., Lee, K., Cho, H.C., Park, S., Lee, Y., Park, S.: Swad: Domain gener-
alization by seeking flat minima. Advances in Neural Information Processing Systems 34,
22405–22418 (2021) 9
12. Cha, J., Lee, K., Park, S., Chun, S.: Domain generalization by mutual-information regular-
ization with pre-trained models. In: European Conference on Computer Vision. pp. 440–457.
Springer (2022) 9
13. Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Tran-
sunet: Transformers make strong encoders for medical image segmentation. arXiv preprint
arXiv: 2102.04306 (2021) 2
14. Choi, S., Das, D., Choi, S., Yang, S., Park, H., Yun, S.: Progressive random convolutions for
single domain generalization. In: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition. pp. 10312–10322 (2023) 10
15. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning augmen-
tation policies from data. arXiv preprint arXiv:1805.09501 (2018) 10
16. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data aug-
mentation with a reduced search space. In: Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition workshops. pp. 702–703 (2020) 10
16 Y Li, S Ren W Deng et al.
17. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional
transformers for language understanding. arXiv preprint arXiv: 1810.04805 (2018) 2
18. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., De-
hghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is
worth 16x16 words: Transformers for image recognition at scale. ICLR (2021) 2
19. Fan, X., Wang, Q., Ke, J., Yang, F., Gong, B., Zhou, M.: Adversarially adaptive normalization
for single domain generalization. In: Proceedings of the IEEE/CVF conference on Computer
Vision and Pattern Recognition. pp. 8208–8217 (2021) 10
20. Gulrajani, I., Lopez-Paz, D.: In search of lost domain generalization. arXiv preprint
arXiv:2007.01434 (2020) 8
21. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. arXiv
preprint arXiv: 1512.03385 (2015) 2
22. He, R., Sun, S., Yu, X., Xue, C., Zhang, W., Torr, P., Bai, S., Qi, X.: Is synthetic data from
generative models ready for image recognition? arXiv preprint arXiv:2210.07574 (2022) 3,
11, 12
23. Huang, J., Chang, K.C.C.: Towards reasoning in large language models: A survey. In: Find-
ings of the Association for Computational Linguistics: ACL 2023. pp. 1049–1065. Associa-
tion for Computational Linguistics, Toronto, Canada (Jul 2023). https://doi.org/10.
18653/v1/2023.findings-acl.67, https://aclanthology.org/2023.
findings-acl.67 14
24. Huang, Z., Wang, H., Huang, D., Lee, Y.J., Xing, E.P.: The two dimensions of worst-case
training and their integrated effect for out-of-domain generalization. In: Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9631–9641 (2022)
8
25. Huang, Z., Wang, H., Xing, E.P., Huang, D.: Self-challenging improves cross-domain gen-
eralization. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part II 16. pp. 124–140. Springer (2020) 8, 9, 10
26. Krueger, D., Caballero, E., Jacobsen, J.H., Zhang, A., Binas, J., Zhang, D., Le Priol, R.,
Courville, A.: Out-of-distribution generalization via risk extrapolation (rex). In: International
Conference on Machine Learning. pp. 5815–5826. PMLR (2021) 9
27. Li, D., Gouk, H., Hospedales, T.: Finding lost dg: Explaining domain generalization via
model complexity. arXiv preprint arXiv:2202.00563 (2022) 21
28. Li, D., Yang, Y., Song, Y.Z., Hospedales, T.: Learning to generalize: Meta-learning for
domain generalization. In: Proceedings of the AAAI conference on artificial intelligence.
vol. 32 (2018) 9, 13
29. Li, H., Pan, S.J., Wang, S., Kot, A.C.: Domain generalization with adversarial feature learn-
ing. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp.
5400–5409 (2018) 9, 13
30. Li, X., Dai, Y., Ge, Y., Liu, J., Shan, Y., Duan, L.Y.: Uncertainty modeling for out-of-
distribution generalization. arXiv preprint arXiv:2202.03958 (2022) 2, 10, 14
31. Li, Y., Tian, X., Gong, M., Liu, Y., Liu, T., Zhang, K., Tao, D.: Deep domain generalization
via conditional invariant adversarial networks. In: Proceedings of the European conference
on computer vision (ECCV). pp. 624–639 (2018) 13
32. Li, Y., Cai, W., Gao, Y., Li, C., Hu, X.: More than encoder: Introducing transformer decoder
to upsample. arXiv preprint arXiv: 2106.10637 (2021) 2
33. Liu, J., Shen, Z., He, Y., Zhang, X., Xu, R., Yu, H., Cui, P.: Towards out-of-distribution
generalization: A survey. arXiv preprint arXiv:2108.13624 (2021) 2
34. Min, S., Park, N., Kim, S., Park, S., Kim, J.: Grounding visual representations with texts for
domain generalization. In: European Conference on Computer Vision. pp. 37–53. Springer
(2022) 14
Data-free Generalization 17
35. Mohri, M., Rostamizadeh, A., Talwalkar, A.: Foundations of machine learning. MIT press
(2018) 19
36. Nichol, A., Dhariwal, P., Ramesh, A., Shyam, P., Mishkin, P., McGrew, B., Sutskever, I.,
Chen, M.: Glide: Towards photorealistic image generation and editing with text-guided dif-
fusion models. arXiv preprint arXiv:2112.10741 (2021) 2
37. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal,
S., Slama, K., Ray, A., et al.: Training language models to follow instructions with human
feedback. Advances in Neural Information Processing Systems 35, 27730–27744 (2022) 14
38. Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., Miller, A.: Language
models as knowledge bases? In: Proceedings of the 2019 Conference on Empirical Methods
in Natural Language Processing and the 9th International Joint Conference on Natural Lan-
guage Processing (EMNLP-IJCNLP). pp. 2463–2473. Association for Computational Lin-
guistics, Hong Kong, China (Nov 2019). https://doi.org/10.18653/v1/D19-
1250, https://aclanthology.org/D19-1250 5, 14
39. Qiao, F., Zhao, L., Peng, X.: Learning to learn single domain generalization. In: Proceed-
ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12556–
12565 (2020) 2
40. Qiao, S., Ou, Y., Zhang, N., Chen, X., Yao, Y., Deng, S., Tan, C., Huang, F., Chen, H.: Rea-
soning with language model prompting: A survey. In: Proceedings of the 61st Annual Meet-
ing of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 5368–
5393. Association for Computational Linguistics, Toronto, Canada (Jul 2023). https:
//doi.org/10.18653/v1/2023.acl-long.294, https://aclanthology.
org/2023.acl-long.294 5, 14
41. Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell,
A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language
supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
2, 5, 14
42. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image syn-
thesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on com-
puter vision and pattern recognition. pp. 10684–10695 (2022) 2
43. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image syn-
thesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition (CVPR). pp. 10684–10695 (June 2022) 5
44. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image syn-
thesis with latent diffusion models (2021) 8, 14
45. Sagawa, S., Koh, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks
for group shifts: On the importance of regularization for worst-case generalization. arXiv
preprint arXiv:1911.08731 (2019) 9
46. Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E.L., Ghasemipour, K., Gon-
tijo Lopes, R., Karagol Ayan, B., Salimans, T., et al.: Photorealistic text-to-image diffusion
models with deep language understanding. Advances in Neural Information Processing Sys-
tems 35, 36479–36494 (2022) 2
47. Shu, Y., Guo, X., Wu, J., Wang, X., Wang, J., Long, M.: Clipood: Generalizing clip to out-
of-distributions. arXiv preprint arXiv:2302.00864 (2023) 2, 5, 14
48. Sun, B., Saenko, K.: Deep coral: Correlation alignment for deep domain adaptation. In: Com-
puter Vision–ECCV 2016 Workshops: Amsterdam, The Netherlands, October 8-10 and 15-
16, 2016, Proceedings, Part III 14. pp. 443–450. Springer (2016) 9
49. Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N.,
Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat
models. arXiv preprint arXiv:2307.09288 (2023) 14
18 Y Li, S Ren W Deng et al.
50. Vapnik, V.: Statistical learning theory j wiley new york (1998) 9, 10
51. Vidit, V., Engilberge, M., Salzmann, M.: Clip the gap: A single domain generalization ap-
proach for object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vi-
sion and Pattern Recognition. pp. 3219–3229 (2023) 14
52. Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., Yu, P.: General-
izing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowl-
edge and Data Engineering (2022) 2
53. Wang, Y., Li, H., Kot, A.C.: Heterogeneous domain generalization via domain mixup. In:
ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing (ICASSP). pp. 3622–3626. IEEE (2020) 9
54. Wang, Z., Luo, Y., Qiu, R., Huang, Z., Baktashmotlagh, M.: Learning to diversify for sin-
gle domain generalization. In: Proceedings of the IEEE/CVF International Conference on
Computer Vision. pp. 834–843 (2021) 2
55. Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q.V., Zhou,
D.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo, S.,
Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds.) Advances in Neural In-
formation Processing Systems. vol. 35, pp. 24824–24837. Curran Associates, Inc. (2022),
https://proceedings.neurips.cc/paper_files/paper/2022/file/
9d5609613524ecf4f15af0f7b31abca4-Paper-Conference.pdf 6, 14
56. Wu, S., Yuksekgonul, M., Zhang, L., Zou, J.: Discover and cure: Concept-aware mitigation
of spurious correlation. arXiv preprint arXiv:2305.00650 (2023) 14
57. Xu, M., Zhang, J., Ni, B., Li, T., Wang, C., Tian, Q., Zhang, W.: Adversarial domain adapta-
tion with domain mixup. In: Proceedings of the AAAI conference on artificial intelligence.
vol. 34, pp. 6502–6509 (2020) 9
58. Xu, Z., Liu, D., Yang, J., Raffel, C., Niethammer, M.: Robust and generalizable visual rep-
resentation learning via random convolutions. arXiv preprint arXiv:2007.13003 (2020) 2,
14
59. Yan, S., Song, H., Li, N., Zou, L., Ren, L.: Improve unsupervised domain adaptation with
mixup training. arXiv preprint arXiv:2001.00677 (2020) 2, 9
60. Yang, Y., Panagopoulou, A., Zhou, S., Jin, D., Callison-Burch, C., Yatskar, M.: Language in
a bottle: Language model guided concept bottlenecks for interpretable image classification.
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.
pp. 19187–19197 (2023) 14
61. Zhao, D., Qi, L., Shi, X., Shi, Y., Geng, X.: A novel cross-perturbation for single domain
generalization. arXiv preprint arXiv:2308.00918 (2023) 10
62. Zhou, K., Yang, Y., Hospedales, T., Xiang, T.: Learning to generate novel domains for do-
main generalization. In: Computer Vision–ECCV 2020: 16th European Conference, Glas-
gow, UK, August 23–28, 2020, Proceedings, Part XVI 16. pp. 561–578. Springer (2020)
14
63. Zhou, K., Yang, Y., Qiao, Y., Xiang, T.: Domain generalization with mixstyle. arXiv preprint
arXiv:2104.02008 (2021) 2, 9, 10, 14
64. Zhou, T., Konukoglu, E.: Fedfa: Federated feature augmentation. arXiv preprint
arXiv:2301.12995 (2023) 2, 14
Data-free Generalization 19
A Proof of Theorem 1
Notation. Let X denote the observation space and Y = {1, −1} the output space.
Denote PXY as the joint probability of the joint space of X × Y and assume a meta
(1) (i) (n)
distribution µ and n domains PXY , · · · , PXY , PXY are i.i.d realizations from µ. A
decision function is a function f ∈ F : X → Y predicts ŷi = f (xi ). We denote
l : Y × Y → R+ a loss function and define the generalization error of a decision
function as
\small \mathcal {L}^\mu (f) = \mathbb {E}_{P_{XY} \sim \mu }\mathbb {E}_{(x, y)\sim P_{XY}}[\mathit {l}(f(x), y)] \label {eq:objective} (3)
\small \hat {\mathcal {L}}^\mu (f) = \sum _{i=1}^n \sum _{j=1}^m \mathit {l}(f(x_{ij}), y_{ij}) (4)
\frac {1}{n} \sum _{j=1}^n \mathcal {L}_{P^{(j)}}(f) \le \frac {1}{n} \sum _{j=1}^n \mathcal {\hat L}_{P^{(j)}}(f) + 2\mathcal {R}_{mn}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2mn}}
(5)
where L̂P (j) (f ) is losses on empirical set SP (j) i.i.d. drawn from P (j) .
\Phi (S) = \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n (\mathcal {L}_{P^{(j)}}(f) - \mathcal {\hat L}_{P^{(j)}}(f)) (6)
\Phi (S) \le \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \Phi (S) \right ] + \sqrt {\frac {\ln (2/\delta )}{2mn}} (7)
20 Y Li, S Ren W Deng et al.
\small & \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \Phi (S) \right ] \\ &= \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n (\mathcal {L}_{P^{(j)}}(f) - \mathcal {\hat L}_{P^{(j)}}(f)) \right ] \\ &= \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n \left ( \mathbb {E}_{S_{P^{(j)}}' \sim P^{(j)}} \left [ \frac {1}{m}\sum _{i=1}^m l(f(x'_{ij}), y'_{ij} \right ] - \frac {1}{m}\sum _{i=1}^m l(f(x_{ij}), y_{ij}) \right ) \right ] \\ & \le \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \mathbb {E}_{S'_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n \frac {1}{m}\sum _{i=1}^m l(f(x'_{ij}), y'_{ij}) - l(f(x_{ij}), y_{ij}) \right ]\\ &= \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \mathbb {E}_{S_{P^{(1:n)}}' \sim P^{(1:n)}} \mathbb {E}_\sigma \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n \frac {1}{m}\sum _{i=1}^m \sigma _{ij}(f(x'_{ij}), y'_{ij}) - l(f(x_{ij}), y_{ij}) \right ]\\ &\le \mathbb {E}_{S_{P^{(1:n)}}' \sim P^{(1:n)}} \mathbb {E}_\sigma \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n \frac {1}{m}\sum _{i=1}^m \sigma _{ij}(f(x'_{ij}), y'_{ij}) \right ] \\ &\quad \quad \quad \quad + \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \mathbb {E}_\sigma \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n \frac {1}{m}\sum _{i=1}^m -\sigma _{ij}(f(x_{ij}), y_{ij}) \right ] \\ &= 2 \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \mathbb {E}_\sigma \left [ \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n \frac {1}{m}\sum _{i=1}^m \sigma _{ij} l(f(x_{ij}), y_{ij}) \right ] \\ &= 2 \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \mathcal {R}_{mn}(\mathcal {F})\right ]
(16)
2 \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \mathcal {R}_{mn}(\mathcal {F})\right ] \le 2 \mathcal {R}_{mn}(\mathcal {F}) + 2\sqrt {\frac {\ln (2/\delta )}{2mn}} (17)
Data-free Generalization 21
Finally, we have
\Phi (S) &= \sup _{f \in \mathcal {F}} \frac {1}{n} \sum _{j=1}^n (\mathcal {L}_{P^{(j)}}(f) - \mathcal {\hat L}_{P^{(j)}}(f))\\ &\le \mathbb {E}_{S_{P^{(1:n)}} \sim P^{(1:n)}} \left [ \Phi (S) \right ] + \sqrt {\frac {\ln (2/\delta )}{2mn}} \\ &\le 2 \mathcal {R}_{mn}(\mathcal {F}) + 2\sqrt {\frac {\ln (2/\delta )}{2mn}} + \sqrt {\frac {\ln (2/\delta )}{2mn}} \\ &= 2 \mathcal {R}_{mn}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2mn}}
(21)
Thus,
\frac {1}{n} \sum _{j=1}^n \mathcal {L}_{P^{(j)}}(f) \le \frac {1}{n} \sum _{j=1}^n \mathcal {\hat L}_{P^{(j)}}(f) + 2\mathcal {R}_{mn}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2mn}}
(22)
\small \mathcal {L}^\mu (f) \le \hat {\mathcal {L}}^\mu (f) + 2\mathcal {R}_{mn}(\mathcal {F}) + 2\mathcal {R}_{n}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2mn}} + 3\sqrt {\frac {\ln (2/\delta )}{n}}
\Phi (P) \le \mathbb {E}_{P^{(1:n)} \sim \mu } \left [ \Phi (P) \right ] + \sqrt {\frac {\ln (2/\delta )}{2n}} (24)
Following the proof techniques in Theorem 2, we bound the expected value of Φ(P )
&\mathbb {E}_{P^{(1:n)} \sim \mu } \left [ \Phi (P) \right ] \\ &\quad \quad \quad = \mathbb {E}_{P^{(1:n)} \sim \mu } \left [ \sup _{f \in \mathcal {F}} \left ( \mathbb {E}_{q \sim \mu }[\mathcal {L}_q (f)] - \frac {1}{n} \sum _{j=1}^n \mathcal {L}_{p^{(j)}} (f) \right ) \right ] \\ &\le 2 \mathbb {E}_{P^{(1:n)} \sim \mu } \mathbb {E}_{(x_j, y_j) \sim p^{(j)}} \left [ \mathcal {R}_{n}(\mathcal {F}) \right ]
(27)
22 Y Li, S Ren W Deng et al.
2 \mathbb {E}_{P^{(1:n)} \sim \mu } \mathbb {E}_{(x_j, y_j) \sim p^{(j)}} \left [ \mathcal {R}_{n}(\mathcal {F}) \right ] \le 2\mathcal {R}_{n}(\mathcal {F}) + 2\sqrt {\frac {\ln (2/\delta )}{2n}} (28)
Thus, we have
\Phi (P) &= \sup _{f \in \mathcal {F}} \mathcal {L}^\epsilon (f) - \frac {1}{n}\sum _{j=1}^n \mathcal {L}_{p^{(j)}}(f) \\ &\le \mathbb {E}_{P^{(1:n)} \sim \mu } \left [ \Phi (P) \right ] + \sqrt {\frac {\ln (2/\delta )}{2n}} \\ &\le 2\mathcal {R}_{n}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2n}}
(31)
Finally, we have
\sup _{f \in \mathcal {F}} \mathcal {L}^\epsilon (f) \le \frac {1}{n}\sum _{j=1}^n \mathcal {L}_{p^{(j)}}(f) + 2\mathcal {R}_{n}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2n}} \\ \le 2\mathcal {R}_{nm}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2nm}} + 2\mathcal {R}_{n}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2n}}
(34)
(38)
′
With the assumption that D(µ, µ′ ) = supf |Lµ (f ) − Lµ | ≤ ϵ, we have
\mathcal {L}^\mu (f) &- \hat {\mathcal {L}}^{\mu '} (f) \\ &\le 2\mathcal {R}_{mn}(\mathcal {F}) + 2\mathcal {R}_{n}(\mathcal {F}) + 3\sqrt {\frac {\ln (2/\delta )}{2mn}} + 3\sqrt {\frac {\ln (2/\delta )}{n}} + \epsilon
(40)
B Visualization
bird
car
chair
person
Fig. 6: Examples of synthetic images conditioned on novel domain knowledge from LLM. The
first two columns (i.e. Caltech101 and VOC2017) are selected from VLCS datasets while the rest
three columns are images generated based on the novel domains (i.e. fairytale, etc) provided by
LLMs
ambiguous
classification
class: horse class: horse class: house class: guitar
domain: castle domain: castle domain: street domain: beach
grounds grounds markets
missing
object
of interest
class: car class: car class: car class: dog
domain: airport domain: luxury domain: night domain: art deco
estate club
distorted
torso
distorted or
more/less
limps or
fingers