[CVPR25] SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

[Homepage] [arXiv] [Video]

Official Code of CVPR 2025 paper SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters.

Various Characters	Comprehension of Body Language	Execution of Motion Commands	Engagement in Interactive Tasks

License and Usage Notices

Open Source Content and Influencing Factors: In this repository, we provide code for raw data preprocessing, multimodal data synthesis, SOLAMI model training, model evaluation, VR Unity 8000 client and server code for community reference. Considering that we used some company internal data to train the models in the original paper, we are not open-sourcing the raw data and trained models. Users can use their own collected data to train their deployable models on advanced end-to-end multimodal models (GLM-4-Voice, Qwen2.5-Omni, etc). In our VR engineering implementation, we use the company's intranet for forwarding, file reading and writing strategies to achieve communication between client and server. You can design your front-end and back-end strategies according to your infra. We are eager to open-source a universal version to everyone, but considering our limited time and organizational changes, our current code is still relatively rough. We ask for the community's understanding.

Usage and License Notices: This project utilizes certain datasets, 3D assets, and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses, including but not limited to the OpenAI Terms of Use for generating synthetic data scripts, Llama community license for foundation language models, SMPL-X for original motion format, and HumanML3D, Inter-X, DLP-MoCap, AnyInstruct, CommonVoice for data generation and model training. This project does not impose any additional constraints beyond those stipulated in the original licenses. Furthermore, users are reminded to ensure that their use of the dataset and checkpoints is in compliance with all applicable laws and regulations.

Install

Environment with Linux 18.04 LTS and cuda 11.8 is tested in our experiment.

Step 1: Install pytorch

conda create -n demo python=3.11
conda activate demo
pip install torch==2.2.2 torchvision==0.17.2 torchaudio==2.2.2 --index-url https://download.pytorch.org/whl/cu118
pip install -U openai-whisper
pip install -r requirements.txt

Step 2: Install TTS

Please follow the instructions from TTS to install this package. TTS will downgrade numpy. After you have installed TTS, please install spacy and upgrade numpy.

pip install -U spacy
python -m spacy download en_core_web_sm
pip install numpy==1.26.4

(Optional) Step 3: Install vllm

Data

Data processing is very complex work involving numerous details. The overall framework is as follows. For processing code, please refer to Datasets.

Speech Data
- Pretrain Data Preprocessing (~ 300K items)
- Character Data Preprocessing
Motion Data
- SMPL-X Preprocessing & Feature Extraction (~ 40K motion items)
- Text Embedding Generation
- Unified Data Item Generation
Multimodal Generation
- Topic Collection (~ 4K topics)
- Multimodal Chat Data Synthesis (~6K items)

Model

Training SOLAMI requires three stages: motion tokenizer training, multitask pretraining, and multimodal chat sft. For details, please refer to Models.

Tokenizer Training

We use the codebase of MotionGPT to train the motion tokenizer. For hand or body tokenizer, we apply 1D convolution as the basic layer of VQVAE. For relative transform, we use MLP layers. For speech tokenizer, we use the original pretrained tokenizer fromAnyGPT. Besides the tokenizer training, we also use GPT-2 as foundation model for initial ablation studies.

Multi-task Pre-training for Modality Alignment

We adopt multi-task pretraining on LLM backbone to align motion, speech, and language. To achieve this, we train a 7B decoder-only LLM (AnyGPT) on 32 V100s with DeepSpeed Zero3 for one day. During training, we fixed the params of motion & speech tokeniers and adopt full parameter finetuning.

Instruction Tuning for Multi-turn Conversation

In this stage, we finetune the model with synthetic multimodal chat data to obtain social Vision-Language-Action model for immersive interaction with 3D Characters. SOLAMI model takes the user's motion and speech (character's observation) as input, and generate the character's motion and speech as response (character's action) based on the system prompt of character settings and dialogue context.

VR Demo

VR Client

The VR Client is a standalone Unity project that can be compiled for Quest 2/3/Pro and above devices. It serves as the front-end interface for users to interact with the SOLAMI system in virtual reality.

Repository: SOLAMI-VRClient

VR Data Relay

The Relay acts as middleware to establish connections between the VR Client and the Model Server. The Relay communicates with the Model Server through HTTP requests and with the VR Client through Redis.

Repository: SOLAMI-VRRelay

For security reasons, the VR Relay and the SOLAMI model are deployed on separate servers. Users can modify the code according to their requirements to improve communication efficiency.

Audio-to-Face Algorithm

The audio-to-face animation algorithm used in this project needs to be deployed separately by users. For reference, you can check out the UniTalker project, which provides a unified model for audio-driven 3D facial animation that can handle various audio domains including clean and noisy voices in different languages.

UniTalker can generate realistic facial motion from different audio inputs and is compatible with the SOLAMI system when properly configured.

Model Server

We deploy our model server on nodes with 2 GPUs. In this repo, we provide SOLAMI deployment based on vllm. Additionally, we offer a simplified version of the DLP method with llama2-7B-chat model as the base LLM model, serving as a comparative LLM-Agent approach.

cd models/vla/anygpt/infer

# solami model server
python solami_server_model.py

# llm-agent framework
python llama2_server_model.py

Citation

@inproceedings{Jiang2025SOLAMI,
      title={SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters}, 
      author={Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu},
      booktitle={CVPR},
      year={2025}
}

@inproceedings{Cai2024DLP,
      title={Digital Life Project: Autonomous 3D Characters with Social Intelligence}, 
      author={Zhongang Cai, Jianping Jiang, Zhongfei Qing, Xinying Guo, Mingyuan Zhang, Zhengyu Lin, Haiyi Mei, Chen Wei, Ruisi Wang, Wanqi Yin, Xiangyu Fan, Han Du, Liang Pan, Peng Gao, Zhitao Yang, Yang Gao, Jiaqi Li, Tianxiang Ren, Yukun Wei, Xiaogang Wang, Chen Change Loy, Lei Yang, Ziwei Liu},
      booktitle={CVPR},
      year={2024}
}

Acknowledgement

Our code of SOLAMI is based on AnyGPT, HumanTOMATO, and MotionGPT.

Related Works

Research

Digital Life Project : First LLM-Agent framework for building 3D autonomous characters.
ChatHuman : A multi-modal LLM for understanding humans with the assistance of tools.
Generative Agents : An architecture for interactive simulacra of human behavior.

Products & Company

SEELES : End-to-end 3D game AI engine generating 3D games with a single sentence, igniting hyper-personalized social gaming.
MeshCapade : Foundation models that enable digital humans to see, understand, and move.
Whispers from the Star : AI dialogue-based text adventure game.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
assets		assets
datasets		datasets
demos		demos
models		models
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

[CVPR25] SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

License and Usage Notices

Contents

Install

Data

Model

Tokenizer Training

Multi-task Pre-training for Modality Alignment

Instruction Tuning for Multi-turn Conversation

VR Demo

VR Client

VR Data Relay

Audio-to-Face Algorithm

Model Server

Citation

Acknowledgement

Related Works

Research

Products & Company

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

AlanJiang98/SOLAMI

Folders and files

Latest commit

History

Repository files navigation

[CVPR25] SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

License and Usage Notices

Contents

Install

Data

Model

Tokenizer Training

Multi-task Pre-training for Modality Alignment

Instruction Tuning for Multi-turn Conversation

VR Demo

VR Client

VR Data Relay

Audio-to-Face Algorithm

Model Server

Citation

Acknowledgement

Related Works

Research

Products & Company

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages