CAST: Cross-Attention in Space and Time for Video Action Recognition [NeurIPS 2023][Project Page][Arxiv]
We conduct all the experiments with 16 NVIDIA GeForce RTX 3090 GPUs. First, install PyTorch 1.10.0+ and torchvision 0.11.0.
conda create -n vmae_1.10 python=3.8 ipykernel -y
conda activate vmae_1.10
conda install pytorch==1.10.0 torchvision==0.11.0 torchaudio==0.10.0 -c pytorch
Then, install timm, triton, DeepSpeed, and others.
pip install triton==1.0.0
git clone https://github.com/microsoft/DeepSpeed
cd DeepSpeed
git checkout 3a3dfe66bb
DS_BUILD_OPS=1 pip install . --global-option="build_ext"
pip install TensorboardX decord einops scipy pandas requests
ds_report
If you have successfully installed Deepspeed, after running the 'ds_report' command, you can see the following results. For other Deepspeed-related issues, please refer to the DeepSpeed GitHub page.
- We report experimental results on three standard datasets.(EPIC-KITCHENS-100, Something-Something-V2, Kinetics400)
- We provide sample annotation files -> annotations.
-
The pre-processing of EPIC-KITCHENS-100 can be summarized into 3 steps:
-
Download the dataset from official website.
-
Preprocess the dataset by resizing the short edge of video to 256px. You can refer to MMAction2 Data Benchmark.
-
Generate annotations needed for dataloader ("<video_id>,<verb_class>,<noun_class>" in annotations). The annotation usually includes
train.csv
,val.csv
. The format of*.csv
file is like:video_1,verb_1,noun_1 video_2,verb_2,noun_2 video_3,verb_3,noun_3 ... video_N,verb_N,noun_N
-
All video files are located inside the DATA_PATH.
-
-
The pre-processing of Something-Something-V2 can be summarized into 3 steps:
-
Download the dataset from official website.
-
Preprocess the dataset by changing the video extension from
webm
to.mp4
with the original height of 240px. You can refer to MMAction2 Data Benchmark. -
Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes
train.csv
,val.csv
andtest.csv
. The format of*.csv
file is like:video_1.mp4 label_1 video_2.mp4 label_2 video_3.mp4 label_3 ... video_N.mp4 label_N
-
All video files are located inside the DATA_PATH.
-
-
The pre-processing of Kinetics400 can be summarized into 3 steps:
-
Download the dataset from official website or OpenDataLab.
-
Preprocess the dataset by resizing the short edge of video to 320px. You can refer to MMAction2 Data Benchmark.
-
Generate annotations needed for dataloader ("<video_id> <video_class>" in annotations). The annotation usually includes
train.csv
,val.csv
andtest.csv
. The format of*.csv
file is like:video_1.mp4 label_1 video_2.mp4 label_2 video_3.mp4 label_3 ... video_N.mp4 label_N
-
- All video files should be splited into DATA_PATH/train and DATA_PATH/val.
We use the pre-trained weights of spatial and temporal experts. The pretrained weight of the spatial expert (CLIP) uses the official weight. The pre-trained weight of the temporal expert (VideoMAE) uses the pre-trained weights from the three datasets EK100, K400, and SSV2. Of these, K400 and SSV2 use the official weights, and EK100 uses the weights we pre-trained ourselves. Put each downloaded expert weight into the VMAE_PATH and CLIP_PATH of the fine-tune script.
We provide the off-the-shelf scripts in the scripts folder.
- For example, to fine-tune CAST on Kinetics400 with 16 GPUs (2 nodes x 8 GPUs) script.
DATA_PATH=YOUR_PATH
VMAE_MODEL_PATH=YOUR_PATH
CLIP_MODEL_PATH=YOUR_PATH
OMP_NUM_THREADS=1 python -m torch.distributed.launch \
--nproc_per_node=2 \
--master_port ${YOUR_NUMBER} --nnodes=8 \
--node_rank=${YOUR_NUMBER} --master_addr=${YOUR_NUMBER} \
YOUR_PATH/run_bidirection_compo.py \
--data_set Kinetics-400 \
--nb_classes 400 \
--vmae_model compo_bidir_vit_base_patch16_224 \
--anno_path ${ANNOTATION_PATH}
--data_path ${DATA_PATH} \
--clip_finetune ${CLIP_MODEL_PATH} \
--vmae_finetune ${VMAE_MODEL_PATH} \
--log_dir ${YOUR_PATH} \
--output_dir ${YOUR_PATH} \
--batch_size 6 \
--input_size 224 \
--short_side_size 224 \
--save_ckpt_freq 25 \
--num_sample 1 \
--num_frames 16 \
--opt adamw \
--lr 1e-3 \
--opt_betas 0.9 0.999 \
--weight_decay 0.05 \
--epochs 70 \
--dist_eval \
--test_num_segment 5 \
--test_num_crop 3 \
--num_workers 8 \
--drop_path 0.2 \
--layer_decay 0.75 \
--mixup_switch_prob 0 \
--mixup_prob 0.5 \
--reprob 0. \
--init_scale 1. \
--update_freq 6 \
--seed 0 \
--enable_deepspeed \
--warmup_epochs 5 \
Evaluation commands for the EK100.
python ./run_bidirection_compo.py --fine_tune {YOUR_FINETUNED_WEIGHT} --composition --eval
Evaluation commands for the SSV2, K400.
python ./run_bidirection.py --fine_tune {YOUR_FINETUNED_WEIGHT} --eval
Method | Spatial Expert | Temporal expert | Epoch | #Frames x Clips x Crops | Fine-tune | Top-1 |
---|---|---|---|---|---|---|
CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on EK100) | 50 | 16x2x3 | log/checkpoint |
49.3 |
Method | Spatial Expert | Temporal expert | Epoch | #Frames x Clips x Crops | Fine-tune | Top-1 |
---|---|---|---|---|---|---|
CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on SSV2) | 50 | 16x2x3 | log/checkpoint |
71.6 |
Method | Spatial Expert | Temporal expert | Epoch | #Frames x Clips x Crops | Fine-tune | Top-1 |
---|---|---|---|---|---|---|
CAST | CLIP-B/16 | VideoMAE-B/16 (pre-trained on K400) | 70 | 16x5x3 | log/checkpoint |
85.3 |
This project is built upon VideoMAE, MAE, CLIP and BEiT. Thanks to the contributors of these great codebases.
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
@article{cast,
title={CAST: Cross-Attention in Space and Time for Video Action Recognition},
author={Lee, Dongho and Lee, Jongseo and Choi, Jinwoo},
booktitle={NeurIPS}},
year={2023}