The python version is 3.9.20, and other required packages can be installed with the following command:
pip install -r requirements.txt
Create directory to store checkpoints (If modify the structure/rename directories, need to change config files and model files accordingly)
mkdir -p ckpt/MultiResQFormer
mkdir -p ckpt/pretrained_ckpt
Then download the following model checkpoints:
- Main video-SALMONN model checkpoint, then put it under
ckpt/MultiResQFormer
- InstructBLIP checkpoint for Vicuna-13B model, then put it under
ckpt/pretrained_ckpt
- EVA_VIT model checkpoint for InstructBLIP, then put it under
ckpt/pretrained_ckpt
- BEATs encoder checkpoint, then put it under
ckpt/pretrained_ckpt
python inference.py --cfg-path config/test.yaml
The result is saved in the following path:
./ckpt/MultiResQFormer/<DateTime>/eval_result.json
Expecting the following result:
[
{
"id": "./dummy/4405327307.mp4_Describe the video and audio in detail",
"conversation": [
{
"from": "human",
"value": "Describe the video and audio in detail"
},
{
"from": "gpt",
"value": "None"
}
],
"task": "audiovisual_video_input",
"ref_answer": "None",
"gen_answer": "The video shows a group of musicians performing on stage, with a man singing into a microphone and playing the piano. There is also a drum set and a saxophone on stage. The audience is not visible in the video. The music is upbeat and energetic, and the performers seem to be enjoying themselves.</s>"
}
]