Of course what is of note is that they trained a public model to
   convert text into video, not that it is for Chinese. It can likely be
   finetuned relatively easily to process tokenized English, in the same
   way as any other model that uses the same framework.
   [1]https://github.com/THUDM/CogVideo
# CogVideo

This is the official repo for the paper: [CogVideo: Large-scale Pretraining for
Text-to-Video Generation via Transformers]([2]http://arxiv.org/abs/2205.15868).

**News!** The [demo]([3]https://wudao.aminer.cn/cogvideo/) for CogVideo is avail
able!

**News!** The code and model for text-to-video generation is now available! Curr
ently we only supports *simplified Chinese input*.

[4]https://user-images.githubusercontent.com/48993524/170857367-2033c514-3c9f-42
97-876f-2468592a254b.mp4

* **Read** our paper [CogVideo: Large-scale Pretraining for Text-to-Video Genera
tion via Transformers]([5]https://arxiv.org/abs/2205.15868) on ArXiv for a forma
l introduction.
* **Try** our demo at [[6]https://wudao.aminer.cn/cogvideo/](https://wudao.amine
r.cn/cogvideo/)
* **Run** our pretrained models for text-to-video generation. Please use A100 GP
U.
* **Cite** our paper if you find our work helpful

```
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Tran
sformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang,
 Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}
```

## Web Demo

The demo for CogVideo is at [[7]https://wudao.aminer.cn/cogvideo/](https://wudao
.aminer.cn/cogvideo/), where you can get hands-on practice on text-to-video gene
ration. *The original input is in Chinese.*


## Generated Samples

**Video samples generated by CogVideo**. The actual text inputs are in Chinese.
Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniform
ly for display purposes.

![Intro images](assets/intro-image.png)

![More samples](assets/appendix-moresamples.png)


**CogVideo is able to generate relatively high-frame-rate videos.**
A 4-second clip of 32 frames is shown below.

![High-frame-rate sample](assets/appendix-sample-highframerate.png)

## Getting Started

### Setup

* Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay
 to run the pretrained models with smaller `--max-inference-batch-size` and `--b
atch-size` or training smaller models on less powerful GPUs.
* Environment: install dependencies via `pip install -r requirements.txt`.
* LocalAttention: Make sure you have CUDA installed and compile the local attent
ion kernel.

```shell
git clone [8]https://github.com/Sleepychord/Image-Local-Attention
cd Image-Local-Attention && python setup.py install
```

### Download

Our code will automatically download or detect the models into the path defined
by environment variable `SAT_HOME`. You can also manually download [CogVideo-Sta
ge1]([9]https://lfs.aminer.cn/misc/cogvideo/cogvideo-stage1.zip) and [CogVideo-S
tage2]([10]https://lfs.aminer.cn/misc/cogvideo/cogvideo-stage2.zip) and place th
em under SAT_HOME (with folders named `cogvideo-stage1` and `cogvideo-stage2`)

### Text-to-Video Generation

```
./script/inference_cogvideo_pipeline.sh
```

Arguments useful in inference are mainly:

* `--input-source [path or "interactive"]`. The path of the input file with one
query per line. A CLI would be launched when using "interactive".
* `--output-path [path]`. The folder containing the results.
* `--batch-size [int]`. The number of samples will be generated per query.
* `--max-inference-batch-size [int]`. Maximum batch size per forward. Reduce it
if OOM.
* `--stage1-max-inference-batch-size [int]` Maximum batch size per forward in St
age 1. Reduce it if OOM.
* `--both-stages`. Run both stage1 and stage2 sequentially.
* `--use-guidance-stage1` Use classifier-free guidance in stage1, which is stron
gly suggested to get better results.

You'd better specify an environment variable `SAT_HOME` to specify the path to s
tore the downloaded model.

*Currently only Chinese input is supported.*

References

   1. https://github.com/THUDM/CogVideo
   2. http://arxiv.org/abs/2205.15868
   3. https://wudao.aminer.cn/cogvideo/
   4. https://user-images.githubusercontent.com/48993524/170857367-2033c514-3c9f-4297-876f-2468592a254b.mp4
   5. https://arxiv.org/abs/2205.15868
   6. https://wudao.aminer.cn/cogvideo/](https://wudao.aminer.cn/cogvideo/)
   7. https://wudao.aminer.cn/cogvideo/](https://wudao.aminer.cn/cogvideo/)
   8. https://github.com/Sleepychord/Image-Local-Attention
   9. https://lfs.aminer.cn/misc/cogvideo/cogvideo-stage1.zip
  10. https://lfs.aminer.cn/misc/cogvideo/cogvideo-stage2.zip