Random Transformer Model News: Text2Human

Thu Jun 2 14:28:04 PDT 2022

I was in kind of a wonky state of mind when I posted this, although it
is of course somewhat interesting.

My energy quip may have been out of scale here. The below snippet
describes passing about 3.5 billion images through a single high-end
GPU (each epoch is the entire dataset).

The most powerful models involve huge networks of highest-end GPUs .

> We split the dataset into a training set and a testing set. The training
> set contains 10, 335 images and the testing set contains 1, 149 images.
> We downsample the images to 512 × 256 resolution. The texture
> attribute labels are the combinations of clothes colors and fabrics
> annotations. The modules in the whole pipeline are trained stage by
> stage. All of our models are trained on one NVIDIA Tesla V100 GPU.
> We adopt the Adam optimizer. The learning rate is set as 1 × 10−4
> .
> For the training of Stage I (i.e., Pose to Parsing), we use the (human
> pose, clothes shape labels) pairs as inputs and the labeled human
> parsing masks as ground truths. We use the instance channel of
> densepose (three-channel IUV maps in original) as the human pose
> 𝑃. Each shape attribute 𝑎𝑖
> is represented as one-hot embeddings. We
> train the Stage I module for 50 epochs. The batch size is set as 8. For
 > the training of hierarchical VQVAE in Stage II, we first train the
toplevel codebook, 𝐸𝑡𝑜𝑝 ,
> and decoder for 110 epochs, and then train the
> bottom-level codebook, 𝐸𝑏𝑜𝑡, and 𝐷𝑏𝑜𝑡 for 60 epochs with top-level
> related parameters fixed. The batch size is set as 4. The sampler with
> mixture-of-experts in Stage II requires𝑇𝑠𝑒𝑔 and𝑇𝑡𝑒𝑥 .𝑇𝑠𝑒𝑔 is obtained
> by a human parsing tokenizer, which is trained by reconstructing
> the human parsing maps for 20 epochs with batch size 4. 𝑇𝑡𝑒𝑥 is
> obtained by directly downsampling the texture instance maps to
> the same size of codebook indices maps using nearest interpolation.
> The cross-entropy loss is employed for training. The sampler is
> trained for 90 epochs with the batch size of 4. For the feed-forward
>  index prediction network, we use the top-level features and bottomlevel codebook indices > as the input and ground-truth pairs. The
> feed-forward index prediction network is optimized using the crossentropy loss. The index > prediction network is trained for 45 epochs
> and the batch size is set as 4    .