Re: Random Transformer Model News: Text2Human

2 Jun 2022

      I was in kind of a wonky state of mind when I posted this, although it
is of course somewhat interesting.

My energy quip may have been out of scale here. The below snippet
describes passing about 3.5 billion images through a single high-end
GPU (each epoch is the entire dataset).

The most powerful models involve huge networks of highest-end GPUs .
...
We split the dataset into a training set and a testing set. The training
set contains 10, 335 images and the testing set contains 1, 149 images.
We downsample the images to 512 × 256 resolution. The texture
attribute labels are the combinations of clothes colors and fabrics
annotations. The modules in the whole pipeline are trained stage by
stage. All of our models are trained on one NVIDIA Tesla V100 GPU.
We adopt the Adam optimizer. The learning rate is set as 1 × 10−4
.
For the training of Stage I (i.e., Pose to Parsing), we use the (human
pose, clothes shape labels) pairs as inputs and the labeled human
parsing masks as ground truths. We use the instance channel of
densepose (three-channel IUV maps in original) as the human pose
𝑃. Each shape attribute 𝑎𝑖
is represented as one-hot embeddings. We
train the Stage I module for 50 epochs. The batch size is set as 8. For
the training of hierarchical VQVAE in Stage II, we first train the
toplevel codebook, 𝐸𝑡𝑜𝑝 ,
and decoder for 110 epochs, and then train the
bottom-level codebook, 𝐸𝑏𝑜𝑡, and 𝐷𝑏𝑜𝑡 for 60 epochs with top-level
related parameters fixed. The batch size is set as 4. The sampler with
mixture-of-experts in Stage II requires𝑇𝑠𝑒𝑔 and𝑇𝑡𝑒𝑥 .𝑇𝑠𝑒𝑔 is obtained
by a human parsing tokenizer, which is trained by reconstructing
the human parsing maps for 20 epochs with batch size 4. 𝑇𝑡𝑒𝑥 is
obtained by directly downsampling the texture instance maps to
the same size of codebook indices maps using nearest interpolation.
The cross-entropy loss is employed for training. The sampler is
trained for 90 epochs with the batch size of 4. For the feed-forward
 index prediction network, we use the top-level features and bottomlevel codebook indices > as the input and ground-truth pairs. The
feed-forward index prediction network is optimized using the crossentropy loss. The index > prediction network is trained for 45 epochs
and the batch size is set as 4    .

Re: Random Transformer Model News: Text2Human

Undiscussed Horrific Abuse, One Victim of Many