I was in kind of a wonky state of mind when I posted this, although it is of course somewhat interesting. My energy quip may have been out of scale here. The below snippet describes passing about 3.5 billion images through a single high-end GPU (each epoch is the entire dataset). The most powerful models involve huge networks of highest-end GPUs .
We split the dataset into a training set and a testing set. The training set contains 10, 335 images and the testing set contains 1, 149 images. We downsample the images to 512 Γ 256 resolution. The texture attribute labels are the combinations of clothes colors and fabrics annotations. The modules in the whole pipeline are trained stage by stage. All of our models are trained on one NVIDIA Tesla V100 GPU. We adopt the Adam optimizer. The learning rate is set as 1 Γ 10β4 . For the training of Stage I (i.e., Pose to Parsing), we use the (human pose, clothes shape labels) pairs as inputs and the labeled human parsing masks as ground truths. We use the instance channel of densepose (three-channel IUV maps in original) as the human pose π. Each shape attribute ππ is represented as one-hot embeddings. We train the Stage I module for 50 epochs. The batch size is set as 8. For the training of hierarchical VQVAE in Stage II, we first train the toplevel codebook, πΈπ‘ππ , and decoder for 110 epochs, and then train the bottom-level codebook, πΈπππ‘, and π·πππ‘ for 60 epochs with top-level related parameters fixed. The batch size is set as 4. The sampler with mixture-of-experts in Stage II requiresππ ππ andππ‘ππ₯ .ππ ππ is obtained by a human parsing tokenizer, which is trained by reconstructing the human parsing maps for 20 epochs with batch size 4. ππ‘ππ₯ is obtained by directly downsampling the texture instance maps to the same size of codebook indices maps using nearest interpolation. The cross-entropy loss is employed for training. The sampler is trained for 90 epochs with the batch size of 4. For the feed-forward index prediction network, we use the top-level features and bottomlevel codebook indices > as the input and ground-truth pairs. The feed-forward index prediction network is optimized using the crossentropy loss. The index > prediction network is trained for 45 epochs and the batch size is set as 4 .