Hierarchical text-conditional picture technology with CLIP latents


Contrastive fashions like CLIP have been proven to be taught strong representations of photographs that seize each semantics and elegance. To leverage these representations for picture technology, we suggest a two-stage mannequin: a previous that generates a CLIP picture embedding given a textual content caption, and a decoder that generates a picture conditioned on the picture embedding. We present that explicitly producing picture representations improves picture range with minimal loss in photorealism and caption similarity. Our decoders conditioned on picture representations may also produce variations of a picture that protect each its semantics and elegance, whereas various the non-essential particulars absent from the picture illustration. Furthermore, the joint embedding house of CLIP allows language-guided picture manipulations in a zero-shot trend. We use diffusion fashions for the decoder and experiment with each autoregressive and diffusion fashions for the prior, discovering that the latter are computationally extra environment friendly and produce higher-quality samples.

Leave a Reply

Your email address will not be published. Required fields are marked *