Combining look (left) and expression (right) into a single image (middle)

Update: I extended this work to produce face animations, which you can find in this post.

Many of you probably have heard of the possibility of projecting existing face images into the latent space of a pre-trained Generative Adversarial Network (GAN). This is usually achieved through the minimization of a perceptual loss with respect to a latent code: We attempt to replicate a given image with what the network has learned. Luckily, pre-trained StyleGAN models are readily available for this purpose.

But we can also play around with the loss function, which allows us to modify what we actually want to project into the latent space. Recently I did exactly that. Instead of just minimizing w.r.t. to a perceptual loss, I added a facial landmark loss. Now we can generate a face image that exhibits the look of one image while displaying the facial expression of another image.

The code is based on the original StyleGAN2-ada repo [0]. For projection of facial landmarks, the l2 norm of the landmark heat maps between projection image and target landmark image is minimized, next to the original LPIPS loss [2]. For heat maps of the landmarks, [1] is used. Thus, there are two target images, one for the look and one for the landmarks. The objective becomes (noise regularization omitted):

\[loss = \lambda_{lpips} LPIPS(x_{projection}, x_{target\_look}) + HL(x_{projection}, x_{target\_landmark}),\]

with HL being the heat map loss defined as

\[HL(x_1, x_2) = \sum_i^N \lambda_{landmark} \sqrt{(FAN(x_1) - FAN(x_2))^2},\]

where N is the number of pixels, and FAN is the landmark heat map extraction model which outputs a three-dimensional matrix, where the depth dimension encodes each single landmark. LPIPS as in [1, 2]. Note that \(\lambda_{landmark}\) is a vector containing the weights for each group of landmarks. Groups are for example: Eye brows, eyes, mouth, etc. Check [1] for more info. By tweaking this vector you can determine what facial features you want to project more strongly into the generated images. See below for an example.

It is really not perfect, and still has some bugs, but it is fun to play around with. Check it out here. I have also included a Google Colab link in the repo, where you can play around with it yourself :).

Huge shout out to the StyleGAN-team and NVIDIA for their work and pre-trained models. Images are from the FFHQ data set.


[0]: Karras, Tero, et al. “Training generative adversarial networks with limited data.” arXiv preprint arXiv:2006.06676 (2020). Code:

[1]: Bulat, Adrian, and Georgios Tzimiropoulos. “How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks).” Proceedings of the IEEE International Conference on Computer Vision. 2017. Code:

[2]: Zhang, Richard, et al. “The unreasonable effectiveness of deep features as a perceptual metric.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.