In the past few days, a disturbing image of a “white face” Obama has been making the rounds on Twitter. The image was created with the help of a new method for upscaling images1 called PULSE. The researchers behind the method describe PULSE as a new way to achieve super resolution on images of human faces, i.e. to produce high resolution images of human faces from low resolution images of human faces.
Super resolution, in general, is one of the major problems in computer vision that have received increased attention over the past few years, as deep learning methods have slowly revolutionized the field.2
Lately, researchers have increasingly started to employ generative adversarial networks3 to help with super resolution tasks. Importantly, using GANs, researchers have claimed to have gone beyond the “deconvolution limit,”4 i.e. beyond what is “naturally” reconstructable from a limited amount of information.5
The deconvolution limit is a “hard” limit. Hence, transcending this limit has, for decades, been a topos in science fiction. The most prominent example is of course Blade Runner:
A more "recent" one is this X-Files scene:
From there on, transcending the deconvolution limit has become an Internet meme (what a time to be alive, that this is a valid sentence), with shows like Navy CIS intentionally exploiting the ridiculousness of infinite super resolution.
Thus, there must be a catch. In case of GANs, the catch is this: GANs are only able to take the super resolution process beyond the deconvolution limit because they introduce additional information about the problem domain in the form of a latent space filled to the brink with images. In other words: GANs cannot reconstruct what is un-reconstructable, but they can take best guesses based exclusively on the additional information supplied. In other words: super resolution, with GANs, is always, to a degree, hallucination.
The Doom Guy image below is a great example (and is, at this point, almost a nerdy inside joke). Coming from an 8-bit video game, an“unpixelated” version of this face does simply not exist. Nevertheless, PULSE is able to produce something “realistic” by searching the latent space of a GAN (StyleGAN trained on FFHQ) for a closest match. And that, right there, is the problem: a latent space full of white faces will always return some variation of a white face.
With default settings, I got this result. pic.twitter.com/mRkqqTwhJF— Bomze (@tg_bomze) June 20, 2020
Interestingly, the dataset that the underlying StyleGAN was trained on is more diverse than the results would suggest. Alexia Jolicoeur-Martineau and Tom White (@dribnet) have pointed out that the problems are likely an artifact of mode collapse, a classic open problem in GAN applications.
What is surprising is that StyleGAN is trained on FFHQ, which is supposed to be a much more diverse face dataset alternative to CelebA (which is extremely biased to good-looking white people). So either FFHQ is not diverse enough or StyleGAN/Pulse mode collapse hard.— Alexia Jolicoeur-Martineau (@jm_alexia) June 20, 2020
And indeed, Mario Klingenmann has shown that you can get marginally better results by starting the optimization form a different place in latent space.
I had to try my own method for this problem. Not sure if you can call it an improvement, but by simply starting the gradient descent from different random locations in latent space you can already get more variation in the results. pic.twitter.com/dNaQ1o5l5l— Mario Klingemann (@quasimondo) June 21, 2020
The problem with face super resolution, however, goes beyond the disturbing bias issues that this particular method has. There is simply no real world use case for face super resolution. The story might be different for video games, or the next Avengers movie, but the limitless upscaling of human faces from unrecognizably degraded images simply has no place in the real world.
The researchers behind PULSE are aware of this, of course, and were quick to emphasize that PULSE cannot be utilized to identify people. We all know, however, that law enforcement will use it at some point, no matter the disclaimers.6 COMPAS has shown this much.
In other words, the problem is this (from the Duke press release):
While the researchers focused on faces as a proof of concept, the same technique could in theory take low-res shots of almost anything and create sharp, realistic-looking pictures, with applications ranging from medicine and microscopy to astronomy and satellite imagery […].
Why faces then? Nothing good ever comes from face datasets, as Adam Harvey’s megapixels project reminds us. Deep learning has opened up a plethora of amazing possibilities in computer vision and beyond. None of them absolutely depend on (real world) face datasets. Yes, faces can be nicely aligned. Yes, faces are easy to come by. Yes, generating realistic faces is more impressive than generating realistic, I don’t know, vaccuums (to just pick a random ILSVRC-2012 class). The responsibility, however, that comes with face datasets, outweighs all of this. Malicious applications will always be, rightfully, presumed by default.
Update: In the meantime, the discussion continued and evolved, with many prominent AI and AI fairness researchers chiming in. Andrey Kurenkov has published a fantastic writeup of this continuing conversation in The Gradient, including lessons learned.
Sachit Menon et al., “PULSE: Self-Supervised Photo Upsampling via Latent Space Exploration of Generative Models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, 2437–45.↩
Zhihao Wang, Jian Chen, and Steven CH Hoi, “Deep Learning for Image Super-Resolution: A Survey,” arXiv Preprint arXiv:1902.06068, 2019.↩
Ian Goodfellow et al., “Generative Adversarial Nets,” in Advances in Neural Information Processing Systems, 2014, 2672–80.↩
Kevin Schawinski et al., “Generative Adversarial Networks Recover Features in Astrophysical Images of Galaxies Beyond the Deconvolution Limit,” Monthly Notices of the Royal Astronomical Society: Letters 467, no. 1 (2017): L110–L114.↩
The original Duke press announcement had this line (emphasis mine): “The system cannot be used to identify people, the researchers say: It won’t turn an out-of-focus, unrecognizable photo from a security camera into a crystal clear image of a real person. Rather, it is capable of generating new faces that don’t exist, but look plausibly real.” In the meantime (after it went viral, as the commit history shows), the researchers behind PULSE have added a more extensive disclaimer to the project’s GitHub README.↩