Artificial intelligence research group OpenAI has created a new version of DALL-E, its text-to-image generation program. DALL-E 2 offers a higher resolution, lower latency version of the original system, which produces images representing descriptions written by users. It also includes new features, like editing an existing image. As with previous OpenAI work, the tool is not directly made public. But researchers can register online to preview the system, and OpenAI hopes to make it available for use in third-party apps later.
The original DALL-E, a portmanteau of artist “Salvador Dalí” and robot “WALL-E”, debuted in January 2021. It was a limited but fascinating test of AI’s ability to represent visually concepts, from mundane depictions of a mannequin in a flannel shirt to “a giraffe made of turtle” or an illustration of a radish walking a dog. At the time, OpenAI said it would continue to build on the system while examining potential dangers such as bias in image generation or the production of misinformation. It attempts to address these issues using technical safeguards and a new content policy while reducing its IT load and advancing the core capabilities of the model.
One of the new features of DALL-E 2, inpainting, applies the text-to-image conversion capabilities of DALL-E at a more granular level. Users can start with an existing image, select an area, and have the template modify it. You can block a painting on a wall in the living room and replace it with another painting, for example, or add a vase of flowers on a coffee table. The model can fill (or remove) objects while accounting for details such as the directions of shadows in a room. Another feature, Variations, is kind of like an image search tool for images that don’t exist. Users can upload a starter image and then create a range of variations similar to this one. They can also blend two images, generating images containing elements of both. The generated images measure 1024 x 1024 pixels, a jump from the 256 x 256 pixels provided by the original model.
DALL-E 2 is based on CLIP, a computer vision system that OpenAI also announced last year. “DALL-E 1 just took our GPT-3 approach to language and applied it to produce an image: we compressed images into a series of words and we just learned how to predict what’s next,” says Prafulla Dhariwal, researcher at OpenAI, referring to the GPT model used by many text-based AI applications. But the word match didn’t necessarily capture the qualities humans found most important, and the predictive process limited the realism of the images. CLIP was designed to look at images and summarize their contents like a human would, and OpenAI iterated on this process to create “unCLIP” – a reversed version that starts with the description and goes to an image. DALL-E 2 generates the image using a process called diffusion, which Dhariwal describes as starting with a “bag of dots” and then filling in an increasingly detailed pattern.
Interestingly, a draft paper on unCLIP says it partially overcomes a very amusing weakness of CLIP: the fact that people can fool the model’s identification abilities by labeling an object (like a Granny Smith apple) with a word indicating something else (such as an iPod). The variations tool, according to the authors, “always generates images of apples with high probability” even when using a mislabeled image that CLIP cannot identify as a Granny Smith. Conversely, “the model never produces iPod images, despite the very high predicted relative probability of this legend”.
The full model of DALL-E has never been made public, but other developers have been perfecting their own tools that mimic some of its functions over the past year. One of the most popular consumer apps is the Wombo’s Dream mobile app, which generates images of anything users describe in a variety of art styles. OpenAI isn’t releasing any new models today, but developers could use its technical findings to update their own work.
OpenAI has some built-in protections in place. The model was trained on data that had some objectionable elements removed, ideally limiting its ability to produce objectionable content. There is a watermark indicating the nature of the AI-generated work, although it could theoretically be cropped. As a preventative anti-abuse feature, the model also cannot generate recognizable faces based on name – even when asking for something like the mona-lisa would apparently return a variation on the actual face of the painting.
DALL-E 2 can be tested by approved partners with some caveats. Users are prohibited from uploading or generating images that are “not G-rated” and “could cause harm”, including anything involving hate symbols, nudity, obscene gestures or “conspiracies”. events or events related to major ongoing geopolitical events”. They must also disclose the AI’s role in generating the images, and they can’t serve generated images to others through an app or website – so you won’t initially see a DALL-powered version -E of something like Dream. But OpenAI hopes to add it to the group’s API toolset later, allowing it to power third-party applications. “Our hope is to continue to follow a step-by-step process here, so that we can continue to evaluate from the feedback we receive how to safely release this technology,” Dhariwal says.
Additional reporting by James Vincent.