What is DALL E 2?
DALL·E 2 is a 12 billion parameter version of GPT-3 aka Generative Pre-trained Transformer 3 trained to generate images from text descriptions using a dataset of text–image pairs. We found it to have a diverse set of abilities, including creating anthropomorphized versions of animals and objects, combining unrelated concepts in believable ways, rendering text, and applying transformations to existing images.
GPT-3 showed that language can be used to instruct a large neural network to perform various text generation tasks. Image GPT showed that the same type of neural network can also be used to generate high-fidelity images. We extend these findings to show that manipulation of visual concepts through language is now within reach.
DALL E 2 Overview
Like GPT-3, DALL E 2 is a transformer language model. It accepts both text and image as a single data stream containing up to 1280 tokens and is trained with maximum probability to generate all tokens one by one. This training procedure allows DALL E 2 not only to generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the lower right corner in a manner consistent with the text prompt.
We recognize that work involving generative models has the potential for significant, broad societal impact. In the future, we plan to analyze how models like DALL E 2 relate to societal issues, such as the economic impact on certain work processes and professions, the potential for bias in model outputs, and the long-term ethical challenges this technology poses.
DALL E 2 Abilities
We found that DALL E 2 is able to produce plausible images for a wide variety of sentences that explore the compositional structure of language. In the next section, we illustrate this with a series of interactive visuals. The samples shown for each title in the visuals are obtained by selecting the first 32 out of 512 after reordering with CLIP, but we do not use any manual cherry-picking, except for the thumbnails and single images that appear outside.
Drawing multiple objects with DALL E 2
The simultaneous control of multiple objects, their attributes and their spatial relationships presents a new challenge. For example, consider the phrase “hedgehog in red hat, yellow gloves, blue shirt and green pants”. To correctly interpret this sentence, DALL E 2 must not only match each piece of clothing correctly with the animal, but also make the associations (hat, red), (glove, yellow), (shirt, blue) and (pants, green ) without confusing them . We test DALL E 2’s ability to do this for relative positioning, object stacking, and multi-attribute control.
While DALL E 2 offers some level of control over the attributes and positions of a small number of objects, success can depend on how the caption is worded. As more objects are introduced, DALL E 2 is prone to confusing associations between objects and their colors, and the success rate drops sharply. We also note that DALL E 2 is fragile when it comes to reformulating the caption in these scenarios: alternative, semantically equivalent captions often do not provide correct interpretations.
Visualization of perspective and three-dimensionality in DALL E 2
We found that DALL E 2 also allows control over the scene’s viewpoint and the 3D style in which the scene is rendered.
To take this further, we test DALL E 2’s ability to repeatedly draw a familiar character’s head at each angle from a sequence of equally spaced angles and find that we can recreate a smooth animation of the rotating head.
DALL E 2 appears to be able to apply some types of optical distortion to scenes, as seen with the fisheye and spherical panorama options. This motivated us to investigate its ability to generate reflections.
Visualization of internal and external structure in DALL E 2
The “Extreme Detail View” and “X-ray” style samples led us to further explore DALL E 2’s ability to render internal structure using cross-sections and external structure using macro photos.
Deriving contextual details in DALL E 2
The task of converting text to images is underspecified: one caption generally corresponds to an infinity of plausible images, so the image is not unambiguously specified. For example, take the caption “painting of a capybara sitting in a field at sunrise”. Depending on the capybara’s orientation, it may be necessary to draw a shadow, although this detail is never explicitly mentioned. We examine the ability of DALL E 2 to resolve underspecification in three cases: changing style, setting, and time; drawing the same subject in a number of different situations; and generating an image of the object with the specific text written thereon.
With varying degrees of reliability, DALL E 2 provides access to a subset of the 3D rendering engine’s capabilities through natural language. It can independently control the attributes of a small number of objects and, to a limited extent, how many there are and how they are arranged in relation to each other. It can also control the location and angle from which the scene is rendered, and can generate known objects according to exact specifications of angle and lighting conditions.
Unlike a 3D rendering engine whose inputs must be specified unambiguously and in complete detail, DALL E 2 is often able to “fill in the blanks” when the caption indicates that the image must contain some detail that is not explicitly stated.
DALL E 2 Application of previous abilities
Next, we will explore the use of the previous options for fashion and interior design.
Combining unrelated concepts in DALL E 2
The compositional nature of language allows us to put together concepts to describe real and imaginary things. We discovered that DALL E 2 also has the ability to combine different thoughts and synthesize objects, some of which probably don’t exist in the real world. We examine this ability in two cases: transferring qualities from different concepts to animals and designing products inspired by unrelated concepts.
Illustration of animals in DALL E 2
In the previous section, we explored DALL E 2’s ability to combine unrelated concepts to generate images of real-world objects. Here, we explore this ability in the context of art for three kinds of illustrations: anthropomorphized versions of animals and objects, animal chimeras, and emoticons.
Zero-Shot Visual Reasoning in DALL E 2
GPT-3 can be instructed to perform many kinds of tasks just from the description and prompt to generate the response provided in its prompt, without any additional training. For example, when prompted with the phrases “here is the sentence ‘a person walks his dog in the park’ translated into French:” GPT-3 responds “un homme qui promène son chien dans le parc.” This ability is called zero-shot reasoning. We found that DALL E 2 extends this ability to the visual domain and is able to perform several kinds of picture-to-picture translation tasks when prompted in the right way.
We did not expect this ability to emerge and did not make any modifications to the neural network or training procedures to support it. Motivated by these results, we measure DALL E 2’s ability for analogical reasoning problems by testing on Raven’s Progressive Matrices, a visual IQ test that was widely used in the 20th century.
Geographic knowledge in DALL E 2
We found that DALL E 2 learned about geographic facts, landmarks, and neighborhoods. His knowledge of these concepts is surprisingly accurate in some ways and flawed in others.
Time knowledge in DALL E 2
In addition to examining DALL E 2’s knowledge of concepts that vary over space, we also examine his knowledge of concepts that vary over time.
Summary of approach and previous work in DALL E 2
DALL E 2 is a simple decoder-only transformer that accepts both text and image as a single stream of 1280 tokens – 256 for text and 1024 for image – and autoregressively models them all. An attention mask on each of the 64 layers of self-observation allows each image token to attend to all text tokens. DALL E 2 uses a standard causal mask for text tokens and sparse attention for image tokens with a row, column, or convolutional attention pattern depending on the layer. We provide more details about the architecture and training process in our article.
Text-to-image synthesis has been an active area of research since the pioneering work of Reed et. al, whose approach uses a GAN conditional on text input. The embedding is produced by an encoder pre-trained using contrast loss, not unlike CLIP. StackGAN and StackGAN++ use multiscale GANs to increase image resolution and improve visual fidelity. AttnGAN integrates attention between text and image features and proposes loss-matched text and image contrast features as an auxiliary target.
This is interesting compared to our reestimation using CLIP, which is done offline. Other work includes additional sources of supervision during training to improve image quality. Finally, the work of Nguyen et. al and Cho et al. al investigates sampling-based image generation strategies that use pre-trained multimodal discriminant models.
Similar to the rejection sampling used in VQVAE-2, we use CLIP to re-evaluate the best 32 out of 512 samples for each caption in all interactive visuals. This procedure can also be seen as a type of language-driven search16 and can have a dramatic impact on the quality of the sample.
- How to Earn from Google
- Fantasy Football Tips and Tricks
- Success Thought in Hindi
- 10 Best Life Saving Quotes
DALL E 2, OpenAI’s image generation AI system, is finally available as an API, meaning developers can build the system into their apps, websites and services. In a blog post today, OpenAI announced that any developer can start harnessing the power of DALL E 2 – now used by more than three million people to create more than four million images per day – once they create an OpenAI API account as part of the public beta.
DALL E 2 API pricing varies by resolution. For 1024×1024 images, the cost is $0.02 per image; 512×512 images are $0.018 per image; and 256×256 images are $0.016 per image. Volume discounts are available to companies working with the OpenAI enterprise team.
As with DALL E 2 beta, the API will allow users to create new images from text prompts (eg “fluffy bunny hopping across a field of flowers”) or edit existing images. Microsoft, a close OpenAI partner, uses it in Bing and Microsoft Edge with its Image Creator tool, which allows users to create images if web results don’t return what they’re looking for. Fashion design app CALA uses the DALL E 2 API as a tool to allow customers to refine design ideas from textual descriptions or images, while photography startup Mixtiles brings it to its users’ artwork creation process.
The launch of the API doesn’t change much in terms of policy, which is likely to disappoint those who worry that generative AI systems like DALL E 2 are being released without sufficient consideration of the ethical and legal issues they pose. As before, users are bound by OpenAI’s terms of service, which prohibit using DALL E 2 to generate overtly violent, sexual, or hateful content. OpenAI also continues to block users from uploading images of people without their consent or images they don’t have rights to, using a combination of automated and human monitoring systems to enforce this.
One minor modification is that images generated using the API will not need to include a watermark. OpenAI introduced watermarking during the DALL E 2 beta as a way to indicate which images came from the system, but decided to make it optional with API launch.
What is OpenAI?
OpenAI also uses prompt and image level filters on DALL E 2, although the filters some customers have complained about are overzealous and imprecise. And the company has focused some of its research efforts on diversifying the types of images DALL E 2 generates to combat biases that text-to-image AI systems fall victim to (eg, generating mostly white images). men when prompted by the text as “examples of CEOs”).
But these moves have not appeased every critic. In August, Getty Images banned the uploading and sale of illustrations created with DALL E 2 and other similar tools, following similar decisions by sites including Newgrounds, PurplePort and FurAffinity. Getty Images CEO Craig Peters told The Verge that the ban was prompted by concerns about “unaddressed right issues” because the training datasets for systems like DALL E 2 contain copyrighted images downloaded from the web.
Many critics say it’s not just trademark infringement that concerns them about DALL E 2. The system threatens the livelihoods of artists whose styles can now be replicated with a few strings of text, they say, including artists who didn’t consent. for their work to be used to train DALL E 2. (To be fair to OpenAI, the company has licensed some of the images in the DALL E 2 training dataset, which is more than can be said for some of its competitors.)
In an attempt to find a middle ground, Shutterstock competitor Getty Images recently announced that it will start using DALL E 2 to create content, but will also launch a “contributor fund” to compensate creators when the company sells work on training text-based AI systems -to-image. . It also bans artificial intelligence uploaded by third parties to minimize the potential for copyrighted work to enter the platform.
Technologists Mat Dryhurst and Holly Herndon are leading an effort called Source+ to allow people to ban the use of their work or likeness for AI training purposes. But it’s voluntary. OpenAI has not said whether it will participate — or indeed whether it will ever implement a self-service tool that allows rights holders to exclude their work from training or content creation.
In the interview, Miller revealed few details about the new mitigations, other than that OpenAI is improving its techniques to prevent the system from generating biased, toxic and otherwise offensive content that customers might find objectionable. He described the API’s open beta as an “iterative” process that will involve working with “users and artists” over the next few months as OpenAI scales the infrastructure powering DALL E 2.
Of course, if the DALL E 2 beta is any indication, the API program will evolve over time. OpenAI early disabled the ability to edit people’s faces with DALL E 2, but later enabled the ability after improving its security system.
“We’ve done a lot of work on this page – both through the images you upload and the prompts you submit – to bring it in line with our content policies and we’ve done various mitigations at the prompt level and at the image level to make sure it’s in compliance with our content policies. So, for example, if someone uploaded an image that contained symbols of hate or gore — like very, very, very violent content — it would be rejected,” Miller said. “We’re always thinking about how we can improve the system.”
But while OpenAI seems keen to avoid the controversy surrounding Stable Diffusion, the open source equivalent of DALL-E 2 used to create porn, gore and celebrity deepfakes, it leaves it up to API users to choose exactly how and where deploy its technologies. Some, like Microsoft, will no doubt take a deliberate approach and slowly roll out products with DALL-E 2 to gather feedback. Others dive in headfirst and embrace both the technology and the ethical dilemmas that come with it.
If one thing’s for sure, it’s that there is pent-up demand for generative AI—consequences be damned. Even before the API was officially available, developers published solutions to integrate DALL-E 2 into applications, services, websites, and even video games. With the public beta launch, backed by OpenAI’s tremendous marketing potential, synthetic images are poised to truly enter the mainstream.