google fonts:Tired of image generation, Google turned to text → video generation, challenging resolution and length-Font Tutorial免费ppt模版下载-道格办公

Tired of image generation, Google turned to text → video generation, challenging resolution and length

Machine Heart Report

Editors: Zhang Qian, Du Wei

Tech giants such as Google and Meta have dug a new pit.

After converting text to image for more than half a year, technology giants such as Meta and Google have set their sights on a new battlefield. : Text to video.

Last week, Meta announced a tool that can generate high-quality short videos-Make-A-Video, using this tool to generate videos Very imaginative.

Of course, Google is not to be outdone. Just now, the company’s CEO Sundar Pichai personally announced their latest achievements in this field: two text-to-video tools – Imagen Video and Phenaki. The former focuses on video quality, while the latter mainly challenges video length. It can be said that each has its own merits.

The teddy bear washing dishes below was generated using Imagen Video. As you can see, the resolution and coherence of the picture are certain. Assure.

Imagen Video: Give text prompts and generate high-definition videos

Generative modeling has made significant progress in recent text-to-image AI systems, such as DALL-E 2, Imagen, Parti, CogView and Latent Diffusion. In particular, diffusion models have achieved great success in a variety of generative modeling tasks such as density estimation, text-to-speech, image-to-image, text-to-image, and 3D synthesis.

What Google wants to do is generate videos from text. Previous work on video generation has focused on restricted datasets with autoregressive models, latent variable models with autoregressive priors, and more recently, non-autoregressive latent variable methods. Diffusion models have also demonstrated excellent medium-resolution video generation capabilities.

On this basis, Google launched Imagen Video, a text-conditional video generation system based on the cascade video diffusion model. Given a text prompt, Imagen Video can generate high-definition video through a system consisting of a frozen T5 text encoder, a basic video generation model, and a cascaded spatiotemporal video super-resolution model.

Paper address: https://imagen.research.google/video/paper.pdf

In the paper, Google details how to extend the system into a high-definition text-to-video model, including the option of full volume at certain resolutions Design decisions such as the product spatiotemporal super-resolution model and the choice of v parameterization of the diffusion model. Google has also successfully migrated previous diffusion-based image generation research results to a video generation setting.

Google found that Imagen Video was able to upscale the 24fps 64 frames 128×128 video generated by previous work to 128 frames 1280×768 HD video. In addition, Imagen Video has a high degree of controllability and world knowledge, can generate video and text animations in diverse artistic styles, and has 3D object understanding capabilities.

Let's enjoy some more videos generated by Imagen Video, such as panda driving:

Wooden ships traveling in space:

For more generated videos, please see: https://imagen.research.google/video/

Methods and experiments

Overall, Google's video generation framework is a cascade of seven sub-video diffusion models, which perform text-conditional video generation and spatial super-resolution accordingly. rate and temporal super-resolution. Using the entire cascade, Imagen Video is able to produce 128 frames of 1280×768 HD video (approximately 126 million pixels) at 24 frames per second.

Meanwhile, with the help of progressive distillation, Imagen Video generates high-quality images using only eight diffusion steps in each sub-model. video. This speeds up video generation time by approximately 18x.

Figure 6 below shows the entire cascade pipeline of Imagen Video, including 1 frozen text encoder, 1 basic video diffusion model and 3 Spatial super-resolution (SSR) and 3 temporal super-resolution (TSR) models. The seven video diffusion models have a total of 11.6 billion parameters.

During the generation process, the SSR model improves the spatial resolution of all input frames, while the TSR model improves the spatial resolution of all input frames by filling in intermediate frames between input frames. Improve temporal resolution. All models generate a complete block of frames simultaneously so that the SSR model does not suffer from noticeable artifacts.

Imagen Video is built on the video U-Net architecture, as shown in Figure 7 below.

In experiments, Imagen Video used the publicly available LAION-400M image-text dataset, 14 million video-text pairs and 60 million image-text Train on them. As a result, as mentioned above, Imagen Video is not only able to generate high-definition videos, but also has some unique features that unstructured generative models that learn purely from data do not have.

Figure 8 below shows that Imagen Video can generate videos with artistic styles learned from image information, such as Van Gogh painting style or watercolor painting style. video.

Figure 9 below shows Imagen Video’s ability to understand 3D structures. It can generate videos of rotated objects while retaining the general structure of the object. .

Figure 10 below shows how Imagen Video can reliably generate text in a variety of animated styles, some of which are difficult to create using traditional tools.

Please refer to the original paper for more experimental details.

Phenaki: You tell the story and I will draw it

We know that although video is essentially a series of images, it is not that easy to generate a coherent long video, because here There is very little high-quality data available for this task, and the task itself is computationally demanding.

What's more troublesome is that short text prompts like the ones used for image generation are usually not enough to provide a complete description of the video. is a series of prompts or stories. Ideally, a video generation model must be able to generate videos of any length and adjust the generated video frames according to the prompt changes at a certain time t. Only with this ability can the works generated by the model be called "video" rather than "moving images", and open up the road to real-life creative applications in art, design and content creation.

Researchers from Google and other institutions said, "To our knowledge, story-based conditional video generation has never been explored before, and this is the first An early paper toward that goal."

  • Paper link: https://pub-bede3007802c4858abc6f742f405d4ef.r2.dev/paper.pdf
  • Project link: https://phenaki.github.io/#interactive

With no story-based data sets to learn from, researchers have no way to simply rely on traditional deep learning methods (simply learning from data) for these tasks. Therefore, they designed a model specifically for this task.

This new text-to-video model is called Phenaki, which uses "text-to-video" and "text-to-image" data to jointly train. This model has the following capabilities:

1. Generate temporally coherent diverse videos under the condition of an open domain prompt, even if the prompt is a new concept combination (see below image 3). The resulting video can be several minutes long, even though the model was trained on a video of only 1.4 seconds (8 frames/second)

2. Generate a video based on a story (i.e. a series of prompts), as shown in Figures 1 and 5 below:

From the following animation, we can see the coherence and diversity of the videos generated by Phenaki:

To achieve these capabilities, researchers cannot rely on existing video encoders, because these encoders can only decode fixed-size videos or Independently encoded frames. To solve this problem, they introduced a new encoder-decoder architecture - C-ViViT.

C-ViViT can:

  • Use the temporal redundancy in the video to improve the reconstruction quality of each frame model, while compressing the number of video tokens by 40% or More;
  • Allows encoding and decoding of variable-length videos given causal structure.

PHENAKI Model Architecture

Inspired by previous research on autoregressive text-to-image and text-to-video, Phenaki’s design mainly includes two parts (see Figure 2 below): An encoder-decoder model that compresses video into discrete embeddings (i.e. tokens) and a transformer model that converts text embeddings into video tokens.

Obtaining a compressed representation of a video is one of the main challenges in generating video from text. Previous work either used per-frame image encoders, such as VQ-GAN, or fixed-length video encoders, such as VideoVQVAE. The former allows the generation of videos of arbitrary length, but in practical use, the videos must be short because the encoder cannot compress the video in time and the tokens are highly redundant in consecutive frames. The latter is more efficient in terms of number of tokens, but it does not allow generating videos of arbitrary length.

In Phenaki, the researcher's goal is to generate variable-length videos while compressing the number of video tokens as much as possible, so that the current Use the Transformer model under computational resource constraints. To this end, they introduce C-ViViT, a causal variant of ViViT with additional architectural changes for video generation, which can compress videos in both temporal and spatial dimensions while maintaining temporal autoregression. This feature allows the generation of autoregressive videos of arbitrary length.

In order to obtain text embeddings, Phenaki also uses a pre-trained language model-T5X.

Please refer to the original paper for specific details.



Articles are uploaded by users and are for non-commercial browsing only. Posted by: Lomu, please indicate the source: https://www.daogebangong.com/en/articles/detail/tu-xiang-sheng-cheng-juan-ni-le-gu-ge-quan-mian-zhuan-xiang-wen-zi-shi-pin-sheng-cheng-tiao-zhan-fen-bian-lyu-he-chang-du.html

Like (810)
Reward 支付宝扫一扫 支付宝扫一扫
single-end

Related Suggestion