Text-to-Video: Top 4 Players in the New Frontier of Generative AI

With growing adoption in various fields, Generative Artificial Intelligence is now indispensable. Well-known for its applications in "text-to-text" (ChatGPT) and "text-to-image" (MidJourney, Stable Diffusion), generative AI is preparing to conquer a new territory: video generation, specifically "text-to-video". Until now, this domain seemed inaccessible, mainly due to the lack of high-quality training data combining text and video, as well as high computational cost. New approaches are changing the game, here are some illustrations.

Google Phenaki

Developed by Google Research, the Phenaki model perfectly illustrates this new direction. Phenaki can generate videos from sequences of prompts, creating long and coherent "visual stories." Its base resolution is 128x128 pixels. With the help of Imagen Video, another Google Research system specialized in super resolution, the team has managed to refine Phenaki's outputs to obtain more persuasive results. Faced with a lack of training data, Phenaki stands out by showing that the combination of a large image-text corpus with a smaller amount of text-video pairs can produce generalization beyond simple video datasets. https://phenaki.github.io

Nvidia

On its part, Nvidia has just revealed the fruits of its research in "text-to-video". Its tool is distinguished by its ability to generate short 4.7-second videos in 1280x2048 pixels, or longer sequences in 512x1024. The tool can thus produce several minutes of dashcam-style video in low quality in a "temporally coherent" manner. Based on Latent Diffusion Models (LDM), Nvidia's technology offers a viable alternative that does not require astronomical computing power. By adding a temporal dimension to a "text-to-image" model (specifically, Stable Diffusion), Nvidia succeeds in animating still images realistically and improving them with super-resolution techniques. https://research.nvidia.com/labs/toronto-ai/VideoLDM/

Runway Gen-2

The generative AI startup Runway, known for co-creating the "text-to-image" model Stable Diffusion, has also developed the Generative AI model Gen-1. It notably has the ability to transform existing videos into new ones by applying any style specified by a text instruction or a reference image ("video-to-video", "image-to-video"). More recently, Runway presented Gen-2, its generative "text-to-video" model. Despite a lack of documentation on the underlying technology, the results obtained are among the best. Could Runway's progress in this field match that of OpenAI in "text-to-text"? Only the future will tell, but it remains a topic to watch closely. https://runwayml.com/ai-magic-tools/gen-2/

MetaAI - Make A Video

In the research phase at Meta, Make A Video is a promising solution that uses labeled images to "learn to represent the world" and to "understand how it is frequently described." It also uses unlabeled videos to "grasp the motion of the world." This approach offers three notable advantages:

the acceleration of "text-to-video" model training, which no longer needs to learn visual and multimodal representations from scratch,
the absence of the need for paired text-video data,
the generation of videos that benefit from the broad aesthetic and representational diversity of current image generation models.

https://makeavideo.studio

The ability to convert a prompt into a video probably represents the next revolution in the field of generative AI. The possibilities for content development are immense, with their corresponding applications in fields as diverse as cinema, advertising, education, but also new ethical challenges. Of course, the current results may seem laughable, especially when compared to the capabilities of "Text-to-Image" models. But let's not forget the progress made in a year by these same models:

alt text

Given the rapid evolution of innovations, it is imperative for the concerned industries to take a close interest in these techniques. Experimenting with text-to-video specific prompt techniques, using less sophisticated or more mature generative solutions (like "video to video") are undoubtedly avenues to explore. And why not consider, as some brands have done with "text to image" models, launching a campaign openly based on AI that will undoubtedly be talked about?