Benchmarks
Veo 3.1
Veo 3.1 has achieved state of the art results in head-to-head comparisons of outputs by human raters over top video generation models.
T2V Overall preference
Participants viewed 1,003 prompts and respective videos on MovieGenBench, a benchmark dataset released by Meta. Veo 3.1 performs best on overall preference.
T2V Text alignment
Participants viewed 1,003 prompts and respective videos on MovieGenBench, a benchmark dataset released by Meta. Veo 3.1 performs best on its capability to follow prompts accurately.
T2V Visual quality
Participants viewed 1,003 prompts and respective videos on MovieGenBench, a benchmark dataset released by Meta. Participants rate the visual quality of Veo’s outputs more highly than other models.
I2V Overall preference
When participants viewed 355 image and text pairs from the VBench I2V benchmark, Veo 3’s outputs were preferred overall compared to other models.
Note: We were unable to compare image to video with Sora 2 Pro because it currently does not support realistic human images.
I2V Text alignment
When participants viewed 355 image and text pairs from the VBench I2V benchmark, Veo 3.1’s outputs were preferred to other models for capturing the intent of the prompt.
Note: We were unable to compare image to video with Sora 2 Pro because it currently does not support realistic human images.
I2V Visual quality
When participants viewed 355 image and text pairs from the VBench I2V benchmark, Veo 3.1’s outputs were preferred overall to other models for the visual quality.
Note: We were unable to compare image to video with Sora 2 Pro because it currently does not support realistic human images.
T2VA Audio visual overall preference
Participants viewed 527 prompts from MovieGenBench, and had an overall preference for Veo’s outputs with audio over other models.
T2VA Audio-video alignment
Participants viewed 527 prompts from MovieGenBench, and chose Veo 3.1’s outputs over other models for having audio that is better synchronized with the video content.
T2V Visually realistic physics
Participants choose Veo 3.1’s outputs over other models for having visually realistic physics on the physics subset of MovieGenBench prompts.
Veo capabilities
Veo’s Ingredients to Video, Scene Extension, First and Last Frame, and Object Insertion capabilities have achieved state of the art results in head-to-head comparisons of outputs by human raters on internal benchmarks.
Ingredients to video
Veo’s “Ingredients to Video” capability has achieved state-of-the-art results for: Overall Preference and Visual Quality in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks. [1]
[1] Human raters conducted direct side-by-side comparisons across 364 diverse examples (each including a prompt and 1-3 reference images and evaluating a single generated video per prompt + reference images). All comparisons were done at 1280x720 resolution. Veo videos are 8 seconds long. All other videos are 10 seconds long and shown at full length to raters.
To ensure a fair visual comparison, all tests were conducted without sound. Audio was only enabled for the Overall Preference metric, and only when competing models had native sound support for the capability. We have indicated when audio was an active part of the comparison on the labels in the chart.
Scene extension
Veo’s “Scene Extension” capability has achieved state-of-the-art results for: Overall Preference, Prompt Alignment and Visual Quality in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks. [1]
[1] Human raters conducted direct side-by-side comparisons across 80 diverse examples (each including initial text prompt and extension prompt evaluating one generated video per example. All comparisons were done at 720x1280 resolution. Veo videos are 8 seconds long. All other videos are 6 seconds long and shown at full length to raters.
To ensure a fair visual comparison, all tests were conducted without sound. Audio was only enabled for the Overall Preference metric, and only when competing models had native sound support for the capability. We have indicated when audio was an active part of the comparison on the labels in the chart.
First and last frame
Veo’s “First and Last Frame” capability has achieved state-of-the-art results for: Overall Preference, Prompt Alignment and Visual Quality, in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks. [1].
[1] Human raters conducted direct side-by-side comparisons across 106 diverse examples (each including a prompt and a start and end images, evaluating one generated video per example. All comparisons were done at 720x1280 resolution. Veo videos are 8 seconds long. All other videos are 10 seconds long and shown at full length to raters.
To ensure a fair visual comparison, all tests were conducted without sound. Audio was only enabled for the Overall Preference metric, and only when competing models had native sound support for the capability. We have indicated when audio was an active part of the comparison on the labels in the chart.
Object insertion
Veo’s “Object Insertion” capability has achieved state-of-the-art results for Overall Preference and Visual Quality, in head-to-head comparisons by human raters against other leading video generation models on internal benchmarks [1].
[1] Human raters conducted direct side-by-side comparisons across 124 diverse examples (each including a video and a prompt, specifying which object to insert, evaluating one generated video per example.
All comparisons were done at 1280x720 (or 720x1280) resolution. Veo videos are 6 seconds long. All competing model videos are 5 seconds long and shown at full length to raters. All videos had no sound.