Google Veo 3.1 Reference-to-Video
Generate videos with consistent characters and objects using reference images. Perfect for storytelling and multi-scene projects.
Key capabilities
- Character consistency: Maintain visual identity of characters across video generation
- Multi-reference support: Use 1-3 reference images for subject consistency
- Multi-resolution output: Generate videos in 720p, 1080p, or 4K resolution
- Native audio generation: Includes dialogue and sound effects synthesis
- Fixed 8-second duration: Optimized duration at 24 FPS for cinematic quality
- Aspect ratio control: 16:9 (landscape) or 9:16 (portrait) formats
- Negative prompts: Specify elements to avoid in generation
- Long prompts: Up to 20,000 characters for detailed scene descriptions
Use cases
- Storytelling: Create multi-scene narratives with consistent characters
- Brand mascots: Generate videos featuring consistent brand characters
- Product showcases: Maintain product appearance across different scenes
- Character animation: Bring illustrated or photographed characters to life consistently
- Social media series: Create episodic content with recurring characters
- Advertising campaigns: Produce multiple ads with consistent spokesperson
How it differs from Image-to-Video
| Feature | Image-to-Video | Reference-to-Video |
|---|---|---|
| Input | Single image to animate | 1-3 reference images + prompt |
| Purpose | Animate a specific image | Generate new scenes with consistent subjects |
| Output | Animation of the input image | New video featuring reference subjects |
| Duration | 4, 6, or 8 seconds | Fixed 8 seconds |
| Modes | Standard and Fast | Single mode |
Generate with Reference-to-Video
Create videos with consistent characters and objects using reference images.POST /v1/ai/reference-to-video/veo-3-1
Create a reference-to-video task
GET /v1/ai/reference-to-video/veo-3-1
List all reference-to-video tasks
GET /v1/ai/reference-to-video/veo-3-1/{task-id}
Get task status by ID
Parameters
| Parameter | Type | Required | Description |
|---|---|---|---|
image_urls | array | Yes | Array of 1-3 reference image URLs (HTTPS, publicly accessible) |
prompt | string | Yes | Text describing the video scene with reference subjects (max 20,000 chars) |
negative_prompt | string | No | Text describing what to avoid in the video |
resolution | string | No | Output resolution: "720p", "1080p", or "4k" (default: "720p") |
aspect_ratio | string | No | Video format: "16:9" or "9:16" (default: "16:9") |
generate_audio | boolean | No | Generate audio with dialogue and effects (default: true) |
seed | integer | No | Random seed for reproducibility |
webhook_url | string | No | URL for task completion notification |
Example request
Frequently Asked Questions
What is Reference-to-Video and how is it different from Image-to-Video?
What is Reference-to-Video and how is it different from Image-to-Video?
Reference-to-Video uses reference images to maintain visual consistency of subjects (characters, objects) while generating entirely new video scenes. Image-to-Video animates a single input image directly. Use Reference-to-Video when you need to create multiple scenes with the same character or object looking consistent.
How many reference images should I provide?
How many reference images should I provide?
You can provide 1-3 reference images. Using multiple images from different angles improves consistency. For characters, include front-facing and profile views. For objects, include various angles to help the model understand the complete appearance.
What makes good reference images?
What makes good reference images?
Good reference images are:
- High resolution and well-lit
- Show the subject clearly without obstructions
- Include different angles when using multiple images
- Have consistent appearance of the subject across images
- Use HTTPS URLs that are publicly accessible
Why is the duration fixed at 8 seconds?
Why is the duration fixed at 8 seconds?
The 8-second duration at 24 FPS is optimized for reference-to-video generation, providing enough time for meaningful scenes while ensuring high-quality consistency of the reference subjects throughout the video.
Does Reference-to-Video have a Fast mode?
Does Reference-to-Video have a Fast mode?
Currently, Reference-to-Video is available in a single mode optimized for quality and consistency. Unlike Text-to-Video and Image-to-Video, there is no Fast variant for Reference-to-Video.
How does audio generation work with reference subjects?
How does audio generation work with reference subjects?
When
generate_audio is enabled (default), the model generates synchronized audio including dialogue and sound effects appropriate to the scene. If your reference subject is a person and the prompt describes them speaking, the audio will include synthesized dialogue.Best practices
- Multiple reference angles: Provide 2-3 images showing different angles of your subject for best consistency
- Clear subjects: Use reference images where the subject is clearly visible and unobstructed
- Consistent lighting: Reference images with similar lighting produce more coherent results
- Descriptive prompts: Describe how the reference subject should act in the scene
- Scene context: Include environment and action details in your prompt
- Negative prompts: Use to avoid quality issues like “blurry, distorted, inconsistent features”
- Webhook integration: Use webhooks for production workflows to handle async completion
Related APIs
- Veo 3.1 Text-to-Video: Generate videos from text prompts without reference images
- Veo 3.1 Image-to-Video: Animate a single image into video
- Kling 2.6 Motion Control: Transfer motion from reference videos
- RunWay Act Two: Character performance with reference video