← Blog

From Nano Banana to Vision Banana: How Google DeepMind Turned an Image Generator into a Generalist Vision System

June 03, 2026 · 43 min read

In April 2026, a team of 25 researchers at Google DeepMind, including Kaiming He and Saining Xie, published a paper with a deceptively simple thesis: if a model can generate photorealistic images, it already understands the visual world well enough to segment objects, estimate depth, and predict surface normals. They tested this by taking Nano Banana Pro, Google's flagship image generator, and instruction-tuning it with a small amount of vision task data. The result, Vision Banana, beat SAM 3 on three out of four segmentation benchmarks, outperformed Depth Anything V3 on metric depth estimation across four major datasets, and surpassed Lotus-2 on surface normal prediction, all without a single architectural modification to the base model, and while retaining its ability to generate and edit images (Gabeur et al., 2026, Image Generators are Generalist Vision Learners, arXiv:2604.20329).

Why this matters: The dominant paradigm in computer vision builds separate specialist models for each task: one architecture for segmentation, another for depth, another for normals. Vision Banana collapses this into a single generative model that handles all tasks through prompt switching alone. If image generation training serves the same role as LLM pretraining, the era of training task-specific vision encoders may be ending.

TL;DR

  • Nano Banana Pro (Gemini 3 Pro Image) is Google DeepMind's image generation model, built on a sparse Mixture-of-Experts transformer architecture with native multimodal capabilities and diffusion-based image synthesis.
  • Vision Banana is Nano Banana Pro instruction-tuned on a small amount of vision task data mixed at a very low ratio into the original training distribution. No architectural changes. No new modules. No auxiliary losses.
  • All vision tasks are parameterized as RGB image generation: the model generates color-coded images that can be decoded back into segmentation masks, metric depth maps, or surface normal vectors through invertible mappings.
  • Metric depth encoding uses a two-stage bijection: Barron's power transform (\(\lambda = -3\), \(c = 10/3\)) compresses unbounded depth to \([0, 1)\), then a 3D Hilbert curve interpolation maps the scalar to an RGB triple. The entire mapping is strictly invertible.
  • Vision Banana outperforms SAM 3 on Cityscapes semantic segmentation (0.699 vs 0.652 mIoU), matches it on referring expression segmentation, and beats Depth Anything V3 on metric depth (0.929 vs 0.918 average \(\delta_1\)), all in zero-shot evaluation with no benchmark training data.
  • The model retains generation capability: 53.5% win rate against unmodified Nano Banana Pro on text-to-image benchmarks.
  • Training data for 3D tasks is entirely synthetic (rendering engines). No real-world depth or normal data was used. No camera intrinsics are required at inference.
  • The core claim is a paradigm parallel: image generation pretraining is to vision what LLM pretraining is to language - a universal foundation that can be steered toward understanding tasks through instruction alignment.

At a Glance

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart LR
    A["Input Image"] --> B["Nano Banana Pro<br/>(Generative Backbone)"]
    C["Text Instruction<br/>(task + format)"] --> B
    B --> D["RGB Output Image"]
    D --> E["Task-Specific<br/>Decoder"]
    E --> F["Seg Mask /<br/>Depth Map /<br/>Normal Map"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff

    class A,C blue
    class B purple
    class D amber
    class E teal
    class F emerald

Before Vision Banana

The idea of repurposing generative models for visual understanding did not emerge in isolation. It grew from a decade-long trajectory where vision models first learned to discriminate, then to generate, and finally proved that generation and understanding were two sides of the same learned representation.

%%{init: {'theme': 'base', 'themeVariables': {'cScale0': '#1e40af', 'cScale1': '#6d28d9', 'cScale2': '#b45309', 'cScale3': '#be123c', 'cScale4': '#047857', 'cScale5': '#0e7490', 'cScale6': '#1e40af', 'cScaleLabel0': '#e2e8f0', 'cScaleLabel1': '#e2e8f0', 'cScaleLabel2': '#e2e8f0', 'cScaleLabel3': '#e2e8f0', 'cScaleLabel4': '#e2e8f0', 'cScaleLabel5': '#e2e8f0', 'cScaleLabel6': '#e2e8f0', 'textColor': '#e2e8f0', 'lineColor': '#94a3b8', 'fontSize': '16px'}}}%%
timeline
    title From Discriminative Vision to Generative Understanding
    2020 : DDPM (Ho et al.) revives diffusion models
         : Denoising as a training objective
    2022 : Imagen (Saharia et al.) - cascaded diffusion
         : Text-conditioned generation via T5-XXL
         : Stable Diffusion - latent space diffusion
    2023 : Marigold - repurposes Stable Diffusion for depth
         : First proof that generative features transfer
    2024 : GenPercept - deterministic one-step fine-tuning
         : Lotus - diffusion foundation for dense prediction
         : SAM 2 released by Meta
    2025 : Nano Banana Pro (Gemini 3 Pro Image)
         : Depth Anything V3 - DINOv2-based depth
         : SAM 3 - concept-based segmentation (Meta)
         : Nano Banana 2 (Gemini 3.1 Flash Image)
    2026 : Vision Banana - instruction-tuned NBP
         : Generation equals understanding paradigm

The discriminative tradition dominated for years. Models like DINOv2 and SAM learned visual features by classifying, matching, or segmenting, never by generating pixels. These models excelled at the tasks they were trained on but required new heads, new decoders, and often new training runs for each downstream application.

The generative tradition ran in parallel. Diffusion models, beginning with the foundational work of Ho et al. on denoising diffusion probabilistic models (Ho et al., 2020, Denoising Diffusion Probabilistic Models, arXiv:2006.11239), learned to generate images by iteratively removing noise. Google Brain's Imagen (Saharia et al., 2022, Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding, arXiv:2205.11487) demonstrated that conditioning diffusion on a powerful text encoder (T5-XXL) produced photorealistic images from text prompts, using a cascaded architecture: a 64x64 base model followed by two super-resolution stages up to 1024x1024.

The bridge between these traditions came in 2023 when Marigold (Ke et al., 2024, Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation, CVPR 2024) showed that Stable Diffusion, fine-tuned on synthetic depth data, produced state-of-the-art monocular depth estimates. The insight was that the denoising process had already learned rich geometric representations during image generation training. GenPercept (Wu et al., 2024, Diffusion Models Trained with Large Data Are Transferable Visual Models, ICLR 2025) pushed further, demonstrating that a single deterministic forward pass through a diffusion model's UNet could extract depth, normals, segmentation, and pose features without iterative denoising. Lotus (He et al., 2024, Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction, arXiv:2409.18124) refined this into a dedicated foundation model for dense geometry tasks.

Each of these predecessors, however, operated on a specific diffusion backbone (typically Stable Diffusion) and required architectural modifications or new training heads. Vision Banana's contribution is showing that a state-of-the-art natively multimodal image generator can be steered toward vision understanding through instruction tuning alone, the same technique that turned base language models into ChatGPT.

[IMAGE: Side-by-side comparison showing the evolution: Marigold (Stable Diffusion + depth head) vs GenPercept (one-step UNet extraction) vs Vision Banana (instruction-tuning, no modifications), with arrows showing increasing simplicity of the adaptation method]

The Foundation: Nano Banana Pro Architecture

Understanding Vision Banana requires understanding what it is built on. Nano Banana Pro is Google's codename for Gemini 3 Pro Image, the image generation component of the Gemini 3 Pro multimodal model. While Google has not published a dedicated architecture paper for this model (no whitepapers were released for Gemini 2.0, 2.5, or 3.0), the architecture can be reconstructed from official documentation, API behavior, and published analysis.

The Sparse MoE Transformer Backbone

Nano Banana Pro is built on Gemini 3 Pro's sparse Mixture-of-Experts (MoE) transformer architecture. Unlike dense transformers where every parameter participates in every forward pass, MoE models learn to dynamically route input tokens to a subset of specialized "expert" sub-networks. This decouples total model capacity from per-token computation cost: the model can have a very large total parameter count while activating only a fraction of those parameters for any given input.

The MoE architecture provides a critical advantage for multimodal generation: different experts can specialize in different aspects of the task. Some experts may handle text understanding, others geometric reasoning, others texture synthesis. The routing mechanism learns to compose these capabilities dynamically based on the input. This is distinct from pipeline architectures that chain separate text encoders and image decoders; in Gemini, text and image processing share the same transformer backbone with the same attention layers.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TD
    subgraph INPUT["Input Processing"]
        T["Text Tokens"] --> E["Unified Embedding Space"]
        V["Visual Tokens<br/>(ViT patches)"] --> E
    end

    subgraph MOE["Sparse MoE Transformer Layers"]
        E --> R["Router Network"]
        R --> E1["Expert 1<br/>(Geometry)"]
        R --> E2["Expert 2<br/>(Semantics)"]
        R --> E3["Expert 3<br/>(Texture)"]
        R --> E4["Expert K<br/>(...)"]
        E1 --> AGG["Aggregate"]
        E2 --> AGG
        E3 --> AGG
        E4 --> AGG
    end

    subgraph OUTPUT["Generation"]
        AGG --> DIFF["Diffusion Head"]
        DIFF --> IMG["Generated Image"]
    end

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff
    classDef slate fill:#334155,stroke:#64748b,stroke-width:1px,color:#e2e8f0

    class T,V blue
    class E,R purple
    class E1,E2,E3,E4 teal
    class AGG amber
    class DIFF,IMG slate

Native Multimodal Processing

The phrase "natively multimodal" is not marketing. Gemini 3 Pro was trained from the start to process and generate both text and images within a unified architecture. Input images are converted to discrete visual tokens through a ViT-style (Vision Transformer) patch embedding: the image is divided into fixed-size patches (typically 16x16 or 14x14 pixels), each patch is linearly projected into the model's embedding dimension, and the resulting sequence of patch embeddings is interleaved with text token embeddings in the transformer's attention layers. This approach is conceptually related to Google's PaLI and earlier Flamingo-style architectures, but trained end-to-end rather than modularly.

The output side uses diffusion-based image synthesis. While the exact diffusion architecture has not been published, the model supports classifier-free guidance (evidenced by negative prompt effectiveness), operates at up to 4K resolution, and handles multi-image inputs (up to 14 reference images with consistency across up to 5 people). The generation process likely operates in a latent space rather than pixel space, consistent with the efficiency requirements of a production API serving model.

Model Family and Scale

Google's Nano Banana family includes three members:

Model Official Name Released Key Characteristics
Nano Banana Gemini 2.5 Flash Image August 2025 Speed-optimized, first release
Nano Banana Pro Gemini 3 Pro Image November 2025 Quality-optimized, 4K, multi-image
Nano Banana 2 Gemini 3.1 Flash Image February 2026 1.8B parameters, Flash-speed generation

Nano Banana Pro sits at the quality end of this spectrum. It prioritizes compositional accuracy, text rendering (95% accuracy for strings under 10 words), and multi-image consistency over raw generation speed. The model's text encoder integrates Gemini 3 Pro's reasoning capabilities, allowing it to "plan" scenes before rendering: simulating lighting physics, object relationships, and spatial logic prior to synthesizing pixels. This reasoning capability is part of what makes Vision Banana possible; the model does not just memorize visual patterns but constructs internal representations of scene geometry and semantics.

[IMAGE: Comparison of three Nano Banana model variants showing generation quality vs speed tradeoff, with Nano Banana 2 (1.8B params, fast) on one end and Nano Banana Pro (large, quality-focused) on the other, with example outputs at identical prompts]

What Nano Banana Pro Learns During Image Generation Training

The central thesis of the Vision Banana paper rests on a specific claim about what image generators learn. Generating photorealistic images requires more than pattern matching. To place shadows correctly, the model must understand lighting direction and surface geometry. To render occluded objects consistently, it must maintain a 3D scene representation. To handle perspective and foreshortening, it must encode camera viewpoint and depth ordering. To generate textures that wrap around surfaces, it must understand surface normals.

These capabilities are latent in any sufficiently powerful image generator. The question Vision Banana answers is whether they can be extracted through instruction tuning rather than architectural surgery.

A separate study evaluated Nano Banana Pro on 14 low-level vision tasks (dehazing, super-resolution, deraining, denoising, and others) across 40 datasets (Pan et al., 2025, Is Nano Banana Pro a Low-Level Vision All-Rounder?, arXiv:2512.15110). The findings revealed a revealing pattern: the model produced images with strong perceptual quality (top NIMA and NIQE scores) but poor pixel-level fidelity (low PSNR/SSIM). It hallucinated plausible but incorrect details rather than faithfully restoring degraded inputs. This "perception-distortion paradox" confirms that the model prioritizes semantic plausibility over pixel alignment, a property that makes it ideal for vision understanding tasks (where semantic correctness matters) rather than signal restoration tasks (where pixel fidelity matters).

How Vision Banana Actually Works

The Instruction-Tuning Recipe

Vision Banana is created by a single, lightweight instruction-tuning pass over Nano Banana Pro. The recipe has three components:

Data mixing. A small proportion of computer vision task data is mixed into NBP's original image generation training distribution at a very low ratio. The paper does not disclose the exact ratio, but emphasizes that it is small enough to preserve the base model's generative capabilities. The training data includes in-house model annotations for web-crawled 2D images (segmentation labels) and synthetic data from rendering engines for 3D tasks (depth maps, surface normals). No evaluation benchmark data was included in training.

Instruction format. Each training example consists of an input image, a natural language instruction specifying the task and output format, and a target RGB image encoding the ground truth in the specified format. The instructions use natural language augmented with color specifications. For segmentation, a prompt might read: "Segment the following categories in the image using these colors: {'car': 'red', 'road': 'gray', 'sky': 'blue'}." For depth, the instruction specifies the colormap. For normals, the instruction specifies the camera coordinate convention.

No architectural changes. Vision Banana uses exactly the same architecture as Nano Banana Pro. No new modules, no task-specific heads, no auxiliary losses, no specialized decoders. The model learns to produce vision outputs by treating them as a particular style of image generation.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TD
    subgraph TRAIN["Training Data Composition"]
        GEN["Original Generation<br/>Training Data<br/>(dominant)"] --> MIX["Low-Ratio<br/>Mixing"]
        VIS["Vision Task Data<br/>(small proportion)"] --> MIX
    end

    subgraph TASKS["Vision Task Data Sources"]
        SEG["2D Segmentation<br/>(in-house annotations<br/>on web images)"]
        DEPTH["Metric Depth<br/>(synthetic renders)"]
        NORM["Surface Normals<br/>(synthetic renders)"]
    end

    SEG --> VIS
    DEPTH --> VIS
    NORM --> VIS

    MIX --> FT["Instruction<br/>Fine-Tuning"]
    NBP["Nano Banana Pro<br/>(frozen arch)"] --> FT
    FT --> VB["Vision Banana"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff

    class GEN blue
    class VIS,SEG,DEPTH,NORM teal
    class MIX,FT purple
    class NBP amber
    class VB emerald

The Core Insight: Perception as Image Generation

The conceptual leap is treating every vision task as an image-to-image generation problem. Instead of regressing depth values or classifying pixels, Vision Banana generates an RGB image where each pixel's color encodes the task output. The encoding must satisfy two constraints: it must be expressive enough to represent the full range of task outputs, and it must be invertible so that the generated image can be decoded back into quantitative results.

This reframing has a crucial consequence for the model's training dynamics. The model never sees a "depth loss" or a "segmentation cross-entropy." It only sees the standard image generation loss: produce an image that matches the target. The task-specific structure is encoded entirely in the RGB mapping and the natural language instruction. This means the model's generative priors (texture coherence, edge sharpness, spatial consistency) directly benefit the vision output, because the same properties that make a generated photograph look realistic also make a generated depth map clean and well-aligned with object boundaries.

Task Parameterization: Semantic Segmentation

Semantic segmentation assigns a class label to every pixel. Vision Banana handles this by prompting the model with a class-to-color mapping and asking it to generate a color-coded segmentation image.

Example prompt: "Generate a segmentation visualization using color mapping: {'car': (255, 0, 0), 'road': (128, 128, 128), 'building': (0, 0, 255), 'sky': (135, 206, 235)}."

The model generates an image where each pixel is colored according to its predicted class. At decode time, pixels are assigned to the nearest specified color by Euclidean distance in RGB space, with a clustering threshold to handle anti-aliasing and generation artifacts at class boundaries.

The prompting format is flexible. The paper demonstrates that color assignments can be specified as RGB tuples, hex codes, color names, or JSON mappings. The model also handles partial class lists: if prompted with only a subset of classes present in the image, it correctly segments only those classes. This flexibility comes directly from the base model's natural language understanding.

Instance segmentation adds a wrinkle: multiple objects of the same class need distinct labels. Vision Banana handles this by running one inference pass per class, asking the model to assign dynamically chosen distinct colors to each instance. Pixels are then clustered by color similarity to extract individual instance masks.

Referring expression segmentation leverages the model's language understanding more directly. Given a prompt like "Segment the person holding a red umbrella in pure white," the model generates a binary mask (white foreground, black background) isolating the referenced entity. The language grounding required here comes free from the base model's multimodal pretraining.

[IMAGE: Three-panel figure showing the same street scene processed by Vision Banana for semantic segmentation (color-coded by class), instance segmentation (distinct colors per object instance), and referring expression segmentation (binary mask for "the bus behind the taxi"), with the text prompt displayed below each panel]

Task Parameterization: Metric Depth Estimation

Depth estimation is the task where Vision Banana's encoding design is most technically interesting. Metric depth values are continuous, unbounded (theoretically ranging from 0 to infinity), and need high precision at near-field distances where small depth differences matter most (the difference between 1m and 2m is far more significant than between 100m and 101m). RGB values, by contrast, are bounded integers in \([0, 255]^3\), giving only $256^3 \approx 16.7$ million distinct colors.

The encoding uses a two-stage bijection:

Stage 1: Power Transform Compression. The raw metric depth \(d \in [0, \infty)\) is compressed to a scalar \(t \in [0, 1)\) using a power transform from Barron (Barron, 2019, A General and Adaptive Robust Loss Function, arXiv:1701.03077):

\[f(d, \lambda, c) = 1 - \left(1 - \frac{d}{\lambda c}\right)^{\lambda + 1}\]

with shape parameter \(\lambda = -3\) and scale parameter \(c = 10/3\). The negative \(\lambda\) produces a concave mapping that allocates more of the \([0, 1)\) range to near-field depths. Objects at 2 meters receive substantially more color precision than objects at 200 meters, matching human depth perception where nearby distance differences are perceptually dominant. The power transform is strictly monotonic and smooth, ensuring it is analytically invertible:

\[d = \lambda c \left(1 - (1 - t)^{1/(\lambda + 1)}\right)\]

Stage 2: Hilbert Curve RGB Interpolation. The scalar \(t \in [0, 1)\) is mapped to an RGB triple by interpolating along a piecewise-linear path that traces the edges of the RGB cube. This path approximates the first iteration of a 3D Hilbert curve, visiting the vertices of the cube in an order that produces smooth, perceptually meaningful color transitions: nearby depth values map to nearby colors, and distant depth values map to visually distinct colors. The path connects black (0,0,0) to white (255,255,255) through intermediate colors, creating a rainbow-like depth visualization that is both visually interpretable and quantitatively invertible.

The full mapping composes both stages: \(\text{RGB}(d) = \text{HilbertInterp}(f(d, -3, 10/3))\). Since both the power transform and the piecewise-linear interpolation are strictly invertible, the entire encoding is a perfect bijection. Given a generated RGB image, the decoder projects each pixel color onto the nearest line segment of the Hilbert path, recovers \(t\) by inverting the linear interpolation along that segment, and then applies the inverse power transform to recover the metric depth \(d\).

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart LR
    D["Metric Depth<br/>d in 0 to inf"] --> PT["Power Transform<br/>lambda=-3, c=10/3"]
    PT --> S["Scalar t<br/>in 0 to 1"]
    S --> HC["Hilbert Curve<br/>Interpolation"]
    HC --> RGB["RGB Triple<br/>in 0-255 cubed"]
    RGB --> GEN["Vision Banana<br/>Generates Image"]

    GEN --> DEC["Project to<br/>Nearest Segment"]
    DEC --> INV_HC["Invert Linear<br/>Interpolation"]
    INV_HC --> INV_PT["Inverse Power<br/>Transform"]
    INV_PT --> REC["Recovered<br/>Metric Depth d"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff

    class D,REC blue
    class PT,INV_PT purple
    class S,HC,INV_HC teal
    class RGB,DEC amber
    class GEN emerald

A key engineering detail: during training, the model is exposed to multiple alternative colormaps (Plasma, Inferno, Viridis, grayscale) alongside the Hilbert curve encoding. This augmentation prevents the model from memorizing a single colormap and encourages it to learn the underlying depth structure rather than a specific color-to-depth lookup table.

The most striking aspect of Vision Banana's depth estimation is that it requires no camera intrinsics or extrinsics. The model infers absolute metric scale from visual context alone. A photograph of a room yields depth values in meters. A photograph of an outdoor scene yields depth values in meters. No focal length, no sensor size, no calibration. This capability likely emerges from the base model's training on diverse web-crawled images with implicit scale cues (known object sizes, perspective geometry, horizon line position).

[IMAGE: The depth-to-RGB bijection visualized as a 3D path through the RGB cube, with the power transform curve shown alongside it, annotated with depth values at key color transitions (0m = black, 0.5m = dark blue, 2m = cyan, 10m = yellow, 50m = red, inf approaches white)]

Task Parameterization: Surface Normal Estimation

Surface normals encode the 3D orientation of a surface at each pixel. A surface normal is a unit vector \((n_x, n_y, n_z)\) in camera coordinates, where each component ranges from \(-1\) to \(+1\). Vision Banana maps these components directly to RGB channels:

\[R = \frac{n_x + 1}{2} \times 255, \quad G = \frac{n_y + 1}{2} \times 255, \quad B = \frac{n_z + 1}{2} \times 255\]

The coordinate convention follows standard camera-space orientation: \(+x\) points right, \(+y\) points up, \(+z\) points out of the image plane (toward the camera). This produces intuitive color coding:

  • Surfaces facing left appear pinkish-red (high \(R\), low \(G\), medium \(B\))
  • Surfaces facing up appear light green (medium \(R\), high \(G\), medium \(B\))
  • Surfaces facing the camera appear light blue/purple (medium \(R\), medium \(G\), high \(B\))

The mapping is linear and trivially invertible: given an RGB pixel, recover the normal by \(n_i = 2 \times \text{channel}_i / 255 - 1\) for each component, then normalize to unit length. The simplicity of this encoding is itself a finding: unlike depth, which required a sophisticated bijection, surface normal encoding needed nothing beyond a linear rescaling.

[IMAGE: A room scene processed for surface normal estimation, with the output false-color image annotated with arrows showing which real-world surface orientations produce which colors, plus a color wheel legend mapping normal directions to RGB]

The Inference Pipeline: End to End

Putting it all together, Vision Banana's inference pipeline for a single task follows these steps:

%%{init: {'theme': 'base', 'themeVariables': {'actorBkg': '#1e40af', 'actorTextColor': '#fff', 'actorBorder': '#3b82f6', 'signalColor': '#94a3b8', 'signalTextColor': '#e2e8f0', 'labelBoxBkgColor': '#1e293b', 'labelBoxBorderColor': '#334155', 'labelTextColor': '#e2e8f0', 'loopTextColor': '#e2e8f0', 'noteBkgColor': '#1e293b', 'noteTextColor': '#e2e8f0', 'noteBorderColor': '#475569', 'activationBorderColor': '#3b82f6', 'activationBkgColor': '#1e3a5f', 'fontSize': '16px'}}}%%
sequenceDiagram
    participant U as User / Application
    participant VB as Vision Banana
    participant DEC as Task Decoder

    U->>VB: Input image + task instruction
    Note over VB: Encode image as visual tokens<br/>Encode instruction as text tokens
    Note over VB: Forward pass through MoE<br/>transformer + diffusion head
    VB->>VB: Iterative denoising (N steps)
    VB->>DEC: Generated RGB output image
    Note over DEC: Apply task-specific<br/>inverse mapping
    DEC->>U: Segmentation mask / depth map / normal map

Step 1: Input encoding. The input image is tokenized into visual patches and the text instruction is tokenized into text tokens. Both are embedded into the model's unified embedding space.

Step 2: Forward pass. The interleaved token sequence passes through the sparse MoE transformer layers. The router network selects which experts process each token. Cross-attention between text and image tokens allows the instruction to condition the generation.

Step 3: Diffusion-based generation. The model's diffusion head generates an output image through iterative denoising. Starting from noise, each denoising step refines the output conditioned on both the input image and the task instruction. The number of denoising steps controls the quality-speed tradeoff (the paper does not disclose the exact step count, but production diffusion models typically use 20-50 steps with advanced samplers).

Step 4: Task-specific decoding. The generated RGB image is decoded using the appropriate inverse mapping:
- Segmentation: cluster pixels to nearest specified color, extract per-class binary masks
- Depth: project each pixel onto nearest Hilbert curve segment, invert interpolation, apply inverse power transform
- Normals: linear rescaling of RGB channels to \([-1, 1]\), normalize to unit vector

Step 5: Output. The decoded task output (mask array, depth tensor, normal field) is returned in standard format compatible with downstream applications.

For instance segmentation, Step 1-4 are repeated once per object class, with the model assigning distinct colors to different instances of each class. The per-class masks are then merged. This multi-pass approach increases inference cost linearly with the number of classes but avoids the ambiguity of distinguishing instances across classes in a single pass.

[IMAGE: Full pipeline visualization showing a photograph of a kitchen flowing through Vision Banana to produce three outputs side by side: semantic segmentation map, metric depth map with colorbar in meters, and surface normal map with directional color legend]

Seeing It in Motion: The Multi-Task Architecture

One of Vision Banana's most significant engineering properties is that all five supported tasks share the same model weights and architecture. Task switching happens entirely through the text prompt. This means a single deployment serves all vision understanding needs.

%%{init: {'theme': 'base', 'themeVariables': {'primaryColor': '#1e40af', 'primaryTextColor': '#fff', 'primaryBorderColor': '#60a5fa', 'lineColor': '#94a3b8', 'textColor': '#e2e8f0', 'clusterBkg': '#1e293b', 'clusterBorder': '#334155', 'fontSize': '16px'}}}%%
flowchart TD
    IMG["Input Image"] --> VB["Vision Banana<br/>(Single Model)"]

    P1["Prompt: Semantic Seg<br/>with class colors"] --> VB
    P2["Prompt: Instance Seg<br/>per-class distinct"] --> VB
    P3["Prompt: Referring Expr<br/>natural language"] --> VB
    P4["Prompt: Metric Depth<br/>Hilbert colormap"] --> VB
    P5["Prompt: Surface Normals<br/>camera-space RGB"] --> VB

    VB --> O1["Class Mask"]
    VB --> O2["Instance Masks"]
    VB --> O3["Binary Mask"]
    VB --> O4["Depth Map (m)"]
    VB --> O5["Normal Field"]

    classDef blue fill:#1e40af,stroke:#3b82f6,stroke-width:1px,color:#fff
    classDef purple fill:#6d28d9,stroke:#a78bfa,stroke-width:1px,color:#fff
    classDef teal fill:#0e7490,stroke:#22d3ee,stroke-width:1px,color:#fff
    classDef emerald fill:#047857,stroke:#34d399,stroke-width:1px,color:#fff
    classDef amber fill:#b45309,stroke:#fbbf24,stroke-width:1px,color:#fff

    class IMG blue
    class VB purple
    class P1,P2,P3,P4,P5 teal
    class O1,O2,O3,O4,O5 emerald

This single-model-multiple-task property mirrors the trajectory of large language models, which went from task-specific fine-tuned variants (BERT for NER, BERT for QA, BERT for sentiment) to unified instruction-following models (GPT-3.5, Claude) that handle all tasks through prompt engineering. Vision Banana is performing the same unification for dense visual perception.

By the Numbers

Segmentation Benchmarks

Benchmark Task Metric Vision Banana SAM 3 SAM 3 Agent DINO-X X-Decoder
Cityscapes Semantic mIoU 0.699 0.652 - - 0.520
RefCOCOg Referring Expression cIoU 0.738 - 0.734 - -
ReasonSeg Reasoning gIoU 0.793 - 0.770 - -
SA-Co/Gold Instance pmF1 0.540 0.661 - 0.552 -

Vision Banana outperforms SAM 3 by 4.7 percentage points on Cityscapes semantic segmentation, a gap that is substantial given the maturity of this benchmark. It also edges out the SAM 3 Agent (SAM 3 combined with a language model for instruction parsing) on referring expression and reasoning segmentation tasks. The one area where Vision Banana falls short is instance segmentation on SA-Co/Gold, where its 0.540 pmF1 trails both SAM 3's 0.661 and DINO-X's 0.552. This weakness likely stems from the multi-pass per-class inference approach, which can miss instances when color clustering fails at object boundaries.

For context, SAM 3 (Segment Anything Model 3) is Meta's latest segmentation foundation model with approximately 848 million parameters, a dedicated dual encoder-decoder transformer architecture, and a Meta Perception Encoder pre-trained on 5.4 billion image-text pairs (Meta AI, 2025, SAM 3: Segment Anything with Concepts). SAM 3 was purpose-built for segmentation. Vision Banana matches or beats it on three of four benchmarks while also handling depth, normals, and image generation.

Depth Estimation Benchmarks

Benchmark Vision Banana (\(\delta_1\)) Depth Anything V3 (\(\delta_1\)) UniK3D (\(\delta_1\)) Depth Pro (\(\delta_1\))
NYU Depth V2 0.948 0.961 - -
4-dataset average 0.929 0.918 - -
6-dataset average 0.882 - 0.823 0.715

The \(\delta_1\) metric measures the fraction of predicted depth values within a threshold of the ground truth (specifically, \(\max(d_{\text{pred}}/d_{\text{gt}}, d_{\text{gt}}/d_{\text{pred}}) < 1.25\)). Vision Banana achieves 0.929 average \(\delta_1\) across four major benchmarks, beating Depth Anything V3's 0.918. On the broader six-benchmark average, it reaches 0.882 compared to UniK3D's 0.823.

Depth Anything V3 (Yang et al., 2025, Depth Anything 3: Recovering the Visual Space from Any Views) uses a DINOv2-pretrained Vision Transformer backbone with a teacher-student training paradigm and specialized geometric prediction targets. It was built specifically for depth estimation. Vision Banana surpasses it using a model whose primary training objective was generating photorealistic images, not predicting geometry.

The one dataset where DA V3 wins (NYU Depth V2, 0.961 vs 0.948) is a relatively small indoor dataset where specialist models benefit from distribution-specific optimization. Across the broader, more diverse evaluation, Vision Banana's generalist capability produces more consistent results.

Critically, Vision Banana achieves these depth results with zero real-world depth training data. All depth supervision comes from synthetic rendering engines. The model generalizes from synthetic depth to real-world scenes through the visual representations learned during image generation pretraining.

Surface Normal Benchmarks

Setting Metric Vision Banana Lotus-2 DSINE
Overall (3 datasets) Mean Angle Error 18.928 19.642 -
Indoor average Mean Angle Error 15.549 16.558 17.017
Indoor average Median Angle Error 9.300 - -

Lower is better for angle error. Vision Banana's 18.928-degree mean error across three benchmarks represents a meaningful improvement over Lotus-2 (19.642 degrees), particularly impressive given that Lotus is a dedicated diffusion-based model fine-tuned specifically for dense geometry prediction.

Generation Capability Retention

Benchmark Type Vision Banana vs NBP
GenAI-Bench Text-to-Image 53.5% win rate
ImgEdit Image Editing 47.8% win rate

These numbers demonstrate that instruction tuning for vision tasks does not degrade the base model's generative capabilities. A 53.5% win rate means Vision Banana actually generates slightly better images than the unmodified Nano Banana Pro on the text-to-image benchmark, possibly because the vision task training sharpens the model's spatial reasoning. The 47.8% win rate on image editing indicates near-parity, within the margin of human preference noise.

[IMAGE: Bar chart comparing Vision Banana against specialist models across all tasks, with color-coded bars for each model (blue for Vision Banana, gray for specialists), showing the 4.7-point mIoU lead on Cityscapes and the 1.1-point delta1 lead on metric depth, with the instance segmentation gap visible as the one bar where Vision Banana is shorter]

A Concrete Example

Consider a self-driving perception stack processing a single urban intersection frame. With traditional architectures, you would deploy at least three separate models: SAM 3 for segmentation (848M parameters), Depth Anything V3 for depth (DINOv2-Large backbone), and a surface normal estimator like DSINE. Each model has its own weights, its own inference pipeline, its own latency budget.

With Vision Banana, you process the same frame three times through one model, varying only the text prompt:

Pass 1 - Semantic Segmentation:

Prompt: "Segment the following categories: {'vehicle': (255, 0, 0), 'pedestrian': (0, 255, 0), 'road': (128, 128, 128), 'sidewalk': (200, 200, 100), 'building': (0, 0, 255), 'sky': (135, 206, 235), 'vegetation': (0, 128, 0), 'traffic_sign': (255, 255, 0)}."

The model generates a color-coded image. The decoder clusters pixels: a red region near the center becomes a vehicle mask, green blobs on the sidewalk become pedestrian masks, and so on. The output is an H x W integer tensor where each value is a class index.

Pass 2 - Metric Depth:

Prompt: "Generate a metric depth visualization using rainbow colormap."

The model generates a false-color depth image. For a pixel colored RGB (0, 200, 255), the decoder locates this color on the Hilbert curve path, recovers the scalar \(t \approx 0.42\), and applies the inverse power transform:

\[d = (-3)(10/3) \times \left(1 - (1 - 0.42)^{1/(-3+1)}\right) = -10 \times \left(1 - 0.58^{-0.5}\right) \approx 3.14 \text{ meters}\]

The decoder processes every pixel, producing an H x W float32 tensor of metric depths in meters.

Pass 3 - Surface Normals:

Prompt: "Generate surface normal estimation in camera-space RGB encoding."

The model generates a normal map. A sidewalk pixel colored (128, 220, 180) decodes to:

\[n_x = 2 \times 128/255 - 1 \approx 0.004, \quad n_y = 2 \times 220/255 - 1 \approx 0.725, \quad n_z = 2 \times 180/255 - 1 \approx 0.412\]

After normalization to unit length: \((0.005, 0.869, 0.495)\), a surface tilting upward and slightly toward the camera, consistent with a road surface in perspective.

The three outputs are spatially aligned (same input image, same resolution) and can be fused for 3D scene reconstruction. The depth map provides per-pixel metric distances; the normals provide surface orientations; the segmentation provides semantic labels. Together, they define a labeled 3D point cloud of the intersection. The paper demonstrates that Vision Banana's metric depth is accurate enough to produce plausible 3D reconstructions when back-projected using estimated camera parameters.

[IMAGE: Four-panel figure of an urban intersection scene: (a) input photograph, (b) semantic segmentation output with color legend, (c) metric depth map with colorbar showing 0-50m range, (d) surface normal map with directional color wheel, and (e) 3D point cloud reconstruction from depth + normals, viewed from above]

Where It Breaks

Instance Segmentation Ceiling

Vision Banana's instance segmentation (0.540 pmF1) trails SAM 3 (0.661) by a substantial margin. The root cause is architectural: the per-class, multi-pass inference approach means that distinguishing individual instances relies entirely on the model's ability to assign distinct colors consistently and on the color clustering algorithm's ability to separate them. When objects overlap extensively or share similar appearance, the color boundaries in the generated image become ambiguous, and clustering fails. SAM 3's dedicated detection + segmentation architecture handles this case with learned object proposals rather than color heuristics.

Computational Cost

Image generators are expensive. Nano Banana Pro serves production traffic at Google's scale, but each inference involves iterative denoising through a large MoE transformer. A single depth estimation pass requires the same computation as generating a full image. For three vision tasks on one frame, you run three full image-generation forward passes. Specialist models like Depth Anything V3 or lightweight SAM variants operate as single-pass feed-forward networks, often 10-100x faster per inference. For real-time applications (autonomous driving at 30fps, robotic manipulation at 100Hz), Vision Banana's latency profile is currently prohibitive.

Resolution and Precision Limits

The depth encoding's quantization is bounded by RGB precision. With 8 bits per channel and three channels, the encoding can distinguish at most $256^3 \approx 16.7$ million distinct depth values. The power transform allocates these non-uniformly: near-field depths (0-5m) receive fine resolution, but far-field depths (beyond 50m) are coarsely quantized. For applications requiring sub-millimeter precision at close range (industrial metrology, surgical navigation), this quantization ceiling may be too low.

Hallucination in Ambiguous Regions

As a generative model, Vision Banana inherits the generative model's tendency to hallucinate plausible content. In occluded regions (behind furniture, inside shadows), the model generates plausible-looking depth and normal values rather than admitting uncertainty. Unlike Marigold or Lotus, which can provide uncertainty estimates through multiple stochastic forward passes, Vision Banana's deterministic inference provides no built-in confidence measure.

No Real-Time Video Consistency

Each frame is processed independently. There is no temporal consistency mechanism, no tracking, no memory across frames. For video applications, depth estimates can flicker between frames, segmentation boundaries can shift, and normals can be inconsistent. SAM 3 and Depth Anything V3 both have video-specific extensions; Vision Banana does not.

Alternative Designs

Approach Architecture Strengths Weaknesses Best When
Vision Banana Generative (instruction-tuned image generator) Single model, multi-task, no camera intrinsics, strong generalization Slow inference, weak instance seg, no uncertainty Multi-task understanding needed, camera params unknown, deployment simplicity valued
SAM 3 Discriminative (dual encoder-decoder transformer, 848M params) Best instance seg, fast inference, video support Single-task (segmentation only), requires separate models for depth/normals Real-time segmentation, video tracking, instance-level precision needed
Depth Anything V3 Discriminative (DINOv2 ViT backbone) Fast, accurate depth, strong generalization Depth only, requires camera intrinsics for some modes Real-time depth estimation, multi-view geometry
Marigold Generative (fine-tuned Stable Diffusion) Uncertainty estimation, GPU-efficient training Relative depth only (not metric), single-task per model Research settings, uncertainty needed, limited compute for training
GenPercept Generative (one-step diffusion UNet) Fast (single forward pass), deterministic Fixed architecture, requires architectural modifications Speed-critical dense prediction
Lotus Generative (diffusion foundation for dense prediction) Strong dense geometry, foundation model approach Specialized for geometry (not segmentation), Stable Diffusion base Depth + normals with high quality

The key architectural distinction is between discriminative approaches (SAM 3, DA V3) that learn task-specific representations from labeled data and generative approaches (Vision Banana, Marigold, GenPercept, Lotus) that repurpose learned generative representations. Vision Banana occupies a unique position as the only approach using a state-of-the-art, natively multimodal image generator as its base rather than an open-source diffusion model.

[IMAGE: Radar chart comparing Vision Banana, SAM 3, Depth Anything V3, and Marigold across six axes: segmentation quality, depth quality, normal quality, inference speed, multi-task capability, and uncertainty estimation]

How It Is Used in Practice

The Gemini API Integration

Vision Banana is not released as a standalone model. Its capabilities are accessed through the Gemini API, where users can prompt Nano Banana Pro with vision task instructions. This means:

  • No weight downloads. You cannot run Vision Banana locally. Inference happens on Google's infrastructure.
  • API-level abstraction. The encoding/decoding pipeline (Hilbert curve, power transform, color clustering) must be implemented client-side. The API returns a generated image; your application decodes it.
  • Cost model. Each vision task inference costs the same as generating an image through the API. Three tasks on one image cost three image generations. Pricing follows Google's Gemini API tier structure.

Engineering a Deployment Pipeline

For an engineer building a vision system on Vision Banana, the deployment pipeline would look like this:

1. Prompt engineering. Design task-specific prompts with precise color mappings and format instructions. The prompt template is critical: ambiguous or underspecified prompts produce inconsistent output formats. Maintain a prompt library with tested templates for each task and class vocabulary.

2. Client-side decoder library. Implement the inverse mappings in your inference stack:
- Segmentation: RGB-to-class lookup with nearest-neighbor matching and configurable distance threshold
- Depth: Hilbert curve segment projection, linear interpolation inversion, power transform inversion
- Normals: linear rescaling and unit normalization

These decoders are computationally trivial (per-pixel arithmetic, no neural networks) and can be implemented in NumPy, PyTorch, or even shader code for GPU-side processing.

3. Multi-pass orchestration. For multi-task inference, issue parallel API calls for each task on the same input image. The tasks are independent and can be parallelized across API requests.

4. Post-processing. Clean up generation artifacts at task boundaries: morphological operations on segmentation masks (erosion/dilation to fix jagged edges), median filtering on depth maps (to remove isolated outlier pixels), and normal smoothing (bilateral filter to reduce high-frequency noise while preserving edges).

5. Evaluation and monitoring. Track output quality using task-specific metrics (mIoU for segmentation, \(\delta_1\) for depth, mean angle error for normals) against held-out test sets from your deployment domain.

Where This Makes Practical Sense

Vision Banana's value proposition is clearest in scenarios where:

  • Multi-task understanding is needed and deploying 3+ separate models is operationally burdensome
  • Camera parameters are unknown (web-crawled images, user-uploaded photos, legacy datasets without metadata)
  • Domain generalization matters more than peak performance on a specific benchmark
  • Inference latency is not a hard constraint (batch processing, offline analysis, content moderation)
  • Development velocity is prioritized (one API integration vs. three model deployments)

For real-time applications requiring sub-50ms inference, dedicated hardware-optimized specialist models remain necessary.

[IMAGE: Architecture diagram of a production Vision Banana deployment showing: client application sending images to a load balancer, which distributes to parallel Gemini API calls (one per task), with results flowing back through client-side decoders and into a downstream fusion module that combines segmentation, depth, and normals for 3D scene reconstruction]

Building Your Own: An Engineer's Guide to the Generative Vision Paradigm

While Vision Banana itself is only available through the Gemini API, its methodology is reproducible. An engineer seeking to build a similar system from open-source components would follow this blueprint:

Step 1: Select a Base Image Generator

The base model must satisfy three criteria: (a) strong image generation quality, confirming it has learned rich visual representations; (b) native text conditioning, enabling instruction-based task switching; (c) sufficient resolution to produce detailed vision outputs. Current candidates include Stable Diffusion 3 (open-weights, 8B parameters, latent diffusion with multimodal conditioning) and FLUX (open-weights, transformer-based diffusion). The key insight from Vision Banana is that more powerful generators produce better vision understanding; the representation quality ceiling determines the performance ceiling.

Step 2: Prepare Vision Task Datasets in RGB Format

For each target task, create training pairs of (input image, text instruction, target RGB image):

Segmentation: Use existing datasets (COCO, ADE20K, Cityscapes) and render ground truth masks as color-coded images. Vary color assignments across training examples to prevent the model from memorizing fixed class-color mappings.

Depth: Generate synthetic depth data using rendering engines (Blender, Unity, Unreal Engine with depth buffer export). Encode metric depth as RGB using the power transform + Hilbert curve bijection. Augment with alternative colormaps (Plasma, Inferno, Viridis) for robustness.

Surface Normals: Use the same rendering pipeline to export surface normals. Encode as RGB using the linear \([-1,1] \to [0,255]\) mapping.

Step 3: Instruction-Tune with Low-Ratio Mixing

Mix the vision task data into the generator's original training distribution at a low ratio. The Vision Banana paper does not disclose the exact ratio, but related work on instruction tuning suggests ratios in the 1-5% range are typical. Too much vision data degrades generation quality; too little fails to elicit reliable task-following behavior. This ratio is a hyperparameter that requires tuning per base model.

Use the same training objective as the base model (denoising score matching for diffusion models). No auxiliary losses or task-specific loss functions. The model learns to generate vision outputs by treating them as a particular type of image.

Step 4: Implement Client-Side Decoders

Build the inverse mapping library:

# Depth decoding (pseudocode)
def decode_depth_rgb(rgb_image, lam=-3, c=10/3):
    # Step 1: Project each pixel onto nearest Hilbert curve segment
    t = project_to_hilbert_curve(rgb_image)  # H x W scalar

    # Step 2: Invert power transform
    depth = lam * c * (1 - (1 - t) ** (1 / (lam + 1)))
    return depth  # H x W float32, metric depth in meters

def decode_normals_rgb(rgb_image):
    # Linear rescaling from [0, 255] to [-1, 1]
    normals = 2.0 * rgb_image.float() / 255.0 - 1.0
    # Normalize to unit vectors
    normals = normals / normals.norm(dim=-1, keepdim=True)
    return normals  # H x W x 3

def decode_segmentation_rgb(rgb_image, class_colors):
    # Nearest-neighbor assignment
    # class_colors: dict mapping class_name -> (R, G, B)
    labels = torch.zeros(rgb_image.shape[:2], dtype=torch.long)
    for idx, (name, color) in enumerate(class_colors.items()):
        color_tensor = torch.tensor(color, dtype=torch.float32)
        dist = (rgb_image.float() - color_tensor).norm(dim=-1)
        labels[dist < threshold] = idx
    return labels  # H x W integer class indices

Step 5: Evaluate and Iterate

Evaluate on standard benchmarks using established metrics. Monitor for generation artifacts that propagate through decoding: color bleeding at object boundaries (segmentation), depth discontinuities at smooth surfaces (depth), and high-frequency noise in flat regions (normals). These artifacts are characteristic of generative approaches and can often be mitigated through post-processing (morphological operations, guided filtering) without retraining.

[IMAGE: Flowchart showing the open-source replication pipeline: Base Diffusion Model + Rendering Engine (Blender/Unity) + Vision Datasets -> RGB Encoding -> Low-Ratio Mixed Training -> Instruction-Tuned Vision Model -> Client-Side Decoder -> Task Outputs]

Insights Worth Remembering

  1. Generation implies understanding. A model that can generate photorealistic images has already learned geometry, semantics, depth ordering, and surface properties. The question is not whether these capabilities exist in the model but how to access them.

  2. The encoding design matters as much as the model. Vision Banana's performance depends critically on the invertibility and precision of the RGB encoding. The power transform with \(\lambda = -3\) gives near-field depth roughly 8x more color precision than far-field, matching the distribution of depth values that matter in practice.

  3. Synthetic data is sufficient for 3D tasks. All of Vision Banana's depth and normal training data is synthetic, yet it generalizes to real-world scenes. This confirms that rendering engines produce geometry distributions that transfer to the real world, especially when the base model's generative pretraining provides appearance-level grounding.

  4. The generation quality floor determines the vision understanding ceiling. A weak image generator will make a weak vision model, no matter how good the instruction tuning recipe is. The representations must be there to be extracted.

  5. Instruction tuning for vision is remarkably data-efficient. A "very low ratio" of vision task data in the training mix is sufficient to unlock competitive performance across five distinct tasks. The base model's generative pretraining does most of the heavy lifting.

  6. Multi-task capability has a real cost. Vision Banana handles five tasks with one model but requires separate inference passes for each task and performs worst on the task (instance segmentation) that specialist architectures handle with dedicated components (object proposal networks).

  7. The no-camera-intrinsics property is the deployment story. Most practical applications cannot guarantee camera metadata. Vision Banana infers absolute scale from visual context, eliminating a hard requirement that limits deployment of traditional metric depth methods.

  8. Color quantization is the precision ceiling. With 8-bit RGB channels and a Hilbert curve encoding, the theoretical maximum depth precision is bounded by $256^3 \approx 16.7M$ distinct values distributed non-uniformly across the depth range. For sub-millimeter applications, this is insufficient.

  9. The paradigm parallel to LLMs is directionally right but not complete. LLM pretraining produces models that can be instruction-tuned for any language task. Image generation pretraining produces models that can be instruction-tuned for many vision tasks. But the vision case has a gap: vision tasks require structured, invertible output encodings that language tasks do not. Designing these encodings is non-trivial engineering.

  10. Latency separates research impact from production adoption. Until generative models can produce vision outputs at speeds competitive with feed-forward specialist models, the generalist advantage is limited to offline and batch processing scenarios.

Open Questions

Can the encoding be learned rather than designed? The Hilbert curve depth encoding and linear normal mapping were hand-designed. Could the model learn its own output encoding during training, discovering encodings that are naturally better suited to the diffusion process's generative biases? This would remove a source of human engineering effort and might produce encodings that are more robust to generation artifacts.

What is the optimal mixing ratio? The paper states "very low ratio" without quantification. The relationship between vision data ratio, vision task performance, and generation quality retention likely follows a Pareto frontier. Mapping this frontier would provide practical guidance for anyone building similar systems.

Can distillation close the latency gap? Distilling Vision Banana's outputs into a smaller, single-pass feed-forward model could retain the multi-task capability while achieving real-time inference. GenPercept demonstrated that single-step distillation of diffusion features is viable. Whether this extends to a model of Vision Banana's complexity and across all five tasks is an open research question.

Does this scale to video? Current vision understanding models increasingly operate on video (SAM 3 tracks objects across frames, DA V3 handles multi-view inputs). Extending Vision Banana to generate temporally consistent vision outputs across video frames would require either frame-level processing with post-hoc consistency enforcement or a video-native generative backbone.

What tasks cannot be encoded as RGB? The five tasks demonstrated all have outputs that can be naturally represented as per-pixel values. Tasks requiring graph structures (scene graphs), variable-length outputs (object detection with bounding boxes), or non-spatial outputs (image classification, VQA) do not fit the "perception as image generation" framework. The boundary of this paradigm is not yet mapped.

Will scaling laws hold? If the base generator improves (more parameters, better training data, higher resolution), does vision understanding improve proportionally? The LLM analogy suggests it should, but there may be diminishing returns or task-specific saturation points. Nano Banana 2 (1.8B parameters, Gemini 3.1 Flash Image) provides one data point: comparing its vision understanding capabilities against Nano Banana Pro would illuminate whether faster, smaller generators sacrifice understanding quality.

Can competing approaches converge? SAM 3 is adding language understanding. Depth Anything V3 is handling multi-view inputs. Vision Banana is adding vision understanding to a generator. Are these trajectories converging toward a single universal vision model, or will the specialist advantages in latency, precision, and task-specific quality persist?

Sources and Further Reading

Foundational Papers

Important Follow-up Work

Technical Blogs and Resources

BibTeX

@article{visionbanana2026,
  title={Image Generators are Generalist Vision Learners},
  author={Gabeur, Valentin and Long, Shangbang and Peng, Songyou and
          Voigtlaender, Paul and Sun, Shuyang and Bao, Yanan and
          Truong, Karen and Wang, Zhicheng and Zhou, Wenlei and
          Barron, Jonathan T and Genova, Kyle and Kannen, Nithish and
          Ben, Sherry and Li, Yandong and Guo, Mandy and Yogin, Suhas and
          Gu, Yiming and Chen, Huizhong and Wang, Oliver and Xie, Saining and
          Zhou, Howard and He, Kaiming and Funkhouser, Thomas and
          Alayrac, Jean-Baptiste and Soricut, Radu},
  journal={arXiv preprint arXiv:2604.20329},
  year={2026}
}

@inproceedings{ho2020denoising,
  title={Denoising Diffusion Probabilistic Models},
  author={Ho, Jonathan and Jain, Ajay and Abbeel, Pieter},
  booktitle={NeurIPS},
  year={2020}
}

@inproceedings{saharia2022photorealistic,
  title={Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding},
  author={Saharia, Chitwan and Chan, William and Saxena, Saurabh and Li, Lala and
          Whang, Jay and Denton, Emily and others},
  booktitle={NeurIPS},
  year={2022}
}

@article{barron2019general,
  title={A General and Adaptive Robust Loss Function},
  author={Barron, Jonathan T},
  journal={arXiv preprint arXiv:1701.03077},
  year={2019}
}

@inproceedings{ke2024marigold,
  title={Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation},
  author={Ke, Bingxin and Obukhov, Anton and Huang, Shengyu and Metzger, Nando and
          Daudt, Rodrigo Caye and Schindler, Konrad},
  booktitle={CVPR},
  year={2024}
}

@inproceedings{wu2024genpercept,
  title={Diffusion Models Trained with Large Data Are Transferable Visual Models},
  author={Wu, Guangkai and others},
  booktitle={ICLR},
  year={2025}
}

@article{he2024lotus,
  title={Lotus: Diffusion-based Visual Foundation Model for High-quality Dense Prediction},
  author={He, Jing and others},
  journal={arXiv preprint arXiv:2409.18124},
  year={2024}
}
Sign in to save and react.
Share Copied