We're in early access — onboarding jewelry brands one at a time

← Back to Research

Published

December 23, 2025

Category

Model Comparison

Author

Formel Studio Research

Evaluating Frontier Image Generation Models for Commercial Jewelry Photography

A systematic comparison of 6 frontier models across 270 pairwise evaluations reveals workflow-dependent performance patterns with significant implications for production deployment.

Evaluating Frontier Image Generation Models for Commercial Jewelry Photography

Research Series: AI Model Comparison for Jewelry Photography

Abstract

As AI image generation matures, jewelry brands face a practical question: which models produce commercially viable product photography? Despite rapid advances in diffusion architectures, no systematic evaluation exists for this specialized domain. We present a pairwise comparison of 6 frontier models—Nano Banana Pro, Nano Banana, Gemini 2.5 Flash (Google), and FLUX.2 Pro, FLUX.2 Max, FLUX.2 Flex (Black Forest Labs)—across 270 head-to-head evaluations. We measure three dimensions: pairwise preference, ring accuracy relative to reference images, and photorealism. Our findings reveal that model performance is highly workflow-dependent: Google models achieve 89% win rates for generation tasks, while Black Forest Labs models achieve 70% win rates for replacement tasks. We introduce a “production-ready” metric combining accuracy and realism, and calculate cost-efficiency per usable image. These results have direct implications for model selection in commercial jewelry photography pipelines.


1. Introduction

1.1 The Problem

Jewelry e-commerce represents a $300+ billion global market where product photography directly impacts conversion rates. Traditional photography requires physical inventory, studio setups, and skilled photographers—creating bottlenecks for brands with large catalogs or frequent new releases.

AI image generation offers a potential solution: generate product photography from reference images alone. However, jewelry presents unique challenges that general-purpose benchmarks don’t capture:

  • Fine detail reproduction: Prongs, pavé settings, and engravings must be accurately rendered
  • Material properties: Metal reflectivity and gemstone refraction require precise handling
  • Hand realism: On-hand shots demand anatomically correct, photorealistic skin
  • Brand accuracy: The generated ring must match the reference exactly—not a similar ring

1.2 The Gap

Existing model evaluations focus on general image quality metrics (FID, CLIP scores) or broad categories (faces, landscapes, objects). No published work systematically evaluates frontier models for jewelry-specific tasks with commercially relevant success criteria.

1.3 Our Contribution

We present the first systematic evaluation of frontier image generation models for commercial jewelry photography. Our study:

  1. Compares 6 leading models across two distinct workflows (generation and replacement)
  2. Introduces domain-specific metrics: ring accuracy and production-readiness
  3. Reveals workflow-dependent performance inversions not visible in aggregate rankings
  4. Provides cost-efficiency analysis for production deployment decisions

2.1 Diffusion Model Benchmarks

Standard benchmarks like DrawBench, PartiPrompts, and COCO evaluate general image generation quality. These capture broad capabilities but miss domain-specific requirements like product accuracy and fine detail reproduction.

2.2 Commercial Image Generation

Recent work has explored AI for product photography in fashion and furniture, but jewelry—with its reflective surfaces, intricate details, and strict accuracy requirements—remains understudied.

2.3 Prior Work in This Series

In Part 1 of this research, we evaluated 11 models on basic jewelry generation capability. Five models were eliminated due to fundamental failures (wrong object generation, severe artifacts). The remaining 6 models advanced to this systematic comparison.


3. Methodology

3.1 Models Evaluated

We evaluated 6 frontier models that demonstrated basic jewelry generation capability in Part 1:

ModelProviderArchitectureCost/Image
Nano Banana ProGoogleImagen-based$0.150
Nano BananaGoogleImagen-based$0.039
Gemini 2.5 FlashGoogleMultimodal$0.039
FLUX.2 ProBlack Forest LabsFLUX$0.090
FLUX.2 MaxBlack Forest LabsFLUX$0.190
FLUX.2 FlexBlack Forest LabsFLUX$0.315

All models were accessed via Replicate API with default parameters. Costs reflect actual invoice data from December 2025.

3.2 Task Design

We evaluated two workflows representing common production use cases:

Workflow A: Generate

  • Input: Ring reference image only
  • Task: Generate a photorealistic on-hand shot from scratch
  • Challenge: Model must create realistic hand anatomy while accurately reproducing the ring

Workflow B: Replace

  • Input: Ring reference image + hand photograph
  • Task: Replace existing ring in hand photo with reference ring
  • Challenge: Model must preserve hand realism while accurately swapping the ring

3.3 Test Set

We selected 9 rings across 3 complexity levels:

ComplexityDescriptionCount
SimpleSolitaire, single stone, minimal setting3
MediumMultiple stones, moderate detail3
ComplexPavé, clusters, intricate settings3

Each ring was processed by all 6 models in both workflows, yielding 108 total images (6 models × 9 rings × 2 workflows).

3.4 Evaluation Protocol

Pairwise Comparison

We conducted round-robin pairwise comparisons: every model versus every other model for each ring-workflow combination.

  • 15 unique model pairs × 9 rings × 2 workflows = 270 comparisons
  • Images displayed side-by-side with randomized left/right position
  • Evaluator selected winner or tie for each pair
  • No model labels shown during evaluation

Ring Accuracy Rating

Each image was rated independently for ring accuracy:

RatingDefinition
ExactPerfect match to reference
CloseMinor variations, clearly the same ring
SimilarSame style but noticeable differences
WrongDifferent ring entirely

Photorealism Rating

Each image was rated for AI-generated appearance:

RatingDefinitionCommercial Viability
PhotorealisticIndistinguishable from real photoViable
Minor tellsSmall artifacts, trained eye might detectViable
NoticeableClearly AI-generatedMarginal
Obviously AISevere artifacts, uncanny appearanceNot viable

3.5 Production-Ready Metric

We define an image as “production-ready” if it meets both criteria:

Production-Ready = (Ring Accuracy: Exact OR Close) AND (Photorealism: Photorealistic OR Minor Tells)

This captures the minimum bar for commercial use: the ring must be recognizably correct, and the image must not appear obviously artificial.


4. Results

4.1 Aggregate Rankings

Across all 270 comparisons, the overall rankings were:

RankModelWinsLossesTiesWin Rate
1Nano Banana Pro5828466.7%
2FLUX.2 Max41361352.8%
3Nano Banana4240851.1%
4FLUX.2 Pro4043748.3%
5FLUX.2 Flex3448842.2%
6Gemini 2.5 Flash3151838.9%

However, these aggregate numbers obscure a critical finding.

4.2 Workflow-Dependent Performance

When separated by workflow, the rankings invert dramatically.

Generate Workflow (ring image only):

RankModelWin Rate
1Nano Banana Pro88.9%
2Nano Banana58.9%
3Gemini 2.5 Flash46.7%
4FLUX.2 Flex42.2%
5FLUX.2 Max35.6%
6FLUX.2 Pro27.8%

Replace Workflow (ring + hand reference):

RankModelWin Rate
1FLUX.2 Max70.0%
2FLUX.2 Pro68.9%
3Nano Banana Pro44.4%
4Nano Banana43.3%
5FLUX.2 Flex42.2%
6Gemini 2.5 Flash31.1%

FLUX.2 Pro moves from last place (27.8%) in Generate to second place (68.9%) in Replace—a 41 percentage point swing.

4.3 Visual Comparison

The following figures illustrate the workflow-dependent quality differences.

Figure 1: Reference Ring (Medium Complexity)

Reference Ring

Figure 2: Generate Workflow Results

NBP Generate

Nano Banana Pro
Photorealistic, exact ring

NB Generate

Nano Banana
Minor tells, close ring

Gemini Generate

Gemini 2.5 Flash
Minor tells, similar ring

F2Pro Generate

FLUX.2 Pro
Noticeable AI artifacts

F2Max Generate

FLUX.2 Max
Minor tells, close ring

F2Flex Generate

FLUX.2 Flex
Minor tells, close ring

Generate workflow requires creating realistic hands from scratch. Google models produce more naturalistic skin texture and hand poses. Black Forest Labs models exhibit higher rates of visible AI artifacts.

Figure 3: Replace Workflow Results (Same Ring)

NBP Replace

Nano Banana Pro
Photorealistic

NB Replace

Nano Banana
Photorealistic

Gemini Replace

Gemini 2.5 Flash
Photorealistic

F2Pro Replace

FLUX.2 Pro
Photorealistic

F2Max Replace

FLUX.2 Max
Photorealistic

F2Flex Replace

FLUX.2 Flex
Photorealistic

Replace workflow preserves the original hand photograph. All models achieve photorealistic results when not required to generate hands from scratch.

4.4 Ring Accuracy

Ring accuracy measures how faithfully the model reproduces the reference ring.

Table 1: Ring Accuracy by Model and Workflow

ModelGenerate (Accurate)Replace (Accurate)Delta
Nano Banana Pro89%78%-11%
FLUX.2 Flex78%78%0%
Nano Banana78%56%-22%
FLUX.2 Max56%78%+22%
Gemini 2.5 Flash56%67%+11%
FLUX.2 Pro44%89%+45%

FLUX.2 Pro exhibits the largest workflow-dependent accuracy shift: 44% in Generate versus 89% in Replace. The model struggles to imagine rings correctly but excels at preserving them during image editing.

4.5 Photorealism (AI Look)

We rated each image for visible AI artifacts.

Table 2: Photorealism Distribution — Generate Workflow

ModelPhotorealisticMinor TellsNoticeableObviously AI
Nano Banana Pro100%0%0%0%
Nano Banana0%100%0%0%
Gemini 2.5 Flash0%100%0%0%
FLUX.2 Flex0%100%0%0%
FLUX.2 Max0%67%22%11%
FLUX.2 Pro0%44%33%22%

Nano Banana Pro was the only model rated 100% photorealistic in Generate. FLUX.2 Pro and Max produced “obviously AI” images in 22% and 11% of Generate outputs respectively.

Table 3: Photorealism Distribution — Replace Workflow

ModelPhotorealistic
All 6 models100%

All models achieved 100% photorealistic ratings in Replace workflow. Starting with a real hand photograph eliminates the AI-generated hand problem entirely.

4.6 Production-Ready Rates

Combining ring accuracy and photorealism yields the production-ready metric.

Table 4: Production-Ready Rate by Workflow

ModelGenerateReplace
Nano Banana Pro89%78%
Nano Banana78%56%
FLUX.2 Flex78%78%
Gemini 2.5 Flash56%67%
FLUX.2 Max44%78%
FLUX.2 Pro33%89%

Figure 4: Production-Ready Rate Inversion

                    Generate                Replace
Nano Banana Pro     ████████████████░░░░    ██████████████░░░░░░
                    89%                     78%

FLUX.2 Pro          ██████░░░░░░░░░░░░░░    ████████████████░░░░
                    33%                     89%

FLUX.2 Pro’s production-ready rate increases from 33% to 89% when switching from Generate to Replace—a near-complete inversion.

4.7 Failure Analysis

We categorized failure modes across workflows.

Table 5: Failure Distribution by Workflow

Failure TypeGenerateReplace
Obviously AI appearance301
Wrong finger placement284
Ring does not match reference7988

Generate workflow produces 30× more AI-appearance failures due to the hand generation challenge. Replace workflow produces slightly more ring accuracy failures, possibly due to editing artifacts.

Table 6: Failure Patterns by Model

ModelPrimary Failure ModeFrequency
Gemini 2.5 FlashRing inaccuracy46 instances
FLUX.2 ProAI appearance (Generate)22% of outputs
FLUX.2 MaxAI appearance (Generate)11% of outputs
FLUX.2 FlexRing inaccuracy20 instances
Nano BananaRing inaccuracy28 instances
Nano Banana ProRing inaccuracy19 instances (lowest)

Gemini 2.5 Flash exhibited the poorest ring fidelity across both workflows. Nano Banana Pro had the lowest overall failure rate.

4.8 Ring Complexity Effects

We analyzed performance by ring complexity level.

Table 7: Win Rates by Ring Complexity

ComplexityTop ModelWin Rate
SimpleFLUX.2 Max61.7%
MediumNano Banana Pro76.7%
ComplexNano Banana Pro88.3%

Black Forest Labs models compete effectively on simple rings. Google’s advantage increases with ring complexity—Nano Banana Pro achieves 88.3% win rate on complex multi-stone settings versus 61.7% for the best Black Forest Labs model.

Figure 5: Simple vs Complex Ring Performance

NBP Simple

NBP: Simple ring
Win rate: 61.7%

F2Max Simple

F2Max: Simple ring
Win rate: 61.7%

NBP Complex

NBP: Complex ring
Win rate: 88.3%

Left/Center: Simple rings show competitive performance across providers. Right: Complex rings reveal Google’s advantage in fine detail reproduction.


5. Analysis

5.1 Why Do Rankings Invert?

The workflow-dependent performance inversion reflects fundamentally different task requirements.

Generate workflow requires the model to synthesize realistic human hands from its training distribution. Google’s models—trained on larger, more diverse image datasets—produce more naturalistic hand anatomy and skin texture. Black Forest Labs models exhibit higher rates of artifacts: uncanny skin rendering, anatomically incorrect finger positions, and obvious digital textures.

Replace workflow requires precise image editing: preserving the hand photograph while seamlessly integrating a new ring. Black Forest Labs’ FLUX architecture, designed for image-to-image transformation, excels at this task. The model can attend to the ring region specifically while leaving the hand largely unchanged.

5.2 The Photorealism Gap

The most striking finding is the photorealism difference between workflows. In Generate, only Nano Banana Pro achieved 100% photorealistic ratings. In Replace, all six models achieved 100%.

This suggests that hand generation—not ring rendering—is the primary source of AI artifacts in jewelry photography. When given a real hand photograph as reference, even models that struggle with generation produce commercially viable results.

5.3 Ring Accuracy Trade-offs

FLUX.2 Pro’s 45-point accuracy improvement in Replace (44% → 89%) indicates that the model’s ring rendering capability is intact—the problem in Generate is not ring reproduction but ring-in-context imagination. When the model must “decide” what ring to place on a generated hand, it often produces a plausible but incorrect ring. When editing an existing image, it can focus on accurate reproduction.

5.4 Cost-Efficiency Analysis

We calculated cost per production-ready image by dividing unit cost by production-ready rate.

Table 8: Cost per Production-Ready Image

ModelGenerateReplace
Nano Banana$0.05$0.07
Gemini 2.5 Flash$0.07$0.06
Nano Banana Pro$0.17$0.19
FLUX.2 Pro$0.27$0.10
FLUX.2 Flex$0.40$0.40
FLUX.2 Max$0.43$0.24

For Generate workflow, Nano Banana offers the best cost-efficiency at $0.05 per usable image. For Replace workflow, FLUX.2 Pro achieves $0.10 per usable image with the highest accuracy (89%).

FLUX.2 Flex is overpriced in both workflows—same production-ready rate as cheaper alternatives at 4-8× the cost.


6. Limitations

6.1 Sample Size

Our evaluation used 9 rings across 3 complexity levels. While sufficient to identify significant trends, a larger test set would provide higher confidence in the findings.

6.2 Single Evaluator

Pairwise comparisons and quality ratings were performed by a single evaluator. Inter-rater reliability studies would strengthen the methodology.

6.3 Prompt Variation

All images were generated using standardized prompts. Different prompting strategies might yield different relative performance.

6.4 Temporal Validity

Model capabilities evolve rapidly. These results reflect December 2025 model versions; future updates may change relative performance.

6.5 Product Category

This study evaluated rings only. Results may not generalize to necklaces, earrings, bracelets, or other jewelry categories with different visual characteristics.


7. Future Work

7.1 Extended Product Categories

Part 3 will evaluate the same models on additional shot types:

  • Studio hero shots (white background, product-only)
  • Flat lay compositions
  • On-model shots for necklaces and earrings

7.2 Alternative Approaches

We will compare the Replace workflow winners (FLUX.2 Pro/Max) against mask-based inpainting approaches (Ideogram) and LoRA fine-tuned models.

7.3 Scale Testing

Production deployment requires consistent performance at scale. We will test batch consistency and failure rate stability across larger generation runs.


8. Conclusion: Two Product Options

We presented the first systematic evaluation of frontier image generation models for commercial jewelry photography. Our central finding—that model performance inverts between generation and replacement workflows—has a direct implication for product design:

A single-workflow platform would underserve half of all use cases. The data supports offering customers two distinct options.

8.1 The Case for Two Options

The performance inversion we observed is not a minor variation—it’s a complete ranking reversal:

MetricGenerate WinnerReplace Winner
Win rateNano Banana Pro (89%)FLUX.2 Max (70%)
Production-readyNano Banana Pro (89%)FLUX.2 Pro (89%)
PhotorealismNano Banana Pro (100%)All models (100%)
Best valueNano Banana ($0.05)FLUX.2 Pro ($0.10)

A platform built exclusively on Google models would fail at replacement tasks. A platform built exclusively on Black Forest Labs models would produce unacceptable AI artifacts in generation tasks. Neither approach serves all customer needs.

This study provides the empirical foundation for offering both workflows as first-class product options.


8.2 Option A: Generate From Scratch

What it is: AI creates the complete image—hand, ring, background—from a ring reference photo alone.

Pros

  • No photography required — Customer needs only product images, not hand models or studio shoots
  • Unlimited variety — Each generation produces a unique hand pose and composition
  • Lower barrier to entry — Ideal for brands without existing photography assets
  • Faster onboarding — Customer can generate images immediately after uploading ring photos

Cons

  • Photorealism varies by model — Only Nano Banana Pro achieves 100% photorealistic results; other models show “minor tells” or worse
  • Higher failure rate for complex rings — Models struggle to accurately reproduce intricate multi-stone settings
  • AI appearance risk — 11-22% of Black Forest Labs outputs rated “obviously AI” in this workflow

Model Recommendations

PriorityModelProduction-ReadyCost/Usable
Premium qualityNano Banana Pro89%$0.17
Best valueNano Banana78%$0.05
Budget optionGemini 2.5 Flash56%$0.07
AvoidFLUX.2 Pro, FLUX.2 Max33-44%$0.27-0.43

Ideal Use Cases

  • Brands launching new collections without existing lifestyle photography
  • High-volume catalog generation where some variation is acceptable
  • Simple to medium complexity rings (solitaires, basic multi-stone)
  • Cost-sensitive applications where $0.05/usable is the target

8.3 Option B: Replace With Templates

What it is: AI swaps the ring in an existing hand photograph, preserving the original hand, pose, and lighting.

Pros

  • 100% photorealistic — All models achieved photorealistic ratings; the real hand photo eliminates AI appearance issues entirely
  • Consistent brand aesthetic — Same template produces visually cohesive catalog
  • Higher ring accuracy — FLUX.2 Pro achieves 89% accuracy vs 44% in Generate
  • Model flexibility — Performance differences between models are smaller; more options available

Cons

  • Requires template library — Customer needs hand photography assets or access to a template library
  • Less variety — Same template produces similar compositions; variety requires multiple templates
  • Higher onboarding friction — Customers must upload or select templates before generating

Model Recommendations

PriorityModelProduction-ReadyCost/Usable
Best accuracyFLUX.2 Pro89%$0.10
Premium alternativeFLUX.2 Max78%$0.24
Budget optionGemini 2.5 Flash67%$0.06
AvoidFLUX.2 Flex78%$0.40 (overpriced)

Ideal Use Cases

  • Brands with existing hand photography they want to extend
  • High-accuracy requirements where ring fidelity is critical
  • Complex rings (pavé, clusters, intricate settings)
  • Consistent catalog aesthetics across product lines

8.4 Pricing Implications

The cost structures differ significantly between options:

OptionBudget TierStandard TierPremium Tier
Generate$0.05 (Nano Banana)$0.07 (Gemini)$0.17 (Nano Banana Pro)
Replace$0.06 (Gemini)$0.10 (FLUX.2 Pro)$0.24 (FLUX.2 Max)

Generate offers a lower floor ($0.05) but the premium tier is more expensive ($0.17 vs $0.10). Replace offers a tighter cost range with better quality consistency.


8.5 Platform Recommendations

Based on these findings, we recommend:

  1. Offer both options as distinct product features — Let customers choose based on their assets and needs
  2. Build a template library for Replace — Unlocks the higher-quality workflow for customers without photography assets
  3. Implement complexity-based routing — Route simple rings to budget models, complex rings to premium models
  4. Add ring accuracy validation — Automated quality checks before delivery
  5. Do not include FLUX.2 Flex — Overpriced in all scenarios

8.6 Open Questions for Future Research

This study tested on-hand ring photography specifically. The Generate vs Replace trade-off may differ for other shot types. Future research should address:

  • Studio shots: Is it better to generate a flat lay from scratch or replace products in a template composition?
  • Lifestyle scenes: Do the model rankings hold for environmental backgrounds?
  • Other jewelry categories: Do necklaces, earrings, and bracelets show the same workflow-dependent patterns?
  • Template design: What template characteristics optimize Replace workflow performance?

These questions will guide our next phase of research as we expand the platform’s capabilities.


Appendix: Head-to-Head Matrix

Complete pairwise comparison results:

F2ProF2MaxF2FlexNaBanNaBProGemini
FLUX.2 Pro-44%61%44%36%56%
FLUX.2 Max56%-61%36%36%75%
FLUX.2 Flex39%39%-50%28%56%
Nano Banana56%64%50%-36%50%
Nano Banana Pro64%64%72%64%-69%
Gemini Flash44%25%44%50%31%-

Research conducted December 2025. 270 pairwise comparisons, 108 images, 6 models.

studio formel Research — Advancing AI for commercial content generation.



About studio formel

studio formel is an AI-powered creative platform built specifically for jewelry brands. We combine systematic research on AI generation with a flexible asset management system, helping jewelry sellers create professional images, videos, and ads at scale.

Learn more about our platform →

Apply our research to your jewelry brand

Our AI platform uses these research findings to help you create professional product images at scale.

Get early access

Topics

AI jewelry photography model comparison FLUX Imagen diffusion models ring photography e-commerce