Before optimizing prompts or building complex workflows, we needed to answer a foundational question: Which AI models can even produce usable jewelry photography?
Study type: Elimination round — 132 images, 11 models, blind evaluation
Research Series: AI Model Comparison for Jewelry Photography
- Part 1: Baseline Capability Test (this article)
- Part 2: Head-to-Head Model Comparison →
- Part 3: Studio Shots Comparison →
The Question This Study Answers
This is Part 1 of our AI jewelry photography research. Before comparing the fine details between top models, we first needed to eliminate the models that simply can’t handle jewelry photography at all.
This study answers: Which AI models should you avoid entirely for jewelry product photography?
This study does NOT answer: Which of the capable models is definitively “best”? (That’s Part 2)
Why Elimination First?
Jewelry photography is demanding. It requires:
- Accurate reproduction of metal colors and finishes
- Precise stone placement and proportions
- Realistic reflections and sparkle
- Natural-looking hands and skin
- Correct ring placement on fingers
Many AI models that work well for general image generation fail completely when asked to reproduce specific jewelry from a reference image. We needed to identify these failures before investing time in detailed comparisons.
TL;DR: Elimination Results
After testing 11 models across 132 images:
ELIMINATED — Do not use for jewelry photography:
- Qwen-Image (Alibaba): 100% rejection rate — failed every test
- Runway Gen-4 Image: 100% rejection rate — failed every test
- Seedream 4 (ByteDance): 75% rejection rate — too unreliable
- GPT Image 1.5 (OpenAI): 58% rejection rate — inconsistent
- Seedream 4.5 (ByteDance): 58% rejection rate — inconsistent
ADVANCING TO PART 2 — Show promise for jewelry:
- Nano Banana Pro (Google): 0% rejection rate — never failed
- FLUX.2 Pro (Black Forest Labs): 8% rejection rate — reliable
- Nano Banana (Google): 8% rejection rate — reliable
- FLUX.2 Flex (Black Forest Labs): 17% rejection rate — usable
- Gemini 2.5 Flash (Google): 25% rejection rate — borderline
- FLUX.2 Max (Black Forest Labs): 23% rejection rate — borderline
Key workflow findings (valid regardless of model ranking):
- The “Replace” approach works 31% better than generating from scratch
- Adding text descriptions of jewelry provides virtually no improvement
- Simple rings have 50% higher success rates than complex designs
Study Design
Reference Images
We tested with three rings at varying complexity levels:

| Level | Description |
|---|---|
| Simple | Gold signet ring with oval flat face, polished, no stones |
| Medium | Delicate gold ring with pear-shaped diamond center stone and halo setting |
| Complex | 14k gold band with alternating vertical gold bars and diamond clusters |
Hand Reference for Replace Scenario

Models Tested
| Model | Lab | Cost/Image |
|---|---|---|
| FLUX.2 Pro | Black Forest Labs | ~$0.045 |
| FLUX.2 Max | Black Forest Labs | ~$0.10 |
| FLUX.2 Flex | Black Forest Labs | ~$0.12 |
| Nano Banana | ~$0.039 | |
| Nano Banana Pro | ~$0.15 | |
| Gemini 2.5 Flash | ~$0.039 | |
| GPT Image 1.5 | OpenAI | ~$0.05 |
| Qwen-Image | Alibaba | ~$0.025 |
| Seedream 4 | ByteDance | ~$0.03 |
| Seedream 4.5 | ByteDance | ~$0.04 |
| Runway Gen-4 Image | Runway | ~$0.05 |
Test Scenarios
Each model was tested in 4 scenarios:
| Scenario | Task | Purpose |
|---|---|---|
| A: Generate | Create ring-on-hand from ring image only | Test basic capability |
| B: Generate + Description | Add text description of the ring | Test if words help |
| C: Replace | Swap ring on existing hand photo | Test precision editing |
| D: Replace + Description | Replace with text description | Test combined approach |
Total: 11 models × 3 rings × 4 scenarios = 132 images
Evaluation Methodology
Blind Evaluation for Elimination
Our goal was to identify which models produce usable results vs. which should be eliminated. The evaluation was designed to surface clear failures:
- Anonymization: All 132 images were assigned random IDs (01-11)
- Batch presentation: Images shown in randomized batches of 4
- Relative ranking: Within each batch, evaluator selected 1st place, 2nd place, or marked as “Reject”
- Blind reveal: Model identities mapped back only after all scoring complete
Evaluation Interface
[Screenshot placeholder: Blind evaluation UI showing 4 images side by side with rating controls]
What This Methodology Measures Well
- Clear failures: A model rejected in 12/12 batches is genuinely unusable
- Consistent performers: A model never rejected across varied conditions is reliable
- Workflow comparisons: Within-model comparisons (Replace vs Generate) are valid
What This Methodology Does NOT Measure
- Precise rankings between top models: The batch composition varied, so “1st place” in one batch isn’t directly comparable to “1st place” in another
- Head-to-head winner: We cannot say “Nano Banana Pro definitively beats FLUX.2 Pro”
This is why we’re calling this Part 1: Elimination. Part 2 will conduct controlled head-to-head comparisons of the advancing models.
Scoring Criteria
Each image was rated on 5 dimensions:
| Dimension | Options | What We Measured |
|---|---|---|
| Ring Match | Exact / Close / Similar style / Wrong ring | Does it match the reference? |
| Hand Quality | Natural / Minor issues / Major issues | Does the hand look real? |
| Placement | Correct / Wrong finger / Floating | Is the ring properly worn? |
| AI Look | Photorealistic / Slight AI / Obviously AI | Would customers notice? |
| Verdict | 1st / 2nd / OK / Reject | Would you publish this? |
Results: Clear Eliminations
Models That Failed Completely
Qwen-Image and Runway Gen-4 should be avoided entirely for jewelry photography. Both achieved a 100% rejection rate — every single output was unusable.
Common failure patterns:
- Generated completely different rings, ignoring the reference image
- In some cases, didn’t generate a ring at all
- These models appear not optimized for reference-image fidelity
Visual examples of failure — Qwen-Image:

Visual examples of failure — Runway Gen-4:

Models With High Failure Rates
Seedream 4 (75% reject), GPT Image 1.5 (58% reject), and Seedream 4.5 (58% reject) showed partial capability but failed too often to be reliable:
- Ring style often “inspired by” but not matching reference
- Correct concept but wrong execution (proportions, stone count, metal color)
- ByteDance models (Seedream) particularly struggled with fine jewelry details
Elimination Summary
| Status | Model | Rejection Rate | Verdict |
|---|---|---|---|
| ADVANCING | Nano Banana Pro | 0% | Never failed — reliable |
| ADVANCING | FLUX.2 Pro | 8% | Rarely failed — reliable |
| ADVANCING | Nano Banana | 8% | Rarely failed — reliable |
| ADVANCING | FLUX.2 Flex | 17% | Occasional failures — usable |
| BORDERLINE | FLUX.2 Max | 23% | Frequent failures — test carefully |
| BORDERLINE | Gemini 2.5 Flash | 25% | Frequent failures — test carefully |
| ELIMINATED | Seedream 4.5 | 58% | Too unreliable |
| ELIMINATED | GPT Image 1.5 | 58% | Too unreliable |
| ELIMINATED | Seedream 4 | 75% | Mostly fails |
| ELIMINATED | Qwen-Image | 100% | Complete failure |
| ELIMINATED | Runway Gen-4 | 100% | Complete failure |
Results: Models That Show Promise
These models produced usable jewelry photography and advance to Part 2 for head-to-head comparison.
Nano Banana Pro — Most consistent performer (0% rejection rate):

FLUX.2 Pro — Strong performer at lower cost (8% rejection rate):

Important note: While Nano Banana Pro showed the lowest rejection rate, this study cannot definitively claim it’s “better” than FLUX.2 Pro. The batch composition varied, meaning these models weren’t always compared head-to-head on identical tasks. Part 2 will address this with controlled comparisons.
Workflow Findings
These findings compare performance within models across different approaches, so they’re valid regardless of cross-model ranking questions.
Replace Works Better Than Generate
Recommendation: When possible, use the Replace workflow. Provide a hand reference image and ask the AI to swap in your ring. This gives the AI more spatial context and produces better results.
Text Descriptions Don’t Help
Recommendation: Don’t waste time writing detailed jewelry descriptions. The reference image contains more information than words can convey. Focus your prompts on the scene (pose, lighting, background) instead.
Simple Rings Work Better
Recommendation: Start with your simpler pieces when adopting AI photography. Validate your workflow before attempting complex multi-stone designs.
What Goes Wrong: Failure Mode Analysis
Ring Match Accuracy (All 132 Images)
- Exact match: 29% — AI perfectly reproduced the reference
- Close: 30% — Minor differences, clearly the same ring
- Similar style: 15% — Right category, wrong details
- Wrong ring entirely: 27% — AI generated a different ring
Common Failure Patterns
-
Wrong ring entirely (27%) — AI generates a generic ring instead of the reference. Most common with eliminated models.
-
Wrong finger (21%) — Ring on index/middle finger instead of ring finger. Even good models made this mistake.
-
Floating/clipping (7%) — Ring merged into hand or appeared to float. More common with complex designs.
-
Obviously AI (18%) — Uncanny skin, unrealistic lighting. Would reduce customer trust.
Practical Recommendations
For E-commerce Sellers Today
Based on elimination results, here’s what you can do now:
Definitely avoid:
- Qwen-Image
- Runway Gen-4 Image
- Seedream 4
- GPT Image 1.5
- Seedream 4.5
Safe to test (pending Part 2 results):
- Nano Banana Pro ($0.15/image) — most consistent
- FLUX.2 Pro ($0.045/image) — best value candidate
- Nano Banana ($0.039/image) — budget option
Workflow tips:
- Use Replace approach when possible
- Skip detailed jewelry descriptions
- Start with simple rings
- Budget for 1.5-2x generations to select the best
Cost Guidance
| Model | Cost | Reliability |
|---|---|---|
| Nano Banana Pro | $0.15/image | Never failed in this study |
| FLUX.2 Pro | $0.045/image | 92% reliable |
| Nano Banana | $0.039/image | 92% reliable |
Study Limitations
What This Study Measured
- Clear eliminations: Models that fail consistently across varied conditions
- Reliable performers: Models that rarely produce unusable results
- Workflow effectiveness: Replace vs Generate, Description impact
What This Study Did NOT Measure
- Precise quality ranking between top models: Due to varied batch composition, we cannot definitively rank Nano Banana Pro vs FLUX.2 Pro vs others
- Head-to-head comparisons: Models weren’t always compared on identical tasks
Why This Matters
A “1st place” in a batch with 3 weak competitors isn’t the same as “1st place” against 3 strong competitors. The elimination results are valid — if a model fails 100% of the time, it’s genuinely bad. But the relative ranking of successful models requires controlled head-to-head testing.
Next: Part 2 — Head-to-Head Comparison
This elimination study answered: Which models can do jewelry photography?
Part 2 will answer: Of the capable models, which is actually best?
Planned methodology for Part 2:
- Show all 6 advancing models’ outputs for identical ring + scenario
- Rank 1-6 within each controlled comparison
- Repeat for multiple conditions
- Produce definitive quality ranking
Additional planned research:
- Consistency testing (10+ generations per model)
- Other jewelry types (bracelets, necklaces, earrings)
- Ideogram inpainting comparison
Reproducibility
Materials:
- Reference images: 3 rings at simple/medium/complex levels
- Prompts: Documented in methodology
- Raw scores: 132 images with full rating data
Cost to reproduce:
- 132 images × ~$0.06 average = ~$8 total
- Replicate API access required
- ~1 hour for evaluation
Conclusion
This elimination study provides clear guidance on which AI models to avoid for jewelry photography:
Do not use: Qwen-Image, Runway Gen-4, Seedream 4, GPT Image 1.5, Seedream 4.5
Safe to use: Nano Banana Pro, FLUX.2 Pro, Nano Banana (with FLUX.2 Flex, Gemini 2.5 Flash, and FLUX.2 Max as borderline options)
The workflow findings are actionable today:
- Use Replace instead of Generate (+31% improvement)
- Skip detailed jewelry descriptions (no benefit)
- Start with simple rings (50% better success)
Part 2 will determine which of the advancing models produces the best quality. Until then, Nano Banana Pro’s 0% failure rate makes it the safest choice, while FLUX.2 Pro offers strong reliability at 1/3 the cost.
Part 1 of ongoing AI jewelry photography research. December 2025.
Questions? Contact [email protected]
Related Articles
- Which AI Model Works Best for Jewelry Photography? — Practical takeaways from this research
- The Complete Guide to Jewelry Photography — Every shot type you need
- AI Image Studios & Models: 2025 Guide — Overview of all platforms tested
About studio formel
studio formel is an AI-powered creative platform built specifically for jewelry brands. We combine systematic research on AI generation with a flexible asset management system, helping jewelry sellers create professional images, videos, and ads at scale.