Benchmark Release

SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection

SciFigDetect is the first benchmark dedicated to detecting AI-generated scientific figures. It pairs real figures from licensed open-access papers with synthetic counterparts generated by Nano Banana Pro and GPT-image-1.5, while preserving paper context, structured prompts, and provenance metadata.

  • Real Figures72,965
  • Synthetic Figures150,807
  • Aligned Pairs50,379

Overview

Teaser and Main Contributions

Teaser figure for SciFigDetect

Why this benchmark matters

Modern multimodal generators can now produce scientific figures at near-publishable quality. Unlike generic AI-generated images, scientific figures are structured, text-dense, and tightly coupled with scholarly semantics, which makes scientific-figure forensics a distinct problem.

SciFigDetect introduces a realistic evaluation setting built from licensed papers and review-filtered synthetic figures. The benchmark shows that existing AI-generated image detectors remain far from reliable in this domain.

First dedicated benchmark

SciFigDetect is the first benchmark focused on AI-generated scientific figure detection rather than open-domain natural imagery.

Agent-based data construction

The dataset is built from licensed papers through multimodal understanding, structured prompt planning, generation, and review-driven refinement.

Strong context preservation

Each benchmark sample can preserve figure-related paper context, prompt, generator identity, license information, and review history.

Key Statistics

72,965

real scientific figures

150,807

synthetic figures across two generators

3

figure categories covering scientific illustration and evidence visuals

50,379

aligned real-synthetic source pairs for controlled comparison

Three Benchmark Settings

Zero-shot evaluation

Off-the-shelf detectors trained on prior AIGI benchmarks are evaluated directly on SciFigDetect without adaptation. The strongest zero-shot result reported in the paper reaches only 53.68% average accuracy.

Cross-generator evaluation

Models trained on one generator transfer poorly to the other. Averaged over detectors, training on Banana gives 83.3% on Banana but only 48.7% on GPT, while training on GPT gives 87.5% on GPT but only 26.1% on Banana.

Degraded-image evaluation

The benchmark tests robustness under JPEG and WebP compression, Gaussian blur, and Gaussian noise, simulating re-saving, rendering, and screenshot redistribution.

Dataset

Dataset Composition and Coverage

Dataset Scale by Subset

Subset Illustration Overview Experimental Figure Total
Real 5,773 8,882 58,310 72,965
Nano Banana 4,616 6,608 39,155 50,379
GPT 9,090 13,164 78,174 100,428

Real subset

Real figures are collected from commercially permissible open-access papers. They provide authentic scientific layouts, annotation styles, and figure-paper alignment grounded in published research artifacts.

Nano Banana subset

Nano Banana Pro figures are synthesized from structure-aware prompts derived from paper context and figure understanding. This subset also forms the full aligned-pair split with source real figures.

GPT subset

GPT-image-1.5 figures expand generator diversity and reveal large cross-generator distribution shifts. The GPT subset is notably larger and helps expose generator-specific detector overfitting.

Illustration

Conceptual diagrams, method sketches, and schematic visuals emphasizing designed structure, layout composition, legends, arrows, and symbolic elements.

Overview

High-level workflow and system overview figures that summarize pipelines, modules, interactions, or multi-stage frameworks in paper-style layouts.

Experimental Figure

Result-oriented scientific visuals such as plots, charts, tables-as-figures, and empirical evidence presentations with dense labels and publication semantics.

Aligned real-synthetic pairs

A core subset of SciFigDetect forms aligned true-synthetic pairs, where the same source figure is associated with both Nano Banana and GPT-generated counterparts. The paper reports 4,616 aligned illustrations, 6,608 aligned overviews, and 39,155 aligned experimental figures.

These aligned pairs are especially valuable for controlled comparison because the real figure and synthetic variants share the same paper context and core semantics while differing in generation source.

What is included per sample?

Asset Included Purpose
Structured prompt Yes Captures style-oriented and content-oriented generation intent.
Paper context Yes Preserves figure-related scholarly semantics and provenance.
Metadata Yes Includes category, license, generator identity, and review history.

Access

Full Dataset Release and Agreement

Release policy

The full SciFigDetect dataset will be released after the paper is accepted. The repository currently provides the project website, a small example subset, and the documentation needed for controlled data access.

If you need access before the public release, please sign the data sharing agreement and follow the request process below.

Access process

Download the agreement here: Data sharing License Agreement.docx

1. Download and sign the agreement.
2. Email the signed file to xiaobai.li@zju.edu.cn.
3. Wait for confirmation and further instructions from the project team.

Contact: xiaobai.li@zju.edu.cn

Data Schema

Sample Format and Field Definitions

Benchmark sample definition

z = {
  "context": c,
  "real_figure": f_real,
  "synthetic_figure": f_syn,
  "metadata": a
}

Sample semantics

The paper defines each benchmark sample as a tuple containing figure-related paper context, the original real figure, the accepted synthetic figure, and auxiliary metadata. This design preserves both visual evidence and the scholarly semantics behind each figure.

Example JSON record

{
  "sample_id": "paper_001_fig_03_gpt",
  "paper_id": "paper_001",
  "figure_id": "fig_03",
  "split": "test",
  "figure_type": "overview",
  "topic_group": "Generative & Learning",
  "is_real": false,
  "generator": "gpt-image-1.5",
  "real_image_path": "images/real/paper_001_fig_03.png",
  "synthetic_image_path": "images/gpt/paper_001_fig_03.png",
  "paper_context": {
    "caption": "...",
    "section_text": "...",
    "reference_paragraphs": ["...", "..."]
  },
  "prompt": {
    "style_prompt": "...",
    "content_prompt": "...",
    "full_prompt": "..."
  },
  "review": {
    "fidelity": 0.78,
    "aesthetics": 0.74,
    "logic": 0.82,
    "overall": 0.78,
    "accepted": true
  },
  "metadata": {
    "license": "CC BY",
    "source_pdf": "paper.pdf",
    "generator_family": "GPT",
    "aligned_pair_id": "pair_001_fig_03"
  }
}

CSV / JSON field guide

Field Type Description
sample_id string Unique sample identifier for one real or synthetic instance.
paper_id string Source paper identifier used for paper-level splitting.
figure_id string Original figure index inside the source paper.
figure_type string One of illustration, overview, or experimental_figure.
generator string real, nano_banana_pro, or gpt-image-1.5.
paper_context json/text Caption and figure-related context extracted from the paper body.
prompt json/text Structured generation prompt combining style and content signals.
review_overall float Overall review score used for synthetic sample acceptance.
aligned_pair_id string Links real and synthetic figures derived from the same source figure.
license string License information for compliant source-paper usage.

Construction Pipeline

How the Dataset Is Built

SciFigDetect is constructed from licensed source papers through a compliant master-worker pipeline. The process is designed to improve trustworthiness, reproducibility, and dataset realism by preserving figure-paper alignment rather than generating synthetic images in isolation.

01

Licensed paper retrieval

Candidate papers are filtered by commercially permissible licenses such as CC BY before any benchmark sample is constructed.

02

Multimodal understanding

A Chunking Agent segments the paper, a Text Agent extracts figure-relevant semantics, and a Figure Agent analyzes layout, modules, arrows, legends, color usage, and spatial hierarchy.

03

Structured prompt planning

A Prompt Builder merges paper semantics and figure understanding into structure-aware prompts with style-oriented and content-oriented components.

04

Generation and review loop

Candidate figures are synthesized by Nano Banana Pro or GPT-image-1.5 and then scored for academic fidelity, aesthetic consistency, and logical coherence.

05

Acceptance and curation

Samples are accepted only when the overall review score is at least 0.6. Accepted records store the real figure, synthetic figure, context, prompt, category, generator, license, and review history.

Why this improves credibility

Compliance by design

The benchmark starts from papers with permissible licenses, reducing legal ambiguity around redistribution and benchmark construction.

Reproducible generation trace

Context, prompts, generator identity, and review records make the synthetic figures traceable rather than opaque.

Better scientific realism

The synthetic figures are anchored to real paper semantics and filtered by a review loop, which makes them closer to realistic misuse scenarios.

Benchmark

Representative Findings

Setting What is tested Main finding
Zero-shot Direct transfer from existing open-domain AIGI detectors All methods degrade sharply; the best average accuracy reported is only 53.68%.
Cross-generator Train on one generator and test on another Detectors show strong generator-specific overfitting and poor transfer.
Degraded-image Compression, blur, and noise on test images Even strong clean-data models remain fragile under realistic post-processing.

Citation

Reference

@misc{hu2026scifigdetectbenchmarkaigeneratedscientific,
      title={SciFigDetect: A Benchmark for AI-Generated Scientific Figure Detection}, 
      author={You Hu and Chenzhuo Zhao and Changfa Mo and Haotian Liu and Xiaobai Li},
      year={2026},
      eprint={2604.08211},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.08211}, 
}