DiagramQG: A Dataset for Generating Concept-Focused Questions from Diagrams

Overview

DiagramQG is a comprehensive educational dataset focused on scientific diagram question generation. It contains:

Note: Due to the ongoing peer review process of our research paper, we are currently releasing a subset of the DiagramQG dataset.

Dataset Examples

Figure 1: Four different examples of different subjects in the DiagramQG dataset.

Domain Distribution

Figure 2: Domain diversity in DiagramQG. Each color corresponds to one subject: Natural Science (blue), Earth Science (yellow), Applied Science (green), and Social Science (orange).

Dataset Structure

Subject Areas

The dataset covers four main subject areas:

Hierarchical Organization

Data is organized hierarchically:

  1. Subject (e.g., Natural Science)
  2. Course (e.g., Biology)
  3. Concept (e.g., Ecological interactions)

Data Collection Process

Phase 1: Initial Data Gathering

Phase 2: Organization

Phase 3: Annotation

Phase 4: Quality Assurance

Dataset Analysis

Question Distribution

Question Distribution

Figure 3: Question distribution in DiagramQG.

Concept Distribution

Concept Distribution

Figure 4: Distribution of diagrams, questions, and questions per diagram ratios across different concepts in DiagramQG.

Dataset Comparison

Dataset Questions Images Objects/Image Image Type Constraints Knowledge Type
VQAv2.0 1.1M 20k 3.5 natural answer N/A
FVQA 5,826 2k 2.9 natural answer common-sense
VQG-COCO 25,000 5k 3.3 natural image, caption common-sense
K-VQG 16,098 13k 2.7 natural knowledge triple common-sense
DiagramQG 19,475 8,372 11.2 diagram target, concept subject knowledge

Unique Challenges

  1. Domain-specific Knowledge Requirement
    • Requires understanding of specialized subject concepts
    • Goes beyond common sense reasoning
  2. Long-tail Distribution
    • Uneven concept coverage
    • Challenges in model generalization
  3. High Information Density
    • Complex diagram interpretation
    • Dense visual information processing

Download

Our dataset is released under the Apache-2.0 license. You can download our dataset from DiagramQG or check out our GitHub repository.