PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Benchmark

Key Components

1,200 physics problems covering multiple domains
25% knowledge-based, 75% reasoning-based questions
147 physics theorems
81% problems include diagrams

Data Collection Process

Acquisition: From global college entrance exams and competitions
Standardization: Using MinerU framework
Translation: Two-phase process with expert verification
Search Prevention: Exclude easily searchable problems
Difficulty Classification: Based on solving time and theorem complexity

Table 1: Comparison with Existing Benchmarks
Benchmark	Multi-modal	Size	Knowledge	Question		Solution
Benchmark	Multi-modal	Size	Knowledge	Type	Avg. T	Step-by-step	Avg. T	Avg. S
JEEBench	❌	123	CEE	OE,MC	169.7	-	-	-
MMLU-Pro	❌	1299	COL	MC	52.1	-	-	-
GPQA	❌	227	PH.D.	OE	111.4	❌	197.2	3.6
SciEval	❌	1657	-	OE,MC	154.5	-	-	-
SciBench	✅	295	COL	OE	80.5	❌	315.9	2.8
MMMU	✅	443	COL	OE,MC	53.8	-	-	-
ScienceQA	✅	617	K1-K12	MC	13.3	❌	63.0	2.4
OlympiadBench	✅	2334	COMP	OE	222.0	❌	199.8	3.7
EMMA	✅	156	-	MC	109.5	-	-	-
Ours-Knowledge	✅	300	CEE+COMP	OE	163.7	✅	196.5	3.3
Ours-Easy	✅	300	CEE+COMP	OE	171.2	✅	241.5	5.0
Ours-Medium	✅	300	CEE+COMP	OE	229.2	✅	391.3	8.4
Ours-Hard	✅	300	CEE+COMP	OE	340.9	✅	936.1	15.6
Ours-Full	✅	1200	CEE+COMP	OE	226.3	✅	441.3	8.1

Evaluation Framework

PSAS-A (Answer Level Evaluation)

PSAS-A evaluates based on sub-question answers. It extracts answers from the model's reasoning process using LLM, verifies semantic consistency, and calculates scores considering solution step lengths as weights for different sub-questions.

PSAS-S (Step Level Evaluation)

PSAS-S provides detailed step-by-step assessment through four phases: data extraction, scoring, first error step detection, and error analysis. It identifies where models first deviate from correct reasoning paths and classifies error types.

Figure 2: Step-level evaluation example obtained from PSAS-S framework

Experimental Results

Table 3: Model performance comparisons on the PhysReason benchmark
Model	Input	Knowledge	Easy	Medium	Hard	Avg.
Non-O-like Models
Qwen2VL-72B	Q, I	41.92/62.47	24.04/45.26	15.97/36.13	4.83/24.23	16.96/42.88
InternVL2.5-78B	Q, I	28.34/64.71	24.16/50.69	17.72/38.56	9.71/25.95	19.98/45.89
GPT-4o	Q, I	50.71/65.82	33.87/51.98	22.73/42.36	11.03/24.71	29.58/47.23
Deepseek-V3-671B	Q, IC	55.86/66.14	40.06/52.77	26.63/44.02	13.73/26.87	34.07/48.42
Claude-3.5-Sonnet	Q, I	54.14/66.45	41.35/55.85	28.14/44.86	15.11/28.51	34.69/49.88
Gemini-2.0-Flash	Q, I	65.08/75.04	54.84/68.60	39.79/55.67	21.99/38.39	45.20/60.40
Gemini-2.0-Pro	Q, I	67.99/79.01	55.43/71.47	44.29/57.74	23.81/42.66	47.88/62.74
O-like Models
o1-mini	Q, IC	53.90/65.74	35.21/52.26	22.24/40.19	10.61/26.80	30.49/47.18
QvQ-72B	Q, I	62.44/70.92	53.74/64.65	28.18/54.88	14.30/36.47	32.67/57.66
Gemini-2.0-T†	Q, I	65.35/77.20	51.89/67.49	44.43/58.95	27.14/45.48	47.20/63.07
QwQ-32B	Q, IC	62.03/76.28	54.92/71.08	43.64/62.14	22.99/42.19	45.89/63.87
GLM-Zero	Q, IC	64.95/80.36	54.11/71.54	41.32/63.67	23.04/47.46	46.52/65.76
o3-mini-high	Q, IC	70.67/83.61	67.20/81.95	45.31/64.57	30.12/47.23	53.32/69.34
Gemini-2.0-T*	Q, I	73.44/84.15	63.17/75.94	50.41/66.60	31.90/48.47	54.73/69.73
Deepseek-R1	Q, IC	75.11/85.91	65.08/79.81	54.84/72.02	31.95/51.50	56.75/73.26

Table 4: Comparison on PhysReason-mini with PSAS-A
Model	K.	E.	M.	H.	Avg.
o1-mini	54.80	30.33	15.41	7.92	27.11
QvQ-72B	51.17	37.10	29.83	22.13	35.06
QwQ-32B	64.40	50.07	38.88	27.45	45.20
Gemini-2.0-T†	71.47	49.97	36.83	22.97	45.42
GLM-Zero	72.70	50.17	43.42	24.70	47.75
o1	72.47	53.37	49.31	25.32	50.12
o3-mini-high	71.10	63.20	47.02	31.93	53.31
Gemini-2.0-T*	76.33	56.87	51.85	32.61	54.42
Deepseek-R1	85.17	60.77	47.24	33.23	56.60

Figure 3: Error statistics with PSAS-S framework

Performance analysis — Figure 4: Performance with PSAS-S framework in hard problems