📝 Publications

👁️ Multimodal Learning & Visual Reasoning

CoFFT: Chain of Foresight-Focus Thought for Visual Language Models
Xinyu Zhang, et al., NeurIPS 2025 (CCF-A)

  • Proposed a training-free method to adaptively adjust visual focus for VLMs.

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning
Xinyu Zhang, et al., ACL 2025 (CCF-A)

  • The first benchmark specifically targeting physics-based visual reasoning.

Alignment Relation is What You Need for Diagram Parsing
Xinyu Zhang, et al., IEEE-TIP 2024 (CCF-A, SCI-Q1)

  • Focuses on curriculum-level diagram parsing and question generation.

📑 Selected Conference & Journal Papers

  • [Beyond Layer-wise Merging: Dynamic Chain-of-Merging for VLM], Xinyu Zhang, et al., CVPR 2026 (Under Review/Accepted)
  • [Cognitive Predictive Coding Network], Xinyu Zhang, et al., ACM-MM 2025 (CCF-A)
  • [Memory-enriched thought-by-thought framework (METbT)], Xinyu Zhang, et al., CVIU 2025 (CCF-B)
  • [RPMG-FSS: Robust Prior Mask Guided Few-Shot Semantic Segmentation], Xinyu Zhang, et al., IEEE-TCSVT 2023 (CCF-B)
  • [Evochart: A benchmark towards real-world chart understanding], (Co-author), AAAI 2025
  • [Cog-dqa: Chain-of-guiding learning for DQA], (Co-author), CVPR 2024