Skip to main content

Command Palette

Search for a command to run...

DSS-SQA: Decoupling Structure and Semantics for Semantic Quality Assessment

Published
5 min read
DSS-SQA: Decoupling Structure and Semantics for Semantic Quality Assessment
J

Hi, I'm Jiajun Wang, a researcher focusing on the application of cutting-edge AI in engineering. This blog is my digital garden.

Links: GitHub Repository | LoViF 2026 Challenge

As Generative AI (AIGC) fundamentally transforms low-level vision tasks, the way we evaluate image quality is undergoing a massive paradigm shift. In modern generative models, the most pressing issue is no longer just blur or noise, but AIGC semantic hallucinations.

To address this, our team (DSS-SQA: Jiajun Wang, Yipeng Sun, Kaiwei Lian) participated in the LoViF 2026 Challenge on Semantic Quality Assessment. We proposed a novel Full-Reference Image Quality Assessment (FR-IQA) framework that explicitly decouples structural fidelity from semantic alignment.

Our method achieved highly competitive results: a Final Score of 0.8469 on the official test phase and 0.9121 on the validation phase, significantly outperforming traditional metrics, deep-feature metrics (LPIPS, DISTS), and even zero-shot large Vision-Language Models (GPT-5.4, Gemini 3.1 Pro).


🛑 The Challenge: "Shortcut Learning" in Semantic IQA

The LoViF 2026 challenge provides a highly timely and critical benchmark. However, modeling this high-level semantic alignment presents significant obstacles.

The extreme scarcity of training data (only 510 pairs) makes deep neural networks highly prone to "shortcut learning". Instead of learning to evaluate semantic alignment, models easily degenerate into evaluating absolute image sharpness. For instance, a model might erroneously predict a high score for a visually pristine but semantically completely unrelated generated image.

Explicit gating constraints and self-supervised structural priors were vital to overcoming this bottleneck.


🧠 Our Solution: The DSS-SQA Architecture

Our proposed solution, DSS-SQA, explicitly decouples structural degradation from semantic hallucinations using a dual-vision backbone (DINOv3 and CLIP).

1. Dual-Vision Siamese Encoders

Recognizing that structure and semantics are orthogonal dimensions, we employ two distinct, frozen pre-trained foundation models:

  • Structural Awareness (DINOv3): We utilize the DINOv3-Base vision transformer to extract structure-aware global representations (implemented with the DINOv3 CLS token by default, with patch-token fallback only when needed).

  • Semantic Awareness (CLIP): We employ the CLIP-ViT-L/14 vision encoder to extract global, human-aligned semantic embeddings.

2. Element-wise Multiplication Fusion

To fuse these features, we compute the absolute differences for both branches (\(Diff_{s}\) and \(Diff_{c}\)). Crucially, we introduce an Element-wise Multiplication Fusion for the CLIP features. $$ $$

  • By default, we apply \(L_{2}\) normalization before the Hadamard product: mult_clip \(=L_{2}(F_{c_ref}) \odot L_{2}(F_{c_dist})\).

  • We then replace the raw distorted semantic feature with this multiplication result to force the network to rely on fine-grained semantic interaction.

  • The scalar cosine similarity \(S_{cos}\) is also explicitly injected into the concatenated feature vector.

3. Explicit Semantic Gating Mechanism

To directly combat shortcut learning, we designed an explicit semantic gate. During inference (model.eval()) with semantic gating enabled, if the cosine similarity \(S_{cos}\) falls below a threshold (0.4), the gate applies a hard veto by forcefully pushing the predicted logits toward the lowest quality class (Class 0). This ensures the final expected score approaches zero, effectively penalizing visually pleasing but semantically completely hallucinated pairs.

4. Robust Ensemble Strategy

To maximize robustness and prevent overfitting on the extremely small dataset, we utilized a 5-fold stratified cross-validation strategy during training. During the final testing/inference phase, the expected quality scores from all 5 trained models are averaged (Mean Ensemble) to produce the highly stable final output.


📊 Experimental Results

LoViF 2026 Validation Set

Our method significantly outperforms all baseline metrics, proving that decoupling structure and semantics is highly effective in AIGC evaluation.

Method SROCC \(\uparrow\) PLCC \(\uparrow\) Final Score \(\uparrow\)
LPIPS (VGG) 0.7602 0.7389 0.7516
DISTS 0.8172 0.8055 0.8125
GPT-5.4 0.7861 0.7957 0.7900
Gemini 3.1 Pro 0.8068 0.8174 0.8110
DSS-SQA (Ours) 0.9062 0.9209 0.9121

The Final Score is calculated as \(0.6 \times SROCC + 0.4 \times PLCC\). On the official blind test phase, our model achieved a Final Score of 0.8469.

Zero-Shot Generalization on BAPPS

To demonstrate that our model learns universally applicable quality representations rather than overfitting the small LoViF dataset, we conducted zero-shot evaluations on the BAPPS perceptual dataset.

Method Trad CNN Superres Overall
Human (Ceiling) 80.8% 84.4% 73.4% 73.9%
LPIPS (VGG) 71.4% 81.4% 69.0% 66.8%
DSS-SQA (Ours) 73.9% 79.6% 65.5% 65.0%

Despite being optimized for high-level semantic quality, DSS-SQA remains highly competitive with metrics explicitly trained on BAPPS (like LPIPS), proving that our dense DINOv3 representations robustly capture low-level generative artifacts.


💻 Code & Resources

We have open-sourced the complete training and inference pipeline.

If you find our work helpful for your research, feel free to drop a star on GitHub! 🌟

More from this blog

基于生成式AI的 3D PCCT 去噪算法研发:6个月工作计划

第1个月:入职培训、环境部署与数据预处理 目标: 完成企业与医院的入职流程,熟悉研发环境,完成数据清洗与经典深度学习 Baseline 的初步验证。 第1-2周:入职与环境熟悉 完成西门子医疗中国及北京协和医院的入职手续与安全/合规培训。 熟悉协和医院放射科驻场工作环境,获取联合实验室高性能计算集群(8卡A100)的访问权限及环境配置。 与院方医生及西门子/FAU导师对齐项目最终预期,确认

Mar 31, 20262 min read
J

Jiajun's Cyber ​​Garden

7 posts

A continuous integration of life and research.