DSS-SQA: Decoupling Structure and Semantics

Links: GitHub Repository | LoViF 2026 Challenge

As Generative AI (AIGC) fundamentally transforms low-level vision tasks, the way we evaluate image quality is undergoing a massive paradigm shift. In modern generative models, the most pressing issue is no longer just blur or noise, but AIGC semantic hallucinations.

To address this, our team (DSS-SQA: Jiajun Wang, Yipeng Sun, Kaiwei Lian) participated in the LoViF 2026 Challenge on Semantic Quality Assessment. We proposed a novel Full-Reference Image Quality Assessment (FR-IQA) framework that explicitly decouples structural fidelity from semantic alignment.

Our method achieved highly competitive results: a Final Score of 0.8469 on the official test phase and 0.9121 on the validation phase, significantly outperforming traditional metrics, deep-feature metrics (LPIPS, DISTS), and even zero-shot large Vision-Language Models (GPT-5.4, Gemini 3.1 Pro).

🛑 The Challenge: "Shortcut Learning" in Semantic IQA

The LoViF 2026 challenge provides a highly timely and critical benchmark. However, modeling this high-level semantic alignment presents significant obstacles.

The extreme scarcity of training data (only 510 pairs) makes deep neural networks highly prone to "shortcut learning". Instead of learning to evaluate semantic alignment, models easily degenerate into evaluating absolute image sharpness. For instance, a model might erroneously predict a high score for a visually pristine but semantically completely unrelated generated image.

Explicit gating constraints and self-supervised structural priors were vital to overcoming this bottleneck.

🧠 Our Solution: The DSS-SQA Architecture

Our proposed solution, DSS-SQA, explicitly decouples structural degradation from semantic hallucinations using a dual-vision backbone (DINOv3 and CLIP).

1. Dual-Vision Siamese Encoders

Recognizing that structure and semantics are orthogonal dimensions, we employ two distinct, frozen pre-trained foundation models:

Structural Awareness (DINOv3): We utilize the DINOv3-Base vision transformer to extract structure-aware global representations (implemented with the DINOv3 CLS token by default, with patch-token fallback only when needed).
Semantic Awareness (CLIP): We employ the CLIP-ViT-L/14 vision encoder to extract global, human-aligned semantic embeddings.

2. Element-wise Multiplication Fusion

To fuse these features, we compute the absolute differences for both branches ($Diff_{s}$ and $Diff_{c}$). Crucially, we introduce an Element-wise Multiplication Fusion for the CLIP features. $$ $$

By default, we apply $L_{2}$ normalization before the Hadamard product: mult_clip $=L_{2}(F_{c_ref}) \odot L_{2}(F_{c_dist})$.
We then replace the raw distorted semantic feature with this multiplication result to force the network to rely on fine-grained semantic interaction.
The scalar cosine similarity $S_{cos}$ is also explicitly injected into the concatenated feature vector.

3. Explicit Semantic Gating Mechanism

To directly combat shortcut learning, we designed an explicit semantic gate. During inference (model.eval()) with semantic gating enabled, if the cosine similarity $S_{cos}$ falls below a threshold (0.4), the gate applies a hard veto by forcefully pushing the predicted logits toward the lowest quality class (Class 0). This ensures the final expected score approaches zero, effectively penalizing visually pleasing but semantically completely hallucinated pairs.

4. Robust Ensemble Strategy

To maximize robustness and prevent overfitting on the extremely small dataset, we utilized a 5-fold stratified cross-validation strategy during training. During the final testing/inference phase, the expected quality scores from all 5 trained models are averaged (Mean Ensemble) to produce the highly stable final output.

📊 Experimental Results

LoViF 2026 Validation Set

Our method significantly outperforms all baseline metrics, proving that decoupling structure and semantics is highly effective in AIGC evaluation.

Method	SROCC $\uparrow$	PLCC $\uparrow$	Final Score $\uparrow$
LPIPS (VGG)	0.7602	0.7389	0.7516
DISTS	0.8172	0.8055	0.8125
GPT-5.4	0.7861	0.7957	0.7900
Gemini 3.1 Pro	0.8068	0.8174	0.8110
DSS-SQA (Ours)	0.9062	0.9209	0.9121

The Final Score is calculated as $0.6 \times SROCC + 0.4 \times PLCC$. On the official blind test phase, our model achieved a Final Score of 0.8469.

Zero-Shot Generalization on BAPPS

To demonstrate that our model learns universally applicable quality representations rather than overfitting the small LoViF dataset, we conducted zero-shot evaluations on the BAPPS perceptual dataset.

Method	Trad	CNN	Superres	Overall
Human (Ceiling)	80.8%	84.4%	73.4%	73.9%
LPIPS (VGG)	71.4%	81.4%	69.0%	66.8%
DSS-SQA (Ours)	73.9%	79.6%	65.5%	65.0%

Despite being optimized for high-level semantic quality, DSS-SQA remains highly competitive with metrics explicitly trained on BAPPS (like LPIPS), proving that our dense DINOv3 representations robustly capture low-level generative artifacts.

💻 Code & Resources

We have open-sourced the complete training and inference pipeline.

GitHub Repository: atJesse/SIQAv3

If you find our work helpful for your research, feel free to drop a star on GitHub! 🌟

DSS-SQA: Decoupling Structure and Semantics for Semantic Quality Assessment

🛑 The Challenge: "Shortcut Learning" in Semantic IQA