Jiajun's Cyber Garden

基于生成式AI的 3D PCCT 去噪算法研发：6个月工作计划

Jiajun Wang(Jesse) — Tue, 31 Mar 2026 12:37:06 GMT

第1个月：入职培训、环境部署与数据预处理

目标： 完成企业与医院的入职流程，熟悉研发环境，完成数据清洗与经典深度学习 Baseline 的初步验证。

第1-2周：入职与环境熟悉
- 完成西门子医疗中国及北京协和医院的入职手续与安全/合规培训。
- 熟悉协和医院放射科驻场工作环境，获取联合实验室高性能计算集群（8卡A100）的访问权限及环境配置。
- 与院方医生及西门子/FAU导师对齐项目最终预期，确认数据安全规范。
第3-4周：数据处理与基础 Baseline
- 对 400 份 3D PCCT 数据（200例配对）进行清洗、预处理及标准化（划分训练集/验证集/测试集，后续创新算法按无配对设定处理）。
- 跑通前几年经典的 3D CT 去噪深度学习网络（如 3D U-Net、RED-CNN 等），获取基础定量指标，熟悉 3D 数据的 I/O 与显存管理。

第2个月：3D CycleGAN Baseline 搭建与严格评估

目标： 建立公平且极具竞争力的无配对生成式 Baseline，确立评估标准。

第1-3周：公平性控制下的 CycleGAN 训练
- 搭建 3D CycleGAN 架构。严格控制变量：确保 CycleGAN 的 Generator 与后续创新方法的 Encoder/Decoder 采用相同深度的 3D U-Net 主干网络。
- 对齐超参数：使用相同的 3D patch size、batch size，并在无配对（unpaired）的数据划分下进行训练。
第4周：多维度指标评估系统构建
- 像素级： PSNR/SSIM（明确其倾向于平滑结果的特性）。
- 感知级： LPIPS（检验结构细节是否丢失）。
- 临床级： HU 偏差统计（测量肝脏、肌肉等特定组织的 CT 值分布是否与真实正常剂量一致，排查系统性偏移）。
- 结构级： 轴向一致性分数（测量相邻 slice 间的结构连续性）。

第3个月：3D 潜空间构建与扩散模型基础验证

目标： 解决 3D 数据显存瓶颈，构建高质量的解剖感知潜空间（Latent Space）。

第1-2周：3D Latent Autoencoder 训练
- 构建并训练 3D 编码器 $E(\cdot)$ 和解码器 $D(\cdot)$。
- 使用自编码重建损失 $$\mathcal{L}_{\mathrm{AE}} = |D(E(u))-u|1 + \beta,\mathcal{L}{\mathrm{perc}}$$，确保 latent 空间足够紧凑且保留解剖信息。
第3-4周：潜空间分布定义与 Conditional Diffusion 探索
- 将低剂量（LD）和正常剂量（ND）数据映射至潜空间，获得潜分布 $\nu_{\mathrm{LD}} = E_{}\mu_{\mathrm{LD}}$ 与 $\nu_{\mathrm{ND}} = E_{}\mu_{\mathrm{ND}}$。
- 初步运行基础的 Latent Conditional Diffusion 模型作为对照，分析其在“条件生成”而非“精准恢复”上的局限性，为后续薛定谔桥（SB）方法提供反面论证。

第4个月：核心创新实现 (一) - Observation-Centered Schrödinger Bridge

目标： 抛弃全局无条件桥，实现以低剂量输入为中心的 3D 潜空间薛定谔桥。

第1-2周：参考过程与桥的定义
- 对每个输入低剂量体 $y$，编码得到 $z_0 = E(y)$。
- 定义一个以 $z_0$ 为中心的 input-centered Ornstein–Uhlenbeck 参考过程： $$dZ_t = -\lambda(t)(Z_t-z_0),dt + \sigma(t,y),dW_t$$
第3-4周：最优化桥接与主损失函数实现
- 在满足起点为 $z_0$、终点边缘为 $\nu_{\mathrm{ND}}$ 的路径分布中，求解最优桥 $Q^{z_0,*} = \arg\min_{Q^{z_0}} \mathrm{KL}(Q^{z_0},|,P^{z_0})$。
- 实现并调试薛定谔桥主损失 $\mathcal{L}_{\mathrm{SB}}$，确保桥能够正确连接起始与目标分布，同时局部轨迹贴近参考过程，实现“保结构”的天然动力学约束。

第5个月：核心创新实现 (二) - Barycentric Anchor 与方向约束

目标： 融入 Noise2Flow 核心思想，利用软耦合提取平均恢复方向，完成模型调优与蒸馏。

第1-2周：基于分布几何的平均恢复方向提取
- 在一个 batch 内，计算 LD 潜点 $z_i^{\mathrm{LD}}$ 与 ND 潜点 $z_j^{\mathrm{ND}}$ 之间的代价矩阵 $c_{ij} = |z_i^{\mathrm{LD}} - z_j^{\mathrm{ND}}|^2$。
- 利用 Sinkhorn/OT 算法计算软耦合矩阵 $\pi_{ij}$，提取 Barycentric target： $$\bar{z}i^{\mathrm{ND}} = \frac{\sum_j \pi{ij} z_j^{\mathrm{ND}}}{\sum_j \pi_{ij}}$$
- 计算出分布诱导的平均恢复方向：$d_i = \bar{z}_i^{\mathrm{ND}} - z_i^{\mathrm{LD}}$。
第3周：方向损失融合与联合训练
- 引入方向损失 $$\mathcal{L}{\mathrm{dir}} = 1- \frac{ \langle \bar{v}\theta(z_i^{\mathrm{LD}}), d_i\rangle }{ |\bar{v}_\theta(z_i^{\mathrm{LD}})|, |d_i| }$$ ，约束桥的 drift 早期方向。
- 组合总损失 $$\mathcal{L} = \mathcal{L}{\mathrm{AE}} + \lambda{\mathrm{SB}}\mathcal{L}{\mathrm{SB}} + \lambda{\mathrm{dir}}\mathcal{L}{\mathrm{dir}} + \lambda{\mathrm{tube}}\mathcal{L}_{\mathrm{tube}}$$ 进行全链路联合训练。
第4周：模型蒸馏（可选 / 进阶）
- 探索将训练期间学到的复杂桥演化压缩为 few-step 或 one-step 映射（$\hat{z} = F_\phi(z_0)$），提升测试与未来临床部署的推理效率。

第6个月：总结评估、论文撰写与离职交接

目标： 全面收尾项目，固化研究成果，完成企业及学校的结题与交接。

第1-2周：最终验证与多维度对比分析
- 将创新方法（Observation-centered SB + Barycentric Direction）与前期的经典 DL Baseline、3D CycleGAN 进行全面对比。
- 汇总 PSNR、LPIPS、HU 偏差、轴向一致性等定量结果，并与协和医院放射科医师进行定性（视觉效果、病灶保留度）交流反馈。
第3周：论文撰写与答辩准备
- 整理实验数据，绘制图表，撰写实习/项目总结报告，以及为 FAU Pattern Recognition Lab 的项目论文做准备。
第4周：工作交接与离职
- 整理代码仓库，完善注释与运行文档（Readme），确保西门子医疗及协和医院的后续跟进人员能够顺利复现。
- 归还企业与医院资产，注销系统权限，完成西门子医疗的正式离职流程与实习鉴定。

DSS-SQA: Decoupling Structure and Semantics for Semantic Quality Assessment

Jiajun Wang(Jesse) — Sat, 28 Mar 2026 18:42:11 GMT

Links: GitHub Repository | LoViF 2026 Challenge

As Generative AI (AIGC) fundamentally transforms low-level vision tasks, the way we evaluate image quality is undergoing a massive paradigm shift. In modern generative models, the most pressing issue is no longer just blur or noise, but AIGC semantic hallucinations.

To address this, our team (DSS-SQA: Jiajun Wang, Yipeng Sun, Kaiwei Lian) participated in the LoViF 2026 Challenge on Semantic Quality Assessment. We proposed a novel Full-Reference Image Quality Assessment (FR-IQA) framework that explicitly decouples structural fidelity from semantic alignment.

Our method achieved highly competitive results: a Final Score of 0.8469 on the official test phase and 0.9121 on the validation phase, significantly outperforming traditional metrics, deep-feature metrics (LPIPS, DISTS), and even zero-shot large Vision-Language Models (GPT-5.4, Gemini 3.1 Pro).

🛑 The Challenge: "Shortcut Learning" in Semantic IQA

The LoViF 2026 challenge provides a highly timely and critical benchmark. However, modeling this high-level semantic alignment presents significant obstacles.

The extreme scarcity of training data (only 510 pairs) makes deep neural networks highly prone to "shortcut learning". Instead of learning to evaluate semantic alignment, models easily degenerate into evaluating absolute image sharpness. For instance, a model might erroneously predict a high score for a visually pristine but semantically completely unrelated generated image.

Explicit gating constraints and self-supervised structural priors were vital to overcoming this bottleneck.

🧠 Our Solution: The DSS-SQA Architecture

Our proposed solution, DSS-SQA, explicitly decouples structural degradation from semantic hallucinations using a dual-vision backbone (DINOv3 and CLIP).

1. Dual-Vision Siamese Encoders

Recognizing that structure and semantics are orthogonal dimensions, we employ two distinct, frozen pre-trained foundation models:

Structural Awareness (DINOv3): We utilize the DINOv3-Base vision transformer to extract structure-aware global representations (implemented with the DINOv3 CLS token by default, with patch-token fallback only when needed).
Semantic Awareness (CLIP): We employ the CLIP-ViT-L/14 vision encoder to extract global, human-aligned semantic embeddings.

2. Element-wise Multiplication Fusion

To fuse these features, we compute the absolute differences for both branches ($Diff_{s}$ and $Diff_{c}$). Crucially, we introduce an Element-wise Multiplication Fusion for the CLIP features. $$ $$

By default, we apply $L_{2}$ normalization before the Hadamard product: mult_clip $=L_{2}(F_{c_ref}) \odot L_{2}(F_{c_dist})$.
We then replace the raw distorted semantic feature with this multiplication result to force the network to rely on fine-grained semantic interaction.
The scalar cosine similarity $S_{cos}$ is also explicitly injected into the concatenated feature vector.

3. Explicit Semantic Gating Mechanism

To directly combat shortcut learning, we designed an explicit semantic gate. During inference (model.eval()) with semantic gating enabled, if the cosine similarity $S_{cos}$ falls below a threshold (0.4), the gate applies a hard veto by forcefully pushing the predicted logits toward the lowest quality class (Class 0). This ensures the final expected score approaches zero, effectively penalizing visually pleasing but semantically completely hallucinated pairs.

4. Robust Ensemble Strategy

To maximize robustness and prevent overfitting on the extremely small dataset, we utilized a 5-fold stratified cross-validation strategy during training. During the final testing/inference phase, the expected quality scores from all 5 trained models are averaged (Mean Ensemble) to produce the highly stable final output.

📊 Experimental Results

LoViF 2026 Validation Set

Our method significantly outperforms all baseline metrics, proving that decoupling structure and semantics is highly effective in AIGC evaluation.

Method	SROCC $\uparrow$	PLCC $\uparrow$	Final Score $\uparrow$
LPIPS (VGG)	0.7602	0.7389	0.7516
DISTS	0.8172	0.8055	0.8125
GPT-5.4	0.7861	0.7957	0.7900
Gemini 3.1 Pro	0.8068	0.8174	0.8110
DSS-SQA (Ours)	0.9062	0.9209	0.9121

The Final Score is calculated as $0.6 \times SROCC + 0.4 \times PLCC$. On the official blind test phase, our model achieved a Final Score of 0.8469.

Zero-Shot Generalization on BAPPS

To demonstrate that our model learns universally applicable quality representations rather than overfitting the small LoViF dataset, we conducted zero-shot evaluations on the BAPPS perceptual dataset.

Method	Trad	CNN	Superres	Overall
Human (Ceiling)	80.8%	84.4%	73.4%	73.9%
LPIPS (VGG)	71.4%	81.4%	69.0%	66.8%
DSS-SQA (Ours)	73.9%	79.6%	65.5%	65.0%

Despite being optimized for high-level semantic quality, DSS-SQA remains highly competitive with metrics explicitly trained on BAPPS (like LPIPS), proving that our dense DINOv3 representations robustly capture low-level generative artifacts.

💻 Code & Resources

We have open-sourced the complete training and inference pipeline.

GitHub Repository: atJesse/SIQAv3

If you find our work helpful for your research, feel free to drop a star on GitHub! 🌟

BVM2026 in Lübeck

Jiajun Wang(Jesse) — Wed, 18 Mar 2026 13:37:36 GMT

Just wrapped up an incredibly inspiring two days at the BVM 2026 (German Conference on Medical Image Computing)!

I had the great honor of presenting our recent work, "Explainable Radiologist-Aligned VLM for CT Image Quality Assessment", during the poster session. It was a fantastic experience to share how we used QLoRA to fine-tune MedGemma-4B-IT, enabling it to provide radiologist-level interpretable feedback for CT images.

It was a privilege to connect with fellow researchers and domain experts in the medical AI field. We had some truly thought-provoking discussions about the limitations of current black-box models and exciting future directions for interpretable AI and Neuro-symbolic methods.

I was absolutely amazed by the exceptional quality of the other posters and presentations. Having the chance to interact face-to-face with the authors behind these outstanding works was invaluable and sparked many new ideas for my own upcoming Master's thesis!

A huge thank you to my amazing co-authors and supervisor for their continuous support: Yipeng Sun, Siming Bayer and Andreas Maier. Looking forward to applying these fresh insights to my next research steps!

#BVM2026 #MedicalImaging #ArtificialIntelligence #VLM #GenerativeAI #FAU #DeepLearning #Research

Fine-Tuning Qwen2.5-VL on Your Own Images using LLaMA-Factory

Jiajun Wang(Jesse) — Fri, 13 Feb 2026 12:04:14 GMT

The world of Large Language Models (LLMs) is evolving rapidly into Vision-Language Models (VLMs). Models that can see and understand images—like Qwen2.5-VL—are game changers for tasks like OCR, medical imaging analysis, and visual agents.

However, fine-tuning these multimodal models has historically been a complex engineering nightmare.

Enter LLaMA-Factory.

This unified framework makes fine-tuning state-of-the-art models accessible to everyone. In this tutorial, I will guide you step-by-step through fine-tuning Qwen2.5-VL-7B-Instruct on a custom image dataset. Whether you are a researcher or a hobbyist, this guide will take you from an empty folder to a working custom VLM.

Prerequisites

Before we begin, ensure you have:

Hardware: An NVIDIA GPU (24GB VRAM recommended for 7B models using LoRA; A100/H100 is ideal for faster training).
OS: Linux (Ubuntu/CentOS) or Windows via WSL2.
Python: Version 3.10 or higher.

Step 1: Environment Setup

We need a clean environment with the specific dependencies for Qwen's visual processing capabilities.

Create a Conda environment:

conda create -n qwen_vl_ft python=3.10
conda activate qwen_vl_ft

Clone LLaMA-Factory:

git clone https://github.com/hiyouga/LLaMA-Factory.git
cd LLaMA-Factory

Install dependencies: This step is crucial. Qwen2.5-VL requires qwen-vl-utils to handle image inputs.

pip install -e .[metrics]
pip install qwen-vl-utils

(Optional but Recommended: Install Flash Attention 2 for faster training if you have an Ampere/Ada GPU like A100/RTX3090/4090):

pip install flash-attn --no-build-isolation

Step 2: Prepare Your Multimodal Dataset

Data preparation for VLMs is slightly different from text-only models. You need to link your text instructions to specific image files.

1. Organize your images

Create a folder named data/my_images inside the LLaMA-Factory directory and put all your training images there (e.g., .jpg or .png files).

2. Create the JSON file

Create a file named data/my_vl_data.json. The format should include an images list containing the path to the image.

Example Format:

[
  {
    "instruction": "Analyze this image and describe the defects found.",
    "input": "",
    "output": "The image shows a crack in the metal surface located at the top left corner.",
    "images": [
      "data/my_images/defect_001.png"
    ]
  },
  {
    "instruction": "What is the text written on the sign?",
    "input": "",
    "output": "The sign says 'Do Not Enter'.",
    "images": [
      "data/my_images/sign_045.jpg"
    ]
  }
]

Note: Ensure the image paths are relative to the LLaMA-Factory root directory or absolute paths.

3. Register the dataset

Open data/dataset_info.json and add your new dataset definition:

"my_vl_dataset": {
  "file_name": "my_vl_data.json",
  "formatting": "alpaca",
  "columns": {
    "prompt": "instruction",
    "query": "input",
    "response": "output",
    "images": "images"
  }
}

Step 3: Download the Base Model

For stability, download the model weights manually before training.

pip install huggingface_hub
# Download Qwen2.5-VL-7B-Instruct
huggingface-cli download Qwen/Qwen2.5-VL-7B-Instruct --local-dir models/Qwen2.5-VL-7B

Step 4: Configure and Run Training (LoRA)

We will use LoRA (Low-Rank Adaptation). This is efficient and perfect for VLMs. We need to create a YAML configuration file.

Create train_qwen25_vl.yaml in the root folder:

### Model Configuration
model_name_or_path: models/Qwen2.5-VL-7B
template: qwen2_vl                     # CRITICAL: Must use 'qwen2_vl' for correct tokenization
trust_remote_code: true

### Method Configuration
stage: sft                             # Supervised Fine-Tuning
do_train: true
finetuning_type: lora
lora_target: all                       # Qwen-VL benefits from training all linear layers
lora_rank: 16
lora_alpha: 16

### Dataset Configuration
dataset: my_vl_dataset                 # Your custom dataset name
cutoff_len: 2048                       # VLMs need longer context for image tokens
overwrite_cache: true
preprocessing_num_workers: 16

### Training Configuration
output_dir: saves/qwen2.5-vl/lora/sft  # Save path
logging_steps: 10
save_steps: 100
plot_loss: true
overwrite_output_dir: true

### Hyperparameters
per_device_train_batch_size: 4         # Adjust based on VRAM (Try 2 if OOM)
gradient_accumulation_steps: 4
learning_rate: 1.0e-4
num_train_epochs: 5.0
lr_scheduler_type: cosine
warmup_ratio: 0.1
bf16: true                             # Use pure bf16 for A100/3090
flash_attn: fa2                        # Use Flash Attention 2

Start the Training: Run the following command:

llamafactory-cli train train_qwen25_vl.yaml

LLaMA-Factory will now handle the complex task of encoding your images into visual tokens and training the LoRA adapter to understand them.

Step 5: Inference (Testing Your Model)

Once training finishes, let's see if the model learned your task. You can use the CLI or the WebUI.

Using the WebUI (Easiest Method):

llamafactory-cli webui

Go to the Chat tab.
Select the Checkpoint: saves/qwen2.5-vl/lora/sft.
Upload an image in the chat box.
Type your instruction and see the magic!

Using CLI:

llamafactory-cli chat \
  --model_name_or_path models/Qwen2.5-VL-7B \
  --adapter_name_or_path saves/qwen2.5-vl/lora/sft \
  --template qwen2_vl \
  --finetuning_type lora

Step 6: Merge and Export (Optional)

If you want to deploy your model (e.g., using vLLM or Ollama), you need to merge the LoRA weights into the base model.

Create merge_vl.yaml:

model_name_or_path: models/Qwen2.5-VL-7B
adapter_name_or_path: saves/qwen2.5-vl/lora/sft
template: qwen2_vl
finetuning_type: lora
export_dir: models/Qwen2.5-VL-FinelyTuned
export_size: 5
export_device: cpu   # Use CPU for merging to save VRAM

Run the export:

llamafactory-cli export merge_vl.yaml

Conclusion

Fine-tuning multimodal models used to require specialized knowledge of visual encoders and projector layers. LLaMA-Factory abstracts this away, allowing you to treat images just like another data input.

By following this guide, you have successfully fine-tuned Qwen2.5-VL, one of the most powerful open-source VLMs available, on your own custom data.

Key Takeaways:

Dependencies matter: Don't forget qwen-vl-utils.
Data format: Ensure your JSON correctly points to your image paths.
Template: Always use template: qwen2_vl for this specific model family.

Happy Fine-Tuning!

If you found this tutorial helpful, please share it with your community!

The Ultimate Beginner’s Guide to FAU HPC: From Zero to A100

Jiajun Wang(Jesse) — Thu, 12 Feb 2026 19:46:13 GMT

So, you’ve received an invitation to use the High-Performance Computing (HPC) cluster at FAU (likely a Tier3 project). You want to run Deep Learning, VLM, or RL experiments, but you are staring at a black terminal screen and don't know where to start.

Don't worry. I went through the exact same struggle—configuring SSH keys, getting "Permission Denied" errors, and wondering why PyTorch couldn't see the GPU.

This guide will take you from receiving the email to training on an NVIDIA A100, step-by-step.

Step 1: Accept the Invitation

Log in to the FAU HPC Portal using your standard IdM credentials (e.g., abc1234d).
Go to User -> Your Invitations and accept the project invitation.
Wait overnight. The system runs a synchronization script every night. You usually cannot log in immediately after accepting; your home directory needs time to be created.

Step 2: Generate and Upload Your SSH Key

HPC systems don't use passwords; they use "keys." You need to generate a lock (Public Key) and a key (Private Key).

For Windows (PowerShell) / Mac / Linux:

Open your terminal and run:

ssh-keygen -t rsa -b 4096

Press Enter to save it in the default location.
Press Enter twice to skip setting a password (useful for automation).
Display your public key:
- Windows: type %userprofile%\.ssh\id_rsa.pub
- Mac/Linux: cat ~/.ssh/id_rsa.pub
Copy everything (starting with ssh-rsa and ending with your username).

Upload to Portal:

Go back to the HPC Portal.
Go to User -> Your Accounts.
Find your HPC account (e.g., abc1234d), click on it, and paste your key into the "Add new SSH Key" section.
Wait 15-20 minutes for the key to sync to the servers.

Step 3: Configure VS Code (The "Pro" Setup)

Do not try to use the raw terminal for everything. Use VS Code with the Remote - SSH extension. It allows you to edit code on the server as if it were on your laptop.

The "ProxyJump" Trick

Direct access to GPU nodes (like tinyx) is often blocked from outside the university network. We need to jump through a "gatekeeper" server called csnhr.

In VS Code, install the Remote - SSH extension.
Click the blue >< icon (bottom left) -> Open Configuration File -> Select your .ssh/config.
Paste the following configuration (Replace abc1234d with YOUR HPC username):

# 1. The Gatekeeper (Jump Host)
Host csnhr
    HostName csnhr.nhr.fau.de
    User abc1234d
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes
    PasswordAuthentication no

# 2. Woody (CPU Frontend - Good for data transfer)
Host woody
    HostName woody.nhr.fau.de
    User abc1234d
    ProxyJump csnhr
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes

# 3. TinyX (Tier3 GPU Frontend - RUN YOUR EXPERIMENTS HERE)
Host tinyx
    HostName tinyx.nhr.fau.de
    User abc1234d
    ProxyJump csnhr
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes

Save the file.
Click the blue >< icon -> Connect to Host -> Select tinyx.

Step 4: Know Your Territory ($HOME vs $WORK)

Once logged in, you need to know where to put your files. This is the most common mistake beginners make.

$HOME (/home/hpc/...):
- Size: Very small (100GB).
- Use for: Config files, scripts, source code.
- NEVER put: Datasets, Conda environments, or Model checkpoints here. You will run out of space immediately.
$WORK (/home/woody/...):
- Size: Huge (1TB+?).
- Use for: EVERYTHING BIG. Install Miniforge here. Download datasets here.
- How to find it: Run echo $WORK in the terminal.

Always switch to WORK before doing anything:

cd $WORK

Step 5: Setting Up the Environment (The Right Way)

Do not use the default Python. Do not use Anaconda (it's too bloated). Use Miniforge.

Download and Install (in the terminal on tinyx):
```
 cd $WORK
 wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
 bash Miniforge3-Linux-x86_64.sh
```
- Crucial: When asked for the installation path, ensure it says /home/woody/.... If it says /home/hpc/..., edit it manually!
- Type yes to initialize.
Restart your terminal (close and reopen the terminal pane in VS Code).

Create your Environment: (e.g., vlm_env)

 mamba create -n vlm_env python=3.10
 mamba activate vlm_env

Step 6: Installing PyTorch (The Trap!)

Here is where many fail.

Compute nodes (GPUs) have NO INTERNET. You must install packages on the Login Node (tinyx).
Mamba sometimes defaults to CPU versions. Use pip to force the CUDA version.

The Golden Command (run this on tinyx):

mamba activate vlm_env
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate bitsandbytes

Note: We verify installation on the Login Node, but we run it on the Compute Node.

Step 7: Accessing GPUs

You are currently on tinyx (a shared login node). DO NOT run training here. You must request a Compute Node.

Option A: Interactive Mode (Debugging/Testing)

Use salloc to get a GPU for a short time (e.g., 30 mins).

To check what GPUs are available:

sinfo -o "%20P %G"

(You might see partitions like a100, v100, rtx3080).

To request an A100 (The Beast):

salloc --partition=a100 --gres=gpu:a100:1 --time=00:30:00

To request a V100 (Reliable):

salloc --partition=v100 --gres=gpu:v100:1 --time=00:30:00

Once inside (prompt changes to tgXXX), reactivate your environment and test:

mamba activate vlm_env
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0)}')"

If it says True and NVIDIA A100, you win!

Option B: Batch Jobs (Real Training)

For long training runs (e.g., 24 hours), create a script called run.sh:

#!/bin/bash
#SBATCH --job-name=vlm_train
#SBATCH --output=logs/%j.out
#SBATCH --partition=a100       # or v100
#SBATCH --gres=gpu:a100:1      # or gpu:v100:1
#SBATCH --time=24:00:00

source $WORK/miniforge3/bin/activate vlm_env
python train.py

Submit it with: sbatch run.sh

Summary Cheat Sheet

Connect: VS Code -> tinyx.
Workspace: cd $WORK.
Install: Run installs on tinyx (Login Node).
Debug: salloc ... to get an interactive GPU.
Train: sbatch run.sh for long jobs.

Good luck with your experiments! 🚀

Explainable Radiologist-Aligned VLM for CT Image Quality Assessment

Jiajun Wang(Jesse) — Fri, 19 Dec 2025 16:00:00 GMT

Authors: Jiajun Wang, Yipeng Sun, Siming Bayer, Andreas Maier
Affiliation: Pattern Recognition Lab, Friedrich-Alexander Universität Erlangen-Nürnberg, Germany
Links: GitHub Repository

📖 Abstract

The assessment of computed tomography (CT) image quality has traditionally relied on manual evaluation by radiologists—a method that is both subjective and time-consuming. While Deep Learning methods exist, they often only give quantitative scores and lack explainability. To address this, we propose a parameter-efficient supervised fine-tuning (SFT) framework for the medical VLM, MedGemma-4B-IT. By employing Quantized Low-Rank Adaptation (QLoRA), we aligned the model's visual perception with expert quantitative judgment.

Our results demonstrate a substantial improvement in correlation with expert scores (SRCC=0.7950, PLCC=0.7907) , significantly outperforming zero-shot baselines like Gemini 2.5 Pro and Gemini 2.5 Flash. Most importantly, our model generates professional textual explanations that emulate the reasoning and explanation style of radiologists.

💡 Motivation: Why Explainable AI for CT?

Diagnostic utility in CT scans depends critically on image quality. Degradation due to noise, artifacts, or insufficient contrast can lead to misdiagnoses and repeated examinations.

The Problem with Manual Scoring: It is labor-intensive, time-consuming, and prone to inter-observer variability.
The Problem with Previous AI: Conventional Deep Learning methods (NR-IQA) provide a score but remain opaque, making it difficult to interpret why specific scores are assigned.
The Problem with Cloud VLMs: General-purpose, closed-source VLMs are often constrained by patient privacy regulations.

Our goal was to create a locally deployable, privacy-preserving, and explainable solution.

🛠️ Methodology: The Framework

We formulated CT-IQA (Image Quality Assessment) as a multimodal reasoning task.

1. The Model Architecture

We selected MedGemma-4B-IT as our base model due to its strong medical priors. The architecture consists of:

Vision Encoder: SigLIP, to capture local anatomical and noise patterns.
Multimodal Projector: Aligns visual representations with the language space.
Language Model: Gemma, generating both the textual reasoning and the final quality scores.

2. Parameter-Efficient Fine-Tuning (QLoRA)

To make training efficient, we used QLoRA.

4-bit Quantization: The backbone weights are quantized to 4-bit precision and frozen to reduce memory consumption.
Trainable Adapters: Only the low-rank adapters (~1% of parameters) are trainable.

3. Data Construction with "Teacher" Distillation

Since large labeled datasets with explanations are scarce, we created a novel pipeline using the LDCTIQA dataset (1,000 CT slices):

Teacher Model: We employed Gemini 2.5 Pro (the best zero-shot performer) to generate expert-level textual explanations for the training data.
Fine-tuning: We trained our model to mimic these high-quality explanations, pairing them with radiologist scores.

📊 Results

Quantitative Performance

Our fine-tuned model achieved state-of-the-art results compared to zero-shot baselines on the test set.

Model	SRCC (Correlation)	PLCC (Linearity)	MAE (Error)
MedGemma-4B-IT (Fine-tuned)	0.7950	0.7907	0.5780
Gemini 2.5 Pro (Zero-shot)	0.7328	0.7204	0.6540
Gemini 2.5 Flash (Zero-shot)	0.7170	0.6946	0.7360
MedGemma-4B-IT (Zero-shot)	-0.2438	-0.2029	1.4790

Data Source: Table 1 in Paper

The fine-tuning improved the SRCC by over +1.0 compared to the original weights.

Qualitative Analysis (Case Study)

In testing, when presented with a low-quality image (Ground Truth Score: 1.2), the zero-shot baseline incorrectly rated it as high quality (3.2).

Our Fine-Tuned Model:

Predicted Score: 1.0 (Very close to GT 1.2).
Generated Reasoning: Correctly identified "Severe Artifacts," "Streak Artifacts," and "High Noise". It explicitly noted that streaks were radiating from the pelvic girdle, obscuring tissue texture.

🚀 Conclusion

We demonstrated that a specialized medical VLM can be fine-tuned to emulate the reasoning style of radiologists. This provides a locally deployable, privacy-preserving, and explainable tool for automated CT image quality assessment.

🔗 Resources

Code: GitHub - VLM-CT-IQA

📝 Citation

If you find this work helpful, please consider citing our paper:

@inproceedings{wang2025explainable,
  title={Explainable Radiologist-Aligned VLM for CT Image Quality Assessment},
  author={Wang, Jiajun and Sun, Yipeng and Bayer, Siming and Maier, Andreas},
  booktitle={German Conference on Medical Image Computing (BVM)},
  year={2026}
}

My personal experiences and introduction

Jiajun Wang(Jesse) — Mon, 08 Dec 2025 16:00:00 GMT

Hi, I am Jiajun Wang(Jesse as my nickname). I am currently a Master's student at FAU Erlangen-Nürnberg, combining a solid Engineering foundation with cutting-edge AI research.

Engineering Background: Experience in telecommunications and industrial automation, with granted patents for innovative system designs.

AI Research: Published author (BVM Conference) on fine-tuning Vision-Language Models (VLM). Currently researching Diffusion Models for medical image denoising.

Goal: I am actively seeking an internship in AI, Computer Vision, or Deep Learning. I bring a cheerful personality and a strong ability to bridge engineering challenges with advanced AI solutions.

📅 Work and Education History

🔬 Project Research Student

Pattern Recognition Lab, FAU Erlangen-Nürnberg | Oct 2025 – Present

Focus: Generative AI, VLM, Medical Imaging.
Key Achievement: Fine-tuned MedGemma-4B using QLoRA for CT image quality assessment.
Outcome: Paper accepted at the German Conference on Medical Image Computing (BVM). Currently researching Diffusion Models for medical image denoising.

🎓 M.Sc. in Electromobility

Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) | Oct 2022 – Present

Specialization: Artificial Intelligence, Deep Learning, and Computer Vision.
Transitioned from engineering to advanced AI research.

📡 TETRA System Test Engineer

Hytera Communications | Jul 2022 – Oct 2022

Deployment: Configured TETRA digital trunking systems (BSCU, CHU) for public safety networks.
Skills: Root cause analysis using Linux, Wireshark, and Xshell.

🎓 B.E. in Automation

Xi'an Polytechnic University | Sep 2018 – Jun 2022

Major: Automation Engineer Technology.
Foundation: Built a strong background in Control Systems and Hardware-Software integration.

⚙️ Engineering Intern

Esquel Group | Jul 2021 – Aug 2021

Innovation: Designed a PLC-based monitoring system using position sensors to track roller displacement.
Impact: Reduced manual errors and secured 1 Invention Patent & 1 Utility Model Patent.

🛠 Skills

AI & Deep Learning: PyTorch, LLMs/VLMs (Fine-tuning, RAG), Diffusion Models, PEFT (LoRA/QLoRA), SFT, RLHF, Computer Vision.
Programming: Python, C, MATLAB.
Tools & Platforms: Linux, HPC, Docker, Git, Wireshark, Jira.
Languages: English (Professional), German (Basic), Chinese (Native).

🏔️ Beyond Work

🏃‍♂️ Active Lifestyle

I believe a healthy body fuels a creative mind.

🏂 Snowboarding: Passionate about carving through snow in winter.
🏊‍♂️ Swimming: Building endurance and focus in the water.
🥾 Hiking & Travel: Exploring nature and experiencing diverse cultures.
📷 Photography: Capturing life's moments and finding unique perspectives through both modern digital lenses and the timeless charm of classic vintage film cameras.

🥗 Healthy Habits

I maintain a disciplined lifestyle to stay at peak performance.

🚭 Substance-Free: Non-smoker and non-drinker.
🍬 Conscious Diet: Committed to a low-sugar lifestyle, strictly minimizing refined sugar intake for better health and mental clarity.

❤️ Social Impact

Giving back to the community is an essential part of my life.

🧒 Child Development: Deeply care about the growth, education, and well-being of the next generation, actively supporting children's welfare through periodic charitable donations.

Jiajun's Cyber ​​Garden