Skip to main content

Command Palette

Search for a command to run...

The Ultimate Beginner’s Guide to FAU HPC: From Zero to A100

Updated
5 min read
The Ultimate Beginner’s Guide to FAU HPC: From Zero to A100

So, you’ve received an invitation to use the High-Performance Computing (HPC) cluster at FAU (likely a Tier3 project). You want to run Deep Learning, VLM, or RL experiments, but you are staring at a black terminal screen and don't know where to start.

Don't worry. I went through the exact same struggle—configuring SSH keys, getting "Permission Denied" errors, and wondering why PyTorch couldn't see the GPU.

This guide will take you from receiving the email to training on an NVIDIA A100, step-by-step.


Step 1: Accept the Invitation

  1. Log in to the FAU HPC Portal using your standard IdM credentials (e.g., abc1234d).

  2. Go to User -> Your Invitations and accept the project invitation.

  3. Wait overnight. The system runs a synchronization script every night. You usually cannot log in immediately after accepting; your home directory needs time to be created.


Step 2: Generate and Upload Your SSH Key

HPC systems don't use passwords; they use "keys." You need to generate a lock (Public Key) and a key (Private Key).

For Windows (PowerShell) / Mac / Linux:

Open your terminal and run:

ssh-keygen -t rsa -b 4096
  1. Press Enter to save it in the default location.

  2. Press Enter twice to skip setting a password (useful for automation).

  3. Display your public key:

    • Windows: type %userprofile%\.ssh\id_rsa.pub

    • Mac/Linux: cat ~/.ssh/id_rsa.pub

  4. Copy everything (starting with ssh-rsa and ending with your username).

Upload to Portal:

  1. Go back to the HPC Portal.

  2. Go to User -> Your Accounts.

  3. Find your HPC account (e.g., abc1234d), click on it, and paste your key into the "Add new SSH Key" section.

  4. Wait 15-20 minutes for the key to sync to the servers.


Step 3: Configure VS Code (The "Pro" Setup)

Do not try to use the raw terminal for everything. Use VS Code with the Remote - SSH extension. It allows you to edit code on the server as if it were on your laptop.

The "ProxyJump" Trick

Direct access to GPU nodes (like tinyx) is often blocked from outside the university network. We need to jump through a "gatekeeper" server called csnhr.

  1. In VS Code, install the Remote - SSH extension.

  2. Click the blue >< icon (bottom left) -> Open Configuration File -> Select your .ssh/config.

  3. Paste the following configuration (Replace abc1234d with YOUR HPC username):

# 1. The Gatekeeper (Jump Host)
Host csnhr
    HostName csnhr.nhr.fau.de
    User abc1234d
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes
    PasswordAuthentication no

# 2. Woody (CPU Frontend - Good for data transfer)
Host woody
    HostName woody.nhr.fau.de
    User abc1234d
    ProxyJump csnhr
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes

# 3. TinyX (Tier3 GPU Frontend - RUN YOUR EXPERIMENTS HERE)
Host tinyx
    HostName tinyx.nhr.fau.de
    User abc1234d
    ProxyJump csnhr
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes
  1. Save the file.

  2. Click the blue >< icon -> Connect to Host -> Select tinyx.


Step 4: Know Your Territory ($HOME vs $WORK)

Once logged in, you need to know where to put your files. This is the most common mistake beginners make.

  • $HOME (/home/hpc/...):

    • Size: Very small (100GB).

    • Use for: Config files, scripts, source code.

    • NEVER put: Datasets, Conda environments, or Model checkpoints here. You will run out of space immediately.

  • $WORK (/home/woody/...):

    • Size: Huge (1TB+?).

    • Use for: EVERYTHING BIG. Install Miniforge here. Download datasets here.

    • How to find it: Run echo $WORK in the terminal.

Always switch to WORK before doing anything:

cd $WORK

Step 5: Setting Up the Environment (The Right Way)

Do not use the default Python. Do not use Anaconda (it's too bloated). Use Miniforge.

  1. Download and Install (in the terminal on tinyx):

     cd $WORK
     wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
     bash Miniforge3-Linux-x86_64.sh
    
    • Crucial: When asked for the installation path, ensure it says /home/woody/.... If it says /home/hpc/..., edit it manually!

    • Type yes to initialize.

  2. Restart your terminal (close and reopen the terminal pane in VS Code).

  3. Create your Environment: (e.g., vlm_env)

     mamba create -n vlm_env python=3.10
     mamba activate vlm_env
    

Step 6: Installing PyTorch (The Trap!)

Here is where many fail.

  1. Compute nodes (GPUs) have NO INTERNET. You must install packages on the Login Node (tinyx).

  2. Mamba sometimes defaults to CPU versions. Use pip to force the CUDA version.

The Golden Command (run this on tinyx):

mamba activate vlm_env
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate bitsandbytes

Note: We verify installation on the Login Node, but we run it on the Compute Node.


Step 7: Accessing GPUs

You are currently on tinyx (a shared login node). DO NOT run training here. You must request a Compute Node.

Option A: Interactive Mode (Debugging/Testing)

Use salloc to get a GPU for a short time (e.g., 30 mins).

To check what GPUs are available:

sinfo -o "%20P %G"

(You might see partitions like a100, v100, rtx3080).

To request an A100 (The Beast):

salloc --partition=a100 --gres=gpu:a100:1 --time=00:30:00

To request a V100 (Reliable):

salloc --partition=v100 --gres=gpu:v100:1 --time=00:30:00

Once inside (prompt changes to tgXXX), reactivate your environment and test:

mamba activate vlm_env
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0)}')"

If it says True and NVIDIA A100, you win!

Option B: Batch Jobs (Real Training)

For long training runs (e.g., 24 hours), create a script called run.sh:

#!/bin/bash
#SBATCH --job-name=vlm_train
#SBATCH --output=logs/%j.out
#SBATCH --partition=a100       # or v100
#SBATCH --gres=gpu:a100:1      # or gpu:v100:1
#SBATCH --time=24:00:00

source $WORK/miniforge3/bin/activate vlm_env
python train.py

Submit it with: sbatch run.sh


Summary Cheat Sheet

  1. Connect: VS Code -> tinyx.

  2. Workspace: cd $WORK.

  3. Install: Run installs on tinyx (Login Node).

  4. Debug: salloc ... to get an interactive GPU.

  5. Train: sbatch run.sh for long jobs.

Good luck with your experiments! 🚀

More from this blog

基于生成式AI的 3D PCCT 去噪算法研发:6个月工作计划

第1个月:入职培训、环境部署与数据预处理 目标: 完成企业与医院的入职流程,熟悉研发环境,完成数据清洗与经典深度学习 Baseline 的初步验证。 第1-2周:入职与环境熟悉 完成西门子医疗中国及北京协和医院的入职手续与安全/合规培训。 熟悉协和医院放射科驻场工作环境,获取联合实验室高性能计算集群(8卡A100)的访问权限及环境配置。 与院方医生及西门子/FAU导师对齐项目最终预期,确认

Mar 31, 20262 min read
J

Jiajun's Cyber ​​Garden

7 posts

A continuous integration of life and research.