The Ultimate Beginner’s Guide to FAU HPC: From Zero to A100

So, you’ve received an invitation to use the High-Performance Computing (HPC) cluster at FAU (likely a Tier3 project). You want to run Deep Learning, VLM, or RL experiments, but you are staring at a black terminal screen and don't know where to start.

Don't worry. I went through the exact same struggle—configuring SSH keys, getting "Permission Denied" errors, and wondering why PyTorch couldn't see the GPU.

This guide will take you from receiving the email to training on an NVIDIA A100, step-by-step.

Step 1: Accept the Invitation

Log in to the FAU HPC Portal using your standard IdM credentials (e.g., abc1234d).
Go to User -> Your Invitations and accept the project invitation.
Wait overnight. The system runs a synchronization script every night. You usually cannot log in immediately after accepting; your home directory needs time to be created.

Step 2: Generate and Upload Your SSH Key

HPC systems don't use passwords; they use "keys." You need to generate a lock (Public Key) and a key (Private Key).

For Windows (PowerShell) / Mac / Linux:

Open your terminal and run:

ssh-keygen -t rsa -b 4096

Press Enter to save it in the default location.
Press Enter twice to skip setting a password (useful for automation).
Display your public key:
- Windows: type %userprofile%\.ssh\id_rsa.pub
- Mac/Linux: cat ~/.ssh/id_rsa.pub
Copy everything (starting with ssh-rsa and ending with your username).

Upload to Portal:

Go back to the HPC Portal.
Go to User -> Your Accounts.
Find your HPC account (e.g., abc1234d), click on it, and paste your key into the "Add new SSH Key" section.
Wait 15-20 minutes for the key to sync to the servers.

Step 3: Configure VS Code (The "Pro" Setup)

Do not try to use the raw terminal for everything. Use VS Code with the Remote - SSH extension. It allows you to edit code on the server as if it were on your laptop.

The "ProxyJump" Trick

Direct access to GPU nodes (like tinyx) is often blocked from outside the university network. We need to jump through a "gatekeeper" server called csnhr.

In VS Code, install the Remote - SSH extension.
Click the blue >< icon (bottom left) -> Open Configuration File -> Select your .ssh/config.
Paste the following configuration (Replace abc1234d with YOUR HPC username):

# 1. The Gatekeeper (Jump Host)
Host csnhr
    HostName csnhr.nhr.fau.de
    User abc1234d
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes
    PasswordAuthentication no

# 2. Woody (CPU Frontend - Good for data transfer)
Host woody
    HostName woody.nhr.fau.de
    User abc1234d
    ProxyJump csnhr
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes

# 3. TinyX (Tier3 GPU Frontend - RUN YOUR EXPERIMENTS HERE)
Host tinyx
    HostName tinyx.nhr.fau.de
    User abc1234d
    ProxyJump csnhr
    IdentityFile ~/.ssh/id_rsa
    IdentitiesOnly yes

Save the file.
Click the blue >< icon -> Connect to Host -> Select tinyx.

Step 4: Know Your Territory ($HOME vs $WORK)

Once logged in, you need to know where to put your files. This is the most common mistake beginners make.

$HOME (/home/hpc/...):
- Size: Very small (100GB).
- Use for: Config files, scripts, source code.
- NEVER put: Datasets, Conda environments, or Model checkpoints here. You will run out of space immediately.
$WORK (/home/woody/...):
- Size: Huge (1TB+?).
- Use for: EVERYTHING BIG. Install Miniforge here. Download datasets here.
- How to find it: Run echo $WORK in the terminal.

Always switch to WORK before doing anything:

cd $WORK

Step 5: Setting Up the Environment (The Right Way)

Do not use the default Python. Do not use Anaconda (it's too bloated). Use Miniforge.

Download and Install (in the terminal on tinyx):
```
 cd $WORK
 wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
 bash Miniforge3-Linux-x86_64.sh
```
- Crucial: When asked for the installation path, ensure it says /home/woody/.... If it says /home/hpc/..., edit it manually!
- Type yes to initialize.
Restart your terminal (close and reopen the terminal pane in VS Code).

Create your Environment: (e.g., vlm_env)

 mamba create -n vlm_env python=3.10
 mamba activate vlm_env

Step 6: Installing PyTorch (The Trap!)

Here is where many fail.

Compute nodes (GPUs) have NO INTERNET. You must install packages on the Login Node (tinyx).
Mamba sometimes defaults to CPU versions. Use pip to force the CUDA version.

The Golden Command (run this on tinyx):

mamba activate vlm_env
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
pip install transformers accelerate bitsandbytes

Note: We verify installation on the Login Node, but we run it on the Compute Node.

Step 7: Accessing GPUs

You are currently on tinyx (a shared login node). DO NOT run training here. You must request a Compute Node.

Option A: Interactive Mode (Debugging/Testing)

Use salloc to get a GPU for a short time (e.g., 30 mins).

To check what GPUs are available:

sinfo -o "%20P %G"

(You might see partitions like a100, v100, rtx3080).

To request an A100 (The Beast):

salloc --partition=a100 --gres=gpu:a100:1 --time=00:30:00

To request a V100 (Reliable):

salloc --partition=v100 --gres=gpu:v100:1 --time=00:30:00

Once inside (prompt changes to tgXXX), reactivate your environment and test:

mamba activate vlm_env
python -c "import torch; print(f'CUDA: {torch.cuda.is_available()}'); print(f'Device: {torch.cuda.get_device_name(0)}')"

If it says True and NVIDIA A100, you win!

Option B: Batch Jobs (Real Training)

For long training runs (e.g., 24 hours), create a script called run.sh:

#!/bin/bash
#SBATCH --job-name=vlm_train
#SBATCH --output=logs/%j.out
#SBATCH --partition=a100       # or v100
#SBATCH --gres=gpu:a100:1      # or gpu:v100:1
#SBATCH --time=24:00:00

source $WORK/miniforge3/bin/activate vlm_env
python train.py

Submit it with: sbatch run.sh

Summary Cheat Sheet

Connect: VS Code -> tinyx.
Workspace: cd $WORK.
Install: Run installs on tinyx (Login Node).
Debug: salloc ... to get an interactive GPU.
Train: sbatch run.sh for long jobs.

Good luck with your experiments! 🚀

The Ultimate Beginner’s Guide to FAU HPC: From Zero to A100

Step 1: Accept the Invitation

Step 2: Generate and Upload Your SSH Key

For Windows (PowerShell) / Mac / Linux:

Upload to Portal:

Step 3: Configure VS Code (The "Pro" Setup)

The "ProxyJump" Trick

Step 4: Know Your Territory ($HOME vs $WORK)

Step 5: Setting Up the Environment (The Right Way)

Step 6: Installing PyTorch (The Trap!)

Step 7: Accessing GPUs

Option A: Interactive Mode (Debugging/Testing)

Option B: Batch Jobs (Real Training)

Summary Cheat Sheet

Comments

More from this blog

Project Introduction: High-Fidelity 3D Photon-Counting CT Denoising via Advanced Generative AI

基于生成式AI的 3D PCCT 去噪算法研发：6个月工作计划

DSS-SQA: Decoupling Structure and Semantics for Semantic Quality Assessment

BVM2026 in Lübeck

Fine-Tuning Qwen2.5-VL on Your Own Images using LLaMA-Factory

Command Palette

Step 1: Accept the Invitation

Step 2: Generate and Upload Your SSH Key

For Windows (PowerShell) / Mac / Linux:

Upload to Portal:

Step 3: Configure VS Code (The "Pro" Setup)

The "ProxyJump" Trick

Step 4: Know Your Territory ($HOME vs $WORK)

Step 5: Setting Up the Environment (The Right Way)

Step 6: Installing PyTorch (The Trap!)

Step 7: Accessing GPUs

Option A: Interactive Mode (Debugging/Testing)

Option B: Batch Jobs (Real Training)

Summary Cheat Sheet

Comments

More from this blog