Multi-GPU CUDA quantized compute/memory benchmark suite (fp4/fp8/bf16/fp16/fp32/fp64). Follows CUDA sample patterns (matrixMulCUBLAS, bandwidthTest) to hit 100% SM and memory load.
Local benchmark results (cuBLASLt + Transformer Engine), run on my desktop with matrix=8192, compute-iters=30, seconds=120 (FP4 uses seconds=180):
| Metric | RTX 5090 D | RTX 4090 |
|---|---|---|
| SM | 12.0 | 8.9 |
| fp32 (TFLOP/s) | 116.3 | 89.9 |
| bf16 (TFLOP/s) | 234.8 | 175.1 |
| fp16 (TFLOP/s) | 236.3 | 175.1 |
| fp8 with fp32 accum (TFLOP/s) | 548.5 | N/A |
| fp4 with fp32 accum (TFLOP/s) | 1040.0 | N/A |
| Memory D2D memcpy (GB/s) | 1528.8 | 918.8 |
| Memory Streaming (GB/s) | 1340.8 | 842.7 |
Note: FP8 tensor cores require SM90+ (Hopper/Blackwell). RTX 4090 (Ada, SM89) does not expose FP8 kernels; NVIDIA forum confirmation: https://forums.developer.nvidia.com/t/4090-doesnt-have-fp8-compute/232256
git clone https://github.com/binbinsh/gpu-bench.git
cd gpu-bench
make # builds gpu_bench (defaults to sm_89)Choose an arch: make COMPUTE=90 (Ada/Hopper), make COMPUTE=100 or make COMPUTE=120 for newer Blackwell parts when available. Clean artifacts: make clean.
FP4 path (CUDA >= 12.8 with TransformerEngine):
git submodule update --init --recursive transformer-engine
CXX=/usr/bin/g++-12 CC=/usr/bin/gcc-12 cmake -S transformer-engine/transformer_engine/common -B transformer-engine/build -DCMAKE_BUILD_TYPE=Release -DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12 -DCMAKE_CUDA_ARCHITECTURES=120
cmake --build transformer-engine/build -j32
make COMPUTE=89,120 ENABLE_FP4=1 CCPATH=/usr/bin/g++-12 TE_LIBDIR=transformer-engine/build TE_INCLUDEDIR=transformer-engine/transformer_engine/common/includeTip: COMPUTE accepts comma-separated arch codes for fatbin builds, e.g. COMPUTE=89,120 to target both Ada (sm_89) and Blackwell (sm_120).
If you previously built Transformer Engine with a different compiler, clear the cache first:
rm -rf transformer-engine/buildList GPUs: ./gpu_bench --list
Quantized compute + memory sweep (all GPUs, steady 2 min):
./gpu_bench --matrix 8192 --compute-iters 30 --seconds 60 --parallel --report results/all.jsonB200/Blackwell focus with tensor cores:
./gpu_bench --devices 0 --precisions bf16,fp16,fp8,fp32 --matrix 8192 --compute-iters 30 --no-memory --seconds 60 --report results/b200.jsonFP4 check (Transformer Engine, 5090/B200-class):
./gpu_bench --devices 1 --precisions fp4 --matrix 8192 --compute-iters 80 --seconds 60 --no-memory --report results/fp4.jsonMemory-only sweep:
./gpu_bench --no-compute --mem 40% --mem-iters 50 --seconds 60 --report results/mem.jsonKey flags:
--precisions:fp4,fp8,fp16,bf16,fp32,fp64--compute-iters: number of GEMM iterations per timing loop--seconds: keep each phase running for at least N seconds--devices: comma-separated CUDA device indices to run (e.g.,--devices 1)--parallel: run all selected GPUs concurrently--no-memory: skip memory bandwidth test (compute-only)--no-compute: skip GEMM (memory-only)--mem <bytes|percent%>: control bandwidth test size--mem-iters <N>: control bandwidth test length--report <path>: JSON summary per GPU
- Inspired by the gpu-burn project.