Performance — edf-omp¶

edf vs. edf-omp¶

Both binaries are compiled from the same source with the same flags (gfortran -fopenmp -fbounds-check -O). The only difference is in the link step:

	`edf`	`edf-omp`
Link	`-static -static-libgcc` + bundled `.a` files	Dynamic system LAPACK/BLAS
OpenMP	Compiled in but non-functional	Active
Portability	Self-contained, no runtime deps	Requires system LAPACK/BLAS + libgomp
Threading	Always single-core	Multi-core via `OMP_NUM_THREADS`

The Makefile comment (lines 4–5) explains the reason: with -static -static-libgcc the OpenMP runtime library (libgomp) is linked statically in a way that prevents the user-level !$omp directives from running multi-threaded. Only edf-omp (dynamic link) activates threading.

Building edf-omp¶

make clean && make edf-omp

Requires system LAPACK and BLAS at /usr/lib/x86_64-linux-gnu/{lapack,blas}/liblapack.a and libblas.a. See Installation for prerequisites.

Setting the thread count¶

OMP_NUM_THREADS=8 ./edf-omp < input.edfinp > output.edfout

Set OMP_NUM_THREADS to the number of physical cores (not hyperthreads) for compute-bound workloads. Hyperthreading rarely helps for floating-point-heavy loops.

To check the thread count being used, look at the timer section at the end of the output: the _linearsys and _splitedf timers reflect wall time for the parallel regions.

What is parallelized¶

The !$omp parallel directives in binedf.f (the 2-group binomial recurrence engine) create three parallel regions:

Region	binedf.f line	Schedule	Parallelizes
Alpha `_linearsys`	649	`dynamic,4`	Loop over alpha spin-configs (triangular; `npa` iterations)
Beta `_linearsys`	804	`dynamic,4`	Loop over beta spin-configs (triangular; `npb` iterations)
`_splitedf`	938	`dynamic,1`	Outer loop over alpha config pairs (`m1=1,npa`)

When does this help?

The binedf.f engine is used automatically when ngroup = 2 (the two-group case) unless the RECUR keyword forces it also for larger systems. For ngroup > 2, the other engine files (calcedfd.f, rcalcedf.f, xcalcedf.f, etc.) handle the computation; these are not yet parallelized.

Expected speedup (2-group case):

Near-linear for large CASSCF wavefunctions with many determinants (large npa, npb).
Modest for small single-determinant systems where overhead dominates.
The _linearsys regions use schedule(dynamic,4) to handle the triangular loop imbalance. The _splitedf region uses schedule(dynamic,1) because inner loop costs vary with configuration pair.

Over-subscription warning¶

If the system BLAS (OpenBLAS, MKL) is compiled with internal threading, each of the N OpenMP threads may spawn M BLAS threads, giving N×M system threads on a machine with N cores. This severely degrades performance.

To prevent over-subscription when using edf-omp:

export OPENBLAS_NUM_THREADS=1   # for OpenBLAS
export MKL_NUM_THREADS=1        # for Intel MKL
export BLIS_NUM_THREADS=1       # for BLIS

Check which BLAS is installed:

ls -la /usr/lib/x86_64-linux-gnu/blas/libblas.a
ldconfig -p | grep -E "openblas|lapack"

Practical recommendations¶

Default: OMP_NUM_THREADS = number of physical cores. Check with nproc --all.
Large CASSCF runs (ndets > 1000, ngroup = 2): full speedup from parallelism.
Single-determinant small systems: speedup may be modest; edf-omp still correct.
Many-group systems (ngroup > 2): no speedup from threading — use edf (serial is faster due to no OpenMP overhead).
Cluster nodes: request one full node (all cores), set OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE.

Example: timed run on 1 vs. 4 threads¶

cd test/
time OMP_NUM_THREADS=1 ../edf-omp < cas-ch4.edfinp > /dev/null
time OMP_NUM_THREADS=4 ../edf-omp < cas-ch4.edfinp > /dev/null

Look at the _linearsys and _splitedf lines in the timer section to isolate the contribution of the parallel regions.