Performance — edf-omp¶
edf vs. edf-omp¶
Both binaries are compiled from the same source with the same flags
(gfortran -fopenmp -fbounds-check -O). The only difference is in the link step:
edf |
edf-omp |
|
|---|---|---|
| Link | -static -static-libgcc + bundled .a files |
Dynamic system LAPACK/BLAS |
| OpenMP | Compiled in but non-functional | Active |
| Portability | Self-contained, no runtime deps | Requires system LAPACK/BLAS + libgomp |
| Threading | Always single-core | Multi-core via OMP_NUM_THREADS |
The Makefile comment (lines 4–5) explains the reason: with -static -static-libgcc the
OpenMP runtime library (libgomp) is linked statically in a way that prevents the
user-level !$omp directives from running multi-threaded. Only edf-omp (dynamic link)
activates threading.
Building edf-omp¶
make clean && make edf-omp
Requires system LAPACK and BLAS at /usr/lib/x86_64-linux-gnu/{lapack,blas}/liblapack.a
and libblas.a. See Installation for prerequisites.
Setting the thread count¶
OMP_NUM_THREADS=8 ./edf-omp < input.edfinp > output.edfout
Set OMP_NUM_THREADS to the number of physical cores (not hyperthreads) for
compute-bound workloads. Hyperthreading rarely helps for floating-point-heavy loops.
To check the thread count being used, look at the timer section at the end of the output:
the _linearsys and _splitedf timers reflect wall time for the parallel regions.
What is parallelized¶
The !$omp parallel directives in binedf.f (the 2-group binomial recurrence engine)
create three parallel regions:
| Region | binedf.f line | Schedule | Parallelizes |
|---|---|---|---|
Alpha _linearsys |
649 | dynamic,4 |
Loop over alpha spin-configs (triangular; npa iterations) |
Beta _linearsys |
804 | dynamic,4 |
Loop over beta spin-configs (triangular; npb iterations) |
_splitedf |
938 | dynamic,1 |
Outer loop over alpha config pairs (m1=1,npa) |
When does this help?
The binedf.f engine is used automatically when ngroup = 2 (the two-group case)
unless the RECUR keyword forces it also for larger systems. For ngroup > 2, the
other engine files (calcedfd.f, rcalcedf.f, xcalcedf.f, etc.) handle the
computation; these are not yet parallelized.
Expected speedup (2-group case):
- Near-linear for large CASSCF wavefunctions with many determinants (large
npa,npb). - Modest for small single-determinant systems where overhead dominates.
- The
_linearsysregions useschedule(dynamic,4)to handle the triangular loop imbalance. The_splitedfregion usesschedule(dynamic,1)because inner loop costs vary with configuration pair.
Over-subscription warning¶
If the system BLAS (OpenBLAS, MKL) is compiled with internal threading, each of the N OpenMP threads may spawn M BLAS threads, giving N×M system threads on a machine with N cores. This severely degrades performance.
To prevent over-subscription when using edf-omp:
export OPENBLAS_NUM_THREADS=1 # for OpenBLAS
export MKL_NUM_THREADS=1 # for Intel MKL
export BLIS_NUM_THREADS=1 # for BLIS
Check which BLAS is installed:
ls -la /usr/lib/x86_64-linux-gnu/blas/libblas.a
ldconfig -p | grep -E "openblas|lapack"
Practical recommendations¶
- Default:
OMP_NUM_THREADS= number of physical cores. Check withnproc --all. - Large CASSCF runs (
ndets> 1000,ngroup = 2): full speedup from parallelism. - Single-determinant small systems: speedup may be modest;
edf-ompstill correct. - Many-group systems (
ngroup > 2): no speedup from threading — useedf(serial is faster due to no OpenMP overhead). - Cluster nodes: request one full node (all cores), set
OMP_NUM_THREADS=$SLURM_CPUS_ON_NODE.
Example: timed run on 1 vs. 4 threads¶
cd test/
time OMP_NUM_THREADS=1 ../edf-omp < cas-ch4.edfinp > /dev/null
time OMP_NUM_THREADS=4 ../edf-omp < cas-ch4.edfinp > /dev/null
Look at the _linearsys and _splitedf lines in the timer section to isolate the
contribution of the parallel regions.