BabelStream benchmarks
Measure memory transfer rates to/from global device memory on GPUs. This benchmark is similar in spirit, and based on, the STREAM benchmark [1] for CPUs. Unlike other GPU memory bandwidth benchmarks this does not include the PCIe transfer time. There are multiple implementations of this benchmark in a variety of programming models. This code was previously called GPU-STREAM.
Usage
From the top-level directory of the repository, you can run the benchmarks with
reframe -c benchmarks/apps/babelstream -r  --tag <TAG> --system=<ENV:PARTITION> -Sbuild_locally=false -Sspack_spec='babelstream +tag <extra flags>'
Filtering the benchmarks
The Spack directives for the babelstream could be found here
You can run individual benchmarks with the --tag option:
ompto run theOpenMPbenchmark.oclto run theOpenCLbenchmark.stdto run theSTDbenchmark.std20to run theSTD20benchmark.hipto run theHIPbenchmark.cudato run theCUDAbenchmark.kokkosto run theKokkosbenchmark.syclto run theSYCLbenchmark.sycl2020to run theSYCL2020benchmark.accto run theACCbenchmark.rajato run theRAJAbenchmark.tbbto run theTBBbenchmark.thrustto run theTHRUSTbenchmark,
Examples:
reframe -c benchmarks/apps/babelstream -r --tag omp --system=isambard-macs:volta -S build_locally=false -S spack_spec='babelstream%gcc@9.2.0 +omp cuda_arch=70'
reframe -c benchmarks/apps/babelstream -r --tag tbb --system=isambard-macs:cascadelake -S build_locally=false -S spack_spec='babelstream@develop +tbb'
reframe -c benchmarks/apps/babelstream -r --tag cuda --system=isambard-macs:volta -S build_locally=false -S spack_spec='babelstream@develop%gcc@9.2.0 +cuda cuda_arch=70'
Setting the number of threads and MPI processes
By default, these benchmarks will use
- [
num_gpus_per_node](https://reframe-hpc.readthedocs.io/en/stable/regression_test_api.html#reframe.core.pipeline.RegressionTest.num_gpus_per_node: This value is by default 1 for the benchmarks requiring GPU. (e.g. CUDA,HIP) 
You can override the value of this variable from the command line with the
--setvar
option, for example
reframe -c benchmarks/apps/babelstream -r --tag cuda --system=isambard-macs:volta -S build_locally=false -S spack_spec='babelstream@develop%gcc@9.2.0 +cuda cuda_arch=70' --setvar=num_gpus_per_node=2
Note: you're responsible for overriding this variable in a consistent
way, so that, for example, num_gpus_per_node doesn't exceed the number of
total GPUs runnable on each node.
Figure of merit
The figure of merit captured by these benchmarks is the bandwidth. For example, if the output of the program is
BabelStream
Version: 4.0
Implementation: OpenMP
Running kernels 100 times
Precision: double
Array size: 268.4 MB (=0.3 GB)
Total size: 805.3 MB (=0.8 GB)
Function    MBytes/sec  Min (sec)   Max         Average     
Copy        91018.241   0.00590     0.01087     0.00721     
Mul         80014.622   0.00671     0.01173     0.00837     
Add         92644.967   0.00869     0.01636     0.01121     
Triad       93484.396   0.00861     0.01416     0.01142     
Dot         114688.364  0.00468     0.01382     0.00707
the output numbers
will be captured.