Combinatorial BLAS Library (MPI reference implementation)
Ariful Azad , Aydın Buluç , John R. Gilbert , Adam Lugowski (in collaboration with Scott Beamer).

This material is based upon work supported by the National Science Foundation under Grant No. 0709385 and by the Department of Energy, Office of Science, ASCR Contract No. DE-AC02-05CH11231. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF) and the Department of Energy (DOE). This software is released under the MIT license described here.


The Combinatorial BLAS is an extensible distributed-memory parallel graph library offering a small but powerful set of linear algebra primitives specifically targeting graph analytics.

  • The Combinatorial BLAS development influences the Graph BLAS standardization process.
  • It achieves scalability via its two dimensional distribution and coarse-grained parallelism.
  • For an illustrative overview, check out these slides.
  • Operations among sparse matrices and vectors use arbitrary user defined semirings. Here is a semiring primer
  • Check out the Frequently asked questions about CombBLAS.


  • Read release notes.
  • The latest CMake'd tarball (version 1.6.0, Oct 2017) here. (NERSC users read this). The previous version (version 1.5.0, Jan 2016) is also available here for backwards compatibility and benchmarking.
    • To create sample applications and run simple tests, all you need to do is to execute the following three commands, in the given order, inside the main directory:
      • cmake .
      • make
      • ctest -V (you need the testinputs, see below)
    • Test inputs are separately downloadable here. Extract them inside the CombBLAS_vx.x directory with the command "tar -xzvf testdata_combblas1.2.1.tgz"
  • Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the ReleaseTests and Applications directories. Those sample makefiles have the following format: makefile-machine. (example: makefile-macair)
  • The CMake now automatically compiles for hybrid MPI+OpenMP mode because almost all expensive primitives are now multithreaded. Example makefiles are also multithreaded for many cases.

Requirements: You need a recent C++ compiler (gcc version 4.8+, Intel version 15.0+ and compatible), a compliant MPI implementation, and C++11 Standard library (libstdc++ that comes with g++ has them). The recommended tarball uses the CMake build system, but only to build the documentation and unit-tests, and to automate installation. The chances are that you're not going to use any of our sample applications "as-is", so you can just modify them or imitate their structure to write your own application by just using the header files. There are very few binary libraries to link to, and no configured header files. Like many high-performance C++ libraries, the Combinatorial BLAS is mostly templated.

Documentation: This is a reference implementation of the Combinatorial BLAS Library in C++/MPI. It is purposefully designed for distributed memory platforms though it also runs in uniprocessor and shared-memory (such as multicores) platforms. It contains efficient implementations of novel data structures/algorithms as well as reimplementations of some previously known data structures/algorithms for convenience. More details can be found in the accompanying paper [1]. One of the distinguishing features of the Combinatorial BLAS is its decoupling of parallel logic from the sequential parts of the computation, making it possible to implement new formats and plug them in without changing the rest of the library.

The implementation supports both formatted and binary I/O. The latter is faster but not human readable. Formatted I/O can read both a tuples format very similar to the Matrix Market and the Matrix Market format itself. We encourage in-memory generators for faster benchmarking. More info on I/O formats (other than the well-known Matrix Market Format) are here

The main data structure is a distributed sparse matrix ( SpParMat <IT,NT,DER> ) which HAS-A sequential sparse matrix ( SpMat <IT,NT> ) that can be implemented in various ways as long as it supports the interface of the base class (currently: SpTuples, SpCCols, SpDCCols).

For example, the standard way to declare a parallel sparse matrix A that uses 32-bit integers for indices, floats for numerical values (nonzeros), SpDCCols <int,float> for the underlying sequential matrix operations is:

  • SpParMat<int, float, SpDCCols<int,float> > A;

The repetitions of int and float types inside the SpDCCols< > is a direct consequence of the static typing of C++ and is akin to some STL constructs such as vector<int, SomeAllocator<int> >. If your compiler support "auto", then you can have the compiler infer the type.

Sparse and dense vectors are distributed along all processors. This is very space efficient and provides good load balance for SpMSV (sparse matrix-sparse vector multiplication).

New in version 1.6:

  • In-node multithreading enabled for many expensive operations.
  • Fully parallel text-file reader for vectors (FullyDistSpVec::ParallelReadMM() and FullyDistVec::ParallelReadMM())
  • Fully parallel text-file writer for vectors (FullyDistSpVec::ParallelWrite () and FullyDistVec::ParallelWrite())
  • Reverse Cuthill-McKee (RCM) ordering implementation. Please cite [12] if you use this implementation
  • Novel multithreaded SpGEMM and SpMV (with sparse vectors) algorithms are integrated with the rest of CombBLAS.
    • For benchmarking multithreaded SpMV with sparse vectors, go to Applications/SpMSpV-IPDPS2017 directory and use the code there.
    • Please cite [13] if you use the new multithreaded SpMV with sparse vectors.
  • Extended CSC support
  • Previously deprecated SpParVec and DenseParVec (that were distributed to diagonal processors only) classes are removed.
  • Lots of more bug fixes

New in version 1.5:

  • Fully parallel matrix market format reader (SpParMat::ParallelReadMM())
  • Complete multithreading support, including SpGEMM (previously it was solely SpMV), enabled by -DTHREADED during compilation
  • Experimental 3D SpGEMM (the ability to switch processor grids from 2D to 3D will have to wait for version 1.6)
    • Cite [9] if you use this implementation
    • cd 3DSpGEMM/, make test_mpipspgemm, and call the executable with correct parameters
  • Maximal and Maximum cardinality matching algorithms on bipartite graphs
    • Cite [10] for maximal cardinality and [11] for maximum cardinality matching
    • cd MaximumMatching, make bpmm, and call the executable with correct parameters
  • Automated MPI_Op creation from simple C++ function objects (simplifies semiring descriptions and Reduce() functions)
  • FullyDistSpVec::Invert() to map from/to (integer) values to/from indices
  • Many more helper functions
  • Experimental CSC support for low concurrencies
  • Lots of bug fixes

New in version 1.4:

  • Direction optimizing breadth-first search in distributed memory (in collaboration with Scott Beamer and GAP). Please cite [8] if you use this code in your research or benchmarks (DirOptBFS.cpp).

The supported operations (a growing list) are:

  • Sparse matrix-matrix multiplication on a semiring SR: PSpGEMM()
  • Elementwise multiplication of sparse matrices (A .* B and A .* not(B) in Matlab): EWiseMult()
  • Unary operations on nonzeros: SpParMat::Apply()
  • Matrix-matrix and matrix-vector scaling (the latter scales each row/column with the same scalar of the vector)
  • Reductions along row/column: SpParMat::Reduce()
  • Sparse matrix-dense vector multiplication on a semiring, SpMV()
  • Sparse matrix-sparse vector multiplication on a semiring, SpMV()
  • Generalized matrix indexing: SpParMat::operator(const FullyDistVec & ri, const FullyDistVec & ci)
  • Generalized sparse matrix assignment: SpParMat::SpAsgn (const FullyDistVec & ri, const FullyDistVec &ci, SpParMat & B)
  • Numeric type conversion through conversion operators
  • Elementwise operations between sparse and dense matrices: SpParMat::EWiseScale() and operator+=()
  • BFS specific optimizations inside BFSFriends.h

All the binary operations can be performed on matrices with different numerical value representations. The type-traits mechanism will take care of the automatic type promotion, and automatic MPI data type determination.

Some features it uses:

  • templates (for generic types, and for metaprogramming through "curiously recurring template pattern")
  • operator overloading
  • compositors (to avoid intermediate copying)
  • standard library whenever possible
  • Reference counting using shared_ptr for IITO (implemented in terms of) relationships
  • As external code, it utilizes

Important Sequential classes:

  • SpTuples : uses triples format to store matrices, mostly used for input/output and intermediate tasks (such as sorting)
  • SpDCCols : implements Alg 1B and Alg 2 [2], holds DCSC.
  • SpCCols : implements CSC

Important Parallel classes:

  • SpParMat : distributed memory MPI implementation
    Each processor locally stores its submatrix (block) as a sequential SpDCCols object
    Uses a polyalgorithm for SpGEMM: For most systems this boils down to a BSP like Sparse SUMMA [3] algorithm.
  • FullyDistVec : dense vector distributed to all processors
  • FullyDistSpVec: : sparse vector distributed to all processors

Applications implemented using Combinatorial BLAS:

Performance results of the first two applications can be found in the design paper [1]; Graph 500 results are in a recent BFS paper [4]. The most recent sparse matrix indexing, assignment, and multiplication results can be found in [5]. Performance of filtered graph algorithms (BFS and MIS) are reported in [7]. Performance of the 3D SpGEMM algorithm can be found in [9]

A subset of test programs demonstrating how to use the library (under ReleaseTests):

Citation: Please cite the design paper [1] if you end up using the Combinatorial BLAS in your research.

  • [1] Aydın Buluç and John R. Gilbert, The Combinatorial BLAS: Design, implementation, and applications . International Journal of High Performance Computing Applications (IJHPCA), 2011. Preprint , Link
  • [2] Aydın Buluç and John R. Gilbert, On the Representation and Multiplication of Hypersparse Matrices . The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, April 14-18, 2008
  • [3] Aydın Buluç and John R. Gilbert, Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication . The 37th International Conference on Parallel Processing (ICPP 2008), Portland, Oregon, USA, 2008
  • [4] Aydın Buluç and Kamesh Madduri, Parallel Breadth-First Search on Distributed-Memory Systems . Supercomputing (SC'11), Seattle, USA. Extended preprint PDF
  • [5] Aydın Buluç and John R. Gilbert. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments . SIAM Journal of Scientific Computing, 2012. Preprint PDF
  • [6] Aydın Buluç. Linear Algebraic Primitives for Computation on Large Graphs . PhD thesis, University of California, Santa Barbara, 2010. PDF
  • [7] Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams. High-Productivity and High-Performance Analysis of Filtered Semantic Graphs , International Parallel and Distributed Processing Symposium (IPDPS), 2013. PDF
  • [8] Scott Beamer, Aydin Buluç, Krste Asanović, and David Patterson. Distributed memory breadth-first search revisited: Enabling bottom-up search. In Workshop on Multithreaded Architectures and Applications (MTAAP), in conjunction with IPDPS. IEEE Computer Society, 2013. PDF
  • [9] Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing (SISC), 38(6):C624–C651, 2016. PDF
  • [10] Ariful Azad and Aydin Buluç. Distributed-memory algorithms for maximal cardinality matching using matrix algebra. In IEEE International Conference on Cluster Computing (CLUSTER), 2015. PDF
  • [11] Ariful Azad and Aydin Buluc. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS, 2016. PDF
  • [12] Ariful Azad, Mathias Jacquelin, Aydin Buluç, and Esmond G. Ng. The reverse Cuthill-McKee algorithm in distributed-memory. In Proceedings of the IPDPS, 2017. PDF
  • [13] Ariful Azad and Aydin Buluç. A work-efficient parallel sparse matrix-sparse vector multiplication algorithm. In Proceedings of the IPDPS, 2017. PDF