- Authors
- Ariful Azad , Aydın Buluç , John R. Gilbert , Adam Lugowski (in collaboration with Scott Beamer).
This material is based upon work supported by the National Science Foundation under Grant No. 0709385 and by the Department of Energy, Office of Science, ASCR Contract No. DE-AC02-05CH11231. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF) and the Department of Energy (DOE). This software is released under the MIT license described here.
Introduction
The Combinatorial BLAS is an extensible distributed-memory parallel graph library offering a small but powerful set of linear algebra primitives specifically targeting graph analytics.
- The Combinatorial BLAS development influences the Graph BLAS standardization process.
- It achieves scalability via its two dimensional distribution and coarse-grained parallelism.
- For an illustrative overview, check out these slides.
- Operations among sparse matrices and vectors use arbitrary user defined semirings. Here is a semiring primer
- Check out the Frequently asked questions about CombBLAS.
Download
- Read release notes.
- The latest CMake'd tarball (version 1.5.0, Jan 2016) here. (NERSC users read this). The previous version (version 1.4.0, Jan 2014) is also available here for backwards compatibility and benchmarking.
- To create sample applications and run simple tests, all you need to do is to execute the following three commands, in the given order, inside the main directory:
- cmake .
- make
- ctest -V (you need the testinputs, see below)
- Test inputs are separately downloadable here. Extract them inside the CombBLAS_vx.x directory with the command "tar -xzvf testdata_combblas1.2.1.tgz"
- Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the ReleaseTests and Applications directories. Those sample makefiles have the following format: makefile-machine. (example: makefile-macair)
Requirements: You need a recent C++ compiler (gcc version 4.8+, Intel version 15.0+ and compatible), a compliant MPI implementation, and C++11 Standard library (libstdc++ that comes with g++ has them). The recommended tarball uses the CMake build system, but only to build the documentation and unit-tests, and to automate installation. The chances are that you're not going to use any of our sample applications "as-is", so you can just modify them or imitate their structure to write your own application by just using the header files. There are very few binary libraries to link to, and no configured header files. Like many high-performance C++ libraries, the Combinatorial BLAS is mostly templated.
Documentation: This is a reference implementation of the Combinatorial BLAS Library in C++/MPI. It is purposefully designed for distributed memory platforms though it also runs in uniprocessor and shared-memory (such as multicores) platforms. It contains efficient implementations of novel data structures/algorithms as well as reimplementations of some previously known data structures/algorithms for convenience. More details can be found in the accompanying paper [1]. One of the distinguishing features of the Combinatorial BLAS is its decoupling of parallel logic from the sequential parts of the computation, making it possible to implement new formats and plug them in without changing the rest of the library.
The implementation supports both formatted and binary I/O. The latter is faster but not human readable. Formatted I/O can read both a tuples format very similar to the Matrix Market and the Matrix Market format itself. We encourage in-memory generators for faster benchmarking. More info on I/O formats (other than the well-known Matrix Market Format) are here
The main data structure is a distributed sparse matrix ( SpParMat <IT,NT,DER> ) which HAS-A sequential sparse matrix ( SpMat <IT,NT> ) that can be implemented in various ways as long as it supports the interface of the base class (currently: SpTuples, SpCCols, SpDCCols).
For example, the standard way to declare a parallel sparse matrix A that uses 32-bit integers for indices, floats for numerical values (nonzeros), SpDCCols <int,float> for the underlying sequential matrix operations is:
- SpParMat<int, float, SpDCCols<int,float> > A;
The repetitions of int and float types inside the SpDCCols< > is a direct consequence of the static typing of C++ and is akin to some STL constructs such as vector<int, SomeAllocator<int> >. If your compiler support "auto", then you can have the compiler infer the type.
Sparse and dense vectors can be distributed either along the diagonal processor or to all processor. The latter is more space efficient and provides much better load balance for SpMSV (sparse matrix-sparse vector multiplication) but the former is simpler and perhaps faster for SpMV (sparse matrix-dense vector multiplication)
New in version 1.5:
- Fully parallel matrix market format reader (SpParMat::ParallelReadMM())
- Complete multithreading support, including SpGEMM (previously it was solely SpMV), enabled by -DTHREADED during compilation
- Experimental 3D SpGEMM (the ability to switch processor grids from 2D to 3D will have to wait for version 1.6)
- Cite [9] if you use this implementation
- cd 3DSpGEMM/, make test_mpipspgemm, and call the executable with correct parameters
- Maximal and Maximum cardinality matching algorithms on bipartite graphs
- Cite [10] for maximal cardinality and [11] for maximum cardinality matching
- cd MaximumMatching, make bpmm, and call the executable with correct parameters
- Automated MPI_Op creation from simple C++ function objects (simplifies semiring descriptions and Reduce() functions)
- FullyDistSpVec::Invert() to map from/to (integer) values to/from indices
- Many more helper functions
- Experimental CSC support for low concurrencies
- Lots of bug fixes
New in version 1.4:
- Direction optimizing breadth-first search in distributed memory (in collaboration with Scott Beamer and GAP). Please cite [8] if you use this code in your research or benchmarks (DirOptBFS.cpp).
The supported operations (a growing list) are:
- Sparse matrix-matrix multiplication on a semiring SR: PSpGEMM()
- Elementwise multiplication of sparse matrices (A .* B and A .* not(B) in Matlab): EWiseMult()
- Unary operations on nonzeros: SpParMat::Apply()
- Matrix-matrix and matrix-vector scaling (the latter scales each row/column with the same scalar of the vector)
- Reductions along row/column: SpParMat::Reduce()
- Sparse matrix-dense vector multiplication on a semiring, SpMV()
- Sparse matrix-sparse vector multiplication on a semiring, SpMV()
- Generalized matrix indexing: SpParMat::operator(const FullyDistVec & ri, const FullyDistVec & ci)
- Generalized sparse matrix assignment: SpParMat::SpAsgn (const FullyDistVec & ri, const FullyDistVec &ci, SpParMat & B)
- Numeric type conversion through conversion operators
- Elementwise operations between sparse and dense matrices: SpParMat::EWiseScale() and operator+=()
- BFS specific optimizations inside BFSFriends.h
All the binary operations can be performed on matrices with different numerical value representations. The type-traits mechanism will take care of the automatic type promotion, and automatic MPI data type determination.
Some features it uses:
- templates (for generic types, and for metaprogramming through "curiously recurring template pattern")
- operator overloading
- compositors (to avoid intermediate copying)
- standard library whenever possible
- Reference counting using shared_ptr for IITO (implemented in terms of) relationships
- As external code, it utilizes
Important Sequential classes:
- SpTuples : uses triples format to store matrices, mostly used for input/output and intermediate tasks (such as sorting)
- SpDCCols : implements Alg 1B and Alg 2 [2], holds DCSC.
- SpCCols (incomplete) : implements CSC
Important Parallel classes:
- SpParMat : distributed memory MPI implementation
Each processor locally stores its submatrix (block) as a sequential SpDCCols object
Uses a polyalgorithm for SpGEMM: For most systems this boils down to a BSP like Sparse SUMMA [3] algorithm.
- FullyDistVec : dense vector distributed to all processors
- FullyDistSpVec: : sparse vector distributed to all processors
Applications implemented using Combinatorial BLAS:
Performance results of the first two applications can be found in the design paper [1]; Graph 500 results are in a recent BFS paper [4]. The most recent sparse matrix indexing, assignment, and multiplication results can be found in [5]. Performance of filtered graph algorithms (BFS and MIS) are reported in [7]. Performance of the 3D SpGEMM algorithm can be found in [9]
A subset of test programs demonstrating how to use the library (under ReleaseTests):
Citation: Please cite the design paper [1] if you end up using the Combinatorial BLAS in your research.
- [1] Aydın Buluç and John R. Gilbert, The Combinatorial BLAS: Design, implementation, and applications . International Journal of High Performance Computing Applications (IJHPCA), 2011. Preprint , Link
- [2] Aydın Buluç and John R. Gilbert, On the Representation and Multiplication of Hypersparse Matrices . The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, April 14-18, 2008
- [3] Aydın Buluç and John R. Gilbert, Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication . The 37th International Conference on Parallel Processing (ICPP 2008), Portland, Oregon, USA, 2008
- [4] Aydın Buluç and Kamesh Madduri, Parallel Breadth-First Search on Distributed-Memory Systems . Supercomputing (SC'11), Seattle, USA. Extended preprint PDF
- [5] Aydın Buluç and John R. Gilbert. Parallel Sparse Matrix-Matrix Multiplication and Indexing: Implementation and Experiments . SIAM Journal of Scientific Computing, 2012. Preprint PDF
- [6] Aydın Buluç. Linear Algebraic Primitives for Computation on Large Graphs . PhD thesis, University of California, Santa Barbara, 2010. PDF
- [7] Aydın Buluç, Erika Duriakova, Armando Fox, John Gilbert, Shoaib Kamil, Adam Lugowski, Leonid Oliker, Samuel Williams. High-Productivity and High-Performance Analysis of Filtered Semantic Graphs , International Parallel and Distributed Processing Symposium (IPDPS), 2013. PDF
- [8] Scott Beamer, Aydin Buluç, Krste Asanović, and David Patterson. Distributed memory breadth-first search revisited: Enabling bottom-up search. In Workshop on Multithreaded Architectures and Applications (MTAAP), in conjunction with IPDPS. IEEE Computer Society, 2013. PDF
- [9] Ariful Azad, Grey Ballard, Aydin Buluç, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, and Samuel Williams. Exploiting multiple levels of parallelism in sparse matrix-matrix multiplication. SIAM Journal on Scientific Computing (SISC), 38(6):C624--C651, 2016. PDF
- [10] Ariful Azad and Aydin Buluç. Distributed-memory algorithms for maximal cardinality matching using matrix algebra. In IEEE International Conference on Cluster Computing (CLUSTER), 2015. PDF
- [11] Ariful Azad and Aydin Buluç. Distributed-memory algorithms for maximum cardinality matching in bipartite graphs. In Proceedings of the IPDPS, 2016. PDF