This material is based upon work supported by the National Science
Foundation under Grant No. 0709385. Any opinions, findings and
conclusions or recommendations expressed in this material are those of
the author(s) and do not necessarily reflect the views of the National
Science Foundation (NSF)
The latest CMake'd tarball here. (NERSC users read this)
To create sample applications and run simple tests, all you need to do is to execute the following three commands, in the given order, inside the PSpGEMM-R1 directory:
cmake .
make
ctest -V (you need the testinputs, see below)
Test inputs are separately downloadable here. Extract them inside the PSpGEMM-R1 directory with the command "tar -xzvf testdata.tar.gz"
Alternatively (if cmake fails, or you just don't want to install it), you can just imitate the sample makefiles inside the ReleaseTests and Applications directories. Those sample makefiles have the following format: makefile-machine. (example: makefile-neumann)
Requirements: You need a recent C++ compiler (g++ version 4.2 or higher - and compatible), a compliant MPI-2 implementation, and a TR1 library (libstdc++ that comes with g++ has them). If not, you can use the boost library and pass the -DNOTR1 option to the compiler (cmake will automatically do it for you); it will work if you just add boost's path to $INCADD in the makefile. The recommended tarball uses the CMake build system, but only to build the documentation and unit-tests, and to automate installation. The chances are that you're not going to use any of our sample applications "as-is", so you can just modify them or imitate their structure to write your own application by just using the header files. There are very few binary libraries to link to, and no configured header files. Like many high-performance C++ libraries, the Combinatorial BLAS is mostly templated.
Documentation: This is a reference implementation of the Combinatorial BLAS Library in C++/MPI. It is purposefully designed for distributed memory platforms though it also runs in uniprocessor and shared-memory (such as multicores) platforms. It contains efficient implementations of novel data structures/algorithms as well as reimplementations of some previously known data structures/algorithms for convenience. More details can be found in Chapter 4 of my thesis [1].
The main data structure is a distributed sparse matrix ( SpParMat <IT,NT,DER> ) which HAS-A sequential sparse matrix ( SpMat <IT,NT> ) that can be implemented in various ways as long as it supports the interface of the base class (currently: SpTuples, SpCCols, SpDCCols).
For example, the standard way to declare a parallel sparse matrix A that uses 32-bit integers for indices, floats for numerical values (nonzeros), SpDCCols <int,float> for the underlying sequential matrix operations is:
The repetitions of int and float types inside the SpDCCols< > is a direct consequence of the static typing of C++ and is akin to some STL constructs such as vector<int, SomeAllocator<int> >
The supported operations (a growing list) are:
Sparse matrix-matrix multiplication on a semiring SR: Mult_AnXBn_Synch(), and other variants
Elementwise multiplication of sparse matrices (A .* B and A .* not(B) in Matlab): EWiseMult()
Numeric type conversion through conversion operators
Elementwise operations between sparse and dense matrices: SpParMat::EWiseScale() and operator+=()
All the binary operations can be performed on matrices with different numerical value representations. The type-traits mechanism will take care of the automatic type promotion. Of course, you have to declare the return value type appropriately (until C++0x is out, which has auto )
Some features it uses:
templates (for generic types, and for metaprogramming through "curiously recurring template pattern")
operator overloading
compositors (to avoid intermediate copying)
boost library or TR1 whenever necessary (mostly for shared_ptr and tuple)
standard library whenever possible
Reference counting using shared_ptr for IITO (implemented in terms of) relationships
MPI-2 one-sided operations
As external code, it utilizes sequence heaps of Peter Sanders .
Sequential classes:
SpTuples : uses triples format to store matrices, mostly used for input/output and intermediate tasks (such as sorting)
SpCCols : multiplication is similar to Matlab's, holds CSC.
SpParMat : distributed memory MPI implementation
Each processor locally stores its submatrix (block) as a sequential SpDCCols object
Uses a polyalgorithm for SpGEMM.
If robust MPI-2 support is not available, then it reverts back to a less scalable synchronous algorithm that is based on SUMMA [3]
Otherwise, it uses an asyncronous algorithm based on one sided communication. This performs best on an interconnect with RDMA support
SpThreaded (not included) : shared memory implementation. Uses Boost.Threads for multithreading.
Uses a logical 2D block decomposition of sparse matrices.
Asyncronous (to migitate the severe load balancing problem)
Lock-free (since it relies on the owner computes rule, i.e. C_{ij} is updated by only P_{ij})
Applications implemented using Combinatorial BLAS:
BetwCent.cpp : Betweenness centrality computation on directed, unweighted graphs. Download sample input here .
MCL.cpp : An implementation of the MCL graph clustering algorithm.
Performance results of both applications can be found in Chapter 5 of my thesis [1].
Test programs demonstrating how to use the library:
Citation: Please cite my thesis [1] if you end up using the Combinatorial BLAS in your research.
[1] Aydin Buluc. Linear Algebraic Primitives for Computation on Large Graphs . PhD thesis, University of California, Santa Barbara, 2010. PDF
[2] Aydin Buluc and John R. Gilbert, On the Representation and Multiplication of Hypersparse Matrices . The 22nd IEEE International Parallel and Distributed Processing Symposium (IPDPS 2008), Miami, FL, April 14-18, 2008
[3] Aydin Buluc and John R. Gilbert, Challenges and Advances in Parallel Sparse Matrix-Matrix Multiplication . The 37th International Conference on Parallel Processing (ICPP 2008), Portland, Oregon, USA, 2008