Research Narrative (2011)
I work in combinatorial scientific computing, an emerging discipline that focuses on combinatorial problems at the intersection of algorithmics and computational science.
Technological advances in hardware have enabled fields like computational biology, web analysis, and information retrieval to develop at an unprecedented rate, especially in terms of the amount of available data. Many datasets, including those arising from social, biological, chemical, financial, epidemiological, cosmological, and ecological analytics, are conveniently modeled as graphs. A graph consists of a set of vertices that represent the main entities in the dataset (such as cities, people, and proteins) and a set of edges that connect these vertices based on how they interact with each other (such as friendships, email exchanges, and chemical bonds). Domain scientists in these fields now face the challenge of performing computations on petascale graphs and datasets.
High-performance graph analysis is in its infancy, and effective mappings of graph computations to computer architectures are not well understood. This is especially true for large graphs that require a distributed-memory approach due to data size, computational costs, or both. Coincidentally, if not consequently, a scalable software stack that eases the application programmer's job does not exist for computations on graphs.
My research spans all layers of developing a software stack for graph analytics, from scalable algorithms for fundamental building blocks to interfaces both for algorithm developers and for domain scientists. My work aims to provide a solution to manycore dilemma that is facing large-scale data analysis and high-performance computing. I develop easy-to-use scalable parallel software, via concurrent research on:
- applications to identify the computational and data requirements of domain scientists,
- software to ensure horizontal and vertical composition of primitives,
- algorithms to develop primitives that scale with increasing core counts,
- high-performance computing for highly efficient execution.
A Scalable Software Stack for Graph Analysis
I envision a software stack where the bottom layer is the high-performance Combinatorial BLAS (5), which can be used both as a compilation target for the high level interface, and a stand-alone library of basic kernels for graph algorithm developers. The high level interface is the user-friendly Knowledge Discovery Toolbox (KDT) (15), to be primarily used by domain scientists. Both Combinatorial BLAS and KDT are evolving systems that will serve as tools for further developments in large-scale graph analysis.The Combinatorial BLAS: During my doctoral studies, I proposed the Combinatorial BLAS as a standard for combinatorial computational kernels. It offers a small but powerful set of linear algebraic kernels that can be used as building blocks for the most common graph-analytic algorithms. Graph abstractions can be built on top of its sparse matrices, taking advantage of its existing best practices for handling parallelism in sparse linear algebra. This is not to claim that linear algebraic primitives are the only primitives needed to perform graph analysis and data mining; they are, however, general enough to be widely useful and compact enough to be heavily optimized.
I developed a high-performance distributed-memory graph library (11) as a proof-of-concept realization of the underlying principles of the Combinatorial BLAS.
It includes generalized sparse matrix-matrix multiplication, sparse matrix indexing and assignment, sparse matrix-vector multiplications, and elementwise sparse matrix operations.
These operations support mixed mode arithmetic (e.g. multiplication of a boolean adjacency matrix by an integer vector of frontier vertices translates into a breadth-first search step).
The operations can also be performed on any user defined semiring, such as the tropical (min,+)-semiring that is prevalent in shortest path computations. The complete set of primitives to be included in the Combinatorial BLAS,
however, is intentionally left unfinalized. The list is likely to evolve as new applications become popular, making the library an application driven set of tools instead of a solution looking for problems. Combinatorial BLAS is used for rapidly implementing graph algorithms such as betweenness centrality (finds influential entities in social networks) and Markov clustering
(partitions protein-interaction graphs).
Knowledge Discovery Toolbox (KDT): KDT was born when the need for a higher level library, both in level of abstraction and in programming paradigm, became evident through interactions with potential users. The target audience of KDT has a good understanding of graph semantics, but not the high-performance implementations of graph operations. The primary KDT abstractions are different from Combinatorial BLAS abstractions. KDT exposes graph abstractions such as directed graphs, and graph operations such as ranking vertices, clustering, and finding neighbors within hops of a set of vertices; the underlying linear algebraic implementation of the Combinatorial BLAS is not readily visible. This shift in abstractions between the linear-algebra worldview and the graph worldview is one of the primary contributions of KDT. It creates usability for domain experts while retaining performance and customizability.
KDT uses high-performance kernels from the Combinatorial BLAS; and exposes its higher-level graph algorithms in Python, a dynamically typed language that many domain scientist are familiar with. I am a strong contributor to the design and development of the KDT project (13). A example KDT workflow is as follows. The user first finds the largest connected component of the graph; then divides this giant component into clusters of closely-related vertices; contracts the clusters into supervertices; and finally visualizes this graph of supervertices to perform a detailed analysis.
Sparse Matrix and Graph Primitives
I developed new parallel algorithms for commonly used sparse matrix and graph primitives. I view every one of them as both a graph primitive and a sparse matrix primitive, following the duality between sparse matrices and graphs (8).Sparse matrix-sparse matrix multiplication and applications: Generalized sparse matrix-sparse matrix multiplication (SpGEMM) is a key primitive for many high performance graph algorithms as well as some linear solvers such as algebraic multigrid. With John Gilbert, I developed the first parallel algorithm for SpGEMM that uses a two-dimensional decomposition of sparse matrices (3). As a result of the manycore revolution, supercomputers with close to a million cores will soon be available. Applications can not scale to these concurrencies unless they restrict the number of communicating processors. Our 2D SpGEMM algorithm scales beyond thousands of cores, presenting both a compelling case study and an actual plug-in primitive that can be used in various applications. I also showed that traditional data structures and algorithms are asymptotically too wasteful for storing and multiplying submatrices that arise after the 2D decomposition (4). This is because the local submatrices are hypersparse, meaning that the ratio of nonzeros to dimension is asymptotically zero.
I used SpGEMM in our algebraic implementation of the SSCA#2 betweenness centrality benchmark (1). The inner loop of betweenness centrality on undirected graphs is a breadth-first search from multiple starting vertices. If is the transpose of the graph's adjacency matrix and is an indicator matrix with a column for each start vertex, then corresponds to one step of parallel breadth-first search. The 2D block decomposition of the sparse arrays invisibly encapsulates three levels of parallelism in this algorithm. First, decomposing the columns of corresponds to doing multiple simultaneous breadth-first searches in parallel. Second, decomposing the rows of and the columns of corresponds to parallelism across the frontier vertices of each breadth-first search. Third, decomposing the rows of corresponds to parallelism across the incident edges of individual high-degree frontier vertices. The algorithm yields good scalability over a thousand cores on the older TACC Lonestar cluster, where it achieves up to 400 MTEPS (million edges traversed per second) on an RMAT power-law graph with 8 million vertices (5).
Indexing sparse arrays in parallel can be used to coarsen grids and extract subgraphs. In recent work with John Gilbert, I demonstrated that our sparse SUMMA algorithm is also flexible enough
to yield elegant parallel algorithms for indexing and assignment (6). The resulting indexing algorithm shows scaling up to thousands of processors in a variety of test scenarios. Sparse matrix indexing and assignment are crucial access and updating operations for irregular discrete data structures that were notoriously hard to scale to large numbers of processors.
Sparse matrix-dense vector multiplication: Sparse matrix-dense vector multiplication sustains a low fraction of peak performance, making it a frequent bottleneck in scientific computing. Exposing sufficient parallelism becomes a key factor in performance for multicore platforms. In collaboration with Jeremy Fineman, Matteo Frigo, John Gilbert, and Charles Leiserson (2), I targeted sparse matrix-vector and sparse matrix-transpose-vector multiplication operations on multicore chips. In our widely cited paper, we proposed a novel hierarchical data structure called compressed sparse blocks (CSB) that does not favor rows over columns or vice-versa, and attains high performance for both operations. The pattern of nonzeros in the input varies enormously among applications, which ultimately affects the performance. Ours is the first algorithm that guarantees high parallelism for every possible nonzero pattern in the input.
Later, with Samuel Williams, Leonid Oliker, and James Demmel (10), I extended this data structure to incorporate a novel technique called bitmasked register blocks.
Conventional register blocking reorganizes the matrix into small dense blocks, filling zero elements with explicit zeros. This can potentially offset any performance gains attained by
reduced index overhead. By augmenting each register block with a bitmask that notes where the nonzeros are, the new scheme achieves performance gains even when
a natural block structure does not exist. I also gave a parallel algorithm for symmetric matrices that avoids data synchronization hazards. The performance of the enhanced CSB
algorithms were comparable to or better than previous approaches that involved numerous low-level optimizations.
Breadth-first search (a.k.a. sparse matrix-sparse vector multiplication): It is challenging to attain high performance on graph algorithms using distributed memory architectures. Breadth-first search (BFS) is a key subroutine in several graph algorithms, and has recently been chosen as the first Graph500 benchmark for ranking supercomputers based on their performance on data intensive applications (12). In joint work with Kamesh Madduri (9), I explored the design space of distributed memory BFS using hybrid programming and sparse linear algebra formalism, achieving substantial speedups over the reference implementations on huge datasets of billions of vertices. We analyzed the impact on communication of the 1D graph and 2D sparse matrix representations. This research led to stand-alone software that consistently attains top spots for Berkeley Lab machines on the Graph500 list, and the inclusion of a general sparse matrix-sparse vector multiplication primitive in Combinatorial BLAS, which corresponds to expanding the frontier of the graph.
Other Research
With John Gilbert and Ceren Budak (7), I studied the all-pairs shortest paths problem on graphics processing units (GPUs). By using an unorthodox blocked recursive algorithm together with a highly optimized matrix-matrix multiplication, we achieved up to 480 times speedup over a standard code running on a single CPU. This work became one of the first to suggest that GPU programming is not merely porting CPU algorithms to the GPU but rather that application developers may need to completely rethink their algorithms. Another conclusion of our study was that carefully chosen and optimized primitives, such as the ones found in Combinatorial BLAS, would be key to achieve high performance.