Go back to the the Combinatorial BLAS home page.

**Q1: **I would like to use your Combinatorial BLAS code for
some of my experiments which involve sparse matrix
multiplication. However, it is not clear how to write an output of PSpGEMM(…) function to
a file. I've tried to use "put" function of SpParMat,
but it outputs part of the matrix that corresponds to particular process and
not the whole matrix. Is there a way to do it using your code?

**A1:** Yes, SpParMat::SaveGathered(…)
will create a single file output (albeit slow) sorted with increasing row id's
when called like A.SaveGathered("product.txt").
The caveat is that "gathered" I/O in human readable form is
quite slow due to serialization on large processor counts. It should only be
used for debugging, ideally.

----

**Q2**: Does Combinatorial BLAS support in-node multithreading?

**A2:** Some primitives (Sparse SpMV, EWiseMult, Apply, Set) are
hybrid multithreaded within a socket. Read this short tutorial. We are working to
extend multithreading support to all primitives.

---

**Q3:** Reading/writing text files is
really slow for my purposes, what can I do?

**A3:** Combinatorial BLAS supports a
binary format for fast parallel I/O. Read the manual here.

---

**Q4:** Is there a preferred way to prune elements from a SpParMat according to a
predicate?

**A4:** Yes, SpParMat::Prune(…) will do it according to
a predicate. An overloaded version of the same function, SpParMat::Prune(ri,ci) will prune
all entries whose row indices are in ri and column indices are in ci

---

**Q5:** I am trying to run CombBLAS on
Windows but the MS MPI does not seem to support MPI C++ bindings.

**A5: **Combinatorial BLAS recently (version 1.3.0) switched to C-API
for all its internal MPI calls. After that, we've also compiled CombLAS on a windows machine without problems. However, we
recommend using an open source MPI for windows too, such as MPICH-2.

---

**Q6:** I would like to use Combinatorial BLAS for some parallel
sparse matrix-matrix multiplications. This works quite well, however when I try
to assign a m x 1 sparse matrix (actually a vector) to
the first column of an existing sparse matrix with SpAsgn
I get an error saying: "Matrix is too small to be splitted".
Is this because it's not possible to use SpAsgn on
vector-like matrices?

**A6: **SpAsgn internally uses a
memory efficient Mult_AnXBn_DoubleBuff
as opposed to Mult_AnXBn_Synch).
You might probably go into SpParMat<IT,NT,DER>::SpAsgn(...) and change occuranges of Mult_AnXBn_DoubleBuff to Mult_AnXBn_Synch. However, this
will likely only solve your problem for the serial case. because
ComBBLAS can not effectively 2D decompose an m x 1
matrix: each dimension should ideally be at least sqrt(p). It
is much better to represent that vector as a FullyDistSpVec.

---

**Q7: **Starting from a general sparse matrix Z, I want to
construct the symmetric matrix M: [[X'X; X'Z];[Z'X;
Z'Z]], where X is a vector of 1's. Thus the element at position (1,1) is simply
the number of columns of Z, and the vector Z'X contains the sums per column of
Z. For now, I have a working code, but it is quite sloppy because I do not find
a function for which I can easily increase the dimension of a sparse matrix or
even set an element to a specific value. Is there any function in Combinatorial
BLAS which can do this?

**A7:** Not out of the box. You don't want an
SpAsgn or any variant of it because it can't grow the
matrix. You want some sort of matrix append. How about using Find(…)
and Sparse(…)? The Matlab care of what you want
to do is:

X = ones(size(Z,2),1)

M = [X' * X, X' * Z; Z'* X, Z' * Z]

Supporting such a general concatenation efficiently might be hard to add at this point. Instead, there is a Concatenate(…) function for vectors. Armed with Concatenate(…), find(), and the sparse-like constructor, one can solve your problem. Check out the working example in ReleaseTests/FindSparse.cpp

---

**Q8:** Does CombBLAS include the API
to perform a symmetric permutation on a matrix, as explained in your SISC
paper?

**A8:** Yes it does. Check out the ReleaseTests/IndexingTiming.cpp
for an example.

---

**Q9:** How can I use small test case to see whether the
operation on matrix is correct? In other words, how do I print all the
information of a matrix with each value in matrix?

I can use PrintInfo to print basic information, but it only gives me number of rows and columns and nnz

**A9:** Our recommendation is to use SaveGathered(…) to dump the
whole matrix into a file in triples (matrix market) format. There is also an
equivalent vector version of it: FullyDistVec::SaveGathered(…)

---

**Q10:** Does CombBLAS
code run on any graph size or there is some limitation on the dimension of the
matrix A. I mean should it be a multiple of sqrt(p) where p is total
number of processors.

**A10:** No, the matrix dimension does not have to be a multiple
of sqrt(p)
but it should be bigger than sqrt(p). In other words
you can have a 5x5 matrix on 4 processors but not on 36 processors. We don't
really see the point of using more than |V|^2 processors.

---

**Q11:** My comparison results on real graph inputs revealed
something weird. In input loc-gowalla, how can 16
processors time(called time_16) and

64 processors time(called time_64) which time_64*4<time_16 which is more than linear scale?

**A11:** The complexity of the parallel algorithm drops as
sub-matrices owned by each processor gets sparser. In particular, it is
proportional to O(flops x log(ni))
where ni is the size of the intersection of the set
of nonzero columns of Aik and nonzero rows of Bkj for A*B. What might happen as p increases is that there
is a phase transition that makes ni
drop significantly for your input (for p=64, each sub-matrix will have only
~1.2 nonzeros per row or column). More details are in the SISC paper
and the references therein. Hope this makes sense. This is why I don't suggest
people use CombBLAS for small p (< 40) because it
is not on the top of its game for small number of processors.

---

**Q12:** Should the input file have nodes numbered from 1 or it
is fine if the nodes are numbered from 0?

**A12:** If you're using the human readable matrix market format
as your input, then it should be 1-indexed.

---

**Q13: **I'm wondering for breadth-first-search, under the hood
does the matrix-vector multiplication method change based on the sparsity of the frontier vector, or does the underlying
matrix-vector multiplication assume the frontier is always sparse?

**A13:** Depending on your definition of sparseness, the frontier
is almost always sparse. We use the pragmatic definition of "sparse"
in the sense that a vector is sparse if it is worth taking advantage of the sparsity in there. I'd guess, for a dense vector assumption
to be competitive, it would have to have at least 1/3 of its potential
locations nonzero. However, I might be wrong (and you're welcome to prove me
wrong). To answer your question more directly, CombBLAS
supports both dense and sparse right hand side vectors, but the specific BFS
implementation does not adapt.

---

**Q14: **Could you briefly explain the difference in your
implementations of matrix-sparse vector and matrix-dense vector multiply? For
example, is the sparse vector case a write-based approach: Every element
updates all of its neighbors (from a graph-theoretic standpoint) locations in
the output vector; and the dense vector case a read-based approach: Every
element reads some value from each of its neighbors and updates its own entry
in the resulting vector?

**A14:** Sparse matrix-sparse vector is "right hand side
vector structure" driven. In y = A*x, for each nonzero x_i,
we scale the column A(:,i)
with that and merge the scaled sparse columns results into y. The computation
boils down into merging sparse columns into one. Combinatorial
BLAS is a matrix-vector based library,
so thinking in terms of updates on single entries is probably not the right
abstraction.

Sparse matrix-dense vector is slightly different in the sense that it is driven by the matrix structure; you basically stream the matrix. The correctness of both operations are handled by a SPA-like or heap-like data structure that merges multiple intermediate values contributing to the same output location; no atomics are used.

---

**Q15: **I would like to get your opinion
on how sparse-matrix based implementations compare with more native
implementations

**A15: **Sparse matrix abstraction, like
any abstraction, will leave some performance on the table. In particular it is
prone to performing extra passes over data or creating extra temporaries (if
you've ever programmed in Matlab; this is similar).
On the other hand, sparse matrix abstraction gives you "primitives"
to implement graph "algorithms" as opposed to the algorithms
themselves. For instance, CombBLAS has sparse matrix
x sparse vector over a semiring as opposed to BFS, because now using the same
primitive one can implement MIS (maximal independent set) too, only by changing
the semiring. Or one can perform run time filtering on edges based on the
attributes, similarly by changing the semiring functions (therefore extending
functionality to semantic graphs). Indeed this is what we've done in our
upcoming IPDPS'13 paper.

---

**Q16: **Is there an effort to incorporate the bottom-up BFS of
Scott Beamer into CombBLAS?

**A16: **Yes, it is already done. Just use the dobfs executable (made from DirOptBFS.cpp).

---

**Q17: **My serial code is faster than CombBLAS
on a single core.

**A17: **I believe that. CombBLAS
targets "scalability", not optimizing the single core performance.

Examples:

- think about the 2D BFS. CombBLAS does not use a CSR like data structure because that is not memory scalable due to problems of hypersparsity in large concurrencies. Instead CombBLAS opts to use a slower (about 2x around 1000 cores) but memory scalable format called DCSC.

- think about betweenness centrality which uses sparse matrix-matrix multiply. CombBLAS doesn't use the fastest serial algorithm as its subroutine because it doesn't scale to thousands of cores. Instead it uses a outer-product algorithm that is significantly slower for p=1, but scales indefinitely.

---

**Q18:** Looking at the output of your Graph500 application, I
noticed a large number of self-edges removed. That’s very interesting.

**A18: **The duplicate edges problem is inherent to the R-MAT generator
on large scale, unless some special kind of noise is added. Check here for a
great analysis of this phenomenon: http://arxiv.org/abs/1102.5046

---

**Q19:** How are you counting the number of edges traversed in
Graph500? Is this still using the original verify.c
file provided with the reference version of the Graph500 benchmark and passing
in the parent tree?

**A19: **It is calculated by summing the
degrees of the discovered vertices using EWiseMult(…) followed by a Reduce(…). Degrees are
pre-symmetrization (original edges), so we're not
over-counting. However, we count self-loops and duplicates as mentioned in the
benchmark specs.

---

**Q20:** My computations
finishes fine but I get an “Attempting to use an MPI routine after finalizing
MPICH” afterwards.

**A20:** To avoid the
finalization error, please imitate an example such as MultTest.cpp: http://gauss.cs.ucsb.edu/~aydin/CombBLAS/html/_mult_test_8cpp_source.html

The curly brackets around the code are intentional. Since distributed objects have MPI related pointers in them, those pointers are released once the destructors are called. In C++ (at least until C++11) there isn’t a good way to call the destructor manually, so the destructor is called immediately before the program exists, which is after the MPI_Finalize. Since the MPI related objects are destructed after MPI_Finalize, you see this error. Try the curly brackets approach.

Go back to the the Combinatorial BLAS home page.