Sorry, I haven't used that particular package. However, I've used ScaLAPACK in the past with some very good results. If you don't get anywhere with SuperLU_DIST, you might want to give it a try. < http://www.netlib.org/scalapack/ >
I am in a building with a large number of computational scientists, many of whom have tried SuperLU. Their assessment has always been that SuperLU just doesn't get the job done. It is slow, it requires a large amount of work space, etc. Perhaps this has changed recently, but that was the experience a few years ago.
These people tend to use MUMPS instead with mixed results. MUMPS gives good performance but it sometimes mysteriously fails to return an answer. It uses a lot of memory.
If you happen to be working with hp adaptive FEM, then I may be able to get you in contact with a package that works very well for that community.
BTW, ScaLAPACK is no longer the state-of-the-art. Many people are now switching to a package called Elemental.
On the sparse solver side there are several key libraries, PETSc and Trilinos, that are widely used. PETSc has all kinds of adapters for direct solvers, the one I am familiar with is UMFPACK. It also supports LUSOL and MUMPS.
If you are working on more modern software then environments like MTL4 are very powerful. Parallel MTL4 is a commercial product, but I am sure that Peter Gottschling at Simunova would be happy to work with you. MTL4 also has a GPU-based solver in the making.
Finally, if you are modeling then starting with environments like FEniCS would give you a head start: FEniCS integrates all of these sparse solver packages so you can pick and choose and experiment with the performance of any of them.
http://crd-legacy.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdf contains a list of distributed direct solvers in Table 1. Probably has to be taken with a grain of salt as the author is responsible for SuperLU. There ar e also references to each solver in there
Depending on how large your linear system is parallel iterative solvers might be a better choice for solving it in a scalable manner.
1. Frankly speaking, I have no experience with the SuperLU package. But, in general, I suppose that direct solvers cannot provide considerable scalability in parallel mode, because the solution depends on the right-hand side globally. So, the information about right-hand side value in a particular point should spread over all processors, which requires increasing number of exchanges when the number of processors grows. As an example, consider a three-diagonal system - usually it can be solved by the Gaussian eliminatins, so called "marching", in O(N) operations, but this approach cannot be parallelized in principle... I think, the iterative methods are more suitable for the sparse distributed systems.
2. The method suitable for a particular system (type of systems) depends considerably on the properties of that system - such as symmetry, diagonal dominance, have the nondiagonal elements the same or different sign, etc. These characteristics are usually defined by the sort of the problem which the linear system arises from...
3. The people here suggested the MUMPS package. In addition, I can recommend Aztec (now it is known as AztecOO) - see http://trilinos.sandia.gov. It is aC/ C++ code, while MUMPS (at least, two or three years ago, when we made the choice of the package for our purposes) was a Fortran code.
This actually brings up the first questions that should always be asked: (1) are you sure you should use a sparse direct method; (2) are you sure your problem is large enough to worry about using a distributed memory architecture for its solution; (3) How much time do you have on your hands? An out-of-core solver can solve a very large problem on a small machine.
I am often asked about parallel dense linear algebra solvers. When I ask "how big is your problem?" the answer is often "Very big". When I ask "How big is very big?" the answer is often "1000 x 1000" to which I then reply that they should try solving it on their iPhone instead of on a 1,000,000 core BG/Q.
Alexey makes a good point: iterative methods are often faster and more convenient. I don't necessarily agree with his reasoning about "marching" since the domain decomposition (graph partitioning) that underlies sparse direct solvers inherently overcome that problem.
I agree with the last two reviews of Alexey Boldarev and Robert van de Geijn. I have successfuly experienced the superLU library for solving linear and non-linear systems with sparse matrices. The later arises from the application of the Element Free Galerkin Method and Finite Element Method analysis, e.g., for computing potentials in microchanels, Fields governed by Poisson`s Equation like eletric and/or magnetic devices.
Particularly, the aim of our research has been mostly concerned to the acuracy achieved for a particular arrangement of numerical and computational procedures in the aplication of a numerical method, which can be employed to achieve more accurated results. The interest in these methods is for use in CAD/CAE software, initially for Electrical and Electronic Engineering in our research group. The numerical analysis of relatively small problems, like some well known trial cases, which are usually enough to demonstrate the numerical performance of the numerical and computational procedures combined in the model.
Also, since 2007 I have not worked with high performance computing (hpc) research. Moreover, computer archtechtures have changed enormously in the last 6 years, so I can't help if you interest for "very big problems". We have solved problems until 10^6 x 10^6 or small ones.
In anyway, The first step is well describe the relevant characteristics of your algebraic system, and then choose the more suitable lib available, always considering the caracheristics of your computational architecture.
I want to thank you guys for keeping up with these responses. I have learned a lot since I asked this question for the first time. My problem is actually banded. So I started researching about distributed banded solvers... my first option are the banded solvers in ScaLAPACK, but my biggest issue is that I can't find an example, implementing the solution to a banded system, in C, using ScaLAPACK.
You could also try LAMA http://www.libama.org/ and get support from us. With LAMA you can even use multiple GPUs on different nodes. There are implementation of various solvers available including AMG.