Pflotran Performance Results
From WaterWiki
Contents |
[edit] 2 billion DoF problem
- The input file with which the following runs were made is located at, http://neptune.ce.ncsu.edu/~vamsi/2b_dof/pflotran.in
- The number of time steps is set to 30.
- Command line arguments used when launching the executable, “-file_output no -flow_mat_type aij -log_summary”
- HDF5 output is turned off for these runs.
- The PETSc log summary information is available at,
- Cray XT5 - http://neptune.ce.ncsu.edu/~vamsi/2b_dof/log_summary/xt5/
- IBM BlueGene/P - http://neptune.ce.ncsu.edu/~vamsi/2b_dof/log_summary/bgp/
[edit] Profiling on Cray XT5
[edit] MPI_Allreduce synchronization timings on Cray XT5
[edit] Detailed timing results at 8184 processor cores of Cray XT5
- The plots below show the time spent in the most dominant routines by each process for a 8184 processor core run.
- At 8184 cores, each process participates in 146,738 MPI_Allreduce calls. The table below shows the distribution of MPI_Allreduce calls into various message size bins.
| Bin | Count | Callers |
|---|---|---|
| 0B < Message size < 16B | 113,070 | VecDot_MPI, VecNorm_MPI etc., |
| 16B <= Message size < 256B | 32,725 | VecDotNorm2 |
| 4KB <= Message size < 64KB | 943 | MatZeroRows_MPIBAIJ, MatZeroRows_MPIAIJ, MatAssemblyBegin_MPIBAIJ, MatAssemblyBegin_MPIAIJ |
[edit] Profiling on IBM BlueGene/P
[edit] Detailed timing results at 8184 processor cores of IBM BlueGene/P
- The plots below show the time spent in the most dominant routines by each process for a 8184 processor core run on IBM BlueGene/P.
[edit] Comparison between Cray XT5 and IBM BlueGene/P
[edit] Comparison between IBCGS and BCGS solvers
- These set of runs were done with the following versions of PFLOTRAN and PETSc:
- January 2010 version of PFLOTRAN (changeset:3799e94e5e6c)
- February 2010 version of PETSc-dev (source changeset: fc78576e289c)
- The following command line options are used (with MAX_STEPS set to 30):
- BCGS: -file_output no -flow_mat_type aij -log_summary
- IBCGS: -file_output no -flow_mat_type aij -flow_ksp_type ibcgs -flow_ksp_lag_norm -tran_ksp_type ibcgs -log_summary
- The screen output files (including the PETSc log summary info) for these runs are available at, http://neptune.ce.ncsu.edu/~vamsi/1b_dof_bcgsVsibcgs/
- For easier access, a HTML file containing info such as time spent, linear/non-linear iteration counts etc., http://neptune.ce.ncsu.edu/~vamsi/1b_dof_bcgsVsibcgs/BCGSVsIBCGS.html
[edit] On Cray XT5
[edit] On BG/P
[edit] Cray XT5 and IBM BG/P
[edit] 270 million DoF problem
[edit] On Cray XT5 (Hexcore)
- These results are for the steady state version of the 270 million DoF problem with the number of time steps set to 30. Improved initialization method is used in these runs.
[edit] On Cray XT4
- These results are for the steady state version of the 270 million DoF problem with the number of time steps set to 30. The plots below show the percentage contribution of various groups of routines to wall clock time and the scaling of MPI routines on the Cray XT4 (each node has 4 cores). Default initialization method is used in these runs.

