Pflotran Performance Results

From WaterWiki

Jump to: navigation, search

Contents

[edit] 2 billion DoF problem

[edit] Profiling on Cray XT5

 Figure-1: Scaling of wall clock time (Initialization stage + Flow stage + Transport stage). (Default initialization method is used).
Figure-1: Scaling of wall clock time (Initialization stage + Flow stage + Transport stage). (Default initialization method is used).
 Figure-2: Scaling of flow stage.
Figure-2: Scaling of flow stage.
 Figure-3: Scaling of transport stage.
Figure-3: Scaling of transport stage.
 Figure-4: Scaling of flow + transport stages.
Figure-4: Scaling of flow + transport stages.
 Figure-5: Percentage contribution of USER, MPI and MPI_SYNC group routines to wall clock time. (Default initialization method is used)
Figure-5: Percentage contribution of USER, MPI and MPI_SYNC group routines to wall clock time. (Default initialization method is used)
Figure-6: Percentage contribution of user routines to wall clock time.
Figure-6: Percentage contribution of user routines to wall clock time.
Figure-7: Percentage contribution of MPI routines to wall clock time. (Default initialization method is used)
Figure-7: Percentage contribution of MPI routines to wall clock time. (Default initialization method is used)
Figure-8: Timings for user and MPI routines. (Default initialization method is used)
Figure-8: Timings for user and MPI routines. (Default initialization method is used)
Figure-9: Percentage of theoretical peak performance. (Default initialization method is used). Peak performance of 1 core of Cray XT5 = 2.6 GHz * 4 ops/cycle = 10.4 GFlops/second.
Figure-9: Percentage of theoretical peak performance. (Default initialization method is used). Peak performance of 1 core of Cray XT5 = 2.6 GHz * 4 ops/cycle = 10.4 GFlops/second.
Figure-10: Comparison between default initialization method and improved initialization method.
Figure-10: Comparison between default initialization method and improved initialization method.
Figure-11: Comparison of wall clock time with default and improved initialization method.
Figure-11: Comparison of wall clock time with default and improved initialization method.
Figure-12: Timings for user and MPI routines. (Improved initialization method is used)
Figure-12: Timings for user and MPI routines. (Improved initialization method is used)
Figure-13: Comparison of percentage of theoretical peak for the total program with default and improved initialization methods. Peak performance of 1 core of Cray XT5 = 2.6 GHz * 4 ops/cycle = 10.4 GFlops/second.
Figure-13: Comparison of percentage of theoretical peak for the total program with default and improved initialization methods. Peak performance of 1 core of Cray XT5 = 2.6 GHz * 4 ops/cycle = 10.4 GFlops/second.

[edit] MPI_Allreduce synchronization timings on Cray XT5


Figure-14: Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)
Figure-14: Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)
Figure-15: Time spent in synchronizing for MPI_Allreduce by each process for a 4092 core run.
Figure-15: Time spent in synchronizing for MPI_Allreduce by each process for a 4092 core run.
Figure-16: Time spent in synchronizing for MPI_Allreduce by each process for a 8184 core run.
Figure-16: Time spent in synchronizing for MPI_Allreduce by each process for a 8184 core run.
Figure-17: Time spent in synchronizing for MPI_Allreduce by each process for a 16380 core run.
Figure-17: Time spent in synchronizing for MPI_Allreduce by each process for a 16380 core run.
Figure-18: Time spent in synchronizing for MPI_Allreduce by each process for a 32760 core run.
Figure-18: Time spent in synchronizing for MPI_Allreduce by each process for a 32760 core run.

[edit] Detailed timing results at 8184 processor cores of Cray XT5


  • The plots below show the time spent in the most dominant routines by each process for a 8184 processor core run.
  • At 8184 cores, each process participates in 146,738 MPI_Allreduce calls. The table below shows the distribution of MPI_Allreduce calls into various message size bins.


Bin Count Callers
0B < Message size < 16B 113,070 VecDot_MPI, VecNorm_MPI etc.,
16B <= Message size < 256B 32,725 VecDotNorm2
4KB <= Message size < 64KB 943 MatZeroRows_MPIBAIJ, MatZeroRows_MPIAIJ, MatAssemblyBegin_MPIBAIJ, MatAssemblyBegin_MPIAIJ


Figure-19: Time spent in MatLUFactorNumeric_SeqBAIJ_N routine by each process for a 8184 core run.
Figure-19: Time spent in MatLUFactorNumeric_SeqBAIJ_N routine by each process for a 8184 core run.
Figure-20: Time spent in MatMult_SeqBAIJ_N routine by each process for a 8184 core run.
Figure-20: Time spent in MatMult_SeqBAIJ_N routine by each process for a 8184 core run.
Figure-21: Time spent in MatSolve_SeqBAIJ_N routine by each process for a 8184 core run.
Figure-21: Time spent in MatSolve_SeqBAIJ_N routine by each process for a 8184 core run.
Figure-22: Time spent in reaction_module_rtotal_ routine by each process for a 8184 core run.
Figure-22: Time spent in reaction_module_rtotal_ routine by each process for a 8184 core run.
Figure-23: Time spent in synchronizing for MPI_Allreduce by each process for a 8184 core run.
Figure-23: Time spent in synchronizing for MPI_Allreduce by each process for a 8184 core run.
Figure-24: Floating point operations (PAPI_FP_OPS) executed by each process for a 8184 core run.
Figure-24: Floating point operations (PAPI_FP_OPS) executed by each process for a 8184 core run.
Figure-25: Total number of instructions (PAPI_TOT_INS) executed by each process for a 8184 core run.
Figure-25: Total number of instructions (PAPI_TOT_INS) executed by each process for a 8184 core run.

[edit] Profiling on IBM BlueGene/P

Figure-26: Breakdown of dominant routines on IBM BlueGene/P.
Figure-26: Breakdown of dominant routines on IBM BlueGene/P.
Figure-27: Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)
Figure-27: Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)

[edit] Detailed timing results at 8184 processor cores of IBM BlueGene/P


  • The plots below show the time spent in the most dominant routines by each process for a 8184 processor core run on IBM BlueGene/P.
Figure-28: Time spent in the routine rtotal by each process for a 8184 core run on IBM BG/P.
Figure-28: Time spent in the routine rtotal by each process for a 8184 core run on IBM BG/P.
Figure-29: Time spent in the routine rmultiratesorption by each process for a 8184 core run on IBM BG/P.
Figure-29: Time spent in the routine rmultiratesorption by each process for a 8184 core run on IBM BG/P.
Figure-30: Time spent in the routine rkineticmineral by each process for a 8184 core run on IBM BG/P.
Figure-30: Time spent in the routine rkineticmineral by each process for a 8184 core run on IBM BG/P.
Figure-31: Time spent in the routine MatLUFactorNumeric_SeqBAIJ_N by each process for a 8184 core run on IBM BG/P.
Figure-31: Time spent in the routine MatLUFactorNumeric_SeqBAIJ_N by each process for a 8184 core run on IBM BG/P.
Figure-32: Time spent in MPI_Allreduce (includes synchronization also) by each process for a 8184 core run on IBM BG/P.
Figure-32: Time spent in MPI_Allreduce (includes synchronization also) by each process for a 8184 core run on IBM BG/P.

[edit] Comparison between Cray XT5 and IBM BlueGene/P

 Figure-33: Comparison of wall clock time on Cray XT5 and IBM BlueGene/P.
Figure-33: Comparison of wall clock time on Cray XT5 and IBM BlueGene/P.
 Figure-34: Comparison of Flow stage timing on Cray XT5 and IBM BlueGene/P.
Figure-34: Comparison of Flow stage timing on Cray XT5 and IBM BlueGene/P.
Figure-35: Comparison of Transport stage timing on Cray XT5 and IBM BlueGene/P.
Figure-35: Comparison of Transport stage timing on Cray XT5 and IBM BlueGene/P.
Figure-36: Comparison of Initialization stage on Cray XT5 and IBM BlueGene/P.
Figure-36: Comparison of Initialization stage on Cray XT5 and IBM BlueGene/P.
Figure-37: Comparison of Flow + Transport stage timings on Cray XT5 and IBM BlueGene/P.
Figure-37: Comparison of Flow + Transport stage timings on Cray XT5 and IBM BlueGene/P.
Figure-38: Comparison of MPI_Allreduce (includes synchronization also) time on Cray XT5 and IBM BlueGene/P.
Figure-38: Comparison of MPI_Allreduce (includes synchronization also) time on Cray XT5 and IBM BlueGene/P.
Figure 39 (same as Fig 14): Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)
Figure 39 (same as Fig 14): Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)
Figure 40 (same as Fig 27): Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)
Figure 40 (same as Fig 27): Box-plot of MPI_Allreduce timings (includes synchronization also). The bottom and top of the box are respectively Q1 (25th percentile) and Q3 (75th percentile). The mid-point in the box represents the median of the distribution. The whiskers are marked at Q1-(1.5*IQR) and Q3+(1.5*IQR). Points which cross the whiskers are marked with "+" symbols and represent outliers. * IQR is interquartile range (Q3-Q1)

[edit] Comparison between IBCGS and BCGS solvers

  • These set of runs were done with the following versions of PFLOTRAN and PETSc:
    • January 2010 version of PFLOTRAN (changeset:3799e94e5e6c)
    • February 2010 version of PETSc-dev (source changeset: fc78576e289c)
  • The following command line options are used (with MAX_STEPS set to 30):
    • BCGS: -file_output no -flow_mat_type aij -log_summary
    • IBCGS: -file_output no -flow_mat_type aij -flow_ksp_type ibcgs -flow_ksp_lag_norm -tran_ksp_type ibcgs -log_summary
  • The screen output files (including the PETSc log summary info) for these runs are available at, http://neptune.ce.ncsu.edu/~vamsi/1b_dof_bcgsVsibcgs/
  • For easier access, a HTML file containing info such as time spent, linear/non-linear iteration counts etc., http://neptune.ce.ncsu.edu/~vamsi/1b_dof_bcgsVsibcgs/BCGSVsIBCGS.html


[edit] On Cray XT5

Figure-41: Set of 5 trial runs at 8184 cores
Figure-41: Set of 5 trial runs at 8184 cores
Figure-42: Set of 5 trial runs at 16380 cores
Figure-42: Set of 5 trial runs at 16380 cores
Figure-43: Set of 5 trial runs at 32760 cores
Figure-43: Set of 5 trial runs at 32760 cores
Figure-44: Set of 5 trial runs at 65536 cores
Figure-44: Set of 5 trial runs at 65536 cores
Figure-45: Set of 5 trial runs at 98304 cores
Figure-45: Set of 5 trial runs at 98304 cores
Figure-46: Set of 5 trial runs at 131072 cores
Figure-46: Set of 5 trial runs at 131072 cores
Figure-47: Best, Median and Worst timings for Flow stage for a set of 5 trial runs
Figure-47: Best, Median and Worst timings for Flow stage for a set of 5 trial runs
Figure-48: Best, Median and Worst timings for Transport stage for a set of 5 trial runs
Figure-48: Best, Median and Worst timings for Transport stage for a set of 5 trial runs
Figure-49: Best, Median and Worst timings for (Flow+Transport) stages for a set of 5 trial runs
Figure-49: Best, Median and Worst timings for (Flow+Transport) stages for a set of 5 trial runs
Figure-50: Comparison of no.of linear iterations in Flow stage for a set of 5 trial runs
Figure-50: Comparison of no.of linear iterations in Flow stage for a set of 5 trial runs

[edit] On BG/P

Figure-51: Flow and Transport stages with BCGS and IBCGS
Figure-51: Flow and Transport stages with BCGS and IBCGS
Figure-52: Comparison of no.of linear iterations in Flow stage
Figure-52: Comparison of no.of linear iterations in Flow stage

[edit] Cray XT5 and IBM BG/P

Figure-53: Flow stage (Best case from a set of 5 trial runs is taken for XT5)
Figure-53: Flow stage (Best case from a set of 5 trial runs is taken for XT5)
Figure-54: Transport stage (Best case from a set of 5 trial runs is taken for XT5)
Figure-54: Transport stage (Best case from a set of 5 trial runs is taken for XT5)
Figure-55: Flow+Transport ((Best case from a set of 5 trial runs is taken for XT5)
Figure-55: Flow+Transport ((Best case from a set of 5 trial runs is taken for XT5)
Figure-56: Wall clock time ((Best case from a set of 5 trial runs is taken for XT5)
Figure-56: Wall clock time ((Best case from a set of 5 trial runs is taken for XT5)

[edit] 270 million DoF problem

[edit] On Cray XT5 (Hexcore)

  • These results are for the steady state version of the 270 million DoF problem with the number of time steps set to 30. Improved initialization method is used in these runs.
Figure-41: Percentage contribution of USER, MPI and MPI_SYNC groups to wall clock time.
Figure-41: Percentage contribution of USER, MPI and MPI_SYNC groups to wall clock time.
Figure-42: Scaling of MPI routines on Cray XT5.
Figure-42: Scaling of MPI routines on Cray XT5.

[edit] On Cray XT4

  • These results are for the steady state version of the 270 million DoF problem with the number of time steps set to 30. The plots below show the percentage contribution of various groups of routines to wall clock time and the scaling of MPI routines on the Cray XT4 (each node has 4 cores). Default initialization method is used in these runs.
Figure-43: Percentage contribution of USER, MPI and MPI_SYNC groups to wall clock time.
Figure-43: Percentage contribution of USER, MPI and MPI_SYNC groups to wall clock time.
Figure-44: Scaling of MPI routines on the Cray XT4.
Figure-44: Scaling of MPI routines on the Cray XT4.
Navigation