1.9K Views
July 11, 23
スライド概要
Parallelization of Molecular Dynamics(分子動力学法の並列化)
Molecular dynamics (MD) is a very useful tool to understand various phenomena in atomistic detail. In MD, we can overcome the size- and time-scale problems by efficient parallelization. In this lecture, I’ll explain various parallelization methods of MD with some examples of GENESIS MD software optimization on Fugaku.
R-CCS 計算科学研究推進室
計算科学技術特論A Parallelization of molecular dynamics 2023年 7⽉ 13⽇ Jaewoon Jung (RIKEN Center for Computational Science)
Overview of MD
Molecular Dynamics (MD) 1. Energy/forces are described by classical molecular mechanics force field. 2. Update state according to equations of motion 𝐩! 𝐫!̇ = 𝑚! 𝐩̇ ! = 𝐅! Equation of motion 1 1 𝐩! 𝑡 + ∆𝑡 = 𝐩! 𝑡 − ∆𝑡 + 𝐅! 𝑡 ∆𝑡 2 2 1 +! 𝑡 + ∆𝑡 ∆𝑡 𝐩 2 𝐫! 𝑡 + ∆𝑡 = 𝐫! 𝑡 + 𝑚! Integration MD trajectory => Ensemble generation Long time MD trajectories are important to obtain thermodynamic quantities of target systems.
MD integration (Velocity Verlet case) 1 𝑡 + ∆𝑡 2 𝑡 5 𝑡 + ∆𝑡 2 𝑡 + 2∆𝑡 coordinate update coordinate update coordinate update 𝐪 𝑡 → 𝐪 𝑡 + ∆𝑡 𝐪 𝑡 + ∆𝑡 → 𝐪 𝑡 + 2∆𝑡 𝐪 𝑡 + 2∆𝑡 → 𝐪 𝑡 + 3∆𝑡 𝐩𝑡 1 → 𝐩 𝑡 + ∆𝑡 2 𝐅(𝑡) 3 𝑡 + ∆𝑡 2 𝑡 + ∆𝑡 1 𝐩 𝑡 + ∆𝑡 2 → 𝐩 𝑡 + ∆𝑡 𝐩 𝑡 + ∆𝑡 3 → 𝐩 𝑡 + ∆𝑡 2 𝐅(𝑡 + ∆𝑡) 3 𝐩 𝑡 + ∆𝑡 2 → 𝐩 𝑡 + 2∆𝑡 𝐩 𝑡 + 2∆𝑡 5 → 𝐩 𝑡 + ∆𝑡 2 𝐅(𝑡 + 2∆𝑡) 𝑡 +3∆𝑡 5 𝐩 𝑡 + ∆𝑡 2 → 𝐩 𝑡 + 3∆𝑡 𝐅(𝑡 + 3∆𝑡)
Difficulty to perform long time MD simulation 1. One time step length (Δt) is limited to 1-2 fs due to vibrations. 2. On the other hand, biologically meaningful events occur on the time scale of milliseconds or longer. 3. We need to accelerate energy/force calculation for long time MD fs ps ns μs ms sec vibrations Sidechain motions Mainchain motions Folding Protein global motions
History of Biological MD simulations (1977 to current) ps µs ns ms relaxation vibration folding Secondary structure change protein BPTI (~6x102) Membrane protein Reaction of protein 1977 HP36 (~104) AQP1 (~105) BPTI, Ubiquitin (~2x104) Ribosome (~2x106) 2013 ANTON (Gordon Bell Prize 2009) 2000 HIV-1 capsid (~6.5x107) Macromolecule complex Mycoplasma genitalium (~1.0x108) Cellular environment ~2019 KNL size (# of atoms) time s Chromatin (~1.0x109) Sars-Cov-2 (~3.0x108) HIV-1 capsid (~6.5x107) Chromatophore vesicle (~1.4x108) We need to extend simulation time & size efficient parallelization 6
Potential energy in MD (all-atom case)
𝐸"#$%&$'()
= ( 𝑘- 𝑏 − 𝑏.
/
bond
Total number
O(N) of particles
/
angle
O(N)
*#&+,
+ ( 𝑘1 𝜃 − 𝜃.
(&0)%,
+
(
dihedral angle
𝑘5 1 + 𝑐𝑜𝑠 𝑛𝜑 − 𝛿
O(N)
+'2%+3()4
+
(
𝑘7 𝜔 − 𝜔.
'6"3#"%3
+
(
&#&8*#&+%+
"('3,
:!;
𝜀!9
improper dihedral angle
/
6'&
𝑅!9
𝑟!9
</
6'&
𝑅!9
−
𝑟!9
=
Main bottleneck in MD
+
𝑞! 𝑞9
𝑟!9
O(N)
non-bonded
(van der Waals and electrostatic) O(N2)
7
Practical evaluation of non-bonded interaction
1. We evaluate non-bonded interaction with periodic boundary condition with infinite images.
:!;
( 𝜀!9
!9
6'&
𝑅!9
𝑟!9
</
−
6'&
𝑅!9
𝑟!9
=
+
𝑞! 𝑞9
:!;
→ ( 𝜀!9
𝑟!9
!9,𝐧
6'&
𝑅!9
𝑟!9,𝐧
</
−
6'&
𝑅!9
𝑟!9,𝐧
=
+
𝑞! 𝑞9
𝑟!9,𝐧
Practical evaluation of non-bonded interaction
2. Truncation of van der Waals interaction with 𝑉 𝑟 = 0 with 𝑟 > 𝑅+
:!;
( 𝜀!9
!9,𝐧
6'&
𝑅!9
𝑟!9,𝐧
</
−
6'&
𝑅!9
𝑟!9,𝐧
=
+
𝑞! 𝑞9
→ (
𝑟!9,𝐧
!9,𝐧
@!" AB#
i-th
particle
:!;
𝜀!9
6'&
𝑅!9
𝑟!9,𝐧
</
−
6'&
𝑅!9
𝑟!9,𝐧
=
+(
!9
𝑞! 𝑞9
𝑟!9,𝐧
Range of j-th
particle that
interacts with i-th
particle
Practical evaluation of non-bonded interaction
3. Truncation of real-space electrostatic interaction by decomposing the electrostatic interaction
into real- and reciprocal-space
(
:!;
𝜀!9
!9,𝐧
@!" AB#
→ (
!9,𝐧
@!" AB#
:!;
𝜀!9
6'&
𝑅!9
</
𝑟!9,𝐧
6'&
𝑅!9
𝑟!9,𝐧
−
6'&
𝑅!9
=
𝑟!9,𝐧
</
−
6'&
𝑅!9
𝑟!9,𝐧
real-space
+(
!9
=
𝑞! 𝑞9
𝑟!9,𝐧
𝑞! 𝑞9 erfc(𝛼𝑟!9,𝐧 )
1
exp( − 𝜋 / 𝐤 / /𝛼 / )
+
+
(
𝑆(𝐤)
𝑟!9,𝐧
2𝜋𝑉
𝐤/
𝐤D𝟎
reciprocal-space
G
/
−
𝛼
( 𝑞!/
𝜋
!F<
self term
Parallelization is one of the solutions to accelerate MD simulations Serial Parallel X16? 1cpu CPU Core C 1cpu 16 cpus MPI Comm Good Parallelization ; 1) Small amount of computation in one process 2) Small communicational cost
Parallelization
Shared memory parallelization (OpenMP) C Memory #pragma omp parallel Fortran !$omp parallel P1 P2 P3 PP-1 PP • All processes share data in memory • For efficient parallelization, processes should not access the same memory address. • It is only available for multi-processors in a physical node
Distributed memory parallelization (MPI) M M M 1 2 3 MP-1 M P … P1 P2 P3 call mpi_init call mpi_comm_rank PP PP call_mpi_comm_size -1 • Processors do not share data in memory • We need to send/receive data via communications • For efficient parallelization, the amount of communication data should be minimized
Hybrid parallelization (MPI+OpenMP) call mpi_init M1 M2 M3 MP-1 MP call_mpi_comm_size … P1 P2 P3 call mpi_comm_rank . . . PP-1 PP !$omp parallel do do i = 1, N .. • Combination of shared memory and distributed memory parallelization. • It is useful for minimizing communicational cost with very large number of processors end do !$omp end parallel do
Low level parallelization: SIMD (Single instruction, multiple data) SIMD No SIMD 4 times faster • Same operation on multiple data points simultaneously • Usually applicable to common tasks like adjusting graphic image or volume • In most MD programs, SIMD becomes the one of the important topics to increase the performance
MPI Parallelization of MD (non-bonded interaction in real space)
Parallelization scheme 1: Replicated data approach 1. Each process has a copy of all particle data. 2. Each process works only part of the whole works by proper assign in do loops. do i = 1, N do j = i+1, N energy(i,j) force(i,j) end do end do my_rank = MPI_Rank proc = total MPI do i = my_rank+1, N, proc do j = i+1, N energy(i,j) force(i,j) end do end do MPI reduction (energy,force)
Parallelization scheme 1: Replicated data approach atom indices Computational cost 1 2 3 4 5 6 7 8 MPI rank 0 MPI rank 1 MPI rank 2 MPI rank 3 1. Perfect load balance is not guaranteed in this parallelization scheme 2. Reduction with communication prevents the larger number of MPIs.
Hybrid (MPI+OpenMP) parallelization of the Replicated data approach 1. 2. Works are distributed over MPI and OpenMP threads. Parallelization is increased by reducing the number of MPIs involved in communications. do i = my_rank+1, N, proc do j = i+1, N energy(i,j) force(i,j) end do end do MPI reduction !$omp parallel id = omp thread id my_id = my_rank*nthread + id do i = my_id+1, N, proc*nthread do j = i+1, N energy(i,j) force(i,j) end do end do Openmp reduciton !$omp end parallel do i = my_rank+1,N,proc !omp parallel do do j = i+1, N or energy(i,j) force(i,j) end do !$omp end parallel do end do MPI reduction MPI reduction my_rank: MPI rank, proc: total number of MPIs, nthread: total number of OMP threads
Pros and Cons of the Replicated data approach 1. Pros : easy to implement 2. Cons • Parallel efficiency is not good • No perfect load balance • Communication cost is not reduced by increasing the number of processes • We can parallelize only for energy calculation (with MPI, parallelization of integration is not so much efficient) • Needs a lot of memory • Usage of global data
Parallelization scheme 2: Domain decomposition 1. The simulation space is divided into subdomains according to MPI (different colors for different MPIs). 2. Each MPI only considers the corresponding subdomain. 3. MPI communications only among neighboring processes. Communications between processors
Parallelization scheme 2: Domain decomposition Buffer Subdomain 𝒓𝒄 1. For the interaction between different subdomains, it is necessary to have the data of the buffer region of each subdomain. 2. The size of the buffer region is dependent on the cutoff values. 3. The interaction between particles in different subdomains should be considered very carefully. 4. The size of the subdomain and buffer region is decreased by increasing the number of processes.
Pros and Cons of the domain decomposition approach 1. Pros • • Good parallel efficiency • Reduced computational cost by increasing the number of processes • We can easily parallelize not only energy but also integration Availability of a huge system • Data size is decreased by increasing the number of processes 2. Cons • Implementation is not easy • Domain decomposition scheme is highly depend on the potential energy type, cutoff and so on • Good performance cannot be obtained for nonuniform particle distributions
Special treatment for nonuniform distributions Tree method: Subdomain size is adjusted to have the same number of particles Hilbert space filling curve : a map that relates multi-dimensional space to one-dimensional curve The above figure is from Wikipedia https://en.wikipedia.org/wiki/Hilbert_curve
Comparison of two parallelization scheme Computation Communication Memory Replicated data O(N/P) O(N) O(N) Domain decomposition O(N/P) O((N/P)2/3) O(N/P) N: system size P: number of processes
MPI Parallelization of MD (reciprocal space)
Electrostatic interaction with particle mesh Ewald method 𝑞$ 𝑞% erfc(𝛼𝑟$%,𝐧 ) 1 exp( − 𝜋 " 𝐤 " /𝛼 " ) ( + ( 𝑆(𝐤) 𝑟$%,𝐧 2𝜋𝑉 𝐤" $%,𝐧 (!" )*# 𝐤,𝟎 real-space / 𝛼 " − ( 𝑞$" 𝜋 $.! reciprocal-space self energy The structure factor in the reciprocal part is approximated as 𝑆(𝑘! , 𝑘" , 𝑘# ) = 𝑏! (𝑘! )𝑏" (𝑘" )𝑏# (𝑘# )𝐹(𝑄)(𝑘! , 𝑘" , 𝑘# ) Fourier Transform of charge It is important to parallelize the Fast Fourier transform efficiently in PME!!
Overall procedure of reciprocal space calculation 𝑸 (charge on real space) " 𝑸(charge on grid) FFT Energy calculation " 𝚯 " 𝑭𝑭𝑻(𝑸)× " 𝑭𝑭𝑻(𝑸) Inverse FFT " 𝚯) " 𝑰𝑭𝑭𝑻(𝑭𝑭𝑻(𝑸)× Force calculation " 𝚯) " from 𝑰𝑭𝑭𝑻(𝑸×
Simple note of MPI_alltoall communications Proc1 Proc2 Proc3 Proc4 Proc1 Proc2 Proc3 Proc4 A1 B1 C1 D1 A1 A2 A3 A4 A2 B2 C2 D2 B1 B2 B3 B4 C1 C2 C3 C4 D1 D2 D3 D4 A3 B3 C3 D3 A4 B4 C4 D4 MPI_alltoall MPI_alltoall is same as matrix transpose!!
Parallel 3D FFT – slab (1D) decomposition 1. Each process is assigned a slab of size 𝑁×𝑁× 𝑁⁄𝑃 for computing FFT of 𝑁×𝑁×𝑁 grids on 𝑃 processes. 2. The scalability is limited by 𝑁 3. 𝑁 should be divisible by 𝑃
Parallel 3D FFT – slab (1D) decomposition No communication MPI_Alltoall communication among all preocessors
Parallel 3D FFT – slab (1D) decomposition 1. Slab decomposition of 3D FFT has three steps • 2D FFT (or two 1D FFT) along the two local dimension • Global transpose (communication) • 1D FFT along third dimension 2. Pros • fast using small number of processes 3. Cons • Limitation of the number of processes
Parallel 3D FFT –2D decomposition MPI_Alltoall communications among 3 processes of the same color MPI_Alltoall communications among 3 processes of the same color
Parallel 3D FFT –2D decomposition 1. 2D decomposition of 3D FFT has five steps • 1D FFT along the local dimension • Global transpose • 1D FFT along the second dimension • Global transpose • 1D FFT along the third dimension 2. The global transpose requires communication only between subgroups of all nodes 3. Cons: Slower than 1D decomposition for a small number of processes 4. Pros : Maximum parallelization is increased
Parallel 3D FFT – 3D decomposition • More communications than existing FFT • MPI_Alltoall communications only in one dimensional space • Reduce communicational cost for large number of processes Ref) J. Jung et al. Comput. Phys. Comm. 200, 57-65 (2016)
Case Study: Parallelization in GENESIS MD software on Fugaku supercomputer
MD software GENESIS • GENESIS has been developed in RIKEN (main director: Dr. Yuji Sugita). • It allows high-performance MD simulations on parallel supercomputers like K, Fugaku, Tsubame, etc. • It is free software under LGPL license. GENESIS developers GENESISの開発の歴史 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 年度 Start of the development at RIKEN Wako Kobe teams started Ver. 1.0 Ver. 1.4 Ver. 1.2 Ver. 2.0 Optimized for Fugaku Ver. 1.1 Ver. 1.3 Ver. 1.5 Latest release https://www.r-ccs.riken.jp/labs/cbrt/
Domain decomposition in GENESIS SPDYN 𝒓𝒄 /𝟐 1. The simulation space is divided into subdomain according to the number of MPI processes. 2. Each subdomain is further divided into the unit domain (named cell). 3. Particle data are grouped together in each cell. 4. Communications are considered only between neighboring processes. 5. In each subdomain, OpenMP parallelization is used.
Cell-wise particle data in GENESIS 1. Each cell contains an array with the data of the particles that reside within it 2. This improves the locality of the particle data for operations on individual cells or pairs of cells. particle data in traditional cell lists cell-wise arrays of particle data Ref : P. Gonnet, JCC 33, 76 (2012) J. Jung et al. JCC, 35, 1064 (2014)
Midpoint cell method • For every cell pair, midpoint cell is decided. • If midpoint cell is not uniquely decided, we decide it by considering load balance. • For every particle pairs, interaction subdomain is decided from the midpoint cell. • With this scheme, we can only communicate data of each subdomain only considering the adjacent cells of each subdomain. Ref : J. Jung et al. JCC, 35, 1064 (2014)
Subdomain structure with midpoint cell method 1. Each MPI has local subdomain which is surrounded by boundary cells. 2. At least, the local subdomain has at least two cells in each direction 𝑟# /2 Local sub-domain Boundary
Hybrid Parallelization of GENESIS (real-space) 5 6 7 9 11 1 2 10 3 4 12 8 13 14 15 16 Unit cell Subdomain of each MPI OpenMP parallelization thread1 Cell pair Midpoint cell? (1,1) Yes (2,2) Yes … … (1,2) Yes thread4 (1,3) Yes thread1 … … (1,5) Yes (1,6) No (1,7) No (1,8) Yes (1,9) No (1,10) Yes … … thread2 thread2 thread3 thread4 The parallelization scheme is changed on Fugaku for higher speed!!
Previous non-boned interaction scheme for K do ijcel = 1, ncell_pair (3) (4) (4) icel = cell_pair(ijcel) first cell index jcel = cell_pair(ijcel) second cell index do ix = 1, natom(icel) index of the first cell do k = 1, nb15(ix,ijcel) (4) jx = nb15_list(k,ix,ijcel) index of the second cell dij(1:3) = coord(1:3,ix,icel) - coord(1:3,jx,jcel) . . . force calculation (6) (5) (4) (3) end do Sign of atom pairs within pairlist cutoff distance end do end do
Fugaku supercomputer 1. Number of Cores: 7,630,848 2. Each node has 4 core memory groups (CMGs) 3. Each CMG has 12 cores with 2.0 GHz clock speed 4. In each node, there are assistant cores (2 cores for computational node and 4 cores for I/O and computational node) 5. HBM 32 GiB memory 6. Tofu Interconnect D 7. 16 SIMD widths with single precision 8. Theoretical Peak performance: 537.21 Petalflops with double precision
How to optimize on Fugaku? Algorithm1 (X) Algorithm2 (O) do i = 1, 16 do j = 1, 16 f(i,j) end do end do do ij = 1, 256 index i and j from ij f(i,j) end do end do 1. In principle, there is no difference between Algorithm1 and Algorithm2 because of the same operation amount. 2. On Fugaku, Algorithm2 has better performance than Algorithm1. 3. The main reason is the operand waiting time for each do loop. On Fugaku, there are nonnegligible waiting time before executing calculations in the do loop. 4. To minimize the waiting time, the most inner do loop length should be long. 5. If the most inner do loop length is small, it would be better not to vectorize when compiling. do i = 1, 3 f(i) end do !$ocl nosimd do i = 1, 3 f(i) end do f(1:3) !$ocl nosimd f(1:3) f(1), f(2), f(3)
Nonbond interaction scheme on Fugaku : New coordinate array 1 2 3 4 5 6 7 8 1st cell 1 2 3 4 5 6 9 10 1 2 3 4 5 6 7 2nd cell 7 8 8 9 10 1 2 3 4 5 6 7 8 9 10 … 3rd cell 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 … coord_pbc(1:3,1:MaxAtom,1:ncel) → coord_pbc(1:3,1:MaxAtom×ncell)
Non-bonded interaction scheme on Fugaku (1,2) cell pair (1,3) cell pair (1,4) cell pair 1 2 3 4 5 6 7 8 11 1 12 2 13 3 14 4 15 5 16 6 21 1 22 2 23 3 24 4 25 5 26 6 27 7 28 8 29 9 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1112 1 2 3 4 5 6 7 8 (1,5) cell pair ... 16 21 22 31 1 32 2 33 3 34 4 35 5 36 6 37 7 38 8 1 2 3 4 5 6 7 8 ... 29 31 ... 38 (15) (15) (15) (14) (19) (17) (15) (9) Large do loop length 48
Nonbond interaction scheme on Fugaku (Pseudo code) do icel = 1, ncell do ix = 1, natom(icel) i = (icel-1)*MaxAtom + ix rtmp(1:3) = coord(1:3,i) i is same as (ix,icel) do k = 1, nb15(ix,i) Larger do loop length than K or Generic kernel j = nb15_list(k,ix,i) dij(1:3) = rtmp(1:3)-coord(1:3,j) . . . force calculation end do
Comparison of non-bonded algorithms K do ijcel = 1, cell_pair obtain icel and jcel do i=1,natom(icel) do k=1,neighbor(i,ijcel) j = list(k,ijcel) interaction end do end do end do do icel = 1, cell do i=1,natom(icel) do k=1,neighbor(i,icel) list(k,i,icel) interaction end do end do end do ApoA1 on one MPI processor Fugaku
prefetch Prefetch is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (from Wikipedia).
Overall procedure of reciprocal-space interaction 𝑸 (charge in real space) PME_Pre " 𝑸(charge grid data) FFT ") E(𝑸 " 𝚯 " 𝑸× ") 𝑭𝑭𝑻(𝑸 Inverse FFT " 𝚯) " 𝑰𝑭𝑭𝑻(𝑸× PME_Post Force from " 𝚯) " 𝑰𝑭𝑭𝑻(𝑸×
Charge grid data evaluation in GENESIS 1.X grids 1. Charge values is transferred to the neighbor grids. 2. The amount of neighbor depends on the spline order. Real-space 𝐷! 𝐷" 𝐷# 𝐷$ 𝐷% 𝐷& 𝐷' 𝐷( Generated charge grid Accepted charge grid Reciprocal-space "! 𝐷 "" 𝐷 "# 𝐷 "$ 𝐷 "% 𝐷 "& 𝐷 "' 𝐷 "( 𝐷 𝐷) 𝐷!* 𝐷!! 𝐷!" ") 𝐷 "!* 𝐷 "!! 𝐷 "!" 𝐷 𝐷!# 𝐷!$ 𝐷!% 𝐷!& "!# 𝐷 "!$ 𝐷 "!% 𝐷 "!& 𝐷 In this scheme, we can avoid communication in charge grid data generation.
Problem of charge grid data in GENESIS 1.X Reciprocal-space Real-space 𝐷! 𝐷" 𝐷# 𝐷% 𝐷& 𝐷$ 𝐷' 𝐷( ? "! 𝐷 "" 𝐷 "# 𝐷 "$ 𝐷 "% 𝐷 "& 𝐷 "' 𝐷 "( 𝐷 𝐷) 𝐷!* 𝐷!! 𝐷!" ") 𝐷 "!* 𝐷 "!! 𝐷 "!" 𝐷 𝐷!# 𝐷!$ 𝐷!% 𝐷!& "!# 𝐷 "!$ 𝐷 "!% 𝐷 "!& 𝐷 • We assume that charge grids in reciprocal-space is obtained from the real-space subdomain and its adjacent cells. • If the spline order and grid spacing is large (e.g. spline order 8 with grid spacing 2Å), the charge data of adjacent cell is not sufficient to obtain the charge grid data in the subdomain.
New scheme of charge grid data for Fugaku (1) Real-space Reciprocal-space • Charge grid data is obtained from the charge values in the subdomain only. • In this way, we don’t have to consider charge grid data loss even for large spline order.
New scheme of charge grid data for Fugaku (2) Real-space Reciprocal-space 1. In the new scheme, we generate charge grid data only from the subdomain. 2. Charge grid data generated out of the subdomain is sent to next subdomain with communication. 3. It accelerates the speed by reducing operation and higher order spline can be used. 4. The communication used is not global, so it does not lose the performance significantly.
Additional considerations when using Fugaku Fugaku compilation Source Code start => domain%start_atom do i = 2, N start(i) = start(i-1) + natom(i-1) end do 1 1 pp p v v 1 p v do i = 2, ncell start(i) = start(i-1) + natom(i-1) end do 1. Because of the dependency, the do loop should not be vectorized (SIMD must not be applied). 2. However, the compiler did auto vectorization and there is a high possibility of wrong results. 3. For these kinds of codes, please apply no simd directive or no pointer usage. do i = 2, N domain%start_atom(i) = domain%start_atom(i-1) + natom(i-1) end do !ocl nosimd do i = 2, N start(i) = start(i-1) + natom(i-1) end do
Real-space non-bonded interaction (performance per node) Force calculation Algorithm 2 Algorithm 3 Time/step 26.99 ms 10.08 ms FLOPS 18.91 GFLOPS 46.16 GFLOPS Waiting time/step 18.43 ms 6.23 ms Pairlist generation Algorithm 2 Algorithm 3 Time/cycle 41.46 ms 17.56 ms FLOPS 14.32 GFLOPS 33.58 GFLOPS Waiting time/cycle 18.60 ms 8.65 ms • Target system: 1.58 million atom system on 16 nodes • 12.0 Å cutoff and 13.5 Å pairlist cutoff
Reciprocal-space non-bonded interaction Node # 256 512 1024 2048 4096 PME_new_3d 1Åa 2 Åb 57.3 38.4 27.6 20.2 18.7 11.7 13.9 6.9 9.5 4.8 PME_old_3d 1Åa 61.7 29.5 18.8 13.5 8.7 PME_new_2d 1Åa 2 Åb 46.6 38.1 29.4 20.8 19.0 12.2 17.5 7.4 14.5 N/A PME_old_2d 1Åa 52.6 32.3 19.5 17.1 13.6 a Grid spacing is 1 Å, spline order is 4, and grid numbers are 1024×1024×1024 b Grid spacing is 2 Å, spline order is 8, and grid numbers are 512×512×512 • Target system: 101.12 million atoms • 3d: 3D FFT decomposition • 2d: 2D FFT decomposition 1. The new scheme increases the performance for small number of processors. 2. The new scheme also increases the overall performance by reducing grid numbers while increasing spline order.
Conventional MD (weak scaling) Number of nodes 16 32 64 128 256 512 1024 2048 4096 8192 16384 System size 1.58 M 3.16 M 6.32 M 12.63 M 25.26 M 50.53 M 101.05 M 202.11 M 404.21 M 808.43 M 1.62 B grid spacing =1Å 11.62 (1.00) 11.12 (0.96) 11.07 (0.95) 10.82 (0.93) 10.01 (0.86) 10.27 (0.88) 9.26 (0.80) 8.53 (0.73) 7.54 (0.65) 7.23 (0.62) 6.19 (0.53) grid spacing =2Å 11.29 (1.00) 11.11 (0.98) 11.10 (0.98) 10.99 (0.97) 10.72 (0.95) 10.82 (0.96) 10.43 (0.92) 10.26 (0.91) 9.92 (0.88) 9.13 (0.81) 8.30 (0.74) • Numbers in parentheses are weak scaling efficiency. • By reducing grid numbers while increasing spline order, we can increase the performance for larger number of nodes. • We obtained best performance of the largest system (1.62B system with 8.30 ns/day) that is better than existing (NAMD on Titan is 1B system with 5 ns/day).
Conventional MD (strong scaling) 4 times better performance than GENESIS1.0 on K using 1/8 nodes 11.9 ns/day Ref : J. Jung et al. JCC, 42, 231 (2021)
Summary • Parallelization: MPI and OpenMP • MPI: Distributed memory parallelization • OpenMP: Shared memory parallelization • Hybrid: both MPI and OpenMMP • Parallelization of MD: mainly by domain decomposition with parallelization of non-bonded interaction • Key issues in parallelization: minimizing communication cost to maximize the parallel efficiency, hardware aware optimization.