第11回配信講義計算科学技術特論B（2024）

計算科学技術特論B 大規模地震シミュレーション2 2024/6/27 東京大学地震研究所藤田航平 1

内容・スケジュール • 6/20 1300- • 地震シミュレーションの概要 • 有限要素法の基礎 • 大規模連立方程式の求解方法(共役勾配法) • 共役勾配法における並列化 • 共役勾配法における前処理 • 大規模地震シミュレーションの例 • 6/27 1300- • 前回の復習 • 並列計算機アーキテキチュアの概説 • SIMDによる有限要素法の行列ベクトル積の高速化 • SIMDに適した有限要素法アルゴリズム・実装 • GPU向けの有限要素法の高速化 • 富岳向けの有限要素法の高速化 2

3.

前回の復習 3

4.

地震シミュレーション概要 • 断層～地殻～地盤～構造物 • 領域サイズ: 100 km • 空間分解能: 0.1 m • 複雑構造をもつ都市などを高精度で求解するため、有限要素法が使われることが多い • 都市規模の問題は数千億自由度規模の超大規模問題になるうえに、有限要素法ではランダムアクセスが主体となり計算効率が下がる傾向にあるため、高性能計算技術が必須 Example code: do i=1,nx do j=1,ny a(i,j)=b(i,j)+b(i+1,j)+…. enddo enddo Example code: do i=1,n do j=ptr(i)+1,ptr(i+1) a(i)=a(i)+b(index(j)) enddo enddo 構造格子(差分法など) 非構造格子(有限要素法など) 4

5.

解析例 • ターゲット：断層～地殻～地盤～構造物～社会活動 c) Resident evacuation a) Earthquake wave propagation 0 km -7 km Two million agents evacuating to nearest safe site Ikebukuro Ueno Earthquake Post earthquake Shinjuku Tokyo station Shinbashi Shibuya 京コンピュータ全系を使った地震シミュレーション例 b) City response simulation T. Ichimura et al., Implicit Nonlinear Wave Simulation with 1.08T DOF and 0.270T Unstructured Finite Elements to Enhance Comprehensive 5 Earthquake Simulation, SC15 Gordon Bell Prize Finalist

6.

有限要素法（1次元） • ターゲット問題(支配方程式): 𝑓𝑓 𝑢𝑢(𝑥𝑥) = 0 2 • e.g.: 𝑓𝑓 𝑢𝑢 𝑥𝑥 =1− 𝑑𝑑 𝑢𝑢(𝑥𝑥) for 0 < 𝑥𝑥 < 1 with boundary conditions 𝑢𝑢 0 𝑑𝑑𝑥𝑥 2 = 0, 𝑢𝑢 1 = 1/2 • 有限要素法では未知関数𝑢𝑢 𝑥𝑥 をオーバーラップの無い区間（要素）で分割する • 𝑢𝑢 𝑥𝑥 = ∑𝑖𝑖 𝑢𝑢𝑖𝑖 𝜑𝜑𝑖𝑖 (𝑥𝑥) • ここで𝑢𝑢𝑖𝑖 は定数(未知) 、𝜑𝜑𝑖𝑖 (𝑥𝑥)は形状関数(既知) • 𝑢𝑢𝑖𝑖 が決まれば𝑢𝑢 𝑥𝑥 が求まる→どうやって𝑢𝑢𝑖𝑖 を求めるか？ • 隣接する節点での関係が導かれる→行列にその値を足しこんでいく節点 𝑦𝑦 𝑢𝑢 𝑥𝑥 要素 𝑥𝑥 3 −3 𝑦𝑦 1 0 節点𝑖𝑖 𝜑𝜑𝑖𝑖 (𝑥𝑥) 𝑥𝑥 −3 0 0 12 − 0 −1/6 −3/8 5 • 𝐀𝐀 = , 𝐟𝐟 = 32 −1/3 0 −4 5 −1/8 −4 4 0 0 • あとは連立方程式𝐀𝐀𝐮𝐮 = 𝐟𝐟を境界条件 1 𝑢𝑢1 = 0, 𝑢𝑢4 = を満たすように解けばよ 2 6 い 27 5 12 − 5

7.

有限要素法のポイント • 有限要素法で数理問題を離散化することで、行列𝐀𝐀はスパース(疎) となる • 対称行列の行数を𝑁𝑁としたとき、 • →行列の非零成分の数は𝑂𝑂 𝑁𝑁 になる • →行列ベクトル積のコスト・メモリ使用量は𝑂𝑂 𝑁𝑁 になる • 行列ベクトル積コストが小さいため、行列ベクトル積を繰り返し使う反復法を使って解くことが多い • 反復法の前処理と組み合わせてアルゴリズムを構築 7

8.

ソルバー例１：地震時の地盤増幅解析＠京コンピュータ Solving preconditioning matrix CG loop Inner coarse loop Solving preconditioning matrix Second ordered tetrahedron Computations of outer loop Equation to be solved (double precision) Outer loop Solve system roughly using CG solver (single precision) Inner fine loop Use as initial solution Linear tetrahedron Solve system roughly using CG solver (single precision) Use for preconditioner of outer loop Second ordered tetrahedron • Solve preconditioning matrix roughly to reduce number of CG loops • Use geometric multi-grid method to reduce cost of preconditioner • Use single precision in preconditioner to reduce computation & communication T. Ichimura et al., Implicit Nonlinear Wave Simulation with 1.08T DOF and 0.270T Unstructured Finite Elements to Enhance Comprehensive 8 Earthquake Simulation, SC15 Gordon Bell Prize Finalist

9.

前回までのまとめ • 有限要素法で数理問題を離散化することで、行列𝐀𝐀はスパース(疎) となる • 行列ベクトル積コストが小さいため、行列ベクトル積を繰り返し使う反復法を使って解くことが多い • いかに速く行列ベクトル積を実行できるか？ • 計算機アーキテクチャに沿った開発が重要に 9

10.

並列計算機アーキテキチュアの概要 10

11.

計算機の性能向上 • 性能向上の要因 • クロック数の向上，同じ価格・電力・面積あたりで使える素子数が増加 • 計算機アーキテキチュア：使えるようになった多数の素子を使って，いかに処理速度を向上させるか？という観点から開発が続けられてきた • コア内並列(pipelining, superscalar, out-of-order, SIMDなど) • マルチコア並列 • 分散メモリ型並列 11

12.

コア内並列 • SIMD (single instruction multiple data) • 一つの命令で複数のデータを操作 • 命令数を減らすことができる • データが不連続であったり，データごとに異なる操作をする処理には適さない • 例：Intel AVX-512 (512 bit SIMD, 倍精度浮動小数点数8つをまとめて操作) • • • • • a512=_mm 512 _loadu_pd(&a[0]); // 連続した8要素をメモリからレジスタにロード b512=_mm512_loadu_pd(&b[0]); // 連続した8要素をメモリからレジスタにロード c512=a512+b512; // 8要素を加算 _mm512_storeu_pd(&c[0],c512); // 8要素をメモリの連続領域にストア参考： https://www.intel.co.jp/content/www/jp/ja/architecture-and-technology/avx-512-overview.html レジスタメモリ上 a: b: c: 0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 … … … 7 7 7 load load store a512 0 1 2 3 4 … 7 4 … 7 4 … 7 + b512 0 1 2 3 = c512 0 1 2 3 12

https://www.intel.co.jp/content/www/jp/ja/architecture-and-technology/avx-512-overview.html

13.

共有メモリ型並列・分散メモリ型並列 • 共有メモリ計算機 • • • • 複数のコアが，コアごとに異なる命令列を実行全コアが同一のメモリ空間を共有単一のオペレーティングシステムでシステムが管理される OpenMPやOpenACCなどを利用してコア毎の命令列や，コア間の同期などを制御 • 分散メモリ計算機 • ネットワークで複数の計算機(ノード)を接続 • 各ノードごとにオペレーティングシステムが実行され，メモリ空間はノード内のコア間でのみ共有される • 各ノードの間ではネットワーク上での通信を介して情報を伝達(MPI等，通信を明示したプログラムが必要) Core 0 Core 1 Core 2 Core 3 Memory 共有メモリ型並列計算機 Network Node 0 Node 1 … 分散メモリ型並列計算機 Node N-1 13

14.

ピーク性能 • クロック x スーパースカラによる演算器数 x FMA x SIMDレーン数 x コア数 x ノード数 • FMA: 命令一個で積和演算ができる(a=b*c+d) • 例：東京大学情報基盤センターのWisteria/BDEC-01 (Aquarius)の計算ノード • （GPU搭載ノードだがCPU部だけをみると） • 72コアのSMP型の共有メモリ計算機(36コアIntel Xeon Platinum 8360Y CPU@2.4GHz x 2個) • Ice Lake Xeon シリーズのCPU： AVX-512 命令をサポート • 512ビットSIMD: 倍精度の場合は8個（単精度の場合は16個）の要素を一度に操作できる • ピーク性能：2.4GHz x 2 (浮動小数点演算器数) x 2 (FMA) x 8 (倍精度SIMD) x 72コア x 1ノード = 5530 GFLOPS 14

15.

計算機の性能向上 • 従来からの計算機は，ユーザが意識せずともに性能向上を享受できた • クロック数向上，pipelining, superscalar, out-of-order • だが，クロックは向上せず，これらの技術による上がり幅が少なくなってきた • 現在の計算機は，ユーザが計算機システム(ハードウェア＋ソフトウェア) を理解しないと性能向上できない形でも素子を使っている • SIMD, マルチコア並列，分散メモリ型並列 • 高性能なプログラムを開発するにあたってはこれらの知識が必須 • 続いて、1次元・3次元有限要素法を例にSIMDに着目した高速化例を説明する • データ転送能力(e.g. メモリバンド幅，キャッシュ特性)についても考慮する必要があるが，ここでは，Element-by-Element法を使うことでアルゴリズムレベルで回避している 15

16.

行列ベクトル積へのSIMDの適用 16

17.

一次元有限要素法 • 支配方程式 • 𝑑𝑑 (𝐸𝐸 𝑑𝑑𝑑𝑑 ) =0 𝑑𝑑𝑥𝑥 2 • 今回は0 ≤ 𝑥𝑥 ≤ 1, 全領域でヤング率𝐸𝐸 = 1 • 境界条件 • 𝑢𝑢 = 0 at 𝑥𝑥 = 0 (Dirichlet条件) • 𝐸𝐸 𝑑𝑑𝑑𝑑 = 1 at 𝑥𝑥 = 1 (Neumann条件) 𝑑𝑑𝑑𝑑 • 線形要素を使った場合の要素剛性マトリクス • 𝐾𝐾𝑒𝑒 = 𝐸𝐸 𝑑𝑑𝑑𝑑 1 −1 −1 1 17

18.

有限要素法の解析フロー • メッシュ生成 • 今回は0 ≤ 𝑥𝑥 ≤ 1を𝑛𝑛 − 1個に等分割したメッシュを使う • 境界条件設定 • Neumann条件をfに反映 • Ku=fの求解 • 共役勾配法を使い、マトリクスベクトル積においてDirichlet条件を反映 • 可視化 • 今回は省略 18

19.

行列ベクトル積 (cal_amat.c) • r ⇐ K uを，r ⇐ Σe PeT(Ke (Pe u))として計算アセンブリにコメントを埋め込む • Peは全体節点番号と要素節点番号のマッピング行列 • ここで， • ue ⇐Pe u 𝐸𝐸 1 −1 • Ke = 𝑑𝑑𝑑𝑑 −1 1 • BDBue ⇐Ke ue ue ⇐Pe u Ke BDBue ⇐Ke ue r+= PeT BDBue 19

20.

コンパイル・実行 • 東京大学情報基盤センターのWisteria/BDEC-01 (AquariusのCPU部のみ)を例に説明 • Intel Xeon Platinum 8360Y CPU@2.4GHz • コンパイル • $ gcc main.c cg.c cal_amat.c -O3 -mavx -fopenmp -fopt-info-vec-optimized -Wall -lm (またはgfortran main.F cg.F cal_amat.F -O3 -mavx -fopenmp -fopt-info-vec-optimized -Wall lm) • • • • • -mavx: AVX instructions -fopt-info-vec-optimized : output optimization information -Wall: output warning messages -lm: link math library Output assembly using option “-S” • 実行 • ./a.out 20

21.

コンパイル・実行結果 [username@wisteria01 1D]$ gcc main.c cg.c cal_amat.c -O3 -mavx -fopenmp -fopt-info-vec-optimized -Wall -lm main.c:63:3: note: loop vectorized … cg.c:11:3: note: loop vectorized [username@wisteria01 1D]$ pjsub run.sh [INFO] PJM 0000 pjsub Job 4495423 submitted. [username@wisteria01 1D]$ cat run.sh.4495423.out |tail 32766 1.160480e-06 1.000000e-06 32767 1.308662e-06 1.000000e-06 32768 1.348558e-06 1.000000e-06 32769 1.483470e-06 1.000000e-06 32770 1.365041e-06 1.000000e-06 32771 1.050433e-06 1.000000e-06 32772 1.327965e-06 1.000000e-06 32773 6.988487e-07 1.000000e-06 norm 1.000000e+00 2.731286e+03 0 took 0.197028 (3.742007 GFLOPS) 0.000000 • 関数にSIMDが適用されるとnote: loop vectorized と表示されるが、今回は cal_amat.cで表示がない • SIMDが適用されていない • 実行時間約0.19秒、約 3.74GFLOPS • この計算機(1コア)のピーク性能は153GFLOPSなので、実行効率は約2.4% 21

22.

SIMDをかけるには • データ依存性 do i=1,n a(i)=2.0*b(i) enddo SIMD演算可能 Gather load do i=1,n a(i)=2.0*b(cny(i)) enddo Scatter store do i=1,n a(cny(i))=2.0*b(i) enddo SIMD演算不可 • Branching do i=1,n if(b(i)>0.0)then a(i)=2.0*b(i) else a(i)=0.0 endif enddo SIMD演算できる場合もあるが、多くの場合効率が落ちる 22

23.

SIMDがかかる例 func1.F90 subroutine func1(n,a,b) implicit none integer i,n real*4 a(n),b(n) [username@wisteria01 test]$ gfortran func1.F -O3 mavx -fopenmp -fopt-info-vec-optimized -S func1.F:6:0: note: loop vectorized do i=1,n a(i)=2.0*b(i) enddo VectorizedはSIMDがかかったという意味 end 2 b 2 0 2 1 2 2 2 * 3 2 … 2 4 … 7 4 … 7 = a 0 1 2 3 23

24.

SIMDがかからない例 (Gather load) func2.F90 subroutine func2(n,cny,a,b) implicit none integer i,n,cny(n) real*4 a(n),b(n) [username@wisteria01 test]$ gfortran func2.F -O3 mavx -fopenmp -fopt-info-vec-optimized –S (出力なし) do i=1,n a(i)=2.0*b(cny(i)) enddo end b 0 1 2 3 4 … 7 データアクセスパターンがcnyによって変わる 2 * 2 2 2 2 2 … 2 4 … 7 = a 0 1 2 3 24

25.

SIMDがかからない例 (Scatter store) func3.F90 subroutine func3(n,cny,a,b) implicit none integer i,n,cny(n) real*4 a(n),b(n) [username@wisteria01 test]$ gfortran func3.F -O3 mavx -fopenmp -fopt-info-vec-optimized –S (出力なし) do i=1,n a(cny(i))=2.0*b(i) enddo end 2 b 2 2 2 2 2 … 2 4 … 7 * 0 1 2 3 データアクセスパターンがcnyによって変わる a 0 1 2 3 4 … 7 同一箇所への足しこみの可能性があるため並列計算できない 25

26.

SIMDがかからない例 (Branching) func4.F90 subroutine func4(n,cny,a,b) implicit none integer i,n,cny(n) real*4 a(n),b(n) do i=1,n if(b(i)>0.0)then a(i)=2.0*b(i) else a(i)=0.0 endif enddo end 2 b 0 a 2 [username@wisteria01 test]$ gfortran func4.F -O3 mavx -fopenmp -fopt-info-vec-optimized –S (出力なし) 2 2 2 2 … 2 * 0 1 2 3 4 … 7 0 0 0 0 0 … 0 or or or or or 0 1 2 3 4 or … 7 b(i)の値によって答えが変わる 26

27.

cal_amat.cのieループにSIMDがかからない理由 • ieループにSIMDがかからない理由 • u, coor, youngのgather load • rへのscatter store • これらを主要計算部から分離すればよい 27

28.

ieループのSIMD化 (cal_amat_simd.c) • ieループをgather load部・計算部・scatter store部に分離 • 一時配列の大きさを抑えるため，ループを入れ子状に変更（ループ・ブロッキング） Gather load 部主要計算部 Scatter store 部 28

29.

# 23 "cal_amat_simd.c" 1 # load right hand side vector # 0 "" 2 #NO_APP xorl %eax, %eax .p2align 4,,10 .p2align 3 .L6: movslq (%rcx,%rax,2), %rdx movslq 4(%rcx,%rax,2), %rsi vmovss 0(%r13,%rdx,4), %xmm0 vmovss %xmm0, 32(%rsp,%rax) vmovss 0(%r13,%rsi,4), %xmm0 vmovss %xmm0, 96(%rsp,%rax) vmovss (%r12,%rsi,4), %xmm0 vsubss (%r12,%rdx,4), %xmm0, %xmm0 movslq (%r14,%rax), %rdx vmovss %xmm0, 352(%rsp,%rax) … Scatter store部 • オプション“-S”を付けてコンパイルすることでアセンブリファイルを生成 s: スカラー命令，xmmは128 bit レジスタ(単精度で4要素)だがスカラー命令ではそのうちの一要素しか使わない p: SIMD命令，ymmは256 bit レジスタ(単精度で8要素) Gather load部 • • 主要計算部アセンブリ(cal_amat_simd.s) # 33 "cal_amat_simd.c" 1 # compute determinant and BDBu # 0 "" 2 #NO_APP vmovaps 32(%rsp), %ymm5 vsubps 96(%rsp), %ymm5, %ymm3 vdivps 352(%rsp), %ymm2, %ymm0 vdivps 384(%rsp), %ymm2, %ymm4 vmulps 160(%rsp), %ymm3, %ymm3 vmovaps 96(%rsp), %ymm7 vmovaps 64(%rsp), %ymm6 … #APP # 39 "cal_amat_simd.c" 1 # add BDBu into left side vector # 0 "" 2 #NO_APP xorl %eax, %eax .p2align 4,,10 .p2align 3 .L7: movslq (%rcx,%rax,2), %rsi movslq 4(%rcx,%rax,2), %rdx leaq (%r10,%rsi,4), %rsi leaq (%r10,%rdx,4), %rdx vmovss (%rsi), %xmm0 vaddss 224(%rsp,%rax), %xmm0, %xmm0 vmovss %xmm0, (%rsi) vmovss (%rdx), %xmm0 29 …

30.

実行結果 • 解析結果(norm)は同一 • この場合，解析時間は遅くなった • SIMD計算部分が少ないため，配列操作などのオーバーヘッドが隠せていないコード改変後(cal_amat_simd.c) コード改変前(cal_amat.c) [username@wisteria01 1D]$ tail run.sh.4495423.out [username@wisteria01 1D]$ tail run.simd.sh.4495513.out 32766 1.160480e-06 1.000000e-06 32766 1.160480e-06 1.000000e-06 32767 1.308662e-06 1.000000e-06 32767 1.308662e-06 1.000000e-06 32768 1.348558e-06 1.000000e-06 32768 1.348558e-06 1.000000e-06 32769 1.483470e-06 1.000000e-06 32769 1.483470e-06 1.000000e-06 32770 1.365041e-06 1.000000e-06 32770 1.365041e-06 1.000000e-06 32771 1.050433e-06 1.000000e-06 32771 1.050433e-06 1.000000e-06 32772 1.327965e-06 1.000000e-06 32772 1.327965e-06 1.000000e-06 32773 6.988487e-07 1.000000e-06 32773 6.988487e-07 1.000000e-06 norm 1.000000e+00 2.731286e+03 norm 1.000000e+00 2.731286e+03 0 took 0.235718 (3.127807 GFLOPS) 0.000000 0 took 0.197028 (3.742007 GFLOPS) 0.000000 30

31.

三次元有限要素法の場合 • 一次元有限要素法と同じ構成 • 要素当たりの節点数が4つに，各節点あたりの変数がx,y,zの3成分になる cny[4*ie] 要素ie: cny[4*ie+3] cny[4*ie+1] cny[4*ie+2] x y z x y z … coor: x y z x y z … x y z x y z … u: r: 節点0 節点1 31 行列ベクトル積コード (三次元有限要素法)

32.

実行結果 • 解析結果(norm)は同一 • 三次元の場合は行列ベクトル積内部の計算コストが大きいため、SIMD計算により高速化 • ただしデータのロード・多仕込み部がSIMD化できないため、性能向上幅は限られているコード改変前(cal_amat.c) nblock,n,ne 20 9261 48000 0 took 0.800056 (7.379483 GFLOPS) 0.208333 norm 3.565104e+03 3.797010e+06 コード改変後(cal_amat_simd.c) nblock,n,ne 20 9261 48000 0 took 0.432718 (13.643987 GFLOPS) 0.208333 norm 3.565104e+03 3.797010e+06 32

33.

SIMDに適した有限要素法アルゴリズム開発・実装の例 Kohei Fujita, Masashi Horikoshi, Tsuyoshi Ichimura, Larry Meadows, Kengo Nakajima, Muneo Hori, Lalith Maddegedara Journal of Computational Science, 2020より 33

34.

ターゲット問題 • 陰解法による動的非線形低次非構造格子有限要素法 • 都市の地震応答など、不均質な分布を持つ非線形物性・複雑形状を持つ領域の求解に適している • 多数回大規模線形連立方程式を求解する • 多数のランダムデータアクセスを含む Solve for each of few thousand time steps: Unknown vector (up to trillion degrees of freedom) Ku = f Outer force vector Unknown vector with up to 1 trillion degrees of freedom Sparse symmetric positive definite matrix (changes every time step) Known vector 34

35.

SC14 unstructured finite-element solver • Designed for CPU based K computer • Use algorithm that can obtain equal granularity on millions of cores • Use matrix-free matrix-vector product (Element-by-Element method): Good load balance when elements per core is equal • Also high-peak performance as it is on-cache computation • Combine Element-by-Element method with multi-grid, mixed precision arithmetic, and adaptive conjugate gradient method • Scalability & peak-performance good (key kernels are Element-by-Element), convergency good, thus, time-to-solution good Element-by-Element method += Element #0 += Element #1 f = Σe Pe Ke PeT u [Ke is generated on-the-fly] Ke u … … f Element #N-1 35

36.

計算機におけるSC14ソルバーの性能 • 京コンピュータ(理化学研究所) • 8コアCPU(SPARC64) x 82,944計算ノード • ピーク性能：10.6 PFLOPS、メモリバンド幅：5.3 PB/s • SC14ソルバーの性能：ピーク性能の11.1% • Oakforest-PACS (東京大学・筑波大学) • 68コアCPU(Xeon Phi Knights Landing) x 8,208計算ノード • ピーク性能：25 PFLOPS、メモリバンド幅：4 PB/s • SC14ソルバーの性能：ピーク性能の2.26% 36

37.

性能劣化の原因 • 計算機によりSIMD幅が異なる • 京コンピュータ：倍精度演算で2 (単精度演算で2) • Oakforest-PACS:倍精度演算で16 (単精度演算で8) • Element-by-Element法におけるランダムアクセスがwide SIMD計算機においてボトルネックとなる • 右辺ベクトルのランダムロード(u) • 左辺ベクトルへのランダム足しこみ(f) • これらはSIMDによる連続アクセスよりも低効率な命令で実装される Element #0 += Element #1 Ke u … … f += Element #N-1 37

38.

非構造有限要素法における、問題の「均質性」の利用 • 計算の均質性・連続性→高いデータアクセス性能に直結する • 差分法などの構造格子や、メッシュが構造化されている手法において高い計算性能が得られる理由の一つ • 非構造有限要素においても、メッシュは時間方向に不変 • この特性を使って、時間方向に並列に求解計算を実施することで、動的有限要素法の効率を改善 38

39.

時間並列アルゴリズムの概要 • 複数の時間ステップを同時に求解することで、将来ステップの解を予測 • 一反復当たりの計算コストを削減 • 通常のソルバーを使った場合と全演算数はほぼ変わらないが、より高効率なカーネルを使用することができる Kernel in previous solver Kernel in time-parallel solver: less random access Element #0 Contiguous in memory (i.e., SIMD efficient) ke … Contiguous in memory ke … … Element #1 fi Element #0 Element #1 fi ke Future time steps Current time step … ke … … ui fi+1 fi+2 fi+3 ui ui+1 ui+2 ui+3 Tsuyoshi Ichimura, Kohei Fujita, Masashi Horikoshi, Larry Meadows, Kengo Nakajima, Takuma Yamaguchi, Kentaro Koyama, Hikaru Inoue, Akira Naruse, Keisuke Katsushima, Muneo Hori, Maddegedara Lalith, A Fast Scalable Implicit Solver with Concentrated Computation for Nonlinear Time-evolution Problems on Low-order Unstructured Finite Elements, 32nd IEEE International Parallel and Distributed Processing Symposium, 2018. 39 Kohei Fujita, Keisuke Katsushima, Tsuyoshi Ichimura, Masashi Horikoshi, Kengo Nakajima, Muneo Hori, Lalith Maddegedara, Wave Propagation Simulation of Complex Multi-Material Problems with Fast Low-Order Unstructured Finite-Element Meshing and Analysis, Proceedings of HPC Asia 2018 (Best Paper Award), 2018.

40.

[beta]

Standard solver algorithm
1: set 𝑥𝑥−1 ← 0
2: for( 𝑖𝑖 = 0; 𝑖𝑖 < 𝑛𝑛; 𝑖𝑖 = 𝑖𝑖 + 1 ){
3:
guess 𝑥𝑥̅𝑖𝑖 using standard predictor
4:
set 𝑏𝑏𝑖𝑖 using 𝑥𝑥𝑖𝑖−1
5:
solve 𝑥𝑥𝑖𝑖 ← 𝐴𝐴−1 𝑏𝑏𝑖𝑖 using initial solution 𝑥𝑥̅𝑖𝑖 (Computed using iterative solver with SpMV kernel)
6: }

Developed algorithm
1: set 𝑥𝑥−1 ← 0 and 𝑥𝑥̅𝑖𝑖 ← 0 ( 𝑖𝑖 = 0, … , 𝑚𝑚 − 2)
2: for( 𝑖𝑖 = 0; 𝑖𝑖 < 𝑛𝑛; 𝑖𝑖 = 𝑖𝑖 + 1 ){
3:
set 𝑏𝑏𝑖𝑖 using 𝑥𝑥𝑖𝑖−1
4:
guess 𝑥𝑥̅𝑖𝑖+𝑚𝑚−1 using standard predictor
5:
𝑏𝑏𝑗𝑗 ← 𝑏𝑏�𝑗𝑗
6:

while (

A𝑥𝑥̅𝑖𝑖 −𝑏𝑏𝑖𝑖
𝑏𝑏𝑖𝑖

> 𝜖𝜖) do {

7:
guess 𝑏𝑏�𝑗𝑗 using 𝑥𝑥𝑗𝑗−1
̅
( 𝑗𝑗 = 𝑖𝑖 + 1, … , 𝑖𝑖 + 𝑚𝑚 − 1 )
8:
refine solution {𝑥𝑥𝑗𝑗̅ ← 𝐴𝐴−1 𝑏𝑏�𝑗𝑗 } with initial solution 𝑥𝑥𝑗𝑗̅ (𝑗𝑗 = 𝑖𝑖, … , 𝑖𝑖 + 𝑚𝑚 − 1) (Computed using iterative
solver with concentrated computation kernel)
10:
}
11:
𝑥𝑥𝑖𝑖 ← 𝑥𝑥̅𝑖𝑖
11: }

40

41.

並列時間積分アルゴリズムのnaïveな実装 (m-ステップ) • 各コアにてSIMDを用いてm本のベクトルを計算 • ただし、実際の問題ではm ≦4が一般的：SIMD幅をすべて使い切ることができない • マルチコア間のデータ競合を回避するためにテンポラリのベクトルを確保 • メニーコア計算機を使う場合は高コスト Element #0 Compute using 4-wide SIMD ke Global left hand side vector (f) Core-wise temporary vectors (ft) + + Element #1 fi fi+1 fi+2 fi+3 ke … Future time steps Current time step + 1. Initialize necessary components (gray) ui ui+1 ui+2 ui+3 Core-wise computation (in case of m=4) 2. Update components by EBE (black) = = = = = = 3. Add necessary components Many-core computation (in case of three cores) 41

42.

Naïve implementation of EBE kernel with m vectors 1 !$OMP PARALLEL DO 2 do iu=1,numberofthreads ! for each thread 3 do i=1,nnum(iu) 4 i1=nlist(i,iu) 5 do im=1,m 6 ft(im,1,i1,iu)=0.0 ! clear temporary vector 7 ft(im,2,i1,iu)=0.0 8 ft(im,3,i1,iu)=0.0 9 enddo 10 enddo 11 do ie=npl(iu)+1,npl(iu+1) 12 cny1=cny(1,ie) 13 cny2=cny(2,ie) 14 cny3=cny(3,ie) 15 cny4=cny(4,ie) 16 xe11=x(1,cny1) 17 xe21=x(2,cny1) SIMD computation ... with width m 18 xe34=x(3,cny4) 19 do im=1,m 20 ! compute BDBu using ue11~ue34 and xe11~xe34 21 ue11=u(im,1,cny1) 22 ue21=u(im,2,cny1) ... 23 ue34=u(im,3,cny4) 24 ! compute BDBu using ue11~ue34 and xe11~xe34 25 BDBu11=... 26 BDBu21=... ... 27 BDBu34=... 28 ! add to temporary vector 29 ft(im,1,cny1,iu)=BDBu11+ft(im,1,cny1,iu) 30 ft(im,2,cny1,iu)=BDBu21+ft(im,2,cny1,iu) ... 31 ft(im,3,cny4,iu)=BDBu34+ft(im,3,cny4,iu) 32 enddo ! im 33 enddo ! ie 34 enddo ! iu 35 !$OMP END PARALLEL DO 36 !$OMP PARALLEL DO 37 ! clear global vector 38 do i=1,n 39 do im=1,m 40 f(im,1,i)=0.0 41 f(im,2,i)=0.0 42 f(im,3,i)=0.0 43 enddo 44 enddo 45 !$OMP END PARALLEL DO 46 do iu=1,numberofthreads 47 !$OMP PARALLEL DO 48 ! add to global vector 49 do i=1,nnum(iu) 50 i1=nlist(i,iu) 51 do im=1,m 52 f(im,1,i1)=f(im,1,i1)+ft(im,1,i1,iu) 53 f(im,2,i1)=f(im,2,i1)+ft(im,2,i1,iu) 54 f(im,3,i1)=f(im,3,i1)+ft(im,3,i1,iu) 55 enddo 56 enddo 57 !$OMP END PARALLEL DO 58 enddo Global left hand side vector (f) Core-wise temporary vectors (ft) + = = = = = = + + 1. Initialize necessary components (gray) 2. Update components by EBE (black) 3. Add necessary components 42

43.

Wide-SIMD CPUにおける効率的な計算方法 • ベクトルをパック・アンパックすることでSIMD幅をすべて使う Example for time parallel kernel (m = 4) with 8 width FP32 SIMD architecture Element #0 Element #0 ke Pack Compute using 8width SIMD Element #1 fi fi+1 fi+2 fi+3 ke … Future time steps Current time step fi fi+1 fi+2 fi+3 Unpack and add ui ui+1 ui+2 ui+3 Naïve implementation: use only 4 out of the 8-width SIMD Element #1 Pack ui ui+1 ui+2 ui+3 With packing: can use full SIMD width 43

44.

EBE kernel with m vectors for wide-SIMD CPUs 1 !$OMP PARALLEL DO 2 do iu=1,numberofthreads ! for each thread 3 do i=1,nnum(iu) 4 i1=nlist(i,iu) 5 do im=1,m 6 ft(im,1,i1,iu)=0.0 ! clear temporary vector 7 ft(im,2,i1,iu)=0.0 8 ft(im,3,i1,iu)=0.0 9 enddo 10 enddo 11 ! block loop with blocksize NL/m 12 do ieo=npl(iu)+1,npl(iu+1),NL/m 13 ! load ue, xe 14 do ie=1,min(NL/m,npl(,iu+1)-ieo+1) 15 cny1=cny(1,ieo+ie-1) 16 cny2=cny(2,ieo+ie-1) 17 cny3=cny(3,ieo+ie-1) 18 cny4=cny(4,ieo+ie-1) 19 do im=1,m 20 ue11(im+(ie-1)*m)=u(im,1,cny1) SIMD 21 ue21(im+(ie-1)*m)=u(im,2,cny1) (width=m) ... computation 22 ue34(im+(ie-1)*m)=u(im,3,cny4) 23 xe11(im+(ie-1)*m)=x(1,cny1) 24 xe21(im+(ie-1)*m)=x(2,cny1) ... 25 xe34(im+(ie-1)*m)=x(3,cny4) 26 enddo 27 enddo 28 29 30 31 ! compute BDBu do i=1,NL BDBu11(ie)=... SIMD computation BDBu21(ie)=... ... 32 BDBu34(ie)=... 33 enddo 34 ! add to global vector 35 do ie=1,min(NL/m, npl(icolor,iu+1)-ieo+1) 36 cny1=cny(1,ieo+ie-1) SIMD 37 cny2=cny(2,ieo+ie-1) (width=m) 38 cny3=cny(3,ieo+ie-1) computation 39 cny4=cny(4,ieo+ie-1) 40 do im=1,m 41 ft(im,1,cny1,iu)=BDBu11(im+(ie-1)*m)+f(im,1,cny1,iu) 42 ft(im,2,cny1,iu)=BDBu21(im+(ie-1)*m)+f(im,2,cny1,iu) ... 43 ft(im,3,cny4,iu)=BDBu34(im+(ie-1)*m)+f(im,3,cny4,iu) 44 enddo 45 enddo 46 enddo ! ieo 47 enddo ! iu 48 !$OMP END PARALLEL DO 49 Add ft in to f (same as lines 36-58 of Fig. 2) 44

45.

メニーコア計算機向けのスレッド分割 • テンポラリ配列を使う必要がない • グラフ分割法を使ってキャッシュの再利用を促進 Decompose mesh using graph partitioning method … Overall mesh Color #1 Color #2 Color #3 All threads compute each color a) Standard coloring method Overall mesh Set #1 Set #2 Thread 1, Thread 2, Thread 3 Set #3 (Threads 2,3 idle) b) Developed thread partitioning method 45

46.

1 !$OMP PARALLEL DO 2 ! clear global vector 3 do i=1,n 4 do im=1,m 5 f(im,1,i)=0.0 6 f(im,2,i)=0.0 7 f(im,3,i)=0.0 8 enddo 9 enddo 10 !$OMP END PARALLEL DO 11 do icolor=1,ncolor ! for each color or element set 12 !$OMP PARALLEL DO 13 do iu=1, numberofthreads 14 ! block loop with blocksize NL/m 15 do ieo=npl(icolor,iu)+1,npl(icolor,iu+1),NL/m 16 ! load ue, xe 17 do ie=1,min(NL/m,npl(icolor,iu+1)-ieo+1) 18 cny1=cny(1,ieo+ie-1) 19 cny2=cny(2,ieo+ie-1) 20 cny3=cny(3,ieo+ie-1) 21 cny4=cny(4,ieo+ie-1) 22 do im=1,m SIMD 23 ue11(im+(ie-1)*m)=u(im,1,cny1) 24 ue21(im+(ie-1)*m)=u(im,2,cny1) (width=m) computation 25 ... 26 ue34(im+(ie-1)*m)=u(im,3,cny4) 27 xe11(im+(ie-1)*m)=x(1,cny1) 28 xe21(im+(ie-1)*m)=x(2,cny1) ... 29 xe34(im+(ie-1)*m)=x(3,cny4) 30 enddo 31 enddo 32 33 34 35 ! compute BDBu do i=1,NL BDBu11(ie)=... SIMD computation BDBu21(ie)=... ... 36 BDBu34(ie)=... 37 enddo 38 ! add to global vector 39 do ie=1,min(NL/m, npl(icolor,iu+1)-ieo+1) 40 cny1=cny(1,ieo+ie-1) 41 cny2=cny(2,ieo+ie-1) 42 cny3=cny(3,ieo+ie-1) 43 cny4=cny(4,ieo+ie-1) 44 do im=1,m 45 f(im,1,cny1)=BDBu11(im+(ie-1)*m)+f(im,1,cny1) SIMD 46 f(im,2,cny1)=BDBu21(im+(ie-1)*m)+f(im,2,cny1) (width=m) computation ... 47 f(im,3,cny4)=BDBu34(im+(ie-1)*m)+f(im,3,cny4) 48 enddo 49 enddo 50 enddo ! ieo 51 enddo ! iu 52 !$OMP END PARALLEL DO 53 enddo ! icolor Coloring/thread partitioning of EBE kernel with m vectors for wide-SIMD CPUs 46

47.

Mixed use of 4-wide and 16-wide SIMD • In case of problems of m = 4 computed on 16-wide SIMD architecture, 16width SIMD can be used for packing/unpacking of 4-width vectors • Enables further reduction of instructions 22 23 24 25 30 do im=1,m SIMD width=4 ue11(im+(ie-1)*m)=u(im,1,cny1) computation ! Load u(1:4,1,cny1) to xmm1 ! Store xmm1 to ue11(1+(ie-1)*m: 4+(ie-1)*m) ue21(im+(ie-1)*m)=u(im,2,cny1) ! Load u(1:4,2,cny1) to xmm1 ! Store xmm1 to ue21(1+(ie-1)*m: 4+(ie-1)*m) ue31(im+(ie-1)*m)=u(j,3,cny1) ! Load u(1:4,3,cny1) to xmm1 ! Store xmm1 to ue31(1+(ie-1)*m: 4+(ie-1)*m) ... enddo Use of only 4-width SIMD instructions 23-25 SIMD width=16 ! Load u(1:16,cny1) to zmm1 computation ! Store zmm1(1:4) to ue11(1+(i-1)*4:4+(i-1)*4) ! Store zmm1(5:8) to ue21(1+(i-1)*4:4+(i-1)*4) ! Store zmm1(9:12) to ue31(1+(i-1)*4:4+(i-1)*4) ... SIMD width=4 computation Mixed use of 4-wide and 16-wide SIMD instructions xmm indicate 128 bit FP32 registers and zmm indicate 512 bit FP32 registers 47

48.

問題設定 • 下記のような２層地盤における非線形波動伝播問題を求解 • ３種の性質の異なる計算機において性能を計測 K computer Oakforest-PACS Intel Skylake Xeon Gold based server Nodes 8 1 1 Sockets/node 1 1 2 Cores/socket 8 68 20 FP32 SIMD width 2 16 16 Clock frequency 2.0 GHz 1.4 GHz 2.4 GHz Total peak FP32 FLOPS 1024 GFLOPS 6092 GFLOPS 6144 GFLOPS 512 GB/s 80.1 GB/s 255.9 GB/s Total DDR bandwidth Total MCDRAM bandwidth 20m Layer 1 8m Layer 1 2 Vp (m/s) 700 2,100 Vs (m/s) 100 700 1,500 2,100 0.25 (hmax ) 0.05 0.007 - Density (kg/m3) Damping Strain Criteria 60m Layer 2 z - 490 GB/s - y x 64m 48

49.

0 Skylake Xeon Gold 6148 x 2 socket 2.04 1287 GFLOPS (20.9%) 2.14 2.60 1000 GFLOPS (16.3%) 6.62 380 GFLOPS (37.1%) 8.20 Percentage to FP32 peak 3.25 3.75 3.05 7.50 7.52 8.85 11.52 14.07 20.10 Oakforest-PACS (1 node) 3.98 5 7.52 10 175 GFLOPS (2.85%) 15 121 GFLOPS (1.98%) 20 9.45 267 GFLOPS (26.1%) Kernel elapsed time per vector (s) 25 K computer (8 nodes) Baseline (1 vector) Baseline (4 vector) Case #1 Case #2 Case #3 Case #4 # of vectors 1 4 4 4 4 4 SIMD packing No No Yes Yes Yes Yes Many-core algorithm Core wise vectors Core wise vectors Core wise vectors Standard coloring Developed partitioning Developed partitioning Mixed use of 16-width and 4-width SIMD No No No No No Yes 49

50.

Performance on actual urban earthquake simulation problem • Compute seismic shaking of 3 layered ground in central Tokyo 10 a) Model of 1.25 km x 1.25 km area of Tokyo with 4066 structures 40m b) Elevation of interfaces of three soil layers 113 236 [cm/s] c) Response at ground surface (merged horizontal component of SI value) 50

51.

Performance on actual urban earthquake simulation problem Elapsed time (s) 300 250 247.2 200 150 1.97 x faster 125.6 100 2.03 x faster 61.9 50 0 Solver algorithm EBE kernel algorithm 3.99 x faster SC14 solver (without With time parallelism With time parallelism time parallelism) Baseline (m=1) Baseline (m=4) Developed (m=4) 51

52.

Summary • Element-by-Element (EBE) kernel in matrix-vector products is key kernel of unstructured implicit finite-element applications • However, the EBE kernel is not straightforward to attain high performance due to random data access • We developed methods to circumvent these data races for high performance on many-core CPU architectures with wide SIMD units • Developed EBE kernel attains • 16.3% of FP32 peak on Intel Xeon Phi Knights Landing based Oakforest-PACS • 20.9% of FP32 peak on Intel Skylake Xeon Gold processor based system • Leads to 2.88-fold speedup over the baseline kernel and 2.03-fold speedup of the whole finite-element application on Oakforest-PACS 52

53.

GPU向けの有限要素法の高速化 Tsuyoshi Ichimura, Kohei Fujita, Takuma Yamaguchi, Akira Naruse, Jack C. Wells, Thomas C. Schulthess, Tjerk P. Straatsma, Christopher J. Zimmer, Maxime Martinasso, Kengo Nakajima, Muneo Hori, Lalith Maddegedara SC18 Gordon Bell Prize Finalist より（資料：山口拓真氏提供） 53

54.

Porting to Piz Daint/Summit • Communication & memory bandwidth relatively lower than K computer • Reducing data transfer required for performance • We have been using FP32-FP64 variables • Transprecision computing is available due to adaptive preconditioning K computer Piz Daint Summit CPU/node 1×SPARC64 VIIIfx 1×Intel Xeon E5-2690 v3 2×IBM POWER 9 GPU/node - 1×NVIDIA P100 GPU 6×NVIDIA V100 GPU Peak FP32 performance/node 0.128 TFLOPS 9.4 TFLOPS 93.6 TFLOPS Memory bandwidth/node 64 GB/s 720 GB/s 5400 GB/s Inter-node throughput 10.2 GB/s 25 GB/s 5 GB/s in each direction 54

55.

Introduction of FP16 variables • Half precision can be used for reduction of data transfer size Single precision (FP32, 32 bits) Half precision (FP16, 16 bits) S e x p o n e n t f r a c t i o n 1bit sign + 8bits exponent + 23bits fraction S e x p f r a c t i o n 1bit sign + 5bits exponent + 10bits fraction • Using FP16 for whole matrix or vector causes overflow/underflow or fails to converge • Smaller exponent bits → small dynamic range • Smaller fraction bits → no more than 4-digit accuracy 55

56.

FP16 for point-to-point communication • FP16 MPI buffer only for boundary part • To avoid overflow or underflow, Original vector 𝐱𝐱 is divided into one localized scaling factor 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑡𝑡 and FP16 vector 𝐱𝐱� 16 • Data transfer size can be reduced • 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝑡𝑡 × 𝐱𝐱� 16 does not match 𝐱𝐱 exactly, but convergence characteristic is not changed for most problems PE#0 PE#1 boundary part 𝐱𝐱 … 𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶𝐶 × 𝐱𝐱� 16 × … 56

57.

Overlap of computation and communication i, i+1, i+2, i+3-th time step 1 : 𝐫𝐫 = 𝐀𝐀𝐀𝐀 2 : synchronize 𝐪𝐪 by point-to-point comm. 3 : 𝐫𝐫 = 𝐛𝐛 − 𝐫𝐫; 𝐳𝐳 = 𝐌𝐌 −1 𝐫𝐫 4 : 𝜌𝜌𝑎𝑎 = 1; 𝛼𝛼 = 1; 𝜌𝜌𝑏𝑏 = 𝐳𝐳 � 𝐫𝐫; 𝛾𝛾 = 𝐳𝐳 � 𝐪𝐪 5 : synchronize 𝜌𝜌𝑏𝑏 , 𝛾𝛾 by collective comm. 6 : while (|𝐫𝐫𝑖𝑖 |/|𝐛𝐛𝑖𝑖 | > 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ) do 7 : 𝛽𝛽 = −𝛾𝛾𝜌𝜌𝑎𝑎 /𝛼𝛼 8 : 𝐮𝐮 = 𝐮𝐮 + 𝛼𝛼𝐩𝐩; 𝐩𝐩 = 𝐳𝐳 + 𝛽𝛽𝐩𝐩 9 : 𝐪𝐪 = 𝐀𝐀𝐀𝐀 10: synchronize 𝐪𝐪 by point-to-point comm. 11: 𝜌𝜌𝑎𝑎 = 𝐩𝐩 � 𝐪𝐪 12: synchronize 𝜌𝜌𝑎𝑎 by collective comm. 13: 𝛼𝛼 = 𝜌𝜌𝑏𝑏 /𝜌𝜌𝑎𝑎 ; 𝜌𝜌𝑎𝑎 = 𝜌𝜌𝑏𝑏 14: 𝐫𝐫 = 𝐫𝐫 − 𝛼𝛼𝐪𝐪; 𝐳𝐳 = 𝐌𝐌 −1 𝐫𝐫; 𝜌𝜌𝑏𝑏 = 𝐳𝐳 � 𝐫𝐫; 𝛾𝛾 = 𝐳𝐳 � 𝐪𝐪 15: synchronize 𝜌𝜌𝑏𝑏 , 𝛾𝛾 by collective comm. 16: enddo • Conjugate Gradient method • Introduce time-parallel algorithm • Solve four time steps in the analysis in parallel • Compute 1 current time step and 3 future time steps • Reduce iterations in the solver • Computation becomes dense and suitable for low B/F architectures 57

58.

Overlap of computation and communication i, i+1, i+2, i+3-th time step 1’: while (e𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑖𝑖 > 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ) do 2’: Vector operation 1 3’: Matrix vector multiplication 4’: Point-to-point comm. 5’: Vector operation 2 6’: Collective comm. 7’: Vector operation 3 8’: Collective comm. 9’: enddo • Simplified loop • Computation part • 3 groups of vector operations • 1 sparse matrix vector multiplication • Communication part • 1 point-to-point communication • 2 collective communication • Point-to-point communication is overlapped with matrix vector multiplication PE#0 inner part boundary part: send/receive between other MPI processes 1. boundary part computation 2. inner part computation & boundary part communication • However, this communication is still bottleneck of 58 the solver

59.

Overlap of computation and communication • 4 vectors are divided into 2 vectors × 2 sets • Point-to-point communication is overlapped with other vector operations • The number of collective communication is unchanged i, i+1-th time step 1’ : while (e𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑖𝑖 > 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ) do 2’ : 3’ : Collective comm. 4’ : Vector operation 1 5’ : Matrix vector multiplication Point-to-point comm. 6’ : 7’ : Vector operation 2 8’ : Collective comm. 9’ : 10’: 11’: Vector operation 3 12’: enddo i+2, i+3-th time step 1’ : while (e𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑖𝑖 > 𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡 ) do 2’ : Vector operation 2 3’ : Collective comm. 4’ : 5’ : 6’ : Vector operation 3 7’ : 8’ : Collective comm. 9’ : Vector operation 1 10’: Matrix vector multiplication Point-to-point comm. 11’: 12’: enddo 59

60.

FP16 computation in Element-by-Element method • Matrix-free matrix-vector multiplication • Compute element-wise multiplication • Add into the global vector • Normalization of variables per element can be performed • Enables use of doubled width FP16 variables in element wise computation • Achieved 71.9% peak FP64 performance on V100 GPU • Similar normalization used in communication between MPI partitions for FP16 communication Element-by-Element (EBE) method Element #1 += … f Ae u FP32 FP16 FP16 … f = Σe Pe Ae PeT u [Ae is generated on-the-fly] Element #0 += Element #N-1 60

61.

Introduction of custom data type: FP21 • Most computation in CG loop is memory bound • However, exponent of FP16 is too small for use in global vectors • Use FP21 variables for memory bound computation • Only used for storing data (FP21×3 are stored into 64bit array) • Bit operations used to convert FP21 to FP32 variables for computation Single precision (FP32, 32 bits) (FP21, 21 bits) Half precision (FP16, 16 bits) S e x p o n e n t f r a c t i o n 1bit sign + 8bits exponent + 23bits fraction S e x p o n e n t f r a c t i o n 1bit sign + 8bits exponent + 12bits fraction S e x p f r a c t i o n 1bit sign + 5bits exponent + 10bits fraction 61

62.

Performance on Piz Daint/Summit • Developed solver demonstrates higher scalability compared to previous solvers 288 110.7 373.2 576 117.8 378.5 1152 121.1 399.5 2,759.3 3,065.1 3,034.6 120.8 401.0 2304 2,999.8 123.7 393.3 4608 0 1000 # of MPI processes (# GPUs) # of MPI processes (# GPUs) • Leads to 19.8% (nearly full Piz Daint) & 14.7% (nearly full Summit) peak FP64 performance 2,867.1 2000 Elapsed time (s) 3000 4000 Piz Daint ■ Developed 288 75.8 302.5 576 77.6 311.7 1152 80.4 327.3 2304 82.9 349.8 4608 84.3 374.6 6144 83.7 380.2 12288 90.0 415.1 100.4 454.2 24576 0 500 1,923.7 1,939.5 1,927.5 1,912.2 2,033.8 1,922.1 2,082.9 1000 1500 Elapsed time (s) 2000 2500 Summit ■ SC14 ■ PCGE (Standard) 62

63.

富岳向け計算手法の開発 Kohei Fujita, Kentaro Koyama, Kazuo Minami, Hikaru Inoue, Seiya Nishizawa, Miwako Tsuji, Tatsuo Nishiki, Tsuyoshi Ichimura, Muneo Hori, Lalith Maddegedara, High-fidelity nonlinear low-order unstructured implicit finite-element seismic simulation of important structures by accelerated element-by-element method, Journal of Computational Science, 2020 63

64.

富岳上でのSC14ソルバーの性能 • 京コンピュータ(8コアCPU x 82944台)においてピーク性能の11.1% • 富岳(48コアCPU x 158976台)においてピーク性能の1.5% • 有限要素法においては行列ベクトル積におけるランダムデータアクセスがボトルネックになっている • いかにしてこれらのボトルネックを回避して有限要素法を高速化するか？ • 計算用コア数の増加、システム用のアシスタントコアの追加 • これらも活用したい 64

65.

計算機機構の活用 Thread 1 Thread 2 Thread 3 • 計算の連続アクセス化 • SIMD演算器を有効活用するための計算の並び替え • 多数コアの効率的活用 Overall mesh Color #1 Color #2 • キャッシュ特性を考慮したマルチカラリング • 計算と通信のオーバーラップ MPI process #0 Color #3 (Threads 2,3 idle) Nodes on MPI boundary • アシスタントコアを活用して計算と通信を同時実行 • これらの工夫で主要計算部となる行列ベクトル積において13倍の高速化を実現 MPI process #1MPI process #2 Boundary domain Inner domain 65

66.

性能計測問題 • 2層の非線形地盤中の波動伝播解析 20m Layer 1 8m Layer 1 2 Vp (m/s) 700 2,100 Vs (m/s) 100 700 1,500 2,100 0.25 (hmax ) 0.05 0.007 - Density (kg/m3) Damping Strain Criteria 60m Layer 2 64m 66

67.

性能計測結果 • ウィークスケーリング(一台あたりの計算規模を一定として、計算機の台数を増やす)を計測：実行時間が増加しないのが理想的 • 富岳ほぼ全系に相当する147,456ノードにおいて、SC14ソルバーをそのまま実行した場合に比べて計算機機構を踏まえた手法により7倍速にソルバーの実行時間 (s) • 京全系を使った場合の59倍速に相当 50.0 39.3 37.1 37.1 36.8 37.0 38.8 ピーク性能の1.5% 36.8 36.4 36.0 35.9 40.0 開発手法により 7倍の高速化 30.0 20.0 10.0 4.13 4.20 4.19 4.17 4.52 4.65 4.57 4.73 5.02 5.54 ピーク性能の9.9% 0.0 256 2048 16384 計算ノード数(台) SC14solver_asis 131072 SC14solver_proposed 67

68.

Data-driven approachとの融合 HPC Asia 2022 Best Paper: "152K-computer-node parallel scalable implicit solver for dynamic nonlinear earthquake simulation" / Tsuyoshi Ichimura and Kohei Fujita (The University of Tokyo, RIKEN); Kentaro Koyama (Fujitsu Ltd.); Ryota Kusakabe and Yuma Kikuchi (The University of Tokyo); Takane Hori and Muneo Hori (Japan Agency for Marine-Earth Science and Technology); Lalith Maddegedara (The University of Tokyo); Noriyuki Ohi, Tatsuo Nishiki, and Hikaru Inoue (Fujitsu Ltd.); and Kazuo Minami, Seiya Nishizawa, Miwako Tsuji, and Naonori Ueda (RIKEN) 68

69.

断層から都市までの⼀体型の連成地震応答シミュレーション例都市における構造物も陽に解像最小要素サイズ: 12.5 cm 開発手法を富岳全系(152,352ノード(609,408 MPI processes × 12 OpenMP threads = 7,312,896コア)上で実行することでこれまでにない解像度で一体解析を実現

70.

ターゲット問題 • 複雑形状・非均質物性領域中での非線形波動伝播シミュレーション • 複雑形状のモデル化のため非構造格子有限要素法が適している • 複雑形状の離散化のため局所的に小さい要素が生じる：極端に短い時間ステップを避けるため陰解法による求解が適している • 陰解法による非構造格子有限要素法を活用 𝑛𝑛 𝛿𝛿𝑢𝑢 𝑛𝑛 = 𝑓𝑓 𝑛𝑛 𝐴𝐴 (1) 𝑛𝑛 を求解し、変位・速度・加速度をを解くことで変位増分𝛿𝛿𝑢𝑢 2 4 4 𝑢𝑢𝑛𝑛 ⇐ 𝑢𝑢𝑛𝑛−1 + 𝛿𝛿𝑢𝑢𝑛𝑛 , 𝑣𝑣 𝑛𝑛 ⇐ −𝑣𝑣 𝑛𝑛−1 + 𝛿𝛿𝑢𝑢𝑛𝑛 , 𝑎𝑎𝑛𝑛 ⇐ −𝑎𝑎𝑛𝑛−1 − 𝑣𝑣 𝑛𝑛−1 + 2 𝛿𝛿𝑢𝑢𝑛𝑛 𝑑𝑑𝑑𝑑 𝑑𝑑𝑑𝑑 𝑑𝑑𝑡𝑡 で更新 4 2 𝑛𝑛 • ここで𝐴𝐴𝑛𝑛 ⇐ 𝑀𝑀 + 𝐶𝐶 + 𝐾𝐾 𝑛𝑛 , 𝑓𝑓 𝑛𝑛 ⇐ 𝐹𝐹 𝑛𝑛 − 𝑄𝑄𝑛𝑛−1 + 𝐶𝐶 𝑛𝑛 𝑣𝑣 𝑛𝑛−1 + 𝑀𝑀 �𝑎𝑎𝑛𝑛−1 + 2 𝑑𝑑𝑡𝑡 𝑑𝑑𝑑𝑑 4 𝑛𝑛−1 𝑣𝑣 � • 𝑑𝑑𝑑𝑑 • 解析コストのほぼすべてが式(1)の求解にかかるため、この式を高速・スケーラブルに求解するソルバーを開発 70

71.

対象方程式の特性 • 領域サイズが105-6×105-6×104-5 mで最小要素サイズが10-1 mとなるため超大規模問題となる(1011-13自由度) • 省メモリ・スケーラブルな反復法ソルバーが必要 • 要素サイズ・物性が有限要素モデル内で大きく変化するため、分離型の地震シミュレーションに比べて収束性が悪化 • 低精度演算を使いつつロバストなソルバーを設計するには工夫が必要に • 疎行列の計算にはランダムアクセスが含まれるため、演算性能向上のためには工夫が必要となる 71

72.

ソルバー設計 • 収束特性の悪い大規模問題を幅広い計算機アーキテクチャで求解可能なアルゴリズム設計 • Equation-basedとdata-driven approachの高度な融合＋精度混合演算 • 超並列環境でのスケーラビリティのための通信削減＋ロードバランス • + アーキテクチャに適したカーネル設計による高効率計算 72

73.

Equation-based and data-driven approach • Equation-based approach (e.g., マルチグリッド法) • 低次モードを低コストで求解可能だが、高次モードの求解は高コストに • Data-driven approach • 学習データサイズに応じてコストと精度が決まる(i.e., 大規模領域全域における高次モードまでの求解は超高コストとなるが、領域が小さければコストは抑えらえれる) • これらの方法をうまく組み合わせることができないか？ • Equation-based modelingで低次モードを求解し、data-driven methodにより小領域における高次モードを求解。これらを精度保証される求解スキーム内で活用 73

74.

Data-driven initial solution estimator • 過去に求解したデータを使って高次モードを含む初期解を推定 • 𝑋𝑋 𝑛𝑛−1 = 𝐴𝐴𝑋𝑋 𝑛𝑛−2 により、時間発展行列𝐴𝐴 を推定。ここで、直前の𝑠𝑠タイムス 𝑛𝑛 により𝑋𝑋 𝑛𝑛−1 = 𝑥𝑥 𝑛𝑛−1 , 𝑥𝑥 𝑛𝑛−2 , … , 𝑥𝑥 𝑛𝑛−𝑠𝑠 と設テップのデータ𝑥𝑥 𝑛𝑛 = 𝛿𝛿𝑢𝑢𝑛𝑛 − 𝛿𝛿𝑢𝑢𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 定 𝑛𝑛 • Equation-based predictor (𝛿𝛿𝑢𝑢𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 : Adams-Bashforth法により推定)により、data-driven methodにおいて精度低下につながるtransient-modeを求解: 𝑛𝑛 𝑛𝑛 𝑛𝑛−1 𝛿𝛿𝑢𝑢𝑖𝑖𝑖𝑖𝑖𝑖 = 𝛿𝛿𝑢𝑢𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 + 𝐴𝐴 𝛿𝛿𝑢𝑢𝑛𝑛−1 − 𝛿𝛿𝑢𝑢𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 • 計算は局所化されるため、大規模システムにおいて高いスケーラビリティが期待される • 時間発展行列𝐴𝐴はMPI領域内をさらに再分割した小領域内で構築・評価可能 74

75.

マルチグリッド法による前処理 • Data-driven methodによる初期解推定により高次モードの大部分は求解される • 結果として、特に収束性が悪い領域に誤差が集中 • これら誤差が大きい領域を集中的にrefineするW-cycle multi-grid preconditionerを用いる • 富岳のメニーコア・SIMDアーキテクチャに適した行列ベクトル積カーネル(EBEカーネル)と組み合わせることで高い演算性能が期待される Solve preconditioning matrix equation 𝐴𝐴𝐴𝐴 = 𝑟𝑟 roughly (single precision) CGprehmr CGpre CGpre CG loop Second ordered tetrahedron Solving preconditioning matrix Computations of outer loop Equation to be solved (double precision) Outer loop Part of second ordered tetrahedron mesh Second ordered tetrahedron 𝐴𝐴𝐴𝐴 = 𝑟𝑟 CGprelmr Linear tetrahedron 𝐴𝐴𝑐𝑐 𝑧𝑧𝑐𝑐 = 𝑟𝑟𝑐𝑐 CGprelmr 75

76.

性能測定問題 • 実際の地殻・地盤・構造物を模したモデルを使用 • モデル下部からの地震動入力に対する非線形応答を求める • 主要動が構造物に到達する401-600ステップでのソルバー性能を測定 • 直前の16ステップでのデータを使って初期解を推定し、都市部をマルチグリッドでの選択的refinement領域として設定 • PCGEK (京コンピュータ向けにSC14論文で開発されたEBEカーネルを使った共役勾配法ソルバー)と性能を比較 Minimum element size varies from 32 m (crust) to 0.25 m (city): Challenging compared to crust- or cityonly simulations 76

77.

性能測定環境 • Fugaku@Center for Computational Science, RIKEN • Single 48-core Fujitsu A64FX/node (3.07 TFLOPS FP64 peak & 1024 GB/s memory bandwidth) • Total 158,976 compute nodes • Oakbridge-CX@Information Technology Center, The University of Tokyo • Dual 28-core Intel Cascade Lake Xeon/node (4.84 TFLOPS FP64 peak & 281 GB/s memory bandwidth) • Total 1,368 compute nodes 77

78.

Oakbridge-CXと富岳での実行時間 • 以下の計算環境で性能計測を実施 • 256 Oakbridge-CX compute nodes (1.24 PFLOPS FP64 peak, 72.1 TB/s memory bandwidth) • 578 Fugaku compute nodes (1.77 PFLOPS FP64 peak, 591 TB/s memory bandwidth) • 従来手法比でOakbridge-CX上で 18.9倍、富岳上で26.0倍の高速化を実現 Ratio to FP64 peak PCGEK Developed • AI＋マルチグリッド法によるソルバーアルゴリズム開発により浮動小数点演算数を1/8.16に削減 • 計算機アーキテクチャに即したカーネル開発により浮動小数点演算効率を2%から7-8%に改善 78

79.

高速化の要因 • 通常の初期解推定手法(相対誤差 10-3, Adams-Bashforth in PCGEK) に対し、提案手法により初期解の誤差が10-5まで低減 • 反復数がPCGEの62345回から 4080 + 8160 + 30760 = 43000回に削減 Ratio to FP64 peak PCGEK Developed • W-cycle multi-gridによる反復あたりのコスト削減 • Cgprelmrの自由度は元のメッシュの 1/8、CGprehmrの自由度は元のメッシュの1/11.5 • 適切なカーネルアルゴリズム構築により、浮動小数点演算性能も改善 • EBE@PCGEでは2.6% of FP64 peak • EBE@developed solverでは12.3% of FP64 peak Developed @Fugaku 79

80.

富岳上でのweak scaling • 大規模問題の計測においてはPCGEK は遅すぎるため、PCGEFugaku (本研究で富岳用に開発したEBEカーネルを使った PCGE)と性能を比較 • PCGEKからの高速化率(25.4倍)のうち、5倍がカーネル開発、5倍がソルバーアルゴリズムから 5x from architecture aware kernel 5x from algorithm development 93.7% weak scalability up to 73984 nodes • 富岳73984ノード(1.2兆自由度)までウィークスケーリング効率93.7%を実現 • ほぼ一定の反復数＋均質なロードバランス＋通信削減により高いスケーラビリティを実現 • 1.2兆自由度 x 16時間ステップ = 19.2兆自由度のデータを使った学習・推定により高い高速化率を実現 80

81.

まとめ 81

82.

まとめ • 有限要素法は複雑形状問題の求解にたけているが、ランダムアクセスが多く含まれるため、性能を出すには並列計算機の特徴に合わせたアルゴリズム開発・実装開発が重要 • SIMDの活用のためのカーネルレベルの実装 • SIMDをさらに効率よく活用するための求解アルゴリズムの再設計 • 通信バンド幅・メモリバンド幅が限られている計算機においては、低精度データ型の使用などの工夫により高効率な計算が可能に • これらの手法開発はランダムアクセス系の他の手法にも応用可能と期待される 82

第11回配信講義計算科学技術特論B（2024）

R-CCS　計算科学研究推進室

関連スライド

第1回配信講義　計算科学技術特論A （2023）

第1回配信講義　計算科学技術特論A（2025）

第2回配信講義　計算科学技術特論A （2023）

第6回配信講義　計算科学技術特論A （2023）

第8回配信講義　計算科学技術特論A （2023）

第４回配信講義　計算科学技術特論A （2023）

各ページのテキスト

第11回配信講義 計算科学技術特論B（2024）

R-CCS 計算科学研究推進室

関連スライド

第1回 配信講義 計算科学技術特論A （2023）

第1回 配信講義 計算科学技術特論A（2025）

第2回 配信講義 計算科学技術特論A （2023）

第6回 配信講義 計算科学技術特論A （2023）

第8回 配信講義 計算科学技術特論A （2023）

第４回 配信講義 計算科学技術特論A （2023）

各ページのテキスト

第11回配信講義計算科学技術特論B（2024）

R-CCS　計算科学研究推進室

第1回配信講義　計算科学技術特論A （2023）

第1回配信講義　計算科学技術特論A（2025）

第2回配信講義　計算科学技術特論A （2023）

第6回配信講義　計算科学技術特論A （2023）

第8回配信講義　計算科学技術特論A （2023）

第４回配信講義　計算科学技術特論A （2023）