Upgrading HDFS to 3.3.0 and deploying RBF in production #LINE_DM

1.9K Views

September 18, 20

スライド概要

LINE Developer Meetup #68 - Big Data Platformの発表資料です。HDFSのメジャーバージョンアップとRouter-based Federation(RBF)の適用について紹介しています。イベントページ: https://line.connpass.com/event/188176/

profile-image

2023年10月からSpeaker Deckに移行しました。最新情報はこちらをご覧ください。 https://speakerdeck.com/lycorptech_jp

シェア

またはPlayer版

埋め込む »CMSなどでJSが使えない場合

(ダウンロード不可)

関連スライド

各ページのテキスト
1.

LINE Developer Meetup #68 – Big Data Platform Upgrading HDFS to 3.3.0 and deploying RBF in production 2020/09/17 Akira Ajisaka Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.

2.

Self introduction • Akira Ajisaka (鯵坂 明, Twitter: @ajis_ka) • Apache Hadoop PMC member (2016~) • Yahoo! JAPAN (2018~) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Outdoor bouldering for the first time in Mitake 2

3.

Agenda • Why and how we upgraded the largest HDFS cluster to 3.3.0 • • • • • • Hadoop clusters in Yahoo! JAPAN Short intro of RBF and why we choose it How to upgrade How to split namespace What we considered and experimented Many troubles and lessons learned from them Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 3

4.

Why and how we upgraded the cluster? Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.

5.

Yahoo! JAPAN's largest HDFS cluster • • 100PB actual used 500+ DataNodes • • • 240M files + directories 290M blocks 400GB NameNode Java heap • HDP 2.6.x + patches (as of Dec. 2019) Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 5

6.

Major existing problems • The namespace is too large • • The Hadoop version is too old • • • NameNode does not scale infinitely due to heavy GC HDP 2.6 is based on Apache Hadoop 2.7.3 2.7.3 was released 4 years ago We upgraded to HDFS 3.3.0 and use RBF to split the namespace Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 6

7.

RBF (Router-based Federation) / Namespace top/ NameNode shp/ Namespace auc/ DFSRouter NameNode ZooKeeper StateStore Namespace NameNode Note: Kerberos authentication is supported in Hadoop 3.3.0 Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 7

8.

How to enable RBF w/o clients' config changes Before After NameNode @ host1 (port 8021) NameNode @ host1 (port 8020) NameNode @ host2 NameNode @ host3 DFSRouter @ host1 (port 8020) ZooKeeper StateStore Note: We couldn't rolling upgrade the cluster because of the NN RPC port change Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 8

9.

How to split namespaces Calculated # of files/directories/blocks from fsimage Calculated # of RPCs from audit logs • • • RPCs are classified into two groups (update/read) We had to check audit logs to ensure that there is no rename operation between namespaces • • • RBF does not support it for now Xiaomi has developed HDFS Federation Rename (HFR) • https://issues.apache.org/jira/browse/HDFS-15087 (work in progress) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 9

10.

Split DataNodes or not? Split DataNodes for each namespace (no-split) DNs register all the NameNodes NN NN DN DN We chose splitting DNs because it is simple Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 10

11.

Split DataNodes – Pros and Cons Pros Simple • Easy to troubleshoot, operate • • No limitation of the # of namespaces • East-west traffic can be controlled easily Cons • Need to calculate how many DNs required for each namespaces • • Possible unbalanced resource usage among namespaces HFR uses hard-link for rename and it assumes non-split DNs Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 11

12.

Check HDFS client-server compatibility • • • • • We upgrade HDFS only Old (HDP 2.6) clients still exist, so we have to check the compatibility We read ".proto" files and verified that In addition, upgraded HDFS in development cluster for end-users Wrote a blog post: https://techblog.yahoo.co.jp/entry/20191206 786320/ (Japanese and English) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 12

13.
[beta]
Load-balancing DFSRouters
• If a client is configured as follows, the client always connects to
host1
<property
<property
<property
<property

name="dfs.nameservices" value="ns"/>
name="dfs.ha.namenodes.ns" value="dr1,dr2"/>
name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/>
name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>

• To avoid this problem, set "dfs.client.failover.random.order" to true
• This feature is available in Hadoop 2.9.0 and not available in the
old clients, so we patched internally
• The default value is true in Hadoop 3.4.0+ (HDFS-15350)
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.

13

14.

Try Java 11 • Hadoop 3.3.0 supports Java 11 as runtime • Upgrade to Java 11 to improve GC performance • We contributed many patches to support Java 11 in Apache Hadoop community • https://www.slideshare.net/techblogyahoo/jav a11-apache-hadoop-146834504 (Japanese) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 14

15.

Upgrade ZooKeeper to 3.5.x • Error log w/ Hadoop 3.3.0 and ZK 3.4.x (snip) Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) (snip) • • • Hadoop 3.3.0 upgraded Curator version and it depends on ZooKeeper 3.5.x (HADOOP-16579) Rolling upgraded ZK cluster before upgrading HDFS Upgrade succeeded without any major problems Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 15

16.

Planned schedule • 2019.9 Upgraded to trunk in the dev cluster • 2020.3 Apache Hadoop 3.3.0 released • 2020.3 Upgraded to 3.3.0 in the staging cluster • 2020.5 Upgraded to 3.3.0 in production Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 16

17.

Actual schedule • • • • • 2019.9 Upgraded to trunk in the dev cluster (with 1 retries) 2020.7 Apache Hadoop 3.3.0 released 2020.8 Upgraded to 3.3.0 in the staging cluster (with 2 retries) 2020.8 Upgraded to 3.3.0 in production (no retry! but faced many troubles...) Upgrade is completed remotely Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 17

18.

Many troubles Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.

19.

DistCp is slower than expected We used DistCp to move recent data between namespaces after upgrade but it didn't finished by deadline Directory listing of src/dst is serial • • • DistCp always fails if (# of Map tasks) > 200 and dynamic option is true • • • • • Increasing Map tasks does not help Fails by configuration error To make matters worse, it fails after directory listing, which takes very long time DistCp does not work well for very large directory Recommend splitting the job Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 19

20.

DN traffic reached the NW bandwidth limit • We faced many job failures just after the upgrade 25Gbps DN out traffic in a subcluster • When splitting DNs, we considered only the data size but it is not sufficient • Read/write request must be considered as well Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 20

21.

DFSRouter slowdown • DFSRouter drastically slowdown when restarting active NameNode DFSRouter Average RPC Queue time 30 sec Finished loading fsimage Restarted active NameNode • Wrote a patch and fixed in HDFS-15555 Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 21

22.

HttpFS incompatibilities The implementation of the web server is different • • • Hadoop 2.x: Tomcat 6.x Hadoop 3.x: Jetty 9.x The behavior is very different • • • Jetty supports HTTP/1.1 (chunked encoding) Default idle timeout is different • • • • • • Tomcat: 60 seconds Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second) Response flow (what timing the server returns 401) is different Response body itself is different and more... Need to test very carefully if you are using HttpFS Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 22

23.

Lessons learned We have changed many configurations at a time, but should be avoided as possible • • • • For example, we changed block placement policy to rack fault-tolerant and under-replicated blocks become 300M+ after upgrade Trouble shooting become more difficult HttpFS upgrades can be also separated from this upgrade, as well as ZooKeeper Imagine what will happen in production and test them as possible in advance • • • Consider the difference between dev/staging and prod There is a limit one people can imagine. Ask many colleagues! Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 23

24.

HDFS Future works Router-based Federation • • • Rebalance DNs/namespaces between subclusters well Considering multiple subclusters, non-split DNs (or even in hybrid), HFR, and so on Erasure Coding in production • • Internally backporting EC feature to the old HDFS client and the work mostly finished Try new low-pause-time GC algorithms • • ZGC, Shenandoah Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 24

25.

We are hiring! https://about.yahoo.co.jp/hr/job-info/role/1247/ (Japanese) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 25