1.9K Views
October 06, 20
スライド概要
ApacheCon @ Home 2020 の発表資料です。比較的最近追加されたHDFSの便利な新機能および、本番環境でメジャーバージョンアップを実施してRouter-based Federation(RBF)を適用した事例について紹介しています。イベントページ: https://www.apachecon.com/acna2020/
2023年10月からSpeaker Deckに移行しました。最新情報はこちらをご覧ください。 https://speakerdeck.com/lycorptech_jp
ApacheCon 2020 @ Home HDFS Migration from 2.7 to 3.3 and enabling Router Based Federation (RBF) in production 2020/10/02 Akira Ajisaka Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
Self introduction • Akira Ajisaka (鯵坂 明, Twitter: @ajis_ka) • Apache Hadoop PMC member (2016~) • Yahoo! JAPAN (2018~) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Outdoor bouldering for the first time in Mitake, Tokyo 2
Agenda Recent improvements in HDFS • • • • • Enabling RBF in production • • • • • • RBF (Router Based Federation) Observer NameNodes DataNode maintenance mode New Decommission Monitor Hadoop clusters in Yahoo! JAPAN How to upgrade How to split namespace What we considered and experimented Many troubles and lessons learned from them Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 3
Agenda Recent improvements in HDFS • • • • • Enabling RBF in production • • • • • • RBF (Router Based Federation) Observer NameNodes DataNode maintenance mode New Decommission Monitor Hadoop clusters in Yahoo! JAPAN How to upgrade How to split namespace What we considered and experimented Many troubles and lessons learned from them Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 4
RBF (Router-based Federation) / Namespace top/ NameNode shp/ Namespace auc/ DFSRouter NameNode ZooKeeper StateStore Namespace NameNode Added in 2.9.0 Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 5
RBF (Router-based Federation) / hdfs dfs –ls /top Namespace top/ NameNode shp/ Namespace auc/ DFSRouter NameNode ZooKeeper StateStore Namespace NameNode Added in 2.9.0 Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 6
RBF major features Multiple subclusters (3.1.0/3.0.3/2.10.0/2.9.1) • • The files/dirs in the mount points are distributed • HASH, HASH_ALL, LOCAL, RANDOM, SPACE Global quotas (3.1.0/2.10.0) Security (3.3.0) DistCp-based balance tool (3.4.0) Hardlink-based rename (HDFS-15087) • • • • • No movement of actual data Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 7
Observer NameNodes • • ① Read/Write Read Read Clients can read data from Observer to reduce the load of Active NN Support read-after-write consistency from single client using transaction ID haaadmin -transitionToStandby Active NN haaadmin -transitionToObserver metadata metadata Journal nodes ② Write EditLog Observer metadata Observer metadata Standby ③ Read EditLog EditLog Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Available from 3.3.0 8
How the load of NN is reduced by Observer NNs About 90% of the requests is read in normal use cases • • The load of Active NN will be greatly reduced! However, read request is actually write if atime is on • • • Please set "dfs.namenode.accesstime.precision" to 0 Otherwise all the requests are handled by Active NameNode Note: Now RBF and Observer NNs cannot be enabled at the same time (HDFS-13522) • • DFSRouters (DRs) have internal caches to manage what NNs are active/standby, and the mechanism does not support observer NNs Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 9
How to load-balance active-active NN/DR
• If a client is configured as follows, the client always connects to
host1 first
• Not load-balancing in the clusters with RBF or Observer NNs
<property
<property
<property
<property
name="dfs.nameservices" value="ns"/>
name="dfs.ha.namenodes.ns" value="dr1,dr2"/>
name="dfs.namenode.rpc-address.ns.dr1" value="host1:8020"/>
name="dfs.namenode.rpc-address.ns.dr2" value="host2:8020"/>
• Set "dfs.client.failover.random.order" to true for load-balancing
• Available from Hadoop 2.9.0
• The default value is true in Hadoop 3.4.0+
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
10
DataNode maintenance mode New! Available from 2.9.0 Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 11
Decommission vs Maintenance Decommission • • • Wait for the blocks are fully replicated Good for long-term maintenance • ex.) Replace some hardware devices Maintenance • • • • Wait for (the replication factor of the blocks) >= "dfs.namenode.maintenance.replication.min" (set to 2 in the most cases) Significantly faster than decommission Good for short-term maintenance • ex.) Kernel update + reboot Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 12
How to enter DNs to maintenance mode
Use JSON-based hosts file
•
Set "dfs.namenode.hosts.provider.classname" to
"org.apache.hadoop.hdfs.server.blockmanagement.
CombinedHostFileManager"
The following config will decommission host3 and
maintenance host4 after running "dfsadmin –
refreshNodes"
•
•
[
]
{
{
{
{
"hostName":
"hostName":
"hostName":
"hostName":
Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.
"host1" },
"host2" },
"host3", "adminState": "DECOMMISSIONED" },
"host4", "adminState": "IN_MAINTENANCE" }
13
DataNode decommission: Existing problems NameNode write lock takes some seconds for a very dense DN • • • When adding a DN for decommission, NN write lock is held to process all blocks on the node at a time We saw > 30 seconds lock in our staging cluster I/O is concentrated on a single disk of a DN • • • Decommission/maintenance is processed storage by storage The replication queue processes blocks in FIFO order And more • • Please check HDFS-14854 for the details Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 14
New Decommission Monitor: How it works (in short) • Before Write Lock (released every DN) DN decommission • After DN decommission Replication Queue The blocks are randomized to avoid I/O concentration Unprocessed Blocks (HashMap) Read lock (released every disk) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. Replication Queue Write Lock (released every 1000 blocks by default) Available from 3.3.0 15
Agenda Recent improvements in HDFS • • • • • Enabling RBF in production • • • • • • RBF (Router Based Federation) Observer NameNodes DataNode maintenance mode New Decommission Monitor Hadoop clusters in Yahoo! JAPAN How to upgrade How to split namespace What we considered and experimented Many troubles and lessons learned from them Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 16
Yahoo! JAPAN's largest HDFS cluster • • 100PB actual used 500+ DataNodes • • • 240M files + directories 290M blocks 400GB NameNode Java heap • HDP 2.6.x + patches (as of Dec. 2019) Reference: https://www.slideshare.net/techblogyahoo/hadoop-yjtc19-in-shibuya-b2-yjtc (Japanese) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 17
Major existing problems The namespace is too large • • • The Hadoop version is too old • • • • NameNode does not scale infinitely due to heavy GC "The legendary" problem with HDFS HDP 2.6 is based on Apache Hadoop 2.7.3 2.7.3 was released 4 years ago We upgraded to HDFS 3.3.0 and use RBF to split the namespace Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 18
How to enable RBF w/o clients' config changes Before After NameNode @ host1 (port 8021) NameNode @ host1 (port 8020) NameNode @ host2 NameNode @ host3 DFSRouter @ host1 (port 8020) ZooKeeper StateStore Note: We couldn't rolling upgrade the cluster because of the NN RPC port change Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 19
How to split namespaces Calculated # of files/directories/blocks from fsimage Calculated # of RPCs from audit logs • • • RPCs are classified into two groups (write/read) We had to check the audit log to ensure that there is no rename operation between namespaces • • RBF does not support it for now Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 20
Split DataNodes or not? Split DataNodes for each namespace (no-split) DNs register all the NameNodes NN NN DN DN We chose splitting DNs because it is simple Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 21
Split DataNodes – Pros and Cons Pros Simple • Easy to troubleshoot, operate • • No limitation of the # of namespaces • East-west traffic can be controlled easily Cons • Need to calculate how many DNs required for each namespaces • • Possible unbalanced resource usage among namespaces Hard-link based rename assumes non-split DNs Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 22
Use Java 11 • • • Hadoop 3.3.0 supports Java 11 as runtime Upgrade to Java 11 to improve GC performance Yahoo! JAPAN contributed many patches to support Java 11 • https://www.slideshare.net/techblogyahoo/ java11-apache-hadoop-146834504 (Japanese) Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 23
Upgrade ZooKeeper to 3.5.x • Error log w/ Hadoop 3.3.0 and ZK 3.4.x (snip) Caused by: org.apache.zookeeper.KeeperException$UnimplementedException: KeeperErrorCode = Unimplemented for /zkdtsm-router/ZKDTSMRoot/ZKDTSMSeqNumRoot at org.apache.zookeeper.KeeperException.create(KeeperException.java:106) at org.apache.zookeeper.KeeperException.create(KeeperException.java:54) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:1637) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1180) at org.apache.curator.framework.imps.CreateBuilderImpl$17.call(CreateBuilderImpl.java:1156) (snip) • • • Hadoop 3.3.0 upgraded Curator version and it depends on ZooKeeper 3.5.x (HADOOP-16579) Rolling upgraded ZK cluster before upgrading HDFS Upgrade succeeded without any major problems Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 24
Planned schedule • 2019.9 Upgraded to trunk in the dev cluster • 2020.3 Apache Hadoop 3.3.0 released • 2020.3 Upgraded to 3.3.0 in the staging cluster • 2020.5 Upgraded to 3.3.0 in production Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 25
Results • • • • • 2019.9 Upgraded to trunk in the dev cluster (with 1 retries) 2020.7 Apache Hadoop 3.3.0 released 2020.8 Upgraded to 3.3.0 in the staging cluster (with 2 retries) 2020.8 Upgraded to 3.3.0 in production (no retry! but faced many troubles...) Upgrades are completed @ Home Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 26
Agenda Recent improvements in HDFS • • • • • Enabling RBF in production • • • • • • RBF (Router Based Federation) Observer NameNodes DataNode maintenance mode New Decommission Monitor Hadoop clusters in Yahoo! JAPAN How to upgrade How to split namespace What we considered and experimented Many troubles and lessons learned from them Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 27
DistCp is slower than expected We used DistCp to move the latest data between namespaces after upgrade but it didn't finish by the deadline Directory listing of src/dst is serial • • • DistCp always fails if (# of Map tasks) > 200 and dynamic option is true • • • • • Increasing Map tasks does not help Fails by configuration error To make matters worse, it fails after directory listing, which takes very long time DistCp does not work well for very large directory Recommend splitting the job Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 28
DN traffic reached the NW bandwidth limit • We faced many job failures just after the upgrade 25Gbps DN out traffic in a subcluster • When splitting DNs, we considered only the data size but it is not sufficient • Read/write requests must be considered as well Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 29
DFSRouter slowdown • DFSRouter drastically slowdown when restarting active NameNode DFSRouter Average RPC Queue time 30 sec Finished loading fsimage Restarted active NameNode • Wrote a patch and fixed in HDFS-15555 Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 30
HttpFS incompatibilities The implementation of the web server is different • • • Hadoop 2.x: Tomcat 6.x Hadoop 3.x: Jetty 9.x The behavior is very different • • • Jetty supports HTTP/1.1 (chunked encoding) Default idle timeout is different • • • • • • Tomcat: 60 seconds Jetty: Set by "hadoop.http.idle_timeout.ms" (default 1 second) Response flow (what timing the server returns 401) is different Response body itself is different and more... Need to test very carefully if you are using HttpFS Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 31
Lessons learned: Overall We have changed many configurations at a time, but should be avoided as possible • • • • For example, we changed block placement policy to rack fault-tolerant and under-replicated blocks become 300M+ after upgrade Trouble shooting become more difficult HttpFS upgrades can be also separated from this upgrade, as well as ZooKeeper Imagine what will happen in production and test them as possible in advance • • • Consider the difference between dev/staging and prod There is a limit one people can imagine. Ask many colleagues! Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 32
HDFS Future works RBF improvements • • • Rebalance DNs/namespaces between subclusters well Considering multiple subclusters, non-split DNs (or even in hybrid), and so on Erasure Coding in production • • Internally backporting EC feature to the old HDFS client and the work mostly finished Try new low-pause-time GC algorithms • • ZGC, Shenandoah Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved. 33
Q&A Copyright (C) 2020 Yahoo Japan Corporation. All Rights Reserved.