>100 Views
September 26, 17
スライド概要
WalB is an open-source backup system that consists of block devices, called WalB devices, and userland utilities, called WalB tools. A WalB device records write-I/Os. WalB tools extracts them to create restorable snapshots in an incremental manner.
Compared with dm-snap and dm-thin, WalB is designed to achieve small I/O latency overhead and short backup time. We conducted an experiment to take an incremental backup of a volume under random write workload. The result confirms those advantages of WalB.
Cybozu cloud platform, which has 500TB volumes and processes 25TB write-I/Os per day, is required to achieve (1) stable workload performance without I/O spikes which may affect application user experience and (2) short backup interval specified in our service level objective. WalB satisfies the requirements, while dm-snap is not enough to and dm-thin is not expected to.
サイボウズ・ラボ株式会社で教育向けのOSやCPU、コンパイラなどの研究開発をしています。
WalB: A Fast and Low Latency Backup System for Block Devices Cybozu Meetup #8 SRE WalB Kota Uchida September 25, 2017 1
About me ▌Kota Uchida ▌SRE team at Cybozu, Inc. ▌A WalB developer 2
About Cybozu ▌A large cloud service vendor in Japan. ▌Largest market shares in field of collaborative software. ▌We serve web applications on our own cloud platform. kintone: a low-code business app platform and more 3
#customer companies: #accesses / day: write IOs / day: 20,000+ 210 millions 24.5 TiB 4
Service Level Objective ▌24/7 nonstop service ▌99.99% availability (4 min / month) ▌Daily backup (retention period is 14 days) ▌Disaster recover: copy data to a remote site once a day 5
Architecture of our platform The scope of this talk Storage Server L7LB Application Server Database Server Blob Server dm-snap Backup Server Diff Diff RAID 1 Storage Server dm-snap Remote Site Diff Diff 6
Snapshot Management with dm-snap Logical Structure 0 1 Snapshot Image 3 A B Write A’ Write B’ A’ B’ Latest Image Physical Structure 2 4 (2) Write A’ Original Volume Area B’ (1) CoW Snapshot Area Mapping Info A B 7
Backup using dm-snap Logical Structure Snapshot0 A B (1) Full-scan an old snapshot A’ Snapshot1 A’ B’ B’ (3) Generate a diff image by comparing two snapshots (2) Full-scan a new snapshot 8
Full-scan at night Backup processing time Daytime o’clock 9
UX degradation during a full-scan Full-scanning 10
We have no more “nights” ▌Until now: Full scan is allowed only when access rate is low, i.e., at night. ▌From now on: We have to handle accesses from multiple timezones. ▌We must be able to backup any time without UX degradation. 11
New Solution ▌We need a new solution with: No IO spikes Short backup time ▌We compared dm-thin with WalB 12
What is dm-thin? ▌dm-thin provides thin-provisioning volume management to share same data among volumes reduce disk usage using snapshots ▌In the mainline Linux kernel 13
Snapshot Management with dm-thin Logical Structure Latest Image A Physical Structure Latest Tree A
Snapshot Management with dm-thin Logical Structure Snapshot A Latest Image A Physical Structure Snapshot Tree Latest Tree A 15
Snapshot Management with dm-thin Logical Structure Snapshot A Write A’ Latest Image A’ Physical Structure Snapshot Tree Latest Tree (1) CoW A (1) CoW (2) Update A’ (2) Write 16
Backup using dm-thin Logical Structure Snapshot0 A B Snapshot1 A’ B’ Physical Structure Snapshot0 Snapshot1 A B A’ B’ Generate a diff image using dm-thin metadata 17
What is WalB? dm-snap full scanning WalB no spikes ▌A real-time and incremental backup system developed at Cybozu Labs ▌Can backup block devices without IO spikes 18
Special Block Devices for WalB Any application (File system, DBMS, etc.) Read Write WalB device Data device Log device Linear mapped Ring buffer 19
Write IO Logging and Backup with WalB Time series of write I/Os Data Device 0 1 A Log Device 2 3 4 B Time 20
Write IO Logging and Backup with WalB Time series of write I/Os Data Device 0 1 A Log Device 2 3 4 B Write A’ A’ B 1 A’ Scan the log device and generate a diff image Time 21
Write IO Logging and Backup with WalB Time series of write I/Os Data Device 0 1 A Log Device 2 3 4 B Write A’ A’ B 1 A’ 1 A’ Write B’ A’ Time B’ 4 B’ Scan the log device and generate a diff image 22
Performance test ▌Compared dm-snap, dm-thin, and WalB ▌Executed a workload during a backup The workload & the backup will affect each other ▌Measured the following metrics: Latencies of the workload Backup time 23
Environment & Settings ▌Test environment: CPU:2.40 GHz x 12 cores MEM:192 GiB HDD:4 TB HDD, RAID 6 (8D2P) NIC:10 Gbps x 2 Kernel:4.11 (latest upstream) ▌Test settings: 100 GiB volumes Workload: 4 KiB Random writes for a 5 GiB range 24
Measuring the Backup Time (dm-snap, dm-thin) 4 KiB Random Writes 5 GiB 95 GiB (unchanged) dm-snap : scan full image dm-thin : scan changed chunks (tree traversal) ▌dm-snap:take a snapshot & scan full image ▌dm-thin:get a structure of snapshot trees & find modified blocks & read these blocks 25
Measuring the Backup Time (WalB) Backup Server 4 KiB Random Writes WalB Device 5 GiB 95 GiB (unchanged) Diff Diff Write IO logs Network Log Device WalB : scan logs ▌WalB:scan logs from a log device & send them to a backup server continuously 26
Write I/O latency IO spikes due to CoW, worse than dm-snap! dm-thin dm-snap large due to CoW WalB Small overhead no-backup 27
Backup time slower than dm-snap 2260 1146 so fast! 1.2 28
Conclusion ▌dm-snap & dm-thin High I/O latency during a backup Long backup time ▌WalB Stable and low I/O latency (no spikes) Short backup time WalB satisfies our requirements for production use. 29
Try WalB! ▌Project page https://walb-linux.github.io/ ▌Tutorial https://github.com/walb-linux/walbtools/tree/master/misc/vagrant/ Vagrantfile for Ubuntu 16.04 and CentOS 7 30
Incremental backup Remote Host Volume Backup Host Diff Diff Apply everyday … Diff Base Diff files for 14 days ▌Daily backup (retention period is 14 days) ▌Worker daemon of WalB selects diff files older than 14 days and applies them to a base image. 31
Restoring a volume Remote Host Diff Diff … Diff Apply all diffs Base Base' Writable snapshot ▌To restore the latest state of a volume: take a snapshot of a base image, and apply all diff files to it. 32
Make restoration faster 1/2 Remote Host Diff … Diff 1 2 Diff Diff Base 14 dm-thin snapshots for each day ▌Fast restoration by preparing read-only snapshots for each day 33
Make restoration faster 2/2 Remote Host Diff … Diff 1 2 Diff Diff Base 14 ▌Apply some diffs to the appropriate snapshot. ▌At most 24 hours of diffs are needed to be applied. Faster! 34
Worldline: restoring a whole environment ▌"Worldline" means a parallel world. ▌We backup configurations in addition to user data. Configurations: definitions for each customer (ID, FQDN, Apps, …), application version definition, host definition, etc. ▌It is important to use applications whose versions are consistent with user data backed up before. 35
Worldline: restoring a whole environment ▌A daily script takes a snapshot of a whole environment. ▌An weekly script restores the latest backup, so we can use it for investigation of failures or development our services. User data Backup Diff Diff Worldline Restore Snap shot Config DB' Spare hosts Config DB Restore Backup Diff Diff 36
Q&A email: [email protected] twitter: @uchan_nos 37