>100 Views
July 28, 17
スライド概要
https://connpass.com/event/61546/
2023年10月からSpeaker Deckに移行しました。最新情報はこちらをご覧ください。 https://speakerdeck.com/lycorptech_jp
Apache: Big Data North America 2017 参加報告 2017年7月27日 ヤフー株式会社 野村 拓也 1 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
自己紹介 野村拓也 2010年入社 ・Hadoop使った集計サポート ・広告配信向け機械学習 ・Stormアプリケーション開発 2 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
目次 • Apache BigData概要 • ストリーム処理の全体感 • 参加セッション紹介 • Beam関連 • Stateful Streaming Data Pipelines with Apache Apex • まとめ 3 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
目次 • Apache BigData概要 • ストリーム処理の全体感 • 参加セッション紹介 • Beam関連 • Stateful Streaming Data Pipelines with Apache Apex • まとめ 4 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Apache Big Data 2017 Miami • • • • • 5 Apache: Big Data North America 2017 May 16 – 18@Miami, Florida Apache Projects Developers, operators and users working in Big Data http://events.linuxfoundation.org/events/a pache-big-data-north-america Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
• 参加⼈数 全体400名程度 • 日本⼈は20⼈ぐらい • ASFの18周年の誕生日ケーキ 6 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
目次 • Apache BigData概要 • ストリーム処理の全体感 • 参加セッション紹介 • Beam関連 • Stateful Streaming Data Pipelines with Apache Apex • まとめ 7 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
セッション数 session type Use Cases Ops Streaming SQL Hadoop Beam/Zeppelin Big Data Cassandra Deep Learning/GPU Machine Learning/Natural Language Processing Spark 8 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved. 件数 16 11 9 9 7 6 6 5 5 4 4
ストリーム処理エンジンへの言及 各ストリームエンジンが何件の概要で言及されているか (/82件) Engine count Apex 6 Flink 6 Samza 2 Spark Streaming 2 Gearpump 1 Google Cloud Dataflow 1 Storm 0 9 参考 Spark Kafka Beam 14件 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved. count 28 12 4
去年の今頃 • 森谷のHadoop Summit参加報告(2016/7/22) • https://www.slideshare.net/techblogyahoo/hadoop-summit-2016-san-jose-streamctjp 10 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
各ストリーム処理エンジンの動向 Apache BigDataに参加して感じたこと → 各ストリーム処理エンジンが基本機能を抑えてきてる • Beam Model対応 • unified API • event time • late data • exactly once • state management • High level API 11 • SQL Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
各ストリーム処理が基本機能を抑えてきてる Spark 2.x Samza Apex (Flink) (Storm) unified API ◯ ◯ ◯ ◯ stream only event time ◯ coming soon ◯ ◯ ◯ late data ◯ coming soon ◯ ◯ ― state management ◯ ◯ ◯ ◯ ◯ SQL ◯ coming soon ◯ ◯ ◯ BEAM Runner 開発中 coming soon ◯ ◯ ― 基本機能を抑えてる → 各ストリーム処理エンジンの特徴は? 12 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
引用: Advantage of Dstreams (Spark 2.0) • Processing with event-time, dealing with late data • Exactly same API for batch, streaming, and interactive • End-to-end exactly-once guarantees for the system • Performance through SQL optimizations • Logical plan optimizations, Tungsten, Codegen, etc. • Faster state management for stateful stream management “Continuous Applications with Apache Spark 2.0” 13 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
引用: Key differentiators for Apache Samza • Stream processing both as a multi-tenant service and a light-weight embedded library • No micro batching(first class streaming) • Unified processing of batch and streaming data • Efficient remote calls I/O using build-in async mode • World-class support for scalable local state • Incremental check-pointing • Instant recover with zero down time • Low level API and Stream based high Level API(DSL) “What It Takes to Process a Trillion Events a Day: Case-Studies in Scaling Stream Processing at LinkedIn” 14 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
引用: Why Apex • State management • Processing based on event time • Native streaming • Scalability • Library of connectors and transformations • Exactly-once. Fault tolerance, windowing • Fine grained recovery, low-latency SLA support • Queryable state • Correctness, repeatable/replay • Low latency + high throughput, efficient resource utilization • Pipelined processing (data in motion) • Process more data by adding compute resources, no platform/architecture limit. • Dynamic scaling and resource allocation, elasticity • Time to value "From Batch to Streaming ET(L) with Apache Apex" 15 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
目次 • Apache BigData概要 • ストリーム処理の全体感 • 参加セッション紹介 • Beam関連 • Stateful Streaming Data Pipelines with Apache Apex • まとめ 16 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Apache Beam • 発表日(5/17)にfirst stable release • 参加セッション • “Using Apache Beam for Batch, Streaming, and Everything in Between” • Daniel Halperin: Apache Beam PMC, Google • “Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways” • Davor Bonaci: Apache Beam PMC chair, Google • Jean-Baptiste Onofré: Apache Beam PMC, Talend 17 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
18 http://events.linuxfoundation.org/events/apache-big-data-north-america Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved. Photo Stream
Apache Beam 19 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Programing Model http://events.linuxfoundation.org/sites/events/files/slides/Beam%202017%20ApacheCon.pdf 20 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Programming Model • SDK • Java/ Python • Runners • Apache Apex • Apache Flink • Apache Gearpump • Apache Spark • Google Cloud Dataflow 21 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
22 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
enable dynamic adaptation • Readerの進捗情報をRunnerに伝え、 実行時間に基づいたアクションを可能にしたい 23 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Dynamic Work Rebalancing + Autoscaling 24 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Extensibility points • • • • • • 25 SDK Runner DSLs Libraries transformations IO Connector File System Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Ecosystem SDK Runner DSLs Libraries transformation IO Connector File System 26 Javaに加えてPythonもサポート Apache Gearpumpが追加された 今後: Spark 2, JStorm 今後: Scala(Scio), Calcite 例としてTensorFlow Kafka, elasticsearch, HBase,… など約20 今後: Cassandra, Redis HDFS, Amazon S3など Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
よもやま話 • 注目度はすごい • 参加した2セッションとも 椅子が足りず立ち見 • Stableリリースで追加されてた • Stateful support • Metrics subsystem 27 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
目次 • Apache BigData概要 • ストリーム処理の全体感 • 参加セッション紹介 • Beam関連 • Stateful Streaming Data Pipelines with Apache Apex • まとめ 28 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Stateful Streaming Data Pipelines with Apache Apex • Chandni Singh • Apache Apex PMC and Committer, Simplifi.it • Timothy Farkas • Apache Apex Committer, Simplifi.it • Apache Apexの紹介 • Stateful managementの問題点と対応 29 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Apache Apex概要 • • • • 30 Distributed data processing engine Runs on Hadoop Real-time streaming Fault-tolerant • N window毎にチェックポイント • Committed window: 全てのオペレータがcheckpoint されときの最新のwindowをcommitted windowと呼ぶ Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Stateful managementの問題点 • Checkpointの時間はoperatorのstate sizeに相関 • Stateの増加に伴い、operatorはクラッシュしてしまう • それ以前に応答の遅さからYARNにkillされるだろう • → Managed State • key/valueの状態を差分更新 • 閾値に達した段階でメモリからWALに出力 • Keyはユーザ定義バケットで分割 • 最終的にHDFSに保持 31 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
Managed State 32 • managedState.put(1L, key, value) • managedState.getSync(1L, key) • managedState.getAsync(1L, key) Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
よもやま話 • セッションでは性能改善について触れてない • リリース時のブログには以下のように記載 • we improved greatly how objects are serialized and reduced garbage collection considerably in the Managed State layer. 33 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
目次 • Apache BigData概要 • ストリーム処理の全体感 • 参加セッション紹介 • Beam関連 • Stateful Streaming Data Pipelines with Apache Apex • まとめ 34 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
まとめ • ストリーム処理 • Stormの話がなくて心配 • 差別化要因がどこになるか • Apache Beamの期待感 • Unified API • Dynamic Work Rebalancing • I/O ConnectorやLibrariesの拡充が気になる • Stateful management • ステートが巨大化した時にどう対処するか 35 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.
reference • Using Apache Beam for Batch, Streaming, and Everything in Between • http://events.linuxfoundation.org/sites/events/files/slides/ 2017-0517%20dhalperi%20%40ApacheCon%20Big%20Data.pdf • Apache Beam: Integrating the Big Data Ecosystem Up, Down, and Sideways • http://events.linuxfoundation.org/sites/events/files/slides/ Beam%202017%20ApacheCon.pdf • Stateful Streaming Data Pipelines with Apache Apex • http://events.linuxfoundation.org/sites/events/files/slides/ Stateful%20streaming%20data%20pipelines.pdf#search= %27Spillable+Data+Structures%27 36 Copyright © 2017 Yahoo Japan Corporation. All Rights Reserved.