SlideShare una empresa de Scribd logo
1 de 91
Descargar para leer sin conexión
@kimutansk
ストリームデータ処理入門
Stream Data Processing 101
ScalaMatsuri 2017
Kimura, Sotaro(@kimutansk)
https://www.flickr.com/photos/esoastronomy/14255636846
@kimutansk
Self Introduction
• Kimura, Sotaro(@kimutansk)
– Data Engineer at DWANGO Co., Ltd.
• Maintenance of Data Analytics Infrastructure
• Development for ETL pipeline
• Construct Data Mart
• And things to do with Data Analytics Infrastructure
– Favorite technology field
• Stream Processing technologies
• Distributed computing systems
– Favorite oss products
• Apache Kafka
• Apache Beam
• Apache NiFi
木村宗太郎(@kimutansk)です。
ドワンゴでデータエンジニアをやっています。
@kimutansk
In the beginning
• In this presentation, I use “Stream Processing”
instead of “Stream Data Processing”.
– “Stream Processing” is more used in related articles.
表題は「ストリームデータ処理入門」ですが、
本資料においては、「ストリーム処理」として記述します。
@kimutansk
Agenda
• What is Stream Processing?
• Data processing patterns
• Trouble of Stream Processing
• Stream Processing system structure and products
• Technical consideration points
• Real Stream Processing performance problems
• Stream Processing system misconceptions
はじめにストリーム処理とは何かから説明し、
実現するためのプロダクト、検討ポイントを説明します。
@kimutansk
What is Stream Processing?
ストリーム処理とは何か?
@kimutansk
In a nutshell
• Model of data processing designed for continuously
generated unbounded data sets.
– More detail...
一言でいうと、
「無限のデータを処理するよう設計されたデータ処理モデル」
@kimutansk
Stream Processing properties
• Unbounded data
– Ever-growing, essentially infinite data set.
– These are often referred to as “streaming data”.
• Ex)System logs, sensor data, activity logs, etc...
• Unbounded processing
– Data is unbounded, so processing is also unbounded.
– For distinction between batch processing, “unbounded”.
• Low latency, approximate, speculative results
– Because of Stream Processing problems,
system often output approximate, speculative results.
– Batch processing traditionally designed for high latency,
complete results.
「無限のデータを処理」「無限に処理が継続」
「低レイテンシ、しばしば近似値・不定期な出力」
@kimutansk
Usage of Stream Processing
• Billing
– Ex) Cloud service billing. Mobile communication billing.
• Live cost estimating
– Ex) Cloud service usage cost. Mobile data usage.
• Anomaly/Change event detection
– Ex) Injustice login. System failure. Recommendation.
Weather data anomaly detection.
• Detection backfill
– Ex) System failure recovery(After notification).
Weather data anomaly progress notification.
ストリーム処理の用途「課金処理」「ライブコスト見積り」
「異常・変化イベント検知」「検知結果復旧」
@kimutansk
Data processing patterns
データ処理のパターン
@kimutansk
Big data processing patterns
• Typical big data processing pattern list is below.
ビッグデータの処理モデルとして、3モデルが言われる。
「バッチ処理」「対話型クエリ」「ストリーム処理」
Batch Processing Interactive Query Stream Processing
Execute timing Manual execute
Periodical execute
Manual execute
Periodical execute
Continuous execute
Processing target Archived data Archived data Unbound generating
stream data
Processing time Minutes ~ Hours Seconds ~ Minutes Permanence
Data size TBs~PBs GBs~TBs Bs~KBs(Per 1 event)
Latency Minutes ~ Hours Seconds ~ Minutes Milliseconds ~ seconds
Typical
applications
ETL
Reporting
Generate ML model
Business intelligence
Analytics
Anomaly detection
Recommend
Visualize
OSS products MapReduce
Spark
Impala, Presto, Drill
Hive
(Described later)
@kimutansk
Batch Processing
• Process “archived data”, output to data store.
バッチ処理:データストアに蓄積したデータを一括変換し、
結果出力を行う処理モデル
Processed data destination
= data store.
@kimutansk
Interactive query
• Process “archived data”, get results from client.
対話的クエリ:データストアに蓄積したデータを一括変換し、
結果をクライアントで取得する処理モデル
Processed data destination
= client.
@kimutansk
Stream Processing
ストリーム処理:無限に生成され続けるデータを処理するモデル
出力先はシステムによって多岐に渡る。
Unbounded
data source
Message bus Stream Processing
Engine
Output systems
Mobile activity
Sensor data
System log
• Process “Unbound generating data”.
– Output destinations are depend on the system.
@kimutansk
Batch Processing Interactive Query Stream Processing
Execute timing Manual execute
Periodical execute
Manual execute
Periodical execute
Continuous execute
Processing target Archived data Archived data Unbound generating
stream data
Processing time Minutes ~ Hours Seconds ~ Minutes Permanence
Data size TBs~PBs GBs~TBs Bs~KBs(Per 1 event)
Latency Minutes ~ Hours Seconds ~ Minutes Milliseconds ~ seconds
Typical
applications
ETL
Reporting
Generate ML model
Business intelligence
Analytics
Anomaly detection
Recommend
Visualize
OSS products MapReduce
Spark
Impala, Presto, Drill
Hive
(Described later)
Difference between Batch vs Stream
• Difference is “Input data is completed” or not.
バッチ処理(対話型クエリ)と、ストリーム処理の違いは、
「入力データが完全に揃っているか否か」
Input data is completed ! Input data is streaming !
@kimutansk
Batch Processing’s premise
• Batch Processing’s premise.
– When Batch Processing executes,
data completeness is needed.
• Target data are needed bounded.
– Basically, outputs over several Batches are difficult.
• Basic Batch Processing model
バッチ処理の前提として、「実行時データが揃っていること」
「バッチを跨いだ結果出力は困難」がある。
MapReduce
@kimutansk
Batch Processing pattern
• For multiple outputs, executes multiple jobs.
• For time sliced outputs, Batch reads whole time
inputs.
「複数の結果を出力する場合は複数回バッチを実行」
「結果を時間で区切る場合はそれらを全て含むデータを入力」
MapReduce
MapReduce
2/26
2/27
2/28
2/28
[00:00 ~ 06:00)
[06:00 ~ 12:00)
[12:00 ~ 18:00)
[18:00 ~ 24:00)
@kimutansk
Batch Processing problems
• When user session processing job,
Batch Processing is not well adapted.
– If the user session continues over 2/27・2/28,
needed re-output 2/27 result.
– If more continues,... really?
ユーザのセッションを算出したい場合、
日跨ぎセッションが区切られる。過去を読んで再出力は難しい。
MapReduce
2/282/27 2/282/27
Red
Yellow
Green
@kimutansk
Batch Processing is...?
• Multiple outputs pattern is transformable below.
– This means that..
「複数回バッチを実行」の図はこのスライドのように
変形することができる。すなわち・・・?
2/26 2/27 2/28
2/282/272/26
Map
Reduce
Map
Reduce
Map
Reduce
@kimutansk
Batch is subset of Stream!
• Multiple outputs pattern is bounded
“unbounded stream data” by interval
これは、すなわち無限のデータであるストリームデータを
一定時間ごとに区切ったものに他ならない。
Bounded finite stream
Bounded by interval
2/26 2/27 2/28
Unbounded infinite stream
@kimutansk
Batch is subset of Stream!
• This means that, Batch Processing is
subset pattern of Stream Processing
つまり、バッチ処理とは、ストリーム処理の中の
限定的な処理のモデルであるということ。
Bounded finite stream
Bounded by interval
2/26 2/27 2/28
Unbounded infinite stream
Stream
Processing
Batch
Processing
@kimutansk
Batch premise does not hold.
• In Stream Processing
• Batch premise : Data completeness
– Continuously generating data.
• Batch premise : Difficulty for over Batch data
– Continuously processing, different from the premise.
• If data completeness is satisfied,
Stream Processing can process same processing by
Batch Processing.
バッチ処理の前提はストリーム処理では成り立たない。
だが、データが揃うならバッチ処理と同じことは出来る。
However...
@kimutansk
Trouble of Stream Processing
ストリーム処理の、バッチ処理にない困った点は?
@kimutansk
New problem in Stream Processing
• Data ingest order is different from actual order!
– That called “Out of order”
– Ex.) The phones configured airplane mode for entire flight.
• Cause example
– Network disconnect/delay.
– Time gap between servers that constitute the system.
• So, typically two domains of time within systems.
– EventTime , which is the time when events actually
occurred.
– ProcessingTime, which is the time when events are
ingested by the system.
ストリーム処理ではデータは発生した順に到着しない。
そのため、「EventTime」「ProcessingTime」の時刻概念が存在。
@kimutansk
Why are there in trouble?
• If no relation between ingested data,
“Out of order” is not problem.
– Except, processing result is approximate.
• But, in the real Stream Processing system,
data grouping method are needed.
– Ex)In anomaly event detection,
few cases are able to be detected by one event.
“Many logins occur in a short time” etc..
Relations between before and after are needed.
データが発生した順に到着しなくても、関連が無ければ困らない。
だが、実際は「不正ログイン検知」等でデータの関連が必要。
@kimutansk
Data grouping concept
• Window, which is data grouping concept.
データのグルーピングの概念としてWindowがある。
「固定長」「スライディング」「セッション」が代表的。
Tumbling
Window
Time
Sliding
Window
Sesson
Window
@kimutansk
Out of order trouble
• With window, “Out of order” becomes trouble.
– If after [00:00 ~ 06:00] result output,
“05:55” data arrived...?
Windowを使うと「データが順番に到着しない」が問題になる。
結果出力後に該当時間帯のデータが到着したらどうする?
[00:00 ~ 06:00)
1. Output2. Arrive...?
@kimutansk
Countermeasure for trouble
• For Stream Processing trouble,
three countermeasures are proposed.
• Watermark
– What time ingested data are completed in EventTime
domain.
• Trigger
– When output aggregate results.
• Accumulation
– How do refinements of results relate?
この問題に対して、「Watermark」「Trigger」「Accumulation」
という対処の概念が挙げられている。
@kimutansk
What is watermark?
• Watermark is the notion of What time
is process completed in EventTime domain.
– EventTime and ProcessingTime exist individual,
there are skews between EventTime and ProcessingTime.
EventTimeベースでどこまで処理したかを示す概念。
実際の処理時刻と、どこまで処理したかがずれるため必要。
Event Time
ProcessingTime
Ideal System
Real System
(≒Watermark)
Skew
@kimutansk
What is watermark?
• Usage of Watermark
– If watermark time is “X”,
events which EventTime is before “X” are all processed.
• But...
– Watermark can not be complete.
– Data arrival is out of order,
so watermark is only an approximation.
• However, watermark is useful.
– Watermark can be the indication of processing timing.
ただ、注意すべきはWatermarkも近似であり、完全ではない。
しかし、処理をするタイミングの基準として有用。
@kimutansk
What is trigger?
• Trigger is mechanism when aggregated results
should be output.
– With trigger, output timings can be flexible and multiple.
– In addition, system can adapt late data over watermark.
• Example
– When watermark arrives end of the window, output results.
Triggerはいつ集計結果を出力するかを定義する機構。
Triggerによって、出力タイミングを柔軟に、複数回定義可能。
PCollection<KV<String, Integer>> wordCountResult =
wordOneCountStream.apply("Fixed Windowing",
Window.<KV<String, Integer>>into(
FixedWindows.of(Duration.standardMinutes(2)))
.triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow())))
.apply(Sum.integersPerKey());
@kimutansk
What is trigger?
• Example
– When late data over watermark, output results.
Triggerの導入によって、Watermarkより遅れたデータが
到着した場合でも、ハンドリングが可能となる。
PCollection<KV<String, Integer>> wordCountResult =
wordOneCountStream.apply("Late Firing",
Window.<KV<String, Integer>>into(
FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AfterWatermark.pastEndOfWindow()
.withLateFirings(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1)))))
.apply(Sum.integersPerKey());
@kimutansk
What is trigger?
• Example
– When late data over watermark, output results.
– Target the limit of latency is 5 minute.
Triggerの導入によって、Watermarkよりどれだけ
データが遅れたら、以後のデータを破棄するかも指定可能。
PCollection<KV<String, Integer>> wordCountResult =
wordOneCountStream.apply("Late Firing until 5 min late",
Window.<KV<String, Integer>>into(
FixedWindows.of(Duration.standardMinutes(2)))
.triggering(AfterWatermark.pastEndOfWindow()
.withLateFirings(
AfterProcessingTime.pastFirstElementInPane()
.plusDelayOf(Duration.standardMinutes(1))))
.withAllowedLateness(Duration.standardMinutes(5))
.accumulatingFiredPanes())
.apply(Sum.integersPerKey());
@kimutansk
What is accumulation?
• Accumulation is refinement mode define of multiple
results.
– Triggers are used to produce multiple outputs for a window.
– So, needs to decide how do refinements of results relate.
– “How” depends on the target system.
• The system which has self accumulation functions.
• The system which depends on Key-Value datastore.
• The system consists of multiple components which
persistence methods are different.
Accumulationとは、Triggerで結果が複数回出力される場合に
どう扱うかの方式。システムによって何がいいかは異なる。
@kimutansk
What is accumulation?
• Typical three accumulation modes
– Discarding mode
• When aggregated result output, discard previous result.
• Next result contains only data which after previous output.
– Accumulating mode
• When aggregated result output, hold previous result.
• Next result accumulates data which after previous output.
– Accumulating & Retracting mode
• When aggregated result output, hold previous result.
• Next result contains accumulation result and retraction of
previous result.
代表的な方式として、「Discard(出力毎の蓄積)」
「Accumulating(結果を累算)」「Retracting(累算と差分)」
@kimutansk
What is accumulation?
• Individual mode output example
– Aggregation [12:00~12:02) result
• Arrive Data
実際のモードごとの例を示す。
[12:00~12:02)の集計結果を各モードで出力すると・・・
No Processing Time Event Time Event Value
1 12:05 12:01 7
2 12:06 12:01 9
3 12:06 12:00 3
4 12:07 12:01 4
@kimutansk
What is accumulation?
• Individual mode output example
– Aggregation [12:00~12:02) result
• Output Data
出力タイミング毎、最終値、合計値は表のようになる。
Accumulating & Retractは複数のシステムが混在しても対応可能。
Output Timing Discard Accumulating Accumulating & Retract
12:05 7 7 7
12:06 12 19 19,-7
12:07 4 23 23,-19
Final Output 4 23 23
Total Output 23 49 23
Final & Total output are same
@kimutansk
With countermeasures, perfect?
• With Watermark & Trigger & Accumulation,
Stream Processing adapt any case?
• ...No!
• With these countermeasures, these are still
problems.
– How long lag are between watermark between
ProcessingTime?
• If the longer lag are, completeness is higher, but also latency
is higher.
– For accumulation, how long time hold intermediate data?
• If the longer holding time are, system resource requirement is
higher.
これらの概念を導入しても問題は全部解決するわけではない。
Watermarkの決め方やどれだけ遅れを許容するか等は決まらない。
@kimutansk
Data Processing Systems trade-off
• It is said that data processing systems(contained
Batch and Stream) has 3 elements trade-off.
– Completeness
– Low Latency
– Low Cost
• All of 3 elements can not be achieved.
– Any data processing systems consist from 3 elements
balance.
– From 3 elements balance, the real solutions are decided.
• “Cost” contains not only system resource but also data
transfer and communication path.
• For real examples...
データ処理システムには「完全性」「低遅延」「低コスト」の
トレードオフがあり、そこから先の問題への落とし所が決まる。
@kimutansk
Trade-off example
• Billing system
– An important element is completeness!
– Some latency or cost are acceptable.
例えば課金処理であれば、
完全性重視で、遅延やコストが発生してもある程度は許容範囲。
Important
Not Important
Completeness Low Latency Low Cost
@kimutansk
Trade-off example
• Anomaly / Change detection system
– Most important element is low latency!
– Other elements priority are low.
例えば異常検知であれば、
低遅延が最優先で、他の要素の優先度は下がる。
Important
Not Important
Completeness Low Latency Low Cost
@kimutansk
Stream Processing system structure
and products
実際のストリーム処理システムの構成はどうなっていて、
どんなプロダクトがあるのか?
@kimutansk
Typical system structure
実際のストリーム処理システムは
図のような構成を取ることが多い。
Unbounded
data source
Message bus Stream Processing
Engine
Output systems
Mobile activity
Sensor data
System log
• Typical Stream Processing system structure.
@kimutansk
Each system element detail
メッセージバスでデータをバッファリングし、
ストリーム処理エンジンで処理するのが基本の構成。
• Message bus
– At unbounded data, sometimes data flow rate is very high.
– When trouble occurs, data need to be reload.
– So, data are temporarily buffered.
– Ex) Kafka, Kinesis, Cloud PubSub etc
• Stream Processing Engine
– Processing engine which get data and process.
– Continuously executed, high availability are needed.
– Products are later.
• Output systems
– The system which use Stream Processing system output.
– Depends on use case.
@kimutansk
Stream Processing Engine genealogy
ストリーム処理エンジンは図のようにいくつかの
カテゴリに分類することができる。
DSL
With Dataflow
Design UI
Pure Stream Processing
Managed Service
Time to release
Streams
@kimutansk
Stream Processing Engine genealogy
「純ストリーム処理エンジン」「UIで処理を定義可能」
「DSL」「マネージドサービス」といったカテゴリが存在。
• Pure Stream Processing
– Basic Stream Processing Engine.
– With specific functions other Stream Processing Engine.
• With Dataflow Design UI
– The product which have Dataflow Design UI.
– User can design the Stream Processing easily.
• DSL
– Once write code, run on multiple Stream Processing Engine.
– DSL generates abstract Dataflow define.
• Managed Service
– Execution environment are managed on Public Cloud.
@kimutansk
Product Introduction:Storm
実質的に広く使用された初のOSSストリーム処理エンジン。
問題も多かったが、以後のプロダクトに大きく影響を与えている。
• At 2011, open sourced by Twitter
– Developed with Clojure.
• For deep dive, required Clojure Skill.
– Practically, first wide used OSS Stream Processing Engine.
• “At least once” semantics support at initial version.
– Because of initial product, Storm had many problem.
• Latency is very low, but throughput is low.
• No BackPressure.
• Default process placement is inefficiency.
• Message ack function are executed per message.
• At current version, most of problems are solved.
– Storm influenced after many Stream Processing products.
@kimutansk
Product Introduction:Spark Streaming
バッチ処理エンジン上でマイクロバッチとして実行。
Sparkエコシステム、開発手法を使えるのが大きい。
• At 2013, open sourced by amplab.
– Developed with Scala.
– On Batch Processing Framework Spark,
pseudo realized by sequential executed mini batched.
• Called “Micro Batch”
– Throughput is high, but latency is also high.
• Compare with Storm at the time.
• Compare with Flink or Apex....?
– The big advantage is executed on Spark Ecosystems.
• Can use Spark components.
Spark SQL, Spark MLlib, etc...
• Also development method.
@kimutansk
Product Introduction:NiFi
画面上でデータフローを定義し、ストリーム処理を構築可能。
データ管理機能も優れるが、構成管理には課題がある。
• At 2014, open sourced by NSA.
– Developed with Java.
– Design dataflow by UI, then user can deploy NiFi cluster,
executed on the cluster.
• Ex) Get from Kafka > Enrichment > Put to HDFS
– Between each components, NiFi has message queue,
NiFi can set priority, QoS setting for each queue.
– NiFi traces each data’s source, modify history.
• Useful for data management.
– But, dataflow management by code is difficult.
@kimutansk
Product Introduction:NiFi
@kimutansk
Product Introduction:Flink
バッチストリームの両方に対応したデータ処理エンジン。
独自のスナップショット方式と、多彩なAPIを提供している。
• At 2014, open sourced.
– At 2011, named Stratosphere.
– Developed with Scala.
– Data Processing Engine produces Batch and Stream both
api.
– For fault tolerance, use “Distributed Snapshot” method.
• Lightweight Asynchronous Snapshots for Distributed Dataflows
• Flink can get snapshot asynchronous and efficient.
– There are 3 type apis for develop Flink application.
• High level api
• Low level api
• Table api(SQL like)
@kimutansk
Product Introduction:Apex
耐障害性重視のストリーム処理基盤。
状態管理、実行最適化、オートスケールなどの機能が充実。
• At 2015, open sourced by DataTorrent.
– Developed with Java.
– Originally, used at financial application.
• Focuses fault tolerance.
• Problem traceability at production environment.
• There are message buffers between each operators,
so when failure occurred, influences are limited.
– Both state management and optimization for runtime
environment are considered.
• Apex uses HDFS like KVS, so low latency and fault tolerant.
• YARN native application.
– Auto Scaling during runtime.
@kimutansk
Product Introduction:Gearpump
GoogleのMillWheelを参考に開発されたActorベースプロダクト。
性能・拡張性に優れるがエキスパート向き。
• At 2015, open sourced by Intel.
– Developed with Scala.
– Developed reference to MillWheel design.
• Google’s Stream Processing paper.
• MillWheel Fault-Tolerant Stream Processing at Internet Scale
– Lean Stream Processing engine, with high extensibility.
• But, needs state management code by application developer.
• Performance is high, but development cost is also high.
– Based on “Reactive Streams”, with standardized Back
Pressure functions.
– By akka streams like syntax, user can develop intuitively
dataflow graph.
@kimutansk
Product Introduction:Kafka Streams
Kafkaと組み合わせてストリーム処理を構築するコンポーネント
Stream/Tableを統合するコンセプトを持ち、シンプルだが機能充実
• At 2016, produced by confluent.
– Developed with Java.
– Component of Kafka.
– Library for implement Stream Processing application.
• Kafka streams does not include process clustering, or high
availability feature.
• These elements depend on user.
– Practically, designated for Kafka.
• Input source, output destination are kafka.
– Key concept is Streams and Tables.
• Close relationship between Streams and Tables.
– Simple, but function are pretty powerful.
• Dual API(declarative, imperative), Queriable State, Windowing.
@kimutansk
Product Introduction:Beam
統一的なストリーム/バッチ処理モデルを提供するDSL
様々な環境で実行可能だが、プロダクト固有機能は使用できない。
• At 2016, open sourced by Google.
– Developed with Java.
– Unified Stream / Batch Processing for BigData.
• Beam provides Data Processing abstraction.
– Developed application with Beam,
can execute on multiple Stream Processing Engine.
• Local Executor
• Google Cloud Dataflow(Google Cloud Platform)
• Spark, Flink, Apex, Gearpump
– In exchange for high portability,
user can’t use each product specific libraries.
• Machine learning, Graph processing, etc...
• It needs execute separately these functions. (Ex.Tensowflow)
@kimutansk
Product Introduction:Cloud Dataflow
GCP上で提供されるストリームバッチ対応フルマネージドサービス
マネージドサービスのため、動的調整、最適化が強力。
• At 2015, produced by Google.
– Application which Developed by
Beam(Dataflow api at the time),
executed on Google Cloud Platform managed service.
– Compared with other managed Stream Processing service,
it can adapt for wide area application.
• Developed as Stream Processing Application.
– Because of managed service,
auto scaled, and optimized resource allocation.
@kimutansk
Product Introduction:KinesisAnalytics
AWS上で提供される、ストリームに継続クエリを実行するサービス
機能は限られるが、非常に手軽に使うことができる。
• At 2016, produced by Amazon.
– By SQL, user can apply continuous query
to Streaming data.
• The concept “Data in Motion”
– Each Kinesis Analytics application, input stream is one.
Output stream number is max 3.
• Input source and output target are limited Kinesis Family.
– Function is limited, but very easy to start.
– Has function distinguishing EventTime and ProcessingTime.
• So, use can use good windowing function.
– Auto scaling, but resource usage of application are difficult
to predict.
@kimutansk
Which product should use?
初めはFlinkかApexがバランスがよく無難、Gearpumpは上級者向。
他は状況や、実行環境、既存システムによって使い分ければいい。
• ※Just my opinion.
• Flink or Apex : For first Stream Processing app
– Balance of functions, performance, ease of use are good.
• Gearpump : For akka expert
– Good performance / extensibility, but difficult for first use.
• Spark Streaming : For Spark user.
– High compatibility with other Spark components.
• Beam / Cloud Dataflow : For public cloud user.
– Good portability, on-premise and public cloud.
• NiFi : For many small Stream Processing apps user.
• Kafka Streams : Auxiliary use with other products.
@kimutansk
However
とはいっても、実際は検討が必要。
これから主要な検討ポイントを説明する。
• However, it needs the consideration which product
should select.
• From now, explain major consideration points.
@kimutansk
Technical consideration points
ストリーム処理プロダクトの選定、
システムの構築で検討するべきポイント
@kimutansk
Consideration point list
技術的な検討ポイントの中で、代表的なもの。
問題領域、可用性、システム管理、開発方式に大きく分けられる。
• For develop Stream Processing Systems,
there are many technical consideration points
– It is needed clarify Stream Processing product has
function? or not?
• Target problem area
– Time model
– Windowing
– Out of order processing
• System reliability
– State management
– Fault tolerance
– Re-execute
– Message delivery semantics
• System management
– UI
– Logging
– Back-pressure
– Scale out / Scale in
– Data security
• Development method
– Api
– Specific library
– Environment, operation
@kimutansk
Target problem area
これまで説明してきたように、時刻モデルやWindowing、
Out of order処理でシステムが何に対応可能かが決定する。
• Time model
– Is it necessary handle EventTime ? or ProcessingTime
only?
– If ProcessingTime only is OK, system are more simple.
• But, changing it later is difficult.
• Windowing
– Which window method is it needed?
• Tumbling, Sliding, Session
• Out of order processing
– Relative to “Time model”, “Windowing”
• How long late date is allowed?
• Handling method.
@kimutansk
System reliability(1)
システムの信頼性の担保のために、検討が必要な項目
状態管理、耐障害性、バグが発生した時の再実行可能性
• State management
– Is “State” save to local machine? or Remote datastore?
– Which format is state serialized?
• Trade-off between reliability VS performance.
• Fault tolerance
– When system has failure, how wide influence ? latency?
• When system failure, auto repair? or manual repair?
• How long mean time to repair? (MTTR)
• Re-execute
– When system failure, or program bug,
is it necessary re-execute?
• For re-execute, messages needs to be stored long term.
@kimutansk
System reliability(2)
メッセージ配信のセマンティクスも検討が必要
注意:単体で「全てに対応可能なExactly once」は実現不可
• Message delivery semantics
– Which semantics for message processing.
• At most once
• At Ieast once
• Exactly once
– Premise: Any pattern covered Exactly once is impossible!
– By Stream Processing system, it can guarantee
only “Self state are exactly once processed.”
– System has output for external systems,
it is needed deduplication or “Accumulation” function by
external systems.
@kimutansk
Exactly once NG pattern
外部に対するアクセスと状態保存が「Atomic」でないため。
メッセージ通知時アラーム発報するシステムを考える。
• Because, external access and persisting state are
not Atomic.
Consider when message
notified, send alarm.
@kimutansk
Exactly once NG pattern
通知後、状態を更新する前に障害が発生すると・・・?
• Because, external access and persisting state are
not Atomic.
After send alarm, if occurs
failure before persisting state?
@kimutansk
Exactly once NG pattern
再度通知が行われ、重複してアラームが発報される!
• Because, external access and persisting state are
not Atomic.
Re-notify message,
duplicated alarm sended!
@kimutansk
System management(1)
大体のシステムはUIを持っている。持つ機能は重要。
ログを集約して解析可能とする仕組みは必須。
• UI
– Most of Stream Processing products, have Custom UI.
• What kind information can user watch from UI?
• What kind operation can user execute from UI?
– Important information, execution graph.
• If shuffle status are displayed, easy to diagnosis.
• Logging
– For distributed systems, each server login confirm are
unrealistic.
– So, the log collect function are important for error analysis.
@kimutansk
System management(2)
バックプレッシャー機能が無いと事前見切りや監視が必要。
基本常時動作するため、動作中の拡張縮小ができるといい。
• Back-pressure
– If performance unbalance between components,
back-pressure function are needed.
– If not exist, needed estimate each component performance,
but the estimations are difficult and unreliable.
• Scale out / Scale in
– In runtime, can system scale out / scale in?
– Or allowing system restart?
– Minimum execution unit is it?
• Depends on external system partition?
• Basically, Stream Processing systems execute continuously,
needed to clarify in advance.
@kimutansk
System management(3)
データの暗号化が必要な場合、どこでどうやっておこなうか?
また、アプリケーション単位でアクセス権限管理は必要?
• Data security
– Is saved data encryption necessary?
• If necessary, what timing, and how encrypt data?
– Is application specific access control necessary for data?
@kimutansk
Development method(1)
開発時どのAPIを使用可能か?(開発速度⇔カスタマイズ性)
複数のAPIをシステムに混在させることもプロダクト次第で可能。
• API
– Which develop abstraction can select?
• Depends on team member’s skill,
system’ priority(customize? development speed?).
– Major development api abstractions are below 3.
• Declarative, expressive API
– Development speed:○, Customizability:○
– like: map(), filter(), join(), sort(), etc...
• Imperative, lower-level API
– Development speed:△, Customizability:◎
– like: process(event)
• Streaming SQL
– Development speed:◎, Customizability:△
– like: STREAM SELECT ... FROM ... WHERE ... TO ...
@kimutansk
Development method(2)
宣言的なアプリケーション実装。
関数でStreamを加工し、個々の処理を行う。
• Declarative, expressive API Example
– By Flink
case class CarEvent(carId: Int, speed: Int, distance: Double, time: Long)
val DataStream[CarEvent] carEventStream = ...;
val topSeed = carEventStream
.assignAscendingTimestamps( _.time )
.keyBy("carId")
.window(GlobalWindows.create)
.evictor(TimeEvictor.of(Time.of(evictionSec * 1000, TimeUnit.MILLISECONDS)))
.trigger(DeltaTrigger.of(triggerMeters, new DeltaFunction[CarEvent] {
def getDelta(oldSp: CarEvent, newSp: CarEvent): Double = newSp.distance - oldSp.distance
}, cars.getType().createSerializer(env.getConfig)))
.maxBy("speed")
@kimutansk
Development method(3)
手続き的なアプリケーション実装。
processメソッドをStreamに適用し、個々の処理を行う。
• Imperative, lower-level API Example
– By Flink
val stream : DataStream[Tuple2[String, String]] = ...;
val result : DataStream[Tuple2[String, Long]] result =
stream
.keyBy(0)
.process(new CountWithTimeoutFunction());
case class CountWithTimestamp(key: String, count: Long, lastModified: Long)
class CountWithTimeoutFunction extends ProcessFunction[(String, Long), (String, Long)] {
lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext()
.getState(new ValueStateDescriptor<>("myState", clasOf[CountWithTimestamp]))
override def processElement(value: (String, Long),
ctx: Context, out: Collector[(String, Long)]): Unit ...;
override def onTimer(timestamp: Long, ctx: OnTimerContext,
out: Collector[(String, Long)]): Unit = ...;
}
@kimutansk
Development method(4)
Streaming SQLによるアプリケーション実装。
Streamにスキーマを設定し、SQLで処理を行う。
• Streaming SQL Example
– By Flink
val env = StreamExecutionEnvironment.getExecutionEnvironment
val tableEnv = TableEnvironment.getTableEnvironment(env)
// read a DataStream from an external source
val ds: DataStream[(Long, String, Integer)] = env.addSource(...)
// register the DataStream under the name "Orders"
tableEnv.registerDataStream("Orders", ds, 'user, 'product, 'amount)
// run a SQL query on the Table and retrieve the result as a new Table
val result = tableEnv.sql(
"SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'")
@kimutansk
Development method(5)
プロダクトの固有ライブラリで必要なものはあるか?
簡易な実行環境や、既存のプロセスとの親和性も重要。
• Specific library
– Each product, has specific library.
• Machine learning
• Graph processing
• Adapter with external components.
– If product has no target library, needs to be developed.
• Environment, operation
– Can execute local machine?
– Install in advance? or resource manager deploy only?
– For program update, allowed restart? or Rolling update?
– Coexistence of existing team’s development tools?
– What language is product developed? (For diagnosis)
@kimutansk
Real Stream Processing performance
problems
実際のストリーム処理システムで
良く発生するパフォーマンス問題
@kimutansk
Performance problems list
プロダクトにかかわらず存在する代表的な問題
「ファイルアクセス」「プロセス間通信」「外部の限界」「GC」
• Typical Stream Processing systems performance
problems are below.
– File access
– Communication between processes
– External system performance limit
– GC
• Of cource, depending on the product, many other
performance problems exist.
– In that case, analyze each time reference to previous
chapter “System management”.
@kimutansk
File access
初心者の時によくはまるのがローカルファイルへの同期アクセス
キャッシュを使うか非同期で問題ない設計にする必要がある。
• ※This problem is often encountered in beginner
user.
– If Stream Processing system component access local file
system each event synchronously, the component become
bottleneck.
– Avoid synchronous file system access by cache.
– Or update system design without synchronous access.
• For example, batch update file system asynchronously.
@kimutansk
Communication between processes
バッチ処理と同じく、プロセス間通信のコストは大きい。
主にShuffle時に発生するため、低減するための対処が必要。
• Similar to Batch Processing, communication between
processes performance impact is high.
– Mainly, communication between processes occur
“Shuffle”
– For example, it is needed reduce communication cost by
grouping each partition’s data beforehand shuffle.
– Or aggregate components for reduce communication
between processes.
– But, excessive aggregation induces high component
maintenance cost, or lack of performance tuning flexibility.
@kimutansk
External system performance limit
ストリーム処理システム自体がインメモリで並列処理をするため
他システムのスループットを超過することがしばしば発生する。
• Reach the performance limit of external
systems(Message bus, output systems.).
– In general, Stream Processing system process in memory,
and execute concurrent/parallel. They tend to become
large throughput, so sometimes overflow external systems.
• In generation of Storm or Spark Streaming,
generally Kafka > Stream Processing engine.
• But current generation(Flink or Apex),
sometimes Kafka < Stream Processing engine.
• Tuning cluster size or replication settings for
Message bus and output systems.
@kimutansk
GC
JVM上で動作する以上避けられないGCの問題。まずチューニング。
それで駄目ならリスクを負って強引な対処をするしか・・・
• Unavoidable problems executing on the JVM.
• Stream Processing systems handle huge number
events, so object generate huge number.
• First, tuning JVM.
• Still if you can not help it...
– In application code, suppress object generation as much as
possible.
– For object contained events, use byte array instead of each
object.
• But, these countermeasure have a bad impact for
system maintenance, quality.
– You better not do it if you can.
@kimutansk
GC
GCの影響を抑えるために、
オブジェクトをバイト配列で用いるコードイメージ。
• Use byte array instead of objects code example.
– Basic Scala Value Object(carId length = Fixed 16)
case class CarEvent(carId: String, speed: Int, distance: Double, time: Long)
@kimutansk
GC
GCの影響を抑えるために、
オブジェクトをバイト配列で用いるコードイメージ。
• Use byte array instead of objects code example.
– Byte Array Scala Value Object(carId length = Fixed 16)
class CarEventByteArray() {
lazy val content = ByteBuffer.allocate(16 + 4 + 8 + 8)
def this(carId: String, speed: Int, distance: Double, time: Long) {
this()
content.put(carId.getBytes(StandardCharsets.US_ASCII), 0, 16)
content.putInt(16, speed)
content.putDouble(20, distance)
content.putLong(28, time)
}
def setCarId(carId: Int) = {
content.putInt(0, carId)
}
def getCarId(): Int = {
content.getInt(0)
}
..........
}
@kimutansk
Stream Processing system
misconceptions
ストリーム処理に対する誤解
@kimutansk
Common misconceptions
初期のプロダクトに問題が多かったこともあり、
ストリーム処理システムにはよくある誤解がある。
• From history of the past, some misconceptions exist
for Stream Processing systems.
– Stream Processing are only applicable for approximate.
– Latency and Throughput, we must choose one.
– Micro-batching means better throughput.
– Completely impossible exactly once semantics.
– Stream Processing only apply to “real-time”
– Stream Processing it too hard anyway.
@kimutansk
Answer for misconceptions(1)
近似にのみ使用可能、は初期のStormから来ている誤解。
同様に今はレイテンシとスループットではなく違うトレードオフ
• Stream Processing are only applicable for
approximate.
– In initial Storm, it is exactly. So required combined use with
Batch Processing(Lambda Architecture).
– In current, it is controllable by Watermark, or Trigger.
• Latency and Throughput, we must choose one.
– It also from initial compare Storm VS Spark Streaming.
– In current, there are 3-axis trade-off.
• Completeness
• Low Latency
• Low Cost
@kimutansk
Answer for misconceptions(2)
マイクロバッチだからスループットに優れる理由にはならない。
Exactly onceは出来るパターンを見極めて対応が必要。
• Micro-batching means better throughput.
– In real systems, data are buffered in hard layer.
So, micro-batching does not become a reason for good
throughput.
– On the contrary, manage of micro-batching job could be
performance impact.
• Completely impossible exactly once semantics.
– Stream Processing system can guarantee
“Self state are exactly once processed.”
– System has output for external systems,
it is needed deduplication or “Accumulation” function by
external systems.
@kimutansk
Answer for misconceptions(3)
ストリーム処理システムはリアルタイム処理以外にも適用可能
開発コストも開発しやすいAPIが増えて、下がってきている。
• Stream Processing only apply to “real-time”
– In previous chapter, “Batch Processing is subset pattern
of Stream Processing.”
– So, Stream Processing can adapt to Batch job.
• But, performance efficiency is needed confirm.
• Stream Processing it too hard anyway.
– With unbound data source and data updates very
frequently, easier to adapt than Batch Processing.
– In initial generation, Stream Processing product has only
imperative, lower-level API, so development cost is high.
– But in current, product also has declarative, expressive API
and Streaming SQL. Development cost become lower.
@kimutansk
Summary
ストリーム処理システムとは何かを紹介し、
プロダクトと検討ポイント、よくある誤解について説明しました。
• Stream Processing is superset of Batch Processing.
– But new problems are occurred “Out-of-order”
• There are countermeasures for new problems.
– Watermark / Trigger / Accumulation
• Typical Stream Processing system consisted of
– Message bus / Stream Processing engine / Output system
• Stream Processing products are many.
– Need consideration for product select.
– Introduced consideration point.
• In addition, typical problem and misconceptions.
@kimutansk
Reference materials
• The world beyond batch: Streaming 101
– https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
• The world beyond batch: Streaming 102
– https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
• MillWheel: Fault-Tolerant Stream Processing at Internet Scale
– https://research.google.com/pubs/pub41378.html
• The Dataflow Model: A Practical Approach to Balancing Correctness, Latency,
and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
– https://research.google.com/pubs/pub43864.html
• The Evolution of Massive-Scale Data Processing
– https://goo.gl/jg4UAb
• Streaming Engines for Big Data
– http://www.slideshare.net/stavroskontopoulos/voxxed-days-thessaloniki-
21102016-streaming-engines-for-big-data
• Introduction to Streaming Analytics
– http://www.slideshare.net/gschmutz/introduction-to-streaming-analytics-
69120031
@kimutansk
Reference materials
• Stream Processing Myths Debunked:Six Common Streaming Misconceptions
– http://data-artisans.com/stream-processing-myths-debunked/
• A Practical Guide to Selecting a Stream Processing Technology
– http://www.slideshare.net/ConfluentInc/a-practical-guide-to-selecting-a-stream-
processing-technology
– https://research.google.com/pubs/pub41378.html
• Apache Beam and Google Cloud Dataflow
– http://www.slideshare.net/SzabolcsFeczak/apache-beam-and-google-cloud-
dataflow-idg-final-64440998
• The Beam Model
– https://goo.gl/6ApbHV
• THROUGHPUT, LATENCY, AND YAHOO! PERFORMANCE BENCHMARKS. IS
THERE A WINNER?
– https://www.datatorrent.com/blog/throughput-latency-and-yahoo/
• Lightweight Asynchronous Snapshots for Distributed Dataflows
– https://arxiv.org/abs/1506.08603
Thank you for your attention!
Enjoy Stream Processing!
https://www.flickr.com/photos/neokratz/4913885458

Más contenido relacionado

La actualidad más candente

Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLArnab Biswas
 
Custom management apps for Kafka
Custom management apps for KafkaCustom management apps for Kafka
Custom management apps for KafkaSotaro Kimura
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormEugene Dvorkin
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentationIlya Bogunov
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Sadayuki Furuhashi
 
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDIntroduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDPGConf APAC
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases MongoDB
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafkaNitin Kumar
 
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data AnalyticsNAVER D2
 
Java Performance and Profiling
Java Performance and ProfilingJava Performance and Profiling
Java Performance and ProfilingWSO2
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container EraSadayuki Furuhashi
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter StormUwe Printz
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignMichael Noll
 
Cache in API Gateway
Cache in API GatewayCache in API Gateway
Cache in API GatewayGilWon Oh
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm Chandler Huang
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014P. Taylor Goetz
 
FDW-based Sharding Update and Future
FDW-based Sharding Update and FutureFDW-based Sharding Update and Future
FDW-based Sharding Update and FutureMasahiko Sawada
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applicationsBen Slater
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.DECK36
 

La actualidad más candente (20)

Machine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkMLMachine Learning With H2O vs SparkML
Machine Learning With H2O vs SparkML
 
Custom management apps for Kafka
Custom management apps for KafkaCustom management apps for Kafka
Custom management apps for Kafka
 
Learning Stream Processing with Apache Storm
Learning Stream Processing with Apache StormLearning Stream Processing with Apache Storm
Learning Stream Processing with Apache Storm
 
Riak add presentation
Riak add presentationRiak add presentation
Riak add presentation
 
Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理Digdagによる大規模データ処理の自動化とエラー処理
Digdagによる大規模データ処理の自動化とエラー処理
 
Introduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XIDIntroduction to Vacuum Freezing and XID
Introduction to Vacuum Freezing and XID
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Deep learning with kafka
Deep learning with kafkaDeep learning with kafka
Deep learning with kafka
 
Introduction to Storm
Introduction to StormIntroduction to Storm
Introduction to Storm
 
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
[2C5]Map-D: A GPU Database for Interactive Big Data Analytics
 
Java Performance and Profiling
Java Performance and ProfilingJava Performance and Profiling
Java Performance and Profiling
 
Logging for Production Systems in The Container Era
Logging for Production Systems in The Container EraLogging for Production Systems in The Container Era
Logging for Production Systems in The Container Era
 
Introduction to Twitter Storm
Introduction to Twitter StormIntroduction to Twitter Storm
Introduction to Twitter Storm
 
Apache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - VerisignApache Storm 0.9 basic training - Verisign
Apache Storm 0.9 basic training - Verisign
 
Cache in API Gateway
Cache in API GatewayCache in API Gateway
Cache in API Gateway
 
Introduction to Storm
Introduction to Storm Introduction to Storm
Introduction to Storm
 
Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014Scaling Apache Storm - Strata + Hadoop World 2014
Scaling Apache Storm - Strata + Hadoop World 2014
 
FDW-based Sharding Update and Future
FDW-based Sharding Update and FutureFDW-based Sharding Update and Future
FDW-based Sharding Update and Future
 
Load testing Cassandra applications
Load testing Cassandra applicationsLoad testing Cassandra applications
Load testing Cassandra applications
 
PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.PHP Backends for Real-Time User Interaction using Apache Storm.
PHP Backends for Real-Time User Interaction using Apache Storm.
 

Destacado

JVM上でのストリーム処理エンジンの変遷
JVM上でのストリーム処理エンジンの変遷JVM上でのストリーム処理エンジンの変遷
JVM上でのストリーム処理エンジンの変遷Sotaro Kimura
 
最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返りSotaro Kimura
 
Kafkaを活用するためのストリーム処理の基本
Kafkaを活用するためのストリーム処理の基本Kafkaを活用するためのストリーム処理の基本
Kafkaを活用するためのストリーム処理の基本Sotaro Kimura
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)NTT DATA OSS Professional Services
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersAlbert Bifet
 
Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Cloudera Japan
 
At42 qt1010 datasheet
At42 qt1010 datasheetAt42 qt1010 datasheet
At42 qt1010 datasheetBrutcat
 
Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?chibochibo
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015Databricks
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsAlbert Bifet
 
QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」
QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」
QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」Kazuki Taniguchi
 
Hadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライドHadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライドhamaken
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkDatabricks
 
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々torutk
 
今更聞けないストリーム処理のあれとかこれ
今更聞けないストリーム処理のあれとかこれ今更聞けないストリーム処理のあれとかこれ
今更聞けないストリーム処理のあれとかこれTatsuro Hisamori
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計Cloudera Japan
 
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますDMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますWataru Shinohara
 
Apache Sparkについて
Apache SparkについてApache Sparkについて
Apache SparkについてBrainPad Inc.
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpYahoo!デベロッパーネットワーク
 

Destacado (20)

JVM上でのストリーム処理エンジンの変遷
JVM上でのストリーム処理エンジンの変遷JVM上でのストリーム処理エンジンの変遷
JVM上でのストリーム処理エンジンの変遷
 
最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り最近のストリーム処理事情振り返り
最近のストリーム処理事情振り返り
 
Kafkaを活用するためのストリーム処理の基本
Kafkaを活用するためのストリーム処理の基本Kafkaを活用するためのストリーム処理の基本
Kafkaを活用するためのストリーム処理の基本
 
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
Apache Sparkに手を出してヤケドしないための基本 ~「Apache Spark入門より」~ (デブサミ 2016 講演資料)
 
Efficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream ClassifiersEfficient Online Evaluation of Big Data Stream Classifiers
Efficient Online Evaluation of Big Data Stream Classifiers
 
Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014Strata + Hadoop World 2014 レポート #cwt2014
Strata + Hadoop World 2014 レポート #cwt2014
 
At42 qt1010 datasheet
At42 qt1010 datasheetAt42 qt1010 datasheet
At42 qt1010 datasheet
 
Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?Is spark streaming based on reactive streams?
Is spark streaming based on reactive streams?
 
New directions for Apache Spark in 2015
New directions for Apache Spark in 2015New directions for Apache Spark in 2015
New directions for Apache Spark in 2015
 
Moa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data StreamsMoa: Real Time Analytics for Data Streams
Moa: Real Time Analytics for Data Streams
 
QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」
QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」
QConTokyo2015「Sparkを用いたビッグデータ解析 〜後編〜」
 
Hadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライドHadoop Conference Japan 2013 Winter オープニングスライド
Hadoop Conference Japan 2013 Winter オープニングスライド
 
Tuning and Debugging in Apache Spark
Tuning and Debugging in Apache SparkTuning and Debugging in Apache Spark
Tuning and Debugging in Apache Spark
 
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
Jjug ccc 2016 spring i 5 javaデスクトッププログラムを云々
 
今更聞けないストリーム処理のあれとかこれ
今更聞けないストリーム処理のあれとかこれ今更聞けないストリーム処理のあれとかこれ
今更聞けないストリーム処理のあれとかこれ
 
Apache Sparkの紹介
Apache Sparkの紹介Apache Sparkの紹介
Apache Sparkの紹介
 
#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計#cwt2016 Apache Kudu 構成とテーブル設計
#cwt2016 Apache Kudu 構成とテーブル設計
 
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話しますDMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
DMM.com ラボはなぜSparkを採用したのか? レコメンドエンジン開発の裏側をお話します
 
Apache Sparkについて
Apache SparkについてApache Sparkについて
Apache Sparkについて
 
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajpストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
ストリーム処理プラットフォームにおけるKafka導入事例 #kafkajp
 

Similar a Stream dataprocessing101

Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...Julien SIMON
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsANKIT GUPTA
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time AnalyticsAmazon Web Services
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...smallerror
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...xlight
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitterRoger Xia
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_finalmanishduttpurohit
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureNiels Naglé
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyVarun Vijayaraghavan
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014gdusbabek
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataDataWorks Summit/Hadoop Summit
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream ProcessingGuido Schmutz
 
What are you waiting for
What are you waiting forWhat are you waiting for
What are you waiting forJason Strate
 
Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!Julian Feinauer
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analyticsAmazon Web Services
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017Jags Ramnarayan
 

Similar a Stream dataprocessing101 (20)

Stream Analytics
Stream AnalyticsStream Analytics
Stream Analytics
 
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...Building a data warehouse  with Amazon Redshift … and a quick look at Amazon ...
Building a data warehouse with Amazon Redshift … and a quick look at Amazon ...
 
Apache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analyticsApache Spark Streaming -Real time web server log analytics
Apache Spark Streaming -Real time web server log analytics
 
Getting Started with Real-time Analytics
Getting Started with Real-time AnalyticsGetting Started with Real-time Analytics
Getting Started with Real-time Analytics
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...Fixing Twitter  Improving The Performance And Scalability Of The Worlds Most ...
Fixing Twitter Improving The Performance And Scalability Of The Worlds Most ...
 
Fixing twitter
Fixing twitterFixing twitter
Fixing twitter
 
Fixing_Twitter
Fixing_TwitterFixing_Twitter
Fixing_Twitter
 
Bdu -stream_processing_with_smack_final
Bdu  -stream_processing_with_smack_finalBdu  -stream_processing_with_smack_final
Bdu -stream_processing_with_smack_final
 
Data & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architectureData & analytics challenges in a microservice architecture
Data & analytics challenges in a microservice architecture
 
Traitement d'événements
Traitement d'événementsTraitement d'événements
Traitement d'événements
 
Real time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.lyReal time stream processing presentation at General Assemb.ly
Real time stream processing presentation at General Assemb.ly
 
Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014Measure All the Things! - Austin Data Day 2014
Measure All the Things! - Austin Data Day 2014
 
Apache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing dataApache Beam: A unified model for batch and stream processing data
Apache Beam: A unified model for batch and stream processing data
 
Introduction to Stream Processing
Introduction to Stream ProcessingIntroduction to Stream Processing
Introduction to Stream Processing
 
What are you waiting for
What are you waiting forWhat are you waiting for
What are you waiting for
 
Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!Dances with bits - industrial data analytics made easy!
Dances with bits - industrial data analytics made easy!
 
Tapping the cloud for real time data analytics
 Tapping the cloud for real time data analytics Tapping the cloud for real time data analytics
Tapping the cloud for real time data analytics
 
Is this normal?
Is this normal?Is this normal?
Is this normal?
 
SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017SnappyData at Spark Summit 2017
SnappyData at Spark Summit 2017
 

Más de Sotaro Kimura

スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介
スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介
スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介Sotaro Kimura
 
Spark Structured Streaming with Kafka
Spark Structured Streaming with KafkaSpark Structured Streaming with Kafka
Spark Structured Streaming with KafkaSotaro Kimura
 
Modern stream processing by Spark Structured Streaming
Modern stream processing by Spark Structured StreamingModern stream processing by Spark Structured Streaming
Modern stream processing by Spark Structured StreamingSotaro Kimura
 
Spark Structured StreamingでKafkaクラスタのデータをお手軽活用
Spark Structured StreamingでKafkaクラスタのデータをお手軽活用Spark Structured StreamingでKafkaクラスタのデータをお手軽活用
Spark Structured StreamingでKafkaクラスタのデータをお手軽活用Sotaro Kimura
 
Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話
Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話
Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話Sotaro Kimura
 
利用者主体で行う分析のための分析基盤
利用者主体で行う分析のための分析基盤利用者主体で行う分析のための分析基盤
利用者主体で行う分析のための分析基盤Sotaro Kimura
 
Apache NiFiと 他プロダクトのつなぎ方
Apache NiFiと他プロダクトのつなぎ方Apache NiFiと他プロダクトのつなぎ方
Apache NiFiと 他プロダクトのつなぎ方Sotaro Kimura
 
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~Sotaro Kimura
 
スキーマつきストリーム データ処理基盤、 Confluent Platformとは?
スキーマつきストリーム データ処理基盤、 Confluent Platformとは?スキーマつきストリーム データ処理基盤、 Confluent Platformとは?
スキーマつきストリーム データ処理基盤、 Confluent Platformとは?Sotaro Kimura
 
Gearpump, akka based Distributed Reactive Realtime Engine
Gearpump, akka based Distributed Reactive Realtime EngineGearpump, akka based Distributed Reactive Realtime Engine
Gearpump, akka based Distributed Reactive Realtime EngineSotaro Kimura
 
リアルタイム処理エンジン Gearpumpの紹介
リアルタイム処理エンジンGearpumpの紹介リアルタイム処理エンジンGearpumpの紹介
リアルタイム処理エンジン Gearpumpの紹介Sotaro Kimura
 

Más de Sotaro Kimura (11)

スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介
スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介
スキーマ 付き 分散ストリーム処理 を実行可能な FlinkSQLClient の紹介
 
Spark Structured Streaming with Kafka
Spark Structured Streaming with KafkaSpark Structured Streaming with Kafka
Spark Structured Streaming with Kafka
 
Modern stream processing by Spark Structured Streaming
Modern stream processing by Spark Structured StreamingModern stream processing by Spark Structured Streaming
Modern stream processing by Spark Structured Streaming
 
Spark Structured StreamingでKafkaクラスタのデータをお手軽活用
Spark Structured StreamingでKafkaクラスタのデータをお手軽活用Spark Structured StreamingでKafkaクラスタのデータをお手軽活用
Spark Structured StreamingでKafkaクラスタのデータをお手軽活用
 
Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話
Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話
Kinesis Analyticsの適用できない用途と、Kinesis Firehoseの苦労話
 
利用者主体で行う分析のための分析基盤
利用者主体で行う分析のための分析基盤利用者主体で行う分析のための分析基盤
利用者主体で行う分析のための分析基盤
 
Apache NiFiと 他プロダクトのつなぎ方
Apache NiFiと他プロダクトのつなぎ方Apache NiFiと他プロダクトのつなぎ方
Apache NiFiと 他プロダクトのつなぎ方
 
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
Hadoop基盤上のETL構築実践例 ~多様なデータをどう扱う?~
 
スキーマつきストリーム データ処理基盤、 Confluent Platformとは?
スキーマつきストリーム データ処理基盤、 Confluent Platformとは?スキーマつきストリーム データ処理基盤、 Confluent Platformとは?
スキーマつきストリーム データ処理基盤、 Confluent Platformとは?
 
Gearpump, akka based Distributed Reactive Realtime Engine
Gearpump, akka based Distributed Reactive Realtime EngineGearpump, akka based Distributed Reactive Realtime Engine
Gearpump, akka based Distributed Reactive Realtime Engine
 
リアルタイム処理エンジン Gearpumpの紹介
リアルタイム処理エンジンGearpumpの紹介リアルタイム処理エンジンGearpumpの紹介
リアルタイム処理エンジン Gearpumpの紹介
 

Último

Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptNoman khan
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsResearcher Researcher
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewsandhya757531
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfManish Kumar
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating SystemRashmi Bhat
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosVictor Morales
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating SystemRashmi Bhat
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESkarthi keyan
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxsiddharthjain2303
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Communityprachaibot
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...Erbil Polytechnic University
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSsandhya757531
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Romil Mishra
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSneha Padhiar
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxRomil Mishra
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionSneha Padhiar
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating SystemRashmi Bhat
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdfHafizMudaserAhmad
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdfsahilsajad201
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectGayathriM270621
 

Último (20)

Forming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).pptForming section troubleshooting checklist for improving wire life (1).ppt
Forming section troubleshooting checklist for improving wire life (1).ppt
 
Novel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending ActuatorsNovel 3D-Printed Soft Linear and Bending Actuators
Novel 3D-Printed Soft Linear and Bending Actuators
 
Artificial Intelligence in Power System overview
Artificial Intelligence in Power System overviewArtificial Intelligence in Power System overview
Artificial Intelligence in Power System overview
 
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdfModule-1-(Building Acoustics) Noise Control (Unit-3). pdf
Module-1-(Building Acoustics) Noise Control (Unit-3). pdf
 
Virtual memory management in Operating System
Virtual memory management in Operating SystemVirtual memory management in Operating System
Virtual memory management in Operating System
 
KCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitosKCD Costa Rica 2024 - Nephio para parvulitos
KCD Costa Rica 2024 - Nephio para parvulitos
 
Main Memory Management in Operating System
Main Memory Management in Operating SystemMain Memory Management in Operating System
Main Memory Management in Operating System
 
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTESCME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
CME 397 - SURFACE ENGINEERING - UNIT 1 FULL NOTES
 
Energy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptxEnergy Awareness training ppt for manufacturing process.pptx
Energy Awareness training ppt for manufacturing process.pptx
 
Prach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism CommunityPrach: A Feature-Rich Platform Empowering the Autism Community
Prach: A Feature-Rich Platform Empowering the Autism Community
 
"Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ..."Exploring the Essential Functions and Design Considerations of Spillways in ...
"Exploring the Essential Functions and Design Considerations of Spillways in ...
 
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMSHigh Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
High Voltage Engineering- OVER VOLTAGES IN ELECTRICAL POWER SYSTEMS
 
Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________Gravity concentration_MI20612MI_________
Gravity concentration_MI20612MI_________
 
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATIONSOFTWARE ESTIMATION COCOMO AND FP CALCULATION
SOFTWARE ESTIMATION COCOMO AND FP CALCULATION
 
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptxCurve setting (Basic Mine Surveying)_MI10412MI.pptx
Curve setting (Basic Mine Surveying)_MI10412MI.pptx
 
Cost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based questionCost estimation approach: FP to COCOMO scenario based question
Cost estimation approach: FP to COCOMO scenario based question
 
Input Output Management in Operating System
Input Output Management in Operating SystemInput Output Management in Operating System
Input Output Management in Operating System
 
11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf11. Properties of Liquid Fuels in Energy Engineering.pdf
11. Properties of Liquid Fuels in Energy Engineering.pdf
 
Robotics Group 10 (Control Schemes) cse.pdf
Robotics Group 10  (Control Schemes) cse.pdfRobotics Group 10  (Control Schemes) cse.pdf
Robotics Group 10 (Control Schemes) cse.pdf
 
STATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subjectSTATE TRANSITION DIAGRAM in psoc subject
STATE TRANSITION DIAGRAM in psoc subject
 

Stream dataprocessing101

  • 1. @kimutansk ストリームデータ処理入門 Stream Data Processing 101 ScalaMatsuri 2017 Kimura, Sotaro(@kimutansk) https://www.flickr.com/photos/esoastronomy/14255636846
  • 2. @kimutansk Self Introduction • Kimura, Sotaro(@kimutansk) – Data Engineer at DWANGO Co., Ltd. • Maintenance of Data Analytics Infrastructure • Development for ETL pipeline • Construct Data Mart • And things to do with Data Analytics Infrastructure – Favorite technology field • Stream Processing technologies • Distributed computing systems – Favorite oss products • Apache Kafka • Apache Beam • Apache NiFi 木村宗太郎(@kimutansk)です。 ドワンゴでデータエンジニアをやっています。
  • 3. @kimutansk In the beginning • In this presentation, I use “Stream Processing” instead of “Stream Data Processing”. – “Stream Processing” is more used in related articles. 表題は「ストリームデータ処理入門」ですが、 本資料においては、「ストリーム処理」として記述します。
  • 4. @kimutansk Agenda • What is Stream Processing? • Data processing patterns • Trouble of Stream Processing • Stream Processing system structure and products • Technical consideration points • Real Stream Processing performance problems • Stream Processing system misconceptions はじめにストリーム処理とは何かから説明し、 実現するためのプロダクト、検討ポイントを説明します。
  • 5. @kimutansk What is Stream Processing? ストリーム処理とは何か?
  • 6. @kimutansk In a nutshell • Model of data processing designed for continuously generated unbounded data sets. – More detail... 一言でいうと、 「無限のデータを処理するよう設計されたデータ処理モデル」
  • 7. @kimutansk Stream Processing properties • Unbounded data – Ever-growing, essentially infinite data set. – These are often referred to as “streaming data”. • Ex)System logs, sensor data, activity logs, etc... • Unbounded processing – Data is unbounded, so processing is also unbounded. – For distinction between batch processing, “unbounded”. • Low latency, approximate, speculative results – Because of Stream Processing problems, system often output approximate, speculative results. – Batch processing traditionally designed for high latency, complete results. 「無限のデータを処理」「無限に処理が継続」 「低レイテンシ、しばしば近似値・不定期な出力」
  • 8. @kimutansk Usage of Stream Processing • Billing – Ex) Cloud service billing. Mobile communication billing. • Live cost estimating – Ex) Cloud service usage cost. Mobile data usage. • Anomaly/Change event detection – Ex) Injustice login. System failure. Recommendation. Weather data anomaly detection. • Detection backfill – Ex) System failure recovery(After notification). Weather data anomaly progress notification. ストリーム処理の用途「課金処理」「ライブコスト見積り」 「異常・変化イベント検知」「検知結果復旧」
  • 10. @kimutansk Big data processing patterns • Typical big data processing pattern list is below. ビッグデータの処理モデルとして、3モデルが言われる。 「バッチ処理」「対話型クエリ」「ストリーム処理」 Batch Processing Interactive Query Stream Processing Execute timing Manual execute Periodical execute Manual execute Periodical execute Continuous execute Processing target Archived data Archived data Unbound generating stream data Processing time Minutes ~ Hours Seconds ~ Minutes Permanence Data size TBs~PBs GBs~TBs Bs~KBs(Per 1 event) Latency Minutes ~ Hours Seconds ~ Minutes Milliseconds ~ seconds Typical applications ETL Reporting Generate ML model Business intelligence Analytics Anomaly detection Recommend Visualize OSS products MapReduce Spark Impala, Presto, Drill Hive (Described later)
  • 11. @kimutansk Batch Processing • Process “archived data”, output to data store. バッチ処理:データストアに蓄積したデータを一括変換し、 結果出力を行う処理モデル Processed data destination = data store.
  • 12. @kimutansk Interactive query • Process “archived data”, get results from client. 対話的クエリ:データストアに蓄積したデータを一括変換し、 結果をクライアントで取得する処理モデル Processed data destination = client.
  • 13. @kimutansk Stream Processing ストリーム処理:無限に生成され続けるデータを処理するモデル 出力先はシステムによって多岐に渡る。 Unbounded data source Message bus Stream Processing Engine Output systems Mobile activity Sensor data System log • Process “Unbound generating data”. – Output destinations are depend on the system.
  • 14. @kimutansk Batch Processing Interactive Query Stream Processing Execute timing Manual execute Periodical execute Manual execute Periodical execute Continuous execute Processing target Archived data Archived data Unbound generating stream data Processing time Minutes ~ Hours Seconds ~ Minutes Permanence Data size TBs~PBs GBs~TBs Bs~KBs(Per 1 event) Latency Minutes ~ Hours Seconds ~ Minutes Milliseconds ~ seconds Typical applications ETL Reporting Generate ML model Business intelligence Analytics Anomaly detection Recommend Visualize OSS products MapReduce Spark Impala, Presto, Drill Hive (Described later) Difference between Batch vs Stream • Difference is “Input data is completed” or not. バッチ処理(対話型クエリ)と、ストリーム処理の違いは、 「入力データが完全に揃っているか否か」 Input data is completed ! Input data is streaming !
  • 15. @kimutansk Batch Processing’s premise • Batch Processing’s premise. – When Batch Processing executes, data completeness is needed. • Target data are needed bounded. – Basically, outputs over several Batches are difficult. • Basic Batch Processing model バッチ処理の前提として、「実行時データが揃っていること」 「バッチを跨いだ結果出力は困難」がある。 MapReduce
  • 16. @kimutansk Batch Processing pattern • For multiple outputs, executes multiple jobs. • For time sliced outputs, Batch reads whole time inputs. 「複数の結果を出力する場合は複数回バッチを実行」 「結果を時間で区切る場合はそれらを全て含むデータを入力」 MapReduce MapReduce 2/26 2/27 2/28 2/28 [00:00 ~ 06:00) [06:00 ~ 12:00) [12:00 ~ 18:00) [18:00 ~ 24:00)
  • 17. @kimutansk Batch Processing problems • When user session processing job, Batch Processing is not well adapted. – If the user session continues over 2/27・2/28, needed re-output 2/27 result. – If more continues,... really? ユーザのセッションを算出したい場合、 日跨ぎセッションが区切られる。過去を読んで再出力は難しい。 MapReduce 2/282/27 2/282/27 Red Yellow Green
  • 18. @kimutansk Batch Processing is...? • Multiple outputs pattern is transformable below. – This means that.. 「複数回バッチを実行」の図はこのスライドのように 変形することができる。すなわち・・・? 2/26 2/27 2/28 2/282/272/26 Map Reduce Map Reduce Map Reduce
  • 19. @kimutansk Batch is subset of Stream! • Multiple outputs pattern is bounded “unbounded stream data” by interval これは、すなわち無限のデータであるストリームデータを 一定時間ごとに区切ったものに他ならない。 Bounded finite stream Bounded by interval 2/26 2/27 2/28 Unbounded infinite stream
  • 20. @kimutansk Batch is subset of Stream! • This means that, Batch Processing is subset pattern of Stream Processing つまり、バッチ処理とは、ストリーム処理の中の 限定的な処理のモデルであるということ。 Bounded finite stream Bounded by interval 2/26 2/27 2/28 Unbounded infinite stream Stream Processing Batch Processing
  • 21. @kimutansk Batch premise does not hold. • In Stream Processing • Batch premise : Data completeness – Continuously generating data. • Batch premise : Difficulty for over Batch data – Continuously processing, different from the premise. • If data completeness is satisfied, Stream Processing can process same processing by Batch Processing. バッチ処理の前提はストリーム処理では成り立たない。 だが、データが揃うならバッチ処理と同じことは出来る。 However...
  • 22. @kimutansk Trouble of Stream Processing ストリーム処理の、バッチ処理にない困った点は?
  • 23. @kimutansk New problem in Stream Processing • Data ingest order is different from actual order! – That called “Out of order” – Ex.) The phones configured airplane mode for entire flight. • Cause example – Network disconnect/delay. – Time gap between servers that constitute the system. • So, typically two domains of time within systems. – EventTime , which is the time when events actually occurred. – ProcessingTime, which is the time when events are ingested by the system. ストリーム処理ではデータは発生した順に到着しない。 そのため、「EventTime」「ProcessingTime」の時刻概念が存在。
  • 24. @kimutansk Why are there in trouble? • If no relation between ingested data, “Out of order” is not problem. – Except, processing result is approximate. • But, in the real Stream Processing system, data grouping method are needed. – Ex)In anomaly event detection, few cases are able to be detected by one event. “Many logins occur in a short time” etc.. Relations between before and after are needed. データが発生した順に到着しなくても、関連が無ければ困らない。 だが、実際は「不正ログイン検知」等でデータの関連が必要。
  • 25. @kimutansk Data grouping concept • Window, which is data grouping concept. データのグルーピングの概念としてWindowがある。 「固定長」「スライディング」「セッション」が代表的。 Tumbling Window Time Sliding Window Sesson Window
  • 26. @kimutansk Out of order trouble • With window, “Out of order” becomes trouble. – If after [00:00 ~ 06:00] result output, “05:55” data arrived...? Windowを使うと「データが順番に到着しない」が問題になる。 結果出力後に該当時間帯のデータが到着したらどうする? [00:00 ~ 06:00) 1. Output2. Arrive...?
  • 27. @kimutansk Countermeasure for trouble • For Stream Processing trouble, three countermeasures are proposed. • Watermark – What time ingested data are completed in EventTime domain. • Trigger – When output aggregate results. • Accumulation – How do refinements of results relate? この問題に対して、「Watermark」「Trigger」「Accumulation」 という対処の概念が挙げられている。
  • 28. @kimutansk What is watermark? • Watermark is the notion of What time is process completed in EventTime domain. – EventTime and ProcessingTime exist individual, there are skews between EventTime and ProcessingTime. EventTimeベースでどこまで処理したかを示す概念。 実際の処理時刻と、どこまで処理したかがずれるため必要。 Event Time ProcessingTime Ideal System Real System (≒Watermark) Skew
  • 29. @kimutansk What is watermark? • Usage of Watermark – If watermark time is “X”, events which EventTime is before “X” are all processed. • But... – Watermark can not be complete. – Data arrival is out of order, so watermark is only an approximation. • However, watermark is useful. – Watermark can be the indication of processing timing. ただ、注意すべきはWatermarkも近似であり、完全ではない。 しかし、処理をするタイミングの基準として有用。
  • 30. @kimutansk What is trigger? • Trigger is mechanism when aggregated results should be output. – With trigger, output timings can be flexible and multiple. – In addition, system can adapt late data over watermark. • Example – When watermark arrives end of the window, output results. Triggerはいつ集計結果を出力するかを定義する機構。 Triggerによって、出力タイミングを柔軟に、複数回定義可能。 PCollection<KV<String, Integer>> wordCountResult = wordOneCountStream.apply("Fixed Windowing", Window.<KV<String, Integer>>into( FixedWindows.of(Duration.standardMinutes(2))) .triggering(Repeatedly.forever(AfterWatermark.pastEndOfWindow()))) .apply(Sum.integersPerKey());
  • 31. @kimutansk What is trigger? • Example – When late data over watermark, output results. Triggerの導入によって、Watermarkより遅れたデータが 到着した場合でも、ハンドリングが可能となる。 PCollection<KV<String, Integer>> wordCountResult = wordOneCountStream.apply("Late Firing", Window.<KV<String, Integer>>into( FixedWindows.of(Duration.standardMinutes(2))) .triggering(AfterWatermark.pastEndOfWindow() .withLateFirings( AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1))))) .apply(Sum.integersPerKey());
  • 32. @kimutansk What is trigger? • Example – When late data over watermark, output results. – Target the limit of latency is 5 minute. Triggerの導入によって、Watermarkよりどれだけ データが遅れたら、以後のデータを破棄するかも指定可能。 PCollection<KV<String, Integer>> wordCountResult = wordOneCountStream.apply("Late Firing until 5 min late", Window.<KV<String, Integer>>into( FixedWindows.of(Duration.standardMinutes(2))) .triggering(AfterWatermark.pastEndOfWindow() .withLateFirings( AfterProcessingTime.pastFirstElementInPane() .plusDelayOf(Duration.standardMinutes(1)))) .withAllowedLateness(Duration.standardMinutes(5)) .accumulatingFiredPanes()) .apply(Sum.integersPerKey());
  • 33. @kimutansk What is accumulation? • Accumulation is refinement mode define of multiple results. – Triggers are used to produce multiple outputs for a window. – So, needs to decide how do refinements of results relate. – “How” depends on the target system. • The system which has self accumulation functions. • The system which depends on Key-Value datastore. • The system consists of multiple components which persistence methods are different. Accumulationとは、Triggerで結果が複数回出力される場合に どう扱うかの方式。システムによって何がいいかは異なる。
  • 34. @kimutansk What is accumulation? • Typical three accumulation modes – Discarding mode • When aggregated result output, discard previous result. • Next result contains only data which after previous output. – Accumulating mode • When aggregated result output, hold previous result. • Next result accumulates data which after previous output. – Accumulating & Retracting mode • When aggregated result output, hold previous result. • Next result contains accumulation result and retraction of previous result. 代表的な方式として、「Discard(出力毎の蓄積)」 「Accumulating(結果を累算)」「Retracting(累算と差分)」
  • 35. @kimutansk What is accumulation? • Individual mode output example – Aggregation [12:00~12:02) result • Arrive Data 実際のモードごとの例を示す。 [12:00~12:02)の集計結果を各モードで出力すると・・・ No Processing Time Event Time Event Value 1 12:05 12:01 7 2 12:06 12:01 9 3 12:06 12:00 3 4 12:07 12:01 4
  • 36. @kimutansk What is accumulation? • Individual mode output example – Aggregation [12:00~12:02) result • Output Data 出力タイミング毎、最終値、合計値は表のようになる。 Accumulating & Retractは複数のシステムが混在しても対応可能。 Output Timing Discard Accumulating Accumulating & Retract 12:05 7 7 7 12:06 12 19 19,-7 12:07 4 23 23,-19 Final Output 4 23 23 Total Output 23 49 23 Final & Total output are same
  • 37. @kimutansk With countermeasures, perfect? • With Watermark & Trigger & Accumulation, Stream Processing adapt any case? • ...No! • With these countermeasures, these are still problems. – How long lag are between watermark between ProcessingTime? • If the longer lag are, completeness is higher, but also latency is higher. – For accumulation, how long time hold intermediate data? • If the longer holding time are, system resource requirement is higher. これらの概念を導入しても問題は全部解決するわけではない。 Watermarkの決め方やどれだけ遅れを許容するか等は決まらない。
  • 38. @kimutansk Data Processing Systems trade-off • It is said that data processing systems(contained Batch and Stream) has 3 elements trade-off. – Completeness – Low Latency – Low Cost • All of 3 elements can not be achieved. – Any data processing systems consist from 3 elements balance. – From 3 elements balance, the real solutions are decided. • “Cost” contains not only system resource but also data transfer and communication path. • For real examples... データ処理システムには「完全性」「低遅延」「低コスト」の トレードオフがあり、そこから先の問題への落とし所が決まる。
  • 39. @kimutansk Trade-off example • Billing system – An important element is completeness! – Some latency or cost are acceptable. 例えば課金処理であれば、 完全性重視で、遅延やコストが発生してもある程度は許容範囲。 Important Not Important Completeness Low Latency Low Cost
  • 40. @kimutansk Trade-off example • Anomaly / Change detection system – Most important element is low latency! – Other elements priority are low. 例えば異常検知であれば、 低遅延が最優先で、他の要素の優先度は下がる。 Important Not Important Completeness Low Latency Low Cost
  • 41. @kimutansk Stream Processing system structure and products 実際のストリーム処理システムの構成はどうなっていて、 どんなプロダクトがあるのか?
  • 42. @kimutansk Typical system structure 実際のストリーム処理システムは 図のような構成を取ることが多い。 Unbounded data source Message bus Stream Processing Engine Output systems Mobile activity Sensor data System log • Typical Stream Processing system structure.
  • 43. @kimutansk Each system element detail メッセージバスでデータをバッファリングし、 ストリーム処理エンジンで処理するのが基本の構成。 • Message bus – At unbounded data, sometimes data flow rate is very high. – When trouble occurs, data need to be reload. – So, data are temporarily buffered. – Ex) Kafka, Kinesis, Cloud PubSub etc • Stream Processing Engine – Processing engine which get data and process. – Continuously executed, high availability are needed. – Products are later. • Output systems – The system which use Stream Processing system output. – Depends on use case.
  • 44. @kimutansk Stream Processing Engine genealogy ストリーム処理エンジンは図のようにいくつかの カテゴリに分類することができる。 DSL With Dataflow Design UI Pure Stream Processing Managed Service Time to release Streams
  • 45. @kimutansk Stream Processing Engine genealogy 「純ストリーム処理エンジン」「UIで処理を定義可能」 「DSL」「マネージドサービス」といったカテゴリが存在。 • Pure Stream Processing – Basic Stream Processing Engine. – With specific functions other Stream Processing Engine. • With Dataflow Design UI – The product which have Dataflow Design UI. – User can design the Stream Processing easily. • DSL – Once write code, run on multiple Stream Processing Engine. – DSL generates abstract Dataflow define. • Managed Service – Execution environment are managed on Public Cloud.
  • 46. @kimutansk Product Introduction:Storm 実質的に広く使用された初のOSSストリーム処理エンジン。 問題も多かったが、以後のプロダクトに大きく影響を与えている。 • At 2011, open sourced by Twitter – Developed with Clojure. • For deep dive, required Clojure Skill. – Practically, first wide used OSS Stream Processing Engine. • “At least once” semantics support at initial version. – Because of initial product, Storm had many problem. • Latency is very low, but throughput is low. • No BackPressure. • Default process placement is inefficiency. • Message ack function are executed per message. • At current version, most of problems are solved. – Storm influenced after many Stream Processing products.
  • 47. @kimutansk Product Introduction:Spark Streaming バッチ処理エンジン上でマイクロバッチとして実行。 Sparkエコシステム、開発手法を使えるのが大きい。 • At 2013, open sourced by amplab. – Developed with Scala. – On Batch Processing Framework Spark, pseudo realized by sequential executed mini batched. • Called “Micro Batch” – Throughput is high, but latency is also high. • Compare with Storm at the time. • Compare with Flink or Apex....? – The big advantage is executed on Spark Ecosystems. • Can use Spark components. Spark SQL, Spark MLlib, etc... • Also development method.
  • 48. @kimutansk Product Introduction:NiFi 画面上でデータフローを定義し、ストリーム処理を構築可能。 データ管理機能も優れるが、構成管理には課題がある。 • At 2014, open sourced by NSA. – Developed with Java. – Design dataflow by UI, then user can deploy NiFi cluster, executed on the cluster. • Ex) Get from Kafka > Enrichment > Put to HDFS – Between each components, NiFi has message queue, NiFi can set priority, QoS setting for each queue. – NiFi traces each data’s source, modify history. • Useful for data management. – But, dataflow management by code is difficult.
  • 50. @kimutansk Product Introduction:Flink バッチストリームの両方に対応したデータ処理エンジン。 独自のスナップショット方式と、多彩なAPIを提供している。 • At 2014, open sourced. – At 2011, named Stratosphere. – Developed with Scala. – Data Processing Engine produces Batch and Stream both api. – For fault tolerance, use “Distributed Snapshot” method. • Lightweight Asynchronous Snapshots for Distributed Dataflows • Flink can get snapshot asynchronous and efficient. – There are 3 type apis for develop Flink application. • High level api • Low level api • Table api(SQL like)
  • 51. @kimutansk Product Introduction:Apex 耐障害性重視のストリーム処理基盤。 状態管理、実行最適化、オートスケールなどの機能が充実。 • At 2015, open sourced by DataTorrent. – Developed with Java. – Originally, used at financial application. • Focuses fault tolerance. • Problem traceability at production environment. • There are message buffers between each operators, so when failure occurred, influences are limited. – Both state management and optimization for runtime environment are considered. • Apex uses HDFS like KVS, so low latency and fault tolerant. • YARN native application. – Auto Scaling during runtime.
  • 52. @kimutansk Product Introduction:Gearpump GoogleのMillWheelを参考に開発されたActorベースプロダクト。 性能・拡張性に優れるがエキスパート向き。 • At 2015, open sourced by Intel. – Developed with Scala. – Developed reference to MillWheel design. • Google’s Stream Processing paper. • MillWheel Fault-Tolerant Stream Processing at Internet Scale – Lean Stream Processing engine, with high extensibility. • But, needs state management code by application developer. • Performance is high, but development cost is also high. – Based on “Reactive Streams”, with standardized Back Pressure functions. – By akka streams like syntax, user can develop intuitively dataflow graph.
  • 53. @kimutansk Product Introduction:Kafka Streams Kafkaと組み合わせてストリーム処理を構築するコンポーネント Stream/Tableを統合するコンセプトを持ち、シンプルだが機能充実 • At 2016, produced by confluent. – Developed with Java. – Component of Kafka. – Library for implement Stream Processing application. • Kafka streams does not include process clustering, or high availability feature. • These elements depend on user. – Practically, designated for Kafka. • Input source, output destination are kafka. – Key concept is Streams and Tables. • Close relationship between Streams and Tables. – Simple, but function are pretty powerful. • Dual API(declarative, imperative), Queriable State, Windowing.
  • 54. @kimutansk Product Introduction:Beam 統一的なストリーム/バッチ処理モデルを提供するDSL 様々な環境で実行可能だが、プロダクト固有機能は使用できない。 • At 2016, open sourced by Google. – Developed with Java. – Unified Stream / Batch Processing for BigData. • Beam provides Data Processing abstraction. – Developed application with Beam, can execute on multiple Stream Processing Engine. • Local Executor • Google Cloud Dataflow(Google Cloud Platform) • Spark, Flink, Apex, Gearpump – In exchange for high portability, user can’t use each product specific libraries. • Machine learning, Graph processing, etc... • It needs execute separately these functions. (Ex.Tensowflow)
  • 55. @kimutansk Product Introduction:Cloud Dataflow GCP上で提供されるストリームバッチ対応フルマネージドサービス マネージドサービスのため、動的調整、最適化が強力。 • At 2015, produced by Google. – Application which Developed by Beam(Dataflow api at the time), executed on Google Cloud Platform managed service. – Compared with other managed Stream Processing service, it can adapt for wide area application. • Developed as Stream Processing Application. – Because of managed service, auto scaled, and optimized resource allocation.
  • 56. @kimutansk Product Introduction:KinesisAnalytics AWS上で提供される、ストリームに継続クエリを実行するサービス 機能は限られるが、非常に手軽に使うことができる。 • At 2016, produced by Amazon. – By SQL, user can apply continuous query to Streaming data. • The concept “Data in Motion” – Each Kinesis Analytics application, input stream is one. Output stream number is max 3. • Input source and output target are limited Kinesis Family. – Function is limited, but very easy to start. – Has function distinguishing EventTime and ProcessingTime. • So, use can use good windowing function. – Auto scaling, but resource usage of application are difficult to predict.
  • 57. @kimutansk Which product should use? 初めはFlinkかApexがバランスがよく無難、Gearpumpは上級者向。 他は状況や、実行環境、既存システムによって使い分ければいい。 • ※Just my opinion. • Flink or Apex : For first Stream Processing app – Balance of functions, performance, ease of use are good. • Gearpump : For akka expert – Good performance / extensibility, but difficult for first use. • Spark Streaming : For Spark user. – High compatibility with other Spark components. • Beam / Cloud Dataflow : For public cloud user. – Good portability, on-premise and public cloud. • NiFi : For many small Stream Processing apps user. • Kafka Streams : Auxiliary use with other products.
  • 58. @kimutansk However とはいっても、実際は検討が必要。 これから主要な検討ポイントを説明する。 • However, it needs the consideration which product should select. • From now, explain major consideration points.
  • 60. @kimutansk Consideration point list 技術的な検討ポイントの中で、代表的なもの。 問題領域、可用性、システム管理、開発方式に大きく分けられる。 • For develop Stream Processing Systems, there are many technical consideration points – It is needed clarify Stream Processing product has function? or not? • Target problem area – Time model – Windowing – Out of order processing • System reliability – State management – Fault tolerance – Re-execute – Message delivery semantics • System management – UI – Logging – Back-pressure – Scale out / Scale in – Data security • Development method – Api – Specific library – Environment, operation
  • 61. @kimutansk Target problem area これまで説明してきたように、時刻モデルやWindowing、 Out of order処理でシステムが何に対応可能かが決定する。 • Time model – Is it necessary handle EventTime ? or ProcessingTime only? – If ProcessingTime only is OK, system are more simple. • But, changing it later is difficult. • Windowing – Which window method is it needed? • Tumbling, Sliding, Session • Out of order processing – Relative to “Time model”, “Windowing” • How long late date is allowed? • Handling method.
  • 62. @kimutansk System reliability(1) システムの信頼性の担保のために、検討が必要な項目 状態管理、耐障害性、バグが発生した時の再実行可能性 • State management – Is “State” save to local machine? or Remote datastore? – Which format is state serialized? • Trade-off between reliability VS performance. • Fault tolerance – When system has failure, how wide influence ? latency? • When system failure, auto repair? or manual repair? • How long mean time to repair? (MTTR) • Re-execute – When system failure, or program bug, is it necessary re-execute? • For re-execute, messages needs to be stored long term.
  • 63. @kimutansk System reliability(2) メッセージ配信のセマンティクスも検討が必要 注意:単体で「全てに対応可能なExactly once」は実現不可 • Message delivery semantics – Which semantics for message processing. • At most once • At Ieast once • Exactly once – Premise: Any pattern covered Exactly once is impossible! – By Stream Processing system, it can guarantee only “Self state are exactly once processed.” – System has output for external systems, it is needed deduplication or “Accumulation” function by external systems.
  • 64. @kimutansk Exactly once NG pattern 外部に対するアクセスと状態保存が「Atomic」でないため。 メッセージ通知時アラーム発報するシステムを考える。 • Because, external access and persisting state are not Atomic. Consider when message notified, send alarm.
  • 65. @kimutansk Exactly once NG pattern 通知後、状態を更新する前に障害が発生すると・・・? • Because, external access and persisting state are not Atomic. After send alarm, if occurs failure before persisting state?
  • 66. @kimutansk Exactly once NG pattern 再度通知が行われ、重複してアラームが発報される! • Because, external access and persisting state are not Atomic. Re-notify message, duplicated alarm sended!
  • 67. @kimutansk System management(1) 大体のシステムはUIを持っている。持つ機能は重要。 ログを集約して解析可能とする仕組みは必須。 • UI – Most of Stream Processing products, have Custom UI. • What kind information can user watch from UI? • What kind operation can user execute from UI? – Important information, execution graph. • If shuffle status are displayed, easy to diagnosis. • Logging – For distributed systems, each server login confirm are unrealistic. – So, the log collect function are important for error analysis.
  • 68. @kimutansk System management(2) バックプレッシャー機能が無いと事前見切りや監視が必要。 基本常時動作するため、動作中の拡張縮小ができるといい。 • Back-pressure – If performance unbalance between components, back-pressure function are needed. – If not exist, needed estimate each component performance, but the estimations are difficult and unreliable. • Scale out / Scale in – In runtime, can system scale out / scale in? – Or allowing system restart? – Minimum execution unit is it? • Depends on external system partition? • Basically, Stream Processing systems execute continuously, needed to clarify in advance.
  • 69. @kimutansk System management(3) データの暗号化が必要な場合、どこでどうやっておこなうか? また、アプリケーション単位でアクセス権限管理は必要? • Data security – Is saved data encryption necessary? • If necessary, what timing, and how encrypt data? – Is application specific access control necessary for data?
  • 70. @kimutansk Development method(1) 開発時どのAPIを使用可能か?(開発速度⇔カスタマイズ性) 複数のAPIをシステムに混在させることもプロダクト次第で可能。 • API – Which develop abstraction can select? • Depends on team member’s skill, system’ priority(customize? development speed?). – Major development api abstractions are below 3. • Declarative, expressive API – Development speed:○, Customizability:○ – like: map(), filter(), join(), sort(), etc... • Imperative, lower-level API – Development speed:△, Customizability:◎ – like: process(event) • Streaming SQL – Development speed:◎, Customizability:△ – like: STREAM SELECT ... FROM ... WHERE ... TO ...
  • 71. @kimutansk Development method(2) 宣言的なアプリケーション実装。 関数でStreamを加工し、個々の処理を行う。 • Declarative, expressive API Example – By Flink case class CarEvent(carId: Int, speed: Int, distance: Double, time: Long) val DataStream[CarEvent] carEventStream = ...; val topSeed = carEventStream .assignAscendingTimestamps( _.time ) .keyBy("carId") .window(GlobalWindows.create) .evictor(TimeEvictor.of(Time.of(evictionSec * 1000, TimeUnit.MILLISECONDS))) .trigger(DeltaTrigger.of(triggerMeters, new DeltaFunction[CarEvent] { def getDelta(oldSp: CarEvent, newSp: CarEvent): Double = newSp.distance - oldSp.distance }, cars.getType().createSerializer(env.getConfig))) .maxBy("speed")
  • 72. @kimutansk Development method(3) 手続き的なアプリケーション実装。 processメソッドをStreamに適用し、個々の処理を行う。 • Imperative, lower-level API Example – By Flink val stream : DataStream[Tuple2[String, String]] = ...; val result : DataStream[Tuple2[String, Long]] result = stream .keyBy(0) .process(new CountWithTimeoutFunction()); case class CountWithTimestamp(key: String, count: Long, lastModified: Long) class CountWithTimeoutFunction extends ProcessFunction[(String, Long), (String, Long)] { lazy val state: ValueState[CountWithTimestamp] = getRuntimeContext() .getState(new ValueStateDescriptor<>("myState", clasOf[CountWithTimestamp])) override def processElement(value: (String, Long), ctx: Context, out: Collector[(String, Long)]): Unit ...; override def onTimer(timestamp: Long, ctx: OnTimerContext, out: Collector[(String, Long)]): Unit = ...; }
  • 73. @kimutansk Development method(4) Streaming SQLによるアプリケーション実装。 Streamにスキーマを設定し、SQLで処理を行う。 • Streaming SQL Example – By Flink val env = StreamExecutionEnvironment.getExecutionEnvironment val tableEnv = TableEnvironment.getTableEnvironment(env) // read a DataStream from an external source val ds: DataStream[(Long, String, Integer)] = env.addSource(...) // register the DataStream under the name "Orders" tableEnv.registerDataStream("Orders", ds, 'user, 'product, 'amount) // run a SQL query on the Table and retrieve the result as a new Table val result = tableEnv.sql( "SELECT product, amount FROM Orders WHERE product LIKE '%Rubber%'")
  • 74. @kimutansk Development method(5) プロダクトの固有ライブラリで必要なものはあるか? 簡易な実行環境や、既存のプロセスとの親和性も重要。 • Specific library – Each product, has specific library. • Machine learning • Graph processing • Adapter with external components. – If product has no target library, needs to be developed. • Environment, operation – Can execute local machine? – Install in advance? or resource manager deploy only? – For program update, allowed restart? or Rolling update? – Coexistence of existing team’s development tools? – What language is product developed? (For diagnosis)
  • 75. @kimutansk Real Stream Processing performance problems 実際のストリーム処理システムで 良く発生するパフォーマンス問題
  • 76. @kimutansk Performance problems list プロダクトにかかわらず存在する代表的な問題 「ファイルアクセス」「プロセス間通信」「外部の限界」「GC」 • Typical Stream Processing systems performance problems are below. – File access – Communication between processes – External system performance limit – GC • Of cource, depending on the product, many other performance problems exist. – In that case, analyze each time reference to previous chapter “System management”.
  • 77. @kimutansk File access 初心者の時によくはまるのがローカルファイルへの同期アクセス キャッシュを使うか非同期で問題ない設計にする必要がある。 • ※This problem is often encountered in beginner user. – If Stream Processing system component access local file system each event synchronously, the component become bottleneck. – Avoid synchronous file system access by cache. – Or update system design without synchronous access. • For example, batch update file system asynchronously.
  • 78. @kimutansk Communication between processes バッチ処理と同じく、プロセス間通信のコストは大きい。 主にShuffle時に発生するため、低減するための対処が必要。 • Similar to Batch Processing, communication between processes performance impact is high. – Mainly, communication between processes occur “Shuffle” – For example, it is needed reduce communication cost by grouping each partition’s data beforehand shuffle. – Or aggregate components for reduce communication between processes. – But, excessive aggregation induces high component maintenance cost, or lack of performance tuning flexibility.
  • 79. @kimutansk External system performance limit ストリーム処理システム自体がインメモリで並列処理をするため 他システムのスループットを超過することがしばしば発生する。 • Reach the performance limit of external systems(Message bus, output systems.). – In general, Stream Processing system process in memory, and execute concurrent/parallel. They tend to become large throughput, so sometimes overflow external systems. • In generation of Storm or Spark Streaming, generally Kafka > Stream Processing engine. • But current generation(Flink or Apex), sometimes Kafka < Stream Processing engine. • Tuning cluster size or replication settings for Message bus and output systems.
  • 80. @kimutansk GC JVM上で動作する以上避けられないGCの問題。まずチューニング。 それで駄目ならリスクを負って強引な対処をするしか・・・ • Unavoidable problems executing on the JVM. • Stream Processing systems handle huge number events, so object generate huge number. • First, tuning JVM. • Still if you can not help it... – In application code, suppress object generation as much as possible. – For object contained events, use byte array instead of each object. • But, these countermeasure have a bad impact for system maintenance, quality. – You better not do it if you can.
  • 81. @kimutansk GC GCの影響を抑えるために、 オブジェクトをバイト配列で用いるコードイメージ。 • Use byte array instead of objects code example. – Basic Scala Value Object(carId length = Fixed 16) case class CarEvent(carId: String, speed: Int, distance: Double, time: Long)
  • 82. @kimutansk GC GCの影響を抑えるために、 オブジェクトをバイト配列で用いるコードイメージ。 • Use byte array instead of objects code example. – Byte Array Scala Value Object(carId length = Fixed 16) class CarEventByteArray() { lazy val content = ByteBuffer.allocate(16 + 4 + 8 + 8) def this(carId: String, speed: Int, distance: Double, time: Long) { this() content.put(carId.getBytes(StandardCharsets.US_ASCII), 0, 16) content.putInt(16, speed) content.putDouble(20, distance) content.putLong(28, time) } def setCarId(carId: Int) = { content.putInt(0, carId) } def getCarId(): Int = { content.getInt(0) } .......... }
  • 84. @kimutansk Common misconceptions 初期のプロダクトに問題が多かったこともあり、 ストリーム処理システムにはよくある誤解がある。 • From history of the past, some misconceptions exist for Stream Processing systems. – Stream Processing are only applicable for approximate. – Latency and Throughput, we must choose one. – Micro-batching means better throughput. – Completely impossible exactly once semantics. – Stream Processing only apply to “real-time” – Stream Processing it too hard anyway.
  • 85. @kimutansk Answer for misconceptions(1) 近似にのみ使用可能、は初期のStormから来ている誤解。 同様に今はレイテンシとスループットではなく違うトレードオフ • Stream Processing are only applicable for approximate. – In initial Storm, it is exactly. So required combined use with Batch Processing(Lambda Architecture). – In current, it is controllable by Watermark, or Trigger. • Latency and Throughput, we must choose one. – It also from initial compare Storm VS Spark Streaming. – In current, there are 3-axis trade-off. • Completeness • Low Latency • Low Cost
  • 86. @kimutansk Answer for misconceptions(2) マイクロバッチだからスループットに優れる理由にはならない。 Exactly onceは出来るパターンを見極めて対応が必要。 • Micro-batching means better throughput. – In real systems, data are buffered in hard layer. So, micro-batching does not become a reason for good throughput. – On the contrary, manage of micro-batching job could be performance impact. • Completely impossible exactly once semantics. – Stream Processing system can guarantee “Self state are exactly once processed.” – System has output for external systems, it is needed deduplication or “Accumulation” function by external systems.
  • 87. @kimutansk Answer for misconceptions(3) ストリーム処理システムはリアルタイム処理以外にも適用可能 開発コストも開発しやすいAPIが増えて、下がってきている。 • Stream Processing only apply to “real-time” – In previous chapter, “Batch Processing is subset pattern of Stream Processing.” – So, Stream Processing can adapt to Batch job. • But, performance efficiency is needed confirm. • Stream Processing it too hard anyway. – With unbound data source and data updates very frequently, easier to adapt than Batch Processing. – In initial generation, Stream Processing product has only imperative, lower-level API, so development cost is high. – But in current, product also has declarative, expressive API and Streaming SQL. Development cost become lower.
  • 88. @kimutansk Summary ストリーム処理システムとは何かを紹介し、 プロダクトと検討ポイント、よくある誤解について説明しました。 • Stream Processing is superset of Batch Processing. – But new problems are occurred “Out-of-order” • There are countermeasures for new problems. – Watermark / Trigger / Accumulation • Typical Stream Processing system consisted of – Message bus / Stream Processing engine / Output system • Stream Processing products are many. – Need consideration for product select. – Introduced consideration point. • In addition, typical problem and misconceptions.
  • 89. @kimutansk Reference materials • The world beyond batch: Streaming 101 – https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 • The world beyond batch: Streaming 102 – https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 • MillWheel: Fault-Tolerant Stream Processing at Internet Scale – https://research.google.com/pubs/pub41378.html • The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing – https://research.google.com/pubs/pub43864.html • The Evolution of Massive-Scale Data Processing – https://goo.gl/jg4UAb • Streaming Engines for Big Data – http://www.slideshare.net/stavroskontopoulos/voxxed-days-thessaloniki- 21102016-streaming-engines-for-big-data • Introduction to Streaming Analytics – http://www.slideshare.net/gschmutz/introduction-to-streaming-analytics- 69120031
  • 90. @kimutansk Reference materials • Stream Processing Myths Debunked:Six Common Streaming Misconceptions – http://data-artisans.com/stream-processing-myths-debunked/ • A Practical Guide to Selecting a Stream Processing Technology – http://www.slideshare.net/ConfluentInc/a-practical-guide-to-selecting-a-stream- processing-technology – https://research.google.com/pubs/pub41378.html • Apache Beam and Google Cloud Dataflow – http://www.slideshare.net/SzabolcsFeczak/apache-beam-and-google-cloud- dataflow-idg-final-64440998 • The Beam Model – https://goo.gl/6ApbHV • THROUGHPUT, LATENCY, AND YAHOO! PERFORMANCE BENCHMARKS. IS THERE A WINNER? – https://www.datatorrent.com/blog/throughput-latency-and-yahoo/ • Lightweight Asynchronous Snapshots for Distributed Dataflows – https://arxiv.org/abs/1506.08603
  • 91. Thank you for your attention! Enjoy Stream Processing! https://www.flickr.com/photos/neokratz/4913885458