3. Better be safe than sorry
Failures will happen
EMC estimated $1.7 billion costs due to
data loss and system downtime
Recovery will save you time and costs
Switch between algorithms
Live upgrade of your system
3
5. Fault tolerance guarantees
At most once
• No guarantees at all
At least once
• For many applications sufficient
Exactly once
Flink provides all guarantees
5
9. Operator State
Stateless operators
System state
User defined state
9
ds.filter(_ != 0)
ds.keyBy(0).window(TumblingTimeWindows.of(5, TimeUnit.SECONDS))
public class CounterSum implements RichReduceFunction<Long> {
private OperatorState<Long> counter;
@Override public Long reduce(Long v1, Long v2) throws Exception {
counter.update(counter.value() + 1);
return v1 + v2;
}
@Override public void open(Configuration config) {
counter = getRuntimeContext().getOperatorState(“counter”, 0L, false);
}
}
14. Advantages
Separation of app logic from recovery
• Checkpointing interval is just a config
parameter
High throughput
• Controllable checkpointing overhead
Low impact on latency
14
31. TL;DL
Job recovery mechanism with low latency
and high throughput
Exactly one processing semantics
No single point of failure
Flink will always keep processing
your data
31
30 nodes, 4 cores, 15 GB
Flink
720,000 events per second per core
690,000 with checkpointing activated
Storm
With at-least-once: 2,600 events per second per core
GCE 30 instances with 4 cores and 15 GB of memory each.
Flink master from July, 24th, Storm 0.9.3.
All the code used for the evaluation can be found here.
Flink
1.5 million elements per second per core
Aggregate Throughput in cluster 182 million elements per second.
Storm
82,000 elements per second per core
Aggregate 0.57 million elements per second
Storm with Acknowledge 4,700 elements per second per core, Latency 30-120 milliseconds
Trident: 75,000 elements per second per core
Flink
0 Buffer timeout:
latency median 0 msec, 99 %tile 20 msec
24,500 events per second per core