Enforcing format, changing schema, introducing privacy filters have always been a challenge with the classical Kafka-API. In this talk we'll cover how to extend existing applications with webassembly, allowing developers to change the shape of data at runtime, per application without creating additional topics. By leveraging WebAssembly, we can extend the capabilities of the Kafka-API beyond what it was initially imagined. Come and learn about the future of the Kafka-API
Potential of AI (Generative AI) in Business: Learnings and Insights
Data Policies for the Kafka-API with WebAssembly | Alexander Gallego, Vectorized
1. Data Policies for the Kafka-API with
WebAssembly
https://github.com/vectorizedio/redpanda
2. background
● i’ve worked on streaming sys. for 12+ years
● developer, founder & CEO of Vectorized, hacking on
Redpanda, a modern streaming platform for mission
critical workloads.
● previously, principal engineer at Akamai; co-founder &
CTO of concord.io, a high performance stream
processing engine built in C++ and acquired by Akamai
in 2016
alex gallego
@emaxerrno
4. keystone problem in streaming: consume what you want
The Data
What you
care about
The Good Parts Of Streaming
○ Streaming data is immutable
■ Good for understandability
■ Built-in auditing
○ Ability to replay data
○ The essence of a streaming platform is to
decouple producers from consumers
■ Focuses on Data vs Who, What and
When produced it
○ Proven at scale
5. keystone problem in streaming: consume what you want
The Data
What you
care about
The *Not so* Good Parts Of Streaming
○ Consume *at least* as much as you
produce
■ In the same order
■ common to 10x more than produce
○ Up front architecture cost of splitting your
streams, types, and data contracts for
things like privacy, etc…
○ Can easily saturate your network (simply
shifting the bottleneck)
■ Forces developers to create
specialized clusters
● Mission Critical Prod
● Analytics/Dashboard Prod
● ML sample clusters
6. keystone problem in streaming: consume what you want
The Data
What you
care about 🥳🤯🎉 - hooray!
7. performance
improvement
2007 2011 2020
SSD: $2500/TB
typical instance 4 cores
SSD $200/TB - 1000x faster, 10x cheaper
225 core VMs - 30x more cores
100Gbps NICs - 100x more throughput
first open source
solutions
take advantage of cheap
disk
disaggregate compute
and storage
modern hardware +
cloud native
30x taller computers + 1000x faster disks
8. thread per core
architecture
● explicit scheduling everywhere
○ IO groups
○ x-core groups (smp)
○ memory throttling
● ONLY supports async interfaces
○ requires library re-writes for
threading model to work
well
9. future<>
● viral primitive (like actors, Orleans, Akka, Pony, etc) - mix, map-reduce, filter,
chain, fail, complete, generate, fulfill, sleep, expire futures, etc
● fundamentally about program structure. w/ concurrent structure, parallelism
is a free variable
● one pinned thread per core - must express parallelism and concurrency
explicitly
● no locks on the hotpath - network of SPSC queues
async-only
cooperative scheduling framework
new way to build software:
10. no virtual memory
buddy allocator ● preallocate 100% of mem; split across
N-cores for thread-local allocation/access
● create pools by dividing the memory one
layer above/2 and creating a new pool
● large allocations (above 64KB are not
pooled)
● buddy allocator pools for all object sizes
below 64KB
● full free-lists are recycled
● difficult to use this technique in practice,
and requires developer
retraining/accounting for every single byte
present in the system at all times
○ forces developer to pay additional attention
to all hash-maps, allocations, pooling, etc
Pools
of 8KB
Pools of 16KB
Pools of 64KB
Pool 0 - large object pool;
above 64KB+1
memory/2
memory/2
...
memory core local (usually around 2GB+)
memory global/N cores…
12. request pipelining per partition
● parallelism model == number of
cores/pthreads in the system
● read full request metadata and assign
subrequest to physical core
● for all non-overlapping cores, execute in
parallel
● for all overlapping cores per *partition*
pipeline (enqueue writes in order)
13. core-local metadata piggybacking
(...pandabacking?)
● maintain core-local metadata cache of
○ bytes written per partition (for future
readers)
○ latencies from the remote core (could be
highly contended and we need TCP
backpressure)
○ per TCP-connection read-ahead pointers
on disk for O(1) access/assignment
copy-on-read cache
x-shard metadata for low
latency access
14. core-local v8::isolate per topic/partition/policy
● maintain a v8::isolate *per core*
● maintain v8::context per topic/partition
○ thread_local v8::isolate
■ Low latency access
■ Preemption
● Timebound
● CPU Cycles
● Cooperative scheduling
○ No cross-core communications
○ Small memory footprint
15. applying a .wasm or .js to a Kafka Topic
> bin/kafka-topics.sh
--alter --topic my_topic_name
--config x-data-policy={...}
(redpanda has a tool called `rpk`
that is similar to kafka-topics.sh)
● Must be a pure function
○ limit of global state per core is 1MB
● On TCP connection
○ Look up associated data policy for the
topic
○ Instantiate a v8::context
○ Perform reads from disk, and subsystems
as normal
○ *before sending tcp bytes
■ Call v8::isolate
■ Swap v8::context
■ Transform payload
■ Re-checksum payload
■ Return new RecordBatch
16. flow
import { InlineTransform } from "@vectorizedio/InlineTransform";
const transform = new InlineTransform();
transform.topics([{"input": "lowercase", "output":"uppercase"}]);
...
const uppercase = async (record) => {
const newRecord = {
...record,
value: record.value.map((char) => {
if (char > 97 && char < 122) {
return char - 32;
} else {
return char;
}
}),
};
return newRecord;
}
l
o
w
e
r
c
a
s
e
U
P
P
E
R
C
A
S
E
F
i
l
t
e
r
*using the Kafka-API in your favorite
programming language (JS, Py, Java, C++, etc)
full compatibility with all your tools. No code
changes
17. check out the code for yourself!
● https://github.com/vectorizedio/redpanda
● ask questions from the maintainers at https://vectorized.io/slack
● say hi on twitter https://twitter.com/vectorizedio
● wasm+kafka-api https://vectorized.io/blog/wasm-architecture/