An Introduction to Kafka
The amount of data in the world is growing exponentially
and, according to the World Economic Forum, the number of bytes being stored in
the world already far exceeds the number of stars in the observable universe.
When you think of this data, you might think of piles of bytes sitting in data
warehouses, in relational databases, or on distributed file systems. Systems
like these have trained us to think of data in its resting state. In other
words, data is sitting somewhere, resting, and when you need to process it, you
run some query or job against the pile of bytes. This view of the world is the
more traditional way of thinking about data. However, while data can certainly
pile up in places, more often than not, it’s moving. You see, many systems
generate continuous streams of data, including IoT sensors, medical sensors,
financial systems, user and customer analytics software, application and server
logs, and more. Even data that eventually finds a nice place to rest likely
travels across the network at some point before it finds its forever home.
If we want to process data in real time, while it moves, we
can’t simply wait for it to pile up somewhere and then run a query or job at
some interval of our choosing. That approach can handle some business use cases,
but many important use cases require us to process, enrich, transform, and
respond to data incrementally as it becomes available. Therefore, we need
something that has a very different worldview of data: a technology that gives
us access to data in its flowing state, and which allows us to work with these
continuous and unbounded data streams quickly and efficiently. This is where
Apache Kafka comes in. Apache Kafka (or simply, Kafka) .
It is a streaming platform for ingesting, storing,
accessing, and processing streams of data. While the entire platform is very
interesting, this book focuses on what I find to be the most compelling part of
Kafka: the stream processing layer. However, to understand Kafka Streams and
ksqlDB (both of which operate at this layer, and the latter of which also
operates at the stream ingestion layer), it is necessary to have a working
knowledge of how Kafka, as a platform, works.
The story of ksqlDB is one of simplification and evolution.
It was built with the same goal as Kafka Streams: simplify the process of
building stream processing applications. However, as ksqlDB evolved, it became
clear that its goals were much more ambitious than even Kafka Streams. That’s
because it not only simplifies how we build stream processing applications, but
also how we integrate these applications with other systems (including those
external to Kafka). It does all of this with a SQL interface, making it easy
for beginners and experts alike to leverage the power of Kafka.
Both Kafka Streams and ksqlDB are excellent tools to have in
your stream processing toolbelt, and complement each other quite well. You can
use ksqlDB for stream processing applications that can be expressed in SQL, and
for easily setting up data sources and sinks to create end-to-end data
processing pipelines using a single tool. On the other hand, you can use Kafka
Streams for more complex applications, and your knowledge of that library will
only deepen your understanding of ksqlDB since it’s actually built on top of
Kafka Streams.
What Is ksqlDB?
ksqlDB is an open source event streaming database that was
released by Confluent in 2017 (a little more than a year after Kafka Streams
was introduced into the Kafka ecosystem). It simplifies the way stream
processing applications are built, deployed, and maintained, by integrating two
specialized components in the Kafka ecosystem (Kafka Connect and Kafka Streams)
into a single system, and by giving us a high-level SQL interface for
interacting with these components. Some of the things we can do with ksqlDB
include:
Model data as either streams or tables (each of which is
considered a collection in ksqlDB) using SQL.
Apply a wide number of SQL constructs (e.g., for joining,
aggregating, transforming, filtering, and windowing data) to create new derived
representations of data without touching a line of Java code.
Query streams and tables using push queries, which run
continuously and emit/push results to clients whenever new data is available.
Under the hood, push queries are compiled into Kafka Streams applications and
are ideal for event-driven micro services that need to observe and react to
events quickly.
When to Use ksqlDB
It’s no surprise that
higher-level abstractions are often easier to work with than their lower-level
counterparts. However, if we were to just say, “SQL is easier to write than
Java,” we’d be glossing over the many benefits of using ksqlDB that stem from
its simpler interface and architecture. These benefits include:
More interactive workflows, thanks to a managed runtime that
can compose and deconstruct stream processing applications on demand using an
included CLI and REST service for submitting queries.
Less code to maintain since stream processing topologies are
expressed using SQL instead of a JVM language.
Simplified architecture, since the interface for managing
connectors (which integrate external data sources into Kafka) and transforming
data are combined into a single system. There’s also an option for running
Kafka Connect from the same JVM as ksqlDB.
Kafka Streams Integration
For the first two years of its life, ksqlDB was known as
KSQL. Early development focused on its core feature: a streaming SQL engine
that could parse and compile SQL statements into full-blown stream processing
applications. In this early evolutionary form, KSQL was conceptually a mix
between a traditional SQL database and Kafka Streams, borrowing features from
relational databases (RDBMS) while using Kafka Streams to do the heavy lifting
in the stream processing layer.
The most notable feature KSQL borrows from the RDBMS branch
of the evolutionary tree is the SQL interface. This removed a language barrier
for building stream processing applications in the Kafka ecosystem, since users
were no longer required to use a JVM language like Java or Scala in order to
use Kafka Streams.
0 Comments:
Post a Comment
If you have any doubts . Please let me know.