Apache Kafka: A Platform for Real-Time Data Streams – Part 2
In my last blog post, I described how Apache Kafka came to be and the original problem my team at LinkedIn aimed to solve. Now, let’s look at some other ways Kafka is being used today.
Real-Time Processing Stream Processing
We had a second idea in mind when we designed Kafka, beyond just shipping data around. We also wanted to be able to be able to process these streams of events in real-time. This works by allowing a program spread across many machines to subscribe to the continuous stream of events from Kafka and continually publish out a new derived stream of data it computes.
This model of processing data as a flowing stream is quite old. Internally data warehouses have long used a similar data flow model to allow parallel SQL execution, though of course the input and output of the SQL query is still a table populated in batch. Computer scientists had long ago developed stream processing systems to extend this model to produce continuous output.
However, the time has come for real-time processing to be available as a first-class facility available to data engineers. The world itself is becoming fundamentally more real-time and data infrastructure needs to support this.
One thing we learned in our experience at LinkedIn and with other companies is that the feasibility of real-time stream processing is directly tied to how data is ingested. As long as data is shipped around in batches (say nightly file dumps), continuous processing of data doesn’t make sense (how does one do stream processing on a nightly CSV dump?). But once the flow of data is itself continuous and real-time, it is perfectly natural that the processing of that data would also be continuous and real-time. One pattern we have seen again and again is that as users adopt Kafka for real-time data capture and shipping, they soon find new uses for real-time processing of these very same streams.
Stream processing can be done directly by application code that connects to Kafka to tap into the data streams — Kafka supports partitioned distributed processing over a pool of application instances and handles many of the fault-tolerance problems that arise in practice.
Kafka also integrates well with processing frameworks that help with this kind of use case. Common frameworks used for this purpose include Apache Spark, Apache Storm, and Apache Samza. When used in combination with frameworks like these, you can think of Kafka as playing a role somewhat analogous to HDFS and the stream processing layer as playing a role analogous to MapReduce in the traditional Hadoop stack.
What is Kafka Good For?
One of the truly powerful things about this representation of data is that the work of capturing data is done once, and all of these uses can feed off of the same data stream. A Kafka stream captured to populate a Hive table is also available for load into a NoSQL cluster, or for indexing in a search system like Solr. Likewise, a stream processing system can take these streams and transform them in real-time as data arrives into new streams that are also available for load into the same set of systems if needed.
In the absence of any kind of central message bus, like Kafka, this plumbing problem between systems and applications becomes one where every system or application needs to directly integrate with every other system, something like this:
As you approach full integration, that is a quadratic number of integrations! By using Kafka as a central message bus, each system integrating adds to the total connectivity for all other systems:
This kind of large-scale distributed plumbing is very difficult, so just doing it once per system can dramatically simplify application architectures and help to lower the bar for high-quality data exchange across an organization.
This pattern gets applied across a variety of data sources and applications. Popular data streams include the following:
- Business Event Logging: In this use case, applications publish out a stream of events that record what happened in the business across applications. This could be things like orders, shipments, returns, stock ticks, page views, ad clicks, and so on.
- Device Data: An increasingly common source of event data is connected devices such as set-top boxes, mobile devices, vehicles, and even industrial machines.
- System Operations Data: This includes application logs and events, server metrics, and other data feeds from the datacenter. Providing global feeds of things like logs, metrics, and other data, rather than per-server log files, allows analysis of this type of data using a variety of tools from search indexes to stream processing, to batch SQL and the Hadoop tool chain.
- Database Change Capture: Databases can be instrumented to produce a feed of updates based on the changes they accept. This data is often used both as part of ETL into a warehouse environment, to populate derived stores and caches, as well as for real-time stream processing.
These data feeds get put to a variety of uses:
- Messaging: In many usage scenarios Kafka replaces legacy enterprise messaging platforms, especially in large-scale usage scenarios.
- Cross-Datacenter Transport: Increasingly we see Kafka used as a transport mechanism for data feeds that flow between many datacenters, often including fledgling public cloud deployments that must synchronized data with on-premise datacenters.
- Stream Processing: Increasingly we see Kafka used with stream processing systems to build real-time applications, ETL pipelines, and monitoring systems.
- Hadoop Data Ingestion: One of our original use cases, feeding a hungry Hadoop cluster, remains a popular Kafka application. This kind of pipeline also often gets run in reverse — processing jobs in Hadoop generate data that is published out through Kafka to be stored and served (often in remote datacenters) by appropriate business applications.
- Real-Time Data Integration: The same data integration problem in the warehouse world is equally prevalent between the various applications and systems that make up the real-time service stack in a large organization.
One of the things that has been most exciting to us as we developed this system was the incredible breadth of applications, industries, and data problems Kafka has been applied to.
The Stream Data Platform
For a long time this kind of stream-centric development pattern lacked a name. Internally we just called it Kafka stuff. Yet, as it started to see broader adoption across companies, we found there needed to be some kind of term for what companies were doing as they transitioned away from batch data capture and started putting in place real-time streams.
The term we have been using for this central point for real-time data is “Stream Data Platform.”
If you are interested in exploring the details of this type of architecture, I’ve written up a longer document that explores much of what we have learned by seeing this put this into practice in different companies.
We believe this real-time stream-centric vision represents a major step forward, so a group of us recently left LinkedIn and created a company called Confluent to help make this vision and technology a practical reality. We are excited to have participation from fantastic open source companies like Cloudera and many others to help bring Kafka to a level where it can truly serve this vision.
For more information on how to get started with Kafka and Cloudera, read “How to Deploy and Configure Apache Kafka in Cloudera Enterprise” or download Apache Kafka now.
Jay Kreps is the CEO of Confluent, a company that provides a platform for streaming data built around Apache Kafka. He is one of the original creators of Apache Kafka as well as several other popular open source projects. He was formerly the lead architect for data infrastructure at LinkedIn.
The post Apache Kafka: A Platform for Real-Time Data Streams – Part 2 appeared first on Cloudera VISION.