More and more enterprises need to process large amounts of data faster, or in real-time, and stream-oriented architectures are emerging as the best approach. Big data users like Netflix, Uber, and Twitter rely on the popular Apache Kafka open-source distributed event streaming platform, which has reportedly been adopted and implemented by more than 80% of all Fortune 100 companies. Although high-performance stream processing is achievable with Kafka, Kafka Streams improves and simplifies the process. Let’s look closer at Kafka Streams to better understand what it is, and how its implementation can reduce operational costs.
What is Kafka Streams?
Streams made its debut in 2016 as a new feature in Kafka v0.10. Kafka Streams is essentially a client library for building applications and microservices that process and analyze data stored in Kafka. All the stream processing takes place inside the app, and multiple instances of an app can be run if high-volume processing is required.
The addition of Streams to the Kafka ecosystem marries the basic development and deployment of standard client-side Scala and Java applications with the advantages provided by Kafka’s server-side cluster technology. Some notable benefits of Kafka Streams are:
- Streaming data is elastic and scalable at any time.
- Millisecond processing latency.
- Broad flexibility allows for deployment to client systems, containers, VM, cloud, etc.
- Kafka Streams apps are fault-tolerant.
- Provides exactly-once processing semantics.
- Separate processing clusters aren’t required.
- Support for Windows, Mac, and Linux.
How does Kafka Streams help to reduce operational costs?
The many incredible benefits provided by Streams to stream processing also enable companies to reduce their operational costs. By taking advantage of Kafka’s server-side cluster technology, Streams provides operational simplicity. It’s a simple client library, eliminating the need for additional cluster managers. So, instead of spending time on building infrastructure, this allows engineers to focus more on building applications. Streams also avoids boilerplate, allowing for cleaner, concise code that’s easier to maintain.
Kafka Streams has a low barrier to entry, and provides a simple path from small local development to massive-scale production. The simple integration with existing applications and microservices is just one more element that makes Kafka Streams a valuable option for use cases of all sizes.
A Kafka Streams Use Case
The company behind a global connected vehicle data analytics platform enlisted Xebia's functional programming division to assist with several areas including a streaming project. One of the client’s products allows partners to subscribe to a real-time stream of vehicle events. The client filters the events based on the partner’s requirements (e.g., selecting only events within a certain geographical area) and streams them in real-time to the partner’s S3 bucket/HTTP API/etc. The client’s existing implementation using Spark Streaming was poorly maintained, cumbersome to deploy, and expensive to run, so they wanted to replace it.
Xebia joined the project after implementation had begun. The existing team was struggling due to a lack of experience in several areas, including Scala and stream-oriented architectures. We actively invested time pairing with the client’s engineers to up-skill them. This was the client’s first use of Kafka Streams. They had used Kafka Connect before, but never written a custom connector. We spent some time investigating and validating the best way to package, deploy, and configure Kafka Connect with our custom connectors.
When we joined the project, the unit test coverage was lower than desired and there were few integration tests. We introduced a suite of end-to-end tests that verified the whole system (Kafka Streams + Kafka Connect), as well as integration tests for each component.
Because a redeployment of the Kafka Streams app in production can cause a short spike in latency, the client wanted to minimize unnecessary deployments and restarts. For this reason, there was a requirement to allow the app’s configuration to be updated without a restart. We implemented a simple mechanism that polled an empty file on S3. Whenever the updated timestamp of that file changed, the app would reload all of its configuration files from S3. All instances of the app polled the same file, so a config change could be propagated to all instances within seconds.
After we successfully released the system into production, at last check, it was serving around 100k events/second to multiple partners. It costs about 1/3 as much to operate as the system it replaced.
This simple 100% open-source client library makes it easy for companies small and large to create mission-critical real-time applications and microservices. It provides tremendous flexibility and value for companies who are looking for a cost-effective stream processing solution.
The engineers at Xebia are experts at developing and implementing real-time data platforms. If you’re interested in getting the most out of your data while minimizing operational costs, contact us to learn more about Kafka Streams and other options.