1. 程式人生 > >Building a Data Processing Pipeline with Amazon Kinesis Data Streams and Kubeless

Building a Data Processing Pipeline with Amazon Kinesis Data Streams and Kubeless

If you’re already running Kubernetes, FaaS (Functions as a Service) platforms on Kubernetes can help you leverage your existing investment in EC2 by enabling serverless computing. The real significance of such platforms, however, lies in the number of data sources that can trigger the deployed function. The first part of this two-part series introduced one such FaaS platform,

Kubeless. This second part explains how to use Amazon Kinesis Data Streams as a trigger for functions deployed in Kubeless.
Arun

In our previous post, Running FaaS on Kubernetes Cluster on AWS using Kubeless, we went through the basics of the FaaS solution Kubeless and showed you how to get started with Kubeless functions in a kops-based cluster on AWS. In this second post, we’ll show you how to build a data processing pipeline by deploying Kubeless functions that are executed when events appear in an

Amazon Kinesis Data Stream. The example we will show pushes #kubecon Tweets to a Kinesis data stream. A Kubeless function is then triggered automatically, which sends a mobile notification via SNS.

Data Processing Pipeline with Amazon Kinesis Data Streams and Kubeless - diagram

Introduction

Streaming data is data generated continuously by thousands of data sources, that simultaneously generate many small-sized data records. There are many sources of streaming data, such as social network feeds, sensor networks, connected IoT devices, or financial transactions. This data needs to be processed sequentially and incrementally in real time, on a record-by-record basis, and can be used for a variety of purposes such as analytics. Processing streaming data typically has typically needed a streaming data solution and complex event processing systems. However, with a managed data streaming service like Kinesis and a FaaS platform like Kubeless, it can be easier and quicker to build a solution that lets you get value from streaming data, without worrying about the infrastructure.

Amazon Kinesis Data Streams is a fully-managed streaming data service that makes it easy to collect, process, and analyze real-time, streaming data; it offers key capabilities to cost-effectively process streaming data at any scale. The unit of data stored by Kinesis Data Streams is a data record: a stream represents a group of data records. The scaling unit of a Kinesis stream is a shard. The data records in a stream are distributed into shards. Shards are also responsible for the partitioning of the stream  —  all records entering the stream are partitioned into a shard by a PartitionKey, which can be specified by the producer. All records with the same PartitionKey are guaranteed to be partitioned into the same shard so records are stored and read in the same order they were created.

As explained in our earlier post, kubeless is a Kubernetes-native FaaS framework that lets you deploy functions without having to worry about the underlying infrastructure used for executing them. It is designed to be deployed on top of a Kubernetes cluster and take advantage of all the great Kubernetes primitives. Kubeless is built around the core concepts of functions, triggers, and runtimes. Triggers in Kubeless represent the event sources and associated functions to be executed on occurence of an event from a given event source. Kubeless supports a wide variety of event sources, in this post we’ll focus on support for Kinesis data streams. Kubeless will let you run the functions in response to records being published to a Kinesis stream.

Triggering Kubeless Functions with Kinesis Data Streams

Before diving into the use case of processing tweets, let’s see how Kubeless triggers functions from a Kinesis data stream. The complete walk-through is available on GitHub . To keep it short, a custom controller is deployed inside your Kubernetes cluster. This controller watches a API endpoint that defines mappings betweens Kinesis streams and functions. When a new mapping is created the controller creates a Kinesis consumer and calls the functions over HTTP with the event.

Starting from the working Kubeless deployment shown in the previous post, you need to create a custom resource definition which defines a new object type called KinesisTriggers and you need to launch a Kinesis controller that watches those new objects. You can do it in one command line as shown:

kubectl create -f https://github.com/kubeless/kubeless/releases/download/v1.0.0-alpha.4/kinesis-v1.0.0-alpha.4.yaml

The so-called Kinesis trigger controller will watch for KinesisTriggers objects. These objects will declare the mapping between a data stream in Kinesis and a function deployed by Kubeless. Once such a trigger is created, the controller will consume events in the stream and forward them to the function. A typical trigger will look like this:

$ kubectl get kinesistriggers.kubeless.io test -o yaml
apiVersion: kubeless.io/v1beta1
kind: KinesisTrigger
metadata:
 labels:
  created-by: kubeless
 name: test
 namespace: default
spec:
 aws-region: us-west-2
 function-name: post-python
 secret: ec2
 shard: shardId-000000000000
 stream: my-kinesis-stream?

The trigger manifest above will consume events on the Kinesis data stream called my-kinesis-stream available in us-west-2 availability zone. The AWS credentials necessary to connect to that data stream will be stored in a Kubernetes secret called ec2, and the events will be forwarded to the Kubeless function post-python. A shard is a unit of streaming capability (up to 1MB/sec in and up to 2MB/sec out), records are ordered within a shard by partition key on a first-in basis, and shards are grouped by a logical construct called a data stream. You can easily add or remove shards to right-size the streaming capability within your data stream. Each shard has an identifier called a shardID. Shard shardId-000000000000 identifies the shard to be used with in the Kinesis stream my-kinesis-stream. Creating such a trigger is straightforward using the Kubeless CLI as shown in the example below.

Use Case: Processing Tweets with Kinesis, Kubeless, and SNS

Now I’ll walk through a real life scenario to illustrate the end-to-end picture and help you appreciate the power of Kinesis and Kubeless.

We will a social network feed to get real-time insights, in this example all the tweets being produced during the Kubecon conference. We will run through a scenario where we would like to get real-time notification of the mention of a topic of interest: kubeless.