What is Kinesis?
- It is a managed alternative to Apache Kafka which is used to gather a lot of data in real-time.
- Kinesis is a good option to choose if we need to gather data like application logs, IoT data, etc. Overall, it is great for any kind of real-time big data.
- Kinesis is comprised of multiple services.
- · Great for application logs, metrics, loT clickstreams
- · Great for “real-time” big data
- · It is good for streaming processing frameworks.
- · Kinesis Streams is low latency streaming ingest
- · Kinesis Analytics performs real-time analytics on streams where we can use SQL
- · Kinesis Firehose loads streams into Amazon S3, Amazon Redshift, Elastic Search engine & Splunk
- There is Kinesis Firehose to load streams into S3, Redshift, ElasticSearch, and Splunk.
What are the components in Kinesis?
- Amazon Kinesis Streams
- Amazon Kinesis Analytics
- Amazon Kinesis Firehose
What is Shard in Kinesis?
- Kinesis streams are divided into a shard and it is the equivalent of a partition.
- Whenever a producer produces stream in Kinesis and that stream is made up of shards.
- Next consumers read that data from shards.
- Streams in Kinesis are divided into ordered Shards or Partitions
- Producers -> Shard 1, Shard 2, Shard 3 ->consumers
- one stream is made of many different shards or partitions And when you get billed in Kinesis you get billed per shard provisions.
- By default, For how many hours, Kinesis stores data?
- By default, Kinesis streams do not store data forever. It stores it for 24 hours by default. But if you wanted to increase then, we can retain data for up to seven days. By default, Data retention is 24 hours but it can go up to 7 days
Can multiple applications consume the same stream in Kinesis?
- Kinesis has the ability to reprocess and replay data. So, once you consume that data, the data is not gone from Kinesis streams.
- It will be gone based on the data retention period, but you are able to read the same data over and over and over as long it is as it is in the Kinesis stream.
- So multiple applications can take or consume the same stream.
- That means you can-do real-time processing and you have a huge scale of throughput.
How will you load streaming data and establish scalable private connections to on-premise data centers ?Which service will you used for that?
Answer : Direct Connect and Kinesis Fire hose
- Establish a dedicated network connection from your premises to AWS.
- AWS Direct Connect makes it easy to establish a dedicated network connection from your premises to AWS.
- Using AWS Direct Connect, you can establish private connectivity between AWS and your datacentre.
- Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture and automatically load streaming data into Amazon S3 and Amazon Redshift
Is Kinesis streams a database?
- Kinesis streams is not a database. Once you insert the data into Kinesis it’s immutable which means it cannot be deleted.
- Once data is inserted or loaded in Kinesis, it can’t be removed or deleted (immutability)
Which service is used by the Spark Streaming tool to consume data from Amazon Kinesis?
Answer : Amazon Kinesis Producer Library
Explain about Kinesis Streams Shards.
- One stream in kinesis is made of many different shards
- Billing is done per shard
- The number of shards can evolve overtime (reshard / merge)
- Records are ordered per shard
- Producers->Shard 1,Shard 2,Shard 3,Shard 4->consumers
- Here we have four shards and consumers will receive that data.
Your application generates a 2 KBJSON payload that needs to be queued and delivered EC2 instances for applications. At the end of the day, the application needs to replay the data for the past 24 hours, Which service would you use for this requirement?
Answer : Kinesis
What is Data Blob, Record Key and Sequence number? How are they linked?
Records are ordered per shard and record is made of Data Blob.
Data Blob is up to one megabyte. It can represent anything you want then you will have a record key and that record key basically helps Kinesis know to reach shard to send that data to.
Sequence number: It basically represents the sequence number in the Shard when the data is added to that chart.
In short,Data Blob:
- data being sent,
- serialized as bytes.
- Up to I MB.
- Can represent anything
• Record Key:
• sent with a record and it helps to group records in Shards.
Same key = Same shard
• Whenever we get “hot partition” issue then we should use a highly distributed key to avoid this problem.
• Sequence number is nothing but one unique identifier for each record which is put in form of shards. Added by Kinesis after ingestion
Consider you are working in commercial deliver loT company where you have to track coordinates through enabled devices via GPS. You receive coordinates , which is transmitted from each device once every 8 seconds. Now you need to process these coordinates In real-time from multiple sources. Which tool should you use to digest the data?
Answer : Amazon Kinesis. Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.
What are Kinesis Data Streams Limits?
- Producer: 1 MB/s or 1000 messages/s at write PER SHARD otherwise “ProvisionedThroughputException”
- Consumer Classic: 2MB/s at read PER SHARD across all consumers.
- consumers: 5 API calls per second PER SHARD
- Consumer Enhanced Fan-Out:
- 2MB/s at read PER SHARD, PER ENHANCED CONSUMER
- No API calls needed (push model)
- Data Retention: 24 hours data retention by default and Can be extended to 7 days
We have a set of web servers hosted on EC2 Instances and have to push the logs from these web servers onto a suitable storage device for subsequent analysis. How will you do this implementation process?
Answer : First we have to install and configure the Kinesis agents on the web servers. Then we have to ensure that Kinesis Fire hose is setup to take the data and send it across to Red shift for further processing
Explain about Kinesis Producer ?
1. Kinesis SDK: This SDK allows us to write code or use the CLIA to directly send data into Amazon Kinesis streams.
2. The Kinesis Producer library or KPL: The Kinesis Producer Library get enhanced throughput into Kinesis streams.
3. Kinesis agent : The Kinesis agent is a linux program that runs on our server so remember that it’s an agent that runs on servers and it basically allows you to get a log file and send that reliably into the Amazon Kinesis streams.
3rd part libraries: We can use third party libraries that usually build upon the SDK such as
Which managed service that can be used to deliver real-time streaming data to S3?
Answer : Kinesis Fire hose. Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service.
Explain Kinesis Producer SDK-PutRecord(s)?
- The Producer SDK : it uses PutRecord, or PutRecords. So it means whenever we put record that means SDK.
- PutRecord (one) and PutRecords (many records) APIs are used in this.
- So with a PutRecord as a name indicates you send one record.
- And if you put many records then we must use PutRecords.
- PutRecords uses batching and it increases throughput which is less HTTP requests.
- That’s because we send many records as part of one HTTP request and we save in the HTTP request and its throughput gets increased. And
- If we go over the limits, then ProvisionedThroughput Exceeds.
- Now this producer SDK can be used in various ways.
- You can use it on your applications but also on your mobile devices.
When would you choose the producer SDK?
- Whenever we have some low throughput use case or we don’t mind the higher latency.
- We want a very simple API or maybe we’re just working straight from the AWS Lambda.
- This is the kind of place where we would use the producer SDK. As we know that there are some manage AWS sources for Kinesis data streams.
- So behind the scenes, they use a producer SDK but we don’t see it.
- And so these managed sources are going to be CloudWatch logs so we can send our logs directly from CloudWatch into Kinesis.
- And finally, Kinesis data analytics can produce back into Kinesis data streams.
You are working with a Kinesis Stream. What is used to group data by shard within a stream?
Answer : Partition Key.A partition key is used to group data by shard within a stream. Kinesis Data Streams segregates the data records belonging to a stream into multiple shards.
Which API commands can be used to data Into a Kinesis stream for Synchronous processing?
Answer : Write Record
Which service can be used for transformation of incoming source data in Amazon Kinesis Data Fire hose ?
Answer : AWS IOT
How will you analyze a large set of data that updates from Kinesis and Dynamo DB?
Answer : Elastic search. Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds. It’s able to achieve fast search responses because instead of searching the text directly, it searches an index.
How do we deal with exceptions for the Kinesis API?
You get a provisioned throughput exception when you are sending more data that you can’t. For example if you are exceeding the number of megabytes per second or you’re exceeding the number of records per second.So you need to make sure when you get this that you don’t have a hot shard.
So solution is: Do retries with backoff.
In short summary,
• ProvisionedThroughputExceeded Exceptions
• Happens when sending more data (exceeding MB/s orTPS for any shard)
• Make sure you don’t have a hot shard (such as your partition key is bad and too much data goes to that partition)
You are currently managing an application that using the Kinesis Client Library to read a Kinesis stream. Suddenly you got a Provisioned Through put Exceeded Exception in Cloud watch from the stream. How will you rectify this error?
Answer : We can add retry logic to applications that use the KCL library
Explain about Kinesis Producer Library (KPL)?
- It is used for increasing high performance and long-running producers
- It is Automated and configurable retry mechanism
- It is Synchronous or Asynchronous API
- It submits metrics to CloudWatch for monitoring
- In KPL, Batching is turned on by default. Doing so it increases throughput and decreases cost.
- Its records are collected and it is written to multiple shards having same PutRecords API call
- KCL or special helper library are used for decoding and KPL Records should be decoded with these.
We are writing data to a Kinesis stream and the default stream settings are used for the kinesis stream. Every fourth day you have decided to send the data to S3 from the stream. When you analyze the data in S3, you see that only the 4th day’s data is present in the stream. What is the reason for this?
Answer : As we know that,Data records are only accessible for a default of 24 hours from the time they are added to a stream, since default stream settings are used for the kinesis stream here.
Which aws service will you use ?
and where can you store these data for future analysis?
Kinesis Data Firehose is the easiest way to load streaming data into data stores and analytics tools.
It captures, transforms, and loads streaming data and you can deliver the data to “destinations” including Amazon S3 buckets for later analysis.
how does checkpointing works and all the shard discovery?
- It basically uses an Amazon dynamo DB table to check point the progress over time and synchronize to see who is going to read which shard which make it helpful.
- DynamoDB will be used for the coordination and check pointing and it will have one row in the table for each shard to consume from.
- And so because we have DynamoDB in the equation now with Kinesis client library we need to consider how to provision throughput for that.
- DynamoDB table so you need to make sure you provision enough write capacity units or reading capacity units WCU or RCU or you need to use on demand for DynamoDB to just not use any provisioning of capacity.
- Finally there is a API of record processors to process the data which make it really convenient to treat these messages one by one. So the Kinesis client library what you would need to remember is that it uses DynamoDB for check pointing and that it has a limit on DynamoDB. And it’s used to deaggregate records from the KPL.
What is Shard Splitting’’ in Kinesis?
- Kinesis Operations – Adding Shards – called “Shard Splitting’’
- Can be used to increase the Stream capacity (I MB/s data in per shard)
- Can be used to divide a “hot shard”
- The old shard is closed and will be deleted once the data is expired
- Shard 1 Shard 2 Shard 3
How can we scale Kinesis?
- First operation you can do is to add shards and adding shards in Kinesis it increases the stream capacity.
- when you split a shard then the old shard is closed and will be deleted once the data in it is expired.
- In short, Kinesis Operations – Merging Shards
- Decrease the Stream capacity and save costs
- Can be used to group two shards with low traffic
- Old shards are closed and deleted based on data expiration
What is Kinesis Security?
- We get encryption within the Kinesis streams using KMS. And if we want to encrypt a message client-side and we should decrypt it client-side.
- If we want to access Kinesis within VPC within a private network we could use VPC endpoints.
- we control access and authorizations to Kinesis using IAM policies and we can get encryption in-flight using the HTTPS endpoints.
- It means that the data we send to Kinesis will be encrypted and it cannot be intercepted.