AWS Interview Question-8

Your Project application is deployed on an Auto Scaling Group of EC2 instances using an Application Load Balancer. The Auto Scaling Group has scaled to maximum capacity, but there are few requests(Customer’s requests) being lost. What will you do?

The project has decided to use SQS with the Auto Scaling Group to ensure all messages are saved and processed. ApproximateNumberOfMessagesVisible for target tracking is that the number of messages in the queue might not change proportionally to the size of the Auto Scaling Group that processes messages from the queue. That’s because the number of messages in your SQS queue does not solely define the number of instances needed. The number of instances in your Auto Scaling Group can be driven by multiple factors, including how long it takes to process a message and the acceptable amount of latency . The solution is to use a backlog per instance metric with the target value being the acceptable backlog per instance to maintain. You can calculate these numbers as follows: Backlog per instance: To calculate your backlog per instance, start with the ApproximateNumberOfMessages queue attribute to determine the length of the SQS queue (number of messages available for retrieval from the queue). 

Your project manager is preparing for disaster recovery and upcoming DR drills of the MySQL database instances and their data. The Recovery Time Objective (RTO) is such that read replicas can be used to offload read traffic from the master database.What are the features of read replicas?

You can create read replicas within AZ, cross-AZ, or cross-Region.

Read replicas can be within AZ, cross-AZ, or cross-Region.

You can have up to five read replicas per master, each with its own DNS endpoint.

Read replica can be manually promoted as a standalone database instance.

Amazon RDS uses the MariaDB, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server DB engines’ built-in replication functionality to create a special type of DB instance called a read replica from a source DB instance. Updates made to the source DB instance are asynchronously copied to the read replica. You can reduce the load on your source DB instance by routing read queries from your applications to the read replica. Using read replicas, you can elastically scale out beyond the capacity constraints of a single DB instance for read-heavy database workloads.

When The ALB stops sending traffic to the instance?

The load balancer routes requests only to the healthy instances. When the load balancer determines that an instance is unhealthy, it stops routing requests to that instance. The load balancer resumes routing requests to the instance when it has been restored to a healthy state.

Your Project Manager need a storage service that provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. Which AWS service can meet these requirements?

Amazon Elastic File System (Amazon EFS) provides a simple, scalable, fully managed elastic NFS file system for use with AWS Cloud services and on-premises resources. It is built to scale on-demand to petabytes without disrupting applications, growing and shrinking automatically as you add and remove files, eliminating the need to provision, and manage capacity to accommodate growth. Amazon EFS offers two storage classes: the Standard storage class, and the Infrequent Access storage class (EFS IA).

The company needs to be able to store files in several different formats, such as pdf, jpg, png, word, and several others. This storage needs to be highly durable. Which storage type will best meet this requirement?

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. This means customers of all sizes and industries can use it to store and protect any amount of data for a range of use cases, such as websites, mobile applications, backup and restore, archive, enterprise applications, IoT devices, and big data analytics. Amazon S3 provides easy-to-use management features so you can organize your data and configure finely-tuned access controls to meet your specific business, organizational, and compliance requirements.

What is hot attach in EC2?

  • If you have two EC2 instances running in the same VPC, but in different subnets.
  • You are removing the secondary ENI from an EC2 instance and attaching it to another EC2 instance.
  • You want this to be fast and with limited disruption.
  • So you want to attach the ENI to the EC2 instance when it’s running.
  • You can attach a network interface to an instance when it’s running (hot attach), when it’s stopped (warm attach), or when the instance is being launched (cold attach).
  • You can detach secondary network interfaces when the instance is running or stopped. However, you can’t detach the primary network interface.
  • You can move a network interface from one instance to another if the instances are in the same Availability Zone and VPC but in different subnets.
  • When launching an instance using the CLI, API, or an SDK, you can specify the primary network interface and additional network interfaces.
  • Launching an Amazon Linux or Windows Server instance with multiple network interfaces automatically configures interfaces, private IPv4 addresses, and route tables on the operating system of the instance.
  • A warm or hot attach of an additional network interface may require you to manually bring up the second interface, configure the private IPv4 address, and modify the route table accordingly.
  • Instances running Amazon Linux or Windows Server automatically recognize the warm or hot attach and configure themselves. Attaching another network interface to an instance (for example, a NIC teaming configuration) cannot be used as a method to increase or double the network bandwidth to or from the dual-homed instance.
  • If you attach two or more network interfaces from the same subnet to an instance, you may encounter networking issues such as asymmetric routing. 

What is launch templates?

A launch template is similar to a launch configuration, in that it specifies instance configuration information.Defining a launch template instead of a launch configuration allows you to have multiple versions of a template. With versioning, you can create a subset of the full set of parameters and then reuse it to create other templates or template versions.

Which command in Red shift is efficient in loading large amounts of data ?

Answer : COPY. A COPY command loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well. We can use a single COPY command to load data for one table from multiple files. Amazon Redshift then automatically loads the data in parallel.

What is AWS Glue ?
  • AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load processes.
  • It supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud instances.
  • It is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the steps of data preparation for analytics.
  • It is a flexible scheduler that handles dependency resolution and job monitoring.
  • AWS Glue is serverless, it means ,there’s no infrastructure to set up or manage.
  • AWS Glue consists of:
    • Central metadata repository-AWS Glue Data Catalog
    • ETL engine
    • Flexible scheduler
John is new to AWS Glue service and he has given new task to process data through Glue. What would be his action items?
  • Step 1: He has to define a crawler to populate AWS Glue Data Catalog with metadata table definitions.
  • Step 2: He has to point crawler at a data store and the crawler creates table definitions in the Data Catalog.
  • Step 3: He can write a script to transform data or he can provide the script in the AWS Glue console. Like Spark script.
  • Step 4: He can run job or he can set it up to start when a specified trigger occurs.
  • Step 5: Once job runs, a script extracts data from data source, transforms the data, and loads it to data target.
  • Step 6: This script runs in an Apache Spark environment in AWS Glue.
What will happen when a crawler Runs?
  • It classifies data to determine the format, schema, and associated properties of the raw data.
  • It groups data into tables or partitions and data is grouped.
  • It writes metadata to the Data Catalog.
  • A crawler can crawl multiple data stores in a single run.
    After job completion, the crawler creates or updates one or more tables in Data Catalog.
What is Data Catalog in Glue?
  • It is a central repository and persistent metadata store to store structural and operational metadata.
  • It stores its table definition, job definitions, and other control information to manage your AWS Glue environment.
  • Table definitions are available for ETL and also available for querying in Athena, EMR, and Redshift Spectrum to provide a view of the data between these services.
  • Each AWS account has one AWS Glue Data Catalog per region.
You are working in Ecommerce Company where you have an order processing system in AWS. There are many EC2 Instances to pick up the orders from the application and these EC2 Instances are in an Auto scaling Group to process the orders. What will you do to ensure that the EC2 Processing instances are correctly scaled based on demand?

Answer : We can use SQS queues to decouple the architecture and can scale the processing servers based on the queue length. We know that SQS is a queue from which services pull data, and it supports only once delivery of messages. If no workers pull jobs from SQS, the messages still stay in the queue. SNS is a kind of publisher-subscriber system that pushes messages to subscribers. If there are no subscribers to an SNS topic, a given message is lost.

In your project, you have data in Dynamo DB tables and you have to perform complex data analysis queries on the data (stored In the Dynamo DB tables). How will you do this?

Answer : We can copy the data on AWS(Amazon Web Service) Red shift and then perform the complex queries.

Which service will you use to collect, process, and analyze video streams in realtime?

Answer : Amazon Red shift

In an AWS, EMR Cluster which node is resposible for running the YARN service?

Answer : Master Node

In your client big data project, You are trying to connect to the master node for your EMR cluster. What should be checked to ensure that the connection is successful?

Answer : We can check the Inbound rules for the Security Group for the master node. Under Security and access choose the Security groups for Master link. Choose ElasticMapReduce-master from the list. Choose Inbound, Edit. Check for an inbound rule that allows public access with the following settings.

Explain AWS GLUE Crawler.

  • It is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
  • It scan various data stores to infer schema and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • It can be scheduled to run periodically. Doing so, the metadata is always up-to-date and in-sync with the underlying data.
  • It automatically add new tables, new partitions to existing table, and new versions of table definitions.
  • We can determine the schema of complex unstructured or semi-structured data, which can save a ton of time. 
When do I use a Glue Classifier in project?
  • It reads the data in a data store.
  • If it identifies the format of the data then it generates a schema.
  • It provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
  • AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
  • You can set up your crawler with an ordered set of classifiers.
  • When the crawler invokes a classifier, the classifier determines whether the data is recognized or not.
What is Trigger in AWS Glue?

It is an ETL job and we can define triggers based on a scheduled time or event.

John joined new company where he is working in migration project.His project moved into serverless Apache Spark-based platform from ETL.
Then which service is recommended for Streaming?

AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.

How will you import data from Hive Metastore to the AWS Glue Data Catalog?

Migration through Amazon S3:
Step 1: Run an ETL job to read data from your Hive Metastore
and it will export the data(Extract database, table, and partition objects) to an intermediate format in Amazon S3

Step 2:Import that data from S3 into the AWS Glue Data Catalog through AWS Glue ETL job.

Direct Migration:
You can set up an AWS Glue ETL job which extracts metadata from your Hive metastore and loads it into your AWS Glue Data Catalog through an AWS Glue connection.

Which AWS service will you use to perform ad-hoc analysis on log data?

Amazon Elasticsearch Service is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. You can search specific error codes and reference numbers quickly.

What will you do for query optimization after data has been ingested into a Red shift?

Answer : We can run the ANALYZE command so that the optimizer can generate up-to-date data statistics. Amazon Redshift monitors changes to your workload and automatically updates statistics in the background. In addition, the COPY command performs an analysis automatically when it loads data into an empty table. To explicitly analyze a table or the entire database, run the ANALYZE command.

An application is currently using the Elastic Search service in AWS. How can you take backups of a clusters data through Elastic Search?

Answer : Automated snapshots. By default, the AWS Elasticsearch Service already comes with regular automated snapshots. These snapshots can not be used for recovery or migration to a new Elasticsearch cluster.It can only be accessed as long as the Elasticsearch API of the cluster is available.

Explain AWS GLUE Crawler.

  • It is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
  • It scan various data stores to infer schema and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • It can be scheduled to run periodically. Doing so, the metadata is always up-to-date and in-sync with the underlying data.
  • It automatically add new tables, new partitions to existing table, and new versions of table definitions.
  • We can determine the schema of complex unstructured or semi-structured data, which can save a ton of time. 
When do I use a Glue Classifier in project?
  • It reads the data in a data store.
  • If it identifies the format of the data then it generates a schema.
  • It provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
  • AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
  • You can set up your crawler with an ordered set of classifiers.
  • When the crawler invokes a classifier, the classifier determines whether the data is recognized or not.
What is Trigger in AWS Glue?

It is an ETL job and we can define triggers based on a scheduled time or event.

John joined new company where he is working in migration project.His project moved into serverless Apache Spark-based platform from ETL.
Then which service is recommended for Streaming?

AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.

How will you import data from Hive Metastore to the AWS Glue Data Catalog?

Migration through Amazon S3:
Step 1: Run an ETL job to read data from your Hive Metastore
and it will export the data(Extract database, table, and partition objects) to an intermediate format in Amazon S3

Step 2:Import that data from S3 into the AWS Glue Data Catalog through AWS Glue ETL job.

Direct Migration:
You can set up an AWS Glue ETL job which extracts metadata from your Hive metastore and loads it into your AWS Glue Data Catalog through an AWS Glue connection.

Which AWS Service can be used to monitor EMR Clusters and give reports of the performance of the cluster as a whole?

Answer : Cloudwatch logs.You can view the metrics that Amazon EMR reports to CloudWatch using the Amazon EMR console or the CloudWatch console.

Sometimes if you try to terminate an EMR Cluster but it does not happen. Which should be a possible reason for this?

Answer : The termination protection set on the cluster. If you are terminating a cluster which has termination protection set on then you must disable termination protection first.Then you can terminate the cluster. Clusters can be terminated using the console, the Amazon CLI, or programmatically using the TerminateJobFlows API.

Which node type is recommended when launching Red shift cluster ?

Answer : Dense Storage.DS2 allows a storage-intensive data warehouse with vCPU and RAM included for computation. DS2 nodes use HDD(Hard Disk Drive) for storage and as a rule of thumb, if its data more than 500 GB, then it will go for DS2 instances.

Where does the query results from Athena get stored?

Answer : In Amazon S3

How will you convert and migrate an on-premise Oracle database to AWS Aurora.

Answer : First we will convert database schema and code using AWS Schema Conversion Tool then will migrate data from the source database to the target database using AWS.

You expect a large number of GET and PUT requests on S3 bucket. You could expect around 300 PUT and 500 GET requests per second on the 53 bucket during a selling period on your web site. How will you do good design to ensure optimal performance?

Answer : We have to ensure the object names have appropriate key names.

Which AWS Service filter, transform messages (coming from sensor) and store them as time series data in Dynamo DB?

Answer : loT Rules Engine. The Rules Engine is a component of AWS IoT Core. The Rules Engine evaluates inbound messages published into AWS IoT Core and transforms and delivers them to another device or a cloud service, based on business rules you define.

Your Project is currently running an EMR cluster which is used to perform a processing task every day from 5pm to 10 pm. But the data admin has noticed that the cluster is being billed for the entire day. What will you do configuration here for the cluster to reduce the costs?

Answer : We can use transient clusters in EMR. There are two kinds of EMR clusters: transient and long-running. If you want to configure your cluster to be automatically terminated then it is terminated after all the steps complete.This is a transient cluster. Transient clusters are compute clusters that automatically shut down and stop billing when processing is finished.

Which storage types can be used with Amazon EMR?

Answer : Local file system

HDFS

EMRFS

Consider you have large volume of data . You have to store them and access for a short period, but then it needs to be archived indefinitely. What is a cost-effective solution?

Answer : We can Store data in Amazon 53. and use lifecycle policies to archive to Amazon Glacier

Which of the following component for the AWS(Amazon Web Service) Machine learning service is used to generate predictions using the patterns extracted from the input data ?

Answer : Models

How can you Migrate data from AWS Glue to Hive Metastore through Amazon S3?

We can use two AWS Glue jobs here.
The first job extracts metadata from databases in AWS Glue Data Catalog and loads them into S3. The first job is run on AWS Glue Console.
The second job loads data from S3 into the Hive Metastore. The second can be run either on the AWS Glue Console or on a cluster with Spark installed.

How can you Migrate data from AWS Glue to AWS Glue?

We can use two AWS Glue jobs here.
The first extracts metadata from specified databases in an AWS Glue Data Catalog and loads them into S3.
The second loads data from S3 into an AWS Glue Data Catalog.

What is Time-Based Schedules for Jobs and Crawlers ?

We can define a time-based schedule for crawlers and jobs in AWS Glue. When the specified time is reached, the schedule activates and associated jobs to execute.

Which component of a Red shift cluster, if down, it renders the Red shift cluster as unavailable?

Answer : Leader Node.The Leader Node in an Amazon Redshift Cluster manages all external and internal communication. It is responsible for preparing query execution plans whenever a query is submitted to the cluster.The Leader Node distributes data to the slices, and allocates parts of a user query or other database operation to the slices. Slices work in parallel to perform the operations.

Which SQL function statements can be used in Red shift to specify a result when there are multiple conditions?
Answer : Case expression
You have to create an Amazon Machine Learning model to predict how many inches of snow will fall in an area based on the historical snowfall data. What type of modeling will you use?

Answer : Regression

What is Shard in AWS Kinesis?

It is a group of data records in a stream.

How will you load streaming data and establish scalable private connections to on-premise data centers ?Which service will you used for that?

Answer : Direct Connect and Kinesis Fire hose

  • Establish a dedicated network connection from your premises to AWS.
  • AWS Direct Connect makes it easy to establish a dedicated network connection from your premises to AWS.
  • Using AWS Direct Connect, you can establish private connectivity between AWS and your datacentre.
  • Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture and automatically load streaming data into Amazon S3 and Amazon Redshift
Which database would be best for storing and analyzing the complex interpersonal relationships of people involved in organized crime?
  • Amazon Neptune is a purpose-built, high-performance graph database. It is optimized for processing graph queries.
  • After creating an instance in Amazon Elastic Compute Cloud (Amazon EC2), you can log into that instance using SSH and connect to a Amazon Neptune DB cluster
Amazon Elastic File System (Amazon EFS) provides simple, scalable, elastic file storage for use with AWS Cloud services and on-premises resources. You have decided to use EFS for sharing files across many EC2 instances and you want to be able to tolerate an AZ failure. What should you do?

Correct Answer: We can Create EFS mount targets in each AZ and configure each EC2 instance to mount the common mount target.

Which AWS services allow native encryption of data at rest?
  • EBS, S3 and EFS are AWS Services which allow native encryption of data, while at rest.
  • All allow the user to configure encryption at rest.
  • They can use either the AWS Key Management Service (KMS) or customer provided keys.
  • The exception is ElastiCache for Memcached which does not offer a native encryption service whereas ElastiCache for Redis allows.
  • AWS Snowball encrypts data at rest by default as well.
Which service is used by the Spark Streaming tool to consume data from Amazon Kinesis?

Answer : Amazon Kinesis Producer Library

There is a requirement to perform SQL querying along with complex queries on different backend data that include Red shift, My SQL Hive on EMR. H3, and PostgreSQL. How can we use Presto S in this case?

Answer : Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, MongoDB and Teradata. 

We need to perform ad-hoc SQL queries on structured data in Project. As Data comes in constantly at a high velocity so what services should we use?

Answer : EMR + Red shift

Consider you have to load a lot of data once a week from your on-premise datacenter to AWS Redshift. Which AWS-Managed Cloud Data Migration Tools can be used for this data transfer in simple, fast, and secure way.

Answer : Direct Connect

Which service is used by the AWS Athena in partitioning data?
Answer : Hive
You need a cost-effective solution to store a large collection of video files and have fully managed data warehouse service that can keep track of and analyze all your data efficiently using your existing business intelligence tools. How will you full fill the requirements?

Answer : Store the data in Amazon S3 and reference its location in Amazon Red shift. Amazon Red shift will keep track of metadata about your binary objects. but the large objects themselves would be stored in Amazon 53.

In Project, consider your EMR cluster uses ten m4.large instances and runs 24 hours per day, but it is only used for processing and reporting during working hours. How will you reduce the costs?

Answer : We can use Spot instances for tasks nodes when needed  and we can migrate the data from HDFS to S3 using S3DispCp and turn off the cluster when not in use.

Your application generates a 2 KBJSON payload that needs to be queued and delivered EC2 instances for applications. At the end of the day, the application needs to replay the data for the past 24 hours, Which service would you use for this requirement?

Answer : Kinesis

The Amazon DynamoDB Query action lets you retrieve data in a similar fashion. You can use Query with any table that has a composite primary key (partition key and sort key). You must specify an equality condition for the partition key, and you can optionally provide another condition for the sort key. You need to improve performance of queries to your DynamoDB table. The most common queries do not use the partition key. What should you do?

Correct Answer: Create a Global Secondary Index with the most common queried attribute as the hash key

Which data formats does Amazon Athena support?

Correct Answer: Apache Parquet.Apache ORC.JSON

Build data-intensive apps or boost the performance of your existing databases by retrieving data from high throughput and low latency in-memory data stores. Amazon ElastiCache is a popular choice for real-time use cases like Caching, Session Stores, Gaming, Geospatial Services, Real-Time Analytics, and Queuing.You are trying to decide product, you should select for your in-memory cache needs. You require support for encryption. Which service should you choose?

Correct Answer: ElastiCache Redis

Consider you are working in commercial deliver loT company where you have to track coordinates through enabled devices via GPS. You receive coordinates , which is transmitted from each device once every 8 seconds. Now you need to process these coordinates In real-time from multiple sources.Which tool should you use to digest the data?

Answer : Amazon Kinesis. Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.

Amazon DynamoDB is a NoSQL database that supports key-value and document data models. Developers can use DynamoDB to build modern, serverless applications that can start small and scale globally to support petabytes of data and tens of millions of read and write requests per second. What DynamoDB features can be utilised to increase the speed of read operations?

Correct Answer: DynamoDB Accelerator (DAX) and Secondary Indexes

You are architecting a complex application landscape that values fast disk I/O for EC2 instances above everything else. Which storage option would you choose?

Correct Answer: Instance Store

You want to allow your VPC instances to resolve using on-prem DNS. Can you do this and how/why?

Correct Answer: Yes, by configuring a DHCP Option Set to issue your on-prem DNS IP to VPC clients.

Which command can be used to transfer the results of a query in Red shift to Amazon 53?

Answer : UNLOAD connects to Amazon S3 using an HTTPS connection.For unloading data from database tables to a set of files in an Amazon S3 bucket, we can use the UNLOAD command with a SELECT statement. As we know, Redshift splits the results of a select statement across a set of files, one or more files per node slice.

We have a set of web servers hosted on EC2 Instances and have to push the logs from these web servers onto a suitable storage device for subsequent analysis. How will you do this implementation process?

Answer : First we have to install and configure the Kinesis agents on the web servers. Then we have to ensure that Kinesis Fire hose is setup to take the data and send it across to Red shift for further processing

When estimating the cost of using EMR, which of the parameters should you consider.

Answer : The price of the underlying EC2 Instances.The price of the EMR service . The price of EBS storage if used.

Which AWS services allow native encryption of data at rest?
  • EBS, S3 and EFS are AWS Services which allow native encryption of data, while at rest.
  • All allow the user to configure encryption at rest.
  • They can use either the AWS Key Management Service (KMS) or customer provided keys.
  • The exception is ElastiCache for Memcached which does not offer a native encryption service whereas ElastiCache for Redis allows.
  • AWS Snowball encrypts data at rest by default as well.
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, that combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.
In project,You have decided to migrate your on-prem legacy Informix database to Amazon Aurora. How might this be facilitated most efficiently?

Correct Answer: You can manually create the target schema on Aurora then use Data Pipeline with JDBC to move the data.


You are migrating from an Oracle on-prem database to an Oracle RDS database. Which of these describes this migration properly?

Correct Answer: Homogenous migration

Which services can be used for auditing 53 buckets?

Answer : Cloud trail and AWS Config. AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account.
AWS Config is a service that enables us to assess, audit, and evaluate the configurations of AWS resources. Config continuously monitors and records AWS resource configurations and allows to automate the evaluation of recorded configurations against mentioned configurations.

Which service would you use to check CPU utilization of your EC2 instances?

Your Answer : CloudWatch

Which managed service that can be used to deliver real-time streaming data to 53?

Answer : Kinesis Fire hose.Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service.

In your client project, there is a requirement for a vendor to have access to an 53 bucket in your account. The vendor already has an AWS(Amazon Web Service) account. How can you provide access to the vendor on this bucket?

Answer : Create an S3 bucket policy that allows the vendor to read from the bucket from their AWS(Amazon Web Service) account. A bucket policy is a resource-based AWS Identity and Access Management policy.We can add a bucket policy to a bucket to grant other AWS accounts or IAM users access permissions for the bucket and the objects in it. 

Which file format is supported in Athena by default?

Answer : Apache Parquet.Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.

You have setup Red shift cluster in your AWS development account in us-east-i. Now your manager decided to move the cluster to the production account in us-west-i. What will you do in first step?

Answer : Create a manual snapshot of the Red shift cluster.As we know,A snapshot contains data from any databases that are running on your cluster.It contains information about your cluster, including the number of nodes, node type, and master user name.If you backup your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster.

What are the 2 types of nodes in a Red shift Cluster?

Answer : Leader Node,Compute Node.

Redshift Architecture and Its Components

  • Leader Node.
  • Compute Node.
  • Node Slices.
  • Massively Parallel Processing.
  • Columnar Data Storage.
  • Data Compression.
  • Query Optimizer.
You are trying to use SQL Client tool from an EC2 Instance, but you are not able to connect to the Red shift Cluster. What must you do?

Answer : Modify the VPC Security Groups.

Open the Amazon VPC console at https://console.aws.amazon.com/vpc/ .

  1. In the navigation pane, choose Security Groups.
  2. Select the security group to update.
  3. Choose Actions, Edit inbound rules or Actions, Edit outbound rules.
  4. Modify the rule entry as required.
  5. Choose Save rules.
Which method should you use for publishing and analyzing the logs (logs from the EC2 Instances need to be published and analysed for new application’s feature)?

Answer : Use consumers to analyze the logs

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/ .

  1. In the left navigation pane, choose Instances, and select the instance.
  2. Choose Actions, Monitor and troubleshoot, Get system log. 
Which chart you use for comparing measure values over time in Amazon Quick sight?

Answer : Line charts.Use line charts to compare changes in measure values over period of time, for the following scenarios:

  • One measure over a period of time.
  • Multiple measures over a period of time.
  • One measure for a dimension over a period of time.
Your application is writing a large number of records to a Dynamo DB table in one region. There is a requirement for a secondary application to take in the changes to the Dynamo DB table every 4 hours and process the updates accordingly. How will you process here?

Answer : Use Dynamo DB streams to monitor the changes in the Dynamo DB table.Once you enable DynamoDB Streams, it captures a time-ordered sequence of item-level modifications in a DynamoDB table and stores the information for up to 24 hours. As we know,applications can access a series of stream records, which contain an item change, from a DynamoDB stream in near real time. So we can use Dynamo DB streams to monitor the changes in the Dynamo DB table. 

Your project has a Red shift cluster for peta byte-scale data warehousing and your project manager wants to reduce the overall total cost of running Red shift cluster. How will you meet the needs of the running cluster, while still reducing total overall cost?

Answer : Disable automated and manual snapshots on the cluster. To disable automated snapshots, set the retention period to zero. If you disable automated snapshots, Amazon Redshift stops taking snapshots and deletes any existing automated snapshots for the cluster. 

You are working with a Kinesis Stream. What is used to group data by shard within a stream?

Answer : Partition Key.A partition key is used to group data by shard within a stream. Kinesis Data Streams segregates the data records belonging to a stream into multiple shards.

You need to ensure in your project that each user can only access their own data in a particular DynamoDB table. Many users already have accounts with a third-party identity provider, such as Face book. Google. or Login with Amazon. How would you implement this requirement?

Answer : Use Web identity federation and register your application with a third-party identity provider such as Google, Amazon, or Face book.Create a DynamoDB table and call it “Test.”

  1. Create Partition key and a Sort key. Complete creation of the table Test.
  2. Navigate  to “Access control” and select ‘Facebook’ as the identity provider or any other as per your requirement.
  3. Select the “Actions” that you want to allow your users to perform.
  4. Select the “Attributes” that you want your users to have access to.
  5. Select Create policy and copy the code generated in the policy panel. 
 When troubleshooting slowness on an EMR cluster, which of the following node types does not need to be investigated for issues?

Answer : Core Nodes

  • Step 1: Gather Data About the slowness Issue
  • Step 2: Check the Environment
  • Step 3: Examine the Log Files
  • Step 4: Check Cluster and Instance Health
  • Step 5: Check for Suspended Groups
  • Step 6: Review Configuration Settings
  • Step 7: Examine Input Data
 What is the default input data format for Amazon EMR?

Answer : Text.The default input format for a cluster is text files with each line separated by a newline (\n) character, which is the input format most commonly used. 

You are planning on loading a 800 GB file into a Red shift cluster which has 10 nodes. What is a preferable method to load the data?

Answer : We can split the file Into 800 smaller files

 You are planning on loading a huge amount of data into a Red shift Cluster. You are not sure if the load will succeed or fail. Which of the below options can help if an error would occur during the load process.

Answer : Use the COPY command with the NOLOAD option. When NOLOAD parameter is used in the COPY command, Redshift checks data file’s validity without inserting any records to the target table.  

Which service can be used to run ad-hoc queries for data in S3?

Answer : Athena. Amazon Athena helps us to analyze data stored in Amazon S3. We can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets.

 Consider your project has a web site hosted in AWS. There is a requirement to analyze the click stream data for the web site and this needs to be done in real time. How will you complete this requirement?

Answer : We can use the Amazon Kinesis service to process the data from a Kinesis agent.

Which service can be used for storing log files generated from an EMR cluster?

Answer : Amazon 53

 Which tool can be used to build real-time applications using streaming data?

Answer : Kafka

You got a requirement to migrate 4TB of data to AWS. There is a restriction of the time to migrate the data and there is a limitation of only a 1OOMB it line to the AWS Cloud. What is the best solution to use to migrate the data to the cloud?

Answer : Amazon Import/Export.AWS Import/Export is a data transport service used to move large amounts of data into and out of the Amazon Web Services public cloud using portable storage devices for transport. The service also enables a user to perform an export job from Amazon S3, but not from Amazon EBS or Glacier. There are two versions of the service: AWS Import/Export Disk and AWS Snowball. . AWS recommends an IT team use the service if there are 16 TB or less of data to import.

Which can be used to enable developers to quickly get started with deep learning in the cloud?

Answer : Tensor Flow

Your team is planning on using the AWS loT Rules service to allow loT enabled devices to write information to Dynamo DB. What action must be done to ensure that the rules will work as intended?

Answer : Ensure that the right lAM permissions to AWS Dynamo DB is given.IAM user – An IAM user is an identity within your AWS account that has specific custom permissions (for example, permissions to create a table in DynamoDB).

You have an application that uses Dynamo DB to store JSON data ( Read and Write capacity of the Dynamo DB table is enables). Now you have deployed your application and You are unsure of the amount of the traffic that will be received by the application during the deployment time. How can you ensure that the Dynamo DB is not highly throttled and does not become a bottleneck for the application?

Answer : Dynamo DB’s auto scaling feature will make sure that no read/write throttling will happen due to heavy traffic. To configure auto scaling in DynamoDB, you set the minimum and maximum levels of read and write capacity in addition to the target utilization percentage. Auto scaling uses Amazon CloudWatch to monitor a table’s read and write capacity metrics. 

You developed new application that handles huge workloads on large scale datasets that are stored on Amazon Red shift. The application needs to access Amazon Red shift tables frequently. How will you access tables?

Answer : We have to use roles that allow a web identity federated user to assume a role that allows access to the Red shift table(by providing temporary credentials).

Which API commands can be used to data Into a Kinesis stream for Synchronous processing?

Answer : Write Record

Which service can be used for transformation of incoming source data in Amazon Kinesis Data Fire hose ?

Answer : AWS IOT

How will you analyze a large set of data that updates from Kinesis and Dynamo DB?

Answer : Elastic search. Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds. It’s able to achieve fast search responses because instead of searching the text directly, it searches an index.

Your project application requires long-term storage for backups and other data that you need to keep readily available but at lower cost. Which S3 storage option should you use or recommend?

Answer : AWS 53 Standard – Infrequent Access. Amazon S3 Standard-Infrequent Access (S3 Standard-IA) S3 Standard-IA is for data that is accessed less frequently, but it gives rapid access when needed. It offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee.

 
You are currently managing an application that using the Kinesis Client Library to read a Kinesis stream. Suddenly you got a Provisioned Through put Exceeded Exception in Cloud watch from the stream. How will you rectify this error?

Answer : We can add retry logic to applications that use the KCL library

If you have to use in join level queries frequently then which distribution styles would you utilize for the table in Redshift?

Answer : KEY. A distribution key is a column that is used to determine the database partition in which a particular row of data is stored. A distribution key is defined on a table using the CREATE TABLE statement.  The columns of the unique or primary key are used as the distribution keys.

Which method can be used to disable automated snapshots in Red shift?

Answer : Set the retention period to -1

What is the default retention period for a Kinesis stream?

Answer : 1day

Which service can be used as a business analytics service that can be used to build visualizations?

Answer : AWS Quick sight

What is the concurrency level of the number of queries that can run per queue in Red shift?

Answer : 5

What are the examples of Columnar databases?

Answer : Amazon Red shift and Apache H Base

Which tool can be used for transferring data between Amazon 53. Hadoop, HDFS, and RDBMS databases?

Answer : Sqoop

Which command can be used to see the impact of a query on a Red shift Table?

Answer : TRY

We are writing data to a Kinesis stream and the default stream settings are used for the kinesis stream. Every fourth day you have decided to send the data to S3 from the stream. When you analyze the data in S3, you see that only the 4th day’s data is present in the stream. What is the reason for this?

Answer : As we know that,Data records are only accessible for a default of 24 hours from the time they are added to a stream, since default stream settings are used for the kinesis stream here.