AWS BIG DATA INTERVIEW QUESTION-SET 1

What is DynamoDB?
  • DynamoDB is a non-relational database for applications that need performance at any scale.
  • NoSQL managed database service
  • Supports both key-value and document data model
  • It’s really fast
    • Consistent responsiveness
    • Single-digit millisecond
  • Unlimited throughput and storage
  • Automatic scaling up or down
  • Handles trillions of requests per day
  • ACID transaction support
  • On -demand backups and point-in-time recovery
  • Encryption at rest
  • Data is replication across multiple Availability zones
  • Service-level agreement (SLA)up to 99.999%
What are the non-relational Databases?
  • The Non-Relational databases are NoSQL databases. These databases are categorized into four groups:
    • Key-value stores
    • Graph stores
    • Column stores
    • Document stores
List the Data Types supported by DynamoDB?

DynamoDB supports four scalar data types, and they are:

  • Number
  • String
  • Binary
  • Boolean

DynamoDB supports collection data types such as:

  • Number Set
  • String Set
  • Binary Set
  • Heterogeneous List
  • Heterogeneous Map

DynamoDB also supports Null values.

List the APIs provided by Amazon DynamoDB?
  • CreateTable
  • UpdateTable
  • DeleteTable
  • DescribeTable
  • ListTables
  • PutItem
  • BatchWriteItem
  • UpdateItem
  • DeleteItem
  • GetItem
  • BatchGetItem.
  • Query
  • Scan
what are global secondary indexes?

An index with a different partition and partition-and-sort key from those on the table is called global Secondary index.

List types of secondary indexes supported by Amazons DynamoDB?
  • Global Secondary index – It is an index with a partition or a partition sort key that is different from those on the table. The global secondary index is considered to be global because queries on the index can span all the items in a table, across all the partitions.
  • Local secondary index – An index that has the same partition key as that of the table but different sort key. It is considered to be “local” because every partition of the index is scoped to a table partition that has the same partition key.
How many numbers of global secondary indexes do you create per table?

We can create a maximum of 5 global secondary indexes per table.

In your project, you have data in Dynamo DB tables and you have to perform complex data analysis queries on the data (stored In the Dynamo DB tables). How will you do this?

Answer : We can copy the data on AWS(Amazon Web Service) Red shift and then perform the complex queries.

Where Does DynamoDB Fit In?
  • Amazon Relational Database Service (RDS):
    • Support for Amazon Aurora. PostgreSQL. MySQL MariaDB. Oracle Database, and SQL Server
  • Amazon DynamoDB:
    • Key-value and document database
  • Amazon ElastiCache:
    • Managed. Redis- or Memcached compatible in-memory data store
  • Amazon Neptune
    • Graph database for applications that work with highly connected data sets
  • Amazon Redshift
    • Petabyte-scale data warehouse service
  • Amazon QLDB
    • Ledger database providing a cryptographically verifiable transaction log
  • Amazon DocumentDB MongoDB-compatible database service
Which AWS Service filter, transform messages (coming from sensor) and store them as time series data in Dynamo DB?

Answer : loT Rules Engine. The Rules Engine is a component of AWS IoT Core. The Rules Engine evaluates inbound messages published into AWS IoT Core and transforms and delivers them to another device or a cloud service, based on business rules you define.

Explain Partitions and Data Distribution.
  • DynamoDB stores data in partitions. A partition is an allocation of storage for a table, backed by solid-state drives (SSDs) and automatically replicated across multiple Availability Zones within an AWS Region.
  • To get the most out of DynamoDB throughput, create tables where the partition key has a large number of distinct values. Applications should request values fairly uniformly and as randomly as possible.
  • Table: Collection of data. DynamoDB tables must contain a name, primary key. and the required read and write throughput values. Unlimited size.
  • Partition Key: A simple primary key. composed of one attribute known as the partition key This is also called the hash attribute.
  • Partition and Sort Key: Also Known as a composite primary key. this type of key comprises two attributes. The first attribute is the partition key. and the second attribute is the sort key S also called the range attribute
Your application is writing a large number of records to a Dynamo DB table in one region. There is a requirement for a secondary application to take in the changes to the Dynamo DB table every 4 hours and process the updates accordingly. How will you process here?

Answer : Use Dynamo DB streams to monitor the changes in the Dynamo DB table.Once you enable DynamoDB Streams, it captures a time-ordered sequence of item-level modifications in a DynamoDB table and stores the information for up to 24 hours. As we know,applications can access a series of stream records, which contain an item change, from a DynamoDB stream in near real time. So we can use Dynamo DB streams to monitor the changes in the Dynamo DB table. 

Explain DynamoDB Performance?
  • On Demand Capacity:
    • Database series according to demand
    • Good for -new tables with unknown workloads
    • Applications with unpredictable traffic
    • Prefer to pay as you go
  • Provisioned Capacity
    • Allows us to have consistent and predictable performance
    • Specify expected read and write throughput requirements
    • Read Capacity Units (RCU)
    • Write Capacity Units (WCU)
    • Price is determined by provisioned capacity
    • Cheaper per request than On-Demand mode
    • Good option for Applications with predictable traffic
    • Applications whose traffic is consistent or ramps gradually
    • Capacity requirements can be forecasted, helping to control costs
    • Both capacity modes have a limit of 40.000 RCUs and 40.000 WCUs.
    • You can switch between modes only once per 24 hours.
You need to ensure in your project that each user can only access their own data in a particular DynamoDB table. Many users already have accounts with a third-party identity provider, such as Face book. Google. or Login with Amazon. How would you implement this requirement?

Answer : Use Web identity federation and register your application with a third-party identity provider such as Google, Amazon, or Face book.Create a DynamoDB table and call it “Test.”

  1. Create Partition key and a Sort key. Complete creation of the table Test.
  2. Navigate  to “Access control” and select ‘Facebook’ as the identity provider or any other as per your requirement.
  3. Select the “Actions” that you want to allow your users to perform.
  4. Select the “Attributes” that you want your users to have access to.
  5. Select Create policy and copy the code generated in the policy panel. 
Explain DynamoDB Items?
  • Item: A table may contain multiple items. An item is a unique group of attributes. Items are similar to rows or records in a traditional relational database. Items are limited to 400 KB.
  • Attribute: Fundamental data element. Similar to fields or columns in an RDBMS.
Explain Data Types.
  • Data Types
    • Scalar: Exactly one value — number, string, binary, boolean, and null. Applications must encode binary values in base64-encoded format before sending them to DynaboDB.
    • Document: Complex structure with nested attributes (e.g.. JSON) — list and map.
  • Document Types
    • List: Ordered collection of values
      • FavoriteThings: [“Cookies”, “Coffee”, 3.14159]
    • Map: Unordered collection of name-value pairs (similar to JSON)
      • {
      • Day: ’Monday*,
      • UnreadEsalls: 42, lte«sOnMyOesk: |
      • “Coffee Cup”,
      • “Telephone”,
      • {
      • Pens: ( Quantity : 3},
      • Pencils: { Quantity : 2),
      • Erasers: { Quantity : 1>
      • }
      • ]
      • }
    • Set: Multiple scalar values of the same type — string set, number set, binary set.
      • [“Black”, “Green”, “Red”]
      • [42.2, -19, 7.5, 3.14]
      • [“U3Vubnk=”, “UmFpbnk=”, “U25vd3k=”]
Your team is planning on using the AWS loT Rules service to allow loT enabled devices to write information to Dynamo DB. What action must be done to ensure that the rules will work as intended?

Answer : Ensure that the right lAM permissions to AWS Dynamo DB is given.IAM user – An IAM user is an identity within your AWS account that has specific custom permissions (for example, permissions to create a table in DynamoDB).

You have an application that uses Dynamo DB to store JSON data ( Read and Write capacity of the Dynamo DB table is enables). Now you have deployed your application and You are unsure of the amount of the traffic that will be received by the application during the deployment time. How can you ensure that the Dynamo DB is not highly throttled and does not become a bottleneck for the application?

Answer : Dynamo DB’s auto scaling feature will make sure that no read/write throttling will happen due to heavy traffic. To configure auto scaling in DynamoDB, you set the minimum and maximum levels of read and write capacity in addition to the target utilization percentage. Auto scaling uses Amazon CloudWatch to monitor a table’s read and write capacity metrics. 

Explain DynamoDB Table.

Creating a Table

  • Table names must be unique per AWS account and region.
  • Between 3 and 255 characters long
  • UTF-8 encoded
  • Case-sensitive
  • Contain a-z. A-Z. 0-9, _ (underscore). • (dash), and. (dot)
  • Primary key must consist of a partition key or a partition key and sort key.
  • Only string, binary, and number data types are allowed for partition or sort keys
  • Provisioned capacity mode is the default (free tier).
  • For provisioned capacity mode, read/write throughput settings are required
  • Secondary indexes creates a local secondary index.
  • Must be created at the time of table creation
  • Same partition key as the table, but a different sort key
  • Provisioned capacity is set at the table level.
  • Adjust at any time or enable auto scaling to modify them automatically
  • On-demand mode has a default upper limit of 40.000 RCU/WCU — unlike auto scaling, which can be capped manually

Create DynamoDB table

DynamoDB is a schema-loss database that only requires a table name and primary key. The table’s primary key is made up of one or two attributes that uniquely identity items, partition the data, and sort data within each partition.

How will you analyze a large set of data that updates from Kinesis and Dynamo DB?

Answer : Elastic search. Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds. It’s able to achieve fast search responses because instead of searching the text directly, it searches an index.

Explain DynamoDB Console Menu Items.

DynamoDB Console Menu Items

  • Dashboard
  • Tables

Storage size and item count are not real time

  • Items: Manage items and perform queries and scans.
  • Metrics: Monitor CloudWatch metrics.
  • Alarms: Manage CloudWatch alarms.
  • Capacity: Modify a table s provisioned capacity.
  • Free tier allows 25 RCU, 25 WCU. and 25 GB for 12 months
  • Cloud Sandbox within the Cloud Playground
  • Indexes: Manage global secondary indexes.
  • Global Tables: Multi region, multi master replicas
  • Backups: On-demand backups and point in time recovery Triggers: Manage triggers to connect DynamoDB streams to Lambda functions.
  • Access control: Set up fine grained access control with v/eb identity federation.

Tags: Apply tags to your resources to help organize and identify them.

  • Backups
  • Reserved capacity
  • Preferences
  • DynamoDB Accelerator (DAX)
How can you apply aws cli in DynamoDB?

Installing the AWS CLI

  • Preinstalled on Amazon Linux and Amazon Linux 2
  • Cloud Sandbox within the Cloud Playground

Obtaining IAM Credentials

  • Option 1 : Create IAM access keys in your own AWS account.
  • Option 2: Use Cloud Sandbox credentials.
  • Note the access key ID and secret access key.

Configuring the AWS CLI

  • aws configure
  • aws sts get-caller-identity
  • aws dynamodb help

Using DynamoDB with the AWS CLI

  • aws dynamodb create-table
  • aws dynamodb describe-table
  • aws dynamodb put-item
  • aws dynamodb scan

Object Persistence Interface

  • Do not directly perform data plane operations
  • Map complex data types to items in a DynamoDB table
  • Create objects that represent tables and indexes
  • Define the relationships between objects in your program and the tables that store those objects
  • Call simple object methods, such as save. load, or delete
  • Available in the AWS SDKs for Java and NET
How can we use cloudwatch in dynamodb?
  • CloudWatch monitors your AWS resources in real time, providing visibility into resource utilization, application performance, and operational health.
    • Track metrics (data points over time)
    • Create dashboards
    • Create alarms
    • Create rules for events
    • View logs
  • DynamoDB Metrics
    • ConsumedReadCapacityUnits
    • ConsumedWriteCapacityUnits
    • ProvisionedReadCapacityUnits
    • ProvisionedWriteCapacityUnits
    • ReadThrottleEvents
    • SuccessfulRequestLatency
    • SystemErrors
    • Throttled Requests
    • UserErrors
    • WriteThrottleEvents
  • Alarms can be created on metrics, taking an action if the alarm is triggered.
  • Alarms have three states:
    • INSUFFICIENT: Not enough data to judge the state — alarms often start in this state.
    • ALARM: The alarm threshold has been breached (e.g., > 90% CPU).
    • OK: The threshold has not been breached.
  • Alarms have a number of key components:
    • Metric: The data points over time being measured
    • Threshold: Exceeding this is bad (static or anomaly)
    • Period: How long the threshold should be bad before an alarm is generated
    • Action: What to do when an alarm triggers
    • SNS
    • Auto Scaling
    • EC2

Explain Below terminology

  • Provisioned Throughput
    • Maximum amount of capacity an application can consume from a table or index. Throttled requests: ProvisionedThroughputExceededException
  • Eventually vs. Strongly Consistent Read
    • Eventually consistent reads might include stale data.
    • Strongly consistent reads are always up to date but are subject to network delays.
  • Read Capacity Units (RCUs)
    • One RCU represents one strongly consistent read request per second, or two eventually consistent read requests, for an item up to 4 KB in size.
    • Filtered query or scan results consume full read capacity.
    • For an 8 KB item size:
      • 2 RCUs for one strongly consistent read
      • 1 RCU for an eventually consistent read
      • 4 RCUs for a transactional read
  • Write vs. Transactional Write
    • Writes are eventually consistent within one second or less.
    • One WCU represents one write per second for an item up to 1 KB in size. Transactional write requests require 2 WCUs for items up to 1 KB.
    • Standard: 3 WCUs  Provisioned Throughput
      • Transactional: 6 WCUs      Calculations
Explain Scan
  • Returns all items and attributes for a given table
  • Filtering results do not reduce RCU consumption; they simply discard data
  • Eventually consistent by default, but the Consistent Read parameter can enable strongly consistent scans
  • Limit the number of items returned
  • A single query returns results that fit within 1 MB
  • Pagination can be used to retrieve more than 1 MB
  • Parallel scans can be used to improve performance
  • Prefer query over scan when possible; occasional real-world use is okay
  • If you are repeatedly using scans to filter on the same non-PK/SK attribute, consider creating a secondary index
Explain Query
  • Find items based on primary key values
  • Query limited to PK. PK+SK. or secondary indexes
  • Requires PK attribute
  • Returns all items with that PK value
  • Optional SK attribute and comparison operator to refine results
  • Filtering results do not reduce RCU consumption; they simply discard data
  • Eventually consistent by default, but the Consistent Read parameter can enable strongly consistent queries
  • Querying a partition only scans that one partition
  • Limit the number of items returned
  • A single query returns results that fit within 1 MB
  • Pagination can be used to retrieve more than 1 MB
Explain BatchGetltem.
  • Returns attributes for multiple items from multiple tables
  • Request using primary key
  • Returns up to 16 MB of data, up to 100 items
  • Get unprocessed items exceeding limits via UnprocessedKeys
  • Eventually consistent by default, but the Consi stentRead parameter can enable strongly consistent scans
  • Retrieves items in parallel to minimize latency
Explain BatchWriteltem
  • Puts or deletes multiple items in multiple tables
  • Writes up to 16 MB of data, up to 25 put or delete requests
  • Get unprocessed items exceeding limits via Unprocessed Iterns
  • Conditions are not supported for performance reasons
  • Threading may be used to write items in parallel
Explain Provisioned Capacity
  • Minimum capacity required
  • Able to set a budget (maximum capacity)
  • Subject to throttling
  • Auto scaling available
  • Risk of underprovisioning — monitor your metrics
  • Lower price per API call
  • S0.00065 per WCU-hour (us-east-1 )
  • S0.00013 per RCU-hour (us-east-1 )
  • S0.25 per GB-month (first 25 GB is free)
Explain On-Demand Capacity
  • No minimum capacity: pay more per request than provisioned capacity
  • Idle tables not charged for read/write, but only for storage and backups
  • No capacity planning required — just make API calls
  • Eliminates the tradeoffs of over- or under-provisioning
  • Use on-demand for new product launches
  • Switch to provisioned once a steady state is reached
  • $1.25 per million WCU (us-east-1)
  • $0.25 per million RCU (us-east-1 )
Explain Point-in-Time Recovery (PITR)

Helps protect your DynamoDB tables from accidental writes or deletes. You can restore your data to any point in time in the last 35 days.

  • DynamoDB maintains incremental backups of your data.
  • Point-in-time recovery is not enabled by default.
  • The latest restorable timestamp is typically five minutes in the past.

After restoring a table, you must manually set up the following on the restored table:

  • Auto scaling policies
  • AWS Identity and Access Management (1AM) policies
  • Amazon CloudWatch metrics and alarms
  • Tags
  • Stream settings
  • Time to Live (TTL) settings
  • Point-in-time recovery settings

What are partitions?
  • They are the underlying storage and processing nodes of Dynamo DB
  • Initially, one table equates to one partition
  • Initially, all the data for that table is stored by that one partition
  • We don’t directly control the number of partitions
  • A partition can store 10 GB
  • A partition can handle 3000 RCU and 1000 WCU
  • So there is a capacity and performance relationship to the number of partitions THIS IS A KEY CONCEPT
  • Design tables and applications to avoid I/O “hot spots”/”hotkeys”
  • When >10 GB or >3000 RCU OR >1000 WCU required a new partition is added and the data is spread between them over time.
  • Partitions will automatically increase
  • While there is an automatic split of data across partitions, there is no automatic decrease when load/performance reduces
  • Allocated WCU and RCU is split between partitions
  • Each partition key is…
  • Limited to 10GB data
  • Limited to 3000 RCU 1000 WCU
  • Key concepts
  • • Be aware of the underlying storage infrastructure – partitions
  • • Be aware of what influences the number of partitions
  • • Capacity
  • • Performance (WCU / RCU )
  • • Be aware that they increase, but they don’t decrease
Explain Indexes in DynamoDB?

Dynamo DB offers two main data retrieval operations, SCAN and QUERY Without indexes.
Indexes allow secondary representations of the data in a table.
It allows efficient queries on those representations
Indexes come in two forms – Global Secondary and Local Secondary

Explain Local Secondary Indexes(LSI)

• LSI’s contain Partition, Sort, and New Sort + optional projected values
• Any data written to the table is copied Async to any LSI’s
• Shares RCU and WCU with the table
• A LSI is a sparse index. An index will only have an ITEM if the index sort key attribute is contained in the table item (row)

• Storage and performance considerations with LSI’s
• Any non-key values by default are not stored in an LSI
• If you query an attribute that is NOT projected, you are charged for the entire ITEM cost from pulling it from the main table
• Take care with planning your LSI and item projections – its important

Explain Global Secondary Indexes

• It shares many of the same concepts as a Local secondary index, BUT, with a GSI we can have an alternative Partition & sort key
• Options for attribute projection
• KEYS.ONLY – New partition and sort keys, old partition key and if applicable, old sort key
• INCLUDE – Specify custom projection values
• ALL – Projects all attributes
• Unlike LSI’s where the performance is shared with the table, RCU and WCU are defined on the GSI – in the same way as the table
• As with LSI, changes are written to the GSI asynchronously
• GSI’s ONLY support eventually consistent reads

What is a DynamoDB stream ?

• When a stream is enabled on a table, it records changes to a table and stores those values for 24 hours
• A stream can be enabled on a table from the console or API
• But can only be read or processed via the streams endpoint and API requests
• streams.dynamodb.us-west-2.amazonaws.com

• AWS guarantee that each change to a Dynamo DB table occur in the stream once and only once AND….
• That ALL changes to the Table occur in the stream in near realtime

• A Lambda function triggered when items are added to a dynamo DB stream, performing analytics on data
• A Lambda function triggered when a new user signup happens on your web app and data is entered into a users table

How will you Put Item in DynamoDB through Boto3?
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('employees')
table.put_item(
    Item={
        'emp_id': '3',
        'name': 'vikas',
        'salary': 2000
    }
)

How will you get and delete item from DynamoDB through Boto3?
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('employees')
resp = table.get_item(
#Key is dictionary
    Key={
        'emp_id': '3'
    }
)

print(resp['Item'])

table.delete_item(
    Key={
        'emp_id': '3'
    }
)
How will you insert batch records into Dynamodb through Boto3?
import boto3

dynamodb = boto3.resource('dynamodb')
table = dynamodb.Table('employees')

with table.batch_writer() as batch:
    for x in range(100):
        batch.put_item(
            Item={
                'emp_id': str(x),
                'name': 'Name-{}'.format(x)
            }
        )

What is AWS Glue ?
  • AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load processes.
  • It supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud instances.
  • It is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the steps of data preparation for analytics.
  • It is a flexible scheduler that handles dependency resolution and job monitoring.
  • AWS Glue is serverless, it means ,there’s no infrastructure to set up or manage.
  • AWS Glue consists of:
    • Central metadata repository-AWS Glue Data Catalog
    • ETL engine
    • Flexible scheduler
John is new to AWS Glue service and he has given new task to process data through Glue. What would be his action items?
  • Step 1: He has to define a crawler to populate AWS Glue Data Catalog with metadata table definitions.
  • Step 2: He has to point crawler at a data store and the crawler creates table definitions in the Data Catalog.
  • Step 3: He can write a script to transform data or he can provide the script in the AWS Glue console. Like Spark script.
  • Step 4: He can run job or he can set it up to start when a specified trigger occurs.
  • Step 5: Once job runs, a script extracts data from data source, transforms the data, and loads it to data target.
  • Step 6: This script runs in an Apache Spark environment in AWS Glue.
What will happen when a crawler Runs?
  • It classifies data to determine the format, schema, and associated properties of the raw data.
  • It groups data into tables or partitions and data is grouped.
  • It writes metadata to the Data Catalog.
  • A crawler can crawl multiple data stores in a single run.
    After job completion, the crawler creates or updates one or more tables in Data Catalog.
What is Data Catalog in Glue?
  • It is a central repository and persistent metadata store to store structural and operational metadata.
  • It stores its table definition, job definitions, and other control information to manage your AWS Glue environment.
  • Table definitions are available for ETL and also available for querying in Athena, EMR, and Redshift Spectrum to provide a view of the data between these services.
  • Each AWS account has one AWS Glue Data Catalog per region.

Explain AWS GLUE Crawler.

  • It is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
  • It scan various data stores to infer schema and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • It can be scheduled to run periodically. Doing so, the metadata is always up-to-date and in-sync with the underlying data.
  • It automatically add new tables, new partitions to existing table, and new versions of table definitions.
  • We can determine the schema of complex unstructured or semi-structured data, which can save a ton of time. 
When do I use a Glue Classifier in project?
  • It reads the data in a data store.
  • If it identifies the format of the data then it generates a schema.
  • It provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
  • AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
  • You can set up your crawler with an ordered set of classifiers.
  • When the crawler invokes a classifier, the classifier determines whether the data is recognized or not.
What is Trigger in AWS Glue?

It is an ETL job and we can define triggers based on a scheduled time or event.

John joined new company where he is working in migration project.His project moved into serverless Apache Spark-based platform from ETL.
Then which service is recommended for Streaming?

AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.

How will you import data from Hive Metastore to the AWS Glue Data Catalog?

Migration through Amazon S3:
Step 1: Run an ETL job to read data from your Hive Metastore
and it will export the data(Extract database, table, and partition objects) to an intermediate format in Amazon S3

Step 2:Import that data from S3 into the AWS Glue Data Catalog through AWS Glue ETL job.

Direct Migration:
You can set up an AWS Glue ETL job which extracts metadata from your Hive metastore and loads it into your AWS Glue Data Catalog through an AWS Glue connection.

How can you import data from AWS Glue to Hive Metastore through Direct Migration?

We can run job on the AWS Glue console which extracts metadata from specified databases in AWS Glue Data Catalog and loads it into a Hive metastore.
As a prerequisite, it requires an AWS Glue connection to the Hive metastore as a JDBC source.

How can you Migrate data from AWS Glue to Hive Metastore through Amazon S3?

We can use two AWS Glue jobs here.
The first job extracts metadata from databases in AWS Glue Data Catalog and loads them into S3. The first job is run on AWS Glue Console.
The second job loads data from S3 into the Hive Metastore. The second can be run either on the AWS Glue Console or on a cluster with Spark installed.

How can you Migrate data from AWS Glue to AWS Glue?

We can use two AWS Glue jobs here.
The first extracts metadata from specified databases in an AWS Glue Data Catalog and loads them into S3.
The second loads data from S3 into an AWS Glue Data Catalog.

What is Time-Based Schedules for Jobs and Crawlers ?

We can define a time-based schedule for crawlers and jobs in AWS Glue. When the specified time is reached, the schedule activates and associated jobs to execute.

ETL transformations using GLUE

Example of standard imports in AWS GLUE

In AWS Glue, various PySpark and Scala methods and transforms specify the connection type using a connectionType parameter. 

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
Create spark, glue and job context in AWS GLUE
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

# Lets create the spark, glue and job context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Creating a spark dataframe by reading a table defined in the data catalog
Test_df = glueContext.create_dynamic_frame.from_catalog(database = "TestDB", table_name = "Test")
How Hadoop can be replaced by AWS Glue?
  • AWS Glue is simple and it is expensive, but it depends where we compare.
  • Because of on demand pricing we only pay for what we use.
  • It may be AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.
  • So we can replace Hadoop by AWS Glue service.
Why AWS Lambda can not be used in BIG DATA?

Lambdas are simple, scalable and cost efficient. They can also be triggered by events.
For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout.
Thats why , AWS Glue is used to process large datasets.

What makes AWS Glue serverless?
  • In Glue, AWS provisions and allocates the resources automatically.
  • In Glue, the processing power is adjusted by the number of data processing units (DPU).
  • With AWS Glue your bill is calculated as per the following equation:
    • [ETL job price] = [Processing time] * [Number of DPUs]
Print the schema of the dataframe

test_glue_df.printSchema()

How resolve choice can be used to resolve ambiguities?

Data can contain ambiguities. Some columns can contain int and string type data. If you used a crawler to create the table the crawler only evaluates a small percentage of the data to determine the datatype of a column and it could classify a string as an int. Resolve choice can be used to resolve such ambiguities

test_glue_res = test_glue_df.resolveChoice(specs = [('test_glue','cast:long')])
test_glue_res.printSchema()

How will you run a sql query against a dataframe?

# First create a temporary view that will be queried

test_glue.toDF().createOrReplaceTempView(“test_glue_view”)

# execute your query against the temporary view

sqlDF = spark.sql(“select * from test_glue_view”) sqlDF.show()

Which command in Red shift is efficient in loading large amounts of data ?

Answer : COPY. A COPY command loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well. We can use a single COPY command to load data for one table from multiple files. Amazon Redshift then automatically loads the data in parallel.

What is AWS Glue ?
  • AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load processes.
  • It supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud instances.
  • It is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the steps of data preparation for analytics.
  • It is a flexible scheduler that handles dependency resolution and job monitoring.
  • AWS Glue is serverless, it means ,there’s no infrastructure to set up or manage.
  • AWS Glue consists of:
    • Central metadata repository-AWS Glue Data Catalog
    • ETL engine
    • Flexible scheduler
John is new to AWS Glue service and he has given new task to process data through Glue. What would be his action items?
  • Step 1: He has to define a crawler to populate AWS Glue Data Catalog with metadata table definitions.
  • Step 2: He has to point crawler at a data store and the crawler creates table definitions in the Data Catalog.
  • Step 3: He can write a script to transform data or he can provide the script in the AWS Glue console. Like Spark script.
  • Step 4: He can run job or he can set it up to start when a specified trigger occurs.
  • Step 5: Once job runs, a script extracts data from data source, transforms the data, and loads it to data target.
  • Step 6: This script runs in an Apache Spark environment in AWS Glue.
What will happen when a crawler Runs?
  • It classifies data to determine the format, schema, and associated properties of the raw data.
  • It groups data into tables or partitions and data is grouped.
  • It writes metadata to the Data Catalog.
  • A crawler can crawl multiple data stores in a single run.
    After job completion, the crawler creates or updates one or more tables in Data Catalog.
What is Data Catalog in Glue?
  • It is a central repository and persistent metadata store to store structural and operational metadata.
  • It stores its table definition, job definitions, and other control information to manage your AWS Glue environment.
  • Table definitions are available for ETL and also available for querying in Athena, EMR, and Redshift Spectrum to provide a view of the data between these services.
  • Each AWS account has one AWS Glue Data Catalog per region.
You are working in Ecommerce Company where you have an order processing system in AWS. There are many EC2 Instances to pick up the orders from the application and these EC2 Instances are in an Auto scaling Group to process the orders. What will you do to ensure that the EC2 Processing instances are correctly scaled based on demand?

Answer : We can use SQS queues to decouple the architecture and can scale the processing servers based on the queue length. We know that SQS is a queue from which services pull data, and it supports only once delivery of messages. If no workers pull jobs from SQS, the messages still stay in the queue. SNS is a kind of publisher-subscriber system that pushes messages to subscribers. If there are no subscribers to an SNS topic, a given message is lost.

In your project, you have data in Dynamo DB tables and you have to perform complex data analysis queries on the data (stored In the Dynamo DB tables). How will you do this?

Answer : We can copy the data on AWS(Amazon Web Service) Red shift and then perform the complex queries.

Which service will you use to collect, process, and analyze video streams in realtime?

Answer : Amazon Red shift

In an AWS, EMR Cluster which node is resposible for running the YARN service?

Answer : Master Node

In your client big data project, You are trying to connect to the master node for your EMR cluster. What should be checked to ensure that the connection is successful?

Answer : We can check the Inbound rules for the Security Group for the master node. Under Security and access choose the Security groups for Master link. Choose ElasticMapReduce-master from the list. Choose Inbound, Edit. Check for an inbound rule that allows public access with the following settings.

Explain AWS GLUE Crawler.

  • It is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
  • It scan various data stores to infer schema and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • It can be scheduled to run periodically. Doing so, the metadata is always up-to-date and in-sync with the underlying data.
  • It automatically add new tables, new partitions to existing table, and new versions of table definitions.
  • We can determine the schema of complex unstructured or semi-structured data, which can save a ton of time. 
When do I use a Glue Classifier in project?
  • It reads the data in a data store.
  • If it identifies the format of the data then it generates a schema.
  • It provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
  • AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
  • You can set up your crawler with an ordered set of classifiers.
  • When the crawler invokes a classifier, the classifier determines whether the data is recognized or not.
What is Trigger in AWS Glue?

It is an ETL job and we can define triggers based on a scheduled time or event.

John joined new company where he is working in migration project.His project moved into serverless Apache Spark-based platform from ETL.
Then which service is recommended for Streaming?

AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.

How will you import data from Hive Metastore to the AWS Glue Data Catalog?

Migration through Amazon S3:
Step 1: Run an ETL job to read data from your Hive Metastore
and it will export the data(Extract database, table, and partition objects) to an intermediate format in Amazon S3

Step 2:Import that data from S3 into the AWS Glue Data Catalog through AWS Glue ETL job.

Direct Migration:
You can set up an AWS Glue ETL job which extracts metadata from your Hive metastore and loads it into your AWS Glue Data Catalog through an AWS Glue connection.

Which AWS service will you use to perform ad-hoc analysis on log data?

Amazon Elasticsearch Service is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. You can search specific error codes and reference numbers quickly.

What will you do for query optimization after data has been ingested into a Red shift?

Answer : We can run the ANALYZE command so that the optimizer can generate up-to-date data statistics. Amazon Redshift monitors changes to your workload and automatically updates statistics in the background. In addition, the COPY command performs an analysis automatically when it loads data into an empty table. To explicitly analyze a table or the entire database, run the ANALYZE command.

An application is currently using the Elastic Search service in AWS. How can you take backups of a clusters data through Elastic Search?

Answer : Automated snapshots. By default, the AWS Elasticsearch Service already comes with regular automated snapshots. These snapshots can not be used for recovery or migration to a new Elasticsearch cluster.It can only be accessed as long as the Elasticsearch API of the cluster is available.

Explain AWS GLUE Crawler.

  • It is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
  • It scan various data stores to infer schema and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
  • It can be scheduled to run periodically. Doing so, the metadata is always up-to-date and in-sync with the underlying data.
  • It automatically add new tables, new partitions to existing table, and new versions of table definitions.
  • We can determine the schema of complex unstructured or semi-structured data, which can save a ton of time. 
When do I use a Glue Classifier in project?
  • It reads the data in a data store.
  • If it identifies the format of the data then it generates a schema.
  • It provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
  • AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
  • You can set up your crawler with an ordered set of classifiers.
  • When the crawler invokes a classifier, the classifier determines whether the data is recognized or not.
What is Trigger in AWS Glue?

It is an ETL job and we can define triggers based on a scheduled time or event.

John joined new company where he is working in migration project.His project moved into serverless Apache Spark-based platform from ETL.
Then which service is recommended for Streaming?

AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.

How will you import data from Hive Metastore to the AWS Glue Data Catalog?

Migration through Amazon S3:
Step 1: Run an ETL job to read data from your Hive Metastore
and it will export the data(Extract database, table, and partition objects) to an intermediate format in Amazon S3

Step 2:Import that data from S3 into the AWS Glue Data Catalog through AWS Glue ETL job.

Direct Migration:
You can set up an AWS Glue ETL job which extracts metadata from your Hive metastore and loads it into your AWS Glue Data Catalog through an AWS Glue connection.

Which AWS Service can be used to monitor EMR Clusters and give reports of the performance of the cluster as a whole?

Answer : Cloudwatch logs.You can view the metrics that Amazon EMR reports to CloudWatch using the Amazon EMR console or the CloudWatch console.

Sometimes if you try to terminate an EMR Cluster but it does not happen. Which should be a possible reason for this?

Answer : The termination protection set on the cluster. If you are terminating a cluster which has termination protection set on then you must disable termination protection first.Then you can terminate the cluster. Clusters can be terminated using the console, the Amazon CLI, or programmatically using the TerminateJobFlows API.

Which node type is recommended when launching Red shift cluster ?

Answer : Dense Storage.DS2 allows a storage-intensive data warehouse with vCPU and RAM included for computation. DS2 nodes use HDD(Hard Disk Drive) for storage and as a rule of thumb, if its data more than 500 GB, then it will go for DS2 instances.

Where does the query results from Athena get stored?

Answer : In Amazon S3

How will you convert and migrate an on-premise Oracle database to AWS Aurora.

Answer : First we will convert database schema and code using AWS Schema Conversion Tool then will migrate data from the source database to the target database using AWS.

You expect a large number of GET and PUT requests on S3 bucket. You could expect around 300 PUT and 500 GET requests per second on the 53 bucket during a selling period on your web site. How will you do good design to ensure optimal performance?

Answer : We have to ensure the object names have appropriate key names.

Which AWS Service filter, transform messages (coming from sensor) and store them as time series data in Dynamo DB?

Answer : loT Rules Engine. The Rules Engine is a component of AWS IoT Core. The Rules Engine evaluates inbound messages published into AWS IoT Core and transforms and delivers them to another device or a cloud service, based on business rules you define.

Your Project is currently running an EMR cluster which is used to perform a processing task every day from 5pm to 10 pm. But the data admin has noticed that the cluster is being billed for the entire day. What will you do configuration here for the cluster to reduce the costs?

Answer : We can use transient clusters in EMR. There are two kinds of EMR clusters: transient and long-running. If you want to configure your cluster to be automatically terminated then it is terminated after all the steps complete.This is a transient cluster. Transient clusters are compute clusters that automatically shut down and stop billing when processing is finished.

Which storage types can be used with Amazon EMR?

Answer : Local file system

HDFS

EMRFS

Consider you have large volume of data . You have to store them and access for a short period, but then it needs to be archived indefinitely. What is a cost-effective solution?

Answer : We can Store data in Amazon 53. and use lifecycle policies to archive to Amazon Glacier

Which of the following component for the AWS(Amazon Web Service) Machine learning service is used to generate predictions using the patterns extracted from the input data ?

Answer : Models

How can you Migrate data from AWS Glue to Hive Metastore through Amazon S3?

We can use two AWS Glue jobs here.
The first job extracts metadata from databases in AWS Glue Data Catalog and loads them into S3. The first job is run on AWS Glue Console.
The second job loads data from S3 into the Hive Metastore. The second can be run either on the AWS Glue Console or on a cluster with Spark installed.

How can you Migrate data from AWS Glue to AWS Glue?

We can use two AWS Glue jobs here.
The first extracts metadata from specified databases in an AWS Glue Data Catalog and loads them into S3.
The second loads data from S3 into an AWS Glue Data Catalog.

What is Time-Based Schedules for Jobs and Crawlers ?

We can define a time-based schedule for crawlers and jobs in AWS Glue. When the specified time is reached, the schedule activates and associated jobs to execute.

Which component of a Red shift cluster, if down, it renders the Red shift cluster as unavailable?

Answer : Leader Node.The Leader Node in an Amazon Redshift Cluster manages all external and internal communication. It is responsible for preparing query execution plans whenever a query is submitted to the cluster.The Leader Node distributes data to the slices, and allocates parts of a user query or other database operation to the slices. Slices work in parallel to perform the operations.

Which SQL function statements can be used in Red shift to specify a result when there are multiple conditions?
Answer : Case expression
You have to create an Amazon Machine Learning model to predict how many inches of snow will fall in an area based on the historical snowfall data. What type of modeling will you use?

Answer : Regression

What is Shard in AWS Kinesis?

It is a group of data records in a stream.

How will you load streaming data and establish scalable private connections to on-premise data centers ?Which service will you used for that?

Answer : Direct Connect and Kinesis Fire hose

  • Establish a dedicated network connection from your premises to AWS.
  • AWS Direct Connect makes it easy to establish a dedicated network connection from your premises to AWS.
  • Using AWS Direct Connect, you can establish private connectivity between AWS and your datacentre.
  • Amazon Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture and automatically load streaming data into Amazon S3 and Amazon Redshift
Which database would be best for storing and analyzing the complex interpersonal relationships of people involved in organized crime?
  • Amazon Neptune is a purpose-built, high-performance graph database. It is optimized for processing graph queries.
  • After creating an instance in Amazon Elastic Compute Cloud (Amazon EC2), you can log into that instance using SSH and connect to a Amazon Neptune DB cluster
Amazon Elastic File System (Amazon EFS) provides simple, scalable, elastic file storage for use with AWS Cloud services and on-premises resources. You have decided to use EFS for sharing files across many EC2 instances and you want to be able to tolerate an AZ failure. What should you do?

Correct Answer: We can Create EFS mount targets in each AZ and configure each EC2 instance to mount the common mount target.

Which AWS services allow native encryption of data at rest?
  • EBS, S3 and EFS are AWS Services which allow native encryption of data, while at rest.
  • All allow the user to configure encryption at rest.
  • They can use either the AWS Key Management Service (KMS) or customer provided keys.
  • The exception is ElastiCache for Memcached which does not offer a native encryption service whereas ElastiCache for Redis allows.
  • AWS Snowball encrypts data at rest by default as well.
Which service is used by the Spark Streaming tool to consume data from Amazon Kinesis?

Answer : Amazon Kinesis Producer Library

There is a requirement to perform SQL querying along with complex queries on different backend data that include Red shift, My SQL Hive on EMR. H3, and PostgreSQL. How can we use Presto S in this case?

Answer : Presto is a high performance, distributed SQL query engine for big data. Its architecture allows users to query a variety of data sources such as Hadoop, AWS S3, Alluxio, MySQL, Cassandra, Kafka, MongoDB and Teradata. 

We need to perform ad-hoc SQL queries on structured data in Project. As Data comes in constantly at a high velocity so what services should we use?

Answer : EMR + Red shift

Consider you have to load a lot of data once a week from your on-premise datacenter to AWS Redshift. Which AWS-Managed Cloud Data Migration Tools can be used for this data transfer in simple, fast, and secure way.

Answer : Direct Connect

Which service is used by the AWS Athena in partitioning data?
Answer : Hive
You need a cost-effective solution to store a large collection of video files and have fully managed data warehouse service that can keep track of and analyze all your data efficiently using your existing business intelligence tools. How will you full fill the requirements?

Answer : Store the data in Amazon S3 and reference its location in Amazon Red shift. Amazon Red shift will keep track of metadata about your binary objects. but the large objects themselves would be stored in Amazon 53.

In Project, consider your EMR cluster uses ten m4.large instances and runs 24 hours per day, but it is only used for processing and reporting during working hours. How will you reduce the costs?

Answer : We can use Spot instances for tasks nodes when needed  and we can migrate the data from HDFS to S3 using S3DispCp and turn off the cluster when not in use.

Your application generates a 2 KBJSON payload that needs to be queued and delivered EC2 instances for applications. At the end of the day, the application needs to replay the data for the past 24 hours, Which service would you use for this requirement?

Answer : Kinesis

The Amazon DynamoDB Query action lets you retrieve data in a similar fashion. You can use Query with any table that has a composite primary key (partition key and sort key). You must specify an equality condition for the partition key, and you can optionally provide another condition for the sort key. You need to improve performance of queries to your DynamoDB table. The most common queries do not use the partition key. What should you do?

Correct Answer: Create a Global Secondary Index with the most common queried attribute as the hash key

Which data formats does Amazon Athena support?

Correct Answer: Apache Parquet.Apache ORC.JSON

Build data-intensive apps or boost the performance of your existing databases by retrieving data from high throughput and low latency in-memory data stores. Amazon ElastiCache is a popular choice for real-time use cases like Caching, Session Stores, Gaming, Geospatial Services, Real-Time Analytics, and Queuing.You are trying to decide product, you should select for your in-memory cache needs. You require support for encryption. Which service should you choose?

Correct Answer: ElastiCache Redis

Consider you are working in commercial deliver loT company where you have to track coordinates through enabled devices via GPS. You receive coordinates , which is transmitted from each device once every 8 seconds. Now you need to process these coordinates In real-time from multiple sources.Which tool should you use to digest the data?

Answer : Amazon Kinesis. Amazon Kinesis Data Streams is a scalable and durable real-time data streaming service that can continuously capture gigabytes of data per second from hundreds of thousands of sources.

Amazon DynamoDB is a NoSQL database that supports key-value and document data models. Developers can use DynamoDB to build modern, serverless applications that can start small and scale globally to support petabytes of data and tens of millions of read and write requests per second. What DynamoDB features can be utilised to increase the speed of read operations?

Correct Answer: DynamoDB Accelerator (DAX) and Secondary Indexes

You are architecting a complex application landscape that values fast disk I/O for EC2 instances above everything else. Which storage option would you choose?

Correct Answer: Instance Store

You want to allow your VPC instances to resolve using on-prem DNS. Can you do this and how/why?

Correct Answer: Yes, by configuring a DHCP Option Set to issue your on-prem DNS IP to VPC clients.

Which command can be used to transfer the results of a query in Red shift to Amazon 53?

Answer : UNLOAD connects to Amazon S3 using an HTTPS connection.For unloading data from database tables to a set of files in an Amazon S3 bucket, we can use the UNLOAD command with a SELECT statement. As we know, Redshift splits the results of a select statement across a set of files, one or more files per node slice.

We have a set of web servers hosted on EC2 Instances and have to push the logs from these web servers onto a suitable storage device for subsequent analysis. How will you do this implementation process?

Answer : First we have to install and configure the Kinesis agents on the web servers. Then we have to ensure that Kinesis Fire hose is setup to take the data and send it across to Red shift for further processing

When estimating the cost of using EMR, which of the parameters should you consider.

Answer : The price of the underlying EC2 Instances.The price of the EMR service . The price of EBS storage if used.

Which AWS services allow native encryption of data at rest?
  • EBS, S3 and EFS are AWS Services which allow native encryption of data, while at rest.
  • All allow the user to configure encryption at rest.
  • They can use either the AWS Key Management Service (KMS) or customer provided keys.
  • The exception is ElastiCache for Memcached which does not offer a native encryption service whereas ElastiCache for Redis allows.
  • AWS Snowball encrypts data at rest by default as well.
Amazon Aurora is a MySQL and PostgreSQL-compatible relational database built for the cloud, that combines the performance and availability of traditional enterprise databases with the simplicity and cost-effectiveness of open source databases.
In project,You have decided to migrate your on-prem legacy Informix database to Amazon Aurora. How might this be facilitated most efficiently?

Correct Answer: You can manually create the target schema on Aurora then use Data Pipeline with JDBC to move the data.


You are migrating from an Oracle on-prem database to an Oracle RDS database. Which of these describes this migration properly?

Correct Answer: Homogenous migration

Which services can be used for auditing 53 buckets?

Answer : Cloud trail and AWS Config. AWS CloudTrail is a service that enables governance, compliance, operational auditing, and risk auditing of your AWS account.
AWS Config is a service that enables us to assess, audit, and evaluate the configurations of AWS resources. Config continuously monitors and records AWS resource configurations and allows to automate the evaluation of recorded configurations against mentioned configurations.

Which service would you use to check CPU utilization of your EC2 instances?

Your Answer : CloudWatch

Which managed service that can be used to deliver real-time streaming data to 53?

Answer : Kinesis Fire hose.Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon Elasticsearch Service (Amazon ES), Splunk, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service.

In your client project, there is a requirement for a vendor to have access to an 53 bucket in your account. The vendor already has an AWS(Amazon Web Service) account. How can you provide access to the vendor on this bucket?

Answer : Create an S3 bucket policy that allows the vendor to read from the bucket from their AWS(Amazon Web Service) account. A bucket policy is a resource-based AWS Identity and Access Management policy.We can add a bucket policy to a bucket to grant other AWS accounts or IAM users access permissions for the bucket and the objects in it. 

Which file format is supported in Athena by default?

Answer : Apache Parquet.Amazon Athena supports a wide variety of data formats like CSV, TSV, JSON, or Textfiles and also supports open source columnar formats such as Apache ORC and Apache Parquet. Athena also supports compressed data in Snappy, Zlib, LZO, and GZIP formats.

You have setup Red shift cluster in your AWS development account in us-east-i. Now your manager decided to move the cluster to the production account in us-west-i. What will you do in first step?

Answer : Create a manual snapshot of the Red shift cluster.As we know,A snapshot contains data from any databases that are running on your cluster.It contains information about your cluster, including the number of nodes, node type, and master user name.If you backup your cluster from a snapshot, Amazon Redshift uses the cluster information to create a new cluster.

What are the 2 types of nodes in a Red shift Cluster?

Answer : Leader Node,Compute Node.

Redshift Architecture and Its Components

  • Leader Node.
  • Compute Node.
  • Node Slices.
  • Massively Parallel Processing.
  • Columnar Data Storage.
  • Data Compression.
  • Query Optimizer.
You are trying to use SQL Client tool from an EC2 Instance, but you are not able to connect to the Red shift Cluster. What must you do?

Answer : Modify the VPC Security Groups.

Open the Amazon VPC console at https://console.aws.amazon.com/vpc/ .

  1. In the navigation pane, choose Security Groups.
  2. Select the security group to update.
  3. Choose Actions, Edit inbound rules or Actions, Edit outbound rules.
  4. Modify the rule entry as required.
  5. Choose Save rules.
Which method should you use for publishing and analyzing the logs (logs from the EC2 Instances need to be published and analysed for new application’s feature)?

Answer : Use consumers to analyze the logs

Open the Amazon EC2 console at https://console.aws.amazon.com/ec2/ .

  1. In the left navigation pane, choose Instances, and select the instance.
  2. Choose Actions, Monitor and troubleshoot, Get system log. 
Which chart you use for comparing measure values over time in Amazon Quick sight?

Answer : Line charts.Use line charts to compare changes in measure values over period of time, for the following scenarios:

  • One measure over a period of time.
  • Multiple measures over a period of time.
  • One measure for a dimension over a period of time.
Your application is writing a large number of records to a Dynamo DB table in one region. There is a requirement for a secondary application to take in the changes to the Dynamo DB table every 4 hours and process the updates accordingly. How will you process here?

Answer : Use Dynamo DB streams to monitor the changes in the Dynamo DB table.Once you enable DynamoDB Streams, it captures a time-ordered sequence of item-level modifications in a DynamoDB table and stores the information for up to 24 hours. As we know,applications can access a series of stream records, which contain an item change, from a DynamoDB stream in near real time. So we can use Dynamo DB streams to monitor the changes in the Dynamo DB table. 

Your project has a Red shift cluster for peta byte-scale data warehousing and your project manager wants to reduce the overall total cost of running Red shift cluster. How will you meet the needs of the running cluster, while still reducing total overall cost?

Answer : Disable automated and manual snapshots on the cluster. To disable automated snapshots, set the retention period to zero. If you disable automated snapshots, Amazon Redshift stops taking snapshots and deletes any existing automated snapshots for the cluster. 

You are working with a Kinesis Stream. What is used to group data by shard within a stream?

Answer : Partition Key.A partition key is used to group data by shard within a stream. Kinesis Data Streams segregates the data records belonging to a stream into multiple shards.

You need to ensure in your project that each user can only access their own data in a particular DynamoDB table. Many users already have accounts with a third-party identity provider, such as Face book. Google. or Login with Amazon. How would you implement this requirement?

Answer : Use Web identity federation and register your application with a third-party identity provider such as Google, Amazon, or Face book.Create a DynamoDB table and call it “Test.”

  1. Create Partition key and a Sort key. Complete creation of the table Test.
  2. Navigate  to “Access control” and select ‘Facebook’ as the identity provider or any other as per your requirement.
  3. Select the “Actions” that you want to allow your users to perform.
  4. Select the “Attributes” that you want your users to have access to.
  5. Select Create policy and copy the code generated in the policy panel.