When troubleshooting slowness on an EMR cluster, which of the following node types does not need to be investigated for issues?
Answer : Core Nodes
- Step 1: Gather Data About the slowness Issue
- Step 2: Check the Environment
- Step 3: Examine the Log Files
- Step 4: Check Cluster and Instance Health
- Step 5: Check for Suspended Groups
- Step 6: Review Configuration Settings
- Step 7: Examine Input Data
What is the default input data format for Amazon EMR?
Answer : Text.The default input format for a cluster is text files with each line separated by a newline (\n) character, which is the input format most commonly used.
You are planning on loading a 800 GB file into a Red shift cluster which has 10 nodes. What is a preferable method to load the data?
Answer : We can split the file Into 800 smaller files
You are planning on loading a huge amount of data into a Red shift Cluster. You are not sure if the load will succeed or fail. Which of the below options can help if an error would occur during the load process.
Answer : Use the COPY command with the NOLOAD option. When NOLOAD parameter is used in the COPY command, Redshift checks data file’s validity without inserting any records to the target table.
Which service can be used to run ad-hoc queries for data in S3?
Answer : Athena. Amazon Athena helps us to analyze data stored in Amazon S3. We can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Amazon Athena can process unstructured, semi-structured, and structured data sets.
What is AWS Glue ?
- AWS Glue is a cloud service that prepares data for analysis through automated extract, transform and load processes.
- It supports MySQL, Oracle, Microsoft SQL Server and PostgreSQL databases that run on Amazon Elastic Compute Cloud instances.
- It is a fully-managed, pay-as-you-go, extract, transform, and load (ETL) service that automates the steps of data preparation for analytics.
- It is a flexible scheduler that handles dependency resolution and job monitoring.
- AWS Glue is serverless, it means ,there’s no infrastructure to set up or manage.
- AWS Glue consists of:
- Central metadata repository-AWS Glue Data Catalog
- ETL engine
- Flexible scheduler
John is new to AWS Glue service and he has given new task to process data through Glue. What would be his action items?
- Step 1: He has to define a crawler to populate AWS Glue Data Catalog with metadata table definitions.
- Step 2: He has to point crawler at a data store and the crawler creates table definitions in the Data Catalog.
- Step 3: He can write a script to transform data or he can provide the script in the AWS Glue console. Like Spark script.
- Step 4: He can run job or he can set it up to start when a specified trigger occurs.
- Step 5: Once job runs, a script extracts data from data source, transforms the data, and loads it to data target.
- Step 6: This script runs in an Apache Spark environment in AWS Glue.
What will happen when a crawler Runs?
- It classifies data to determine the format, schema, and associated properties of the raw data.
- It groups data into tables or partitions and data is grouped.
- It writes metadata to the Data Catalog.
- A crawler can crawl multiple data stores in a single run.
After job completion, the crawler creates or updates one or more tables in Data Catalog.
What is Data Catalog in Glue?
- It is a central repository and persistent metadata store to store structural and operational metadata.
- It stores its table definition, job definitions, and other control information to manage your AWS Glue environment.
- Table definitions are available for ETL and also available for querying in Athena, EMR, and Redshift Spectrum to provide a view of the data between these services.
- Each AWS account has one AWS Glue Data Catalog per region.
Explain AWS GLUE Crawler.
- It is a program that connects to a data store (source or target), progresses through a prioritized list of classifiers to determine the schema for your data and then creates metadata tables in the Data Catalog.
- It scan various data stores to infer schema and partition structure to populate the Glue Data Catalog with corresponding table definitions and statistics.
- It can be scheduled to run periodically. Doing so, the metadata is always up-to-date and in-sync with the underlying data.
- It automatically add new tables, new partitions to existing table, and new versions of table definitions.
- We can determine the schema of complex unstructured or semi-structured data, which can save a ton of time.
When do I use a Glue Classifier in project?
- It reads the data in a data store.
- If it identifies the format of the data then it generates a schema.
- It provides classifiers for common file types, such as CSV, JSON, AVRO, XML, and others.
- AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers.
- You can set up your crawler with an ordered set of classifiers.
- When the crawler invokes a classifier, the classifier determines whether the data is recognized or not.
What is Trigger in AWS Glue?
It is an ETL job and we can define triggers based on a scheduled time or event.
John joined new company where he is working in migration project.His project moved into serverless Apache Spark-based platform from ETL.
Then which service is recommended for Streaming?
AWS Glue is recommended for Streaming when your use cases are primarily ETL and when you want to run jobs on a serverless Apache Spark-based platform.
Consider your project has a web site hosted in AWS. There is a requirement to analyze the click stream data for the web site and this needs to be done in real time. How will you complete this requirement?
Answer : We can use the Amazon Kinesis service to process the data from a Kinesis agent.
Which service can be used for storing log files generated from an EMR cluster?
Answer : Amazon 53
Which tool can be used to build real-time applications using streaming data?
Answer : Kafka
You got a requirement to migrate 4TB of data to AWS. There is a restriction of the time to migrate the data and there is a limitation of only a 1OOMB it line to the AWS Cloud. What is the best solution to use to migrate the data to the cloud?
Answer : Amazon Import/Export.AWS Import/Export is a data transport service used to move large amounts of data into and out of the Amazon Web Services public cloud using portable storage devices for transport. The service also enables a user to perform an export job from Amazon S3, but not from Amazon EBS or Glacier. There are two versions of the service: AWS Import/Export Disk and AWS Snowball. . AWS recommends an IT team use the service if there are 16 TB or less of data to import.
Which can be used to enable developers to quickly get started with deep learning in the cloud?
Answer : Tensor Flow
Your team is planning on using the AWS loT Rules service to allow loT enabled devices to write information to Dynamo DB. What action must be done to ensure that the rules will work as intended?
Answer : Ensure that the right lAM permissions to AWS Dynamo DB is given.IAM user – An IAM user is an identity within your AWS account that has specific custom permissions (for example, permissions to create a table in DynamoDB).
How will you import data from Hive Metastore to the AWS Glue Data Catalog?
Migration through Amazon S3:
Step 1: Run an ETL job to read data from your Hive Metastore
and it will export the data(Extract database, table, and partition objects) to an intermediate format in Amazon S3
Step 2:Import that data from S3 into the AWS Glue Data Catalog through AWS Glue ETL job.
Direct Migration:
You can set up an AWS Glue ETL job which extracts metadata from your Hive metastore and loads it into your AWS Glue Data Catalog through an AWS Glue connection.
How can you import data from AWS Glue to Hive Metastore through Direct Migration?
We can run job on the AWS Glue console which extracts metadata from specified databases in AWS Glue Data Catalog and loads it into a Hive metastore.
As a prerequisite, it requires an AWS Glue connection to the Hive metastore as a JDBC source.
How can you Migrate data from AWS Glue to Hive Metastore through Amazon S3?
We can use two AWS Glue jobs here.
The first job extracts metadata from databases in AWS Glue Data Catalog and loads them into S3. The first job is run on AWS Glue Console.
The second job loads data from S3 into the Hive Metastore. The second can be run either on the AWS Glue Console or on a cluster with Spark installed.
How can you Migrate data from AWS Glue to AWS Glue?
We can use two AWS Glue jobs here.
The first extracts metadata from specified databases in an AWS Glue Data Catalog and loads them into S3.
The second loads data from S3 into an AWS Glue Data Catalog.
What is Time-Based Schedules for Jobs and Crawlers ?
We can define a time-based schedule for crawlers and jobs in AWS Glue. When the specified time is reached, the schedule activates and associated jobs to execute.
ETL transformations using GLUE
Example of standard imports in AWS GLUE
In AWS Glue, various PySpark and Scala methods and transforms specify the connection type using a connectionType
parameter.
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
Create spark, glue and job context in AWS GLUE
## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
# Lets create the spark, glue and job context
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
Creating a spark dataframe by reading a table defined in the data catalog
Test_df = glueContext.create_dynamic_frame.from_catalog(database = "TestDB", table_name = "Test")
You have an application that uses Dynamo DB to store JSON data ( Read and Write capacity of the Dynamo DB table is enables). Now you have deployed your application and You are unsure of the amount of the traffic that will be received by the application during the deployment time. How can you ensure that the Dynamo DB is not highly throttled and does not become a bottleneck for the application?
Answer : Dynamo DB’s auto scaling feature will make sure that no read/write throttling will happen due to heavy traffic. To configure auto scaling in DynamoDB, you set the minimum and maximum levels of read and write capacity in addition to the target utilization percentage. Auto scaling uses Amazon CloudWatch to monitor a table’s read and write capacity metrics.
You developed new application that handles huge workloads on large scale datasets that are stored on Amazon Red shift. The application needs to access Amazon Red shift tables frequently. How will you access tables?
Answer : We have to use roles that allow a web identity federated user to assume a role that allows access to the Red shift table(by providing temporary credentials).
Which API commands can be used to data Into a Kinesis stream for Synchronous processing?
Answer : Write Record
Which service can be used for transformation of incoming source data in Amazon Kinesis Data Fire hose ?
Answer : AWS IOT
How will you analyze a large set of data that updates from Kinesis and Dynamo DB?
Answer : Elastic search. Elasticsearch allows you to store, search, and analyze huge volumes of data quickly and in near real-time and give back answers in milliseconds. It’s able to achieve fast search responses because instead of searching the text directly, it searches an index.
Your project application requires long-term storage for backups and other data that you need to keep readily available but at lower cost. Which S3 storage option should you use or recommend?
Answer : AWS 53 Standard – Infrequent Access. Amazon S3 Standard-Infrequent Access (S3 Standard-IA) S3 Standard-IA is for data that is accessed less frequently, but it gives rapid access when needed. It offers the high durability, high throughput, and low latency of S3 Standard, with a low per GB storage price and per GB retrieval fee.
You are currently managing an application that using the Kinesis Client Library to read a Kinesis stream. Suddenly you got a Provisioned Through put Exceeded Exception in Cloud watch from the stream. How will you rectify this error?
Answer : We can add retry logic to applications that use the KCL library
If you have to use in join level queries frequently then which distribution styles would you utilize for the table in Redshift?
Answer : KEY. A distribution key is a column that is used to determine the database partition in which a particular row of data is stored. A distribution key is defined on a table using the CREATE TABLE statement. The columns of the unique or primary key are used as the distribution keys.
Which method can be used to disable automated snapshots in Red shift?
Answer : Set the retention period to -1
What is the default retention period for a Kinesis stream?
Answer : 1day
Which service can be used as a business analytics service that can be used to build visualizations?
Answer : AWS Quick sight
What is the concurrency level of the number of queries that can run per queue in Red shift?
Answer : 5
How Hadoop can be replaced by AWS Glue?
- AWS Glue is simple and it is expensive, but it depends where we compare.
- Because of on demand pricing we only pay for what we use.
- It may be AWS Glue significantly cheaper than a fixed size on-premise Hadoop cluster.
- So we can replace Hadoop by AWS Glue service.
Why AWS Lambda can not be used in BIG DATA?
Lambdas are simple, scalable and cost efficient. They can also be triggered by events.
For big data lambda functions are not suitable because of the 3 GB memory limitation and 15 minute timeout.
Thats why , AWS Glue is used to process large datasets.
What makes AWS Glue serverless?
- In Glue, AWS provisions and allocates the resources automatically.
- In Glue, the processing power is adjusted by the number of data processing units (DPU).
- With AWS Glue your bill is calculated as per the following equation:
- [ETL job price] = [Processing time] * [Number of DPUs]
Print the schema of the dataframe
test_glue_df.printSchema()
How resolve choice can be used to resolve ambiguities?
Data can contain ambiguities. Some columns can contain int and string type data. If you used a crawler to create the table the crawler only evaluates a small percentage of the data to determine the datatype of a column and it could classify a string as an int. Resolve choice can be used to resolve such ambiguities
test_glue_res = test_glue_df.resolveChoice(specs = [('test_glue','cast:long')])
test_glue_res.printSchema()
How will you run a sql query against a dataframe?
# First create a temporary view that will be queried
test_glue.toDF().createOrReplaceTempView(“test_glue_view”)
# execute your query against the temporary view
sqlDF = spark.sql(“select * from test_glue_view”) sqlDF.show()
What are the examples of Columnar databases?
Answer : Amazon Red shift and Apache H Base
Which tool can be used for transferring data between Amazon 53. Hadoop, HDFS, and RDBMS databases?
Answer : Sqoop
Which command can be used to see the impact of a query on a Red shift Table?
Answer : TRY
We are writing data to a Kinesis stream and the default stream settings are used for the kinesis stream. Every fourth day you have decided to send the data to S3 from the stream. When you analyze the data in S3, you see that only the 4th day’s data is present in the stream. What is the reason for this?
Answer : As we know that,Data records are only accessible for a default of 24 hours from the time they are added to a stream, since default stream settings are used for the kinesis stream here.