AWS Certified Big Data – Specialty Q & A

1) Your team has successfully migrated the corporate data warehouse to Redshift. So far, all the data coming into the ETL pipeline for the data warehouse has been from other corporate systems also running on AWS. However, after signing some new business deals with a 3rd party, they will be securely sending files directly to S3. The data in these files needs to be ingested into Redshift. Members of your team are debating the most efficient and best automated way to introduce this change into the ETL pipeline. Which of the following options would you suggest? (Choose 2)
Use Lambda (AWS Redshift Database Loader).
Procure a new 3rd party tool that integrates with S3 and Redshift that provides powerful scheduling capabilities.
Use Data Pipeline.
Run a cron job on a t2.micro instance that will execute Linux shell scripts.

Correct Answer: Use Lambda (AWS Redshift Database Loader)..Use Data Pipeline..

2) Your company is launching an IoT device that will send data to AWS. All the data generated by the millions of devices your company is going to sell will be stored in DynamoDB for use by the Engineering team. Each customer’s data, however, will only be stored in DynamoDB for 30 days. A mobile application will be used to control the IoT device, and easy user sign-up and sign-in to the mobile application are requirements. The engineering team is designing the application to scale to millions of users. Their preference is to not have to worry about building, securing, and scaling authentication for the mobile application. They also want to use their own identity provider. Which option would be the best choice for their mobile application?
Use an Amazon Cognito identity pool.
Since everyone uses Facebook, Amazon, and Google, keep it simple and use all three.
Use a SAML identity provider.

Correct Answer: Use an Amazon Cognito identity pool.

3) If Kinesis Firehose experiences data delivery issues to S3, it will retry delivery to S3 for a period of __.
7 hours
7 days
24 hours

Correct Answer: 24 hours

4) Which of the following AWS IoT components transforms messages and routes them to different AWS services?
Device Gateway
Rules Engine
Device Shadow

Correct Answer: Rules Engine

5) True or False: Data Pipeline does not integrate with on-premise servers.
True
False

Correct Answer: False

6) Which service does Kinesis Firehose not load streaming data into?
DynamoDB
Redshift
Elasticsearch

Correct Answer: DynamoDB

7) For which of the following AWS services can you not create a rule action in AWS IoT? (Choose 2)
Kinesis Firehose
Redshift
CloudWatch
Aurora

Correct Answer: Redshift.Aurora.

8) For an unknown reason, data delivery from Kinesis Firehose to your Redshift cluster has failed. Kinesis Firehose retries the data delivery every 5 minutes for a maximum period for of 60 minutes; however, none of the retries deliver the data to Redshift. Kinesis Firehose skips the files and move onto the next batch of files in S3. How can you ensure that the undelivered data is eventually loaded into Redshift?
Check the STL_LOAD_ERRORS table in Redshift, find the files that failed to load and manually, and load the data in those files using the COPY command.
You create a Lambda function to automatically load these files into Redshift by reading the manifest after the retries have been completed and the COPY command has been run.
Skipped files are delivered to your S3 bucket as a manifest file in an errors folder. Run the COPY command manually to load the skipped files after you have determined why they failed to load.

Correct Answer: Skipped files are delivered to your S3 bucket as a manifest file in an errors folder. Run the COPY command manually to load the skipped files after you have determined why they failed to load.

9) Regarding SQS, which of the following are true? (Choose 3)
A queue can only be created in limited regions, and you should check the SQS website to see which are supported.
Messages can be sent and read simultaneously.
A queue can be created in any region.
Messages can be retained in queues for up to 7 days.
Messages can be retained in queues for up to 14 days.

Correct Answer: Messages can be sent and read simultaneously..A queue can be created in any region..Messages can be retained in queues for up to 14 days..

10) What are the main uses of Kinesis Data Streams? (Choose 2)
They can undertake the loading of streamed data directly into data stores
They can provide long term storage of data
They can carry out real-time reporting and analysis of streamed data
They can accept data as soon as it has been produced, without the need for batching

Correct Answer: They can accept data as soon as it has been produced, without the need for batching

11) Data delivery from your Kinesis Firehose delivery stream to the destination is falling behind. When this happens, you need to manually change the buffer size to catch up and ensure that the data is delivered to the destination.
True
False

Correct Answer: False

12) True or False: With both local secondary indexes and global secondary indexes, you can define read capacity units and write capacity units on the index itself — so that you don’t have to consume them from the base table.
False
True

Correct Answer: False

13) Which of the following must be defined when you create a table? (Choose 4)
The table capacity, number of GB.
The RCU (Read Capacity Units)
The WCU (Write Capacity Units)
The DCU (Delete/Update Capacity Units)
Partition Key
The Table Name

Correct Answer: The RCU (Read Capacity Units).

The WCU (Write Capacity Units).

The Table Name.

14) Which of the following attribute data types can be table or item keys? (Choose 3)
Number
Blob
Map
Binary
String

Correct Answer: Number.

Binary.

String.

15) Which of the following statements is true?
A shard supports up to 1000 transactions per second for reads, and 5 transactions per second for writes.
A shard supports up to 5 transactions per second for reads, and 1000 records per second for writes.
A shard supports up to 5 transactions per second for reads, and 10 records per second for writes.

Correct Answer: A shard supports up to 5 transactions per second for reads, and 1000 records per second for writes.

16) In terms of data write-rate for data input, what is the capacity of a shard in a Kinesis stream?
4 MB/s
1 MB/s
2 MB/s

Correct Answer: 1 MB/s

17) In terms of data read-rate for data output, what is the capacity of a shard in a Kinesis stream?
1 MB/s
2 MB/s
4 MB/s

Correct Answer: 2 MB/s

18) A producer application has been designed to write thousands of events per second to Kinesis Streams by integrating the Kinesis Producer library into the application. The application takes data from logs on EC2 instances and ingests the data into Streams records. Which of the following solutions did the developer use to improve throughput when implementing the KPL with the application?
Aggregation
Re-Aggregation
De-Aggregation

Correct Answer: Aggregation

19) True or False: You can add a local secondary index to a DynamoDB table after it has been created.
False
True

Correct Answer: False

20) What are the max deliverables from one Dynamo DB Partition.
4,000 WCU, 1,000RCU, 10GB Data volume
No maximums
1,000 WCU, 3,000RCU, 10GB Data volume

Correct Answer: 1,000 WCU,

3,000RCU,

10GB Data volume

21) Your company has a number of consumer applications to get records from various Kinesis Streams for different use cases. For each consumer application there is a separate DynamoDB table that maintains application state. Out of the many consumer applications, one application is experiencing provisioned throughput exception errors with its particular DynamoDB table. Why is this happening? (Choose 2)
The application is checkpointing too frequently.
The application is not checkpointing enough.
The stream has too many shards.
The stream does not have enough shards.

Correct Answer: The application is checkpointing too frequently..The stream has too many shards..

22) The Kinesis Connector Library allows you to emit data from a stream to various AWS services. Which of the following services can receive data emitted from such a stream? (Choose 4)
RDS
S3
Redshift
Elasticsearch
Lambda
DynamoDB

Correct Answer: S3.Redshift.

Elasticsearch.

DynamoDB.

23) You have an application based on the Amazon Kinesis Streams API, and you are not using the Kinesis Produce Library as part of your application. While you won’t be taking advantage of all the benefits of the KPL in your application, you still need to ensure that you add data to a stream efficiently. Which API operation allows you to do this?
PutItems
PutItem
PutRecords

Correct Answer: PutRecords

24) Which operation/feature or service would you use to locate all items in a table with a particular sort key value? (Choose 2)
GetItem
Scan against a table, with filters
Query
Query with a local secondary index
Query with a global secondary index

Correct Answer: Scan against a table, with filters.Query with a global secondary index.

25) You have just joined a new company as an AWS Big Data Architect, replacing an architect who left to join a different company. As a data driven company, your company has started using several of AWS’ Big Data services in the last 6 months. Your new manager is concerned that the AWS charges are too high, and she has asked you to review the monthly bills. After review, you determine that the EMR costs are unnecessarily high considering the company uses EMR to process new data within a 6 hour period that starts at midnight and ends between 5 AM and 7 AM, depending on the amount of data that needs to be processed. The data that needs to be processed is already in S3. However, it appears that the EMR cluster that processes the data is running 24 hours a day, 7 days a week. What type of cluster should your predecessor have configured in order to keep costs low and not unnecessarily waste resources?
Nothing. AWS announces frequent price reductions, and costs will balance-out over time.
Your predecessor should have configured the cluster as a transient cluster.
He should have used auto-scaling to reduce the number of core nodes and task nodes running when no processing is taking place.

Correct Answer: Your predecessor should have configured the cluster as a transient cluster.

26) True or False: EBS volumes used with EMR persist after the cluster is terminated.
False
True

Correct Answer: False

27) Which of the following does Spark Streaming use to consume data from a Kinesis Stream?
Kinesis Client Library
Kinesis Connector Library
Kinesis Producer Library

Correct Answer: Kinesis Client Library

28) True or False: Presto is a database.
False
True

Correct Answer: False

29) Which of the following are the 4 modules (libraries) of Spark? (Choose 4)
GraphX
MLlib
Spark Streaming
YARN
SparkSQL
Apache Mesos

Correct Answer: GraphX.MLlib.Spark Streaming.SparkSQL.

30) Which open-source Web interface provides you with a easy way to run scripts, manage the Hive metastore, and view HDFS?
YARN Resource Manager
Apache Zeppelin
Hue

Correct Answer: Hue

31) Your EMR cluster requires high I/O performance and at a low cost. In terms of storage, which of the following is your best option?
EBS volumes with PIOPS
Instance store volumes
EMRFS with consistent view

Correct Answer: Instance store volumes

32) You have just joined a company that has a petabyte of data stored in multiple data sources. The data sources include Hive, Cassandra, Redis, and MongoDB. The company has hundreds of employees all querying the data at a high concurrency rate. These queries take between a sub-second and several minutes to run. The queries are processed in-memory, and avoid high I/O and latency. A lot of your new colleagues are also happy they did not have to learn a new language when querying the multiple data sources. Which open-source tool do you think your new colleagues are using?
Big Data Query Engine
Presto
Hive

Correct Answer: Presto

33) You plan to use EMR to process a large amount of data that will eventually be stored in S3. The data is currently on-premise, and will be migrated to AWS using the Snowball service. The file sizes range from 300 MB to 500 MB. Over the next 6 months, your company will migrate over 2 PB of data to S3 and costs are a concern. Which compression algorithm provides you with the highest compression ratio, allowing you to both maximize performance minimize costs?
Snappy
bzip2
LZO

Correct Answer: bzip2

34) When should you not use Spark? (Choose 2)
For batch processing
For interactive analytics
For ETL workloads
In multi-user environments with high concurrency

Correct Answer: For batch processing.In multi-user environments with high concurrency.

35) How are EMR tasks nodes different from core nodes? (Choose 3)
Task nodes run the NodeManager daemon.
They are used for extra capacity when additional CPU and RAM are needed.
Task nodes are optional.
Task nodes do not include HDFS.
Task nodes run the Resource Manager.

are used for extra capacity when additional CPU and RAM are needed..Task nodes are optional..Task nodes do not include HDFS..

36) What is a fast way to load data into Redshift?
By restoring backup data files into Redshift.
By using the COPY command.
By using multi-line INSERTS.

Correct Answer: By using the COPY command.

37) An Area Under Curve (AUC) is shown to be 0.5. What does this signify? (Choose 2)
The model is no more accurate than flipping a coin.
The AUC provides no value.
Lower AUC numbers would increase confidence.
There is little confidence beyond a guess.

Correct Answer: The model is no more accurate than flipping a coin..There is little confidence beyond a guess..

38) You are trying to predict a numeric value from inventory/retail data that your company has. Which machine learning model would you use to do this?
Numeric Prediction Model
Regression Model
Multiclass Classification Model

Correct Answer: Regression Model

39) What is the most effective way to merge data into an existing table?
UNLOAD data from Redshift into S3, use EMR to ‘merge’ new data files with the unloaded data files, and copy the data into Redshift.
Use a staging table to replace existing rows or update specific rows.
Connect the source table and the target Redshift table via a replication tool and run direct INSERTS, UPDATES into the target Redshift table.

Correct Answer: Use a staging table to replace existing rows or update specific rows.

40) True or False: Redshift is recommended for transactional processing.
True
False

Correct Answer: False

41) True or False: Defining primary keys and foreign keys is an important part of Redshift design because it helps maintain data integrity
True
False

Correct Answer: False

42) How many concurrent queries can you run on a Redshift cluster?
500
150
50

Correct Answer: 50

43) Name two types of machine learning that are routinely encountered? (Choose 2)
Hypervised Learning
Unsupervised Learning
Supervised Learning
Transcoded Learning

Correct Answer: Unsupervised Learning.Supervised Learning.

44) Your analytics team runs large, long-running queries in an automated fashion throughout the day. The results of these large queries are then used to make business decisions. However, the analytics team also runs small queries manually on ad-hoc basis. How can you ensure that the large queries do not take up all the resources, preventing the smaller ad-hoc queries from running?
Do nothing, because Redshift handles this automatically.
Create a query user group for small queries based on the analysts’ Redshift user IDs, and create a second query group for the large, long-running queries.
Setup node affinity and assign large queries and small queries to run-specific nodes.

Correct Answer: Create a query user group for small queries based on the analysts’ Redshift user IDs, and create a second query group for the large, long-running queries.

45) What does the F1 score represent?
The quality of the model
S3 record import success
The accuracy of the input data

Correct Answer: The quality of the model

46) Which of the following AWS services directly integrate with Redshift using the COPY command. (Choose 3)
DynamoDB
Machine Learning
EMR/EC2 instances
S3
Kinesis Streams

Correct Answer: DynamoDB.EMR/EC2 instances.S3.

47) You are trying to predict whether a customer will buy your product. Which machine learning model would help you make this prediction?
Numeric Prediction Model
Multiclass Classification Model
Binary Classification Model

Correct Answer: Binary Classification Model

48) In your current data warehouse, BI analysts consistently join two tables: the customer table and the orders table. The column they JOIN on (and common to both tables) is called customer_id. Both tables are very large, over 1 billion rows. Besides being in charge of migrating the data, you are also responsible for designing the tables in Redshift. Which distribution style would you choose to achieve the best performance when the BI analysts run queries that JOIN the customer table and orders table using customer_id?
EVEN
DEFAULT
KEY

Correct Answer: KEY

49) Which of the following is not a function of the Redshift manifest?
To load required files only
To load files that have a different prefix.
To automatically check files in S3 for data issues.

Correct Answer: To automatically check files in S3 for data issues.

50) Which of the following are characteristics of Supervised Learning? (Choose 2)
Labeled data
Known desired output
Data lacks categorization
Small amount of data is required to process

Correct Answer: Known desired output