AWS Certified Big Data Specialty Set3 Welcome to AWS Certified Big Data Specialty Set3. Please enter your email details to get QUIZ Details on your email id. Click on Next Button to proceed. Email 1. You have just joined a new company as an AWS Big Data Architect, replacing an architect who left to join a different company. As a data driven company, your company has started using several of AWS' Big Data services in the last 6 months. Your new manager is concerned that the AWS charges are too high, and she has asked you to review the monthly bills. After review, you determine that the EMR costs are unnecessarily high considering the company uses EMR to process new data within a 6 hour period that starts at midnight and ends between 5 AM and 7 AM, depending on the amount of data that needs to be processed. The data that needs to be processed is already in S3. However, it appears that the EMR cluster that processes the data is running 24 hours a day, 7 days a week. What type of cluster should your predecessor have configured in order to keep costs low and not unnecessarily waste resources? Nothing. AWS announces frequent price reductions, and costs will balance-out over time. Your predecessor should have configured the cluster as a transient cluster. He should have used auto-scaling to reduce the number of core nodes and task nodes running when no processing is taking place. 2. True or False: EBS volumes used with EMR persist after the cluster is terminated. False True 3. Which of the following does Spark Streaming use to consume data from a Kinesis Stream? Kinesis Client Library Kinesis Connector Library Kinesis Producer Library 4. True or False: Presto is a database. False True 5. Which of the following are the 4 modules (libraries) of Spark? (Choose 4) GraphX MLlib Spark Streaming YARN SparkSQL Apache Mesos 6. Which open-source Web interface provides you with a easy way to run scripts, manage the Hive metastore, and view HDFS? YARN Resource Manager Apache Zeppelin Hue 7. Your EMR cluster requires high I/O performance and at a low cost. In terms of storage, which of the following is your best option? EBS volumes with PIOPS Instance store volumes EMRFS with consistent view 8. You have just joined a company that has a petabyte of data stored in multiple data sources. The data sources include Hive, Cassandra, Redis, and MongoDB. The company has hundreds of employees all querying the data at a high concurrency rate. These queries take between a sub-second and several minutes to run. The queries are processed in-memory, and avoid high I/O and latency. A lot of your new colleagues are also happy they did not have to learn a new language when querying the multiple data sources. Which open-source tool do you think your new colleagues are using? Big Data Query Engine Presto Hive 9. You plan to use EMR to process a large amount of data that will eventually be stored in S3. The data is currently on-premise, and will be migrated to AWS using the Snowball service. The file sizes range from 300 MB to 500 MB. Over the next 6 months, your company will migrate over 2 PB of data to S3 and costs are a concern. Which compression algorithm provides you with the highest compression ratio, allowing you to both maximize performance minimize costs? Snappy bzip2 LZO 10. When should you not use Spark? (Choose 2) For batch processing For interactive analytics For ETL workloads In multi-user environments with high concurrency 11. How are EMR tasks nodes different from core nodes? (Choose 3) Task nodes run the NodeManager daemon. They are used for extra capacity when additional CPU and RAM are needed. Task nodes are optional. Task nodes do not include HDFS. Task nodes run the Resource Manager. Please fill in the comment box below.