AWS Certified Machine Learning – Specialty Set 1 Author: CloudVikas Published Date: 18 March 2020 Welcome to AWS Certified Machine Learning - Specialty Set 1. Please enter your email details to get QUIZ Details on your email id. Click on Next Button to proceed. 1. You are a ML specialist within a large organization who helps job seekers find both technical and non-technical jobs. You've collected data from a data warehouse from an engineering company to determine which skills qualify job seekers for different positions. After reviewing the data you realise the data is biased. Why?The data collected has missing values for different skills for job seekers.The data collected only has a few attributes. Attributes like skills and job title are not included in the data.The data collected needs to be from the general population of job seekers, not just from a technical engineering company.2. You are a ML specialist working with data that is stored in a distributed EMR cluster on AWS. Currently, your machine learning applications are compatible with the Apache Hive Metastore tables on EMR. You have been tasked with configuring Hive to use the AWS Glue Data Catalog as its metastore. Before you can do this you need to transfer the Apache Hive metastore tables into an AWS Glue Data Catalog. What are the steps you'll need to take to achieve this with the LEAST amount of effort?Create a Data Pipeline job that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.Set up your Apache Hive application with JDBC driver connections, then create a crawler that crawlers the Apache Hive Metastore using the JDBC connection and creates an AWS Glue Data Catalog.Run a Hive script on EMR that reads from your Apache Hive Metastore, exports the data to an intermediate format in Amazon S3, and then imports that data into the AWS Glue Data Catalog.3. You are a ML specialist who is setting up a ML pipeline. The amount of data you have is massive and needs to be set up and managed on a distributed system to efficiently run processing and analytics on. You also plan to use tools like Apache Spark to process your data to get it ready for your ML pipeline. Which setup and services can most easily help you achieve this?Multi AZ RDS Read Replicas with Apache Spark installed.Self-managed cluster of EC2 instances with Apache Spark installed.Elastic Map Reduce (EMR) with Apache Spark installed.4. Your organization has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into?Unstructured data, stored in S3.Semi-structured data, stored in S3.Semi-structured data, stored in DynamoDB.5. The AWS Glue Data Catalog contains references to data that is used as sources and targets of your extract, transform, and load (ETL) jobs in AWS Glue.Consider you are trying to set up a crawler within AWS Glue that crawls your input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from your data and no tables are created within your AWS Glue Data Catalog. What is the reason for these results?The crawler does not have correct IAM permissions to access the input data in the S3 bucket.AWS Glue built-in classifiers could not find the input data format. You need to create a custom classifier.The bucket path for the input data store in S3 is specified incorrectly.6. Which Amazon service allows you to build a high-quality training labeled dataset for your machine learning models? This includes human workers, vendor companies that you choose, or an internal, private workforce.Jupyter NotebooksSageMaker Ground TruthLambda7. You have been tasked with setting up crawlers in AWS Glue to crawler different data stores to populate your organization's AWS Glue Data Catalogs. Which of the following input data store is NOT an option when creating a crawler?RDSDynamoDBDocumentDB8. You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort?Redshift SpectrumAthenaApache Hive9. In general within your dataset, what is the minimum number of observations you should have compared to the number of features?10 times as many observations as features.1000 times as many observations as features.10,000 times as many observations as features.10. You are a ML specialist within a large organization who needs to run SQL queries and analytics on thousands of Apache logs files stored in S3. Which set of tools can help you achieve this with the LEAST amount of effort?AWS Glue Data Catalog and AthenaData Pipeline and AthenaRedshift and Redshift Spectrum11. The data store is used to persist data needed by the microservices. Popular stores for session data are in-memory caches such as Memcached or Redis. AWS offers both technologies as part of the managed Amazon ElastiCache service. Now take an example,An organization needs to store a mass amount of data in AWS. The data has a key-value access pattern, developers need to run complex SQL queries and transactions, and the data has a fixed schema. Which type of data store meets all of their needs?DynamoDBRDSAthena12. You have been tasked with collecting thousands of PDFs for building a large corpus dataset. The data within this dataset would be considered what type of data?UnstructuredStructuredSemi-structured13. You have been tasked with converting multiple JSON files within a S3 bucket to Apache Parquet format. Which AWS service can you use to achieve this with the LEAST amount of effort?Create an AWS Glue Job to convert the S3 objects from JSON to Apache Parquet, then output newly formatted files into S3.Create a Data Pipeline job that reads from your S3 bucket and sends the data the EMR. Create an Apache Spark job to process the data the Apache Parquet and output newly formatted files into S3.Create a Lambda function that reads all of the objects in the S3 bucket. Loop through each of the objects and convert from JSON to Apache Parquet. Once the conversion is complete, output newly formatted files into S3.14. When you train your model in SageMaker, where does your training dataset come from?RedShiftDynamoDBS315 out of 14Please fill in the comment box below. Author: CloudVikas