AWS Certified Machine Learning – Specialty Set 3 Author: CloudVikas Published Date: 18 March 2020 Welcome to AWS Certified Machine Learning - Specialty Set 3. Please enter your email details to get QUIZ Details on your email id. Click on Next Button to proceed. 1. You are a ML specialist preparing some labeled data to help determine whether a given leaf originates from a poisonous plant. The target attribute is poisonous and is classified as 0 or 1. The data that you have been analyzing has the following features: leaf height (cm), leaf length (cm), number of cells (trillions), poisonous (binary). After initial analysis you do not suspect any outliers in any of the attributes. After using the data given to train your model, you are getting extremely skewed results. What technique can you apply to possibly help solve this issue?Normalize the number of cells attribute.Apply one-hot encoding to each of the attributes, except for the poisonous attribute (since it is already encoded).Standardize the number of cells attribute.2. What are the programming languages offered in AWS Glue for Spark job types? (Choose 2) R Scala Java Python3. You are working for an organization that takes different metrics about its customers and classifies them with one of the following statuses: bronze, silver, and gold. Depending on their status they get more/less discounts and are placed as a higher/lower priority for customer support. The algorithm you have chosen expects all numerical inputs. What can be done to handle these status values?Use one-hot encoding techniques to map values for each status dropping the original status feature.Apply random numbers to each status value and apply gradient descent until the values converge to expect results.Experiment with mapping different values for each status and see which works best.4. Choose the scenarios in which one-hot encoding techniques are NOT a good idea. (Choose 3) When our values cannot be ordered in any meaningful way, there are only a few to choose from, and our algorithm expects numeric input. When our algorithm accepts numeric input and we have continuous values. When our algorithm expects numeric input and we have thousands of nominal categorical values. When our algorithm expects numeric input and we have ordinal categorical values. When our algorithm expects numeric input and we have few nominal categorical values.5. You are a ML specialist who is working within SageMaker analyzing a dataset in a Jupyter notebook. On your local machine you have several open-source Python libraries that you have downloaded from the internet using a typical package manager. You want to download and use these same libraries on your dataset in SageMaker within your Jupyter notebook. What options allow you to use these libraries?SSH into the Jupyter notebook instance and install needed libraries. This is typically done using conda install or pip install.SageMaker offers a wide variety of built-in libraries. If the library you need is not included, contact AWS support with details on libraries needed for distribution.Use the integrated terminals in SageMaker to install libraries. This is typically done using conda install or pip install.6. You work for an organization that wants to manage all of the data stores in S3. The organization wants to automate the transformation jobs on the S3 data and maintain a data catalog of the metadata concerning the datasets. The solution that you choose should require the least amount of setup and maintenance. Which solution will allow you to achieve this and achieve its goals?Create a cluster in EMR that uses Apache Hive. Then, create a simple Hive script that runs transformation jobs on a schedule.Create an AWS Glue crawler to populate the AWS Glue Data Catalog. Then, create an AWS Glue job, and set up a schedule for data transformation jobs.Create a cluster in EMR that uses Apache Spark. Then, create an Apache Hive metastore and a script that runs transformation jobs on a schedule.7. We are analyzing the following text { Hello cloud gurus! Keep being awesome! }. We apply lowercase transformation, remove punctuation and n-gram with a sliding window of 3. What are the unique trigrams produced? What are the dimensions of the tf–idf vector/matrix?['hello cloud gurus', 'cloud gurus keep', 'keep being awesome'] and (1, 3)['cloud gurus keep', 'gurus keep being', 'hello cloud gurus', 'keep being awesome'] and (1, 4)['hello cloud gurus', 'cloud gurus keep', 'gurus keep being', 'keep being awesome'] and (2, 4)8. You are a ML specialist who has a Python script using libraries like Boto3, Pandas, NumPy, and sklearn to help transform data that is in S3. On your local machine the data transformation is working as expected. You need to find a way to schedule this job to run periodically and store the transformed data back into S3. What is the best option to use to achieve this?Create an AWS Glue job that uses Spark as the job type to create Pyspark code to transform and store data in S3. Then set up this job to run on some schedule.Create an AWS Glue job that uses Python shell as the job type and executes the code written to transform and store data in S3. Then set up this job to run on some schedule.Create an AWS Glue job that uses Spark as the job type to create Scala code to transform and store data in S3. Then set up this job to run on some schedule.9. A term frequency–inverse document frequency (tf–idf) matrix using both unigrams and bigrams is built from a text corpus consisting of the following two sentences: { Hello world } and { Hello how are you }. What are the dimensions of the tf–idf vector/matrix?(2, 5)(5, 9)(2, 9)10. You are a ML specialist who has 780 GB of files in a data lake-hosted S3. The metadata about these files is stored in the S3 bucket as well. You need to search through the data lake to get a better understanding of what the data consists of. You will most likely do multiple searches depending on results found throughout your research. Which solution meets the requirements with the LEAST amount of effort?First, enable S3 analytics then use the metastore files to analyze your data.Create an EMR cluster with Apache Hive to analyze and query your data.Use Amazon Athena to analyze and query your S3 data.11. You are a ML specialist that has been tasked with setting up an ETL pipeline for your organization. The team already has a EMR cluster that will be used for ETL tasks and needs to be directly integrated with Amazon SageMaker without writing any specific code to connect EMR to SageMaker. Which framework allows you to achieve this?Apache SparkApache FlinkApache Pig12. A ML specialist is working for a bank and trying to determine if credit card transactions are fraudulent or non-fraudulent. The features of the data collected include things like customer name, customer type, transaction amount, length of time as a customer, and transaction type. The transaction type is classified as 'normal' and 'abnormal'. What data preparation action should the ML specialist take?Drop the length of time as a customer and perform label encoding on the transaction type before training the model.Drop the customer name and and perform label encoding on the transaction type before training the model.Drop both the customer type and the transaction type before training the model.13. You are a ML specialist preparing a dataset for a supervised learning problem. You are using the Amazon SageMaker Linear Learner algorithm. You notice the target label attributes are highly imbalanced and multiple feature columns contain missing values. The proportion of missing values across the entire dataset is less than 5%. What should you do to minimize bias due to missing values?First normalize the non-missing values then replace the missing values with the normalized values.Replace the missing values with mean or median values from the other values of the same feature.For each feature that is missing, use a supervised learning to approximate the values based on other features.14. You are a ML specialist that has been tasked with setting up a transformation job for 800 TB of data. You have set up several ETL jobs written in Pyspark on AWS Glue to transform your data, but the ETL jobs are taking a very long time to process and it is extremely expensive. What are your other options for processing the data?Create Kinesis Data Stream to stream the data to multiple EC2 instances each performing partition workloads and ETL jobs. Tweak cluster size, instance types, and data partitioning until performance and cost satisfaction is met.Create an EMR cluster with Spark, Hive, and Flink to perform the ETL jobs. Tweak cluster size, instance types, and data partitioning until performance and cost satisfaction is met.Change job type to Python shell and use built-in libraries to perform the ETL jobs. The built-in libraries perform better than Spark jobs and are a fraction of the cost.15 out of 14Please fill in the comment box below. Author: CloudVikas