AWS Machine Learning Q&A -Part 2

1

The data engineering team at a social media company ingests the clickstream data into the Kinesis Data Stream using the PutRecord API in the source system. Now, the team wants to ingest this data into Kinesis Data Firehose instead and they want to use the PutRecord API for Firehose. Identify the differences between the PutRecord API call for Kinesis Data Stream v/s Kinesis Data Firehose:

Kinesis Data Streams PutRecord API uses name of the stream, a
partition key and the data blob whereas Kinesis Data Firehose
PutRecord API uses the name of the delivery stream and the
data record
Kinesis Data Firehose PutRecord API uses name of the stream, a partition key and the data blob whereas Kinesis Data Streams PutRecord API uses the name of the delivery stream and the data record
Both Kinesis Data Firehose PutRecord API and Kinesis Data Streams PutRecord API use the name of the delivery stream and the data record
Both Kinesis Data Firehose PutRecord API and Kinesis Data Streams PutRecord API use the name of the stream, a partition key and the data blob

Kinesis Data Streams PutRecord API uses name of the stream, a
partition key and the data blob whereas Kinesis Data Firehose
PutRecord API uses the name of the delivery stream and the
data record

Explanation
Kinesis Data Streams PutRecord API uses name of the stream, a partition key and the data blob whereas Kinesis Data Firehose PutRecord API uses the name of the delivery stream and the data record. Please review more details here:

2) Identify the three built-in SageMaker algorithms that support incremental training (Select three):

XGBoost
Semantic Segmentation

mage Classification
Object Detection
Linear Learner

Semantic Segmentation ,mage Classification
Object Detection

Explanation
Only three built-in algorithms currently support incremental training: Object Detection Algorithm, Image Classification Algorithm, and Semantic Segmentation Algorithm:

3) The data science team at an analytics company is working on a linear regression model and it observes that the training error as well as the test error are high, implying that the model has a bias. Which of the following L1 and L2 regularization optimizations may be done to resolve this issue (Select two):

Decrease L1 regularization
Use L2 regularization and drop L1 regularization
L1 and L2 regularization are not required, just get more training data
ncrease L1 regularization

Decrease L1 regularization
Use L2 regularization and drop L1 regularization

Explanation
Getting more training data alone will not address the model bias. You can think of L1 as reducing the number of features in the model altogether. L2 “regulates” the feature weight instead of just dropping them. “Decreasing L1 regularization” and “Using L2 regularization along with dropping L1 regularization” are the correct options. Please review the concept of L1 and L2 regularization in more detail:

4) The compliance department at a major Financial Services Firm wants to monitor the SageMaker services used by the Data Science team for their ML jobs. Which services can be used to achieve this objective (Select two) :

Amazon Cloudwatch
Amazon Inspector
AWS Cloudtrai
AWS Config

Amazon Cloudwatch , AWS Cloudtrai

Explanation
Cloudwatch and Cloudtrail can be used to monitor the SageMaker services. Further details on monitoring options for SageMaker can be found here:

5) An Analytics Consulting Firm would like to capture and analyse the real time metrics for a cab hailing service. The Firm would like to identify “demand hotspots” in real time so that additional cabs can be dispatched to meet the sudden spurt in demand. What is the least effort way of building a real time analytics solution for this use case :

Ingest the source data directly into Kinesis Data Analytics so that real time analytics can be done without any processing delay. Once processing is done, the streams are dumped into S3 using Kinesis Data Firehouse.
Ingest the data into Kinesis Data Streams that writes the data into a Spark Streaming application running on an EMR cluster. Once the processing is done, the output is written on S3
Ingest the data into Kinesis Data Streams and immediately write
the stream into Kinesis Data Analytics for SQL based analysis so
that appropriate alerts can be sent to the drivers. Once processing is done, the streams are dumped into S3 using Kinesis Data Firehouse.
Ingest the data into Kinesis Data Firehose and write into S3, which triggers a Lambda that analyses the event data. The Lambda finally writes the output to S3.

Ingest the data into Kinesis Data Streams and immediately write
the stream into Kinesis Data Analytics for SQL based analysis so
that appropriate alerts can be sent to the drivers. Once processing is done, the streams are dumped into S3 using Kinesis Data Firehouse.

Explanation
Kinesis Data Analytics cannot directly ingest source data. Using a combination of Kinesis Firehose with lambda would introduce a buffering delay of at least 1 minute or 1MB of data, so the solution will not be real time. Using EMR would significantly increase the development and maintenance effort, so it’s not the right choice. Correct solution is to ingest the data into Kinesis Data Streams and immediately write the stream into Kinesis Data Analytics for SQL based analysis so that appropriate alerts can be sent to the drivers. Once processing is done, the streams are dumped into S3 using Kinesis Data Firehouse.

6)

An analyst is trying to create a box plot for the following data points :
10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4
Based on these data points, we have the following characteristics :
Q1(25th percentile) = 14.4
Q2(50th percentile) = 14.6
Q3(75th percentile) = 14.9

Identify the data points that would show up as outliers on the box plot (Select three):

15.9
14.1
14.4
15.1
10.2
16.4

15.9

10.2
16.4

Explanation
Interquartile Range (IQR) = Q3-Q1 = 0.5
Minimum outlier cutoff = Q1 – 1.5 * IQR = 14.4 – (1.5*0.5) = 13.65
Maximum outlier cutoff = Q3 + 1.5 * IQR = 14.9 + (1.5*0.5) = 15.65
So the outlier would be anything less than 13.65 or anything more than 15.65. Thus the outliers are 10.2, 15.9, 16.4 for the given problem statement.

7) Identify the mandatory hyperparameter for both the Word2Vec (unsupervised) and Text Classification (supervised) modes of the SageMaker BlazingText algorithm:

Iearning_rate
mode
buckets
epochs

mode

Explanation
mode is the mandatory hyperparameter for both the Word2Vec (unsupervised) and Text Classification (supervised) modes of the SageMaker BlazingText algorithm :

8) After training a SageMaker model over a huge training dataset, the data science team observed that it has low accuracy on the training data as well as low accuracy on the test data. What can you say about the model:

Model is overfitting
Model is underfitting
The model needs more training data
Model is neither underfitting nor overfitting

Model is underfitting

Explanation
When a model underfits, it exhibits low accuracy on both the training and test data

9) Which SageMaker algorithm supports only the CPU based instance classes for both training and inference (Select two):

XGBoost
KNN
Random Cut Forest
k-means
Neural Topic Model

XGBoost ,Random Cut Forest

Explanation
XGBoost and Random Cut Forest support only the CPU based instance classes for both training and inference.

10) Researchers at NASA are creating a model on Amazon SageMaker to analyze images for detecting strong gravitational lensing, a phenomenon in which an accumulation of matter in space is dense enough that it bends light waves as they travel around it. The training data contains 200K images of the negative class (images with no gravitational lensing) and only 2000 images of the positive class (images with gravitational lensing). The final model has 85% accuracy, but poor recall. How can you improve the model performance (Select two):

Collect more training data for the negative class
Over-sample from the positive class
Collect more training data for the positive class
Over-sample from the negative class

Over-sample from the positive class
Collect more training data for the positive class

Explanation
In case of a binary classification model with strongly unbalanced classes, we need to over-sample from the minority class, collect more training data for the minority class.

11) The marketing analytics team at a financial services company is working on creating a customer loyalty program targeted at specific groups of customers. Which data analysis technique should be used for this goal:

Bivariate visualizations
Clustering
Multivariate visualizations
Dimensionality Reduction

Clustering

Explanation
Clustering is the best way to uncover similar groups. These groups can then be further analyzed to customize the customer loyalty program.

12 ) A Financial Services company has asked you to finetune its SageMaker model training process. You observe that the company runs the training jobs multiple times in a day with a little tweaking of the training data for each run. Which steps would you recommend to improve the training performance so that the training jobs can complete faster (Select two) :

Use pipe mode to stream data from S3
Spin up an EMR cluster to process the training job
Upgrade the training instance to the highest possible type
Change the data format to protobuf recordIO format

Use pipe mode to stream data from S3

Change the data format to protobuf recordIO format

Explanation
Using the training data in protobuf recordIO format along with pipe mode can significantly improve the training job performance. Neither using the EMR cluster nor changing the instance type guarantees improvement in the training performance.

13) Considering the following ROC curve generated for the Amazon SageMaker XGBoost algorithm for a binary classification use-case

The model represented by the black ROC curve is best at distinguishing the two classes. The model represented by the blue ROC curve is worst at distinguishing the two classes.
The model represented by the black ROC curve is best at distinguishing the two classes. The model represented by the red ROC curve is worst at distinguishing the two classes.
The model represented by the red ROC curve is best at distinguishing the two classes. The model represented by the blue ROC curve is worst at distinguishing the two classes.
The model represented by the blue ROC curve is best at
distinguishing the two classes. The model represented by the
black ROC curve is worst at distinguishing the two classes.

The model represented by the blue ROC curve is best at
distinguishing the two classes. The model represented by the
black ROC curve is worst at distinguishing the two classes.

Explanation
Please review the concept of AUC/ROC as applied to binary classification. Here is a good reference:


14) Identify SageMaker algorithms that process training data in the form of a pair of entities (Select two):

KNN
XGBoost
IP Insights
Object2Vec

IP Insights
Object2Vec

Explanation
Amazon SageMaker IP Insights ingests historical data as (entity, IPv4 Address) pairs and learns the IP usage patterns of each entity

15) Identify SageMaker unsupervised learning algorithms that are parallelizable (Select two):

PCA
Random Cut Forest
LDA
k-means

PCA , Random Cut Forest

Explanation
PCA and Random Cut Forest are parallelizable amongst the given options. Please review the common parameters for all SageMaker algorithms

16) A leading technology company offers a fast-track leadership program to the best performing executives at the company. At any point in time, more than a thousand executives are part of this leadership program. Which is the best visualization type to analyze the salary distribution for these executives:

Bubble Chart
Bar Chart
Histogram
Pie Chart

Histogram

Explanation
Histogram is best suited to analyse the underlying distribution of data such as described for this use-case.

17) A marketing analyst wants to group current and prospective customers into 10 groups based on their attributes. He wants to send mailings to prospective customers in the group which has the highest percentage of current customers. As an ML Specialist, which Sagemaker algorithm would you recommend as a solution:

Latent Dirichlet Allocation
KNN
PCA
K-means

K-means

Explanation
As there is no historic data with labels, so KNN is ruled out. PCA is used for dimensionality reduction and LDA for topic modeling. K-means is the right algorithm to uncover discrete groupings within data.

18) An analytics company is doing the sentiment analysis of tweets about a leading sports event. The company has prepared the following confusion matrix. What is the precision of the underlying model:

88%
80%
50%
20%

80%

Explanation
Precision = (True Positives / (True Positives + False Positives)) = (800/(800+200)) = 0.8 or 80%

19) A leading ecommerce company is looking to improve the user experience by recommending the related product categories for its catalog of products. As an ML Specialist, which SageMaker algorithms would you use to develop a solution for this use-case (Select two):

Latent Dirichlet Allocation (LDA)
K-means
Factorization Machines
XGBoost

Latent Dirichlet Allocation (LDA), Factorization Machines

Explanation
Use LDA to figure out the right categories for each product. Use Factorization Machines to recommend the right related categories for the given product’s categories.

20) Identify the criteria on which early stopping works in Amazon SageMaker:
If the value of the objective metric for the current training job is
worse (higher when minimizing or lower when maximizing the
objective metric) than the median value of running averages of
the objective metric for previous training jobs up to the same
epoch, Amazon SageMaker stops the current training job.
If the value of the objective metric for the current training job is better (higher when minimizing or lower when maximizing the objective metric) than the mean value of running averages of the objective metric for previous training jobs up to the same epoch, Amazon SageMaker stops the current training job.
If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the mean value of running averages of the objective metric for previous training jobs up to the same epoch, Amazon SageMaker stops the current training job.
If the value of the objective metric for the current training job is better (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, Amazon SageMaker stops the current training job.

If the value of the objective metric for the current training job is
worse (higher when minimizing or lower when maximizing the
objective metric) than the median value of running averages of
the objective metric for previous training jobs up to the same
epoch, Amazon SageMaker stops the current training job.

Explanation
If the value of the objective metric for the current training job is worse (higher when minimizing or lower when maximizing the objective metric) than the median value of running averages of the objective metric for previous training jobs up to the same epoch, Amazon SageMaker stops the current training job.

21) A financial services company wants to migrate its data architecture from a data warehouse to a data lake. It wants to use a solution that takes the least amount of time and needs no infrastructure management. What options would you recommend to transfer the data from AWS Redshift to S3 (Select two):

Apache Spark ETL script running on EMR cluster
Lambda functions orchestrated by AWS Step Function
AWS Data Pipeline
AWS Glue ETL job

AWS Data Pipeline
AWS Glue ETL job

Explanation
EMR cluster need to be provisioned and managed, so that option is ruled out. Lambda is not meant to handle ETL workloads, so that is also ruled out. Both AWS Data Pipeline and AWS Glue ETL job are the correct choices for this use-case.

22) An ecommerce company wants to optimize the cost structure for its Redshift data warehouse by moving out some of the infrequently accessed data to S3. What solution would you recommend so that the company can still access this infrequently accessed data from Redshift whenever required:

Create an AWS Glue ETL job that writes the data from S3 back into Redshift. The job needs to be triggered every time the data needs to be analysed in Redshift
Use Redshift Spectrum so that the infrequently accessed data in
S3 can be queried from Redshift.
Create a Glue crawler to read the S3 data into Athena so there is no need to use Redshift
Create an EMR based Spark ETL job that writes the data from S3 back into Redshift. The job needs to be triggered every time the data needs to be analysed in Redshift

Use Redshift Spectrum so that the infrequently accessed data in
S3 can be queried from Redshift.

Explanation
EMR and Glue based ETL jobs are not practical as the job needs to be invoked every time data needs to be queried in Redshift. Once the query is done, data needs to be deleted again to save costs. Using Athena is not an option as the query needs to be done in Redshift. Using Redshift Spectrum is the correct choice for this use-case. Please review more details here:

23) The Training Image and Inference Image Registry Paths used for SageMaker algorithms are of which type:

Region based
Global
City based
Country based

Region based


Explanation\

The Training Image and Inference Image Registry Paths are region based.

24) To get inference for an entire dataset, you are developing a batch transform job on Sagemaker using the AWS SDK for Python (Boto 3). Which methods would you use to create the job and then also get a status about the progress of the job (Select two):

createjransformjob
describejransformjob
createjrainingjob
describejrainingjob

createjransformjob

describejransformjob

Explanation
You can use create.transformjob to create the batch transform job and describejransformjob to get a status about the progress of the job.

25) Amazon Sagemaker models are stored in which format:

model.tar.gz
model.tar.gzip
model.gzip
model.zip

model.tar.gz

Explanation
Amazon SageMaker models are stored as model.tar.gz in the S3 bucket specified OutputDataConfig S30utputPath parameter of the create_trainingJob call.

26) The data science team at an analytics company is working on a credit score model using SageMaker Linear Learner algorithm. The training data consists of these fields : name, age, annual salary, gender, employment status and credit score. The model needs to predict the credit score label. Which data preparation steps need to be done before working on the model:

Drop the name and one-hot-encode gender and employment
status
Drop the age and one-hot-encode name and credit score
Drop the age and one-hot-encode name
Drop the name and one-hot-encode annual salary and credit score

Drop the name and one-hot-encode gender and employment
status

Explanation
As gender and employment status are categorical they need to be one-hot-encoded. Name has no bearing as a useful feature for the model, so it can be discarded. You cannot one-hot encode annual salary as it’s not categorical.

27) The research team at a University wants to do sentiment analysis of the most famous quotes from the classical english literature over the last 500 years. Some of the sample quotes from the corpus are like so : “All that glitters is not gold”, “Brevity is the soul of wit”, “The lady doth protest too much, methinks.”, “Love all, trust a few, do wrong to none.” As an ML Specialist, what data pre-processing steps would you recommend before the team starts building the model (Select three):

Remove stop words
Create n-gram vector for each word
Lowercase each word
Remove archaic words such as doth and methinks
Create one-hot encoding for each word
Tokenize each word

Remove stop words, Lowercase each word ,Tokenize each word

Explanation
Removing stop words, tokenizing each word and “lowercase-ing” each word are the recommended pre-processing steps for this use case. N-gram vector and one-hot encoding is not relevant for this use-case. Archaic words should not be removed, since these play a crucial role to determine the sentiment of the sentence

28) You would like to tune the hyperparameters for the Sagemaker XGBoost algorithm. Identify the correct options for the model validation techniques (Select two):

K-fold validation
Validation using training set
Validation using a holdout set
Validation using SageMaker Ground Truth

K-fold validation , Validation using a holdout set

Explanation
Validation using training set and validation using SageMaker Ground Truth are made up options. You can use Validation using a holdout set or K-fold validation to tune the hyperparameters for the Sagemaker XGBoost algorithm

29) A car insurance company wants to automate the claims process. The company wants the customers to upload the video footage of the damaged car. This video footage is then pre-assessed by an Amazon SageMaker model as part of the damage evaluation process. The company has no prior training data to get started on this endeavor. As an ML Specialist, what would you recommend to the company:

Use Amazon SageMaker Ground Truth to create the labels for
the training videos. The labeled videos can be used to train the
downstream Amazon SageMaker model for the damage
evaluation process.
Use AWS Rekognition to create the labels for the training videos. The labeled videos can be used to train the downstream Amazon SageMaker model for the damage evaluation process.
Use an unsupervised learning algorithm to label the videos which can be used in the downstream Amazon SageMaker model for the damage evaluation process
Use Kinesis Video Streams to create the labels for the training videos. The labeled videos can be used to train the downstream Amazon SageMaker model for the damage evaluation process.

Use Amazon SageMaker Ground Truth to create the labels for
the training videos. The labeled videos can be used to train the
downstream Amazon SageMaker model for the damage
evaluation process.

Explanation
Rekognition or Kinesis Video Streams or an unsupervised learning algorithm cannot be used to create labels for the training videos. Correct option is to use Amazon SageMaker Ground Truth to create the labels for the training videos. The labeled videos can be used to train the downstream Amazon SageMaker model for the damage evaluation process.

30) A Silicon Valley startup has introduced a new email service that would completely eradicate spam from the inbox. The data scientists at the startup would like to analyze the results of the underlying model. Identify the most important evaluation metric for this task (The model’s predicted value of 1 implies that the email is predicted to be a spam):

Accuracy
Precision
Recall
F1-score

Precision

Explanation
Precision = (True Positives / (True Positives + False Positives))
The startup would like to be extra sure that an email is spam before potentially putting in the spam folder. In this scenario, the user never sees the genuine email as it was sent to the spam folder instead. This implies that they want less false positives. As false positives decrease, the model would have a higher precision.


31) You are working on a fraud detection model based on SageMaker IP Insights algorithm with a training data set of 1TB in CSV format. Your Sagemaker Notebook instance has only 5GB of space. How would you go about building your model, given these constraints:

Shuffle the training data and create a 5GB slice of this shuffled
data. Build your model on the Jupyter Notebook using this slice
of training data. Once the evaluation metric looks good, create a
training job on SageMaker infrastructure with the appropriate
instance types and instance counts to handle the entire training
data.

Create an AWS Glue job to transform the training data into recordlO- protobuf format. Read the entire transformed data in recordlO-protobuf format from S3 in your Jupyter Notebook instance while training your model.
Spin-up an EMR Cluster running Apache Spark to transform the CSV data into recordlO-protobuf format. Read the entire transformed data in recordlO-protobuf format from S3 in your Jupyter Notebook instance while training your model.
Create an AWS Glue job to compress the training data into parquet format using an appropriate compression codec. This should allow you to use the entire compressed training data on your notebook instance.

Shuffle the training data and create a 5GB slice of this shuffled
data. Build your model on the Jupyter Notebook using this slice
of training data. Once the evaluation metric looks good, create a
training job on SageMaker infrastructure with the appropriate
instance types and instance counts to handle the entire training
data.

Explanation
IP Insights algorithm supports only CSV file type as training data, so other options using parquet or recordIO-protobuf are ruled out. An important aside, AWS Glue job cannot write output in recordIO-protobuf format. The correct option is to shuffle the training data and create a 5GB slice of this shuffled data. Build your model on the Jupyter Notebook using this slice of training data. Once the evaluation metric looks good, create a training job on SageMaker infrastructure with the appropriate instance types and instance counts to handle the entire training data.

32) Identify the mandatory hyperparameters for the SageMaker K-means algorithm (Select two):

epochs
feature_dim
k
mini_batch_size

feature_dim
k

Explanation
feature_dim and k are the required hyperparameters for the SageMaker K-means algorithm

33) You are building a feature for a web application such that when a user attempts to log in from an anomalous IP address, a web login server would trigger a multi-factor authentication system. Which SageMaker algorithm would you use for this feature:

XGBoost
Factorization Machines
Random Cut Forest
IP Insights

IP Insights

34) You are pre-processing a training dataset to be used on the Amazon SageMaker Linear Learner algorithm. The dataset has hundreds of features and you need to decide which features to drop. Identify the guidelines that you would follow (Select three):

Drop a feature if it has a lot of missing values
Drop a feature if it has a few missing values
Drop a feature if it has a low correlation to the target label
Drop a feature if it has a high correlation to the target label
Drop a feature if it has high variance
Drop a feature if it has low variance

Drop a feature if it has a lot of missing values , Drop a feature if it has a low correlation to the target label , Drop a feature if it has low variance

Explanation
The thumb rule is that you drop a feature that will not help a model to learn. Any feature that has low variance or a lot of missing values or has a low correlation to the target label ought to be dropped.

35) The data science team at a SaaS CRM company wants to improve its customer support workflow. The team wants to identify duplicate support tickets or route tickets to the correct support queue based on similarity of the text found in a ticket. As an ML Specialist, which SageMaker algorithm would you recommend to help solve this problem:

Factorization Machines
XGBoost
Object2Vec
BlazingText Word2Vec mode

Object2Vec

Explanation
Object2Vec can be used to find semantically similar objects such as tickets. BlazingText Word2Vec can only find semantically similar words. Factorization Machines and XGBoost are not fit for this use-case

36) A retail organization ingests 100GB of data into S3 from its global storefronts on a daily basis. This data needs to be cleaned, prepared and analyzed daily so that sales reports can be sent out to the business stakeholders. Which option takes the least effort to make this data available for SQL queries:

Setup a daily Glue job to write the incremental S3 data into DynamoDB and have it available for SQL queries
Setup Glue crawlers to initially read the data into Athena tables. Since the data schema does not change, the daily data is readily available for SQL queries in Athena as soon as it arrives
Setup a daily Glue job to write the incremental S3 data into RDS and have it available for SQL queries
Setup a daily Glue job to write the incremental S3 data into Redshift and have it available for SQL queries

Setup Glue crawlers to initially read the data into Athena tables. Since the data schema does not change, the daily data is readily available for SQL queries in Athena as soon as it arrives

Explanation
Using Glue Crawler with Athena is the least effort way to make S3 data available for SQL queries. Using a daily Glue job adds unnecessary complexity into the solution. Also you can’t use SQL queries with DynamoDB.

37) An online real estate database company provides information on the housing prices for all states in the US by capturing information such as house size, age, location etc. The company is capturing data for a city where the typical housing prices are around $200K except for some houses that are more than 100 years old with an asking price of about $1 million. These heritage houses will never be listed on the platform. What data processing step would you recommend to address this use-case?

Normalize the data for all houses in this city and then train the model
Standardize the data for all houses in this city and then train the model
Drop the heritage houses from the training data and then train the model
One-hot encode the data for all houses in this city and then train the model

Drop the heritage houses from the training data and then train the model

Explanation
One-hot encoding is used only for nominal categorical features, so this option is not correct. While normalizing and standardizing is a valid strategy but for this use-case it would end up injecting noise into the model due to the data from the heritage houses. As the heritage houses are clear outliers in terms of price and will never be listed, it is best to drop these from the training data and then train the model.

38) You want to create an AWS Glue crawler to read the transaction data dumped into an S3 based data lake in the s3://mybucket/myfolder/ location. The transaction data is in CSV format however there are some additional metadata files with .metadata extension in the same location. The metadata needs to be ignored while reading the transaction data via Athena. How would you implement this solution:

Use exclude pattern .metadata in the crawler definition to ignore the metadata

Use exclude pattern .metadata/ in the crawler definition to ignore the metadata
Use exclude pattern **.metadata in the crawler definition to
ignore the metadata

It is not possible to ignore the metadata in crawler. Create a daily ETL job to transfer only the transaction data specific CSV files into a new location and then read this cleansed transaction data into Athena.

Use exclude pattern **.metadata in the crawler definition to
ignore the metadata

Explanation
Correct option is to use exclude pattern **.metadata in the crawler definition to ignore the metadata. AWS Glue crawler supports exclude patterns.

39) A Sports Analytics Company wants to analyse the game-plays for the coming NBA season. They would like to track the movement of each athlete for post-game analysis. Which AWS service can they use to build a solution in the least possible time:

AWS Rekognition
Kinesis Video Streams
SageMaker Image Classification
Kinesis Data Stream with Lambda based video frame processing

AWS Rekognition

40) An upcoming music streaming service wants to build a Minimum Viable Product and would like to have the underlying music recommendation engine developed at the earliest with the least development effort. As an ML Specialist, which AWS service would you suggest for the music recommendation engine:

Amazon SageMaker Factorization Machines
Amazon SageMaker Neural Topic Model

Amazon SageMaker XGBoost
Amazon Personalize

Amazon Personalize

Explanation
Amazon Personalize is a machine learning service that makes it easy for developers to create individualized recommendations for customers using their applications. Other options require significant effort to train and test the models. Please read

41) You would like to use data in the protobuf recordIO format for training. What value should be passed as “ContentType” in the input data channel specification:

text/x-recordio-protobuf
application/recordio-protobuf
application/x-recordio-protobuf
text/reco rd i o- proto buf

application/x-recordio-protobuf

Explanation
If you want to use data in the protobuf recordIO format for training, then content type should be set to application/x-recordio-protobuf in the input data channel specification.

42) Identify SageMaker supervised learning algorithms that are NOT parallelizable (Select three):

Semantic Segmentation
Seq2Seq Modeling
DeepAR Forecasting
Object2Vec
Image Classification

Semantic Segmentation
Seq2Seq Modeling
, Object2Vec

Explanation
Object2Vec, Semantic Segmentation and Seq2Seq Modeling are NOT parallelizable amongst the options given above.

43) You are creating a computer vision application to recognize truck brands. Your application uses Convolutional Neural Networks (CNN) but you do not have enough data to train the model. However, there are pre-trained third-party image recognition models available for similar tasks. What steps will you take to build your solution in the shortest possible duration:

Use transfer learning with Kinesis Video Streams
Use Kinesis Video Streams to identify the truck brand by using image manipulation algorithms and then do a pixel by pixel comparison
Use transfer learning in your CNN by using the pre-trained third-
party image recognition model as the convolutional base. Then
remove the original classifier from the pre-trained model and
add the new classifier for recognizing truck brands.
Use transfer learning by retraining the pre-trained third-party image recognition model with your own data.

Use transfer learning in your CNN by using the pre-trained third-
party image recognition model as the convolutional base. Then
remove the original classifier from the pre-trained model and
add the new classifier for recognizing truck brands.

Explanation
You cannot use transfer learning with Kinesis Video Streams. Retraining the pretrained model with your own data is not correct because you do not have enough data to train. The correct option is to use transfer learning in your CNN by using the pre-trained third-party image recognition model as the convolutional base. Then remove the original classifier from the pre-trained model and add the new classifier for recognizing truck brands.

44) Which options are valid for the Training Input Mode parameter while using SageMaker algorithms (Select two):

Pipe
Token
File
Text

Pipe, File

Explanation
File or Pipe are the possible options. Text and Token are invalid.

45) The data engineering team at an ecommerce company wants to ingest the clickstream data from the source system in a reliable way. The solution should provide built-in performance benefits and ease of use on the client side. Which solution would you implement on the source system:

Kinesis Producer Library
Kinesis API
Kinesis Client Library
Spark Streaming

Kinesis Producer Library


Explanation
Kinesis Product Library provides built-in performance benefits and ease of use advantages.

46) An Analytics Consulting Firm wants you to review a Classification Model trained on historical data and deployed about 6 months ago. At the time of deployment the model performance was upto the mark. Post deployment, the model has not been retrained on the incremental data coming in every day. Now the model performance has gone down significantly. As an ML Specialist, what is your recommended course of action:

Completely retrain the model using the historical data along
with the data for the last 6 months.
Completely retrain the model using only the data for the last 6 months
Change the algorithm behind the model for better performance.
Completely retrain the model again using only the historical data

Completely retrain the model using the historical data along
with the data for the last 6 months.


Explanation
This is an example of model deterioration because the training data has aged. The solution is to retrain the model using the historical data along with the data for the last 6 months

47) Which is the best evaluation metric for a binary classification model:

AUC/ROC
Accuracy
Precision
F1-Score

AUC/ROC


Explanation
AUC/ROC is the correct choice. AUC/ROC metric does not require you to set a classification threshold. It is also useful when there is high class imbalance.

48) As a security policy, the data science team at an ecommerce company does not want Amazon SageMaker to provide external network access to the training or inference containers, so network isolation is enabled for all containers. Identify the Amazon SageMaker containers that do not support network isolation, so the data science team does not use them for modeling (Select three):

Amazon SageMaker Reinforcement Learning
Scikit-learn
MXNet
Pytorch
TensorFlow

Amazon SageMaker Reinforcement Learning
Scikit-learn
, Pytorch

Explanation
Network isolation is not supported by the following managed Amazon SageMaker containers as they require access to Amazon S3:
Chainer
PyTorch
Scikit-learn
Amazon SageMaker Reinforcement Learning

49) The data science team at an ecommerce company is working on a training dataset for a forecasting model. The dataset represents the sales data for the last 5 years and has the following features : item description, item price, order date, quantity ordered, shipping address, order amount. The team would like to uncover any cyclical sales patterns such as hourly, daily, weekly, monthly, yearly from this data. As an ML Specialist, what solution would you recommend:

Preprocess the order date to create new features such as hour
of the day, day of the week, week of the month, week of the
year, date of the month, month of the year and represent these
features as (x,y) coordinates on a circle using sin and cos
transformations. This transformed data should then be used to
train the model.
Preprocess the order date to create new features such as hour of the day, day of the week, week of the month, week of the year, date of the month, month of the year and use these features in one*hot encoded format for training the model
No need for data preprocessing as the underlying algorithm can detect the cyclical patterns on its own
Preprocess the order date to create new features such as hour of the day, day of the week, week of the month, week of the year, date of the month, month of the year and use these features in label encoded format for training the model

Preprocess the order date to create new features such as hour
of the day, day of the week, week of the month, week of the
year, date of the month, month of the year and represent these
features as (x,y) coordinates on a circle using sin and cos
transformations. This transformed data should then be used to
train the model.

Explanation
The best way to engineer the cyclical features is to represent these as (x,y) coordinates on a circle using sin and cos functions.

50) Which of the following Amazon SageMaker built-in algorithms process the training data in recordIO-protobuf float32 format (Select two):

XGBoost
Semantic Segmentation
Factorization Machines
Linear Learner

Factorization Machines
Linear Learner