AWS Machine Learning certification Pre Test -1

1) Question 1: 
John was working on Machine Learning requirement in which he has to design a system.Its requirement is as :
A system designed to classify financial transactions into fraudulent and non-fraudulent transactions results in the confusion matrix below. What is the recall of this model?

predicted Positives -Actual positives-90,Actual Negatives-45
predicted Negatives-Actual positives-10,Actual Negatives-20

​50%
74%
66.67%
90%

ANS: 90%

Explanation

Recall is defined as true positives / (true positives + false negatives). This works out to 90/(90+10) in this example, or 90%. 66.67% is the precision (true positives / (true postives + false positives.) Recall is an important metric in situations where classifications are highly imbalanced, and the positive case is rare. Accuracy tends to be misleading in these cases.

2) Question 2:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You wish to use a SageMaker notebook within a VPC. SageMaker notebook instances are Internet-enabled, creating a potential security hole in your VPC. How would you use SageMaker within a VPC without opening up Internet access?
​Disable direct Internet access when specifying the VPC for your notebook instance, and use VPC interface endpoints (PrivateLink) to allow the connections needed to train and host your model. Modify your instance’s security group to allow outbound connections for training and hosting.
​No action is required, the VPC will block the notebook instances from accessing the Internet.
Uncheck the option for Internet access when creating your notebook instance, and it will handle the rest automatically.
Use IAM to restrict Internet access from the notebook instance.

ANS- Disable direct Internet access when specifying the VPC for your notebook instance, and use VPC interface endpoints (PrivateLink) to allow the connections needed to train and host your model. Modify your instance’s security group to allow outbound connections for training and hosting.

Explanation
This is covered under “Infrastructure Security” in the SageMaker developer guide. You really do need to read all 1,000+ pages of it and study it in order to ace this certification.

Question 3:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?
​Use Sparkmagic to pre-process your data within a SageMaker notebook, transform the resulting Spark DataFrames into RecordIO format, and then use Spark’s XGBoost algorithm to train the model.
Use Amazon EMR to pre-process your data using Spark, and use the same EMR instances to host your SageMaker notebook.
Use Amazon EMR to pre-process your data using Spark, and then use AWS Data Pipelines to transfer the processed training data to SageMaker
Use sagemaker_pyspark and XGBoostSageMakerEstimator to use Spark to pre-process, train, and host your model using Spark on SageMaker.

ANS-
Use sagemaker_pyspark and XGBoostSageMakerEstimator to use Spark to pre-process, train, and host your model using Spark on SageMaker.
Explanation
The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution. You can’t deploy SageMaker to an EMR cluster, and XGBoost actually requires LibSVM or CSV input, not RecordIO.

Question 4:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
After training a deep neural network over 100 epochs, it achieved high accuracy on your training data, but lower accuracy on your test data, suggesting the resulting model is overfitting. What are TWO techniques that may help resolve this problem?
​Use early stopping
Use dropout regularization
Employ gradient checking
Use more layers in the network
Use more features in the training data

ANS- Use early stopping
Use dropout regularization

Explanation

Early stopping is a simple technique for preventing neural networks from training too far, and learning patterns in the training data that can’t be generalized. Dropout regularization forces the learning to be spread out amongst the artificial neurons, further preventing overfitting. Removing layers, rather than adding them, might also help prevent an overly complex model from being created – as would using fewer features, not more.

Question 5:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are developing a machine learning model to predict house sale prices based on features of a house. 10% of the houses in your training data are missing the number of square feet in the home. Your training data set is not very large. Which technique would allow you to train your model while achieving the highest accuracy?
​Impute the missing values using deep learning, based on other features such as number of bedrooms
Impute the missing square footage values using kNN
Impute the missing values using the mean square footage of all homes
Drop all rows that contain missing data

ANS- Impute the missing square footage values using kNN

Explanation
Deep learning is better suited to the imputation of categorical data. Square footage is numerical, which is better served by kNN. While simply dropping rows of missing data or using the mean values are a lot easier, they won’t result in the best results.

Question 6:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are developing an autonomous vehicle that must classify images of street signs with extremely low latency, processing thousands of images per second. What AWS-based architecture would best meet this need?
Use Amazon Rekognition on AWS DeepLens to identify specific street signs in a self-contained manner.
​Use Amazon Rekognition in edge mode
​Develop your classifier using SageMaker Object Detection, and use Elastic Inference to accelerate the model’s endpoints called over the air from the vehicle.
​Develop your classifier with TensorFlow, and compile it for an NVIDIA Jetson edge device using SageMaker Neo, and run it on the edge with IoT GreenGrass.

ANS- Develop your classifier with TensorFlow, and compile it for an NVIDIA Jetson edge device using SageMaker Neo, and run it on the edge with IoT GreenGrass.
Explanation
SageMaker Neo is designed for compiling models using TensorFlow and other frameworks to edge devices such as Nvidia Jetson. The low latency requirement requires an edge solution, where the classification is being done within the vehicle itself and not over the air. Rekognition (which doesn’t have an “edge mode,” but does integrate with DeepLens) can’t handle the very specific classification task of identifying different street signs and what they mean.

Question 7:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
Your company wishes to monitor social media, and perform sentiment analysis on Tweets to classify them as positive or negative sentiment. You are able to obtain a data set of past Tweets about your company to use as training data for a machine learning system, but they are not classified as positive or negative. How would you build such a system?
​Use SageMaker Ground Truth to label past Tweets as positive or negative, and use those labels to train a neural network on SageMaker.
​Use RANDOM_CUT_FOREST to automatically identify negative tweets as outliers.
​Stream both old and new tweets into an Amazon Elasticsearch Service cluster, and use Elasticsearch machine learning to classify the tweets.
​Use Amazon Machine Learning with a binary classifier to assign positive or negative sentiments to the past Tweets, and use those labels to train a neural network on an EMR cluster.

ANS- Use SageMaker Ground Truth to label past Tweets as positive or negative, and use those labels to train a neural network on SageMaker.
Explanation
A machine learning system needs labeled data to train itself with; there’s no getting around that. Only the Ground Truth answer produces the positive or negative labels we need, by using humans to create that training data initially. Another solution would be to use natural language processing through a service such as Amazon Comprehend.

Question 8:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably. Which system would provide the most cost-effective and reliable solution?
​Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
​Publish click data into Amazon Elasticsearch using Kinesis Firehose, and query the Elasticsearch data to produce recommendations in real-time.
​Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using reserved instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
​Publish click data into Amazon S3 using Kinesis Streams, and process the data in real time using Splunk on an EMR cluster with spot instances added as needed. Publish the model’s results to DynamoDB for producing recommendations in real-time.

ANS- Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
Explanation
The use of spot instances in response to anticipated surges in usage is the most cost-effective approach for scaling up an EMR cluster. Kinesis streams is over-engineering because we do not have a real-time streaming requirement. Elasticsearch doesn’t make sense because Elasticsearch is not a recommender engine.

Question 9:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are developing a computer vision system that can classify every pixel in an image based on its image type, such as people, buildings, roadways, signs, and vehicles. Which SageMaker algorithm would provide you with the best starting point for this problem?
​Rekognition
​Object Detection
​Semantic Segmentation
​Object2Vec

ANS- Semantic Segmentation
Explanation
Semantic Segmentation produces segmentation masks that identify classifications for each individual pixel in an image. It uses MXNet and the ResNet architecture to do this.

Question 10:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
Your automatic hyperparameter tuning job in SageMaker is consuming more resources than you would like, and coming at a high cost. What are TWO techniques that might reduce this cost?
​Use more concurrency while tuning
Use logarithmic scales on your parameter ranges
Use linear scales on your parameter ranges
Use less concurrency while tuning
Use inference pipelines

ANS- Use logarithmic scales on your parameter ranges
Use less concurrency while tuning
Explanation
Since the tuning process learns from each incremental step, too much concurrency can actually hinder that learning. Logarithmic ranges tend to find optimal values more quickly than linear ranges. Inference pipelines are a thing, but have nothing to do with this problem.