AWS Machine Learning Q&A

We have published 100+ questions and its answer. Post clicking on question, you can see its answer.Lets practice and navigate to quiz set if you want to do more practice. We have divided 100+ questions in separate pages so navigate to next page at the end of the page.Or if you want to practice QUIZ directly , please click on below link and start.

AWS MACHINE LEARNING QUIZ

QUESTION AND ANSWER SESSION:

1) Question 1: 
John was working on Machine Learning requirement in which he has to design a system.Its requirement is as :
A system designed to classify financial transactions into fraudulent and non-fraudulent transactions results in the confusion matrix below. What is the recall of this model?


​50%
74%
66.67%
90%

ANS: 90%

Explanation

Recall is defined as true positives / (true positives + false negatives). This works out to 90/(90+10) in this example, or 90%. 66.67% is the precision (true positives / (true postives + false positives.) Recall is an important metric in situations where classifications are highly imbalanced, and the positive case is rare. Accuracy tends to be misleading in these cases.

2) Question 2:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You wish to use a SageMaker notebook within a VPC. SageMaker notebook instances are Internet-enabled, creating a potential security hole in your VPC. How would you use SageMaker within a VPC without opening up Internet access?
​Disable direct Internet access when specifying the VPC for your notebook instance, and use VPC interface endpoints (PrivateLink) to allow the connections needed to train and host your model. Modify your instance’s security group to allow outbound connections for training and hosting.
​No action is required, the VPC will block the notebook instances from accessing the Internet.
Uncheck the option for Internet access when creating your notebook instance, and it will handle the rest automatically.
Use IAM to restrict Internet access from the notebook instance.

ANS- Disable direct Internet access when specifying the VPC for your notebook instance, and use VPC interface endpoints (PrivateLink) to allow the connections needed to train and host your model. Modify your instance’s security group to allow outbound connections for training and hosting.

Explanation
This is covered under “Infrastructure Security” in the SageMaker developer guide. You really do need to read all 1,000+ pages of it and study it in order to ace this certification.

Question 3:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?
​Use Sparkmagic to pre-process your data within a SageMaker notebook, transform the resulting Spark DataFrames into RecordIO format, and then use Spark’s XGBoost algorithm to train the model.
Use Amazon EMR to pre-process your data using Spark, and use the same EMR instances to host your SageMaker notebook.
Use Amazon EMR to pre-process your data using Spark, and then use AWS Data Pipelines to transfer the processed training data to SageMaker
Use sagemaker_pyspark and XGBoostSageMakerEstimator to use Spark to pre-process, train, and host your model using Spark on SageMaker.

ANS-
Use sagemaker_pyspark and XGBoostSageMakerEstimator to use Spark to pre-process, train, and host your model using Spark on SageMaker.
Explanation
The SageMakerEstimator classes allow tight integration between Spark and SageMaker for several models including XGBoost, and offers the simplest solution. You can’t deploy SageMaker to an EMR cluster, and XGBoost actually requires LibSVM or CSV input, not RecordIO.

Question 4:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
After training a deep neural network over 100 epochs, it achieved high accuracy on your training data, but lower accuracy on your test data, suggesting the resulting model is overfitting. What are TWO techniques that may help resolve this problem?
​Use early stopping
Use dropout regularization
Employ gradient checking
Use more layers in the network
Use more features in the training data

ANS- Use early stopping
Use dropout regularization

Explanation

Early stopping is a simple technique for preventing neural networks from training too far, and learning patterns in the training data that can’t be generalized. Dropout regularization forces the learning to be spread out amongst the artificial neurons, further preventing overfitting. Removing layers, rather than adding them, might also help prevent an overly complex model from being created – as would using fewer features, not more.

Question 5:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are developing a machine learning model to predict house sale prices based on features of a house. 10% of the houses in your training data are missing the number of square feet in the home. Your training data set is not very large. Which technique would allow you to train your model while achieving the highest accuracy?
​Impute the missing values using deep learning, based on other features such as number of bedrooms
Impute the missing square footage values using kNN
Impute the missing values using the mean square footage of all homes
Drop all rows that contain missing data

ANS- Impute the missing square footage values using kNN

Explanation
Deep learning is better suited to the imputation of categorical data. Square footage is numerical, which is better served by kNN. While simply dropping rows of missing data or using the mean values are a lot easier, they won’t result in the best results.

Question 6:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are developing an autonomous vehicle that must classify images of street signs with extremely low latency, processing thousands of images per second. What AWS-based architecture would best meet this need?
Use Amazon Rekognition on AWS DeepLens to identify specific street signs in a self-contained manner.
​Use Amazon Rekognition in edge mode
​Develop your classifier using SageMaker Object Detection, and use Elastic Inference to accelerate the model’s endpoints called over the air from the vehicle.
​Develop your classifier with TensorFlow, and compile it for an NVIDIA Jetson edge device using SageMaker Neo, and run it on the edge with IoT GreenGrass.

ANS- Develop your classifier with TensorFlow, and compile it for an NVIDIA Jetson edge device using SageMaker Neo, and run it on the edge with IoT GreenGrass.
Explanation
SageMaker Neo is designed for compiling models using TensorFlow and other frameworks to edge devices such as Nvidia Jetson. The low latency requirement requires an edge solution, where the classification is being done within the vehicle itself and not over the air. Rekognition (which doesn’t have an “edge mode,” but does integrate with DeepLens) can’t handle the very specific classification task of identifying different street signs and what they mean.

Question 7:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
Your company wishes to monitor social media, and perform sentiment analysis on Tweets to classify them as positive or negative sentiment. You are able to obtain a data set of past Tweets about your company to use as training data for a machine learning system, but they are not classified as positive or negative. How would you build such a system?
​Use SageMaker Ground Truth to label past Tweets as positive or negative, and use those labels to train a neural network on SageMaker.
​Use RANDOM_CUT_FOREST to automatically identify negative tweets as outliers.
​Stream both old and new tweets into an Amazon Elasticsearch Service cluster, and use Elasticsearch machine learning to classify the tweets.
​Use Amazon Machine Learning with a binary classifier to assign positive or negative sentiments to the past Tweets, and use those labels to train a neural network on an EMR cluster.

ANS- Use SageMaker Ground Truth to label past Tweets as positive or negative, and use those labels to train a neural network on SageMaker.
Explanation
A machine learning system needs labeled data to train itself with; there’s no getting around that. Only the Ground Truth answer produces the positive or negative labels we need, by using humans to create that training data initially. Another solution would be to use natural language processing through a service such as Amazon Comprehend.

Question 8:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
A large news website needs to produce personalized recommendations for articles to its readers, by training a machine learning model on a daily basis using historical click data. The influx of this data is fairly constant, except during major elections when traffic to the site spikes considerably. Which system would provide the most cost-effective and reliable solution?
​Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
​Publish click data into Amazon Elasticsearch using Kinesis Firehose, and query the Elasticsearch data to produce recommendations in real-time.
​Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using reserved instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
​Publish click data into Amazon S3 using Kinesis Streams, and process the data in real time using Splunk on an EMR cluster with spot instances added as needed. Publish the model’s results to DynamoDB for producing recommendations in real-time.

ANS- Publish click data into Amazon S3 using Kinesis Firehose, and process the data nightly using Apache Spark and MLLib using spot instances in an EMR cluster. Publish the model’s results to DynamoDB for producing recommendations in real-time.
Explanation
The use of spot instances in response to anticipated surges in usage is the most cost-effective approach for scaling up an EMR cluster. Kinesis streams is over-engineering because we do not have a real-time streaming requirement. Elasticsearch doesn’t make sense because Elasticsearch is not a recommender engine.

Question 9:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
You are developing a computer vision system that can classify every pixel in an image based on its image type, such as people, buildings, roadways, signs, and vehicles. Which SageMaker algorithm would provide you with the best starting point for this problem?
​Rekognition
​Object Detection
​Semantic Segmentation
​Object2Vec

ANS- Semantic Segmentation
Explanation
Semantic Segmentation produces segmentation masks that identify classifications for each individual pixel in an image. It uses MXNet and the ResNet architecture to do this.

Question 10:
John was working on Machine Learning requirement in which he has to design a system.
Its requirement is as :
Your automatic hyperparameter tuning job in SageMaker is consuming more resources than you would like, and coming at a high cost. What are TWO techniques that might reduce this cost?
​Use more concurrency while tuning
Use logarithmic scales on your parameter ranges
Use linear scales on your parameter ranges
Use less concurrency while tuning
Use inference pipelines

ANS- Use logarithmic scales on your parameter ranges
Use less concurrency while tuning
Explanation
Since the tuning process learns from each incremental step, too much concurrency can actually hinder that learning. Logarithmic ranges tend to find optimal values more quickly than linear ranges. Inference pipelines are a thing, but have nothing to do with this problem.

Question 11
You are deploying SageMaker inside a VPC, but the Internet access from SageMaker notebooks is considered a security hole. How might you address this?

Ans-Disable internet access and setup NAT gateway.

Question 12
When constructing your own training container for SageMaker, where should the actual training script files be stored?

Ans- /opt/ml/code/

Question 13
where your deployment code for inference will be stored?

Ans- /opt/ml/model/

Question 14
You are training an XGBoost model on SageMaker with millions of rows of training data, and you wish to use Apache Spark to pre-process this data at scale. What is the simplest architecture that achieves this?

Ans:- ​Use sagemaker_pyspark and XGBoostSageMakerEstimator to use Spark to pre-process, train, and host your model using Spark on SageMaker

Question 15
In Project, it is important to maintain memory size. Consider we have to delete old data in S3. Then What is the simplest way to manage automating the archiving or deletion of old data in your S3?

Ans : Use S3 Lifecycle Rules

Question 16
Being AWS Machine architect , You have been tasked with collecting thousands of PDFs for building a large corpus dataset. The data within this dataset would be considered what type of data?’

Ans- Unstructured

17) Is PDF considered Unstructured?

Ans-Yes. Since PDFs have no real structure to them, like key-value pairs or column names, they are considered unstructured data.

18)A Kinesis Data Stream’s capacity is provisioned by shards. What is the maximum throughput of a single shard?

Ans:–>1 mb/s or 1000 messages/s

19) 19)Your organization cloudvikas.com has given you several different sets of key-value pair JSON files that need to be used for a machine learning project within AWS. What type of data is this classified as and where is the best place to load this data into?

Ans : Semi-structured data, stored in S3.

20) 20)Does Key-value pair JSON data is considered semi-structured?

Ans : Key-value pair JSON data is considered Semi-structured data because it doesn’t have a defined structure, but has some structural properties. If our data is going to be used for a machine learning project in AWS, we need to find a way to get that data into S3.

21)In Realtime system , we have to analyze data quickly to take action. Then Which AWS service is suitable for connecting video data from cameras to backend systems to analyze that data in real time?

Ans:- -> Kinesis video streams

22)John is a ML specialist who is setting up a ML pipeline for project. The amount of data he has is massive and needs to be set up and managed on a distributed system to efficiently run processing and analytics on. He has to plan to use tools like Apache Spark to process data to get it ready for ML pipeline. Which setup and services can most easily help him to achieve this?

Ans : Elastic Map Reduce (EMR) with Apache Spark installed.

23)What is the underlying platform for Glue ETL?

An:-A serverless Apache Spark platform

24) 24)Which AWS service allows you to set up a distributed Hadoop cluster to process, transform, and analyze large amounts of data?

Ans : Amazon’s EMR allows you to set up a distributed Hadoop cluster to process, transform, and analyze large amounts of data. Apache Spark is a processing framework and programming model that helps you do machine learning, stream processing, or graph analytics using Amazon EMR clusters.

25) 25)As Spark Engineer, you have created one dataset. Within your dataset, what is the minimum number of observations you should have compared to the number of features?

Ans – 10 times as many observations as features.

26)Sometimes it is needed to query S3 data quickly. In that case which AWS data store provides a highly scalable data warehouse that can query S3 data lake directly?

Ans:–>Amazon Redshift

27) In project , client has given requirement to analyze log files. Now You are a ML specialist within a organization and you have to run SQL queries and analytics on thousands of Apache logs files stored in S3. Your organization already uses Redshift as their data warehousing solution. Which tool can help you achieve this with the LEAST amount of effort?

Ans: Redshift Spectrum

28)Which kind of graph is best suited for visualizing outliers in data?

Ans:–> Box & whisker plots

29)Suppose Client has given you requirement to build a machine learning models .Which Amazon service allows you to build a high-quality training labeled dataset for your machine learning models? This includes human workers, vendor companies that you choose, or an internal, private workforce.

Ans : SageMaker Ground Truth

30)What sort of data distribution would be relevant to flipping heads or tails on a coin?

Ans:- -> Binomial distributions

31)What is a serverless, fully-managed solution for querying unstructured data in S3?

Ans:- -> Athena

32)How will you automate the labeling process?

Ans:- You could use Jupyter Notebooks or Lambda to help automate the labeling process, but SageMaker Ground Truth is specifically used for building high-quality training datasets.

33)What are the features of Amazon Quicksight’s Machine Learning Insights?

Ans- Anomaly detection

Forecasting

Auto narratives

34)Suppose Client has given you data. The data has a key-value access pattern, developers need to run complex SQL queries and transactions, and the data has a fixed schema. Which type of data store meets all of their needs?

Ans: RDS

35)Which technique for missing data would produce the best results?

Ans -MICE

36)You are a ML specialist within a MNC Company and working for client project. In daily activity you have to run SQL queries and analytics on thousands of Apache logs files stored in S3. Which set of tools can help you achieve this with the LEAST amount of effort?

Ans: AWS Glue Data Catalog and Athena

37)Your deep neural network seems to converge on different solutions with different accuracy each time you train it. What’s a likely explanation?

Ans- Large batch sizes

38)John was working on AWS Glue and he found one problem. He was trying to set up a crawler within AWS Glue that crawls input data in S3. For some reason after the crawler finishes executing, it cannot determine the schema from data and no tables are created within AWS Glue Data Catalog. What is the reason for these results?

Ans- AWS Glue built-in classifiers could not find the input data format. We need to create a custom classifier.

39)Which SageMaker algorithm would be best suited for identifying topics in text documents in an unsupervised setting?

Ans- LDA

40)When you train your model in SageMaker, where does your training dataset come from?

Ans-S3

41)Generally where do we store our training data?

Ans- Generally, we store our training data in S3 to use for training our model.

42)Where does SageMaker’s automatic scaling get the data it needs to determine how many endpoints you need?

Ans-CloudWatch

43)Which AWS service is used for auditing?

Ans – CloudTrail is for auditing

44)John is working in the project where AWS Glue was proposed by client and he has been tasked with setting up crawlers in AWS Glue to crawler different data stores to populate organization’s AWS Glue Data Catalogs. Which of the following input data store is NOT an option when creating a crawler?

Ans-DocumentDB

45)Which pipelines allow you to chain together different stages in your inference?

Ans – use Inference pipelines

46)You want SageMaker inference to be fast, but don’t want to pay for P2 or P3 inference nodes. What’s a good solution?

Ans-Use Elastic Inference

47)As per Requirement , John has to convert multiple JSON files within a S3 bucket to Apache Parquet format. Which AWS service should be used to achieve this with the LEAST amount of effort?

Ans- Create an AWS Glue Job to convert the S3 objects from JSON to Apache Parquet, then output newly formatted files into S3.

48)You are developing a computer vision system that can classify every pixel in an image based on its image type, such as people, buildings, roadways, signs, and vehicles. Which SageMaker algorithm would provide you with the best starting point for this problem?

​Semantic Segmentation

49)Your company wishes to monitor social media, and perform sentiment analysis on Tweets to classify them as positive or negative sentiment. You are able to obtain a data set of past Tweets about your company to use as training data for a machine learning system, but they are not classified as positive or negative. How would you build such a system?

​Use SageMaker Ground Truth to label past Tweets as positive or negative, and use those labels to train a neural network on SageMaker.

50)Your automatic hyperparameter tuning job in SageMaker is consuming more resources than you would like, and coming at a high cost. What are TWO techniques that might reduce this cost?

​Use logarithmic scales on your parameter ranges
Use less concurrency while tuning