Data Science Objective Set 1

Which of the following is not considered a cause of confusion about the precise meaning of the data science buzzwords?

The constant evolution of the data science industry and in turn the meaning of the data science buzzwords

Ans:-The speed with which new data science terms are appearing

Which of the following is related to the meaning of the term analytics?

Analytics is about separating a dataset into easy-to-digest chunks and studying them individually and examine how they relate to other parts

Ans:-Analytics is the application of logical and computational reasoning to the component parts obtained in an analysis

Which of the terms relates to the field of business analytics only?

Creating dashboards

Reporting with visuals

Ans:- Qualitative analytics

Which of the following is not considered a data analytics activity?

Ans:-Business case studies

Preliminary data reporting

Optimization of drilling operations

Which of the following is considered data science?

Business case studies

Qualitative analytics

Digital signal processing

Sales forecasting

Given that all activities can be done with ML and all can be done without ML, choose the best answer. Which of the following is considered Data science but not Machine learning?

Creating real-time dashboards

Ans:-Sales forecasting

Fraud prevention

Which of the following is not an example of where Machine Learning is being applied today?

Ans:-Symbolic reasoning

Client retention

Image recognition

From a data scientist’s perspective, the solution of every task begins:

by suggesting a few hypothetical and theoretical solutions to your boss

by gathering your team and deciding on what approach to follow to solve the task

Ans:-with a proper dataset

According to our infographic, which of the following is not considered data science?

Ans:-Big data

Business intelligence

Traditional data science methods

Which of the following is related to the pre-processing of a traditional data set?

Class labelling

Data cleansing

Dealing with missing values

Ans:-All of the above

Which of the following do you encounter when working with big data?

Text data

Integer

Digital image data

Ans:-All of the above

The process of representing observations as numbers is called:

Collecting observations

Ans:-Quantification

A measure that has a business meaning attached is called:

an observation

a quantification

Ans:-a metric

A KPI (Key Performance Indicator) can be best defined as:

the accumulation of observations to show some information

Ans:-a metric that is tightly aligned with your business objectives

a quantification that has a business meaning attached

an observation that can potentially be related to the business goals of a company

The job of a business intelligence analyst always involves the creation of:

reports

dashboards

KPIs

Ans:-All of the above

Which of the following columns from our infographic contain activities that are said to belong to the field of ‘predictive analytics’ and do not aim at explaining past behaviour?

Traditional data

Big data

Business intelligence

Ans:-Traditional methods

In business and statistics, which is the general term that refers to using a model for quantifying causal relationships?

Ans:-regression analysis

factor analysis

cluster analysis

time-series analysis

Which technique can be implemented if you want to reduce the dimensionality of a certain statistical problem?

Ans:-factor analysis

cluster analysis

time-series analysis

all of the above

Which technique is associated with plotting values against time, shown always on the horizontal line?

regression analysis

Ans:-time-series analysis

factor analysis

cluster analysis

When the data is divided into a few groups, you should apply:

factor analysis

Ans:-cluster analysis

time-series analysis

Which of the following statements is true?

The core of machine learning is creating an algorithm, which a computer then uses to find a model that fits the data as best as possible

In machine learning, one does not give the machine instructions on how to find a model. Rather, one provides it with algorithms which give the machine directions on how to learn on its own

A machine learning algorithm is like a trial-and-error process, but the special thing about it, is that each consecutive trial is at least as good as the previous one

Ans:-All of the above

Which line represents the four ingredients of any machine learning algorithm?

Model, data, reward system, objective function

Ans:-Data, model, objective function, optimization algorithm

Model, labelled data, unlabelled data, optimization algorithm

Choose the best answer.

In which type of machine learning is one always working with unlabelled data?

Supervised learning

Ans:-Unsupervised learning

Reinforcement learning

In reinforcement learning, a reward system is being used to improve the machine learning model at hand. The idea of using this reward system is to:

to minimize the error of the model

to minimize the objective function

Ans:-to maximize the objective function

to improve the optimization algorithm

Which of the following is an example where big data techniques are being applied?

Basic customer data

Ans:-Social media

What is NOT a type of machine learning?

supervised

unsupervised

Ans:-reinforced

Which of the following is a typical real-life example where BI techniques are being applied?

Historical stock price data

Ans:-Inventory management

In geometrical terms, a scalar can be represented as:

a line

a square

Ans:-a point

Which of the following is a typical real-life example where machine learning techniques are being applied?

Ans:-Client retention

Basic customer data

Which of the following is a typical real-life example where traditional data techniques are being applied?

Ans:-Basic customer data

Inventory management

Which of the following is a TYPICAL real-life example where traditional data science techniques are being applied?

Social media

Financial trading data

Inventory management

Ans:-Sales forecasting

You have financial data for 100 countries. You feed them to the algorithm and ask it to classify them in as many groups as it sees fit. It starts with 100 groups as each country represents a separate group. You decide to tell it to spit out 5 major groups, i.e. cluster them in 5 clusters. This is an instance of:

supervised learning

Ans:-unsupervised learning

Which software tool is frequently used when working with traditional data or when doing a BI analysis?

Ans:-Excel

Hadoop

R

Knowing which programming language is a huge advantage if you are supposed to be working with big data and/or machine learning?

SQL

Ans:-Java

Data which can be classified using a linear model is called:

clusterable

regressable

linear

Ans:-linearly separable

Econometric time-series analysis is the domain of which software tool?

Excel

Ans:-E-views

Why do we express probabilities numerically?

To determine whether they are likely or unlikely.

Ans:-To compute which event is relatively more likely.

How do we graphically express two mutually exclusive sets?

Their circles are tangent to one another.

Ans:-Their circles do not touch.

What are the intersection and union of two mutually exclusive sets?

The intersection is their sum and the union is the empty set.

The intersection is the smaller set and the union is the larger set.

The union is the empty set and their sum is the intersection.

Ans:-The union is their sum and the intersection is the empty set.

If the probability of an event remains unaffected by another event, the two are….

Dependent.

Ans:-Independent.

What do we call the probability we use to distinguish dependent from independent events?

The dependent probability.

The independent probability.

Ans:The conditional probability.

What can you conclude about events A and B, given that P(A) = P(A|B)?

Ans:The two are independent.

The two are dependent.

What is the difference between P(A|B) and P(B|A)?

The former suggests the two events are dependent, while the latter suggests they are independent.

The former suggests event A is more likely than event B, while the latter suggests B is the likelier of the two.

Ans:One indicates the probability of getting A, given B has occurred, while the other indicates the likelihood of getting B, given A has occurred.

What is the value of P(A|B), knowing P(B|A) = 0.6, P(A) = 0.4 and P(B) = 0.3?

0.24

0.2

Ans:0.8


Which of these characteristics of relational databases are not typical of data warehouses?

Ability to store and query data

Quick retrieval of individual records

Use of structured data

Transaction processing

Answer:

Quick retrieval of individual records.

Use of structured data.

Transaction processing

Question:

Which of these is NOT one of the ACID properties of transactions?

Atomicity

Coordinated

Durability

Answer:- Coordinated

Question:

Match the following use cases to the type of processing system which is best suited for the task.

OLAP

OLTP

  • A:Money transfer from one account to another
  • B:Generate monthly sales reports
  • C:Spot trends in TV viewership over the week
  • D:Update a student’s test score in a final exam

Answer:- OLAP (B,C)

OLTP(A,D)

Question:

Which of these is NOT a data warehouse?

Amazon Redshift

Hive

Jenkins

Teradata

Answer:- Jenkins

Question:

Which of these is NOT a feature of Amazon Redshift?

Ability to back up data using snapshots

Interfaces to query and analyze data

Fully managed data warehouse on the cloud

Highly optimized transaction processing

Answer:- Highly optimized transaction processing

Question:

Which of the following are characteristics of distributed systems?

Fault tolerance

Software to coordinate tasks

Multiple machines working together

Difficulty in scaling

ans:-

Fault tolerance

Software to coordinate tasks

Multiple machines working together

Question:

What does an AWS Policy represent?

A collection of permissions on AWS resources

A licensing agreement between the user and Amazon Inc.

A collection of users who can create resources on AWS

A list of suggested use cases for each AWS service

ANS- A collection of permissions on AWS resources

Question:

When provisioning an AWS user account that will be used to access AWS programmatically, what is generated in order to authenticate users?

An AWS Secret Access Key

An AWS Access Key ID

A Kubernetes secret password

An @aws.com email address

Ans-

An AWS Secret Access Key

An AWS Access Key ID

Question: Which statements about the Amazon Redshift Query Editor is TRUE?

It can be used to load and query data, but not create tables

It can be used to query system tables

It can be used to create tables and to load and query data

It can be used to query existing data but not load data into tables

answer:-

It can be used to query system tables

It can be used to create tables and to load and query data

Question:

What AWS CLI command will list all the Redshift clusters which a user has access to?

aws redshift describe-clusters

aws redshift list-clusters

aws redshift view-all-clusters

aws redshift list-clusters –all

ans:- aws redshift describe-clusters

Question:

Which of the following is not true of Redshift clusters?

Users can define the VPC in which they will be provisioned

The cluster can be scaled

It supports the encryption of data on the cluster

The minimum cluster size is four nodes

ans:- The minimum cluster size is four nodes

Question:

Which of these features and metrics are offered by Amazon Redshift?

Cluster snapshots

Optimized transaction processing

Saved queries

Resource usage metrics

ans:-

Cluster snapshots

Saved queries

Resource usage metrics

Question:

Which of these details need to be supplied when provisioning a Redshift cluster using the Quick launch feature?

Cluster identifier

Cluster credentials

Node type

Frequency of automated snapshots

Question:

How is a SMOTE oversampler initialized from a ModelFrame object named mf?

mf.over_sampling.SMOTE()

mf.imbalance.SMOTE()

mf.imbalance.over_sampling.SMOTE()

mf.SMOTE()

Ans:- mf.imbalance.over_sampling.SMOTE()

Question:

Which statements are true concerning disabling automated snapshots and excluding tables from snapshots?

To exclude a table from a snapshot, add the NO BACKUP clause to the table definition

To disable an automated snapshot, set the retention period to 1 days

To disable an automated snapshot, change the retention period to 0 days

To exclude a table from a snapshot, use the BACKUP NO clause following the table definition

Ans:-

To disable an automated snapshot, change the retention period to 0 days

To exclude a table from a snapshot, use the BACKUP NO clause following the table definition

Question

Given the following values in the body_style column of a Pandas dataframe, what would they look like after label encoding?

hatchback, sedan, suv
(1, 0, 0), (0,1,0), (0,0,1)
 1, 2, 3
 (0, 1,1), (1,0,1), (1,1,0)
 0, 1, 2
Ans:-0, 1, 2


Question
 Which of these statements about the application of the RandomOversampler are correct?

Samples from the minority class are oversampled until the dataset is balanced

Samples from the majority class are dropped until the dataset is balanced

Samples for the minority class are generated to balance the data

The resultant balanced dataset contains duplicates

Ans:- Samples from the minority class are oversampled until the dataset is balanced

The resultant balanced dataset contains duplicates

Question

Match the following sampling techniques to the type of sampling they perform.

Oversampling ABCD

Undersampling ABCD

  • A:Near Miss
  • B:SMOTE
  • C:Tomek Links
  • D:Neighborhood Cleaning Rule

Answer :- Oversampling B

Undersampling ACD

Question

How do you obtain the explained variance ratios for the principal components in your data once you have fitted a PCA object to it?

pca.explained_variance_ratios

pca.get_variance_ratios()

pca.variance_ratios()

pca.explained_variance

ANS:- pca.explained_variance_ratios

Question

How do you initialize a principal component analyzer to extract five principal components from a ModelFrame named mf_data?

decomposition.PCA(data = mf_data, n_components = 5)

mf_data.decomposition.PCA(n_components = 5)

dimensionality_reduction.PCA(data = mf_data, n_components = 5)

mf_data.dimensionality_reduction.PCA(n_components = 5)

ANS:- mf_data.decomposition.PCA(n_components = 5)

Question

What property of a ModelFrame contains the name of the target column?

target_column

target

target_header

target_name

Ans:- target_name

Question

Given you have your test data in a ModelFrame called mf_test, and have a trained estimator called trained_model, how would you get the predictions from the model on your test data?

mf_test.predict(trained_model)

trained_model.predict(mf_test)

trained_model.make_predictions(mf_test)

trained_model.get_predictions(mf_test)

ANs:-mf_test.predict(trained_model)

Question

Which of these are characteristics of the EasyEnsembleClassifier and the BalancedRandomForestClassifier?

They use oversampling to balance the data

They use undersampling to balance the data

They use point-based sampling to balance the data

They use an ensemble of learners where each individual learner is weak

The number of learners can be configured

Ans:-They use undersampling to balance the data

They use an ensemble of learners where each individual learner is weak

The number of learners can be configured

Question:

Recall the essential types of cloud migration.

Re-platforming

Shift to SaaS

Shift to PaaS

Retooling

Ans:-

Re-platforming

Shift to SaaS

Question:

Identify the essential benefits of implementing big data analytics in the cloud.

Scalability and flexibility

Improved analytical outcomes

Security and privacy

Improved analysis

Ans:-

Security and privacy

Improved analysis

Question:

Which are essential characteristics of Kubernetes?

Kubernetes servers can run within a Docker container

Kubernetes is a powerful cloud container management and orchestration tools

Kubernetes servers can run within a Docker virtual machine

Kubernetes is a powerful cloud application management and orchestration tools

Ans:-

Kubernetes servers can run within a Docker container

Kubernetes is a powerful cloud container management and orchestration tools

Question: Select the IP addresses that are used in Kubernetes.

Process EndpointIP

InternalIP

ExternalIP

DataIP

Ans:-

InternalIP

ExternalIP

Question: Name some of the prominent tools provided by GCP that are used to ingest data.

Kubernetes Engine

Cloud Pub/Sub

BigQuery

Cloud Firestore

Ans:-

Kubernetes Engine

Cloud Pub/Sub

Question: Choose the big data management services provided by AWS.

RedShift

DataScale

ENR

EMR

Ans:-

RedShift

EMR

Question: Which of the following storages can we use to back up big data in AWS?

Amazon Glacier

Amazon RedShift

Amazon DataScale

Amazon S3

Ans:-

Amazon Glacier

Amazon S3

Question: Identify which of the following services and implementations can help facilitate disaster recovery planning in the cloud?

S3 configuration

Data migration

Backup and Restore planning

Multi-region deployments

Ans:-

Backup and Restore planning

Multi-region deployments

Question:

Identify some of the critical perspectives of the Cloud Adoption Framework that can help facilitate the implementation or adoption of cloud computing.

Service perspective

People perspective

Application perspective

Business perspective

Ans:-

People perspective

Business perspective

Question: What are the some of the critical layers involved in a typical blockchain-based cloud framework?

Cloud

Defining resources and parameters that are required for deployment

Cloud automation

Blockchain management

Ans:-

Cloud automation

Blockchain management

Question: What are the prominent blockchain implementations that we can use to manage distributed ledgers?

Datachain

Livecoin

Tron

Ethereum

Ans:-

Tron

Ethereum

Question: To which data lifecycle phase does importing data apply to?

Archive

Share

Use

Create

Ans:- Create

Question: Which statement regarding ERDs in Visual Paradigm is true?

Table relationship diagrams cannot be established

Only Microsoft SQL Server database structures are supported

The design can be used to generate a database structure

Only MySQL database structures are supported

Ans:-The design can be used to generate a database structure

Question: Which item is used to control resource access?

Hashing

Software updates

ACL

Firmware updates

Ans:- ACL

Question: Which database statement is correct?

NoSQL commonly scales vertically

SQL commonly scales horizontally

NoSQL uses a structured schema

SQL uses a structured schema

Ans:- SQL uses a structured schema

Question: Which security standard protects cardholder data?

HIPAA

PIPEDA

PCI DSS

GDPR

Ans:-PCI DSS

Question: What is the default listening port for Microsoft SQL server?

1433

631

1389

110

Ans:- 1433

Question: Which statements regarding DynamoDB items is correct?

DynamoDB is a SQL solution

All items must store the same type of data

DynamoDB is a NoSQL solution

Each item can store different types of data

Ans:-

DynamoDB is a NoSQL solution

Each item can store different types of data

Question: Which of the following is an IT data architecture framework?

PIPEDA

GDPR

HIPAA

TOGAF

Ans:-TOGAF

Question: Which definition is correct?

Data stems from information

Information is organized data

Information stems from data

Data is organized information

Ans:-

Information is organized data

Information stems from data

Question: Which factor could undermine the legitimacy of Big Data summaries?

Pace of data creation

Value derivation

Data source accuracy

Amount of data

Ans:- Data source accuracy

Question: Which of the following best describes Apache Hadoop?

SQL database replication

NoSQL database replication

Clustered database analytical engine

Vertically scaled databases analytical engine

Ans:-Clustered database analytical engine

Question: Identify the approaches of data architecture that we can use to implement a hybrid data architecture.

MapReduce

Data warehouse

Hadoop

HDFS

Ans:-

Data warehouse

Hadoop

Question: Recall some of the essential characteristics of stream processing.

Query continuous data velocity

Stream processing facilitates faster data ingestion

Query continuous data streams

Stream processing facilitate faster reports and insights

Ans:-

Query continuous data streams

Stream processing facilitate faster reports and insights

Question: What are some of the essential benefits provided with the implementation of data partitioning?

Auto-scalability

Auto-scalability

Enhanced security

Increased performance

Ans:-

Enhanced security

Increased performance

Question: Choose the prominent data complexity contributors.

Velocity

Category

Format

Transformation

Ans:-

Velocity

Category

Transformation

Question: Identify the essential elements of the CAP theorem.

Consistency

Integration

Partition tolerance

Scalability

Ans:-

Consistency

Partition tolerance

Question: Select the prominent distributed data management models.

File format-oriented services

Record-oriented files

Relational database service

Stream-oriented files

Ans:-

Record-oriented files

Relational database service

Question: Which of the following method calls can we use to create and update data in Elasticsearch?

Get

Put

Option

Post

Ans:-

Put

Post

Question: Identify the essential Read preferences that we can configure in MongoDB for Read optimizations.

Elementary preferred

Farthest preferred

Primary preferred

Secondary preferred

Ans:-

Primary preferred

Secondary preferred

Question: Specify the prominent data modelling methodologies that we can use to model data.

Bottom-up modelling

Top-down logical data modelling

Stripping features model

Dimension extraction

Ans:-

Bottom-up modelling

Top-down logical data modelling

Question: Identify some of the valid and essential MongoDB services.

mongos

mongop

mongoq

mongod

Ans:-

mongos

mongod

Question: Which languages can we use to implement serverless architecture with Lambda?

Swift

C

Node.js

Python

Ans:-

Node.js

Python

Question: Which steps are involved in data discovery?

Sharing data

Data de-identification

Cleansing and preparing data

Converting unstructured data

Ans:-

Sharing data

Cleansing and preparing data

Question: Which of the following steps are involved in deriving a successful data POC?

Identifying data transformation requirements

Evaluating analytical requirements

Cleansing and preparing data

Connecting and blending data

Ans:-

Identifying data transformation requirements

Evaluating analytical requirements

Question: Specify some of the essential data management patterns that we can use for microservices.

Data per service

Event generation

Event sourcing

Database per servic

Ans:-

Event sourcing

Database per servic

Question: Which are types of data architectures that we can implement?

Non-relational data store architecture

Stream processing architecture

Real-time processing architecture

Relational data store architecture

Ans:-

Non-relational data store architecture

Real-time processing architecture

Question: Recall some of the essential benefits with the implementation of clusters.

Simplified management

Increased performance

Auto-scalability

Faster data ingestion

Ans:-

Simplified management

Increased performance

Question: What are some of the important data design principles that are recommended for data design?

Scalability

Integration

Availability

Extensibility

Ans:-

Availability

Extensibility

Question: Identify some of the important characteristics of the serverless architecture.

Serverless architecture eliminates the server management dependency

With a serverless architecture the applications are hosted using third-party services

Serverless architecture eliminates the data management dependency

With a serverless architecture the applications are hosted using in-built services

Question: Match the following statements with the correct type of data that it is a feature of.
Instruction: Match each option with its correct target. Each category may have more than one match.
Answer Options:
A:Order of information is not important
B:Its full content is known before processing begins
C:Its state is dynamic
D:It is an infinite dataset that is never-ending

Streaming Data

A

B

C

D

Batch Data

A

B

C

D

Ans:- Streaming Data C, D

Batch Data A,B

Question: Which of the following are qualities that are desirable in a stream processing system?

Fault tolerant

Idempotent

Low throughput

High Latency

Ans:-

Fault tolerant

Idempotent

Question: The following types of data sinks used to store the results of the transformations applied on streaming data. Which ones are used mainly for debugging purposes?

Memory sink

Kafka sink

Console sink

File sink

Ans:-

Memory sink

Console sink

Question: Which of the following features of Structured Streaming in Spark 2.0 are true.

Any data input to Structured Streaming APIs is processed exactly once

The streaming data processed by Structured Streaming APIs are fault tolerant

APIs used to process batch data is different from the API used to process streaming data

Data that is arrived late cannot be processing by Structured Streaming APIs

Ans:-

Any data input to Structured Streaming APIs is processed exactly once

The streaming data processed by Structured Streaming APIs are fault tolerant

Question: Among the following technologies, which offer support for prefix integrity?

Apache Flink

Apache Spark

Apache Storm

Apache Kafka

Ans:-Apache Spark

Question: Match the following statements related to RDDs with their correct Boolean values

Answer Options:
A:Transformations on RDDs will update them
B:They contain a collection of logical groupings of fields
C:RDDs are stored across multiple nodes in the Spark cluster
D:If one of the nodes where an RDD resides crashes, the data on that node is lost

True

A

B

C

D

False

A

B

C

True:- B,C

False:- A, D

Question: Match the following statements with the corresponding output mode that it describes.

Answer Options:
A:Only the rows which were updated since the previous trigger will be written out
B:Rows which existed previously and get updated in the current trigger won’t be written out
C:The entire contents of the result table are written out to the output
D:The connector to the storage here will determine exactly how that output is written out

Ans:-

Update Mode

A

Complete Mode

C

D

Append Mode

B

Question: What two kinds of data are at issue when considering global standards?

Expensive data

Analyzed data

Impractical data

Personal data

Transactional data

Ans:-

Personal data

Transactional data

Question: Which is not a benefit of a data compliance program?

Makes data cheaper

Complying with the law

Risk reduction

Customer retention

Ans:- Makes data cheaper

Question: If you’re collecting personal data, what should you know?

How big it is

Why you need it

The budget

The cost

Ans:- Why you need it

Question: What should you define when creating reporting and response procedures?

Testing procedures

External reporting only

Doing nothing

Audit avoidance

Ans:- Testing procedures

Question: Which regulation protects credit card information?

HIPAA

GDPR

DMCA

PCI DSS

Ans:- PCI DSS

Question: When building a data compliance strategy, you should do what?

Avoid post-breach procedures

Create an internal reporting structure

Avoid business in countries where reporting is mandatory

Find ways to avoid notifying regulators

Ans:- Create an internal reporting structure

Question: Which is not a feature of Big Data?

Cost-effectiveness

Expensive to store

Behavioral analysis

Long-term storage

Ans:- Expensive to store

Question: What are the two key areas of data protection that companies should focus on?

Policies

Abuses

Regulations

Frameworks

Capturing

Ans:-

Policies

Regulations

Question: Managerial training should focus on what?

Data

Responsibility

Technology

Complexity

Ans:- Responsibility

Question: Who should attend data compliance training for users?

Upper management

Law enforcement

Your customers

Network admins

Ans:-

Question: Who should attend data compliance training for users?

Upper management

Law enforcement

Your customers

Ans:- Network admins

Question: Which is not an element of a good data compliance strategy?

State rationale

Focus on privacy

Define boundaries

Reduce costs

Ans:- Reduce costs

Question: Which is not a reason for big data’s need for a governance structure?

Misuse

Tools

Data loss

Abuse

Ans:- Tools

Question: How do big data paradigms differ from traditional data paradigms?

They offer different analytical models

Traditional data paradigms offer more sources of data

They can incorporate automation into data analysis using AI

Access is limited due to volume

Ans:-

They offer different analytical models

They can incorporate automation into data analysis using AI

Question: Data collection is only as useful as what?

Single points-of-failure

The way in which it’s used

Its ability to evolve

Regular backups

Ans:- The way in which it’s used

Question: Which is not a main need for data governance?

Data validation

Protect data

Domain

Provide reliability

Ans:-Domain

Question: Which is not a business benefit of the cloud?

60% up time

Scalability

Analytics

Onboarding data

Ans:-60% up time

Question: What should you align your stakeholder needs to?

Skills

Curve

Data

Skepticism

Ans:-Skills

Question: What should you do when identifying data?

Budget for vision

Start big

Isolate different types

Align big data with isolation

Ans:-Isolate different types

Question: When identifying the players in a data governance team, you should do what?

Expect technical expertise

Expect a mix of technical and business expertise

Avoid technical expertise

Avoid a mix of technical and business expertise

Ans:-Expect a mix of technical and business expertise

Question: Which is not a key to a successful governance structure?

Data

Inform

Discuss

Educate

Ans:- Data

Question: Which is not a pillar of a successful data governance strategy?

Ensure availability

Secure your data

Access your data

Facilitate data changes

ANS:- Access your data

Question: Which is not a key principle of big data governance?

Best practices

People

Behavior

Process

Ans:- Behavior

Question: What must be done prior to designing effective data access governance policies?

ACLS must be modified

Assets, users and groups must be inventoried

Regulatory compliance must be achieved

Apply software updates

Ans:- Assets, users and groups must be inventoried

Question: Which of the following are examples of PII?

Mother’s maiden name

Product documentation

Social security number

MAC address

Ans:-

Mother’s maiden name

Social security number

Question: Which data storage solution does not use a rigid schema?

NTFS

SQL

NoSQL

Microsoft BitLocker

Ans:-NoSQL

Question: Which data loss threat involves user deception resulting in the disclosure of sensitive data?

Ransomware

Collusion

Malware

Social engineering

Ans:-Social engineering

Question: Which mechanism can be used to assign file system permissions?

ACL

Encryption

DLP

Malware scanner

Ans:- ACL

Question: Which behavior accurately describes the result of applying share and NTFS permissions?

Allow permissions apply only to groups

Deny permissions apply only to groups

Deny permissions override Allow permissions

When combining share and NTFS permissions, the most permissive prevails

Ans:- Deny permissions override Allow permissions

Question: Which of the following constitutes multi-factor authentication?

Smart card, keyfob

Username, password, smart card

Username, password

Username, password, PIN

Ans:-Username, password, smart card

Question: Which of the following are settings related to Amazon Web Services (AWS) IAM users?

Programmatic access

Database access

ACL access

AWS console access

Ans:-

Programmatic access

AWS console access

Question: Which action is required to assign permission to an Amazon Web Services (AWS) IAM group?

Edit the AWS ACL

Attach a policy

Deploy device authentication

Enable multi-factor authentication

Ans:- Attach a policy

Question: Which statement regarding vulnerability assessments is correct?

They are less invasive than penetration tests

The exploit weaknesses

The identify weaknesses

They are more invasive than penetration test

Ans:-

They are less invasive than penetration tests

The identify weaknesses

Question: Which of the following is considered a detective control?

Firewall

Log files

Malware scanning

User training

Ans:- Log files

Question: For which activity does data classification most facilitate data access governance?

Cloud storage

Data backups

Permission assignment

Intrusion detection and prevention

Ans:- Permission assignment

Question: Which Linux solution is used to implement centralized logging?

iptables

rsyslog

Event viewer

SSH

Ans:- rsyslog

Question: You need to capture all network traffic for devices plugged into a network switch. Your machine is plugged into port 1 and you plan to run the Wireshark free packet capturing tool. What is wrong with this configuration?

Nothing is wrong. All traffic will be captured.

You will not see broadcast traffic.

You will not see multicast traffic.

You will only see your own unicast traffic

Ans:- You will only see your own unicast traffic

Question: Which benefit is derived from the use of Microsoft BitLocker?

Network traffic encryption

Disk volume encryption

File encryption

Folder encryption

Ans:- Disk volume encryption

Question: Which of the following can be used as criteria to filter a custom log view?

Event ID

MAC address

Default gateway

User name

Ans:-

Event ID

User name

Question: How can you ensure newly added files are automatically classified?

Enable classification through Group Policy

Create a PowerShell script

Enable compression on folders

Run classification rules on a schedule

Ans:- Run classification rules on a schedule

Question: Which SCCM item is deployed to collections to monitor compliance?

Configuration item

Software updates

Inventory

Baseline

Ans:- Baseline

Question: Which statement regarding asymmetric cryptography is the most correct?

A private key is used

Unrelated public and private keys are used

Mathematically related public and private keys are used

A public key is used

Ans:- Mathematically related public and private keys are used

Question: Which type of VPN is configured by default when created on Windows Server 2016?

PPTP

L2TP

SSL

IKEv2

Ans:-PPTP

Question: You have been tracking database querying over time in the headquarters office and have concluded that performance is unacceptable. Which solution will most likely improve performance the most?

Encryption of data in transit

Encryption of data at rest

Database replicas

In-memory caching

Ans:- In-memory caching

Question: Where do Windows file system audit events get logged?

Audit log

Applicaton log

System log

Security log

Ans:- Security log

Question: What are some of the skills required for practicing data science?

Marketing

Advertising

Domain-specific knowledge

Communication

Statistical analysis

Programming

Ans:-

Domain-specific knowledge

Communication

Statistical analysis

Programming

Question: After collecting a batch of raw data, what should we do next in our data wrangling pipeline?

Integration

Filtering

Gathering

Conversion

Exploration

Ans:- Filtering

Question: What are three major concerns when dealing with large datasets?

Storage

Human resources

Data analysis techniques

Security

Bandwidth

Ans:-

Storage

Data analysis techniques

Security

Question: Match each problem class to its appropriate machine learning method.

Answer Options:
A:Supervised learning
B:Unsupervised learning
C:Reinforcement learning

Training a model based on a reward system

A

B

C

ANS:- C

Detecting patterns in unlabeled data

A

B

C

Ans:- B

Modeling based on input and response variables

A

B

C

Ans:- A

Question: Match each data science term with its description.

Answer Options:
A:Data cleansing
B:Data integrity
C:Data anonymization

Removing personally identifying information from a dataset

A

B

C

ANS:- C

Verifying the accuracy of a dataset

A

B

C

ANS:-B

Removing invalid data from a dataset

A

B

C

ANS:-A

Question: What are some of the ways that we use to make sense of our data science results for others?

Algorithm analysis

Charts

Significant values

Software design

Narratives

Ans:-

Charts

Significant values

Narratives

Question: What step should we perform before statistical analysis and machine learning?

Expert systems design

Data cleaning

Plotting

Graphing

Ans:- Data cleaning

Question: What are the two most common programming languages used in data science?

R

Common Lisp

Perl

Python

Ada

Ans:-

R

Python

Question: What are some strategies we can use to gather the data that accumulates continuously?

incremental synchronization

scheduled tasks

automated scripts

bayesian analysis

constantly recheck it manually

Ans:-

incremental synchronization

scheduled tasks

automated scripts

Question: What are some of the common options that curl provides for downloading web data?

modifying user-agent string

passing form data

parsing cookie information

automatic link following

obeying robots.txt

metadata inspection

Ans:-

modifying user-agent string

passing form data

parsing cookie information

metadata inspection

Question: What is the main advantage of using a command line utility to convert spreadsheets to csv format?

Exception handling

Automation through scripting

Standards compliance

Resolving broken links

Ans:- Automation through scripting

Question: What advantages do libraries like agate provide?

data anonymization

resolving editor disputes

consistent tabular data manipulation

deployment simplicity

general purpose data wrangling

resolving language disputes

Ans:-

consistent tabular data manipulation

general purpose data wrangling

Question: Legacy tabular data formats, such as dbf, are often converted to which of the following formats?

csv

sql

sbt

json

gif

jpg

Ans:-

csv

sql

json

Question: Which of the following features does the python library BeautifulSoup provide?

javascript validation

making url requests

syntax highlighting

HTML tag extraction

HTML DOM manipulation

Ans:-

HTML tag extraction

HTML DOM manipulation

Question: Select the main difference between data and metadata.

metadata is never encrypted

metadata is always in plain text format

metadata contains only the sender and receiver of a message

metadata describes context not contents

data is always proprietary

Ans:- metadata describes context not contents

Question: What are some examples of HTTP header information?

character encoding

email address

content length

user agent

spam scor

Ans:-

character encoding

content length

user agent

Question: Which of the following are examples of information we can get about a client from web server logs?

Phone number

Email address

Geographic location

IP address

Document requested

Ans:-

IP address

Document requested

Question: What are some examples of standard email header information?

Content-type

X-Forwarded-For

X-Spam-Score

Remote-Server

Received

Ans:-

Content-type

Received

Question: If you do not specify a username for your ssh connection, what is the default user set to?

system

local username

root

guest

Ans:- local username

Question: Specifying the remote port number with scp is done with which command line switch?

p

P

r

R

Ans:- P

Question: When running rsync with new options, you should first test with which option?

test

u

del

dry-run

Ans:- dry-run00

Question: What are some of the common reasons for filtering data?

Random sampling

Excluding old data

Removing invalid data

Removing duplicates

Finding relations

Creating simple models

Ans:-

Excluding old data

Removing invalid data

Removing duplicates

Question: Choose the date format which represents a date in ISO 8601.

YYYY-MM-DD

MM/DD/YYYY

DD/MM/YY

MM/DD/YY

DD-MM-YY

Ans:- YYYY-MM-DD

Question: What does the content-type header in an HTTP response represent?

Language

Media type

Compression type

Character encoding

Ans:- Media type

Question: Which of the following commands will skip over the first row in a .csv file?

csvcut -r 2

awk ‘NR+1’

csvcut -r 1

tail -n +2

tail -n -1

awk ‘NR>1’

Ans:-

tail -n +2

awk ‘NR>1’

Question: An e-mail message header is separated from the body of the message using which delimiter?

A
tag

The ‘Sender’ header

A blank line

The pipe character ‘ || ‘

A null character

Question: What format must the data be in for uniq to function correctly?

It must be encoded in UTF-8

It must be sorted

It must be delimited by commas

It must be encoded in Latin1

Ans:- It must be sorted

Question: Select the information that may be found in a JPEG EXIF header.

Copyright

Compression algorithm

GPS coordinates

Camera make and model

Encryption algorithm

XML tags

Ans:-

Copyright

GPS coordinates

Camera make and model

Question: How does pdfgrep handle nonsearchable pdf files?

It will perform OCR first

It will prompt the user to perform OCR

It will find text in image data

It will return nothing

Ans:- It will return nothing

Question: Choose the most appropriate method to deal with impossible combinations in our data set.

Leave the data as is

Randomly change a value so that the combination is valid

Set the invalid data to 0

Drop the data point or record

Set the invalid data to N/A

Ans:- Drop the data point or record

Question: Which one of the following does the disallow directive in robots.txt instructs a parser to do?

Do not index the directory or file

Blacklist the web site

Do nothing

Access but do not log

Ans:- Do not index the directory or file

Question: Which response is a valid reason for converting from CSV to JSON?

The CSV file is small

You get to code a parser

The JSON file is compressed

A JSON format may be required by the software tool that you use

Ans:- A JSON format may be required by the software tool that you use

Question: Which response is a valid reason for converting from XML to JSON?

JSON is less verbose

XML is proprietary

XML is out of style

JSON is used on the web

JSON is compressed by default

Ans:- JSON is less verbose

Question: What might be a problem when converting from CSV to SQL?

SQL queries do not work on data that is imported from CSV

CSV data can only be stored in SQLite databases

SQL tables are slower to query than the CSV tables

CSV file may not contain the information about the data type

Ans:- CSV file may not contain the information about the data type

Question: Why should we export data from SQL to CSV?

To import it into a similar table elsewhere

To process the data

To speed up the database

To back up the database

Ans:-

To import it into a similar table elsewhere

To process the data

Question: Why should we transform a file from comma delimited to tab delimited?

To export to UTF-8 encoding

To allow for commas in the fields

To speed up the processing

To work better on the web

To work on a traditional UNIX operating system

Ans:- To allow for commas in the fields

Question: What is another name for the Unix time?

ISO 2281

January 1, 1970

Greenwich Mean Time

Epoch Time

ISO 8601

Ans:- Epoch Time

Question: Which method is used to find the absolute value of a number in Python?

absval

-1 * n

absolute

abs

Ans:- abs

Question: What is the difference between rounding and truncating a floating-point number?

Truncate and round perform the same function

Rounding always goes to the nearest integer value

Truncating removes all decimal points except for two

Rounding is faster

Ans:- Rounding always goes to the nearest integer value

Question: Which operation should we perform before carrying out OCR on an image?

Recompress the image to 75% quality

Remove header information from the image

Clean the image for easier OCR processing

Rename the image extension to .tiff

Ans:- Clean the image for easier OCR processing

Question: Which options are true in scenarios where the text is more easily extracted from the PDFs?

The PDF was generated from a digital process

The destination format is UTF-8

The text is scanned from a book

The LaTeX source code is used by the OCR software

The PDF contains searchable text

Ans:-

The PDF was generated from a digital process

The PDF contains searchable text

Question: What advantage does csvgrep give over regular grep?

csvgrep excludes duplicate data

csvgrep uses a faster search algorithm

csvgrep operates on a column-by-column basis

csvgrep operates line-by-line

Ans:-csvgrep operates on a column-by-column basis

Question: What are some of the common basic statistics supported by csvstat?

min

history

max

version

mean

Ans:-

min

max

mean

Question: What is the main limitation to querying CSV data with csvsql?

Queries are slow on large data sets

csvsql does not support ‘order by’

csvsql does not support ‘group by’

Not all SELECT operations are performed

Ans:- Queries are slow on large data sets

Question: Which of the following features does gnuplot support?

Saving plots in

Capture

Install

Plotting in ASCII text

Plotting from the command line

Creating interactive plots in JavaScript

Ans:- Plotting in ASCII text

Question: Which features does the wc utility provide?

word frequency count

character count

word count

paragraph count

line count

Ans:-

character count

word count

line count

Question: Select the tools used to explore subdirectories from the command line.

apt

find

xargs

scp

grep

tree

Ans:-

find

xargs

tree

Question: Put the following pseudocode steps in order to carry out a word frequency count.

count the occurances of each word
remove duplicate words
split the text into a list of words

Ans:-remove duplicate words

Question: What method is used to take systematic samples (instead of random)?

srand()

division

rand()

modulus operator %

Ans:- modulus operator %

Question: What is the first step in finding the top 10 rows in a tabular data set based on a particular field?

randomly sample 10 rows from the data set

loop through the data set for each of the top 10 rows

sort the data based on the particular field

sort the data based on weighted rank

Ans:- sort the data based on the particular field

Question: Which two SQL functions are used to count unique rows in a table?

min

order by

limit

distinct

count

avg

max

Ans:-

distinct

count

Question: What is a good median deviation value used to identify outliers?

1

3

2

4

Ans:- 3

Given the following table definitions, which column would you use to join the second table to the first?

ID,Name,Education,Salary_ID1,Steve,1,12,Bob,2,23,Jim,3,34,Jill,4,35,Jane,5,5ID,Salary2,750003,1000004,1250005,1500001,50000

Salary -> Salary_ID

ID -> ID

ID -> Salary_ID

ID -> Education

Ans:- ID -> Salary_ID

Select the answer that explains the following command.

cat *.gz > all-logs.gz

unzip all logs with .gz extension

symbolically link all .gz files with all-logs.gz

concatenate all .gz files into the file all-logs.gz

delete all .gz files after storing them in all-logs.gz

zip files using the gzip utility

Ans:- concatenate all .gz files into the file all-logs.gz

What sorting algorithm does the ‘sort’ command line utility use?

Bubble sort

Heapsort

Merge sort

Quicksort

Ans:-Merge sort

Question

Given an HTML table, select the appropriate compatability criteria for merging two tables into one.

The tables must have the same number of rows

A column cannot have missing elements

The columns in both tables must be in the same order before merging

The tables have the same number of columns

Both tables must have identical header elements:

A row cannot have missing elements

Ans:-

The columns in both tables must be in the same order before merging

The tables have the same number of columns

Select the functions that provide summarized or aggregated data.

min

count

distinct

sort

limit

sum

Ans:-

min

count

sum

Select the answer that best describes normalizing tabular data.

compiling tabular data into code

creating connecting tables in normal form

summarizing or aggregating tabular data

transforming data from long format to wide format

reshaping data from column-key format to key-value format

Ans:- reshaping data from column-key format to key-value format

Given the following table, how many columns would the denormalized table have?

ID,Property,Value1,name,Steve2,name,Bob3,name,Jill1,birth_year,19192,birth_year,19853,birth_year,20001,salary,1250002,salary,500003,salary,250000

7

9

6

8

4

5

Ans:-

Given the following table, how many columns would the denormalized table have?

ID,Property,Value1,name,Steve2,name,Bob3,name,Jill1,birth_year,19192,birth_year,19853,birth_year,20001,salary,1250002,salary,500003,salary,250000

7

9

6

8

4

5

Ans:- 4

What type of operations or functions are typically used to create a pivot table?

summary

limit

join

distinct

group by or aggregate

Ans:-

summary

group by or aggregate

Considering the date range inclusive of 2000 to 2015 for the given data set, how many rows would be added in the homogenized table?

year,value2004,20002005,30002006,90002007,105002008,86002009,92002010,140002012,15000

8

7

4

5

10

6

9

Ans:- 8

Question: Select the best description of the geometric definition of the dot product of two vectors.

Logarithm

Difference

Magnitude

Vector sum

Tangent angle

Ans:-

Question: Select the best description of the geometric definition of the dot product of two vectors.

Logarithm

Difference

Magnitude

Vector sum

Tangent angle

Question: Select the method which describes how to add two matrices together.

The matrices are adding component-wise

The matrices are multiplied component-wise

Rows in the left-hand matrix are added to columns in the right-hand matrix
Not selected. Not selected is correct.

Each component in the right-hand matrix is scaled by a factor of every component in the left-hand matrix

Ans:-The matrices are adding component-wise

Question: Given a matrix A, what values make up the diagonal matrix D in A’s factorization via Singular Value Decomposition (SVD)?

The identity matrix

The determinant of A

The diagonal entries from A

The singular values from A

The orthogonal eigenvalues from A

A unitary matrix of eigenvectors from A

Ans:- The singular values from A

Question: Select examples of discrete categorical data.

Blood type

Temperature

Precipitation level

True or false

Heads or tails

Eye color

Ans:-

Blood type

Eye color

Question: What is an event?

The set of outcomes in an experiment

The number of ways an event can occur divided by the total possible outcomes

The set of all possible outcomes

The result of a coin flip or dice roll

The random distribution of an experiment

Ans:- The set of outcomes in an experiment

Question: Given events A and B, select the answer that describes P(A and B).

A ∩ B

B ∖ A

A – B

A ∪ B

Ans:- A ∩ B

Given the additional rule for calculating probability, why do we need to subtract P(A and B) away from the sum?

P(A or B) = P(A) + P(B) – P(A and B)

To account for the central limit theorem

To account for independent events

To account for Bayes rule

To account for the intersection of events

Ans:- To account for the intersection of events

Select the formal name of the “Bell curve” distribution.

Normal distribution

Discrete distribution

Continuous uniform distribution

Binomial distribution

Poisson distribution

Ans:- Normal distribution

Select the description of a binomial distribution with a single trial.

Bell curve

Poisson event

Discrete value

Bernoulli trial

Ans:- Bernoulli trial

Select the answer that describes the prior probability in a Bayesian email spam algorithm.

The probability that any message is spam

The probability that a spam rule shows up in a ham message

The probability that any message is ham

The probability that a rule appears in a spam message

Ans:- The probability that any message is spam