Question: Select the bias types which describe sampling bias.
Projection bias
Negativity bias
Selfselection bias
Exclusion bias
Survivorship bias
Ans:
Selfselection bias
Exclusion bias
Question: Select the definition of the Coefficient of Variation.
Square of the standard deviation divided by the mode
Absolute value of the Zscore divided by the mean
Square root of the sum of deviations from the mean
Ratio of the standard deviation to the mean
Ans: Ratio of the standard deviation to the mean
Question: Select the answer that describes the bias in an estimator.
The estimator always tends to the mean
The difference between the true value and expected value
Only the variance can be biased
The estimator is always off by one
Ans: The difference between the true value and expected value
Question: Which distribution describes the plot of sample means from any random distribution?
Uniform
Binomial
Normal
Poisson
Ans: Normal
Question: A data science platform must provide an environment flexible enough to integrate a variety of tools and tool types including which key programming languages?
R
JavaScript
Python
HTML
Ans:
R
Python
Question: When deploying data science tools, software engineering best practices should be adhered to. Which type of tool would you use for centralized management of code?
Version control
RDBMS
Visual Studio Code
DVD
Ans: Version control
Question: Coverage is a key data science tool consideration. What does coverage refer to?
The types of projects covered
The platform’s core capabilities
Test code coverage
The platform’s ability to cover data
Ans: The platform’s core capabilities
Question: In which step in the data science workflow would one typically perform feature engineering?
Define objective
Explore/clean data
Evaluate/tune model
Import data
Ans: Explore/clean data
Question: When using data science tools to perform text exploration, what kind of data is typically being analysed?
Relational data
Unstructured data
CSV data
Structured data
Ans: Unstructured data
Question: Which of these are considered common uses for data science visualization tools?
Discover new features
Generate models
Restructure data
Explore information
Ans:
Discover new features
Explore information
Ans:
Question: When using a data science database tool to acquire streaming data from a device, when is the data processing typically performed?
On the device
After import
Before import
In realtime
Ans: In realtime
Question: Which of these are valid benefits of deploying cloudbased tools?
Reliability
Availability
Scalability
Security
Compliance
Ans:
Reliability
Availability
Scalability
Question: Which of these are valid challenges of deploying cloudbased tools?
Scalability
Security
Network latency
Regulatory compliance
Performance
Ans:
Security
Network latency
Regulatory compliance
Question: Which of these are typical functionalities of DevOps?
Deployment
Evaluating
Testing
Integration
Cleaning
Ans:
Deployment
Testing
Integration
Question: Working within DevOps for data science, which types of resources would be subject to automated testing?
IoT devices
Model performance
Data quality
Containers
Ans:
Model performance
Data quality
Question: Match the following statements related to Seaborn with their correct boolean values.
Answer Options:
A:Seaborn is a data visualization library built on top of Matplotlib
B:Seaborn is part of the PyData stack which is the open data science stack available in Python
C:Seaborn allows the user to very finely control every detail of the plot and lets the user perform complex tasks with it
True
A
B
C
Ans: A
False
A
B
C
Ans: B,C
Question: Let’s say you have create a Dataframe called “data” and you call the describe() function on this dataframe. What does this describe() function return?
It returns the datatype of each of the columns of the dataframe
It returns the details like number of rows in the dataframe, the number of columns in the dataframe, total number of cells in the dataframe etc.
It returns a summary of all of the string columns in the dataframe
It returns the summary of all of the numeric columns of the Dataframe by default. The summary includes the count, the mean, the standard deviation, the minimum, the maximum values of the columns etc.
Ans: It returns the summary of all of the numeric columns of the Dataframe by default. The summary includes the count, the mean, the standard deviation, the minimum, the maximum values of the columns etc.
Question: Match the following statements related to the distplot function in Seaborn with the correct values.
Answer Options:
A:Distplot is used for visualizing the distribution of a single column of data in a Dataframe
B:By default, the Seaborn distribution plot automatically plots a smooth representation of the distribution of data across the range of values passed into the function (a KDE curve)
C:The orientation of the distribution of the data passed into this function cannot be changed and is fixed. The range values will always be along the xaxis.
True
A
B
C
Ans: A,B
False
A
B
C
Ans: C
Question: Which of these plots is NOT rendered by the Seaborn distplot function?
Instruction: Choose the option that best answers the question.
Histogram
Kernel Density Estimation curve
2D Scatter Plot
Rug plot
Ans: 2D Scatter Plot
Question: What does the bandwidth of a KDE curve determine?
The bandwidth of the curve determines what portion of the entire range of values will be considered to plot the KDE estimate at any point
The bandwidth of a KDE curve determines the smoothness of the curve. Lower the bandwidth, smoother the curve
The bandwidth of a KDE curve determines the length of the curve
The bandwidth of a KDE curve determines the smoothness of the curve. Higher the bandwidth, smoother the curve
Ans:
The bandwidth of the curve determines what portion of the entire range of values will be considered to plot the KDE estimate at any point
The bandwidth of a KDE curve determines the smoothness of the curve. Higher the bandwidth, smoother the curve
Question: Which of these Seaborn functions here can be used to plot the distribution of bivariate data?
rugplot
kdeplot
jointplot
distplot
Ans:
kdeplot
jointplot
Question: What is the output after you have passed 4 variables as input data to the pairplot function?
It outputs a 4×4 grid with a univariate distribution for the corresponding input
It outputs a 2D scatter plot for any 4 pairs of variables against each other
variables It outputs a 4×4 grid with a 2D scatter plot for every pair of variables against each other along with the univariate distribution for the corresponding input variables
It outputs a 2D scatter plot for the first variable against the other 3 variables passed to it
Ans: variables It outputs a 4×4 grid with a 2D scatter plot for every pair of variables against each other along with the univariate distribution for the corresponding input variables
Question: What is the ‘hue’ argument in the Seaborn pairplot function used for?
It is used for specifying the gaps between all of the subplots that are rendered by the pairplot function
It is used for specifying what kind of plot we are going to use for all of our subplots
It is used to specify the colors for the markers in our scatter plot
It is used for specifying the xvariables and the yvariables that we are going to use in order to construct our plots
Ans: It is used to specify the colors for the markers in our scatter plot
Question: Seaborn has a builtin function called set_context which helps set the details of the plot such as labels, lines, grids, and other plot elements so that the plot is best suited to the context in which you want to present it in. What are the different context modes in Seaborn?
talk
paper
meetup
poster
Ans:
talk
paper
poster
Question: Among the following list of people/groups of people, whom would you consider as outliers in their field?
The Beatles – often considered the most influential rock band of all time
Patrick Klandt – a professional soccer player who plays for a midlevel team
Serena Williams – the record holder for number of major wins in tennis
Sezz Medi – an Italian restaurant which operated for 4 years before shutting down
Ans:
The Beatles – often considered the most influential rock band of all time
Serena Williams – the record holder for number of major wins in tennis
Question: Some of the features of boxplots are stated below. Match them with their correct boolean values
Answer Options:
A:The vertical lines of a box plot represent the range of distribution of data
B:The boxes represent the interquartile distribution i.e. the data from the 25th percentile right up to the 75th percentile
C:The max value of the data is represented by the horizontal line within the box
D:The outliers of the data are also captured by the whiskers of the box plot
False
A
B
C
D
Ans: C,D
True
A
B
C
D
Ans:A,B
Question: Match the following types of Seaborn plots with their correct description.
Instruction: Match each answer with the correct target. Each answer can only be used once.
Answer Options:
A:stripplot
B:countplot
C:pointplot
D:catplot
Similar to a histogram, with a bar for every categorical variable
A
B
C
D
Ans: D
It renders a line connecting a number of points where each point represents the mean value for every category
A
B
C
D
Ans: A
Similar to a scatter plot except in this case, one of our variables is categorical in nature
A
B
C
D
Ans: C
Question: Some of the features of the Seaborn FacetGrid function are given below. Which of these are true?
We don’t have to explicitly mention the data that has to be represented in the graph or the type of graph that we want rendered. FacetGrid detects this automatically
FacetGrid cannot be used to perform analysis on bivariate data
It allows the user to plot a different graph for each category present in a range of data
Ans: It allows the user to plot a different graph for each category present in a range of data
Question: Let’s say you are generating multiple plots using the Seaborn FacetGrid but you want only three graphs to be displayed in a single row. How do you do this?
By setting the ‘col’ argument of the FacetGrid function to 3
By setting the ‘col_wrap’ argument of the FacetGrid function to 3
By setting the ‘row_wrap’ argument of the FacetGrid function to 3
By setting the ‘row’ argument of the FacetGrid function to 3
Ans: By setting the ‘col_wrap’ argument of the FacetGrid function to 3
Question: When using colour palettes to colour your plots, how does Seaborn work when there are multiple objects to be coloured?
Seaborn cycles through each of the colours in the colour pallet in a particular order and sets a colour to a given variable depending on the order in which the variables appear
It sets a colour to a given variable depending on the name of the variable
Different variables are given different colours in random order
Ans: Seaborn cycles through each of the colours in the colour pallet in a particular order and sets a colour to a given variable depending on the order in which the variables appear
Question: Some of the statements related to colour pallets in Seaborn are given below. Match them with their correct boolean values.
Instruction: Match each option with its correct target. Each category has a single match.
Answer Options:
A:True
B:False
Adjacent colors in qualitative palettes vary a lot
A
B
Ans: A
Sequential palettes allow values close to each other to have similar shade and values far from each other to have completely different shades
A
B
Ans: A
Sequential palettes are better suited for categorical data
A
B
Ans: B
Question: What is the ‘hue’ argument in the Seaborn pairplot function used for?
It is used for specifying the gaps between all of the subplots that are rendered by the pairplot function
It is used for specifying what kind of plot we are going to use for all of our subplots
It is used to specify the colors for the markers in our scatter plot
It is used for specifying the xvariables and the yvariables that we are going to use in order to construct our plots
Ans: It is used to specify the colors for the markers in our scatter plot
Question: Seaborn has a builtin function called set_context which helps set the details of the plot such as labels, lines, grids, and other plot elements so that the plot is best suited to the context in which you want to present it in. What are the different context modes in Seaborn?
talk
paper
meetup
poster
Ans:
talk
paper
poster
Question: Among the following list of people/groups of people, whom would you consider as outliers in their field?
The Beatles – often considered the most influential rock band of all time
Patrick Klandt – a professional soccer player who plays for a midlevel team
Serena Williams – the record holder for number of major wins in tennis
Sezz Medi – an Italian restaurant which operated for 4 years before shutting down
Ans:
The Beatles – often considered the most influential rock band of all time
Serena Williams – the record holder for number of major wins in tennis
Question: Some of the features of boxplots are stated below. Match them with their correct boolean values
Answer Options:
A:The vertical lines of a box plot represent the range of distribution of data
B:The boxes represent the interquartile distribution i.e. the data from the 25th percentile right up to the 75th percentile
C:The max value of the data is represented by the horizontal line within the box
D:The outliers of the data are also captured by the whiskers of the box plot
False
A
B
C
D
Ans: C,D
True
A
B
C
D
Ans:A,B
Question: Match the following types of Seaborn plots with their correct description.
Instruction: Match each answer with the correct target. Each answer can only be used once.
Answer Options:
A:stripplot
B:countplot
C:pointplot
D:catplot
Similar to a histogram, with a bar for every categorical variable
A
B
C
D
Ans: D
It renders a line connecting a number of points where each point represents the mean value for every category
A
B
C
D
Ans: A
Similar to a scatter plot except in this case, one of our variables is categorical in nature
A
B
C
D
Ans: C
Question: Some of the features of the Seaborn FacetGrid function are given below. Which of these are true?
We don’t have to explicitly mention the data that has to be represented in the graph or the type of graph that we want rendered. FacetGrid detects this automatically
FacetGrid cannot be used to perform analysis on bivariate data
It allows the user to plot a different graph for each category present in a range of data
Ans: It allows the user to plot a different graph for each category present in a range of data
Question: Let’s say you are generating multiple plots using the Seaborn FacetGrid but you want only three graphs to be displayed in a single row. How do you do this?
By setting the ‘col’ argument of the FacetGrid function to 3
By setting the ‘col_wrap’ argument of the FacetGrid function to 3
By setting the ‘row_wrap’ argument of the FacetGrid function to 3
By setting the ‘row’ argument of the FacetGrid function to 3
Ans: By setting the ‘col_wrap’ argument of the FacetGrid function to 3
Question: When using colour palettes to colour your plots, how does Seaborn work when there are multiple objects to be coloured?
Seaborn cycles through each of the colours in the colour pallet in a particular order and sets a colour to a given variable depending on the order in which the variables appear
It sets a colour to a given variable depending on the name of the variable
Different variables are given different colours in random order
Ans: Seaborn cycles through each of the colours in the colour pallet in a particular order and sets a colour to a given variable depending on the order in which the variables appear
Question: Some of the statements related to colour pallets in Seaborn are given below. Match them with their correct boolean values.
Instruction: Match each option with its correct target. Each category has a single match.
Answer Options:
A:True
B:False
Adjacent colors in qualitative palettes vary a lot
A
B
Ans: A
Sequential palettes allow values close to each other to have similar shade and values far from each other to have completely different shades
A
B
Ans: A
Sequential palettes are better suited for categorical data
A
B
Ans: B
Question: Match the following statements related to the lmplot function in Seaborn with the correct values.
Instruction: Match each option with its correct target. Each category may have more than one match.
Answer Options:
A:lmplot will render a regression plot on a facet grid
B:lmplot is capable of rendering only a single regression plot between two variables
C:We can make our markers more evenly distributed in the lmplot by adding in jitters, but doing so will affect the regression line
False
A
B
C
Ans: B,C
True
A
B
C
Ans: A
Question: What is the Seaborn despine() function used for?
It can be used to set a gap between the axes of our visualization and the plot itself
The despine function is used to change the style and aesthetics of the grid lines of our plot
It is used to remove parts or all of the box inside which our visualizations are rendered
By default, our visualizations will not be plotted inside a box and the despine function is used to render our visualizations inside a box
Ans:
It can be used to set a gap between the axes of our visualization and the plot itself
It is used to remove parts or all of the box inside which our visualizations are rendered
Question: Let’s say you have a grouped dataframe called “gdf” which has a column of integers called “c1”.
What would gdf.c1.sum() return?
The total number of groups
For each group, we get the sum of their values in “c1”
The number of records in each group
The total number of records in the grouped dataframe
Ans: For each group, we get the sum of their values in “c1”
Question: What does the “values” member variable of a pandas dataframe return?
The contents of the dataframe in the form of a python list
The contents of the dataframe in the form of a pandas series
The contents of the dataframe in the form of a numpy array
The dataframe in a dictionary form
Ans: The contents of the dataframe in the form of a numpy array
Question: Let’s say you have a multiindexed dataframe called “multi_index” with three columns “c1”, “c2” and “c3” where the third column “c3” contains integer values.
You want to first group this dataset by the values in “c2” and within each value in “c2”, you want to group by “c1”, in order to calculate the sum of values in “c3” for each of these subgroups.
Which of these functions can help do so?
multi_index.groupby (level = [‘c2’]) .sum()
multi_index.groupby (level = [‘c1’, ‘c2’]) .sum()
multi_index.groupby (level = [‘c1’]).sum()
multi_index.groupby (level = [‘c2’, ‘c1’]) .sum()
Ans: multi_index.groupby (level = [‘c2’, ‘c1’]) .sum()
Question: Match the following quantities, with the correct function that you need to pass to the aggregate function, called on a multiindexed grouped dataframe, to return it.
Answer Options:
A:sum of all the quantities in the group
B:mean value of all the quantities in the group
C:minimum value of all the quantities in the group
D:maximum value of all the quantities in the group
np.max
A
B
C
D
Ans: D
np.sum
A
B
C
D
Ans: A
np.min
A
B
C
D
Ans: C
np.mean
A
B
C
D
Ans: B
Question: What is returned when you call the “isin” function, with one of the values in the column specified as a filter, on a dataframe?
Pandas series of string values where the value is false if that index had the same value as the filter and true otherwise
Pandas series with the index values where the values in the column do not match with the filter
Pandas series with the numeric values where the values in the column match with the filter
Pandas series of boolean values where the value is true if that index had the same value as the filter and false otherwise
Ans: Pandas series of boolean values where the value is true if that index had the same value as the filter and false otherwise
What will be the values present in “new_series” by the end of this program?
import pandas as pd
pandas_series = pd.Series([1, 2, 3, 4])
new_series = pandas_series.mask(pandas_series > 1)
Result A.
0 1.0
1 NaN
2 NaN
3 NaN
Result B.
0 2.0
1 3.0
2 4.0
Result C.
0 NaN
1 2.0
2 3.0
3 4.0
Result D.
0 1.0
Result A
Result B
Result C
Result D
Ans: Result A
Let’s say you have a dataframe called “df” which has a column called “quantity”, what would be the output when you execute the following line of code:
df.duplicated(‘quantity’)
//dataframe df column “quantity” values
0
4
1
1
3
0
3
Result A.
0 True
1 False
2 True
3 True
4 True
5 True
6 True
Result B.
0 False
1 True
2 False
3 False
4 False
5 False
6 False
Result C.
0 False
1 False
2 False
3 True
4 False
5 True
6 True
Result D.
0 True
1 True
2 True
3 False
4 True
5 False
6 False
Result B
Result D
Result A
Result C
Ans: Result C
When creating a dataframe using the DataFrame constructor, what is the default dtype assigned to a column consisting of string values?
object
category
integer
string
Ans: object
Let’s say you’re trying to apply an inequality filter to a column in a dataframe.
For what column dtypes would the program throw an error when trying to perform this operation?
object
integer
unordered categorical
ordered categorical
Ans: unordered categorical
What does the ffill() function do when you call it on your dataframe?
NaN values in the beginning of all the columns in the dataframe are filled with the first observed value of that column
Removes the records with NaN values
NaN values in the end of all the columns in the dataframe are filled with interpolated values of that column
Replaces the NaN value with 1
Ans: NaN values in the end of all the columns in the dataframe are filled with interpolated values of that column
Question: Besides accuracy, what else does veracity refer to?
Meaningless
Truthfulness
Noisiness
Cleanliness
Ans: Truthfulness
Question: Which is not an example of a use case of Big Data?
Telecommunication companies reducing IT costs
Epidemic prediction
Updating menus with highpriced items
Airlines collecting engine information
Ans: Updating menus with highpriced items
Question: Which is not one of the four V’s of Big Data?
Veracity
Velocity
Volume
Value
Ans: Value
Question: What’s an example of something you can control?
The accuracy of data
The speed of user data
Better decision making
The amount of data sources
Ans: Better decision making
Question: Which interactivity feature is available by default in an Altair bar chart?
Save as SVG
Save as PNG
Zoom in
Pan
Zoom out
Ans: Save as SVG
Question: Which statement best defines a wide form dataset?
A wide form dataset has one row per independent variable, with metadata recorded in the row and column labels
A wide form data set is a data set with more than ten columns and less than a hundred rows
A wide form data set is a data set with more columns than rows in the entire data set
A wide form dataset has one row per observation, with metadata recorded within the table as values
Ans: A wide form dataset has one row per independent variable, with metadata recorded in the row and column labels
Question: Which error will be thrown if you attempt to visualize a dataset with more than 5000 rows by default?
ExceedsLimitationError
IntegrityError
No error will be thrown at all, the code will work fine
MaxRowsError
Ans: MaxRowsError
Question: Which commands can be used to install Altair, Vega, and Vega Lite from a Jupyter notebook?
!install altair, vega, vega_datasets
!pip install altair vega vega_datasets
python install altair, vega, vega_datasets
install altair, vega, vega_datasets
Ans: !pip install altair vega vega_datasets
Question: How would you parameterize a call to the alt.X() constructor to specify the number of bins in a histogram?
Using the “binning” input argument to the alt.X() function
Using the “bins” input argument to the alt.X() function
Using the “bin” input argument to the alt.X() function
Using the “num_bins” input argument to the alt.X() function
Ans: Using the “bin” input argument to the alt.X() function
Question: Which statements can be used to create a brush which selects a range on the X axis in a chart?
alt.selection_interval(encodings = x)
alt.brush(axis = x)
alt.brush()
alt.selection_interval(axis = x)
Ans: alt.selection_interval(encodings = x)
Question: How would you parameterize a call to the alt.chart().mark_boxplot().encode() property to specify the color in a box plot?
Using the “palette” input argument to the alt.chart().mark_boxplot().encode() property
Using the “col” input argument to the alt.chart().mark_boxplot().encode() property
Using the “hue” input argument to the alt.chart().mark_boxplot().encode() property
Using the “color” input argument to the alt.chart().mark_boxplot().encode() property
Ans: Using the “color” input argument to the alt.chart().mark_boxplot().encode() property
Question: Which are valid inputs to the “sort” parameter to the alt.X() constructor?
“+”
“asce”
“desc”
“y”
“y”
Ans:
“y”
“y”
Question: You would like to create a line chart with step interpolation. How would you parameterize your call to the alt.Chart().mark_line() method?
alt.Chart().mark_line(interpolation = ‘step’,…)
alt.Chart().mark_line(interpolation_mode = ‘step’,…)
alt.Chart().mark_line(interpolate = “step”,…)
alt.Chart().mark_line(step = True,…)
Ans:alt.Chart().mark_line(interpolate = “step”,…)
Question: You would like to create a brush which selects a range of data points in a scatter plot. How would you parameterize your call to the alt.selection() function?
alt.selection (range = “interval”,…)
alt.selection(type = “interval”,…)
alt.selection (apply = “interval”,…)
alt.selection(select = “interval”,…)
Ans: alt.selection(type = “interval”,…)
Question: Which function from the alt.Chart() class can be used to create a violin plot?
alt.Chart().mark_kernel()
alt.Chart().mark_area()
alt.Chart().mark_violin()
alt.Chart().mark_KDE()
Ans: alt.Chart().mark_area()
Question: Which function from the alt.Chart() class can be used to create a scatter plot with hollow points?
alt.Chart().mark_circle()
alt.Chart().mark_point()
alt.Chart().mark_hollow_scatter()
alt.Chart().mark_scatter()
Ans: alt.Chart().mark_point()
Question: Which Altair classe can be used to add conditional formatting to a chart?
alt.condition()
alt.Condition()
alt.conditional_formatting()
alt.ConditionalFormatting()
Ans: alt.condition()
Question: You would like to add a darkgreen line to an area chart. How would you parameterize your call to the alt.Chart().mark_area() method?
alt.Chart().mark_area(line_color = “darkgreen”,…)
alt.Chart().mark_area(line_layout = {“color” : “darkgreen”},…)
alt.Chart().mark_area(outline = {“color” : “darkgreen”},…)
alt.Chart().mark_area(line = {“color” : “darkgreen”},…)
Ans: alt.Chart().mark_area(line = {“color” : “darkgreen”},…)
Question: Which statement accurately describes a trellis area chart?
A trellis area chart is an area chart which uses a gradient to color the area
A trellis area chart is an area chart with an outline for each of the areas being visualized
A trellis area chart is an area chart with multiple categories being visualized
A trellis area chart is an area chart with a separate chart for every category
Ans: A trellis area chart is an area chart with a separate chart for every category
Question: What kind of variables can be placed on the X and Y axis of a scatter plot?
The X axis has to be continuous and the Y axis can be either categorical or continuous
Both of the axes in a scatter plot can be either categorical or continuous
The X axis has to be continuous and the Y axis has to be continuous
The X axis has to be categorical and the Y axis has to categorical
Ans: The X axis has to be continuous and the Y axis has to be continuous
Question: Assume you have two line charts stored in variables called line_01 and line_02.
Which statement can be used to return both of these lines in the same chart?
alt.combine(line_01, line_02)
line_01  line_02
line_01 & line_02
alt.layer(line_01, line_02)
Ans: alt.layer(line_01, line_02)
Question: Why does volume matter?
By 2020 it’s expected that we’ll have 55 times the data we had in 2010
We don’t have enough sensors
Data storage is plentiful
80 percent of data was created in the past two years
Ans: 80 percent of data was created in the past two years
Question: What causes the variety problem?
Reduction in complexity
No more opportunity
More structured data
Increasing ways data is received
Ans: Increasing ways data is received
Question: Which is not a principle of variety?
Data can be passive
Data is always structured
Variety is expensive
Variety means the same kind of data
Ans: Data is always structured
Question: You would like to create a green colored map. How would you parameterize your call to the alt.Chart().mark_geoshape() method?
alt.Chart().mark_geoshape(fill = “green”,…)
alt.Chart().mark_geoshape(color = “green”,…)
alt.Chart().mark_geoshape(hue = “green”,…)
alt.Chart().mark_geoshape(format = {“color” : “green”},…)
Ans: alt.Chart().mark_geoshape(fill = “green”,…)
Question: How would you parameterize your call to the alt.Chart().transform_aggregate() method to create a plot with one marker for each category of a column?
Using the “groupby” input argument to the alt.Chart().transform_aggregate() method
Using the “group” input argument to the alt.Chart().transform_aggregate() method
Using the “group_by” input argument to the alt.Chart().transform_aggregate() method
Using the “aggregate” input argument to the alt.Chart().transform_aggregate() method
Ans: Using the “groupby” input argument to the alt.Chart().transform_aggregate() method
Question: Which function from the alt.Chart() class can be used to create a heat map?
alt.Chart().mark_heatmap()
alt.Chart().mark_map()
alt.Chart().mark_heat()
alt.Chart().mark_rect()
Ans: alt.Chart().mark_rect()
Question: Which of the following best defines a ranged dot plot?
A dot plot with two or more dots representing a range of values which are connected by a line
A dot plot with dots representing a continuous range of values
A dot plot with two dots representing a range of values
A dot plot with all the dots in the plot connected by a line
Ans: A dot plot with two or more dots representing a range of values which are connected by a line
Question: You would like to create a y axis on the right side of a chart. How would you parameterize your call to the alt.Axis() constructor?
alt.Axis(orientation = “right”,…)
alt.Axis(axis = “right”,…)
alt.Axis(position = “right”,…)
alt.Axis(orient = “right”,…)
Ans: alt.Axis(orientation = “right”,…)
Question: Which statement can be used to create a brush which selects a single point in a scatter chart?
alt.selection_single()
alt.selection_interval()
alt.brush_single()
alt.brush()
Ans: alt.brush_single()
Question: Which classes are required to create a candlestick chart in Altair?
alt.OHLC
alt.Y2
alt.Y
alt.Chart
alt.Candle
Ans:
alt.Y2
alt.Y
alt.Chart
Question: You want to create a color scale in a variable “color_scale” with colors for the categories “X”, “Y”, “Z” for use in a chart.
Which statement can be used to achieve this?
color_scale = alt.ColorScale(categories = [“X”, “Y”, “Z”])
color_scale = alt.ColorScale(domain = (“X”, “Y”, “Z”))
color_scale = alt.Scale(categories = [“X”, “Y”, “Z”])
color_scale = alt.Scale(domain = [“X”, “Y”, “Z”])
Ans: color_scale = alt.Scale(domain = [“X”, “Y”, “Z”])
Question: How would you parameterize your call to the alt.Chart().mark_bar() method so that the width of the bars in a bar chart also correspond to a variable?
Using the “x_width” input argument to the alt.Chart().mark_bar() function
Using the “x2” input argument to the alt.Chart().mark_bar() function
Using the “width” input argument to the alt.Chart().mark_bar() function
Using the “bar_width” input argument to the alt.Chart().mark_bar() function
Ans: Using the “x_width” input argument to the alt.Chart().mark_bar() function
Question: Which statements about a default representation of a strip plot in Altair is true?
The X and Y axes of a strip plot are both bucketed
One axis of a strip plot is always continuous
One axis of a strip plot is always categorical
A strip plot visualizes univariate data using bars
Ans:
One axis of a strip plot is always continuous
One axis of a strip plot is always categorical
Question: Which classe can be used to perform a sort operation on a variable in a dash chart?
alt.Sort
alt.SortX
alt.SortField
alt.SortY
Ans: alt.SortField
Question: Which of these statements best defines clustered bar charts?
A bar chart where the bars are stacked upon one another
A bar chart where the bars slope downwards
A bar chart without lines separating each bar
A bar chart which visualizes multiple variables
Ans: A bar chart which visualizes multiple variables
Question: Which functions can be used to create a second axis which shares the same xaxis as the first?
ax.commonx()
ax.sharex()
ax.twinx()
ax.X()
Ans: ax.twinx()
Question: How can multiple sheets be created when exporting data frames to Excel in R using writexl?
By chaining write_xlsx function calls
By specifying a list of data frames
Multiple sheets cannot be written using writexl
By first opening the excel file and appending the new sheet
Ans: By specifying a list of data frames
Question: By default, what is read by the html_table function from the rvest library?
The last table in the HTML document
A vector containing the tables in the HTML document
A list containing the tables in the HTML document
The first table in the HTML document
Ans: A list containing the tables in the HTML document
Question: Select the description of what action the following code performs.
Code Editor:
sink(“file.txt”)
file.txt will be printed to the console
Console output will be redirected to file.txt
file.txt will be read into an R data frame
file.txt will be deleted
Ans: Console output will be redirected to file.txt
Question: Given the result of the read_excel function from the readxl package, what function is then used to convert the resulting object into an R data frame?
as.data.frame
to_dataframe
tibble_to_dataframe
as_dataframe
Ans: as.data.frame
Question: How is data represented in Lollipop charts?
Using data points and thin vertical bars
Using thin vertical bars with whiskers
Using thin vertical bars
Using data points and curves
Ans: Using data points and thin vertical bars
Question: You would like to create a histogram without displaying the lines which separate the individual bars. How would you parameterize your call to the plt.hist() function?
plt.hist(type = ‘joint’,…)
plt.hist(hist = ‘step’,…)
plt.hist(kind = ‘step’,…)
plt.hist(histtype = ‘step’,…)
Ans: plt.hist(histtype = ‘step’,…)
Question: If you want to create a histogram which visualizes probability values of records, how would you parameterize your call to the plt.hist() function?
Using the “kernel” input argument to the plt.hist() function
Using the “kde” input argument to the plt.hist() function
Using the “density” input argument to the plt.hist() function
Using the “distribution” input argument to the plt.hist() function
Ans: Using the “density” input argument to the plt.hist() function
Question: Match the CSV read method with its type of CSV.
Instruction: Match each answer with the correct target. Each answer can only be used once.
Answer Options:
A:read.csv
B:read.csv2
C:read.delim
comma separator
A
B
C
Ans: A
semicolon separator
A
B
C
Ans: B
tab separator
A
B
C
Ans: C
Question: What is the default separator and decimal character used by write.csv?
semicolon and comma
tab and period
comma and period
tab and comma
Ans: comma and period
Question: Which functions can be performed on a figure object to create an axes object in that figure?
fig.axes()
fig.add_axes()
fig.create_axes()
fig.add_axis()
Ans: fig.add_axes()
Question: Which Matplotlib backends are interactive?
gtk3
qt4
ps
svg
inline
Ans:
gtk3
qt4
inline
Question: Which pandas function can be used to create a DataFrame with separate rows for a category?
pd.group_by()
pd.category()
pd.group()
pd.cat()
Ans: pd.group_by()
Question: What is returned from the dplyr mutate function when operating on a tibble?
The tibble is converted into a list
A new row is added based on a function of existing rows
A new column is added based on a function of existing columns
A column’s values are modified according to a function
Ans: A new column is added based on a function of existing columns
Question: Which of the following are examples of summary functions?
min
select
slice
median
left_join
max
Ans:
min
median
max
Question: When performing a left join, right join, or full join, how does dplyr handle unmatched values?
By inserting NA
By dropping the row
By inserting 0
By inserting NULL
Ans: By inserting NA
Question: Which matplotlib.pyplot functions can be used to create a tuple with the figure and axes of a chart?
plt.subplots()
plt.plot()
plt.axes()
plt.figure()
Ans: plt.subplots()
Question: If you want to customize the size of the text of the xaxis label in a chart, how would you parameterize your call to the plt.xlabel() function?
Using the “label_size” input argument to the plt.xlabel() function
Using the “text_size” input argument to the plt.xlabel() function
Using the “fontsize” input argument to the plt.xlabel() function
Using the “size” input argument to the plt.xlabel() function
Ans: Using the “fontsize” input argument to the plt.xlabel() function
Question: Assume you have data for the opening price and a closing price of a stock over a period of time.
If you want separate lines to represent the opening and closing prices of the stick, how will you parameterize your call to the matplotlib.pyplot.plot() function?
Using the “color” input argument to the matplotlib.pyplot.plot() function
Using the “hue” input arguments to the matplotlib.pyplot.plot() function
Using the “extra_cat” input argument to the matplotlib.pyplot.plot() function
Using the “cat” input argument to the matplotlib.pyplot.plot() function
Ans: Using the “color” input argument to the matplotlib.pyplot.plot() function
Given the dplyr tibble object created, how many rows will be returned in the result of the filter?
pineapples < tibble( country = c(“Costa Rica”, “Brazil”, “Philippines”, “Thailand”, “Indonesia”), production = c(2.7, 2.5, 2.4, 2.2, 1.8) ) filter(pineapples, production > 2.0)
0
2
1
3
4
Ans: 4
What special operator is used by dplyr to pass a function argument to one of its methods?
>
%<%
%*%
%>%

Ans: %>%
Which R class most closely resembles the dplyr tibble?
list
raw
data.frame
vector
Ans: data.frame
Question: When does it make sense to use treemaps?
To analyze proportions of individual categories at various points in time
Show trends over time where there are many ordered data points
Show the trend of a stock’s performance based on the high, low, and close of that stock over some days
To analyze proportions of individual categories
Ans: To analyze proportions of individual categories
Question: What data does a heatmap convey?
The correlation matrix between all pairs of variables
The median of all variables
The 25th and 75th percentile for all variables
The outliers in all variables
Ans: The median of all variables
Question: When does it make sense to use pie charts?
To represent and visualize hierarchical information
To visualize the relationship between continuous variables
To analyze proportions of individual categories
Show trends over time where there are many ordered data points
Ans: To analyze proportions of individual categories
Question: If you want to create a box plot without points representing outliers, how would you parameterize your call to the plt.boxplot() function?
Using the “showfliers” input argument to the plt.boxplot() function
Using the “outlier_markers” input argument to the plt.boxplot() function
Using the “showoutliers” input argument to the plt.boxplot() function
Using the “outliers” input argument to the plt.boxplot() function
Ans: Using the “showfliers” input argument to the plt.boxplot() function
Given the dplyr tibble object created, which column will be in the output of the select statement?
pineapples < tibble(
country = c(“Costa Rica”, “Brazil”, “Philippines”, “Thailand”, “Indonesia”),
production = c(2.7, 2.5, 2.4, 2.2, 1.8)
)
select(pineapples, production)
country, production
country
production
NA
Ans: country
How is the dplyr group_by method typically used?
When creating subsets of columns
In conjunction with dplyr join functions
In conjunction with dplyr summary functions
When filtering or slicing rows
Ans: In conjunction with dplyr summary functions
Question: Which of these statements best defines autocorrelation?
Two variables which are not related to each other
A strong positive correlation between two variables
A strong negative correlation between two variables
The correlation of a variable with itself shifted in time
Ans: The correlation of a variable with itself shifted in time
Question: Which of these statements is true about scatter plots?
The distribution of a continuous variable can be visualized
Two variables can be visualized in a scatter plot
Two categorical variables can be visualized
Multiple pairs of variables can be visualized
Ans:Two variables can be visualized in a scatter plot
Question: When does it make sense to use area charts?
To analyze proportions of individual categories
Show the trend of a stock’s performance based on the high, low, and close of that stock over some days
To analyze composition of multiple categories over a period of time
Show trends over time where there are many ordered data points
Ans: To analyze composition of multiple categories over a period of time
Question: What can a single box plot convey?
The outliers
The median
The distribution
The count
The 25th and 75th percentile
Ans:
The outliers
The median
The 25th and 75th percentile
Given the following line equation, and a y variable that can take on only positive values, for which value of x is y invalid.
y = 5x + 50
x = 1
x = 10
x = 1
x = 0
x = 10
Ans: x = 10
What does a negative result in the cor function indicate?
There is a negative correlation between the variables
The correlation function has produced an error
There is no correlation between the variables
There is a positive correlation between the variables
Ans: There is a negative correlation between the variables
What elements are returned by the code snippet demonstrating the dplyr setdiff function?
pcars %>% setdiff(pcars_training)
The columns of pcars that are not in pcars_training
The rows of pcars_training that are not contained in pcars
The rows of pcars that are also contained in pcars_training
The rows of pcars that are not in pcars_trainin1g
Ans: The rows of pcars that are not in pcars_training01
Question: When does it make sense to use scatter charts or correlation heatmaps?
To create multiple charts grouped by a category
To explore relationships between pairs of variables in data
Show trends over time where there are many ordered data points
To analyze data using multiple chart types in a single chart
Ans: To explore relationships between pairs of variables in data
Question: What type is internally used to store elements of a factor?
raw
character
logical
integer
complex
Ans: integer
Given the following code snippet, which variable represents the dependent variable?
Ozone ~ Solar.R + Wind + Temp
Solar.R
Wind
Ozone
Temp
Ans: Ozone
Select the variables contained in the summary of a linear model.
variancecovariance matrix
rsquared values
fstatistic
confidence interval
residuals
Ans:
rsquared values
fstatistic
residuals
Question: Select the summaries computed on a data frame by the summary function.
median
mean
variance
mode
min
max
Ans:
median
mean
min
max
Question: What is the default method for handling NA values in the sort function?
NA values are discarded
NA values are left where they originally occurred
NA values are sorted last
NA values must be removed before sorting
NA values are sorted first
Ans: NA values are discarded
Match the regression method with its outcome type.
Answer Options:
A:binomial logistic regression
B:multinomial logistic regression
C:linear regression
twovalued
A
B
C
Ans: A
multivalued
A
B
C
Ans: B
continuous
A
B
C
Ans:C
Question: Given the following data frame code, select the valid methods for retrieving the production column.
pineapples < data.frame(
country = c(“Costa Rica”, “Brazil”, “Philippines”, “Thailand”, “Indonesia”),
production = c(2.7, 2.5, 2.4, 2.2, 1.8)
)
pineapples[1,2]
pineapples[“production”]
pineapples[production]
pineapples[2,]
pineapples[,2]
pineapples%production
pineapples$production
Ans:
pineapples[“production”]
pineapples[,2]
pineapples$production
Question: Given the following code, what is the result of the seq function?
seq(from = 2, to = 10, by = 2)
2 4 6 8
2 4 6 8 10
2 6 10
4 6 8
Ans: 2 4 6 8 10
Question: Match the expression with the operation performed on the matrices.
Answer Options:
A:A * B
B:A %*% B
matrix multiplication
A
B
Ans: B
elementwise multiplication
A
B
Ans: A
Question: Given a decision tree outcome of three possible values, what does the predict function return for each prediction when given type = “prob” as an argument?
A vector containing the probability for each of the three possible outcomes
A single probability for the most likely outcome
A single value containing the most likely outcome
A vector of the possible outcomes in order of probability
Ans: A vector containing the probability for each of the three possible outcomes
Question: What category of algorithm do clustering methods belong to?
supervised learning
classification
sorting
unsupervised learning
Ans: unsupervised learning
Question: Given the following code, what are the contents of the vector v?
v < 1:5
v[v < 3] < 2
2 2
2 2 3 4 5
2 2 2 4 5
1 2 3 4 5
1 2 2 4 5
Ans: 2 2 3 4 5
Question: Select the matrix produced by the following diag function.
diag(3)
A 3×3 matrix with 0s everywhere
A 3×3 matrix with 3s everywhere
A 1×1 matrix with the value 3
A 3×3 matrix with 1s on the main diagonal and 0s everywhere else
Ans: A 3×3 matrix with 1s on the main diagonal and 0s everywhere else
Question: Given the following code snippet, how does the set.seed function affect the sample_frac function?
sample_frac will fail if set.seed is not called first
sample_frac performs faster
sample_frac will select random rows each time it is run
sample_frac will always select the same rows for a given seed
Ans: sample_frac will always select the same rows for a given seed
Question: What type of diagram is used to analyse a hierarchical cluster?
Line chart
Dendrogram
Scatter plot
Histogram
Ans:Dendrogram
Question: What is the main difference between lists and vectors in R?
A list must contain elements of the same class type
A list must have named members
A list can contain elements of different classes
A list maintains a sorted order
Ans: A list can contain elements of different classes
Question: Why is velocity important?
Organizations operate on their own schedules
How quickly data’s processed is unimportant
Velocity means fast
Customers usually ‘want it now’
Ans: Customers usually ‘want it now’
Question: Why is structure so important to variety?
We never had structured data
Structured data lacks rules
Most modern data is unstructured
Unstructured data is organized
Ans: Most modern data is unstructured
Question: Validity and Volatility are linked to which V?
Velocity
Veracity
Variety
Volume
Ans: Veracity
Question: What does the k in kmeans clustering refer to?
The number of columns of data
The number of different labels the data set contains
The number of clusters
The number of rows in the cluster data
Ans: The number of clusters
Question: Match the R plot method parameter with its aes function equivalent.
Answer Options:
A:pch
B:bg
color
A
B
Ans:B
shape
A
B
Ans: A
Question: What do the points plotted on a boxandwhisker plot indicate?
Interquartile range
Mean values
Outliers
Median values
Ans:Outliers
Question: Which is not an example of finding value in Big Data?
Understanding relationships between four V’s
Better insights into customer needs
Discarding old sensor data
Avoiding business disruption
Ans: Discarding old sensor data
Question: How many distributions is an outcome within 98% of the mean?
1
2
3
4
Ans: 3
Question: What does the k in kmeans clustering refer to?
The number of columns of data
The number of different labels the data set contains
The number of clusters
The number of rows in the cluster data
Ans: The number of clusters
Question: Match the R plot method parameter with its aes function equivalent.
Answer Options:
A:pch
B:bg
color
A
B
Ans:B
shape
A
B
Ans: A
Question: What do the points plotted on a boxandwhisker plot indicate?
Interquartile range
Mean values
Outliers
Median values
Ans:Outliers
Question: How many variables can be described in a twodimensional colored bubble plot?
2
4
5
3
Ans: 4
Question: Select the JavaScript library that is used to create interactive plots through a web browser.
pip
Vue.js
d3.js
jQuery
Ans:d3.js
How many rows and columns will the following table function return in its result?
table(c(1,0,1), c(1,2,3))
1 column, 9 rows
9 rows, 1 column
3 rows, 3 columns
2 rows, 3 columns
3 rows, 2 columns
Ans:2 rows, 3 columns
What data type does ggplot expect?
table
timeseries
vector
data frame
Ans:data frame
Question: Match the visualization library with its canonical programming environment.
Answer Options:
A:Gnuplot
B:Ggplot2
C:Matplotlib
Shell (command line)
A
B
C
Ans:A
Python
A
B
C
Anc:C
R
A
B
C
Ans:B
Question: Which of the following options are included in Hill’s criteria for causation?
Reproducibility
Authority
History
Education
Effect size
Temporality
Ans:
Reproducibility
Effect size
Temporality
How many bins will be defined for the following argument to a geom_histogram plot using ggplot?
breaks=seq(1.5, 5.5, by = 0.5)
7
8
10
9
Ans:8
Question: What is the main source of correlation errors explained by Simpson’s Paradox?
Dimension reduction
Training data
Confounding variables
Validation data
Ans:Confounding variables
How is a bubble plot different from a scatter plot?
A scatter plot visualizes more information
A bubble plot has different sized points
A scatter plot has different sized points
A scatter plot can use color
Ans: A bubble plot has different sized points
Question: Match the term with the scenario for its appropriate use.
Answer Options:
A:Jargon
B:Layman terms
Academic paper
A
B
Ans:B
Public lecture
A
B
Ans: B
Company meeting
A
B
Ans: A
Blog post
A
B
Ans: A
Question: What type of unclean data refers to data in varying units of measurement?
Inaccurate Data
Inconsistent Data
Erroneous Data
Missing Data
Nonstandard Data
Ans: Nonstandard Data
Question: Given the following code, what will be the dimensions of tbl_final?
library(tidyverse)
tbl_test < tibble(ID = c(1,1,2,2,3,3), name = c(‘name’, ‘year’, ‘name’, ‘year’, ‘name’, ‘year’), value = c(‘Steve’, 1897, ‘Bob’, 2001, ‘Jane’, 1991)) tbl_final < tbl_test %>% spread(name,value)
12 x 2
3 x 3
6 x 3
3 x 6
2 x 12
Ans: 3 x 3
Question: Select the characteristics of informal communication in data science.
Tabular data
Personal anecdotes
Storytelling
Annotated algorithms
Scatter plots
Appropriate layman’s terms
Ans:
Personal anecdotes
Storytelling
Appropriate layman’s terms
Given the following regular expression substitution, select the description of the result.
gsub(“(^’)(‘$)”, “”, var)
Any characters not enclosed in single quotes get removed
All single quotes get removed
Single quotes at the beginning and end of the string get removed
All characters between single quotes get removed
Ans: Single quotes at the beginning and end of the string get removed
Which functions can be used to test a data set for missing or NA values?
is.na
testNA
anyNA
tryCatch
findNA
Ans:
is.na
anyNA
Question: What are some important data science strategies that are shared with software development?
Mathematical rigor
Statistical modeling
Continuous integration
Code annotation
Model versioning
Version control
Ans:
Continuous integration
Code annotation
Version control
By default, what sheet is loaded by read_excel?
most recently viewed
last
first
most recently created
Ans: first
Select examples of aggregate functions used in conjunction with group_by.
as.numeric
mean
select
min
mutate
max
Ans:
mean
min
max
When fetching a document over HTTP, what function can be used to check for errors?
glimpse
exit
tryCatch
anyNA
Ans:tryCatch
What argument passed to dbFetch will request that all rows be returned?
n=1
n=NULL
n=0
n=1
Ans: n=1
Match the data quality element with its criteria.
Instruction: Match each answer with the correct target. Each answer can only be used once.
Answer Options:
A:descriptive statistics
B:recency
C:cross reference
D:numeric range
constraints
A
B
C
D
Ans: D
consistency checks
A
B
C
D
Ans: C
validity
A
B
C
D
Ans: B
data profiling
A
B
C
D
Ans: A
Question: How many variables can be described in a twodimensional colored scatter plot?
3
2
5
4
Ans: 3
Question: Select the answer that best describes when to use a line plot.
When the data is random
When the data is clustered
When the data is spaced linearly
When the data is unstructured
Ans: When the data is spaced linearly
What function is used to split a column into two or more columns based on a delimiter?
separate
spread
mutate
tokenize
split
Ans: separate
Given a left join, if a corresponding record does not exist in the right table, what happens to the joined values?
The join results in an empty table
The missing values are replaced with NA
The missing values are replaced with NULL
The missing values are replaced with 0 or “”
The record is dropped
Ans: The missing values are replaced with NA
Question: Select the answer that best describes when to use a bar chart.
When the data is continuous
When the data is random
When the data is clustered
When the data is categorical
When the data is unstructured
Ans: When the data is categorical
Question: In the context of Neural Networks, which of these statements correctly describe a fully connected layer?
A. Every neuron in this layer takes its input from all the neurons in the previous layer
Output of every neuron in this layer is fed as input to exactly one neuron in the next layer
Output of every neuron in this layer is fed as input to a specific neuron in the next layer
Every neuron in this layer takes its input from exactly one neuron in the previous layer
Ans: A. Every neuron in this layer takes its input from all the neurons in the previous layer
Question: What would be the activation function of a Neural Network that is made to perform linear regression on input data?
identity
logit
ReLU
tanh
Ans: identity
Question: Select the answer that best describes when to use a bar chart.
When the data is categorical
When the data is continuous
When the data is unstructured
When the data is clustered
Ans: When the data is continuous
Question: Match following statements about neurons in the context of machine learning with their correct Boolean values.
Answer Options:
A:A neuron can output multiple different values
B:A neuron can consist of only one function, a linear one
C:Every connection between two neurons has a weight W associated with it
D:A neuron is a mathematical function that can take multiple inputs and outputs a single value
False
A
B
C
D
Ans: A,B
True
A
B
C
D
Ans: C,D
Question: Match the following Neural Networks terms with their corresponding definition
Answer Options:
A:Gradient Descent Optimization
B:Epoch
C:Learning Rate
D:Batch Size
The iterative process of adjusting the model parameters to minimize the loss
A
B
C
D
Ans: A
Question: Match the visual element of the box plot with the data that it represents.
Answer Options:
A:Box height
B:Line
C:Whiskers
D:Points
Median
A
B
C
D
Ans:B
Smallest and largest nonoutliers
A
B
C
D
Ans: C
First to third quartile range
A
B
C
D
Ans: A
Outliers
A
B
C
D
Ans: D
Question: Select the elements that make up a graph in a network visualization.
Edges
Nodes
Dendrites
Weights
Perceptrons
Ans:
Edges
Nodes
Weights
Question: What values determine the critical value for a hypothesis test?
Sample mean
Sample standard deviation
Confidence level
Sample size
Degrees of freedom
Ans:
Confidence level
Degrees of freedom
Question: Select the values used for a chisquare goodnessoffit test.
Mean and confidence level
Median and standard deviation
Observed and expected frequencies
Prior and posterior probability
Ans: Observed and expected frequencies