Theory
What is Pandas Series?
A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). The difference between a NumPy array and a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data; it can hold any arbitrary Python Object.
How will you import library?
import numpy as np
import pandas as pd
How will one convert a list, numpy array, or dictionary to a Series?
Let us first create a list, an array and a dictionary:
labels = [‘a’,’b’,’c’]
my_list = [10,20,30]
my_arr = np.array([10,20,30])
my_dict = {‘a’:10,’b’:20,’c’:30}
now, for converting these to a series:
pd.Series(data=my_list)
output:
- 10
- 20
- 30
dtype: int64
A series can also be indexed by a label
pd.Series(data=my_list,index=labels)
output: a 10
b 20
c 30
dtype: int64
or we can also write like:
pd.Series(my_list,labels)
output: a 10
b 20
c 30
Similarly, we can convert a numpy array to a series:
pd.Series(my_arr)
output: 0 10
1 20
2 30
dtype: int32
pd.Series(my_arr,labels)
output: a 10
b 20
c 30
for converting a dictionary:
pd.Series(my_dict)
output: a 10
b 20
c 30
Create a dataframe with random numbers.
import pandas as pd
import numpy as np
from numpy.random import randn
df = pd.DataFrame(randn(5,4),index=[‘A’,’B’,’C’,’D’,’E’],columns=[‘W’,’X’,’Y’,’Z’])
df
How will one grab data from a particular column or columns in a dataframe.
df[‘X’]
Output:
A 0.018639
B -1.044210
C -1.432191
D -0.689218
E -1.306992
Name: X, dtype: float64
This returns a series
df[[‘X’,’Z’]]
This returns a dataframe.
DataFrame Columns are just Series.
How to add a new column to existing dataframe?
df[‘new’] = df[‘X’] + df[‘Z’]
How to drop a column and a row in pandas?
For dropping column, we must give axis=1
df.drop(‘new’,axis=1,inplace=True)
if inplace argument is not true then new column will not be removed permanently from df
Dropping a row
df.drop(‘E’,axis=0, inplace=True)
Explain conditional selection in dataframe using pandas.
Consider a dataframe with 4 columns and 4 rows:
For getting all the entries that are greater than 0, we can use:
df>0
W | X | Y | Z | |
A | False | False | True | False |
B | True | True | True | False |
C | True | True | True | False |
D | False | True | True | True |
E | True | False | True | True |
df[df>0]
W | X | Y | Z | |
A | NaN | NaN | 0.070692 | NaN |
B | 0.433249 | 0.849024 | 0.816763 | NaN |
C | 0.464947 | 1.094842 | 0.010631 | NaN |
D | NaN | 1.819395 | 0.060382 | 0.484935 |
E | 0.580525 | NaN | 0.747503 | 0.894294 |
df[df[‘X’]>0]
W | X | Y | Z | |
B | 0.433249 | 0.849024 | 0.816763 | -1.495481 |
C | 0.464947 | 1.094842 | 0.010631 | -0.681866 |
D | -0.259343 | 1.819395 | 0.060382 | 0.484935 |
How to reset index and set it to something different?
Consider a dataframe
W | X | Y | Z | |
A | -1.131096 | 1.292825 | -0.656096 | -0.028158 |
B | -1.559303 | 0.276486 | -0.590994 | 0.520583 |
C | 3.306751 | 1.359687 | -0.137238 | 0.521303 |
D | 0.142652 | -0.495529 | -1.417824 | 0.750177 |
E | 1.589873 | -0.282113 | 1.042828 | -0.862563 |
df.reset_index()
Index | W | X | Y | Z | |
0 | A | -1.131096 | 1.292825 | -0.656096 | -0.028158 |
1 | B | -1.559303 | 0.276486 | -0.590994 | 0.520583 |
2 | C | 3.306751 | 1.359687 | -0.137238 | 0.521303 |
3 | D | 0.142652 | -0.495529 | -1.417824 | 0.750177 |
4 | E | 1.589873 | -0.282113 | 1.042828 | -0.862563 |
newind = ‘BL DL IX CC CH’.split()
df[‘States’] = newind
df
W | X | Y | Z | States | |
A | -1.131096 | 1.292825 | -0.656096 | -0.028158 | BL |
B | -1.559303 | 0.276486 | -0.590994 | 0.520583 | DL |
C | 3.306751 | 1.359687 | -0.137238 | 0.521303 | IX |
D | 0.142652 | -0.495529 | -1.417824 | 0.750177 | CC |
E | 1.589873 | -0.282113 | 1.042828 | -0.862563 | CH |
df.set_index(‘States’,inplace=True)
df
States | W | X | Y | Z |
BL | -1.131096 | 1.292825 | -0.656096 | -0.028158 |
DL | -1.559303 | 0.276486 | -0.590994 | 0.520583 |
IX | 3.306751 | 1.359687 | -0.137238 | 0.521303 |
CC | 0.142652 | -0.495529 | -1.417824 | 0.750177 |
CH | 1.589873 | -0.282113 | 1.042828 | -0.862563 |
How to deal with missing data in Pandas?
Missing data is represented by Nan in pandas.
Let us create a dataframe with some missing values:
df = pd.DataFrame({‘A’:[1,2,np.nan],
‘B’:[3,np.nan,np.nan],
‘C’:[4,5,6]})
A | B | C | |
0 | 1.0 | 3.0 | 4 |
1 | 2.0 | NaN | 5 |
2 | NaN | NaN | 6 |
For dropping all the rows from df having Nan value:
df.dropna()
A | B | C | |
0 | 1.0 | 3.0 | 4 |
For dropping all the columns from df having Nan value
df.dropna(axis=1)
C | |
0 | 4 |
1 | 5 |
2 | 6 |
For dropping all rows having Nan value except the rows with at least 2 non Nan values:
df.dropna(thresh=2)
A | B | C | |
0 | 1.0 | 3.0 | 4 |
1 | 2.0 | NaN | 5 |
Replacing Nan value with something else:
df.fillna(value=’X VAL’)
A | B | C | |
0 | 1 | 3 | 4 |
1 | 2 | X VAL | 5 |
2 | X VAL | X VAL | 6 |
Explain different types of operations that can be performed on pandas.
Let us take a dataframe and perform various kinds of operation on it.
import pandas as pd
df = pd.DataFrame({‘col1′:[1,2,3,4],’col2′:[11,22,33,11],’col3’:[‘ab’,’cd’,’ef’,’gh’]})
df.head()
col1 | col2 | col3 | |
0 | 1 | 11 | ab |
1 | 2 | 22 | cd |
2 | 3 | 33 | ef |
3 | 4 | 11 | gh |
unique
To get unique values in the dataframe:
df[‘col2’].unique()
Out: array([11, 22, 33], dtype=int64)
nunique
To get the count of unique values in the dataframe:
df[‘col2’].nunique()
Out: 3
value_counts
To get the count of each value in the dataframe:
df[‘col2’].value_counts()
Out:
11 2
22 1
33 1
Name: col2, dtype: int64
isnull
To check for null values in the dataframe:
df.isnull()
col1 | col2 | col3 | |
0 | False | False | False |
1 | False | False | False |
2 | False | False | False |
3 | False | False | False |
del
permanently removing a column
del df[‘col1’]
df
col2 | col3 | |
0 | 11 | ab |
1 | 22 | cd |
2 | 33 | ef |
3 | 11 | gh |
Get column and index names:
df.columns
Out:
Index([‘col2’, ‘col3′], dtype=’object’)
df.index
Out:
RangeIndex(start=0, stop=4, step=1)
Sort_values
df.sort_values(by=’col2′)
col2 | col3 | |
0 | 11 | ab |
3 | 11 | gh |
1 | 22 | cd |
2 | 33 | ef |
How do you read from and store data in a csv, excel and html file in pandas?
Reading a csv file into a dataframe:
df = pd.read_csv(‘filename’)
storing a dataframe as a csv file:
df.to_csv(‘filename’,index=False)
Reading a excel file into a dataframe
pd.read_excel(‘Excel_filename.xlsx’,sheetname=’Sheet1′)
storing a dataframe as excel file:
df.to_excel(‘Excel_filename.xlsx’,sheet_name=’Sheet1′)
Reading a html file into a dataframe
df = pd.read_html(‘http://www.fdic.gov/bank/individual/failed/banklist.html’)
What is concatenation? Explain with examples.
Concatenation glues together two or more DataFrames. Dimensions should match along the axis of the df you are concatenating on.
Lets create 2 dataframes df1 and df2
import pandas as pd
df1 = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]},
index=[0, 1, 2, 3])
df2 = pd.DataFrame({‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],
‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’],
‘C’: [‘C4’, ‘C5’, ‘C6’, ‘C7’],
‘D’: [‘D4’, ‘D5’, ‘D6’, ‘D7’]},
index=[4, 5, 6, 7])
df1
A | B | C | D | |
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
df2
A | B | C | D | |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
Concatenating two dataframes:
pd.concat([df1,df2])
A | B | C | D | |
0 | A0 | B0 | C0 | D0 |
1 | A1 | B1 | C1 | D1 |
2 | A2 | B2 | C2 | D2 |
3 | A3 | B3 | C3 | D3 |
4 | A4 | B4 | C4 | D4 |
5 | A5 | B5 | C5 | D5 |
6 | A6 | B6 | C6 | D6 |
7 | A7 | B7 | C7 | D7 |
pd.concat([df1,df2],axis=1)
A | B | C | D | A | B | C | D | |
0 | A0 | B0 | C0 | D0 | NaN | NaN | NaN | NaN |
1 | A1 | B1 | C1 | D1 | NaN | NaN | NaN | NaN |
2 | A2 | B2 | C2 | D2 | NaN | NaN | NaN | NaN |
3 | A3 | B3 | C3 | D3 | NaN | NaN | NaN | NaN |
4 | NaN | NaN | NaN | NaN | A4 | B4 | C4 | D4 |
5 | NaN | NaN | NaN | NaN | A5 | B5 | C5 | D5 |
6 | NaN | NaN | NaN | NaN | A6 | B6 | C6 | D6 |
7 | NaN | NaN | NaN | NaN | A7 | B7 | C7 | D7 |
Explain dataframes joins with example.
Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.
left = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’],
‘B’: [‘B0’, ‘B1’, ‘B2’]},
index=[‘K0’, ‘K1’, ‘K2’])
right = pd.DataFrame({‘C’: [‘C0’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D2’, ‘D3’]},
index=[‘K0’, ‘K2’, ‘K3’])
left.join(right)
A | B | C | D | |
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
left.join(right, how=’outer’)
A | B | C | D | |
K0 | A0 | B0 | C0 | D0 |
K1 | A1 | B1 | NaN | NaN |
K2 | A2 | B2 | C2 | D2 |
K3 | NaN | NaN | C3 | D3 |
Explain merge operation.
The merge function merge DataFrames together using a similar logic as merging SQL Tables together. For example:
left = pd.DataFrame({‘key1’: [‘K0’, ‘K0’, ‘K1’, ‘K2’],
‘key2’: [‘K0’, ‘K1’, ‘K0’, ‘K1’],
‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],
‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]})
right = pd.DataFrame({‘key1’: [‘K0’, ‘K1’, ‘K1’, ‘K2’],
‘key2’: [‘K0’, ‘K0’, ‘K0’, ‘K0’],
‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],
‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]})
Inner merge:
pd.merge(left, right, on=[‘key1’, ‘key2’])
A | B | key1 | key2 | C | D | |
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
Outer merge:
pd.merge(left, right, how=’outer’, on=[‘key1’, ‘key2’])
A | B | key1 | key2 | C | D | |
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A1 | B1 | K0 | K1 | NaN | NaN |
2 | A2 | B2 | K1 | K0 | C1 | D1 |
3 | A2 | B2 | K1 | K0 | C2 | D2 |
4 | A3 | B3 | K2 | K1 | NaN | NaN |
5 | NaN | NaN | K2 | K0 | C3 | D3 |
Right merge:
pd.merge(left, right, how=’right’, on=[‘key1’, ‘key2’])
A | B | key1 | key2 | C | D | |
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A2 | B2 | K1 | K0 | C1 | D1 |
2 | A2 | B2 | K1 | K0 | C2 | D2 |
3 | NaN | NaN | K2 | K0 | C3 | D3 |
Left merge:
pd.merge(left, right, how=’left’, on=[‘key1’, ‘key2’])
A | B | Key 1 | key2 | C | D | |
0 | A0 | B0 | K0 | K0 | C0 | D0 |
1 | A1 | B1 | K0 | K1 | NaN | NaN |
2 | A2 | B2 | K1 | K0 | C1 | D1 |
3 | A2 | B2 | K1 | K0 | C2 | D2 |
4 | A3 | B3 | K2 | K1 | NaN | NaN |
Explain groupby.
import pandas as pd
data = {‘Company’:[‘ABC’,’DEF’,’GHI’,’JKL’,’MNO’,’XYZ’],
‘Person’:[‘OM’,’JAI’,’RAM’,’SHYAM’,’AMY’,’SARA’],
‘Sales’:[200,325,229,124,470,350]}
df = pd.DataFrame(data)
df
:
Company | Person | Sales | |
0 | OM | 200 | |
1 | JAI | 325 | |
2 | MSFT | SONA | 229 |
3 | MSFT | MONA | 124 |
4 | FB | RAM | 470 |
5 | FB | SARA | 350 |
Using groupby we can get sales data of companies
by_comp = df.groupby(“Company”)
by_comp.mean()
Company | Sales |
FB | 410.0 |
262.5 | |
MSFT | 176.5 |
OBJECTIVE QUIZ
In the following Python code, typing which Python command will give the user the CEO of Facebook?
import pandas as pd
companies_ceo = {
‘Amazon’ : ‘Jeff Bezos’
‘Apple’ : ‘Tim Cook‘,
‘SpaceX’: ‘Elon Musk‘
‘Facebook’: ‘Mark Zuckerberg’
‘Netflix’: ‘Reed Hastings’
}
companies_ceo_series= pd.Series(companies_ceo)
companies_ceo_series[‘Facebook’]
companies_ceo_series[4]
companies_ceo_series[3]
companies_ceo_series.values[4]
Ans:-
companies_ceo_series[‘Facebook’]
companies_ceo_series[3]
In the following Python code, typing what command will create a DataFrame called “companies_ceo” whose first column has all the entries of the ‘companies’ list and whose second column has all the entries of the ‘ceo’ list, with the column names as the names of the respective variables?
import pandas as pd
companies = {
‘Amazon’
‘Apple’
‘SpaceX’
‘Facebook’
‘Netflix’
}
ceo = {
‘Jeff Bezos’
‘Tim Cook‘,
‘Elon Musk‘
‘Mark Zuckerberg’
‘Reed Hastings’
}
companies_ceo_tuple = list (zip(companies, ceo)) companies_ceo = pd.dataframe(companies_ceo_tuple, columns=[‘companies’, ‘ceo’])
companies_ceo = list (zip(companies,ceo))
companies_ceo= pd.dataframe(companies, ceo)
companies_ceo_tuple = list (zip(companies, ceo)) companies_ceo = pd.dataframe(companies_ceo_tuple)
Ans:- companies_ceo_tuple = list (zip(companies, ceo)) companies_ceo = pd.dataframe(companies_ceo_tuple, columns=[‘companies’, ‘ceo’])
What happens when we call the stack () function on a Pandas DataFrame
It will create a new DataFrame from two DataFrames by stacking them side by side
It will create a new DataFrame from two DataFrames by stacking one on top of the other
It will create a new DataFrame such that a single row in the original DataFrame is stacked into multiple rows in the new DataFrame depending on the number of columns for each row in the original DataFrame
It will create a new DataFrame such that a single row in the new DataFrame is a combination of multiple rows in the original DataFrame depending on the number of columns for each row in the original DataFrame
Ans:- It will create a new DataFrame such that a single row in the original DataFrame is stacked into multiple rows in the new DataFrame depending on the number of columns for each row in the original DataFrame
Consider the following Python code. What command would you use to iterate through the “companies_ceo” DataFrame and print the list of all the CEOs in this DataFrame?
import pandas as pd
companies = {
‘Company’ : [‘Facebook’, ‘Apple’, ‘Amazon’, ‘Netflix’],
'CEO' : [‘Mark Zuckerberg’, 'Jeff Bezos’, 'Tim Cook', ',’Reed Hastings’ ],
}
companies_ceo = pd.DataFrame(companies_ceo)
for row in companies_ceo.iterrows(): print(row[1])
for row in companies_ceo.iterfields(): print(row.CEO)
for row in companies_ceo.iterrecords(): print(row.CEO)
for row in companies_ceo.itertuples(): print(row.CEO)
Ans:-
for row in companies_ceo.iterrows(): print(row[1])
for row in companies_ceo.itertuples(): print(row.CEO)
Which of the following formats does Pandas not support natively when exporting the contents of a Dataframe?
JSON
CSV
JPEG
HTML
Ans:- JPEG
Let’s say you have created a Pandas DataFrame called “unsorted” and you want to sort the contents of this DataFrame column wise in alphabetical order of the header name. Then, which function would you call on the “unsorted” DataFrame to do so?
unsorted.sort_index()
unsorted.sort_col(axis=1)
unsorted.index_sort()
unsorted.sort_index(axis=1)
Ans:- unsorted.sort_index(axis=1)
Match the following functions that you can call on a Pandas DataFrame correctly with what they do
Answer Options:
A:.isnull()
B:.notnull()
C:.fillna(0)
D:.dropna()
Returns a Boolean array containing true or false values and returns the value in a cell as true if it contains NaN
A
B
C
D
Ans:- A
All the rows which contain a NaN value in any cell of that row are removed
A
B
C
D
Ans:- D
Returns a Boolean array containing true or false values and returns the value in a cell as true if it does not contain NaN
A
B
C
D
Ans:- B
Every cell in the Dataset which has a NaN value will be replaced with 0
A
B
C
D
Ans:- C
Match the following statements related to the .xs function in Pandas DataFrame with their correct Boolean values.
Answer Options:
A:The .xs function is used when our Pandas DataFrame makes use of a MultiIndex
B:By default, the .xs function only takes a look at values in the first level index
C:The .xs function cannot be used to return a cross section of columns
False
A
B
C
Ans:- C
True
A
B
C
Ans:- A, B
Let’s say you have imported Python as pd and have instantiated two DataFrames called “frame_1” and “frame_2” with the exact same schema. What command will you use to combine these two DataFrames into a single DataFrame and make sure that the combined DataFrame has its own unique index?
pd.concat( [frame_2, frame_1,] )
pd.concat( [frame_2, frame_1], ignore_index = True )
pd.concat( [frame_1, frame_2], ignore_index = True )
pd.concat( [frame_1, frame_2] )
Ans:-
pd.concat( [frame_2, frame_1], ignore_index = True )
pd.concat( [frame_1, frame_2], ignore_index = True )
The ‘how’ argument in the Pandas merge function allows us to specify what kind of join operation we want to perform on the given Pandas DataFrames. What are the valid values that we can give for this argument?
inner
left
outer
right
Ans:-
inner
left
outer
right
Some statements related to working with SQL Databases in Python are given below. Match them with their correct Boolean values.
Answer Options:
A:The sqlite3 library in Python allows us to create Databases on our local file system
B:Once we have created a table, we can use sqlite3’s .execute() function to recreate the same table with the same table name so that we have duplicates of a table
C:All the changes that we make to an SQL database on a Jupyter notebook by connecting with it, will be committed to the database only after we execute sqlite3’s .commit() function
False
A
B
C
Ans:- B
True
A
B
C
Ans:- A,C
Which of these correctly match the following libraries in the Numpy ecosystem with what that library is used for?
be used once.
Answer Options:
A:Statsmodel
B:Scikit-learn
C:Bokeh
Used to perform statistical operations
A
B
C
Ans:A
Specifically meant for machine learning, data mining, and data analysis
A
B
C
Ans:- B
Data visualization tool used for large datasets
A
B
C
Ans:- C
Match the following statements related to the iloc indexer in Pandas with the correct boolean values.
Answer Options:
A:The iloc indexer is similar to the loc indexer and can be used to access records located at a particular index in a Pandas DataFrame
B:The column headers can be passed as input arguments in the form of a string to the iloc function without any errors
C:When we pass 2:6 as input argument to the iloc function, we get all details of the records located in the second index all the way up to the 5th index of the DataFrame
True
A
B
C
Ans:- A
False
A
B
C
Ans:- B,C
Let’s say you have saved a dataset in a pandas DataFrame called “dataset” which has tons of records and you only want to access the details of the records in only the 5th, 8th and 14th index. Which of these Python commands can you use to do so?
dataset.loc[5,8,14]
dataset(5,8,14,
dataset.loc[[5,8,14],:]
dataset(5,8,14)
Question: Let’s say you have a pandas DataFrame called “panda” which has 8 rows in total and you want to remove the last row from this DataFrame. Which of these Python commands would you use to do so?
panda.drop(panda.index[8])
panda.delete(panda.column[8])
panda.drop(panda.column[7])
panda.drop(panda.index[7])
Ans:- panda.drop(panda.index[7])
Question: Which of these statements related to the pivot function in Pandas is true?
The pivot function is used in aggregating data from various columns of a DataFrame to create a new DataFrame
The Pivot function summarizes the details of each column in a DataFrame
The combination of the row index and the column header must be unique in order to generate a pivot table
Ans:-
The Pivot function summarizes the details of each column in a DataFrame
The combination of the row index and the column header must be unique in order to generate a pivot table
Question: When using the pyplot library to plot graphs on the data stored in a dataframe, what would you specify the “kind” of your chart to be, if you want to plot a horizontal bar graph?
bar
pie
hist
barh
Ans:- barh
Question: What information about the column of a pandas dataframe is returned when you call the describe() function on it?
Sum of all the values in the column
The number of records
Basic statistical measures
The number of cells with NaN values
Ans:-Basic statistical measures
Question: Let’s say you have a dataframe called “df” with columns “c1”, “c2” and “c3”. You want to remove the column “c2” from this dataframe. Which of these operations can help you do that?
df.drop([“c2”], axis=0)
remove df [“c2”]
del df [“c2”]
df.drop([“c2”], axis = 1)
Ans:- remove df [“c2”]
Question: Match the following statements related to Pandas DataFrames with the correct boolean values.
Answer Options:
A:All the data within a particular column in a Pandas DataFrame must be of the same data type
B:Data in different columns of a Pandas DataFrame cannot be of different data types
C:Once a Pandas DataFrame has been created, it is not possible to add a new column to this DataFrame
False
A
B
C
Ans:- B, C
True
A
B
C
Ans:- A
Question: Match the following statements related to the concept of multiIndex in Pandas with the correct Boolean values.
Answer Options:
A:MultiIndex lets the user effectively store and manipulate higher dimensional data in a 2-dimensional tabular structure
B:The MultiIndex for a row is some composite key made up of exactly one column
C:MultiIndex is useful when we have large datasets where using numeric indexes to refer to each record is unintuitive
False
A
B
C
Ans:- B
True
A
B
C
Ans:- A,C
Question: In Pandas, what does a series object represent?
A single string character
A single integer
A 2-D matrix of data
A vector of data
Ans:- A vector of data
Question: Let’s say you have a dataframe called “stock” which has the price and other details of a particular stock. “stock” has a column called “price”, which has the daily closing price of the stock over the last 5 weeks.
Which of these commands would help extract the returns of the stock price?
stock[“price”].change()
stock[“price”].returns()
stock[“price”].pct_change()
stock[“price”].diff()
Ans:- stock[“price”].pct_change()
What would be the output of the following program?
import pandas as pd
pd.Timestamp(‘01/01/2000’) \
- pd.Timedelta(‘2days 2hours 1minute 1second’)
Timestamp (‘2000-01-03 02:01:01’)
Timestamp (‘2000-03-01 02:01:01’)
Timestamp (‘2000-01-03 01:01:02’)
Timestamp (‘2000-03-01 01:01:02’)
Ans:- Timestamp (‘2000-01-03 02:01:01’)
Let’s say you have a dataframe called df which has NaN values in the first few records of the dataframe which is then followed by valid values. You want to fill those values with the first observed value of that column.
Which of these functions would you call on df to do so?
ffill()
sfill()
efill()
bfill()
Ans:- bfill()
How would you access the element 3 from the “pandas_series” variable?
import pandas as pd
pandas_series = pd.Series([1, 2, 3])
pandas_series.index = [‘1’, ‘2’, ‘3’]
pandas_series [‘2’]
pandas_series [2]
pandas_series [‘3’]
pandas_series [3]
Ans:-
pandas_series [2]
pandas_series [‘3’]
Let’s say you have two pandas series called “one” and “two” and you want to print all those values in “one” that are also there in “two”.
Then which of these commands would you use to do so?
one[~one.isin(two)]
one.isin(two)
two[two.isin(one)]
one[one.isin(two)]
Ans:-
two[two.isin(one)]
one[one.isin(two)]
What would be the output of the following line of code?
import pandas as pd
pandas_series = pd.Series([1, 2, 2, 3, 3, 99])
pandas_series.unique()
array([1, 2, 3, 99])
4
6
array([2, 3])
Ans:- array([1, 2, 3, 99])
In a Pandas dataframe, what is each column represented by?
Python Dictionary
Python list
Pandas Series
Numpy Array
Ans:- Pandas Series
Let’s say you have a dataframe called df with 6 columns and with 220 records. You want a subset of this dataframe which includes the first 4 columns from the left and the records from index 100 to 110.
Which of these commands can you use to do so?
df.iloc[100:111, 0:4]
df.loc[100:111, 0:4]
df.iloc[100:110, 0:3]
df.loc[100:10, 0:3]
Ans:- df.iloc[100:111, 0:4]
What will be the values in “new_string” variable at the end of this program?
string = pd.Series([‘Ace’, ‘Fault’, ‘Error’])
new_string = string.str.swapcase()
[‘ace’, fault’, ‘error’]
[‘aCE’, ‘fAULT’, ‘eRROR’]
[‘Ace’, ‘Fault’, ‘Error’]
[‘ACE’, FAULT’, ‘ERROR’]
Ans:- [‘aCE’, ‘fAULT’, ‘eRROR’]
Let’s say you have a pandas series called “series” which has the contents: [‘ae’, ‘bf’, ‘cg’].
What function would you call to find the letter “e” in the series and the starting index of the matched substring?
str.findall(‘e’)
str.match( ‘e’)
str.get(‘e’)
str.find(‘e’)
Ans:- str.find(‘e’)
What will be the values in new_string at the end of this program?
string = pd.Series([‘top’, ‘bottom’, side])
new_string = string.apply(lambda x:x.upper())
[TOP, ‘BOTTOM’, SIDE’]
[‘top’, bottom’, ‘side’]
[‘Top’, Bottom’, ‘Side’]
[‘tOP’, ‘bOTTOM’, ‘sIDE]
Ans:- [TOP, ‘BOTTOM’, SIDE’]
What function would you use to convert the values in “string” to integer form?
import pandas as pd
string = pd.Series([‘1’, ‘2’, ‘3’])
to_numeric(string)
string.to_numeric
string.pd.to_numeric
pd.to_numeric(string)
Ans:- pd.to_numeric(string)
Let’s say you have a dataframe called “df” and you want to find the minimum value in each row of this dataframe.
Which of these commands would you use to do so?
df (min, axis = 1)
df (min, axis = 0)
df.apply (min, axis = 1)
df.apply (min, axis = 0)
Ans:- df.apply (min, axis = 1)
Question
Which of these built in methods of a dataframe would you use to return the number of NA values in each column of your dataframe?
df.isna()
df.na()
df.isna().sum()
df.sum()
Ans:- df.isna().sum()
In the context of a pandas dataframe, what does the dropna( ) function do?
All the cells which have the value NA are replaced with a 0
Calculates the number of records which have even a single field with an NA
All the cells which have the value NA are deleted
All the records which have even a single field with NA have are removed
Ans:-
All the records which have even a single field with NA have are removed
In the context of a pandas dataframe, which of these functions returns the median of a column?
quantile(0.75)
mean()
quantile(0.5)
median()
Ans:-
quantile(0.5)
median()
Question: Which of these statements related to the Pandas Series object are true?
Since Numpy is built on top of Pandas, it is very straightforward to convert a Pandas Series object into a Numpy array
Pandas Series object is similar to a Python list
Once we create a Pandas Series object, an index representing the positions for each of the data points is automatically created for the list
Ans:-
Pandas Series object is similar to a Python list
Once we create a Pandas Series object, an index representing the positions for each of the data points is automatically created for the list
Question: Let’s say you have a pandas DataFrame called “frame” and you want to export this DataFrame along with its index as a CSV file called “data_frame” located in the datasets folder of our workspace.
data_frame.to_csv(‘datasets/ frame.csv’)
data_frame.csv(‘datasets/ frame.csv’)
frame.to_csv(‘datasets/data_frame.csv’)
frame.csv(‘datasets/data_frame.csv’)
Ans:- frame.to_csv(‘datasets/data_frame.csv’)