# Panda

## Theory

What is Pandas Series?

A Series is very similar to a NumPy array (in fact it is built on top of the NumPy array object). The difference between a NumPy array and a Series, is that a Series can have axis labels, meaning it can be indexed by a label, instead of just a number location. It also doesn’t need to hold numeric data; it can hold any arbitrary Python Object.

How will you import library?

import numpy as np

import pandas as pd

How will one convert a list, numpy array, or dictionary to a Series?

Let us first create a list, an array and a dictionary:

labels = [‘a’,’b’,’c’]

my_list = [10,20,30]

my_arr = np.array([10,20,30])

my_dict = {‘a’:10,’b’:20,’c’:30}

now, for converting these to a series:

pd.Series(data=my_list)

output:

• 10
• 20
• 30

dtype: int64

A series can also be indexed by a label

pd.Series(data=my_list,index=labels)

output: a    10

b    20

c    30

dtype: int64

or we can also write like:

pd.Series(my_list,labels)

output: a    10

b    20

c    30

Similarly, we can convert a numpy array to a series:

pd.Series(my_arr)

output: 0    10

1    20

2    30

dtype: int32

pd.Series(my_arr,labels)

output: a    10

b    20

c    30

for converting a dictionary:

pd.Series(my_dict)

output: a    10

b    20

c    30

Create a dataframe with random numbers.

import pandas as pd

import numpy as np

from numpy.random import randn

df = pd.DataFrame(randn(5,4),index=[‘A’,’B’,’C’,’D’,’E’],columns=[‘W’,’X’,’Y’,’Z’])

df

How will one grab data from a particular column or columns in a dataframe.

df[‘X’]

Output:

A    0.018639

B   -1.044210

C   -1.432191

D   -0.689218

E   -1.306992

Name: X, dtype: float64

This returns a series

df[[‘X’,’Z’]]

This returns a dataframe.

DataFrame Columns are just Series.

How to add a new column to existing dataframe?

df[‘new’] = df[‘X’] + df[‘Z’]

How to drop a column and a row in pandas?

For dropping column, we must give axis=1

df.drop(‘new’,axis=1,inplace=True)

if inplace argument is not true then new column will not be removed permanently from df

Dropping a row

df.drop(‘E’,axis=0, inplace=True)

Explain conditional selection in dataframe using pandas.

Consider a dataframe with 4 columns and 4 rows:

For getting all the entries that are greater than 0, we can use:

df>0

df[df>0]

df[df[‘X’]>0]

How to reset index and set it to something different?

Consider a dataframe

df.reset_index()

newind = ‘BL DL IX CC CH’.split()

df[‘States’] = newind

df

df.set_index(‘States’,inplace=True)

df

How to deal with missing data in Pandas?

Missing data is represented by Nan in pandas.

Let us create a dataframe with some missing values:

df = pd.DataFrame({‘A’:[1,2,np.nan],

‘B’:[3,np.nan,np.nan],

‘C’:[4,5,6]})

For dropping all the rows from df having Nan value:

df.dropna()

For dropping all the columns from df having Nan value

df.dropna(axis=1)

For dropping all rows having Nan value except the rows with at least 2 non Nan values:

df.dropna(thresh=2)

Replacing Nan value with something else:

df.fillna(value=’X VAL’)

Explain different types of operations that can be performed on pandas.

Let us take a dataframe and perform various kinds of operation on it.

import pandas as pd

df = pd.DataFrame({‘col1′:[1,2,3,4],’col2′:[11,22,33,11],’col3’:[‘ab’,’cd’,’ef’,’gh’]})

unique

To get unique values in the dataframe:

df[‘col2’].unique()

Out: array([11, 22, 33], dtype=int64)

nunique

To get the count of unique values in the dataframe:

df[‘col2’].nunique()

Out: 3

value_counts

To get the count of each value in the dataframe:

df[‘col2’].value_counts()

Out:

11    2

22    1

33    1

Name: col2, dtype: int64

isnull

To check for null values in the dataframe:

df.isnull()

del

permanently removing a column

del df[‘col1’]

df

Get column and index names:

df.columns

Out:

Index([‘col2’, ‘col3′], dtype=’object’)

df.index

Out:

RangeIndex(start=0, stop=4, step=1)

Sort_values

df.sort_values(by=’col2′)

How do you read from and store data in a csv, excel and html file in pandas?

Reading a csv file into a dataframe:

storing a dataframe as a csv file:

df.to_csv(‘filename’,index=False)

Reading a excel file into a dataframe

storing a dataframe as excel file:

df.to_excel(‘Excel_filename.xlsx’,sheet_name=’Sheet1′)

Reading a html file into a dataframe

What is concatenation? Explain with examples.

Concatenation glues together two or more DataFrames. Dimensions should match along the axis of the df you are concatenating on.

Lets create 2 dataframes df1 and df2

import pandas as pd

df1 = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],

‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’],

‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],

‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]},

index=[0, 1, 2, 3])

df2 = pd.DataFrame({‘A’: [‘A4’, ‘A5’, ‘A6’, ‘A7’],

‘B’: [‘B4’, ‘B5’, ‘B6’, ‘B7’],

‘C’: [‘C4’, ‘C5’, ‘C6’, ‘C7’],

‘D’: [‘D4’, ‘D5’, ‘D6’, ‘D7’]},

index=[4, 5, 6, 7])

df1

df2

Concatenating two dataframes:

pd.concat([df1,df2])

pd.concat([df1,df2],axis=1)

Explain dataframes joins with example.

Joining is a convenient method for combining the columns of two potentially differently-indexed DataFrames into a single result DataFrame.

left = pd.DataFrame({‘A’: [‘A0’, ‘A1’, ‘A2’],

‘B’: [‘B0’, ‘B1’, ‘B2’]},

index=[‘K0’, ‘K1’, ‘K2’])

right = pd.DataFrame({‘C’: [‘C0’, ‘C2’, ‘C3’],

‘D’: [‘D0’, ‘D2’, ‘D3’]},

index=[‘K0’, ‘K2’, ‘K3’])

left.join(right)

left.join(right, how=’outer’)

Explain merge operation.

The merge function merge DataFrames together using a similar logic as merging SQL Tables together. For example:

left = pd.DataFrame({‘key1’: [‘K0’, ‘K0’, ‘K1’, ‘K2’],

‘key2’: [‘K0’, ‘K1’, ‘K0’, ‘K1’],

‘A’: [‘A0’, ‘A1’, ‘A2’, ‘A3’],

‘B’: [‘B0’, ‘B1’, ‘B2’, ‘B3’]})

right = pd.DataFrame({‘key1’: [‘K0’, ‘K1’, ‘K1’, ‘K2’],

‘key2’: [‘K0’, ‘K0’, ‘K0’, ‘K0’],

‘C’: [‘C0’, ‘C1’, ‘C2’, ‘C3’],

‘D’: [‘D0’, ‘D1’, ‘D2’, ‘D3’]})

Inner merge:

pd.merge(left, right, on=[‘key1’, ‘key2’])

Outer merge:

pd.merge(left, right, how=’outer’, on=[‘key1’, ‘key2’])

Right merge:

pd.merge(left, right, how=’right’, on=[‘key1’, ‘key2’])

Left merge:

pd.merge(left, right, how=’left’, on=[‘key1’, ‘key2’])

Explain groupby.

import pandas as pd

data = {‘Company’:[‘ABC’,’DEF’,’GHI’,’JKL’,’MNO’,’XYZ’],

‘Person’:[‘OM’,’JAI’,’RAM’,’SHYAM’,’AMY’,’SARA’],

‘Sales’:[200,325,229,124,470,350]}

df = pd.DataFrame(data)

df

:

Using groupby we can get sales data of companies

by_comp = df.groupby(“Company”)

by_comp.mean()

## OBJECTIVE QUIZ

In the following Python code, typing which Python command will give the user the CEO of Facebook?

import pandas as pd

companies_ceo = {
‘Amazon’ : ‘Jeff Bezos’
‘Apple’ : ‘Tim Cook‘,
‘SpaceX’: ‘Elon Musk‘
‘Netflix’: ‘Reed Hastings’
}
companies_ceo_series= pd.Series(companies_ceo)

companies_ceo_series

companies_ceo_series

companies_ceo_series.values

Ans:-

companies_ceo_series

In the following Python code, typing what command will create a DataFrame called “companies_ceo” whose first column has all the entries of the ‘companies’ list and whose second column has all the entries of the ‘ceo’ list, with the column names as the names of the respective variables?

import pandas as pd
companies = {
‘Amazon’
‘Apple’
‘SpaceX’
‘Netflix’
}

ceo = {
‘Jeff Bezos’
‘Tim Cook‘,
‘Elon Musk‘
‘Mark Zuckerberg’
‘Reed Hastings’
}

companies_ceo_tuple = list (zip(companies, ceo)) companies_ceo = pd.dataframe(companies_ceo_tuple, columns=[‘companies’, ‘ceo’])

companies_ceo = list (zip(companies,ceo))

companies_ceo= pd.dataframe(companies, ceo)

companies_ceo_tuple = list (zip(companies, ceo)) companies_ceo = pd.dataframe(companies_ceo_tuple)

Ans:- companies_ceo_tuple = list (zip(companies, ceo)) companies_ceo = pd.dataframe(companies_ceo_tuple, columns=[‘companies’, ‘ceo’])

What happens when we call the stack () function on a Pandas DataFrame

It will create a new DataFrame from two DataFrames by stacking them side by side

It will create a new DataFrame from two DataFrames by stacking one on top of the other

It will create a new DataFrame such that a single row in the original DataFrame is stacked into multiple rows in the new DataFrame depending on the number of columns for each row in the original DataFrame

It will create a new DataFrame such that a single row in the new DataFrame is a combination of multiple rows in the original DataFrame depending on the number of columns for each row in the original DataFrame

Ans:- It will create a new DataFrame such that a single row in the original DataFrame is stacked into multiple rows in the new DataFrame depending on the number of columns for each row in the original DataFrame

Consider the following Python code. What command would you use to iterate through the “companies_ceo” DataFrame and print the list of all the CEOs in this DataFrame?

import pandas as pd

companies = {
‘Company’ : [‘Facebook’, ‘Apple’, ‘Amazon’, ‘Netflix’],

``````            'CEO' : [‘Mark Zuckerberg’, 'Jeff Bezos’, 'Tim Cook', ',’Reed Hastings’ ],

}``````

companies_ceo = pd.DataFrame(companies_ceo)

for row in companies_ceo.iterrows(): print(row)

for row in companies_ceo.iterfields(): print(row.CEO)

for row in companies_ceo.iterrecords(): print(row.CEO)

for row in companies_ceo.itertuples(): print(row.CEO)

Ans:-

for row in companies_ceo.iterrows(): print(row)

for row in companies_ceo.itertuples(): print(row.CEO)

Which of the following formats does Pandas not support natively when exporting the contents of a Dataframe?

JSON

CSV

JPEG

HTML

Ans:- JPEG

Let’s say you have created a Pandas DataFrame called “unsorted” and you want to sort the contents of this DataFrame column wise in alphabetical order of the header name. Then, which function would you call on the “unsorted” DataFrame to do so?

unsorted.sort_index()

unsorted.sort_col(axis=1)

unsorted.index_sort()

unsorted.sort_index(axis=1)

Ans:- unsorted.sort_index(axis=1)

Match the following functions that you can call on a Pandas DataFrame correctly with what they do

A:.isnull()
B:.notnull()
C:.fillna(0)
D:.dropna()

Returns a Boolean array containing true or false values and returns the value in a cell as true if it contains NaN

A

B

C

D
Ans:- A

All the rows which contain a NaN value in any cell of that row are removed

A

B

C

D
Ans:- D

Returns a Boolean array containing true or false values and returns the value in a cell as true if it does not contain NaN

A

B

C

D
Ans:- B

Every cell in the Dataset which has a NaN value will be replaced with 0

A

B

C

D
Ans:- C

Match the following statements related to the .xs function in Pandas DataFrame with their correct Boolean values.

A:The .xs function is used when our Pandas DataFrame makes use of a MultiIndex
B:By default, the .xs function only takes a look at values in the first level index
C:The .xs function cannot be used to return a cross section of columns

False

A

B

C
Ans:- C

True

A

B

C
Ans:- A, B

Let’s say you have imported Python as pd and have instantiated two DataFrames called “frame_1” and “frame_2” with the exact same schema. What command will you use to combine these two DataFrames into a single DataFrame and make sure that the combined DataFrame has its own unique index?

pd.concat( [frame_2, frame_1,] )

pd.concat( [frame_2, frame_1], ignore_index = True )

pd.concat( [frame_1, frame_2], ignore_index = True )

pd.concat( [frame_1, frame_2] )

Ans:-

pd.concat( [frame_2, frame_1], ignore_index = True )

pd.concat( [frame_1, frame_2], ignore_index = True )

The ‘how’ argument in the Pandas merge function allows us to specify what kind of join operation we want to perform on the given Pandas DataFrames. What are the valid values that we can give for this argument?

inner

left

outer

right

Ans:-

inner

left

outer

right

Some statements related to working with SQL Databases in Python are given below. Match them with their correct Boolean values.

A:The sqlite3 library in Python allows us to create Databases on our local file system
B:Once we have created a table, we can use sqlite3’s .execute() function to recreate the same table with the same table name so that we have duplicates of a table
C:All the changes that we make to an SQL database on a Jupyter notebook by connecting with it, will be committed to the database only after we execute sqlite3’s .commit() function

False

A

B

C
Ans:- B

True

A

B

C
Ans:- A,C

Which of these correctly match the following libraries in the Numpy ecosystem with what that library is used for?

be used once.
A:Statsmodel
B:Scikit-learn
C:Bokeh

Used to perform statistical operations

A

B

C
Ans:A

Specifically meant for machine learning, data mining, and data analysis

A

B

C
Ans:- B

Data visualization tool used for large datasets

A

B

C
Ans:- C

Match the following statements related to the iloc indexer in Pandas with the correct boolean values.

A:The iloc indexer is similar to the loc indexer and can be used to access records located at a particular index in a Pandas DataFrame
B:The column headers can be passed as input arguments in the form of a string to the iloc function without any errors
C:When we pass 2:6 as input argument to the iloc function, we get all details of the records located in the second index all the way up to the 5th index of the DataFrame

True

A

B

C
Ans:- A
False

A

B

C
Ans:- B,C

Let’s say you have saved a dataset in a pandas DataFrame called “dataset” which has tons of records and you only want to access the details of the records in only the 5th, 8th and 14th index. Which of these Python commands can you use to do so?

dataset.loc[5,8,14]

dataset(5,8,14,

dataset.loc[[5,8,14],:]

dataset(5,8,14)

Question: Let’s say you have a pandas DataFrame called “panda” which has 8 rows in total and you want to remove the last row from this DataFrame. Which of these Python commands would you use to do so?

panda.drop(panda.index)

panda.delete(panda.column)

panda.drop(panda.column)

panda.drop(panda.index)

Ans:- panda.drop(panda.index)

Question: Which of these statements related to the pivot function in Pandas is true?

The pivot function is used in aggregating data from various columns of a DataFrame to create a new DataFrame

The Pivot function summarizes the details of each column in a DataFrame

The combination of the row index and the column header must be unique in order to generate a pivot table

Ans:-

The Pivot function summarizes the details of each column in a DataFrame

The combination of the row index and the column header must be unique in order to generate a pivot table

Question: When using the pyplot library to plot graphs on the data stored in a dataframe, what would you specify the “kind” of your chart to be, if you want to plot a horizontal bar graph?

bar

pie

hist

barh

Ans:- barh

Question: What information about the column of a pandas dataframe is returned when you call the describe() function on it?

Sum of all the values in the column

The number of records

Basic statistical measures

The number of cells with NaN values

Ans:-Basic statistical measures

Question: Let’s say you have a dataframe called “df” with columns “c1”, “c2” and “c3”. You want to remove the column “c2” from this dataframe. Which of these operations can help you do that?

df.drop([“c2”], axis=0)

remove df [“c2”]

del df [“c2”]

df.drop([“c2”], axis = 1)

Ans:- remove df [“c2”]

Question: Match the following statements related to Pandas DataFrames with the correct boolean values.

A:All the data within a particular column in a Pandas DataFrame must be of the same data type
B:Data in different columns of a Pandas DataFrame cannot be of different data types
C:Once a Pandas DataFrame has been created, it is not possible to add a new column to this DataFrame

False

A

B

C
Ans:- B, C
True

A

B

C
Ans:- A

Question: Match the following statements related to the concept of multiIndex in Pandas with the correct Boolean values.

A:MultiIndex lets the user effectively store and manipulate higher dimensional data in a 2-dimensional tabular structure
B:The MultiIndex for a row is some composite key made up of exactly one column
C:MultiIndex is useful when we have large datasets where using numeric indexes to refer to each record is unintuitive

False

A

B

C
Ans:- B
True

A

B

C
Ans:- A,C

Question: In Pandas, what does a series object represent?

A single string character

A single integer

A 2-D matrix of data

A vector of data

Ans:- A vector of data

Question: Let’s say you have a dataframe called “stock” which has the price and other details of a particular stock. “stock” has a column called “price”, which has the daily closing price of the stock over the last 5 weeks.
Which of these commands would help extract the returns of the stock price?

stock[“price”].change()

stock[“price”].returns()

stock[“price”].pct_change()

stock[“price”].diff()

Ans:- stock[“price”].pct_change()

What would be the output of the following program?

import pandas as pd
pd.Timestamp(‘01/01/2000’) \

• pd.Timedelta(‘2days 2hours 1minute 1second’)

Timestamp (‘2000-01-03 02:01:01’)

Timestamp (‘2000-03-01 02:01:01’)

Timestamp (‘2000-01-03 01:01:02’)

Timestamp (‘2000-03-01 01:01:02’)

Ans:- Timestamp (‘2000-01-03 02:01:01’)

Let’s say you have a dataframe called df which has NaN values in the first few records of the dataframe which is then followed by valid values. You want to fill those values with the first observed value of that column.

Which of these functions would you call on df to do so?

ffill()

sfill()

efill()

bfill()

Ans:- bfill()

How would you access the element 3 from the “pandas_series” variable?

import pandas as pd
pandas_series = pd.Series([1, 2, 3])
pandas_series.index = [‘1’, ‘2’, ‘3’]

pandas_series [‘2’]

pandas_series 

pandas_series [‘3’]

pandas_series 

Ans:-

pandas_series 

pandas_series [‘3’]

Let’s say you have two pandas series called “one” and “two” and you want to print all those values in “one” that are also there in “two”.

Then which of these commands would you use to do so?

one[~one.isin(two)]

one.isin(two)

two[two.isin(one)]

one[one.isin(two)]

Ans:-

two[two.isin(one)]

one[one.isin(two)]

What would be the output of the following line of code?

import pandas as pd
pandas_series = pd.Series([1, 2, 2, 3, 3, 99])
pandas_series.unique()

array([1, 2, 3, 99])

4

6

array([2, 3])

Ans:- array([1, 2, 3, 99])

In a Pandas dataframe, what is each column represented by?

Python Dictionary

Python list

Pandas Series

Numpy Array

Ans:- Pandas Series

Let’s say you have a dataframe called df with 6 columns and with 220 records. You want a subset of this dataframe which includes the first 4 columns from the left and the records from index 100 to 110.

Which of these commands can you use to do so?

df.iloc[100:111, 0:4]

df.loc[100:111, 0:4]

df.iloc[100:110, 0:3]

df.loc[100:10, 0:3]

Ans:- df.iloc[100:111, 0:4]

What will be the values in “new_string” variable at the end of this program?

string = pd.Series([‘Ace’, ‘Fault’, ‘Error’])

new_string = string.str.swapcase()

[‘ace’, fault’, ‘error’]

[‘aCE’, ‘fAULT’, ‘eRROR’]

[‘Ace’, ‘Fault’, ‘Error’]

[‘ACE’, FAULT’, ‘ERROR’]

Ans:- [‘aCE’, ‘fAULT’, ‘eRROR’]

Let’s say you have a pandas series called “series” which has the contents: [‘ae’, ‘bf’, ‘cg’].

What function would you call to find the letter “e” in the series and the starting index of the matched substring?

str.findall(‘e’)

str.match( ‘e’)

str.get(‘e’)

str.find(‘e’)

Ans:- str.find(‘e’)

What will be the values in new_string at the end of this program?

string = pd.Series([‘top’, ‘bottom’, side])

new_string = string.apply(lambda x:x.upper())

[TOP, ‘BOTTOM’, SIDE’]

[‘top’, bottom’, ‘side’]

[‘Top’, Bottom’, ‘Side’]

[‘tOP’, ‘bOTTOM’, ‘sIDE]

Ans:- [TOP, ‘BOTTOM’, SIDE’]

What function would you use to convert the values in “string” to integer form?

import pandas as pd
string = pd.Series([‘1’, ‘2’, ‘3’])

to_numeric(string)

string.to_numeric

string.pd.to_numeric

pd.to_numeric(string)

Ans:- pd.to_numeric(string)

Let’s say you have a dataframe called “df” and you want to find the minimum value in each row of this dataframe.

Which of these commands would you use to do so?

df (min, axis = 1)

df (min, axis = 0)

df.apply (min, axis = 1)

df.apply (min, axis = 0)

Ans:- df.apply (min, axis = 1)

Question

Which of these built in methods of a dataframe would you use to return the number of NA values in each column of your dataframe?

df.isna()

df.na()

df.isna().sum()

df.sum()

Ans:- df.isna().sum()

In the context of a pandas dataframe, what does the dropna( ) function do?

All the cells which have the value NA are replaced with a 0

Calculates the number of records which have even a single field with an NA

All the cells which have the value NA are deleted

All the records which have even a single field with NA have are removed

Ans:-

All the records which have even a single field with NA have are removed

In the context of a pandas dataframe, which of these functions returns the median of a column?

quantile(0.75)

mean()

quantile(0.5)

median()

Ans:-

quantile(0.5)

median()

Question: Which of these statements related to the Pandas Series object are true?

Since Numpy is built on top of Pandas, it is very straightforward to convert a Pandas Series object into a Numpy array

Pandas Series object is similar to a Python list

Once we create a Pandas Series object, an index representing the positions for each of the data points is automatically created for the list

Ans:-

Pandas Series object is similar to a Python list

Once we create a Pandas Series object, an index representing the positions for each of the data points is automatically created for the list

Question: Let’s say you have a pandas DataFrame called “frame” and you want to export this DataFrame along with its index as a CSV file called “data_frame” located in the datasets folder of our workspace.

data_frame.to_csv(‘datasets/ frame.csv’)

data_frame.csv(‘datasets/ frame.csv’)

frame.to_csv(‘datasets/data_frame.csv’)

frame.csv(‘datasets/data_frame.csv’)

Ans:- frame.to_csv(‘datasets/data_frame.csv’)