Data Processing with Pandas

Original Source: https://www.coursera.org/specializations/data-science-python

What is Data Science?

Drew Conway

data science

David Donoho

Data Exploration and Preparation
Data Representation and Transformation
Computing with Data
Data Modeling
Data Visualization and Presentation
Science about Data Science

Data Processing with Pandas

The Series Data Structure

series

import pandas as pd
sports = {'Archery': 'Bhutan',
          'Golf': 'Scotland',
          'Sumo': 'Japan',
          'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s

Archery           Bhutan
Golf            Scotland
Sumo               Japan
Taekwondo    South Korea
dtype: object

pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])

India      Tiger
America     Bear
Canada     Moose
dtype: object

Querying a Series

s.iloc[3]

'South Korea'

s.loc['Golf']

'Scotland'

s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s

0             1
1             2
2             3
Animal    Bears
dtype: object

The DataFrame Data Structure

dataframe

purchase_1 = pd.Series({'Name': 'Chris',
                        'Item Purchased': 'Dog Food',
                        'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                        'Item Purchased': 'Kitty Litter',
                        'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                        'Item Purchased': 'Bird Seed',
                        'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df.head()

	Name	Item Purchased	Cost
Store 1	Chris	Dog Food	22.5
Store 1	Kevyn	Kitty Litter	2.5
Store 2	Vinod	Bird Seed	5.0

df.loc['Store 2']

Name                  Vinod
Item Purchased    Bird Seed
Cost                      5
Name: Store 2, dtype: object

df.loc['Store 1', 'Cost']

Store 1    22.5
Store 1     2.5
Name: Cost, dtype: float64

df['Cost']

Store 1    22.5
Store 1     2.5
Store 2     5.0
Name: Cost, dtype: float64

df.drop('Store 1')

	Name	Item Purchased	Cost
Store 2	Vinod	Bird Seed	5.0

# in pandas operations, default is inplace=True
# so 'df.drop' doesn't change value of the variable.
df

	Name	Item Purchased	Cost
Store 1	Chris	Dog Food	22.5
Store 1	Kevyn	Kitty Litter	2.5
Store 2	Vinod	Bird Seed	5.0

df['Location'] = None
df

	Name	Item Purchased	Cost	Location
Store 1	Chris	Dog Food	22.5	None
Store 1	Kevyn	Kitty Litter	2.5	None
Store 2	Vinod	Bird Seed	5.0	None

Dataframe Indexing and Loading

costs = df['Cost']+2
df

	Name	Item Purchased	Cost	Location
Store 1	Chris	Dog Food	22.5	None
Store 1	Kevyn	Kitty Litter	2.5	None
Store 2	Vinod	Bird Seed	5.0	None

df = pd.read_csv('olympics.csv')
df.head()

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15
0	NaN	№ Summer	01 !	02 !	03 !	Total	№ Winter	01 !	02 !	03 !	Total	№ Games	01 !	02 !	03 !	Combined total
1	Afghanistan (AFG)	13	0	0	2	2	0	0	0	0	0	13	0	0	2	2
2	Algeria (ALG)	12	5	2	8	15	3	0	0	0	0	15	5	2	8	15
3	Argentina (ARG)	23	18	24	28	70	18	0	0	0	0	41	18	24	28	70
4	Armenia (ARM)	5	1	2	9	12	6	0	0	0	0	11	1	2	9	12

df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()

	№ Summer	01 !	02 !	03 !	Total	№ Winter	№ Games	01 !.2	02 !.2	03 !.2	Combined total
Afghanistan (AFG)	13	0	0	2	2	0	13	0	0	2	2
Algeria (ALG)	12	5	2	8	15	3	15	5	2	8	15
Argentina (ARG)	23	18	24	28	70	18	41	18	24	28	70
Armenia (ARM)	5	1	2	9	12	6	11	1	2	9	12
Australasia (ANZ) [ANZ]	2	3	4	5	12	0	2	3	4	5	12

df.describe()

	№ Summer	01 !	02 !	03 !	Total	№ Winter	01 !.1	02 !.1	03 !.1	Total.1	№ Games	01 !.2	02 !.2	03 !.2	Combined total
count	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000	147.000000
mean	13.476190	65.428571	64.965986	69.795918	200.190476	6.700680	13.047619	13.034014	12.897959	38.979592	20.176871	78.476190	78.000000	82.693878	239.170068
std	7.072359	405.549990	399.309960	427.187344	1231.306297	7.433186	80.799204	80.634421	79.588388	240.917324	13.257048	485.013378	478.860334	505.855110	1469.067883
min	1.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	1.000000	0.000000	0.000000	0.000000	1.000000
25%	8.000000	0.000000	1.000000	1.000000	2.000000	0.000000	0.000000	0.000000	0.000000	0.000000	11.000000	0.000000	1.000000	1.000000	2.500000
50%	13.000000	3.000000	4.000000	6.000000	12.000000	5.000000	0.000000	0.000000	0.000000	0.000000	15.000000	3.000000	4.000000	7.000000	12.000000
75%	18.500000	24.000000	28.000000	29.000000	86.000000	10.000000	1.000000	2.000000	1.000000	5.000000	27.000000	25.500000	29.000000	32.500000	89.000000
max	27.000000	4809.000000	4775.000000	5130.000000	14714.000000	22.000000	959.000000	958.000000	948.000000	2865.000000	49.000000	5768.000000	5733.000000	6078.000000	17579.000000

df.index

Index(['Afghanistan (AFG)', 'Algeria (ALG)', 'Argentina (ARG)',
       'Armenia (ARM)', 'Australasia (ANZ) [ANZ]', 'Australia (AUS) [AUS] [Z]',
       'Austria (AUT)', 'Azerbaijan (AZE)', 'Bahamas (BAH)', 'Bahrain (BRN)',
       ...
       'Uzbekistan (UZB)', 'Venezuela (VEN)', 'Vietnam (VIE)',
       'Virgin Islands (ISV)', 'Yugoslavia (YUG) [YUG]',
       'Independent Olympic Participants (IOP) [IOP]', 'Zambia (ZAM) [ZAM]',
       'Zimbabwe (ZIM) [ZIM]', 'Mixed team (ZZX) [ZZX]', 'Totals'],
      dtype='object', length=147)

df.columns

Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
       '02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
       'Combined total'],
      dtype='object')

for col in df.columns:
    if col[:2]=='01':
        df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
    if col[:2]=='02':
        df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
    if col[:2]=='03':
        df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
    if col[:1]=='№':
        df.rename(columns={col:'#' + col[1:]}, inplace=True)

df.head()

	# Summer	Gold	Silver	Bronze	Total	# Winter	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Afghanistan (AFG)	13	0	0	2	2	0	13	0	0	2	2
Algeria (ALG)	12	5	2	8	15	3	15	5	2	8	15
Argentina (ARG)	23	18	24	28	70	18	41	18	24	28	70
Armenia (ARM)	5	1	2	9	12	6	11	1	2	9	12
Australasia (ANZ) [ANZ]	2	3	4	5	12	0	2	3	4	5	12

Querying a DataFrame

Boolean Masking

boolean masking

(df['Gold'] > 0).head()

Afghanistan (AFG)          False
Algeria (ALG)               True
Argentina (ARG)             True
Armenia (ARM)               True
Australasia (ANZ) [ANZ]     True
Name: Gold, dtype: bool

only_gold = df.where(df['Gold'] > 0)
only_gold.head()

	# Summer	Gold	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Afghanistan (AFG)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Algeria (ALG)	12.0	5.0	2.0	8.0	15.0	3.0	0.0	0.0	0.0	0.0	15.0	5.0	2.0	8.0	15.0
Argentina (ARG)	23.0	18.0	24.0	28.0	70.0	18.0	0.0	0.0	0.0	0.0	41.0	18.0	24.0	28.0	70.0
Armenia (ARM)	5.0	1.0	2.0	9.0	12.0	6.0	0.0	0.0	0.0	0.0	11.0	1.0	2.0	9.0	12.0
Australasia (ANZ) [ANZ]	2.0	3.0	4.0	5.0	12.0	0.0	0.0	0.0	0.0	0.0	2.0	3.0	4.0	5.0	12.0

only_gold = df[df['Gold'] > 0]
only_gold.head()

	# Summer	Gold	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Algeria (ALG)	12	5	2	8	15	3	0	0	0	0	15	5	2	8	15
Argentina (ARG)	23	18	24	28	70	18	0	0	0	0	41	18	24	28	70
Armenia (ARM)	5	1	2	9	12	6	0	0	0	0	11	1	2	9	12
Australasia (ANZ) [ANZ]	2	3	4	5	12	0	0	0	0	0	2	3	4	5	12
Australia (AUS) [AUS] [Z]	25	139	152	177	468	18	5	3	4	12	43	144	155	181	480

df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]

	# Summer	Gold	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Liechtenstein (LIE)	16	0	0	0	0	18	2	2	5	9	34	2	2	5	9

Indexing Dataframes

df.head()

	# Summer	Gold	Silver	Bronze	Total	# Winter	# Games	Gold.2	Silver.2	Bronze.2	Combined total
Afghanistan (AFG)	13	0	0	2	2	0	13	0	0	2	2
Algeria (ALG)	12	5	2	8	15	3	15	5	2	8	15
Argentina (ARG)	23	18	24	28	70	18	41	18	24	28	70
Armenia (ARM)	5	1	2	9	12	6	11	1	2	9	12
Australasia (ANZ) [ANZ]	2	3	4	5	12	0	2	3	4	5	12

df['country'] = df.index
df = df.set_index('Gold')
df.head()

	# Summer	Silver	Bronze	Total	# Winter	Gold.1	Silver.1	Bronze.1	Total.1	# Games	Gold.2	Silver.2	Bronze.2	Combined total	country
Gold
0	13	0	2	2	0	0	0	0	0	13	0	0	2	2	Afghanistan (AFG)
5	12	2	8	15	3	0	0	0	0	15	5	2	8	15	Algeria (ALG)
18	23	24	28	70	18	0	0	0	0	41	18	24	28	70	Argentina (ARG)
1	5	2	9	12	6	0	0	0	0	11	1	2	9	12	Armenia (ARM)
3	2	4	5	12	0	0	0	0	0	2	3	4	5	12	Australasia (ANZ) [ANZ]

df = df.reset_index()
df.head()

	Gold	# Summer	Silver	Bronze	Total	# Winter	# Games	Gold.2	Silver.2	Bronze.2	Combined total	country
0	0	13	0	2	2	0	13	0	0	2	2	Afghanistan (AFG)
1	5	12	2	8	15	3	15	5	2	8	15	Algeria (ALG)
2	18	23	24	28	70	18	41	18	24	28	70	Argentina (ARG)
3	1	5	2	9	12	6	11	1	2	9	12	Armenia (ARM)
4	3	2	4	5	12	0	2	3	4	5	12	Australasia (ANZ) [ANZ]

df = pd.read_csv('census.csv')
df.head()

	SUMLEV	REGION	DIVISION	STATE	COUNTY	STNAME	CTYNAME	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	...	RDOMESTICMIG2011	RDOMESTICMIG2012	RDOMESTICMIG2013	RDOMESTICMIG2014	RDOMESTICMIG2015	RNETMIG2011	RNETMIG2012	RNETMIG2013	RNETMIG2014	RNETMIG2015
0	40	3	6	1	0	Alabama	Alabama	4779736	4780127	4785161	...	0.002295	-0.193196	0.381066	0.582002	-0.467369	1.030015	0.826644	1.383282	1.724718	0.712594
1	50	3	6	1	1	Alabama	Autauga County	54571	54571	54660	...	7.242091	-2.915927	-3.012349	2.265971	-2.530799	7.606016	-2.626146	-2.722002	2.592270	-2.187333
2	50	3	6	1	3	Alabama	Baldwin County	182265	182265	183193	...	14.832960	17.647293	21.845705	19.243287	17.197872	15.844176	18.559627	22.727626	20.317142	18.293499
3	50	3	6	1	5	Alabama	Barbour County	27457	27457	27341	...	-4.728132	-2.500690	-7.056824	-3.904217	-10.543299	-4.874741	-2.758113	-7.167664	-3.978583	-10.543299
4	50	3	6	1	7	Alabama	Bibb County	22915	22919	22861	...	-5.527043	-5.068871	-6.201001	-0.177537	0.177258	-5.088389	-4.363636	-5.403729	0.754533	1.107861

5 rows × 100 columns

df['SUMLEV'].unique()

array([40, 50], dtype=int64)

df=df[df['SUMLEV'] == 50]
df.head()

	SUMLEV	REGION	DIVISION	STATE	COUNTY	STNAME	CTYNAME	CENSUS2010POP	ESTIMATESBASE2010	POPESTIMATE2010	...	RDOMESTICMIG2011	RDOMESTICMIG2012	RDOMESTICMIG2013	RDOMESTICMIG2014	RDOMESTICMIG2015	RNETMIG2011	RNETMIG2012	RNETMIG2013	RNETMIG2014	RNETMIG2015
1	50	3	6	1	1	Alabama	Autauga County	54571	54571	54660	...	7.242091	-2.915927	-3.012349	2.265971	-2.530799	7.606016	-2.626146	-2.722002	2.592270	-2.187333
2	50	3	6	1	3	Alabama	Baldwin County	182265	182265	183193	...	14.832960	17.647293	21.845705	19.243287	17.197872	15.844176	18.559627	22.727626	20.317142	18.293499
3	50	3	6	1	5	Alabama	Barbour County	27457	27457	27341	...	-4.728132	-2.500690	-7.056824	-3.904217	-10.543299	-4.874741	-2.758113	-7.167664	-3.978583	-10.543299
4	50	3	6	1	7	Alabama	Bibb County	22915	22919	22861	...	-5.527043	-5.068871	-6.201001	-0.177537	0.177258	-5.088389	-4.363636	-5.403729	0.754533	1.107861
5	50	3	6	1	9	Alabama	Blount County	57322	57322	57373	...	1.807375	-1.177622	-1.748766	-2.062535	-1.369970	1.859511	-0.848580	-1.402476	-1.577232	-0.884411

5 rows × 100 columns

columns_to_keep = ['STNAME',
                   'CTYNAME',
                   'BIRTHS2010',
                   'BIRTHS2011',
                   'BIRTHS2012',
                   'BIRTHS2013',
                   'BIRTHS2014',
                   'BIRTHS2015',
                   'POPESTIMATE2010',
                   'POPESTIMATE2011',
                   'POPESTIMATE2012',
                   'POPESTIMATE2013',
                   'POPESTIMATE2014',
                   'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()

	STNAME	CTYNAME	BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
1	Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
2	Alabama	Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
3	Alabama	Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
4	Alabama	Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
5	Alabama	Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673

df = df.set_index(['STNAME', 'CTYNAME'])
df.head()

		BIRTHS2010	BIRTHS2011	BIRTHS2012	BIRTHS2013	BIRTHS2014	BIRTHS2015	POPESTIMATE2010	POPESTIMATE2011	POPESTIMATE2012	POPESTIMATE2013	POPESTIMATE2014	POPESTIMATE2015
STNAME	CTYNAME
Alabama	Autauga County	151	636	615	574	623	600	54660	55253	55175	55038	55290	55347
	Baldwin County	517	2187	2092	2160	2186	2240	183193	186659	190396	195126	199713	203709
	Barbour County	70	335	300	283	260	269	27341	27226	27159	26973	26815	26489
	Bibb County	44	266	245	259	247	253	22861	22733	22642	22512	22549	22583
	Blount County	183	744	710	646	618	603	57373	57711	57776	57734	57658	57673

df.loc[('Michigan', 'Washtenaw County')]

BIRTHS2010            977
BIRTHS2011           3826
BIRTHS2012           3780
BIRTHS2013           3662
BIRTHS2014           3683
BIRTHS2015           3709
POPESTIMATE2010    345563
POPESTIMATE2011    349048
POPESTIMATE2012    351213
POPESTIMATE2013    354289
POPESTIMATE2014    357029
POPESTIMATE2015    358880
Name: (Michigan, Washtenaw County), dtype: int64

Missing values

df = pd.read_csv('log.csv')
df.head()

	time	user	video	playback position	paused	volume
0	1469974424	cheryl	intro.html	5	False	10.0
1	1469974454	cheryl	intro.html	6	NaN	NaN
2	1469974544	cheryl	intro.html	9	NaN	NaN
3	1469974574	cheryl	intro.html	10	NaN	NaN
4	1469977514	bob	intro.html	1	NaN	NaN

df = df.sort_values('time')
df.head()

	time	user	video	playback position	paused	volume
0	1469974424	cheryl	intro.html	5	False	10.0
13	1469974424	sue	advanced.html	23	False	10.0
1	1469974454	cheryl	intro.html	6	NaN	NaN
11	1469974454	sue	advanced.html	24	NaN	NaN
18	1469974484	cheryl	intro.html	7	NaN	NaN

df = df.set_index(['time', 'user'])
df.head()

		video	playback position	paused	volume
time	user
1469974424	cheryl	intro.html	5	False	10.0
1469974424	sue	advanced.html	23	False	10.0
1469974454	cheryl	intro.html	6	NaN	NaN
1469974454	sue	advanced.html	24	NaN	NaN
1469974484	cheryl	intro.html	7	NaN	NaN

df = df.fillna(method='ffill')
df.head()

		video	playback position	paused	volume
time	user
1469974424	cheryl	intro.html	5	False	10.0
1469974424	sue	advanced.html	23	False	10.0
1469974454	cheryl	intro.html	6	False	10.0
1469974454	sue	advanced.html	24	False	10.0
1469974484	cheryl	intro.html	7	False	10.0

Share on

Twitter Facebook Google+ LinkedIn

YoonSoo

Data Processing with Pandas

What is Data Science?

Drew Conway

David Donoho

Data Processing with Pandas

The Series Data Structure

Querying a Series

The DataFrame Data Structure

Dataframe Indexing and Loading

Querying a DataFrame

Indexing Dataframes

Missing values

Share on

Leave a Comment

You May Also Enjoy

Generalized Linear Models (GLM)

“ALBERT: A Lite BERT for Self-supervised Learning of Language Representations” Summarized

“Generative Pretraining from Pixels” Summarized

“Language Models are Few-Shot Learners” Summarized