Data Processing with Pandas
Original Source: https://www.coursera.org/specializations/data-science-python
What is Data Science?
Drew Conway
David Donoho
- Data Exploration and Preparation
- Data Representation and Transformation
- Computing with Data
- Data Modeling
- Data Visualization and Presentation
- Science about Data Science
Data Processing with Pandas
The Series Data Structure
import pandas as pd
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
India Tiger
America Bear
Canada Moose
dtype: object
Querying a Series
s.iloc[3]
'South Korea'
s.loc['Golf']
'Scotland'
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears'
s
0 1
1 2
2 3
Animal Bears
dtype: object
The DataFrame Data Structure
purchase_1 = pd.Series({'Name': 'Chris',
'Item Purchased': 'Dog Food',
'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
'Item Purchased': 'Kitty Litter',
'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
'Item Purchased': 'Bird Seed',
'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
df.head()
Name | Item Purchased | Cost | |
---|---|---|---|
Store 1 | Chris | Dog Food | 22.5 |
Store 1 | Kevyn | Kitty Litter | 2.5 |
Store 2 | Vinod | Bird Seed | 5.0 |
df.loc['Store 2']
Name Vinod
Item Purchased Bird Seed
Cost 5
Name: Store 2, dtype: object
df.loc['Store 1', 'Cost']
Store 1 22.5
Store 1 2.5
Name: Cost, dtype: float64
df['Cost']
Store 1 22.5
Store 1 2.5
Store 2 5.0
Name: Cost, dtype: float64
df.drop('Store 1')
Name | Item Purchased | Cost | |
---|---|---|---|
Store 2 | Vinod | Bird Seed | 5.0 |
# in pandas operations, default is inplace=True
# so 'df.drop' doesn't change value of the variable.
df
Name | Item Purchased | Cost | |
---|---|---|---|
Store 1 | Chris | Dog Food | 22.5 |
Store 1 | Kevyn | Kitty Litter | 2.5 |
Store 2 | Vinod | Bird Seed | 5.0 |
df['Location'] = None
df
Name | Item Purchased | Cost | Location | |
---|---|---|---|---|
Store 1 | Chris | Dog Food | 22.5 | None |
Store 1 | Kevyn | Kitty Litter | 2.5 | None |
Store 2 | Vinod | Bird Seed | 5.0 | None |
Dataframe Indexing and Loading
costs = df['Cost']+2
df
Name | Item Purchased | Cost | Location | |
---|---|---|---|---|
Store 1 | Chris | Dog Food | 22.5 | None |
Store 1 | Kevyn | Kitty Litter | 2.5 | None |
Store 2 | Vinod | Bird Seed | 5.0 | None |
df = pd.read_csv('olympics.csv')
df.head()
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | № Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 ! | 02 ! | 03 ! | Total | № Games | 01 ! | 02 ! | 03 ! | Combined total |
1 | Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
2 | Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
3 | Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
4 | Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
df = pd.read_csv('olympics.csv', index_col = 0, skiprows=1)
df.head()
№ Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
df.describe()
№ Summer | 01 ! | 02 ! | 03 ! | Total | № Winter | 01 !.1 | 02 !.1 | 03 !.1 | Total.1 | № Games | 01 !.2 | 02 !.2 | 03 !.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 | 147.000000 |
mean | 13.476190 | 65.428571 | 64.965986 | 69.795918 | 200.190476 | 6.700680 | 13.047619 | 13.034014 | 12.897959 | 38.979592 | 20.176871 | 78.476190 | 78.000000 | 82.693878 | 239.170068 |
std | 7.072359 | 405.549990 | 399.309960 | 427.187344 | 1231.306297 | 7.433186 | 80.799204 | 80.634421 | 79.588388 | 240.917324 | 13.257048 | 485.013378 | 478.860334 | 505.855110 | 1469.067883 |
min | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 |
25% | 8.000000 | 0.000000 | 1.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 11.000000 | 0.000000 | 1.000000 | 1.000000 | 2.500000 |
50% | 13.000000 | 3.000000 | 4.000000 | 6.000000 | 12.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 15.000000 | 3.000000 | 4.000000 | 7.000000 | 12.000000 |
75% | 18.500000 | 24.000000 | 28.000000 | 29.000000 | 86.000000 | 10.000000 | 1.000000 | 2.000000 | 1.000000 | 5.000000 | 27.000000 | 25.500000 | 29.000000 | 32.500000 | 89.000000 |
max | 27.000000 | 4809.000000 | 4775.000000 | 5130.000000 | 14714.000000 | 22.000000 | 959.000000 | 958.000000 | 948.000000 | 2865.000000 | 49.000000 | 5768.000000 | 5733.000000 | 6078.000000 | 17579.000000 |
df.index
Index(['Afghanistan (AFG)', 'Algeria (ALG)', 'Argentina (ARG)',
'Armenia (ARM)', 'Australasia (ANZ) [ANZ]', 'Australia (AUS) [AUS] [Z]',
'Austria (AUT)', 'Azerbaijan (AZE)', 'Bahamas (BAH)', 'Bahrain (BRN)',
...
'Uzbekistan (UZB)', 'Venezuela (VEN)', 'Vietnam (VIE)',
'Virgin Islands (ISV)', 'Yugoslavia (YUG) [YUG]',
'Independent Olympic Participants (IOP) [IOP]', 'Zambia (ZAM) [ZAM]',
'Zimbabwe (ZIM) [ZIM]', 'Mixed team (ZZX) [ZZX]', 'Totals'],
dtype='object', length=147)
df.columns
Index(['№ Summer', '01 !', '02 !', '03 !', 'Total', '№ Winter', '01 !.1',
'02 !.1', '03 !.1', 'Total.1', '№ Games', '01 !.2', '02 !.2', '03 !.2',
'Combined total'],
dtype='object')
for col in df.columns:
if col[:2]=='01':
df.rename(columns={col:'Gold' + col[4:]}, inplace=True)
if col[:2]=='02':
df.rename(columns={col:'Silver' + col[4:]}, inplace=True)
if col[:2]=='03':
df.rename(columns={col:'Bronze' + col[4:]}, inplace=True)
if col[:1]=='№':
df.rename(columns={col:'#' + col[1:]}, inplace=True)
df.head()
# Summer | Gold | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
Querying a DataFrame
Boolean Masking
(df['Gold'] > 0).head()
Afghanistan (AFG) False
Algeria (ALG) True
Argentina (ARG) True
Armenia (ARM) True
Australasia (ANZ) [ANZ] True
Name: Gold, dtype: bool
only_gold = df.where(df['Gold'] > 0)
only_gold.head()
# Summer | Gold | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan (AFG) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
Algeria (ALG) | 12.0 | 5.0 | 2.0 | 8.0 | 15.0 | 3.0 | 0.0 | 0.0 | 0.0 | 0.0 | 15.0 | 5.0 | 2.0 | 8.0 | 15.0 |
Argentina (ARG) | 23.0 | 18.0 | 24.0 | 28.0 | 70.0 | 18.0 | 0.0 | 0.0 | 0.0 | 0.0 | 41.0 | 18.0 | 24.0 | 28.0 | 70.0 |
Armenia (ARM) | 5.0 | 1.0 | 2.0 | 9.0 | 12.0 | 6.0 | 0.0 | 0.0 | 0.0 | 0.0 | 11.0 | 1.0 | 2.0 | 9.0 | 12.0 |
Australasia (ANZ) [ANZ] | 2.0 | 3.0 | 4.0 | 5.0 | 12.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 3.0 | 4.0 | 5.0 | 12.0 |
only_gold = df[df['Gold'] > 0]
only_gold.head()
# Summer | Gold | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
Australia (AUS) [AUS] [Z] | 25 | 139 | 152 | 177 | 468 | 18 | 5 | 3 | 4 | 12 | 43 | 144 | 155 | 181 | 480 |
df[(df['Gold.1'] > 0) & (df['Gold'] == 0)]
# Summer | Gold | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Liechtenstein (LIE) | 16 | 0 | 0 | 0 | 0 | 18 | 2 | 2 | 5 | 9 | 34 | 2 | 2 | 5 | 9 |
Indexing Dataframes
df.head()
# Summer | Gold | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Afghanistan (AFG) | 13 | 0 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 |
Algeria (ALG) | 12 | 5 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 |
Argentina (ARG) | 23 | 18 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 |
Armenia (ARM) | 5 | 1 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 |
Australasia (ANZ) [ANZ] | 2 | 3 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 |
df['country'] = df.index
df = df.set_index('Gold')
df.head()
# Summer | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Gold | |||||||||||||||
0 | 13 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 | Afghanistan (AFG) |
5 | 12 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 | Algeria (ALG) |
18 | 23 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 | Argentina (ARG) |
1 | 5 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 | Armenia (ARM) |
3 | 2 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 | Australasia (ANZ) [ANZ] |
df = df.reset_index()
df.head()
Gold | # Summer | Silver | Bronze | Total | # Winter | Gold.1 | Silver.1 | Bronze.1 | Total.1 | # Games | Gold.2 | Silver.2 | Bronze.2 | Combined total | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | 13 | 0 | 2 | 2 | 0 | 0 | 0 | 0 | 0 | 13 | 0 | 0 | 2 | 2 | Afghanistan (AFG) |
1 | 5 | 12 | 2 | 8 | 15 | 3 | 0 | 0 | 0 | 0 | 15 | 5 | 2 | 8 | 15 | Algeria (ALG) |
2 | 18 | 23 | 24 | 28 | 70 | 18 | 0 | 0 | 0 | 0 | 41 | 18 | 24 | 28 | 70 | Argentina (ARG) |
3 | 1 | 5 | 2 | 9 | 12 | 6 | 0 | 0 | 0 | 0 | 11 | 1 | 2 | 9 | 12 | Armenia (ARM) |
4 | 3 | 2 | 4 | 5 | 12 | 0 | 0 | 0 | 0 | 0 | 2 | 3 | 4 | 5 | 12 | Australasia (ANZ) [ANZ] |
df = pd.read_csv('census.csv')
df.head()
SUMLEV | REGION | DIVISION | STATE | COUNTY | STNAME | CTYNAME | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | ... | RDOMESTICMIG2011 | RDOMESTICMIG2012 | RDOMESTICMIG2013 | RDOMESTICMIG2014 | RDOMESTICMIG2015 | RNETMIG2011 | RNETMIG2012 | RNETMIG2013 | RNETMIG2014 | RNETMIG2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 40 | 3 | 6 | 1 | 0 | Alabama | Alabama | 4779736 | 4780127 | 4785161 | ... | 0.002295 | -0.193196 | 0.381066 | 0.582002 | -0.467369 | 1.030015 | 0.826644 | 1.383282 | 1.724718 | 0.712594 |
1 | 50 | 3 | 6 | 1 | 1 | Alabama | Autauga County | 54571 | 54571 | 54660 | ... | 7.242091 | -2.915927 | -3.012349 | 2.265971 | -2.530799 | 7.606016 | -2.626146 | -2.722002 | 2.592270 | -2.187333 |
2 | 50 | 3 | 6 | 1 | 3 | Alabama | Baldwin County | 182265 | 182265 | 183193 | ... | 14.832960 | 17.647293 | 21.845705 | 19.243287 | 17.197872 | 15.844176 | 18.559627 | 22.727626 | 20.317142 | 18.293499 |
3 | 50 | 3 | 6 | 1 | 5 | Alabama | Barbour County | 27457 | 27457 | 27341 | ... | -4.728132 | -2.500690 | -7.056824 | -3.904217 | -10.543299 | -4.874741 | -2.758113 | -7.167664 | -3.978583 | -10.543299 |
4 | 50 | 3 | 6 | 1 | 7 | Alabama | Bibb County | 22915 | 22919 | 22861 | ... | -5.527043 | -5.068871 | -6.201001 | -0.177537 | 0.177258 | -5.088389 | -4.363636 | -5.403729 | 0.754533 | 1.107861 |
5 rows × 100 columns
df['SUMLEV'].unique()
array([40, 50], dtype=int64)
df=df[df['SUMLEV'] == 50]
df.head()
SUMLEV | REGION | DIVISION | STATE | COUNTY | STNAME | CTYNAME | CENSUS2010POP | ESTIMATESBASE2010 | POPESTIMATE2010 | ... | RDOMESTICMIG2011 | RDOMESTICMIG2012 | RDOMESTICMIG2013 | RDOMESTICMIG2014 | RDOMESTICMIG2015 | RNETMIG2011 | RNETMIG2012 | RNETMIG2013 | RNETMIG2014 | RNETMIG2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 50 | 3 | 6 | 1 | 1 | Alabama | Autauga County | 54571 | 54571 | 54660 | ... | 7.242091 | -2.915927 | -3.012349 | 2.265971 | -2.530799 | 7.606016 | -2.626146 | -2.722002 | 2.592270 | -2.187333 |
2 | 50 | 3 | 6 | 1 | 3 | Alabama | Baldwin County | 182265 | 182265 | 183193 | ... | 14.832960 | 17.647293 | 21.845705 | 19.243287 | 17.197872 | 15.844176 | 18.559627 | 22.727626 | 20.317142 | 18.293499 |
3 | 50 | 3 | 6 | 1 | 5 | Alabama | Barbour County | 27457 | 27457 | 27341 | ... | -4.728132 | -2.500690 | -7.056824 | -3.904217 | -10.543299 | -4.874741 | -2.758113 | -7.167664 | -3.978583 | -10.543299 |
4 | 50 | 3 | 6 | 1 | 7 | Alabama | Bibb County | 22915 | 22919 | 22861 | ... | -5.527043 | -5.068871 | -6.201001 | -0.177537 | 0.177258 | -5.088389 | -4.363636 | -5.403729 | 0.754533 | 1.107861 |
5 | 50 | 3 | 6 | 1 | 9 | Alabama | Blount County | 57322 | 57322 | 57373 | ... | 1.807375 | -1.177622 | -1.748766 | -2.062535 | -1.369970 | 1.859511 | -0.848580 | -1.402476 | -1.577232 | -0.884411 |
5 rows × 100 columns
columns_to_keep = ['STNAME',
'CTYNAME',
'BIRTHS2010',
'BIRTHS2011',
'BIRTHS2012',
'BIRTHS2013',
'BIRTHS2014',
'BIRTHS2015',
'POPESTIMATE2010',
'POPESTIMATE2011',
'POPESTIMATE2012',
'POPESTIMATE2013',
'POPESTIMATE2014',
'POPESTIMATE2015']
df = df[columns_to_keep]
df.head()
STNAME | CTYNAME | BIRTHS2010 | BIRTHS2011 | BIRTHS2012 | BIRTHS2013 | BIRTHS2014 | BIRTHS2015 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Alabama | Autauga County | 151 | 636 | 615 | 574 | 623 | 600 | 54660 | 55253 | 55175 | 55038 | 55290 | 55347 |
2 | Alabama | Baldwin County | 517 | 2187 | 2092 | 2160 | 2186 | 2240 | 183193 | 186659 | 190396 | 195126 | 199713 | 203709 |
3 | Alabama | Barbour County | 70 | 335 | 300 | 283 | 260 | 269 | 27341 | 27226 | 27159 | 26973 | 26815 | 26489 |
4 | Alabama | Bibb County | 44 | 266 | 245 | 259 | 247 | 253 | 22861 | 22733 | 22642 | 22512 | 22549 | 22583 |
5 | Alabama | Blount County | 183 | 744 | 710 | 646 | 618 | 603 | 57373 | 57711 | 57776 | 57734 | 57658 | 57673 |
df = df.set_index(['STNAME', 'CTYNAME'])
df.head()
BIRTHS2010 | BIRTHS2011 | BIRTHS2012 | BIRTHS2013 | BIRTHS2014 | BIRTHS2015 | POPESTIMATE2010 | POPESTIMATE2011 | POPESTIMATE2012 | POPESTIMATE2013 | POPESTIMATE2014 | POPESTIMATE2015 | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
STNAME | CTYNAME | ||||||||||||
Alabama | Autauga County | 151 | 636 | 615 | 574 | 623 | 600 | 54660 | 55253 | 55175 | 55038 | 55290 | 55347 |
Baldwin County | 517 | 2187 | 2092 | 2160 | 2186 | 2240 | 183193 | 186659 | 190396 | 195126 | 199713 | 203709 | |
Barbour County | 70 | 335 | 300 | 283 | 260 | 269 | 27341 | 27226 | 27159 | 26973 | 26815 | 26489 | |
Bibb County | 44 | 266 | 245 | 259 | 247 | 253 | 22861 | 22733 | 22642 | 22512 | 22549 | 22583 | |
Blount County | 183 | 744 | 710 | 646 | 618 | 603 | 57373 | 57711 | 57776 | 57734 | 57658 | 57673 |
df.loc[('Michigan', 'Washtenaw County')]
BIRTHS2010 977
BIRTHS2011 3826
BIRTHS2012 3780
BIRTHS2013 3662
BIRTHS2014 3683
BIRTHS2015 3709
POPESTIMATE2010 345563
POPESTIMATE2011 349048
POPESTIMATE2012 351213
POPESTIMATE2013 354289
POPESTIMATE2014 357029
POPESTIMATE2015 358880
Name: (Michigan, Washtenaw County), dtype: int64
Missing values
df = pd.read_csv('log.csv')
df.head()
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | NaN | NaN |
2 | 1469974544 | cheryl | intro.html | 9 | NaN | NaN |
3 | 1469974574 | cheryl | intro.html | 10 | NaN | NaN |
4 | 1469977514 | bob | intro.html | 1 | NaN | NaN |
df = df.sort_values('time')
df.head()
time | user | video | playback position | paused | volume | |
---|---|---|---|---|---|---|
0 | 1469974424 | cheryl | intro.html | 5 | False | 10.0 |
13 | 1469974424 | sue | advanced.html | 23 | False | 10.0 |
1 | 1469974454 | cheryl | intro.html | 6 | NaN | NaN |
11 | 1469974454 | sue | advanced.html | 24 | NaN | NaN |
18 | 1469974484 | cheryl | intro.html | 7 | NaN | NaN |
df = df.set_index(['time', 'user'])
df.head()
video | playback position | paused | volume | ||
---|---|---|---|---|---|
time | user | ||||
1469974424 | cheryl | intro.html | 5 | False | 10.0 |
sue | advanced.html | 23 | False | 10.0 | |
1469974454 | cheryl | intro.html | 6 | NaN | NaN |
sue | advanced.html | 24 | NaN | NaN | |
1469974484 | cheryl | intro.html | 7 | NaN | NaN |
df = df.fillna(method='ffill')
df.head()
video | playback position | paused | volume | ||
---|---|---|---|---|---|
time | user | ||||
1469974424 | cheryl | intro.html | 5 | False | 10.0 |
sue | advanced.html | 23 | False | 10.0 | |
1469974454 | cheryl | intro.html | 6 | False | 10.0 |
sue | advanced.html | 24 | False | 10.0 | |
1469974484 | cheryl | intro.html | 7 | False | 10.0 |
Leave a Comment