Lecture 6 (4/8/22)¶
Last time we covered:
numpy continuation
pandas basics
Today’s agenda:
Processing data with pandas
import pandas as pd
import numpy as np
Creating new dataframes¶
It’s worth noting: there are many different ways to do operations like dataframe creation in pandas.
This is probably the most intuitive but you may come across others!
# First let's initialize some data
# Typically, we want to think of our data as being lists of the stuff in each column
uc_schools = np.array(["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco",
"Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"])
uc_founded = [1868, 1960, 1919, 1909, 1864, 1965, 1905, 1954, 1965, 2005]
uc_students = pd.Series([45057, 42875, 45742, 26314, 3132, 35220, 40031, 25548, 19161, 8847])
uc_grads = np.zeros((10))
# Now let's put it in a dataframe!
uc_data = pd.DataFrame({ # start by declaring a new data frame
"Campus": uc_schools, # each column name is a dictionary key and the list of column data points is the value
"Date_founded": uc_founded,
"Number_of_students": uc_students,
# "grads": uc_grads
})
uc_data
Campus | Date_founded | Number_of_students | |
---|---|---|---|
0 | Berkeley | 1868 | 45057 |
1 | San Diego | 1960 | 42875 |
2 | Los Angeles | 1919 | 45742 |
3 | Santa Barbara | 1909 | 26314 |
4 | San Francisco | 1864 | 3132 |
5 | Irvine | 1965 | 35220 |
6 | Davis | 1905 | 40031 |
7 | Riverside | 1954 | 25548 |
8 | Santa Cruz | 1965 | 19161 |
9 | Merced | 2005 | 8847 |
We will likely find ourselves reading in data more often than creating new dataframes, but it’s occassionally useful and good to know how to do it!
Adding data to existing dataframes¶
Adding columns (common!)¶
We may often find ourselves adding columns to a dataframe, e.g., creating columns that represent our existing data in a new way
# One option: when we have new data to add
uc_undergrads = pd.Series([33343, 31543, 23349, 0, 30222, 31162, 22055, 17207, 8151, 31814],
index = ["San Diego", "Los Angeles", "Santa Barbara", "San Francisco",
"Irvine", "Davis", "Riverside", "Santa Cruz", "Merced", "Berkeley"])
# uc_undergrads.index = ["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco",
# "Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"]
uc_data.index = ["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco",
"Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"]
# uc_undergrads
uc_data['Undergraduates'] = uc_undergrads # Use bracket syntax to declare a new column
uc_data
Campus | Date_founded | Number_of_students | Undergraduates | |
---|---|---|---|---|
Berkeley | Berkeley | 1868 | 45057 | 31814 |
San Diego | San Diego | 1960 | 42875 | 33343 |
Los Angeles | Los Angeles | 1919 | 45742 | 31543 |
Santa Barbara | Santa Barbara | 1909 | 26314 | 23349 |
San Francisco | San Francisco | 1864 | 3132 | 0 |
Irvine | Irvine | 1965 | 35220 | 30222 |
Davis | Davis | 1905 | 40031 | 31162 |
Riverside | Riverside | 1954 | 25548 | 22055 |
Santa Cruz | Santa Cruz | 1965 | 19161 | 17207 |
Merced | Merced | 2005 | 8847 | 8151 |
# A second option: when we want to process existing data and form a new column
uc_data = uc_data.assign(Undergraduate_pct = uc_data['Undergraduates'] / uc_data['Number_of_students'])
# uc_data['Undergraduate_pct'] = uc_data['Undergraduates'] / uc_data['Number_of_students']
uc_data # Note what happens if we don't do the re-assignment above
uc_data = uc_data.assign(Total = uc_data['Undergraduates']+ uc_data['Number_of_students'])
uc_data
# For this sort of processing, we can also use similar syntax to the above (try it yourself!)
# but `assign` gives us some additional flexibility
Campus | Date_founded | Number_of_students | Undergraduates | Undergraduate_pct | Total | |
---|---|---|---|---|---|---|
Berkeley | Berkeley | 1868 | 45057 | 31814 | 0.706083 | 76871 |
San Diego | San Diego | 1960 | 42875 | 33343 | 0.777679 | 76218 |
Los Angeles | Los Angeles | 1919 | 45742 | 31543 | 0.689585 | 77285 |
Santa Barbara | Santa Barbara | 1909 | 26314 | 23349 | 0.887322 | 49663 |
San Francisco | San Francisco | 1864 | 3132 | 0 | 0.000000 | 3132 |
Irvine | Irvine | 1965 | 35220 | 30222 | 0.858092 | 65442 |
Davis | Davis | 1905 | 40031 | 31162 | 0.778447 | 71193 |
Riverside | Riverside | 1954 | 25548 | 22055 | 0.863277 | 47603 |
Santa Cruz | Santa Cruz | 1965 | 19161 | 17207 | 0.898022 | 36368 |
Merced | Merced | 2005 | 8847 | 8151 | 0.921329 | 16998 |
Adding rows¶
# Let's say the UC system decides it's long overdue to build a campus in Lake Tahoe near the Nevada border.
# We want to add some projected data
# First, we make our new row into a dataframe of its own, with matching columns
uc_tahoe = pd.DataFrame({
"Campus": ["Lake Tahoe"],
"Date founded": [2022] # Note we don't need to know all the column info here
})
uc_tahoe
Campus | Date founded | |
---|---|---|
0 | Lake Tahoe | 2022 |
# Next, we use `concat` to add it to the previous dataframe
uc_data = pd.concat([uc_data, uc_tahoe])
uc_data
Campus | Date_founded | Number_of_students | Undergraduates | Undergraduate_pct | Total | Date founded | |
---|---|---|---|---|---|---|---|
Berkeley | Berkeley | 1868.0 | 45057.0 | 31814.0 | 0.706083 | 76871.0 | NaN |
San Diego | San Diego | 1960.0 | 42875.0 | 33343.0 | 0.777679 | 76218.0 | NaN |
Los Angeles | Los Angeles | 1919.0 | 45742.0 | 31543.0 | 0.689585 | 77285.0 | NaN |
Santa Barbara | Santa Barbara | 1909.0 | 26314.0 | 23349.0 | 0.887322 | 49663.0 | NaN |
San Francisco | San Francisco | 1864.0 | 3132.0 | 0.0 | 0.000000 | 3132.0 | NaN |
Irvine | Irvine | 1965.0 | 35220.0 | 30222.0 | 0.858092 | 65442.0 | NaN |
Davis | Davis | 1905.0 | 40031.0 | 31162.0 | 0.778447 | 71193.0 | NaN |
Riverside | Riverside | 1954.0 | 25548.0 | 22055.0 | 0.863277 | 47603.0 | NaN |
Santa Cruz | Santa Cruz | 1965.0 | 19161.0 | 17207.0 | 0.898022 | 36368.0 | NaN |
Merced | Merced | 2005.0 | 8847.0 | 8151.0 | 0.921329 | 16998.0 | NaN |
0 | Lake Tahoe | NaN | NaN | NaN | NaN | NaN | 2022.0 |
Processing data: filtering, grouping, summarizing¶
First, note that some operations can be done on our dataframe without having to use filtering or grouping
uc_data = uc_data.sort_values('Date_founded', ascending = True)
uc_data
# Note this doesn't change index at far left!
Campus | Date_founded | Number_of_students | Undergraduates | Undergraduate_pct | Total | Date founded | |
---|---|---|---|---|---|---|---|
San Francisco | San Francisco | 1864.0 | 3132.0 | 0.0 | 0.000000 | 3132.0 | NaN |
Berkeley | Berkeley | 1868.0 | 45057.0 | 31814.0 | 0.706083 | 76871.0 | NaN |
Davis | Davis | 1905.0 | 40031.0 | 31162.0 | 0.778447 | 71193.0 | NaN |
Santa Barbara | Santa Barbara | 1909.0 | 26314.0 | 23349.0 | 0.887322 | 49663.0 | NaN |
Los Angeles | Los Angeles | 1919.0 | 45742.0 | 31543.0 | 0.689585 | 77285.0 | NaN |
Riverside | Riverside | 1954.0 | 25548.0 | 22055.0 | 0.863277 | 47603.0 | NaN |
San Diego | San Diego | 1960.0 | 42875.0 | 33343.0 | 0.777679 | 76218.0 | NaN |
Irvine | Irvine | 1965.0 | 35220.0 | 30222.0 | 0.858092 | 65442.0 | NaN |
Santa Cruz | Santa Cruz | 1965.0 | 19161.0 | 17207.0 | 0.898022 | 36368.0 | NaN |
Merced | Merced | 2005.0 | 8847.0 | 8151.0 | 0.921329 | 16998.0 | NaN |
0 | Lake Tahoe | NaN | NaN | NaN | NaN | NaN | 2022.0 |
print(uc_data.max())
Campus Santa Cruz
Date_founded 2005
Number_of_students 45742
Undergraduates 33343
Undergraduate_pct 0.921329
Total 77285
Date founded 2022
dtype: object
print(uc_data['Number_of_students'].min())
3132.0
Filtering: fetch rows that meet certain conditions¶
Filtering operations given by pandas¶
We can do some kinds of filtering with built-in operations like nlargest
top_3_students = uc_data.nlargest(3, 'Number_of_students')
top_3_students
Campus | Date_founded | Number_of_students | Undergraduates | Undergraduate_pct | Total | Date founded | |
---|---|---|---|---|---|---|---|
Los Angeles | Los Angeles | 1919.0 | 45742.0 | 31543.0 | 0.689585 | 77285.0 | NaN |
Berkeley | Berkeley | 1868.0 | 45057.0 | 31814.0 | 0.706083 | 76871.0 | NaN |
San Diego | San Diego | 1960.0 | 42875.0 | 33343.0 | 0.777679 | 76218.0 | NaN |
Logical filtering¶
Most often, we filter by setting logical criteria over certain column values
# Get quantiles of undergraduate sizes
q = uc_data['Undergraduates'].quantile([0.25, 0.75])
q[0.75]
31447.75
# Fetch uc_data rows where uc_data['Undergraduates'] is > the 75th percentile value above
large_undergrads = uc_data[uc_data['Undergraduates'] > q[0.75]]
large_undergrads
Campus | Date_founded | Number_of_students | Undergraduates | Undergraduate_pct | Total | Date founded | |
---|---|---|---|---|---|---|---|
Berkeley | Berkeley | 1868.0 | 45057.0 | 31814.0 | 0.706083 | 76871.0 | NaN |
Los Angeles | Los Angeles | 1919.0 | 45742.0 | 31543.0 | 0.689585 | 77285.0 | NaN |
San Diego | San Diego | 1960.0 | 42875.0 | 33343.0 | 0.777679 | 76218.0 | NaN |
How does this work?
uc_data['Undergraduates'] > q[0.75]
San Francisco False
Berkeley True
Davis False
Santa Barbara False
Los Angeles True
Riverside False
San Diego True
Irvine False
Santa Cruz False
Merced False
0 False
Name: Undergraduates, dtype: bool
So in the code above, we’re basically saying “fetch the rows where this condition evaluates to True
”.
Note this can get much more complicated…
# What's going on here?
max_undergrad = uc_data[uc_data['Undergraduates'] == uc_data['Undergraduates'].max()]
max_undergrad
Campus | Date_founded | Number_of_students | Undergraduates | Undergraduate_pct | Total | Date founded | |
---|---|---|---|---|---|---|---|
San Diego | San Diego | 1960.0 | 42875.0 | 33343.0 | 0.777679 | 76218.0 | NaN |
Why is this useful?¶
Let’s ask: what is the average number of undergraduates in the schools with the most students overall
q = uc_data['Number_of_students'].quantile([0.2, 0.80])
# type(q)
# q.index
q[0.8]
# What's going on here?
uc_data[uc_data['Number_of_students'] >= q[0.8]]['Undergraduates'].mean()
uc_data[uc_data['Number_of_students'] >= q[0.2]]['Undergraduates'].mean()
# uc_data[uc_data['Number_of_students'] >= q[0.4]]['Undergraduates'].mean()
27586.875
Here, we applied some statistics to a filtered subset of our data.
Imagine we wanted the same average for each quartile of student numbers, instead of just the top 20%.
This kind of thing is pretty common: this is where grouping comes in.
Grouping and summarizing: analyze and graph your data¶
First let’s read in some more complex data.
What’s going on with this data?
# pokemon = pd.read_csv("../Datasets/Pokemon.csv")
# Use this code in class
pokemon = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/Pokemon.csv")
pokemon
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | Fairy | 600 | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 719 | DiancieMega Diancie | Rock | Fairy | 700 | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 600 | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 680 | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
800 rows × 13 columns
Let’s say we want to know which Type 1
group has the most HP
on average
[any guesses?]
# First, what are we dealing with here?
pokemon['Type 1'].value_counts()
Water 112
Normal 98
Grass 70
Bug 69
Psychic 57
Fire 52
Rock 44
Electric 44
Ghost 32
Ground 32
Dragon 32
Dark 31
Poison 28
Steel 27
Fighting 27
Ice 24
Fairy 17
Flying 4
Name: Type 1, dtype: int64
One solution:
Look at each individual
Type 1
value in our dataThen, one by one, filter the data to match each unique
Type 1
valueThen, compute the average
HP
in that filtered dataThen save it somewhere else to keep track of
Then see which one is largest
This seems pretty tough…
Let’s start with something simpler: how many of each kind of Type 1
pokemon are there?
pokemon.groupby(['Type 1']).size().reset_index()
Type 1 | 0 | |
---|---|---|
0 | Bug | 69 |
1 | Dark | 31 |
2 | Dragon | 32 |
3 | Electric | 44 |
4 | Fairy | 17 |
5 | Fighting | 27 |
6 | Fire | 52 |
7 | Flying | 4 |
8 | Ghost | 32 |
9 | Grass | 70 |
10 | Ground | 32 |
11 | Ice | 24 |
12 | Normal | 98 |
13 | Poison | 28 |
14 | Psychic | 57 |
15 | Rock | 44 |
16 | Steel | 27 |
17 | Water | 112 |
Above, we just asked pandas to tell us the value of size()
applied to each unique group of Type 1
pokemon.
Can we do the same thing but for more complex operations than size()
? You bet!
pokemon.groupby(['Type 1']).agg( # .agg is our friend here!
mean_hp = ('HP', np.mean) # this defines a new statistic over each grouping. Apply `np.mean` to the `HP` column
).reset_index()
Type 1 | mean_hp | |
---|---|---|
0 | Bug | 56.884058 |
1 | Dark | 66.806452 |
2 | Dragon | 83.312500 |
3 | Electric | 59.795455 |
4 | Fairy | 74.117647 |
5 | Fighting | 69.851852 |
6 | Fire | 69.903846 |
7 | Flying | 70.750000 |
8 | Ghost | 64.437500 |
9 | Grass | 67.271429 |
10 | Ground | 73.781250 |
11 | Ice | 72.000000 |
12 | Normal | 77.275510 |
13 | Poison | 67.250000 |
14 | Psychic | 70.631579 |
15 | Rock | 65.363636 |
16 | Steel | 65.222222 |
17 | Water | 72.062500 |
We don’t need to stop there. We can group by multiple variables and add multiple metrics!
type1_hp_summary = pokemon.groupby(
['Type 1', 'Legendary'] # Note we're now grouping by each combination of Type 1 and Legendary
).agg(
mean_hp = ('HP', np.mean), # And we're compiling multiple statistics here
min_hp = ('HP', np.min),
max_hp = ('HP', np.max)
).reset_index()
type1_hp_summary
Type 1 | Legendary | mean_hp | min_hp | max_hp | |
---|---|---|---|---|---|
0 | Bug | False | 56.884058 | 1 | 86 |
1 | Dark | False | 64.655172 | 35 | 110 |
2 | Dark | True | 98.000000 | 70 | 126 |
3 | Dragon | False | 72.650000 | 41 | 108 |
4 | Dragon | True | 101.083333 | 80 | 125 |
5 | Electric | False | 57.325000 | 20 | 90 |
6 | Electric | True | 84.500000 | 79 | 90 |
7 | Fairy | False | 70.875000 | 35 | 101 |
8 | Fairy | True | 126.000000 | 126 | 126 |
9 | Fighting | False | 69.851852 | 30 | 144 |
10 | Fire | False | 67.085106 | 38 | 110 |
11 | Fire | True | 96.400000 | 80 | 115 |
12 | Flying | False | 62.500000 | 40 | 85 |
13 | Flying | True | 79.000000 | 79 | 79 |
14 | Ghost | False | 58.733333 | 20 | 150 |
15 | Ghost | True | 150.000000 | 150 | 150 |
16 | Grass | False | 65.940299 | 30 | 123 |
17 | Grass | True | 97.000000 | 91 | 100 |
18 | Ground | False | 70.821429 | 10 | 115 |
19 | Ground | True | 94.500000 | 89 | 100 |
20 | Ice | False | 70.818182 | 36 | 110 |
21 | Ice | True | 85.000000 | 80 | 90 |
22 | Normal | False | 76.489583 | 30 | 255 |
23 | Normal | True | 115.000000 | 110 | 120 |
24 | Poison | False | 67.250000 | 35 | 105 |
25 | Psychic | False | 67.720930 | 20 | 190 |
26 | Psychic | True | 79.571429 | 50 | 106 |
27 | Rock | False | 65.125000 | 30 | 123 |
28 | Rock | True | 67.750000 | 50 | 91 |
29 | Steel | False | 60.434783 | 40 | 80 |
30 | Steel | True | 92.750000 | 80 | 100 |
31 | Water | False | 71.120370 | 20 | 170 |
32 | Water | True | 97.500000 | 90 | 100 |
Now we can apply the filtering we discussed above for example.
type1_hp_summary.nlargest(5, 'mean_hp')
Type 1 | Legendary | mean_hp | min_hp | max_hp | |
---|---|---|---|---|---|
15 | Ghost | True | 150.000000 | 150 | 150 |
8 | Fairy | True | 126.000000 | 126 | 126 |
23 | Normal | True | 115.000000 | 110 | 120 |
4 | Dragon | True | 101.083333 | 80 | 125 |
2 | Dark | True | 98.000000 | 70 | 126 |
Let’s practice!¶
In each Generation
, how many different Type 1
and Type 2
types are there? (did they change across generations?)
[HINT: use the nunique
summary function]
# Write here
# SOLUTION
# pokemon.groupby('Generation').agg(
# type1_types = ('Type 1', 'nunique'),
# type2_types = ('Type 2', 'nunique')
# ).reset_index()
Make a new column called Composite_force
that’s the average of each pokemon’s Attack
, Defense
, and Speed
values.
Next, for each Type 1
type, what is the maximum of this new column?
# Write here
# SOLUTION
# pokemon = pokemon.assign(
# Composite_force = (pokemon['Attack'] + pokemon['Defense'] + pokemon['Speed'])/3
# )
# max_force = pokemon.groupby(['Generation']).agg(
# max_force = ('Composite_force', np.max)
# ).reset_index()