Lecture 6 (4/8/22)¶
Last time we covered:
numpy continuation
pandas basics
Today’s agenda:
Processing data with pandas
import pandas as pd
import numpy as np
Creating new dataframes¶
It’s worth noting: there are many different ways to do operations like dataframe creation in pandas.
This is probably the most intuitive but you may come across others!
# First let's initialize some data
# Typically, we want to think of our data as being lists of the stuff in each column
uc_schools = np.array(["Berkeley", "San Diego", "Los Angeles", "Santa Barbara", "San Francisco",
"Irvine", "Davis", "Riverside", "Santa Cruz", "Merced"])
uc_founded = np.array([1868, 1960, 1919, 1909, 1864, 1965, 1905, 1954, 1965, 2005])
uc_students = np.array([45057, 42875, 45742, 26314, 3132, 35220, 40031, 25548, 19161, 8847])
# Now let's put it in a dataframe!
uc_data = pd.DataFrame({ # start by declaring a new data frame
"School": uc_schools, # each column name is a dictionary key and the list of column data points is the value
"Date founded": uc_founded,
"Number of students": uc_students
})
uc_data
School | Date founded | Number of students | |
---|---|---|---|
0 | Berkeley | 1868 | 45057 |
1 | San Diego | 1960 | 42875 |
2 | Los Angeles | 1919 | 45742 |
3 | Santa Barbara | 1909 | 26314 |
4 | San Francisco | 1864 | 3132 |
5 | Irvine | 1965 | 35220 |
6 | Davis | 1905 | 40031 |
7 | Riverside | 1954 | 25548 |
8 | Santa Cruz | 1965 | 19161 |
9 | Merced | 2005 | 8847 |
We will likely find ourselves reading in data more often than creating new dataframes, but it’s occassionally useful and good to know how to do it!
Adding data to existing dataframes¶
Adding columns (common!)¶
We may often find ourselves adding columns to a dataframe, e.g., creating columns that represent our existing data in a new way
# One option: when we have new data to add
uc_undergrads = np.array([31814, 33343, 31543, 23349, 0, 30222, 31162, 22055, 17207, 8151])
uc_data['Undergraduates'] = uc_undergrads # Use bracket syntax to declare a new column
uc_data
School | Date founded | Number of students | Undergraduates | |
---|---|---|---|---|
0 | Berkeley | 1868 | 45057 | 31814 |
1 | San Diego | 1960 | 42875 | 33343 |
2 | Los Angeles | 1919 | 45742 | 31543 |
3 | Santa Barbara | 1909 | 26314 | 23349 |
4 | San Francisco | 1864 | 3132 | 0 |
5 | Irvine | 1965 | 35220 | 30222 |
6 | Davis | 1905 | 40031 | 31162 |
7 | Riverside | 1954 | 25548 | 22055 |
8 | Santa Cruz | 1965 | 19161 | 17207 |
9 | Merced | 2005 | 8847 | 8151 |
# A second option: when we want to process existing data and form a new column
uc_data = uc_data.assign(Undergraduate_pct = uc_data['Undergraduates'] / uc_data['Number of students'])
uc_data # Note what happens if we don't do the re-assignment above
# For this sort of processing, we can also use similar syntax to the above (try it yourself!)
# but `assign` gives us some additional flexibility
School | Date founded | Number of students | Undergraduates | Undergraduate_pct | |
---|---|---|---|---|---|
0 | Berkeley | 1868 | 45057 | 31814 | 0.706083 |
1 | San Diego | 1960 | 42875 | 33343 | 0.777679 |
2 | Los Angeles | 1919 | 45742 | 31543 | 0.689585 |
3 | Santa Barbara | 1909 | 26314 | 23349 | 0.887322 |
4 | San Francisco | 1864 | 3132 | 0 | 0.000000 |
5 | Irvine | 1965 | 35220 | 30222 | 0.858092 |
6 | Davis | 1905 | 40031 | 31162 | 0.778447 |
7 | Riverside | 1954 | 25548 | 22055 | 0.863277 |
8 | Santa Cruz | 1965 | 19161 | 17207 | 0.898022 |
9 | Merced | 2005 | 8847 | 8151 | 0.921329 |
Adding rows¶
# Let's say the UC system decides it's long overdue to build a campus in Lake Tahoe near the Nevada border.
# We want to add some projected data
# First, we make our new row into a dataframe of its own, with matching columns
uc_tahoe = pd.DataFrame({
"School": ["Lake Tahoe"],
"Date founded": [2022] # Note we don't need to know all the column info here
})
uc_tahoe
School | Date founded | |
---|---|---|
0 | Lake Tahoe | 2022 |
# Next, we use `concat` to add it to the previous dataframe
uc_data = pd.concat([uc_data, uc_tahoe])
uc_data
School | Date founded | Number of students | Undergraduates | Undergraduate_pct | |
---|---|---|---|---|---|
0 | Berkeley | 1868 | 45057.0 | 31814.0 | 0.706083 |
1 | San Diego | 1960 | 42875.0 | 33343.0 | 0.777679 |
2 | Los Angeles | 1919 | 45742.0 | 31543.0 | 0.689585 |
3 | Santa Barbara | 1909 | 26314.0 | 23349.0 | 0.887322 |
4 | San Francisco | 1864 | 3132.0 | 0.0 | 0.000000 |
5 | Irvine | 1965 | 35220.0 | 30222.0 | 0.858092 |
6 | Davis | 1905 | 40031.0 | 31162.0 | 0.778447 |
7 | Riverside | 1954 | 25548.0 | 22055.0 | 0.863277 |
8 | Santa Cruz | 1965 | 19161.0 | 17207.0 | 0.898022 |
9 | Merced | 2005 | 8847.0 | 8151.0 | 0.921329 |
0 | Lake Tahoe | 2022 | NaN | NaN | NaN |
Processing data: filtering, grouping, summarizing¶
First, note that some operations can be done on our dataframe without having to use filtering or grouping
uc_data = uc_data.sort_values('Date founded', ascending = True)
uc_data
# Note this doesn't change index at far left!
School | Date founded | Number of students | Undergraduates | Undergraduate_pct | |
---|---|---|---|---|---|
4 | San Francisco | 1864 | 3132.0 | 0.0 | 0.000000 |
0 | Berkeley | 1868 | 45057.0 | 31814.0 | 0.706083 |
6 | Davis | 1905 | 40031.0 | 31162.0 | 0.778447 |
3 | Santa Barbara | 1909 | 26314.0 | 23349.0 | 0.887322 |
2 | Los Angeles | 1919 | 45742.0 | 31543.0 | 0.689585 |
7 | Riverside | 1954 | 25548.0 | 22055.0 | 0.863277 |
1 | San Diego | 1960 | 42875.0 | 33343.0 | 0.777679 |
5 | Irvine | 1965 | 35220.0 | 30222.0 | 0.858092 |
8 | Santa Cruz | 1965 | 19161.0 | 17207.0 | 0.898022 |
9 | Merced | 2005 | 8847.0 | 8151.0 | 0.921329 |
0 | Lake Tahoe | 2022 | NaN | NaN | NaN |
print(uc_data.max())
print(uc_data['Number of students'].min())
School Santa Cruz
Date founded 2022
Number of students 45742
Undergraduates 33343
Undergraduate_pct 0.921329
dtype: object
3132.0
Filtering: fetch rows that meet certain conditions¶
Filtering operations given by pandas¶
We can do some kinds of filtering with built-in operations like nlargest
top_3_students = uc_data.nlargest(3, 'Number of students')
top_3_students
School | Date founded | Number of students | Undergraduates | Undergraduate_pct | |
---|---|---|---|---|---|
2 | Los Angeles | 1919 | 45742.0 | 31543.0 | 0.689585 |
0 | Berkeley | 1868 | 45057.0 | 31814.0 | 0.706083 |
1 | San Diego | 1960 | 42875.0 | 33343.0 | 0.777679 |
Logical filtering¶
Most often, we filter by setting logical criteria over certain column values
# Get quantiles of undergraduate sizes
q = uc_data['Undergraduates'].quantile([0.25, 0.75])
q[0.75]
# Fetch uc_data rows where uc_data['Undergraduates'] is > the 75th percentile value above
large_undergrads = uc_data[uc_data['Undergraduates'] > q[0.75]]
large_undergrads
School | Date founded | Number of students | Undergraduates | Undergraduate_pct | |
---|---|---|---|---|---|
0 | Berkeley | 1868 | 45057.0 | 31814.0 | 0.706083 |
2 | Los Angeles | 1919 | 45742.0 | 31543.0 | 0.689585 |
1 | San Diego | 1960 | 42875.0 | 33343.0 | 0.777679 |
How does this work?
uc_data['Undergraduates'] > q[0.75]
4 False
0 True
6 False
3 False
2 True
7 False
1 True
5 False
8 False
9 False
0 False
Name: Undergraduates, dtype: bool
So in the code above, we’re basically saying “fetch the rows where this condition evaluates to True
”.
Note this can get much more complicated…
# What's going on here?
max_undergrad = uc_data[uc_data['Undergraduates'] == uc_data['Undergraduates'].max()]
max_undergrad
School | Date founded | Number of students | Undergraduates | Undergraduate_pct | |
---|---|---|---|---|---|
1 | San Diego | 1960 | 42875.0 | 33343.0 | 0.777679 |
Why is this useful?¶
Let’s ask: what is the average number of undergraduates in the schools with the most students overall
q = uc_data['Number of students'].quantile([0.80])
# What's going on here?
uc_data[uc_data['Number of students'] >= q[0.8]]['Undergraduates'].mean()
31678.5
Here, we applied some statistics to a filtered subset of our data.
Imagine we wanted the same average for each quartile of student numbers, instead of just the top 20%.
This kind of thing is pretty common: this is where grouping comes in.
Grouping and summarizing: analyze and graph your data¶
First let’s read in some more complex data.
What’s going on with this data?
# pokemon = pd.read_csv("../Datasets/Pokemon.csv")
# Use this code in class
pokemon = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/Pokemon.csv")
pokemon
# | Name | Type 1 | Type 2 | Total | HP | Attack | Defense | Sp. Atk | Sp. Def | Speed | Generation | Legendary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | Bulbasaur | Grass | Poison | 318 | 45 | 49 | 49 | 65 | 65 | 45 | 1 | False |
1 | 2 | Ivysaur | Grass | Poison | 405 | 60 | 62 | 63 | 80 | 80 | 60 | 1 | False |
2 | 3 | Venusaur | Grass | Poison | 525 | 80 | 82 | 83 | 100 | 100 | 80 | 1 | False |
3 | 3 | VenusaurMega Venusaur | Grass | Poison | 625 | 80 | 100 | 123 | 122 | 120 | 80 | 1 | False |
4 | 4 | Charmander | Fire | NaN | 309 | 39 | 52 | 43 | 60 | 50 | 65 | 1 | False |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
795 | 719 | Diancie | Rock | Fairy | 600 | 50 | 100 | 150 | 100 | 150 | 50 | 6 | True |
796 | 719 | DiancieMega Diancie | Rock | Fairy | 700 | 50 | 160 | 110 | 160 | 110 | 110 | 6 | True |
797 | 720 | HoopaHoopa Confined | Psychic | Ghost | 600 | 80 | 110 | 60 | 150 | 130 | 70 | 6 | True |
798 | 720 | HoopaHoopa Unbound | Psychic | Dark | 680 | 80 | 160 | 60 | 170 | 130 | 80 | 6 | True |
799 | 721 | Volcanion | Fire | Water | 600 | 80 | 110 | 120 | 130 | 90 | 70 | 6 | True |
800 rows × 13 columns
Let’s say we want to know which Type 1
group has the most HP
on average
[any guesses?]
One solution:
Look at each individual
Type 1
value in our dataThen, one by one, filter the data to match each unique
Type 1
valueThen, compute the average
HP
in that filtered dataThen save it somewhere else to keep track of
Then see which one is largest
This seems pretty tough…
Let’s start with something simpler: how many of each kind of Type 1
pokemon are there?
pokemon.groupby(['Type 1']).size().reset_index()
Type 1 | 0 | |
---|---|---|
0 | Bug | 69 |
1 | Dark | 31 |
2 | Dragon | 32 |
3 | Electric | 44 |
4 | Fairy | 17 |
5 | Fighting | 27 |
6 | Fire | 52 |
7 | Flying | 4 |
8 | Ghost | 32 |
9 | Grass | 70 |
10 | Ground | 32 |
11 | Ice | 24 |
12 | Normal | 98 |
13 | Poison | 28 |
14 | Psychic | 57 |
15 | Rock | 44 |
16 | Steel | 27 |
17 | Water | 112 |
Above, we just asked pandas to tell us the value of size()
applied to each unique group of Type 1
pokemon.
Can we do the same thing but for more complex operations than size()
? You bet!
pokemon.groupby(['Type 1']).agg( # .agg is our friend here!
mean_hp = ('HP', np.mean) # this defines a new statistic over each grouping. Apply `np.mean` to the `HP` column
).reset_index()
Type 1 | mean_hp | |
---|---|---|
0 | Bug | 56.884058 |
1 | Dark | 66.806452 |
2 | Dragon | 83.312500 |
3 | Electric | 59.795455 |
4 | Fairy | 74.117647 |
5 | Fighting | 69.851852 |
6 | Fire | 69.903846 |
7 | Flying | 70.750000 |
8 | Ghost | 64.437500 |
9 | Grass | 67.271429 |
10 | Ground | 73.781250 |
11 | Ice | 72.000000 |
12 | Normal | 77.275510 |
13 | Poison | 67.250000 |
14 | Psychic | 70.631579 |
15 | Rock | 65.363636 |
16 | Steel | 65.222222 |
17 | Water | 72.062500 |
We don’t need to stop there. We can group by multiple variables and add multiple metrics!
type1_hp_summary = pokemon.groupby(
['Type 1', 'Legendary'] # Note we're now grouping by each combination of Type 1 and Legendary
).agg(
mean_hp = ('HP', np.mean), # And we're compiling multiple statistics here
min_hp = ('HP', np.min),
max_hp = ('HP', np.max)
).reset_index()
type1_hp_summary
Type 1 | Legendary | mean_hp | min_hp | max_hp | |
---|---|---|---|---|---|
0 | Bug | False | 56.884058 | 1 | 86 |
1 | Dark | False | 64.655172 | 35 | 110 |
2 | Dark | True | 98.000000 | 70 | 126 |
3 | Dragon | False | 72.650000 | 41 | 108 |
4 | Dragon | True | 101.083333 | 80 | 125 |
5 | Electric | False | 57.325000 | 20 | 90 |
6 | Electric | True | 84.500000 | 79 | 90 |
7 | Fairy | False | 70.875000 | 35 | 101 |
8 | Fairy | True | 126.000000 | 126 | 126 |
9 | Fighting | False | 69.851852 | 30 | 144 |
10 | Fire | False | 67.085106 | 38 | 110 |
11 | Fire | True | 96.400000 | 80 | 115 |
12 | Flying | False | 62.500000 | 40 | 85 |
13 | Flying | True | 79.000000 | 79 | 79 |
14 | Ghost | False | 58.733333 | 20 | 150 |
15 | Ghost | True | 150.000000 | 150 | 150 |
16 | Grass | False | 65.940299 | 30 | 123 |
17 | Grass | True | 97.000000 | 91 | 100 |
18 | Ground | False | 70.821429 | 10 | 115 |
19 | Ground | True | 94.500000 | 89 | 100 |
20 | Ice | False | 70.818182 | 36 | 110 |
21 | Ice | True | 85.000000 | 80 | 90 |
22 | Normal | False | 76.489583 | 30 | 255 |
23 | Normal | True | 115.000000 | 110 | 120 |
24 | Poison | False | 67.250000 | 35 | 105 |
25 | Psychic | False | 67.720930 | 20 | 190 |
26 | Psychic | True | 79.571429 | 50 | 106 |
27 | Rock | False | 65.125000 | 30 | 123 |
28 | Rock | True | 67.750000 | 50 | 91 |
29 | Steel | False | 60.434783 | 40 | 80 |
30 | Steel | True | 92.750000 | 80 | 100 |
31 | Water | False | 71.120370 | 20 | 170 |
32 | Water | True | 97.500000 | 90 | 100 |
Now we can apply the filtering we discussed above for example.
type1_hp_summary.nlargest(5, 'mean_hp')
Type 1 | Legendary | mean_hp | min_hp | max_hp | |
---|---|---|---|---|---|
15 | Ghost | True | 150.000000 | 150 | 150 |
8 | Fairy | True | 126.000000 | 126 | 126 |
23 | Normal | True | 115.000000 | 110 | 120 |
4 | Dragon | True | 101.083333 | 80 | 125 |
2 | Dark | True | 98.000000 | 70 | 126 |
Let’s practice!¶
In each Generation
, how many different Type 1
and Type 2
types are there?
[HINT: use the nunique
summary function]
# Write here
# SOLUTION
# pokemon.groupby('Generation').agg(
# type1_types = ('Type 1', 'nunique'),
# type2_types = ('Type 2', 'nunique')
# ).reset_index()
Make a new column called Composite_force
that’s the average of each pokemon’s Attack
, Defense
, and Speed
values.
Next, for each Type 1
type, what is the maximum of this new column?
# Write here
# SOLUTION
# pokemon = pokemon.assign(
# Composite_force = (pokemon['Attack'] + pokemon['Defense'] + pokemon['Speed'])/3
# )
# max_force = pokemon.groupby(['Generation']).agg(
# max_force = ('Composite_force', np.max)
# ).reset_index()