Lecture 5 (4/6/22)¶
Last time we covered:
Datahub
Functions
numpy basics
Today’s agenda:
numpy wrap-up
pandas basics
numpy wrap-up¶
import numpy as np
Recall from last time…
What is it?
numpy is primarily:
A class of array objects (the
ndarray
)A set of high-performance functions that can be executed over those arrays
# 1. The ndarray
boring_list = [1, 2, 3, 4, 5, 6] # traditional python list
cool_array = np.array([1, 2, 3, 4, 5, 6]) # numpy ndarray
cool_array
array([1, 2, 3, 4, 5, 6])
# 2. The functions
y = np.square(cool_array)
y
# Note this is much simpler than what we would do to perform the equivalent operation on `boring_list` above
array([ 1, 4, 9, 16, 25, 36])
Why use it?
This represents an improvement over traditional python list
operations for several reasons:
It streamlines our code
It’s way faster
# The code streamlining:
y = []
for value in boring_list: # traditional python often requires using `for` loops to execute operations on lists
y.append(1/value)
y = 1/cool_array # numpy lets you apply intuitive operations to whole ndarrays at once
y
array([1. , 0.5 , 0.33333333, 0.25 , 0.2 ,
0.16666667])
# The speed:
import random
x = [random.random() for _ in range(10000)]
array_x = np.asarray(x)
%timeit y = [val**2 for val in x] # traditional python list-based approach
%timeit array_y = np.square(array_x) # numpy
954 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.39 µs ± 60.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Combining arrays
We can combine numpy arrays into multi-dimensional matrices and perform useful operations on those matrices.
a = np.random.random((3, 2)) # can initialize matrices with 0s, 1s, random numbers, and other cool stuff
print(a)
[[0.46687107 0.9904743 ]
[0.68987756 0.628968 ]
[0.94677363 0.63716925]]
# We can access individual rows or columns and individual elements using bracket [row, col] notation
a[0,]
a[:,1]
# a[1,1]
# note each of these rows/columns is itself a numpy array
type(a[:,0])
numpy.ndarray
# print(np.max(a)) # maximum operation over the whole matrix
# print(np.max(a, axis = 1)) # maximum operation can specify an "axis": 0 (columns) or 1 (rows)
We’ll come across numpy at various points throughout the quarter but this should be enough to get us on our feet.
You can learn more about numpy and follow the beginner’s guide on their website here.
For now, it’s time to switch to a new tool in our computational social science toolkit…. pandas
Pandas!¶
First, what is pandas?¶
pandas is a python library for reading, writing, and interacting with tabular data.
This is convenient because a lot of data is tabular data. It’s kind of like Excel for python (but way cooler…).
There’s a really awesome cheat sheet here and a series of handy tutorials here.
import pandas as pd
In this class, we will use pandas as the basis for reading, processing, and understanding our data.
Let’s get started!
Reading data with pandas¶
# mcd = pd.read_csv("../Datasets/mcd.csv")
mcd = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/mcd-menu.csv")
Note: there are lots of other ways to read in data, including directly from hosted links online.
pd.read_csv?
What is in this dataset?¶
mcd
Category | Item | Serving Size | Calories | Calories from Fat | Total Fat | Total Fat (% Daily Value) | Saturated Fat | Saturated Fat (% Daily Value) | Trans Fat | ... | Carbohydrates | Carbohydrates (% Daily Value) | Dietary Fiber | Dietary Fiber (% Daily Value) | Sugars | Protein | Vitamin A (% Daily Value) | Vitamin C (% Daily Value) | Calcium (% Daily Value) | Iron (% Daily Value) | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Breakfast | Egg McMuffin | 4.8 oz (136 g) | 300 | 120 | 13.0 | 20 | 5.0 | 25 | 0.0 | ... | 31 | 10 | 4 | 17 | 3 | 17 | 10 | 0 | 25 | 15 |
1 | Breakfast | Egg White Delight | 4.8 oz (135 g) | 250 | 70 | 8.0 | 12 | 3.0 | 15 | 0.0 | ... | 30 | 10 | 4 | 17 | 3 | 18 | 6 | 0 | 25 | 8 |
2 | Breakfast | Sausage McMuffin | 3.9 oz (111 g) | 370 | 200 | 23.0 | 35 | 8.0 | 42 | 0.0 | ... | 29 | 10 | 4 | 17 | 2 | 14 | 8 | 0 | 25 | 10 |
3 | Breakfast | Sausage McMuffin with Egg | 5.7 oz (161 g) | 450 | 250 | 28.0 | 43 | 10.0 | 52 | 0.0 | ... | 30 | 10 | 4 | 17 | 2 | 21 | 15 | 0 | 30 | 15 |
4 | Breakfast | Sausage McMuffin with Egg Whites | 5.7 oz (161 g) | 400 | 210 | 23.0 | 35 | 8.0 | 42 | 0.0 | ... | 30 | 10 | 4 | 17 | 2 | 21 | 6 | 0 | 25 | 10 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
255 | Smoothies & Shakes | McFlurry with Oreo Cookies (Small) | 10.1 oz (285 g) | 510 | 150 | 17.0 | 26 | 9.0 | 44 | 0.5 | ... | 80 | 27 | 1 | 4 | 64 | 12 | 15 | 0 | 40 | 8 |
256 | Smoothies & Shakes | McFlurry with Oreo Cookies (Medium) | 13.4 oz (381 g) | 690 | 200 | 23.0 | 35 | 12.0 | 58 | 1.0 | ... | 106 | 35 | 1 | 5 | 85 | 15 | 20 | 0 | 50 | 10 |
257 | Smoothies & Shakes | McFlurry with Oreo Cookies (Snack) | 6.7 oz (190 g) | 340 | 100 | 11.0 | 17 | 6.0 | 29 | 0.0 | ... | 53 | 18 | 1 | 2 | 43 | 8 | 10 | 0 | 25 | 6 |
258 | Smoothies & Shakes | McFlurry with Reese's Peanut Butter Cups (Medium) | 14.2 oz (403 g) | 810 | 290 | 32.0 | 50 | 15.0 | 76 | 1.0 | ... | 114 | 38 | 2 | 9 | 103 | 21 | 20 | 0 | 60 | 6 |
259 | Smoothies & Shakes | McFlurry with Reese's Peanut Butter Cups (Snack) | 7.1 oz (202 g) | 410 | 150 | 16.0 | 25 | 8.0 | 38 | 0.0 | ... | 57 | 19 | 1 | 5 | 51 | 10 | 10 | 0 | 30 | 4 |
260 rows × 24 columns
Note: printing out the whole dataset is often not the best way to look at it, and sometimes totally infeasible.
pandas
offers several very handy tools for peeking at data.
mcd
mcd.shape # number of rows and number of columns
mcd.head() # this is usually enough to get a sense of what's going on in our data
mcd.head(10) # sometimes useful to look at more rows than the default
mcd.tail() # if you're curious what kind of values or responses are at the "end" of your dataset
mcd.columns # helpful when the data has too many columns to preview (as in this data!)
mcd.index # we'll come back to this...
mcd.describe() # note: this isn't all our columns! only numeric ones
mcd.Category.value_counts() # this is the equivalent of `describe` for our categorical variables
Coffee & Tea 95
Breakfast 42
Smoothies & Shakes 28
Beverages 27
Chicken & Fish 27
Beef & Pork 15
Snacks & Sides 13
Desserts 7
Salads 6
Name: Category, dtype: int64
Note all except the last of these are operations applied directly to the pandas data frame mcd
(more on this later).
What can we do with this data?¶
Basic
Which menu items have the most protein? Calories? Largest serving size?
How many items does McDonald’s offer for each meal (breakfast, lunch, dinner)?
Intermediate
What are the healthiest and least healthy items on the menu?
What meal (breakfast, lunch, dinner, snack) is the most healthy or unhealthy overall?
Advanced
Can we identify how McDonald’s segments the healthy choice preferences of their customers by clustering the profiles of each menu item?
Why pandas?¶
Before we go any further, pause and think about how you would store this data with traditional python data structures: a list of lists? Many separate dictionaries? A menu item class with each attribute and all items in a list?
Think about how we would answer the questions above using traditional python operations over those data structures.
The ways we routinely interact with data require many different kinds of (sometimes complicated) operations and data structures that can support those operations (we’ve already seen some of this above just to look at the data in different ways).
We want the flexibility of code but the structure of tools like Excel to solve these problems.
This is where pandas
comes in!
How does it work?¶
type(mcd)
pandas.core.frame.DataFrame
Tabular data is stored in pandas as a DataFrame
.
A pandas data frame is essentially like a table in Excel and has similar corollaries in R, STATA, etc.
It stores data organized by columns, and has some very nifty properties.
# Let's look at the 'Item' column
menu_items = mcd['Item']
menu_items
type(menu_items)
pandas.core.series.Series
Each column in a pandas dataframe is a pandas Series
.
A pandas series is a lot like a numpy array, but with one additional property: the index.
menu_items.index
RangeIndex(start=0, stop=260, step=1)
The index is a unique value used to identify each row in the series.
You can use the index to fetch individual items in the series.
By default, pandas just uses the row number as the index for the values in each column.
menu_items[2]
'Sausage McMuffin'
In this way, it’s a lot like a normal list or numpy array.
But, in pandas an index can use unique values of any hashable type.
menu_cals = mcd['Calories'] # Let's fetch the `Calories` column
menu_cals
menu_cals.index # Here's the default index
# Instead, let's use each menu item as an index
menu_cals_item = pd.Series(list(mcd['Calories']), index = menu_items)
menu_cals_item
Item
Egg McMuffin 300
Egg White Delight 250
Sausage McMuffin 370
Sausage McMuffin with Egg 450
Sausage McMuffin with Egg Whites 400
...
McFlurry with Oreo Cookies (Small) 510
McFlurry with Oreo Cookies (Medium) 690
McFlurry with Oreo Cookies (Snack) 340
McFlurry with Reese's Peanut Butter Cups (Medium) 810
McFlurry with Reese's Peanut Butter Cups (Snack) 410
Length: 260, dtype: int64
Now, we can access items in the list using this new index!
menu_cals_item['Egg McMuffin']
300
What does it look like when we can look up array items with strings as keys?
# We can access `index` and `values` just like dictionary keys and values
menu_cals_item.index
menu_cals_item.values
# This functions just like a dictionary in traditional python
menu_cals_lookup = dict()
for i in range(len(menu_items)):
menu_cals_lookup[menu_items[i]] = menu_cals[i]
menu_cals_lookup
menu_cals_lookup.keys()
menu_cals_lookup.values()
dict_values([300, 250, 370, 450, 400, 430, 460, 520, 410, 470, 430, 480, 510, 570, 460, 520, 410, 470, 540, 460, 400, 420, 550, 500, 620, 570, 670, 740, 800, 640, 690, 1090, 1150, 990, 1050, 350, 520, 300, 150, 460, 290, 260, 530, 520, 600, 610, 540, 750, 240, 290, 430, 720, 380, 440, 430, 430, 500, 510, 350, 670, 510, 610, 450, 750, 590, 430, 360, 480, 430, 360, 630, 480, 610, 450, 670, 520, 540, 380, 190, 280, 470, 940, 1880, 390, 140, 380, 220, 140, 450, 290, 340, 260, 330, 250, 360, 280, 230, 340, 510, 110, 20, 15, 150, 250, 160, 150, 45, 330, 340, 280, 140, 200, 280, 100, 0, 0, 0, 0, 140, 190, 270, 100, 0, 0, 0, 0, 140, 200, 280, 100, 100, 130, 80, 150, 190, 280, 0, 0, 0, 0, 0, 150, 180, 220, 110, 0, 0, 0, 170, 210, 280, 270, 340, 430, 270, 330, 430, 260, 330, 420, 210, 260, 330, 100, 130, 170, 200, 250, 310, 200, 250, 310, 190, 240, 300, 140, 170, 220, 340, 410, 500, 270, 330, 390, 320, 390, 480, 250, 310, 370, 360, 440, 540, 280, 340, 400, 140, 190, 270, 130, 180, 260, 130, 180, 250, 120, 170, 240, 80, 120, 160, 290, 350, 480, 240, 290, 390, 280, 340, 460, 230, 270, 370, 450, 550, 670, 450, 550, 670, 530, 630, 760, 220, 260, 340, 210, 250, 330, 210, 260, 340, 530, 660, 820, 550, 690, 850, 560, 700, 850, 660, 820, 650, 930, 430, 510, 690, 340, 810, 410])
Take-aways¶
A pandas
DataFrame
stores tabular data in rows and columnsEach column is a pandas
Series
objectA pandas
Series
is similar to a numpy array (fixeddtype
) but has an index that allows for rapid and flexible data access
Accessing data in a data frame¶
In the code above, we used bracket notation dataframe['col']
to access column data.
There are a number of different ways to access columns in a data frame.
Any of these are fine, best to pick one and stick with it (and know the others exist).
# Accessing individual columns
menu_items = mcd['Item']
menu_items = mcd.Item
menu_items = mcd.loc[:, 'Item']
menu_items = mcd.iloc[:,1]
menu_items
0 Egg McMuffin
1 Egg White Delight
2 Sausage McMuffin
3 Sausage McMuffin with Egg
4 Sausage McMuffin with Egg Whites
...
255 McFlurry with Oreo Cookies (Small)
256 McFlurry with Oreo Cookies (Medium)
257 McFlurry with Oreo Cookies (Snack)
258 McFlurry with Reese's Peanut Butter Cups (Medium)
259 McFlurry with Reese's Peanut Butter Cups (Snack)
Name: Item, Length: 260, dtype: object
Many of these let us access multiple columns at once:
menu_subset = mcd[['Item', 'Category', 'Calories']] # Access specific columns by name
menu_subset = mcd.loc[:, 'Category':'Calories'] # Access a range of columns by name
menu_subset = mcd.iloc[:,1:4] # Access a range of columns by index
menu_subset = mcd.iloc[:,[1, 2, 5]] # Access specific columns by index
menu_subset
Item | Serving Size | Total Fat | |
---|---|---|---|
0 | Egg McMuffin | 4.8 oz (136 g) | 13.0 |
1 | Egg White Delight | 4.8 oz (135 g) | 8.0 |
2 | Sausage McMuffin | 3.9 oz (111 g) | 23.0 |
3 | Sausage McMuffin with Egg | 5.7 oz (161 g) | 28.0 |
4 | Sausage McMuffin with Egg Whites | 5.7 oz (161 g) | 23.0 |
... | ... | ... | ... |
255 | McFlurry with Oreo Cookies (Small) | 10.1 oz (285 g) | 17.0 |
256 | McFlurry with Oreo Cookies (Medium) | 13.4 oz (381 g) | 23.0 |
257 | McFlurry with Oreo Cookies (Snack) | 6.7 oz (190 g) | 11.0 |
258 | McFlurry with Reese's Peanut Butter Cups (Medium) | 14.2 oz (403 g) | 32.0 |
259 | McFlurry with Reese's Peanut Butter Cups (Snack) | 7.1 oz (202 g) | 16.0 |
260 rows × 3 columns