Lecture 5 (4/6/22)

Last time we covered:

  • Datahub

  • Functions

  • numpy basics

Today’s agenda:

  • numpy wrap-up

  • pandas basics

numpy wrap-up

import numpy as np

Recall from last time…

What is it?

numpy is primarily:

  1. A class of array objects (the ndarray)

  2. A set of high-performance functions that can be executed over those arrays

# 1. The ndarray
boring_list = [1, 2, 3, 4, 5, 6] # traditional python list
cool_array = np.array([1, 2, 3, 4, 5, 6]) # numpy ndarray

cool_array
array([1, 2, 3, 4, 5, 6])
# 2. The functions
y = np.square(cool_array)
y
# Note this is much simpler than what we would do to perform the equivalent operation on `boring_list` above
array([ 1,  4,  9, 16, 25, 36])

Why use it?

This represents an improvement over traditional python list operations for several reasons:

  • It streamlines our code

  • It’s way faster

# The code streamlining:

y = []
for value in boring_list: # traditional python often requires using `for` loops to execute operations on lists
    y.append(1/value)
    
y = 1/cool_array # numpy lets you apply intuitive operations to whole ndarrays at once
y
array([1.        , 0.5       , 0.33333333, 0.25      , 0.2       ,
       0.16666667])
# The speed:

import random
x = [random.random() for _ in range(10000)] 
array_x = np.asarray(x)

%timeit y = [val**2 for val in x] # traditional python list-based approach
%timeit array_y = np.square(array_x) # numpy
943 µs ± 4.64 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
6.51 µs ± 42.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Combining arrays

We can combine numpy arrays into multi-dimensional matrices and perform many useful operations on those matrices

a = np.random.random((3, 2)) # can initialize matrices with 0s, 1s, or random numbers
print(a)
[[0.36002571 0.34237762]
 [0.48141861 0.1166153 ]
 [0.48861556 0.5864437 ]]
# We can access individual rows or columns and individual elements using bracket [row, col] notation
a[0,]
a[:,1] # note each of these rows/columns is itself a numpy array
# type(a[0])
array([0.34237762, 0.1166153 , 0.5864437 ])
print(np.max(a)) # maximum operation over the whole matrix
print(np.max(a, axis = 0)) # maximum operation can specify an "axis": 0 (columns) or 1 (rows)
0.5864437023681016
[0.48861556 0.5864437 ]

We’ll come across numpy at various points throughout the quarter but this should be enough to get us on our feet.

You can learn more about numpy and follow the beginner’s guide on their website here.

For now, it’s time to switch to a new tool in our computational social science toolkit…. pandas

Pandas!

panda

First, what is pandas?

pandas is a python library for reading, writing, and interacting with tabular data.

This is convenient because a lot of data is tabular data. It’s kind of like Excel for python (but way cooler…).

There’s a really awesome cheat sheet here and a series of handy tutorials here.

import pandas as pd

In this class, we will use pandas as the basis for reading, processing, and understanding our data.

Let’s get started!

Reading data with pandas

# mcd = pd.read_csv("../Datasets/mcd.csv")
mcd = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/mcd-menu.csv")

Note: there are lots of other ways to read in data, including directly from hosted links online.

pd.read_csv?

What is in this dataset?

mcd
Category Item Serving Size Calories Calories from Fat Total Fat Total Fat (% Daily Value) Saturated Fat Saturated Fat (% Daily Value) Trans Fat ... Carbohydrates Carbohydrates (% Daily Value) Dietary Fiber Dietary Fiber (% Daily Value) Sugars Protein Vitamin A (% Daily Value) Vitamin C (% Daily Value) Calcium (% Daily Value) Iron (% Daily Value)
0 Breakfast Egg McMuffin 4.8 oz (136 g) 300 120 13.0 20 5.0 25 0.0 ... 31 10 4 17 3 17 10 0 25 15
1 Breakfast Egg White Delight 4.8 oz (135 g) 250 70 8.0 12 3.0 15 0.0 ... 30 10 4 17 3 18 6 0 25 8
2 Breakfast Sausage McMuffin 3.9 oz (111 g) 370 200 23.0 35 8.0 42 0.0 ... 29 10 4 17 2 14 8 0 25 10
3 Breakfast Sausage McMuffin with Egg 5.7 oz (161 g) 450 250 28.0 43 10.0 52 0.0 ... 30 10 4 17 2 21 15 0 30 15
4 Breakfast Sausage McMuffin with Egg Whites 5.7 oz (161 g) 400 210 23.0 35 8.0 42 0.0 ... 30 10 4 17 2 21 6 0 25 10
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
255 Smoothies & Shakes McFlurry with Oreo Cookies (Small) 10.1 oz (285 g) 510 150 17.0 26 9.0 44 0.5 ... 80 27 1 4 64 12 15 0 40 8
256 Smoothies & Shakes McFlurry with Oreo Cookies (Medium) 13.4 oz (381 g) 690 200 23.0 35 12.0 58 1.0 ... 106 35 1 5 85 15 20 0 50 10
257 Smoothies & Shakes McFlurry with Oreo Cookies (Snack) 6.7 oz (190 g) 340 100 11.0 17 6.0 29 0.0 ... 53 18 1 2 43 8 10 0 25 6
258 Smoothies & Shakes McFlurry with Reese's Peanut Butter Cups (Medium) 14.2 oz (403 g) 810 290 32.0 50 15.0 76 1.0 ... 114 38 2 9 103 21 20 0 60 6
259 Smoothies & Shakes McFlurry with Reese's Peanut Butter Cups (Snack) 7.1 oz (202 g) 410 150 16.0 25 8.0 38 0.0 ... 57 19 1 5 51 10 10 0 30 4

260 rows × 24 columns

Note: printing out the whole dataset is often not the best way to look at it, and sometimes totally infeasible.

pandas offers several very handy tools for peeking at data.

mcd.shape # number of rows and number of columns

mcd.head() # this is usually enough to get a sense of what's going on in our data
mcd.head(10) # sometimes useful to look at more rows than the default
mcd.tail() # if you're curious what kind of values or responses are at the "end" of your dataset

mcd.columns # helpful when the data has too many columns to preview (as in this data!)
mcd.index # we'll come back to this...

mcd.describe() # note: this isn't all our columns! only numeric ones
mcd.Category.value_counts() # this is the equivalent of `describe` for our categorical variables
Coffee & Tea          95
Breakfast             42
Smoothies & Shakes    28
Chicken & Fish        27
Beverages             27
Beef & Pork           15
Snacks & Sides        13
Desserts               7
Salads                 6
Name: Category, dtype: int64

Note all except the last of these are operations applied directly to the pandas data frame mcd (more on this later).

What can we do with this data?

Basic

  • Which menu items have the most protein? Calories? Largest serving size?

  • How many items does McDonald’s offer for each meal (breakfast, lunch, dinner)?

Intermediate

  • What are the healthiest and least healthy items on the menu?

  • What meal (breakfast, lunch, dinner, snack) is the most healthy or unhealthy overall?

Advanced

  • Can we identify how McDonald’s segments the healthy choice preferences of their customers by clustering the profiles of each menu item?

Why pandas?

Before we go any further, pause and think about how you would store this data with traditional python data structures: a list of lists? Many separate dictionaries? A menu item class with each attribute and all items in a list?

Think about how we would answer the questions above using traditional python operations over those data structures.

The ways we routinely interact with data require many different kinds of (sometimes complicated) operations and data structures that can support those operations (we’ve already seen some of this above just to look at the data in different ways).

We want the flexibility of code but the structure of tools like Excel to solve these problems.

This is where pandas comes in!

How does it work?

type(mcd)
pandas.core.frame.DataFrame

Tabular data is stored in pandas as a DataFrame.

A pandas data frame is essentially like a table in Excel and has similar corollaries in R, STATA, etc.

It stores data in rows organized by columns, and has some very nifty properties.

# Let's look at the 'Item' column
menu_items = mcd['Item']
menu_items
type(menu_items)
pandas.core.series.Series

Each column in a pandas dataframe is a pandas Series.

A pandas series is a lot like a numpy array, but with one additional property: the index.

menu_items.index
RangeIndex(start=0, stop=260, step=1)

The index is a unique value used to identify each row in the series.

You can use the index to fetch individual items in the series.

By default, pandas just uses the row number as the index for the values in each column.

menu_items[2]
'Sausage McMuffin'

In this way, it’s a lot like a normal list or numpy array.

But, in pandas an index can use unique values of any hashable type.

menu_cals = mcd['Calories'] # Let's fetch the `Calories` column
menu_cals
menu_cals.index # Here's the default index

# Instead, let's use each menu item as an index
menu_cals_item = pd.Series(list(mcd['Calories']), index = menu_items)
menu_cals_item
Item
Egg McMuffin                                         300
Egg White Delight                                    250
Sausage McMuffin                                     370
Sausage McMuffin with Egg                            450
Sausage McMuffin with Egg Whites                     400
                                                    ... 
McFlurry with Oreo Cookies (Small)                   510
McFlurry with Oreo Cookies (Medium)                  690
McFlurry with Oreo Cookies (Snack)                   340
McFlurry with Reese's Peanut Butter Cups (Medium)    810
McFlurry with Reese's Peanut Butter Cups (Snack)     410
Length: 260, dtype: int64

Now, we can access items in the list using this new index!

menu_cals_item['Egg McMuffin']
300

What does it look like when we can look up array items with strings as keys?

# We can access `index` and `values` just like dictionary keys and values
menu_cals_item.index
menu_cals_item.values


# This functions just like a dictionary in traditional python
menu_cals_lookup = dict()
for i in range(len(menu_items)):
    menu_cals_lookup[menu_items[i]] = menu_cals[i]

menu_cals_lookup
menu_cals_lookup.keys()
menu_cals_lookup.values()
dict_values([300, 250, 370, 450, 400, 430, 460, 520, 410, 470, 430, 480, 510, 570, 460, 520, 410, 470, 540, 460, 400, 420, 550, 500, 620, 570, 670, 740, 800, 640, 690, 1090, 1150, 990, 1050, 350, 520, 300, 150, 460, 290, 260, 530, 520, 600, 610, 540, 750, 240, 290, 430, 720, 380, 440, 430, 430, 500, 510, 350, 670, 510, 610, 450, 750, 590, 430, 360, 480, 430, 360, 630, 480, 610, 450, 670, 520, 540, 380, 190, 280, 470, 940, 1880, 390, 140, 380, 220, 140, 450, 290, 340, 260, 330, 250, 360, 280, 230, 340, 510, 110, 20, 15, 150, 250, 160, 150, 45, 330, 340, 280, 140, 200, 280, 100, 0, 0, 0, 0, 140, 190, 270, 100, 0, 0, 0, 0, 140, 200, 280, 100, 100, 130, 80, 150, 190, 280, 0, 0, 0, 0, 0, 150, 180, 220, 110, 0, 0, 0, 170, 210, 280, 270, 340, 430, 270, 330, 430, 260, 330, 420, 210, 260, 330, 100, 130, 170, 200, 250, 310, 200, 250, 310, 190, 240, 300, 140, 170, 220, 340, 410, 500, 270, 330, 390, 320, 390, 480, 250, 310, 370, 360, 440, 540, 280, 340, 400, 140, 190, 270, 130, 180, 260, 130, 180, 250, 120, 170, 240, 80, 120, 160, 290, 350, 480, 240, 290, 390, 280, 340, 460, 230, 270, 370, 450, 550, 670, 450, 550, 670, 530, 630, 760, 220, 260, 340, 210, 250, 330, 210, 260, 340, 530, 660, 820, 550, 690, 850, 560, 700, 850, 660, 820, 650, 930, 430, 510, 690, 340, 810, 410])

Take-aways

  • A pandas DataFrame stores tabular data in rows and columns

  • Each column is a pandas Series object

  • A pandas Series is similar to a numpy array (fixed dtype) but has an index that allows for rapid and flexible data access

Accessing data in a data frame

In the code above, we used bracket notation dataframe['col'] to access column data.

There are a number of different ways to access columns in a data frame.

Any of these are fine, best to pick one and stick with it (and know the others exist).

# Accessing individual columns
menu_items = mcd['Item']
menu_items = mcd.Item
menu_items = mcd.loc[:, 'Item']
menu_items = mcd.iloc[:,1]
menu_items
0                                           Egg McMuffin
1                                      Egg White Delight
2                                       Sausage McMuffin
3                              Sausage McMuffin with Egg
4                       Sausage McMuffin with Egg Whites
                             ...                        
255                   McFlurry with Oreo Cookies (Small)
256                  McFlurry with Oreo Cookies (Medium)
257                   McFlurry with Oreo Cookies (Snack)
258    McFlurry with Reese's Peanut Butter Cups (Medium)
259     McFlurry with Reese's Peanut Butter Cups (Snack)
Name: Item, Length: 260, dtype: object

Many of these let us access multiple columns at once:

menu_subset = mcd[['Item', 'Category', 'Calories']] # Access specific columns by name
menu_subset = mcd.loc[:, 'Category':'Calories'] # Access a range of columns by name
menu_subset = mcd.iloc[:,1:4] # Access a range of columns by index
menu_subset = mcd.iloc[:,[1, 2, 5]] # Access specific columns by index
menu_subset
Item Serving Size Total Fat
0 Egg McMuffin 4.8 oz (136 g) 13.0
1 Egg White Delight 4.8 oz (135 g) 8.0
2 Sausage McMuffin 3.9 oz (111 g) 23.0
3 Sausage McMuffin with Egg 5.7 oz (161 g) 28.0
4 Sausage McMuffin with Egg Whites 5.7 oz (161 g) 23.0
... ... ... ...
255 McFlurry with Oreo Cookies (Small) 10.1 oz (285 g) 17.0
256 McFlurry with Oreo Cookies (Medium) 13.4 oz (381 g) 23.0
257 McFlurry with Oreo Cookies (Snack) 6.7 oz (190 g) 11.0
258 McFlurry with Reese's Peanut Butter Cups (Medium) 14.2 oz (403 g) 32.0
259 McFlurry with Reese's Peanut Butter Cups (Snack) 7.1 oz (202 g) 16.0

260 rows × 3 columns