22)¶

Last time we covered:

Datahub
Functions
numpy basics

Today’s agenda:

numpy wrap-up
pandas basics

numpy wrap-up¶

import numpy as np

Recall from last time…

What is it?

numpy is primarily:

A class of array objects (the ndarray)
A set of high-performance functions that can be executed over those arrays

# 1. The ndarray
boring_list = [1, 2, 3, 4, 5, 6] # traditional python list
cool_array = np.array([1, 2, 3, 4, 5, 6]) # numpy ndarray

cool_array

array([1, 2, 3, 4, 5, 6])

# 2. The functions
y = np.square(cool_array)
y
# Note this is much simpler than what we would do to perform the equivalent operation on `boring_list` above

array([ 1,  4,  9, 16, 25, 36])

Why use it?

This represents an improvement over traditional python list operations for several reasons:

It streamlines our code
It’s way faster

# The code streamlining:

y = []
for value in boring_list: # traditional python often requires using `for` loops to execute operations on lists
    y.append(1/value)
    
y = 1/cool_array # numpy lets you apply intuitive operations to whole ndarrays at once
y

array([1.        , 0.5       , 0.33333333, 0.25      , 0.2       ,
       0.16666667])

# The speed:

import random
x = [random.random() for _ in range(10000)] 
array_x = np.asarray(x)

%timeit y = [val**2 for val in x] # traditional python list-based approach
%timeit array_y = np.square(array_x) # numpy

954 µs ± 19.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

6.39 µs ± 60.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Combining arrays

We can combine numpy arrays into multi-dimensional matrices and perform useful operations on those matrices.

a = np.random.random((3, 2)) # can initialize matrices with 0s, 1s, random numbers, and other cool stuff
print(a)

[[0.46687107 0.9904743 ]
 [0.68987756 0.628968  ]
 [0.94677363 0.63716925]]

# We can access individual rows or columns and individual elements using bracket [row, col] notation
a[0,]
a[:,1] 
# a[1,1]
# note each of these rows/columns is itself a numpy array
type(a[:,0])

numpy.ndarray

# print(np.max(a)) # maximum operation over the whole matrix
# print(np.max(a, axis = 1)) # maximum operation can specify an "axis": 0 (columns) or 1 (rows)

We’ll come across numpy at various points throughout the quarter but this should be enough to get us on our feet.

You can learn more about numpy and follow the beginner’s guide on their website here.

For now, it’s time to switch to a new tool in our computational social science toolkit…. pandas

Pandas!¶

panda

First, what is pandas?¶

pandas is a python library for reading, writing, and interacting with tabular data.

This is convenient because a lot of data is tabular data. It’s kind of like Excel for python (but way cooler…).

There’s a really awesome cheat sheet here and a series of handy tutorials here.

import pandas as pd

In this class, we will use pandas as the basis for reading, processing, and understanding our data.

Let’s get started!

Reading data with pandas¶

# mcd = pd.read_csv("../Datasets/mcd.csv")
mcd = pd.read_csv("https://raw.githubusercontent.com/UCSD-CSS-002/ucsd-css-002.github.io/master/datasets/mcd-menu.csv")

Note: there are lots of other ways to read in data, including directly from hosted links online.

pd.read_csv?

What is in this dataset?¶

mcd

	Category	Item	Serving Size	Calories	Calories from Fat	Total Fat	Total Fat (% Daily Value)	Saturated Fat	Saturated Fat (% Daily Value)	Trans Fat	...	Carbohydrates	Carbohydrates (% Daily Value)	Dietary Fiber	Dietary Fiber (% Daily Value)	Sugars	Protein	Vitamin A (% Daily Value)	Vitamin C (% Daily Value)	Calcium (% Daily Value)	Iron (% Daily Value)
0	Breakfast	Egg McMuffin	4.8 oz (136 g)	300	120	13.0	20	5.0	25	0.0	...	31	10	4	17	3	17	10	0	25	15
1	Breakfast	Egg White Delight	4.8 oz (135 g)	250	70	8.0	12	3.0	15	0.0	...	30	10	4	17	3	18	6	0	25	8
2	Breakfast	Sausage McMuffin	3.9 oz (111 g)	370	200	23.0	35	8.0	42	0.0	...	29	10	4	17	2	14	8	0	25	10
3	Breakfast	Sausage McMuffin with Egg	5.7 oz (161 g)	450	250	28.0	43	10.0	52	0.0	...	30	10	4	17	2	21	15	0	30	15
4	Breakfast	Sausage McMuffin with Egg Whites	5.7 oz (161 g)	400	210	23.0	35	8.0	42	0.0	...	30	10	4	17	2	21	6	0	25	10
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
255	Smoothies & Shakes	McFlurry with Oreo Cookies (Small)	10.1 oz (285 g)	510	150	17.0	26	9.0	44	0.5	...	80	27	1	4	64	12	15	0	40	8
256	Smoothies & Shakes	McFlurry with Oreo Cookies (Medium)	13.4 oz (381 g)	690	200	23.0	35	12.0	58	1.0	...	106	35	1	5	85	15	20	0	50	10
257	Smoothies & Shakes	McFlurry with Oreo Cookies (Snack)	6.7 oz (190 g)	340	100	11.0	17	6.0	29	0.0	...	53	18	1	2	43	8	10	0	25	6
258	Smoothies & Shakes	McFlurry with Reese's Peanut Butter Cups (Medium)	14.2 oz (403 g)	810	290	32.0	50	15.0	76	1.0	...	114	38	2	9	103	21	20	0	60	6
259	Smoothies & Shakes	McFlurry with Reese's Peanut Butter Cups (Snack)	7.1 oz (202 g)	410	150	16.0	25	8.0	38	0.0	...	57	19	1	5	51	10	10	0	30	4

260 rows × 24 columns

Note: printing out the whole dataset is often not the best way to look at it, and sometimes totally infeasible.

pandas offers several very handy tools for peeking at data.

mcd
mcd.shape # number of rows and number of columns

mcd.head() # this is usually enough to get a sense of what's going on in our data
mcd.head(10) # sometimes useful to look at more rows than the default
mcd.tail() # if you're curious what kind of values or responses are at the "end" of your dataset

mcd.columns # helpful when the data has too many columns to preview (as in this data!)
mcd.index # we'll come back to this...

mcd.describe() # note: this isn't all our columns! only numeric ones
mcd.Category.value_counts() # this is the equivalent of `describe` for our categorical variables

Coffee & Tea          95
Breakfast             42
Smoothies & Shakes    28
Beverages             27
Chicken & Fish        27
Beef & Pork           15
Snacks & Sides        13
Desserts               7
Salads                 6
Name: Category, dtype: int64

Note all except the last of these are operations applied directly to the pandas data frame mcd (more on this later).

What can we do with this data?¶

Basic

Which menu items have the most protein? Calories? Largest serving size?
How many items does McDonald’s offer for each meal (breakfast, lunch, dinner)?

Intermediate

What are the healthiest and least healthy items on the menu?
What meal (breakfast, lunch, dinner, snack) is the most healthy or unhealthy overall?

Advanced

Can we identify how McDonald’s segments the healthy choice preferences of their customers by clustering the profiles of each menu item?

Why pandas?¶

Before we go any further, pause and think about how you would store this data with traditional python data structures: a list of lists? Many separate dictionaries? A menu item class with each attribute and all items in a list?

Think about how we would answer the questions above using traditional python operations over those data structures.

The ways we routinely interact with data require many different kinds of (sometimes complicated) operations and data structures that can support those operations (we’ve already seen some of this above just to look at the data in different ways).

We want the flexibility of code but the structure of tools like Excel to solve these problems.

This is where pandas comes in!

How does it work?¶

type(mcd)

pandas.core.frame.DataFrame

Tabular data is stored in pandas as a DataFrame.

A pandas data frame is essentially like a table in Excel and has similar corollaries in R, STATA, etc.

It stores data organized by columns, and has some very nifty properties.

# Let's look at the 'Item' column
menu_items = mcd['Item']
menu_items
type(menu_items)

pandas.core.series.Series

Each column in a pandas dataframe is a pandas Series.

A pandas series is a lot like a numpy array, but with one additional property: the index.

menu_items.index

RangeIndex(start=0, stop=260, step=1)

The index is a unique value used to identify each row in the series.

You can use the index to fetch individual items in the series.

By default, pandas just uses the row number as the index for the values in each column.

menu_items[2]

'Sausage McMuffin'

In this way, it’s a lot like a normal list or numpy array.

But, in pandas an index can use unique values of any hashable type.

menu_cals = mcd['Calories'] # Let's fetch the `Calories` column
menu_cals
menu_cals.index # Here's the default index

# Instead, let's use each menu item as an index
menu_cals_item = pd.Series(list(mcd['Calories']), index = menu_items)
menu_cals_item

Item
Egg McMuffin                                         300
Egg White Delight                                    250
Sausage McMuffin                                     370
Sausage McMuffin with Egg                            450
Sausage McMuffin with Egg Whites                     400
                                                    ... 
McFlurry with Oreo Cookies (Small)                   510
McFlurry with Oreo Cookies (Medium)                  690
McFlurry with Oreo Cookies (Snack)                   340
McFlurry with Reese's Peanut Butter Cups (Medium)    810
McFlurry with Reese's Peanut Butter Cups (Snack)     410
Length: 260, dtype: int64

Now, we can access items in the list using this new index!

menu_cals_item['Egg McMuffin']

What does it look like when we can look up array items with strings as keys?

# We can access `index` and `values` just like dictionary keys and values
menu_cals_item.index
menu_cals_item.values


# This functions just like a dictionary in traditional python
menu_cals_lookup = dict()
for i in range(len(menu_items)):
    menu_cals_lookup[menu_items[i]] = menu_cals[i]

menu_cals_lookup
menu_cals_lookup.keys()
menu_cals_lookup.values()

dict_values([300, 250, 370, 450, 400, 430, 460, 520, 410, 470, 430, 480, 510, 570, 460, 520, 410, 470, 540, 460, 400, 420, 550, 500, 620, 570, 670, 740, 800, 640, 690, 1090, 1150, 990, 1050, 350, 520, 300, 150, 460, 290, 260, 530, 520, 600, 610, 540, 750, 240, 290, 430, 720, 380, 440, 430, 430, 500, 510, 350, 670, 510, 610, 450, 750, 590, 430, 360, 480, 430, 360, 630, 480, 610, 450, 670, 520, 540, 380, 190, 280, 470, 940, 1880, 390, 140, 380, 220, 140, 450, 290, 340, 260, 330, 250, 360, 280, 230, 340, 510, 110, 20, 15, 150, 250, 160, 150, 45, 330, 340, 280, 140, 200, 280, 100, 0, 0, 0, 0, 140, 190, 270, 100, 0, 0, 0, 0, 140, 200, 280, 100, 100, 130, 80, 150, 190, 280, 0, 0, 0, 0, 0, 150, 180, 220, 110, 0, 0, 0, 170, 210, 280, 270, 340, 430, 270, 330, 430, 260, 330, 420, 210, 260, 330, 100, 130, 170, 200, 250, 310, 200, 250, 310, 190, 240, 300, 140, 170, 220, 340, 410, 500, 270, 330, 390, 320, 390, 480, 250, 310, 370, 360, 440, 540, 280, 340, 400, 140, 190, 270, 130, 180, 260, 130, 180, 250, 120, 170, 240, 80, 120, 160, 290, 350, 480, 240, 290, 390, 280, 340, 460, 230, 270, 370, 450, 550, 670, 450, 550, 670, 530, 630, 760, 220, 260, 340, 210, 250, 330, 210, 260, 340, 530, 660, 820, 550, 690, 850, 560, 700, 850, 660, 820, 650, 930, 430, 510, 690, 340, 810, 410])

Take-aways¶

A pandas DataFrame stores tabular data in rows and columns
Each column is a pandas Series object
A pandas Series is similar to a numpy array (fixed dtype) but has an index that allows for rapid and flexible data access

Accessing data in a data frame¶

In the code above, we used bracket notation dataframe['col'] to access column data.

There are a number of different ways to access columns in a data frame.

Any of these are fine, best to pick one and stick with it (and know the others exist).

# Accessing individual columns
menu_items = mcd['Item']
menu_items = mcd.Item
menu_items = mcd.loc[:, 'Item']
menu_items = mcd.iloc[:,1]
menu_items

                                         Egg McMuffin
                                    Egg White Delight
                                     Sausage McMuffin
                            Sausage McMuffin with Egg
                     Sausage McMuffin with Egg Whites
                             ...                        
                 McFlurry with Oreo Cookies (Small)
                McFlurry with Oreo Cookies (Medium)
                 McFlurry with Oreo Cookies (Snack)
  McFlurry with Reese's Peanut Butter Cups (Medium)
   McFlurry with Reese's Peanut Butter Cups (Snack)
Name: Item, Length: 260, dtype: object

Many of these let us access multiple columns at once:

menu_subset = mcd[['Item', 'Category', 'Calories']] # Access specific columns by name
menu_subset = mcd.loc[:, 'Category':'Calories'] # Access a range of columns by name
menu_subset = mcd.iloc[:,1:4] # Access a range of columns by index
menu_subset = mcd.iloc[:,[1, 2, 5]] # Access specific columns by index
menu_subset

	Item	Serving Size	Total Fat
0	Egg McMuffin	4.8 oz (136 g)	13.0
1	Egg White Delight	4.8 oz (135 g)	8.0
2	Sausage McMuffin	3.9 oz (111 g)	23.0
3	Sausage McMuffin with Egg	5.7 oz (161 g)	28.0
4	Sausage McMuffin with Egg Whites	5.7 oz (161 g)	23.0
...	...	...	...
255	McFlurry with Oreo Cookies (Small)	10.1 oz (285 g)	17.0
256	McFlurry with Oreo Cookies (Medium)	13.4 oz (381 g)	23.0
257	McFlurry with Oreo Cookies (Snack)	6.7 oz (190 g)	11.0
258	McFlurry with Reese's Peanut Butter Cups (Medium)	14.2 oz (403 g)	32.0
259	McFlurry with Reese's Peanut Butter Cups (Snack)	7.1 oz (202 g)	16.0

260 rows × 3 columns

Lecture 4 (4/4/22) Lecture 6 (4/8/22)

UCSD CSS 2