Case Study: US College Scorecard
====================================================
This study requires the national college scorecard dataset which can be
found here: https://collegescorecard.ed.gov/data/ (click the download
all data link and unzipped it into the data directory). This dataset is
a large comparison of accredited colleges and universities in the U.S.
To quote the data.gov website:
The College Scorecard is designed to increase transparency, putting
the power in the hands of the public — from those choosing colleges
to those improving college quality — to see how well different
schools are serving their students.
We will see that there are 1805 variables for 7149 universities (in
2010), many of which are missing, which may be due to the fact that this
dataset spans the years 1996-2015 and new variables may have been
recorded later in the study. Throughout this study I am interested in
the cost of a 4 year degree by school, the type of degree, the future
earnings, admissions rate, and the total enrollment of the university.
We will be generally exploring if the common assumptions about the value
of a degree-that the highest value degrees are from expensive, private
schools with low admissions rates-is true.
In this study, we will be using the Plotnine package, which seems to be
well maintained as of Oct. 2018. In addition you will require Seaborn
and Matplotlib. Plotnine is a Python ggplot implementation, and it very
closely emulates the ggplot2 package in R, to install Plotnine use
``conda install -c conda-forge plotnine``.
Reading in the data
~~~~~~~~~~~~~~~~~~~
In this section, we will read one data file, pin down a data munging
pipeline, and then read in the remaining dataframes.
.. code:: ipython3
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import matplotlib as mpl
import plotnine as p9
plt.style.use('ggplot') # set the theme for plots
mpl.rcParams['figure.figsize'] = (10,8)
datadir = "../data/CollegeScorecard"
import warnings
warnings.filterwarnings('ignore')
The college scorecard dataset consists of several large CSV filed
corresponding to different academic years. You can see the filenames
listed here:
.. code:: ipython3
!ls ../data/CollegeScorecard #here are the number of years w data
.. parsed-literal::
Crosswalks.zip MERGED2002_03_PP.csv MERGED2010_11_PP.csv
data.yaml MERGED2003_04_PP.csv MERGED2011_12_PP.csv
MERGED1996_97_PP.csv MERGED2004_05_PP.csv MERGED2012_13_PP.csv
MERGED1997_98_PP.csv MERGED2005_06_PP.csv MERGED2013_14_PP.csv
MERGED1998_99_PP.csv MERGED2006_07_PP.csv MERGED2014_15_PP.csv
MERGED1999_00_PP.csv MERGED2007_08_PP.csv MERGED2015_16_PP.csv
MERGED2000_01_PP.csv MERGED2008_09_PP.csv MERGED2016_17_PP.csv
MERGED2001_02_PP.csv MERGED2009_10_PP.csv
It is a good idea to test reading in a single file, then once this
pipeline is established, you can read the remainder of the files. Let's
start by reading the 2009-10 data.
.. code:: ipython3
# read in the 2009 data
COL = pd.read_csv(datadir + '/MERGED2009_10_PP.csv')
We can see the structure of the table below. It has 1844 columns and
many seem to be missing. It is quite common for this to happen with
large longitudinal studies, since throughout the years the researchers
may decide to include new variables. For example, we will see that
In-state tuition began to be recorded in 2001.
.. code:: ipython3
COL.head()
.. raw:: html
|
UNITID |
OPEID |
OPEID6 |
INSTNM |
CITY |
STABBR |
ZIP |
ACCREDAGENCY |
INSTURL |
NPCURL |
... |
C150_L4_PELL |
D150_L4_PELL |
C150_4_LOANNOPELL |
D150_4_LOANNOPELL |
C150_L4_LOANNOPELL |
D150_L4_LOANNOPELL |
C150_4_NOLOANNOPELL |
D150_4_NOLOANNOPELL |
C150_L4_NOLOANNOPELL |
D150_L4_NOLOANNOPELL |
0 |
100654 |
100200 |
1002 |
Alabama A & M University |
Normal |
AL |
35762 |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
1 |
100663 |
105200 |
1052 |
University of Alabama at Birmingham |
Birmingham |
AL |
35294-0110 |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
2 |
100690 |
2503400 |
25034 |
Amridge University |
Montgomery |
AL |
36117-3553 |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
3 |
100706 |
105500 |
1055 |
University of Alabama in Huntsville |
Huntsville |
AL |
35899 |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
4 |
100724 |
100500 |
1005 |
Alabama State University |
Montgomery |
AL |
36104-0271 |
NaN |
NaN |
NaN |
... |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
NaN |
5 rows × 1844 columns
.. code:: ipython3
# Let's describe the dataset
COL.info()
.. parsed-literal::
RangeIndex: 7149 entries, 0 to 7148
Columns: 1844 entries, UNITID to D150_L4_NOLOANNOPELL
dtypes: float64(634), int64(10), object(1200)
memory usage: 100.6+ MB
We can make a very severe cleaning by dropping all columns that have any
NAs. I am interested in a few variables that I noticed in the data
dictionary, and those are dropped by this procedure as well. I will read
those and merge with this 'non-missingness' data.
.. code:: ipython3
# Which columns have no NAs
col_dna = COL.dropna(axis=1)
col_dna.info()
.. parsed-literal::
RangeIndex: 7149 entries, 0 to 7148
Data columns (total 15 columns):
UNITID 7149 non-null int64
OPEID 7149 non-null object
OPEID6 7149 non-null int64
INSTNM 7149 non-null object
CITY 7149 non-null object
STABBR 7149 non-null object
ZIP 7149 non-null object
MAIN 7149 non-null int64
NUMBRANCH 7149 non-null int64
PREDDEG 7149 non-null int64
HIGHDEG 7149 non-null int64
CONTROL 7149 non-null int64
ST_FIPS 7149 non-null int64
REGION 7149 non-null int64
ICLEVEL 7149 non-null int64
dtypes: int64(10), object(5)
memory usage: 837.9+ KB
We want to predetermine the dtypes and variable names that we want so
that when we read the remaining data, we can make sure that the
DataFrames are uniformly formatted.
.. code:: ipython3
col_dtypes = dict(col_dna.dtypes.replace(np.dtype('int64'),np.dtype('float64'))) # make the dtypes floats
col_dtypes['UNITID'] = np.dtype('int64') # convert the UNITID back to int
vars_interest = ['ADM_RATE','UGDS','TUITIONFEE_IN','TUITIONFEE_OUT','MN_EARN_WNE_P10'] # Include these vars
col_dtypes.update({a: np.dtype('float64') for a in vars_interest}) # make them floats
We will try to read the data again, but this time, select only the
variables and corresponding types in ``col_dtypes``. By specifying the
type we can speed up the reading process and make it more uniform
between data. We can read the data with the specific dtypes and columns
using the ``dtype`` and ``usecols`` arguments below. We also use
``na_values`` because we notice that in the analysis
``"PrivacySuppressed"`` indicates that a value is missing in this data.
.. code:: ipython3
## Try reading it again
col_try_again = pd.read_csv(datadir + '/MERGED2009_10_PP.csv',na_values='PrivacySuppressed',
dtype=col_dtypes,usecols=col_dtypes.keys())
col_try_again.info()
.. parsed-literal::
RangeIndex: 7149 entries, 0 to 7148
Data columns (total 20 columns):
UNITID 7149 non-null int64
OPEID 7149 non-null object
OPEID6 7149 non-null float64
INSTNM 7149 non-null object
CITY 7149 non-null object
STABBR 7149 non-null object
ZIP 7149 non-null object
MAIN 7149 non-null float64
NUMBRANCH 7149 non-null float64
PREDDEG 7149 non-null float64
HIGHDEG 7149 non-null float64
CONTROL 7149 non-null float64
ST_FIPS 7149 non-null float64
REGION 7149 non-null float64
ADM_RATE 2774 non-null float64
UGDS 6596 non-null float64
TUITIONFEE_IN 4263 non-null float64
TUITIONFEE_OUT 4115 non-null float64
MN_EARN_WNE_P10 5486 non-null float64
ICLEVEL 7149 non-null float64
dtypes: float64(14), int64(1), object(5)
memory usage: 1.1+ MB
We will also want to store the school year. I will encode the school
year using the second year since Jan 1 of that year is contained in the
school year. For example, 2009-10 is encoded as 2010. Using the Period
(2010) is not accurate since it should span just the school year.
However, we will not be using Periods to their full capabilities in this
analysis, and we will see that this is a fine use of Periods in this
case.
.. code:: ipython3
col_try_again['Year'] = pd.Period('2010',freq='Y')
Now we are ready to wrap this up into a reader function.
.. code:: ipython3
def read_cs_data(year,col_dtypes,datadir):
"""read a CollegeScorecard dataframe"""
nextyr = str(int(year) + 1)[-2:]
filename = datadir + '/MERGED{}_{}_PP.csv'.format(year,nextyr)
col = pd.read_csv(filename,na_values='PrivacySuppressed',
dtype=col_dtypes,usecols=col_dtypes.keys())
col['Year'] = pd.Period(str(int(year) + 1),freq='Y')
return col
We can very simply use the following generator expression to read in the
files in sequence and concatenate all of them. This should work because
we enforce the variable names and types to be uniform.
.. code:: ipython3
col = pd.concat((read_cs_data(str(y),col_dtypes,datadir) for y in range(1996,2017)))
col = col.set_index(['UNITID','Year'])
We set the multi-index to be the unique id for the school and the year.
This data follows the tidy data design that dictates that each column
correspond to a variable and each row to a record. We could have joined
each year's data instead and made a wide DataFrame, but this would
violate the tidy data idea.
.. code:: ipython3
col.head()
.. raw:: html
|
|
OPEID |
OPEID6 |
INSTNM |
CITY |
STABBR |
ZIP |
MAIN |
NUMBRANCH |
PREDDEG |
HIGHDEG |
CONTROL |
ST_FIPS |
REGION |
ADM_RATE |
UGDS |
TUITIONFEE_IN |
TUITIONFEE_OUT |
MN_EARN_WNE_P10 |
ICLEVEL |
UNITID |
Year |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
100636 |
1997 |
01230800 |
12308.0 |
Community College of the Air Force |
Montgomery |
AL |
36114-3011 |
1.0 |
1.0 |
2.0 |
2.0 |
1.0 |
1.0 |
0.0 |
NaN |
44141.0 |
NaN |
NaN |
NaN |
2.0 |
100654 |
1997 |
00100200 |
1002.0 |
Alabama A & M University |
Normal |
AL |
35762 |
1.0 |
1.0 |
3.0 |
4.0 |
1.0 |
1.0 |
5.0 |
NaN |
3852.0 |
NaN |
NaN |
NaN |
1.0 |
100663 |
1997 |
00105200 |
1052.0 |
University of Alabama at Birmingham |
Birmingham |
AL |
35294-0110 |
1.0 |
2.0 |
3.0 |
4.0 |
1.0 |
1.0 |
5.0 |
NaN |
9889.0 |
NaN |
NaN |
NaN |
1.0 |
100672 |
1997 |
00574900 |
5749.0 |
ALABAMA AVIATION AND TECHNICAL COLLEGE |
OZARK |
AL |
36360 |
1.0 |
1.0 |
1.0 |
2.0 |
1.0 |
1.0 |
5.0 |
NaN |
295.0 |
NaN |
NaN |
NaN |
2.0 |
100690 |
1997 |
02503400 |
25034.0 |
Amridge University |
Montgomery |
AL |
36117-3553 |
1.0 |
1.0 |
3.0 |
4.0 |
2.0 |
1.0 |
5.0 |
NaN |
60.0 |
NaN |
NaN |
NaN |
1.0 |
Select the large universities with more than 1000 students. Then let's
isolate UC Davis. For this we can use the ``query`` method.
.. code:: ipython3
col_large = col[col['UGDS'] > 1000]
davis = col_large.query('CITY=="Davis" and STABBR=="CA"')
davis = davis.reset_index(level=0)
We reset the index because Plotnine does not play nicely with plotting
on indices (as of the most recent release). Using Matplotlib we can plot
the Davis undergraduate enrollment (UGDS) over time.
.. code:: ipython3
ax = davis.plot(y='UGDS')
ax.set_title('UC Davis undergraduate pop.')
ax.set_ylabel('UG Enrollment')
plt.show()
.. image:: images/cost_of_uni_27_0.png
Plotnine also has trouble dealing with Period data, so let's convert it
to a timestamp.
.. code:: ipython3
davis['YearDT'] = davis.index.to_timestamp()
We initialize the ggplot with ``p9.ggplot`` specifying the data and the
aesthetic elements. Finally, we add a single layer of a line geom.
.. code:: ipython3
p9.ggplot(davis,p9.aes(x='YearDT',y='UGDS')) \
+ p9.geom_line() # first layer
.. image:: images/cost_of_uni_31_0.png
.. parsed-literal::
We want to plot similar lines for other universities, so let's apply the
same transformations to the dataset.
.. code:: ipython3
col_large = col_large.reset_index(level=1)
col_large['YearDT'] = pd.PeriodIndex(col_large['Year']).to_timestamp()
col_large.info()
.. parsed-literal::
Int64Index: 48362 entries, 100636 to 489201
Data columns (total 21 columns):
Year 48362 non-null object
OPEID 48362 non-null object
OPEID6 48362 non-null float64
INSTNM 48362 non-null object
CITY 48362 non-null object
STABBR 48362 non-null object
ZIP 48362 non-null object
MAIN 48362 non-null float64
NUMBRANCH 48362 non-null float64
PREDDEG 48362 non-null float64
HIGHDEG 48362 non-null float64
CONTROL 48362 non-null float64
ST_FIPS 48362 non-null float64
REGION 48362 non-null float64
ADM_RATE 21041 non-null float64
UGDS 48362 non-null float64
TUITIONFEE_IN 37823 non-null float64
TUITIONFEE_OUT 37825 non-null float64
MN_EARN_WNE_P10 12528 non-null float64
ICLEVEL 48362 non-null float64
YearDT 48362 non-null datetime64[ns]
dtypes: datetime64[ns](1), float64(14), object(6)
memory usage: 8.1+ MB
We can see some basic statistics for UC Davis in 2013 for example in the
following line:
.. code:: ipython3
## I looked for the following variable names
## in the data dictionary on data.gov
y = '2013'
print("""
UC Davis Statistics {}
Admissions rate:{}, Undergrad admissions:{:.0f},
In-state tuition: {:.0f}, Out-of-state tuition: {:.0f},
Mean earnings 10 yrs after enroll: {:.0f}
""".format(y,*tuple(davis.loc[y,['ADM_RATE','UGDS','TUITIONFEE_IN',
'TUITIONFEE_OUT','MN_EARN_WNE_P10']])))
.. parsed-literal::
UC Davis Statistics 2013
Admissions rate:0.4826, Undergrad admissions:25588,
In-state tuition: 13877, Out-of-state tuition: 36755,
Mean earnings 10 yrs after enroll: 66000
The rise in tuition
~~~~~~~~~~~~~~~~~~~
Tuitions have been dramatically rising for US universities for decades.
I would like to investigate where UC Davis stands in terms of this rise,
and consider the differences between States. Let's begin by examining
the amount of data and missingness for these variables.
.. code:: ipython3
col_large.count()
.. parsed-literal::
Year 48362
OPEID 48362
OPEID6 48362
INSTNM 48362
CITY 48362
STABBR 48362
ZIP 48362
MAIN 48362
NUMBRANCH 48362
PREDDEG 48362
HIGHDEG 48362
CONTROL 48362
ST_FIPS 48362
REGION 48362
ADM_RATE 21041
UGDS 48362
TUITIONFEE_IN 37823
TUITIONFEE_OUT 37825
MN_EARN_WNE_P10 12528
ICLEVEL 48362
YearDT 48362
dtype: int64
It seems that ADM\_RATE is about half missing and the tuition variables
are around a quarter missing. After examining the dataset, it seems that
most of the missingness is either due to some schools having very few
non-missing entries or certain years being completely missing. We will
select only the schools with mostly non-missing tuitions. We will also
sample the schools so that we are not plotting the full dataset, which
is too cumbersome for matplotlib.
.. code:: ipython3
dav_id = davis.loc['1997','UNITID'] # store davis id
col_gby = col_large.groupby(level=0) # Group by the university ID
enough_dat = col_gby.count()['TUITIONFEE_IN'] > 15 # Select those with more than 15 non-missing entries
p = .1 # select a sampling probability
in_sample = pd.Series(np.random.binomial(1,p,size=enough_dat.shape[0]) > 0,
index=enough_dat.index.values) #
The following function will plot either the In-state or out-of-state
tuition. This type of def is fine in a jupyter notebook since there is
an implied flow of the data, but it uses global variables and should not
be used in a module.
.. code:: ipython3
def tuitplot(tuitvar='TUITIONFEE_IN',tuittitle='',varlab='In-state tuition (USD)'):
"""plot the tuitvar"""
ax = plt.subplot() # init plot
for inst_id, df in col_gby: # iterate over unis
df = df.reset_index()
if inst_id == dav_id: # if davis
df.plot(x='Year',y=tuitvar,color='r',ax=ax,legend=False) # plot red
elif enough_dat[inst_id] and in_sample[inst_id]: # if in sample
df.plot(x='Year',y=tuitvar,alpha=.1,color='b',ax=ax,legend=False) # plot blue
ax.set_title(tuittitle)
ax.set_ylabel(varlab)
plt.show()
.. code:: ipython3
tuitplot(tuittitle = "In-state tuition for large uni's")
.. image:: images/cost_of_uni_42_0.png
Plotting the full dataset is not feasible since the data is so large.
Alternatively, let's sample the dataset randomly. We will also select
those universities that have enough tuition data to plot the tuition
curve.
.. code:: ipython3
def samp_with_dav(col_large,p=.1):
col_gby = col_large.groupby(level=0)
enough_dat = col_gby.count()['TUITIONFEE_IN'] > 15
in_sample = pd.Series(np.random.binomial(1,p,size=enough_dat.shape[0]) > 0,
index=enough_dat.index.values)
in_sample.index.name = 'UNITID'
col_samp = col_large[in_sample & enough_dat]
col_dav = col_large.loc[dav_id].reset_index()
col_dav['UNITID'] = dav_id
col_dav = col_dav.set_index('UNITID')
return pd.concat([col_samp,col_dav])
Let's use this def to sample our large dataset.
.. code:: ipython3
col_samp = samp_with_dav(col_large)
Plotnine allows us to define the individual elements of a plot. This
starts with a data and the aesthetic elements, such as the X-axis,
Y-axis, and grouping variables. The following adds a line geom layer.
The grouping tells Plotnine that the lines should be plotted for each
university separately.
.. code:: ipython3
p9.ggplot(col_samp.reset_index()) \
+ p9.aes('YearDT','TUITIONFEE_IN',group='UNITID') \
+ p9.geom_line() \
+ p9.labels.ggtitle("In-state tuition for large uni's") \
+ p9.labels.ylab('In-state tuition (USD)')
.. image:: images/cost_of_uni_48_0.png
.. parsed-literal::
In the above plot we added some annotations, such as the y-label. We can
further modify the annotations by rotating the X-axis text (we do this
with ``p9.theme()``) and setting the X limits.
.. code:: ipython3
p9.ggplot(col_samp.reset_index()) \
+ p9.aes('YearDT','TUITIONFEE_IN',group='UNITID') \
+ p9.geom_line() \
+ p9.theme(axis_text_x = p9.themes.element_text(rotation=45)) \
+ p9.labels.ggtitle("In-state tuition for large uni's") \
+ p9.labels.ylab('In-state tuition (USD)') \
+ p9.scale_x_date(limits=['2001','2016'])
.. image:: images/cost_of_uni_50_0.png
.. parsed-literal::
We can add color by including another aesthetic element, the color and
alpha to be expressions involving the UNITID. This will make UC Davis
show up in red. The color and alpha are given scales, using
``p9.scale_alpha_identity()`` and ``p9.scale_color_cmap()`` where the
cmap is the first argument. It determines the scale for the color,
meaning how do we translate the number ``.5`` into a color. All
Matplotlib colormaps are available for use in Plotnine:
https://matplotlib.org/users/colormaps.html
.. code:: ipython3
p9.ggplot(col_samp.reset_index()) \
+ p9.aes('YearDT','TUITIONFEE_IN',group='UNITID',
alpha='.1+.9*(UNITID=={})'.format(dav_id),
color='1.*(UNITID=={})'.format(dav_id)) \
+ p9.scale_alpha_identity() \
+ p9.scale_color_cmap('bwr',guide=False) \
+ p9.geom_line() + p9.scale_x_date(limits=['2001','2016']) \
+ p9.theme(axis_text_x = p9.themes.element_text(rotation=45)) \
+ p9.labels.ggtitle("In-state tuition for large uni's") \
+ p9.labels.ylab('In-state tuition (USD)')
.. image:: images/cost_of_uni_52_0.png
.. parsed-literal::
We can replicate the same plots but for out-of-state tuition.
Out-of-state is much larger and seems to be at higher quantiles than
in-state tuition.
.. code:: ipython3
tuitplot('TUITIONFEE_OUT',"Out-of-state tuition for large uni's",
'Out-of-state tuition (USD)')
.. image:: images/cost_of_uni_54_0.png
.. code:: ipython3
p9.ggplot(col_samp.reset_index()) \
+ p9.aes('YearDT','TUITIONFEE_OUT',group='UNITID',
alpha='.1+.9*(UNITID=={})'.format(dav_id),
color='1.*(UNITID=={})'.format(dav_id)) \
+ p9.scale_alpha_identity() \
+ p9.scale_color_cmap('bwr',guide=False) \
+ p9.geom_line() + p9.scale_x_date(limits=['2001','2016']) \
+ p9.theme(axis_text_x = p9.themes.element_text(rotation=45)) \
+ p9.labels.ggtitle("Out-of-state tuition for large uni's") \
+ p9.labels.ylab('Out-of-state tuition')
.. image:: images/cost_of_uni_55_0.png
.. parsed-literal::
We can also pivot on the State, which means that we will produce a plot
for every state. This will create many plots, so we adopt a minimalist
theme (called void) and shrink the size of each plot.
.. code:: ipython3
p9.ggplot(col_samp.reset_index()) \
+ p9.aes('YearDT','TUITIONFEE_OUT',group='UNITID',
alpha='.1+.9*(UNITID=={})'.format(dav_id),
color='1.*(UNITID=={})'.format(dav_id)) \
+ p9.scale_alpha_identity() \
+ p9.scale_color_cmap('bwr',guide=False) \
+ p9.geom_line() + p9.scale_x_date(limits=['2001','2016']) \
+ p9.theme_void() \
+ p9.facet_wrap('~ STABBR',ncol=8) \
+ p9.labels.ggtitle("Out-of-state tuition for large uni's") \
+ p9.labels.ylab('Out-of-state tuition')
.. image:: images/cost_of_uni_57_0.png
.. parsed-literal::
We can also only plot trend lines for each state. This produces an
interesting plot that lets us quickly compare the OLS fit to each state.
Because there is not issue with plotting too many geoms in this case, we
can fit it on the entire data (we use ``col_large``).
.. code:: ipython3
p9.ggplot(col_large.reset_index()) \
+ p9.aes('YearDT','TUITIONFEE_OUT',group='STABBR') \
+ p9.scale_x_date(limits=['2001','2016']) \
+ p9.facet_wrap('~ STABBR',ncol=8) \
+ p9.stat_smooth(method='lm') \
+ p9.theme_void() \
+ p9.labels.ggtitle("Out-of-state tuition for large uni's by state") \
+ p9.labels.ylab('Out-of-state tuition')
.. image:: images/cost_of_uni_59_0.png
.. parsed-literal::
There are a few interesting take-aways from these plots. First, the bulk
of large universities have tuitions below 10,000 USD. The increase of
the tuition seems to be more severe for more expensive universities than
for less expensive ones. The increase in tuition in the US is also more
extreme in certain states, particularly PA, NY, and MA.
Earnings trends
~~~~~~~~~~~~~~~~~~
The college Scorecard also reports the mean earnings of students working
and not enrolled 10 years after entry, which is encoded in the
MN\_EARN\_WNE\_P10 variable. We will draw a scatterplot of this variable
against the in-state tuition for each university. We will add the
undergraduate enrollment (UGDS) as a color and size aesthetic element.
Then we specify the scale for the size and the color. We make it a
scatterplot by adding point geoms. This is done on data for 2013.
.. code:: ipython3
col_2013 = col_large.query('YearDT == "2013-01-01"')
.. code:: ipython3
p9.ggplot(col_2013) + p9.aes('TUITIONFEE_IN','MN_EARN_WNE_P10',size='UGDS',color='UGDS')\
+ p9.scale_size_area(breaks=[10000,20000,40000]) \
+ p9.labels.ggtitle("Mean earnings as a function of in-state tuition") \
+ p9.geom_point(alpha=.5) + p9.scale_color_cmap('plasma',guide=False)
.. image:: images/cost_of_uni_63_0.png
.. parsed-literal::
We can also add linear regression lines to the plot. By specifying
groups to be universities above and below 20,000 undergrad students we
see that larger universities tend to have a higher trend line.
.. code:: ipython3
p9.ggplot(col_2013) + p9.aes('TUITIONFEE_IN','MN_EARN_WNE_P10',size='UGDS',
color='UGDS',groups='UGDS > 20000')\
+ p9.scale_size_area(breaks=[10000,20000,40000]) \
+ p9.labels.ggtitle("Mean earnings as a function of in-state tuition") \
+ p9.geom_point(alpha=.5) + p9.stat_smooth(show_legend=False) + p9.scale_color_cmap('plasma',guide=False)
.. image:: images/cost_of_uni_65_0.png
.. parsed-literal::
In the above plot the scatter points are sized such that the area is
proportional to the undergraduate enrollment for the university. Just
from observation, it seems there are two populations (likely between
public and private universities) such that the mean earnings as a
function of tuition is higher in the left population and on the right
population has a smaller trend but some of the universities have
significantly higher mean earnings.
Other comparisons
~~~~~~~~~~~~~~~~~~
In this section we will look at other variables, such as admittance rate
and the highest degree that the institution grants. Let's focus on the
2013 data that has the in-state tuition not missing.
.. code:: ipython3
col_2013_nna = col_2013[~col_2013['TUITIONFEE_IN'].isna()]
col_2013_nna = col_2013_nna[col_2013_nna['HIGHDEG'] != 0]
We will begin by considering the highest degree granted (HIGHDEG) and
in-state tuition. Because HIGHDEG is categorical, it is natural to
either facet on it or to make it an aesthetic element for a univariate
plot such as a density estimate. We will go with having it determine the
fill color for density estimates of the tuition.
.. code:: ipython3
p9.ggplot(col_2013_nna) + p9.aes('TUITIONFEE_IN',fill='factor(HIGHDEG)') \
+ p9.geom_density(alpha=.5) + p9.labels.ggtitle('Density of in-state tuition by highest degree')
.. image:: images/cost_of_uni_70_0.png
.. parsed-literal::
We can also get a sense of the influence of admissions rate (ADM\_RATE)
on the mean earnings. We will use a point geom to get a scatterplot. We
will also add a lowess smoother layer to get a sense of the trend.
Because ADM\_RATE is concentrated around 1 it makes sense to use a
log-scale for the x axis. This does not change the axis text but does
change the position of the point geoms. We also add a color aesthetic
element for in-state tuition.
.. code:: ipython3
p9.ggplot(col_2013_nna) + p9.aes('ADM_RATE','MN_EARN_WNE_P10',color='TUITIONFEE_IN') \
+ p9.geom_point() + p9.scale_x_log10() \
+ p9.scale_color_cmap() + p9.stat_smooth(method='lowess')
.. image:: images/cost_of_uni_72_0.png
.. parsed-literal::
This indicates that there is a close relationship between admissions
rate, tuition, and mean earnings. From the generally positive trend of
earnings as a function of tuition we may conclude that atttending a more
expensive university will cause someone to earn more, but we need to
consider confounding variables such as admission rate. These
visualizations can help us understand the complex dependencies in this
data.
We can also look at the admissions rates as a function of the highest
degree granted. For this we use the Seaborn package, which has a larger
selection of named plots than Pyplot. For example the ``sns.Facetgrid``
will facet on one or two variables and make a grid of plots, in this
case we plot the density of the admission rate.
.. code:: ipython3
import seaborn as sns
g = sns.FacetGrid(col_2013_nna,row='HIGHDEG',aspect=2, height=1.5)
sfig = g.map(sns.kdeplot,'ADM_RATE')
.. image:: images/cost_of_uni_75_0.png
Another interesting named plot is the boxenplot that will produce
vertical histograms of the Y variable with variable bin widths by
grouping on the X variable. The following is another look at the
in-state tuition by year.
.. code:: ipython3
g = sns.boxenplot(x='Year',y='TUITIONFEE_IN',data=col_large)
g.format_xdata = mpl.dates.DateFormatter('%Y-%m-%d')
g.figure.autofmt_xdate()
.. image:: images/cost_of_uni_77_0.png
You can find many more examples of named plots at the Seaborn website:
https://seaborn.pydata.org/
**Note:** Plotnine is a good example of a Python package. It is well
organized and has extensive documentation. If you look at the Plotnine
source code you can see the basic organization of the sub-modules.
Because they have great docstrings, you can for example use help to see
what the following method does.
.. code:: ipython3
help(p9.themes.element_text)
.. parsed-literal::
Help on class element_text in module plotnine.themes.elements:
class element_text(builtins.object)
| element_text(family=None, style=None, weight=None, color=None, size=None, ha=None, va=None, rotation=None, linespacing=None, backgroundcolor=None, margin=None, \*\*kwargs)
|
| Theme element: Text
|
| Parameters
| ----------
| family : str
| Font family
| style : 'normal' | 'italic' | 'oblique'
| Font style
| color : str | tuple
| Text color
| weight : str
| Should be one of *normal*, *bold*, *heavy*, *light*,
| *ultrabold* or *ultralight*.
| size : float
| text size
| ha : 'center' | 'left' | 'right'
| Horizontal Alignment.
| va : 'center' | 'top' | 'bottom' | 'baseline'
| Vertical alignment.
| rotation : float
| Rotation angle in the range [0, 360]
| linespacing : float
| Line spacing
| backgroundcolor : str | tuple
| Background color
| margin : dict
| Margin around the text. The keys are one of
| ``['t', 'b', 'l', 'r']`` and ``units``. The units are
| one of ``['pt', 'lines', 'in']``. The *units* default
| to ``pt`` and the other keys to ``0``. Not all text
| themeables support margin parameters and other than the
| ``units``, only some of the other keys will a.
| kwargs : dict
| Parameters recognised by :class:`matplotlib.text.Text`
|
| Note
| ----
| :class:`element_text` will accept parameters that conform to the
| **ggplot2** *element_text* API, but it is preferable the
| **Matplotlib** based API described above.
|
| Methods defined here:
|
| __init__(self, family=None, style=None, weight=None, color=None, size=None, ha=None, va=None, rotation=None, linespacing=None, backgroundcolor=None, margin=None, \*\*kwargs)
| Initialize self. See help(type(self)) for accurate signature.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)