Case Study: US College Scorecard ==================================================== This study requires the national college scorecard dataset which can be found here: https://collegescorecard.ed.gov/data/ (click the download all data link and unzipped it into the data directory). This dataset is a large comparison of accredited colleges and universities in the U.S. To quote the data.gov website: The College Scorecard is designed to increase transparency, putting the power in the hands of the public — from those choosing colleges to those improving college quality — to see how well different schools are serving their students. We will see that there are 1805 variables for 7149 universities (in 2010), many of which are missing, which may be due to the fact that this dataset spans the years 1996-2015 and new variables may have been recorded later in the study. Throughout this study I am interested in the cost of a 4 year degree by school, the type of degree, the future earnings, admissions rate, and the total enrollment of the university. We will be generally exploring if the common assumptions about the value of a degree-that the highest value degrees are from expensive, private schools with low admissions rates-is true. In this study, we will be using the Plotnine package, which seems to be well maintained as of Oct. 2018. In addition you will require Seaborn and Matplotlib. Plotnine is a Python ggplot implementation, and it very closely emulates the ggplot2 package in R, to install Plotnine use ``conda install -c conda-forge plotnine``. Reading in the data ~~~~~~~~~~~~~~~~~~~ In this section, we will read one data file, pin down a data munging pipeline, and then read in the remaining dataframes. .. code:: ipython3 import pandas as pd import numpy as np from matplotlib import pyplot as plt import matplotlib as mpl import plotnine as p9 plt.style.use('ggplot') # set the theme for plots mpl.rcParams['figure.figsize'] = (10,8) datadir = "../data/CollegeScorecard" import warnings warnings.filterwarnings('ignore') The college scorecard dataset consists of several large CSV filed corresponding to different academic years. You can see the filenames listed here: .. code:: ipython3 !ls ../data/CollegeScorecard #here are the number of years w data .. parsed-literal:: Crosswalks.zip MERGED2002_03_PP.csv MERGED2010_11_PP.csv data.yaml MERGED2003_04_PP.csv MERGED2011_12_PP.csv MERGED1996_97_PP.csv MERGED2004_05_PP.csv MERGED2012_13_PP.csv MERGED1997_98_PP.csv MERGED2005_06_PP.csv MERGED2013_14_PP.csv MERGED1998_99_PP.csv MERGED2006_07_PP.csv MERGED2014_15_PP.csv MERGED1999_00_PP.csv MERGED2007_08_PP.csv MERGED2015_16_PP.csv MERGED2000_01_PP.csv MERGED2008_09_PP.csv MERGED2016_17_PP.csv MERGED2001_02_PP.csv MERGED2009_10_PP.csv It is a good idea to test reading in a single file, then once this pipeline is established, you can read the remainder of the files. Let's start by reading the 2009-10 data. .. code:: ipython3 # read in the 2009 data COL = pd.read_csv(datadir + '/MERGED2009_10_PP.csv') We can see the structure of the table below. It has 1844 columns and many seem to be missing. It is quite common for this to happen with large longitudinal studies, since throughout the years the researchers may decide to include new variables. For example, we will see that In-state tuition began to be recorded in 2001. .. code:: ipython3 COL.head() .. raw:: html
UNITID OPEID OPEID6 INSTNM CITY STABBR ZIP ACCREDAGENCY INSTURL NPCURL ... C150_L4_PELL D150_L4_PELL C150_4_LOANNOPELL D150_4_LOANNOPELL C150_L4_LOANNOPELL D150_L4_LOANNOPELL C150_4_NOLOANNOPELL D150_4_NOLOANNOPELL C150_L4_NOLOANNOPELL D150_L4_NOLOANNOPELL
0 100654 100200 1002 Alabama A & M University Normal AL 35762 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 100663 105200 1052 University of Alabama at Birmingham Birmingham AL 35294-0110 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 100690 2503400 25034 Amridge University Montgomery AL 36117-3553 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 100706 105500 1055 University of Alabama in Huntsville Huntsville AL 35899 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 100724 100500 1005 Alabama State University Montgomery AL 36104-0271 NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

5 rows × 1844 columns

.. code:: ipython3 # Let's describe the dataset COL.info() .. parsed-literal:: RangeIndex: 7149 entries, 0 to 7148 Columns: 1844 entries, UNITID to D150_L4_NOLOANNOPELL dtypes: float64(634), int64(10), object(1200) memory usage: 100.6+ MB We can make a very severe cleaning by dropping all columns that have any NAs. I am interested in a few variables that I noticed in the data dictionary, and those are dropped by this procedure as well. I will read those and merge with this 'non-missingness' data. .. code:: ipython3 # Which columns have no NAs col_dna = COL.dropna(axis=1) col_dna.info() .. parsed-literal:: RangeIndex: 7149 entries, 0 to 7148 Data columns (total 15 columns): UNITID 7149 non-null int64 OPEID 7149 non-null object OPEID6 7149 non-null int64 INSTNM 7149 non-null object CITY 7149 non-null object STABBR 7149 non-null object ZIP 7149 non-null object MAIN 7149 non-null int64 NUMBRANCH 7149 non-null int64 PREDDEG 7149 non-null int64 HIGHDEG 7149 non-null int64 CONTROL 7149 non-null int64 ST_FIPS 7149 non-null int64 REGION 7149 non-null int64 ICLEVEL 7149 non-null int64 dtypes: int64(10), object(5) memory usage: 837.9+ KB We want to predetermine the dtypes and variable names that we want so that when we read the remaining data, we can make sure that the DataFrames are uniformly formatted. .. code:: ipython3 col_dtypes = dict(col_dna.dtypes.replace(np.dtype('int64'),np.dtype('float64'))) # make the dtypes floats col_dtypes['UNITID'] = np.dtype('int64') # convert the UNITID back to int vars_interest = ['ADM_RATE','UGDS','TUITIONFEE_IN','TUITIONFEE_OUT','MN_EARN_WNE_P10'] # Include these vars col_dtypes.update({a: np.dtype('float64') for a in vars_interest}) # make them floats We will try to read the data again, but this time, select only the variables and corresponding types in ``col_dtypes``. By specifying the type we can speed up the reading process and make it more uniform between data. We can read the data with the specific dtypes and columns using the ``dtype`` and ``usecols`` arguments below. We also use ``na_values`` because we notice that in the analysis ``"PrivacySuppressed"`` indicates that a value is missing in this data. .. code:: ipython3 ## Try reading it again col_try_again = pd.read_csv(datadir + '/MERGED2009_10_PP.csv',na_values='PrivacySuppressed', dtype=col_dtypes,usecols=col_dtypes.keys()) col_try_again.info() .. parsed-literal:: RangeIndex: 7149 entries, 0 to 7148 Data columns (total 20 columns): UNITID 7149 non-null int64 OPEID 7149 non-null object OPEID6 7149 non-null float64 INSTNM 7149 non-null object CITY 7149 non-null object STABBR 7149 non-null object ZIP 7149 non-null object MAIN 7149 non-null float64 NUMBRANCH 7149 non-null float64 PREDDEG 7149 non-null float64 HIGHDEG 7149 non-null float64 CONTROL 7149 non-null float64 ST_FIPS 7149 non-null float64 REGION 7149 non-null float64 ADM_RATE 2774 non-null float64 UGDS 6596 non-null float64 TUITIONFEE_IN 4263 non-null float64 TUITIONFEE_OUT 4115 non-null float64 MN_EARN_WNE_P10 5486 non-null float64 ICLEVEL 7149 non-null float64 dtypes: float64(14), int64(1), object(5) memory usage: 1.1+ MB We will also want to store the school year. I will encode the school year using the second year since Jan 1 of that year is contained in the school year. For example, 2009-10 is encoded as 2010. Using the Period (2010) is not accurate since it should span just the school year. However, we will not be using Periods to their full capabilities in this analysis, and we will see that this is a fine use of Periods in this case. .. code:: ipython3 col_try_again['Year'] = pd.Period('2010',freq='Y') Now we are ready to wrap this up into a reader function. .. code:: ipython3 def read_cs_data(year,col_dtypes,datadir): """read a CollegeScorecard dataframe""" nextyr = str(int(year) + 1)[-2:] filename = datadir + '/MERGED{}_{}_PP.csv'.format(year,nextyr) col = pd.read_csv(filename,na_values='PrivacySuppressed', dtype=col_dtypes,usecols=col_dtypes.keys()) col['Year'] = pd.Period(str(int(year) + 1),freq='Y') return col We can very simply use the following generator expression to read in the files in sequence and concatenate all of them. This should work because we enforce the variable names and types to be uniform. .. code:: ipython3 col = pd.concat((read_cs_data(str(y),col_dtypes,datadir) for y in range(1996,2017))) col = col.set_index(['UNITID','Year']) We set the multi-index to be the unique id for the school and the year. This data follows the tidy data design that dictates that each column correspond to a variable and each row to a record. We could have joined each year's data instead and made a wide DataFrame, but this would violate the tidy data idea. .. code:: ipython3 col.head() .. raw:: html
OPEID OPEID6 INSTNM CITY STABBR ZIP MAIN NUMBRANCH PREDDEG HIGHDEG CONTROL ST_FIPS REGION ADM_RATE UGDS TUITIONFEE_IN TUITIONFEE_OUT MN_EARN_WNE_P10 ICLEVEL
UNITID Year
100636 1997 01230800 12308.0 Community College of the Air Force Montgomery AL 36114-3011 1.0 1.0 2.0 2.0 1.0 1.0 0.0 NaN 44141.0 NaN NaN NaN 2.0
100654 1997 00100200 1002.0 Alabama A & M University Normal AL 35762 1.0 1.0 3.0 4.0 1.0 1.0 5.0 NaN 3852.0 NaN NaN NaN 1.0
100663 1997 00105200 1052.0 University of Alabama at Birmingham Birmingham AL 35294-0110 1.0 2.0 3.0 4.0 1.0 1.0 5.0 NaN 9889.0 NaN NaN NaN 1.0
100672 1997 00574900 5749.0 ALABAMA AVIATION AND TECHNICAL COLLEGE OZARK AL 36360 1.0 1.0 1.0 2.0 1.0 1.0 5.0 NaN 295.0 NaN NaN NaN 2.0
100690 1997 02503400 25034.0 Amridge University Montgomery AL 36117-3553 1.0 1.0 3.0 4.0 2.0 1.0 5.0 NaN 60.0 NaN NaN NaN 1.0
Select the large universities with more than 1000 students. Then let's isolate UC Davis. For this we can use the ``query`` method. .. code:: ipython3 col_large = col[col['UGDS'] > 1000] davis = col_large.query('CITY=="Davis" and STABBR=="CA"') davis = davis.reset_index(level=0) We reset the index because Plotnine does not play nicely with plotting on indices (as of the most recent release). Using Matplotlib we can plot the Davis undergraduate enrollment (UGDS) over time. .. code:: ipython3 ax = davis.plot(y='UGDS') ax.set_title('UC Davis undergraduate pop.') ax.set_ylabel('UG Enrollment') plt.show() .. image:: images/cost_of_uni_27_0.png Plotnine also has trouble dealing with Period data, so let's convert it to a timestamp. .. code:: ipython3 davis['YearDT'] = davis.index.to_timestamp() We initialize the ggplot with ``p9.ggplot`` specifying the data and the aesthetic elements. Finally, we add a single layer of a line geom. .. code:: ipython3 p9.ggplot(davis,p9.aes(x='YearDT',y='UGDS')) \ + p9.geom_line() # first layer .. image:: images/cost_of_uni_31_0.png .. parsed-literal:: We want to plot similar lines for other universities, so let's apply the same transformations to the dataset. .. code:: ipython3 col_large = col_large.reset_index(level=1) col_large['YearDT'] = pd.PeriodIndex(col_large['Year']).to_timestamp() col_large.info() .. parsed-literal:: Int64Index: 48362 entries, 100636 to 489201 Data columns (total 21 columns): Year 48362 non-null object OPEID 48362 non-null object OPEID6 48362 non-null float64 INSTNM 48362 non-null object CITY 48362 non-null object STABBR 48362 non-null object ZIP 48362 non-null object MAIN 48362 non-null float64 NUMBRANCH 48362 non-null float64 PREDDEG 48362 non-null float64 HIGHDEG 48362 non-null float64 CONTROL 48362 non-null float64 ST_FIPS 48362 non-null float64 REGION 48362 non-null float64 ADM_RATE 21041 non-null float64 UGDS 48362 non-null float64 TUITIONFEE_IN 37823 non-null float64 TUITIONFEE_OUT 37825 non-null float64 MN_EARN_WNE_P10 12528 non-null float64 ICLEVEL 48362 non-null float64 YearDT 48362 non-null datetime64[ns] dtypes: datetime64[ns](1), float64(14), object(6) memory usage: 8.1+ MB We can see some basic statistics for UC Davis in 2013 for example in the following line: .. code:: ipython3 ## I looked for the following variable names ## in the data dictionary on data.gov y = '2013' print(""" UC Davis Statistics {} Admissions rate:{}, Undergrad admissions:{:.0f}, In-state tuition: {:.0f}, Out-of-state tuition: {:.0f}, Mean earnings 10 yrs after enroll: {:.0f} """.format(y,*tuple(davis.loc[y,['ADM_RATE','UGDS','TUITIONFEE_IN', 'TUITIONFEE_OUT','MN_EARN_WNE_P10']]))) .. parsed-literal:: UC Davis Statistics 2013 Admissions rate:0.4826, Undergrad admissions:25588, In-state tuition: 13877, Out-of-state tuition: 36755, Mean earnings 10 yrs after enroll: 66000 The rise in tuition ~~~~~~~~~~~~~~~~~~~ Tuitions have been dramatically rising for US universities for decades. I would like to investigate where UC Davis stands in terms of this rise, and consider the differences between States. Let's begin by examining the amount of data and missingness for these variables. .. code:: ipython3 col_large.count() .. parsed-literal:: Year 48362 OPEID 48362 OPEID6 48362 INSTNM 48362 CITY 48362 STABBR 48362 ZIP 48362 MAIN 48362 NUMBRANCH 48362 PREDDEG 48362 HIGHDEG 48362 CONTROL 48362 ST_FIPS 48362 REGION 48362 ADM_RATE 21041 UGDS 48362 TUITIONFEE_IN 37823 TUITIONFEE_OUT 37825 MN_EARN_WNE_P10 12528 ICLEVEL 48362 YearDT 48362 dtype: int64 It seems that ADM\_RATE is about half missing and the tuition variables are around a quarter missing. After examining the dataset, it seems that most of the missingness is either due to some schools having very few non-missing entries or certain years being completely missing. We will select only the schools with mostly non-missing tuitions. We will also sample the schools so that we are not plotting the full dataset, which is too cumbersome for matplotlib. .. code:: ipython3 dav_id = davis.loc['1997','UNITID'] # store davis id col_gby = col_large.groupby(level=0) # Group by the university ID enough_dat = col_gby.count()['TUITIONFEE_IN'] > 15 # Select those with more than 15 non-missing entries p = .1 # select a sampling probability in_sample = pd.Series(np.random.binomial(1,p,size=enough_dat.shape[0]) > 0, index=enough_dat.index.values) # The following function will plot either the In-state or out-of-state tuition. This type of def is fine in a jupyter notebook since there is an implied flow of the data, but it uses global variables and should not be used in a module. .. code:: ipython3 def tuitplot(tuitvar='TUITIONFEE_IN',tuittitle='',varlab='In-state tuition (USD)'): """plot the tuitvar""" ax = plt.subplot() # init plot for inst_id, df in col_gby: # iterate over unis df = df.reset_index() if inst_id == dav_id: # if davis df.plot(x='Year',y=tuitvar,color='r',ax=ax,legend=False) # plot red elif enough_dat[inst_id] and in_sample[inst_id]: # if in sample df.plot(x='Year',y=tuitvar,alpha=.1,color='b',ax=ax,legend=False) # plot blue ax.set_title(tuittitle) ax.set_ylabel(varlab) plt.show() .. code:: ipython3 tuitplot(tuittitle = "In-state tuition for large uni's") .. image:: images/cost_of_uni_42_0.png Plotting the full dataset is not feasible since the data is so large. Alternatively, let's sample the dataset randomly. We will also select those universities that have enough tuition data to plot the tuition curve. .. code:: ipython3 def samp_with_dav(col_large,p=.1): col_gby = col_large.groupby(level=0) enough_dat = col_gby.count()['TUITIONFEE_IN'] > 15 in_sample = pd.Series(np.random.binomial(1,p,size=enough_dat.shape[0]) > 0, index=enough_dat.index.values) in_sample.index.name = 'UNITID' col_samp = col_large[in_sample & enough_dat] col_dav = col_large.loc[dav_id].reset_index() col_dav['UNITID'] = dav_id col_dav = col_dav.set_index('UNITID') return pd.concat([col_samp,col_dav]) Let's use this def to sample our large dataset. .. code:: ipython3 col_samp = samp_with_dav(col_large) Plotnine allows us to define the individual elements of a plot. This starts with a data and the aesthetic elements, such as the X-axis, Y-axis, and grouping variables. The following adds a line geom layer. The grouping tells Plotnine that the lines should be plotted for each university separately. .. code:: ipython3 p9.ggplot(col_samp.reset_index()) \ + p9.aes('YearDT','TUITIONFEE_IN',group='UNITID') \ + p9.geom_line() \ + p9.labels.ggtitle("In-state tuition for large uni's") \ + p9.labels.ylab('In-state tuition (USD)') .. image:: images/cost_of_uni_48_0.png .. parsed-literal:: In the above plot we added some annotations, such as the y-label. We can further modify the annotations by rotating the X-axis text (we do this with ``p9.theme()``) and setting the X limits. .. code:: ipython3 p9.ggplot(col_samp.reset_index()) \ + p9.aes('YearDT','TUITIONFEE_IN',group='UNITID') \ + p9.geom_line() \ + p9.theme(axis_text_x = p9.themes.element_text(rotation=45)) \ + p9.labels.ggtitle("In-state tuition for large uni's") \ + p9.labels.ylab('In-state tuition (USD)') \ + p9.scale_x_date(limits=['2001','2016']) .. image:: images/cost_of_uni_50_0.png .. parsed-literal:: We can add color by including another aesthetic element, the color and alpha to be expressions involving the UNITID. This will make UC Davis show up in red. The color and alpha are given scales, using ``p9.scale_alpha_identity()`` and ``p9.scale_color_cmap()`` where the cmap is the first argument. It determines the scale for the color, meaning how do we translate the number ``.5`` into a color. All Matplotlib colormaps are available for use in Plotnine: https://matplotlib.org/users/colormaps.html .. code:: ipython3 p9.ggplot(col_samp.reset_index()) \ + p9.aes('YearDT','TUITIONFEE_IN',group='UNITID', alpha='.1+.9*(UNITID=={})'.format(dav_id), color='1.*(UNITID=={})'.format(dav_id)) \ + p9.scale_alpha_identity() \ + p9.scale_color_cmap('bwr',guide=False) \ + p9.geom_line() + p9.scale_x_date(limits=['2001','2016']) \ + p9.theme(axis_text_x = p9.themes.element_text(rotation=45)) \ + p9.labels.ggtitle("In-state tuition for large uni's") \ + p9.labels.ylab('In-state tuition (USD)') .. image:: images/cost_of_uni_52_0.png .. parsed-literal:: We can replicate the same plots but for out-of-state tuition. Out-of-state is much larger and seems to be at higher quantiles than in-state tuition. .. code:: ipython3 tuitplot('TUITIONFEE_OUT',"Out-of-state tuition for large uni's", 'Out-of-state tuition (USD)') .. image:: images/cost_of_uni_54_0.png .. code:: ipython3 p9.ggplot(col_samp.reset_index()) \ + p9.aes('YearDT','TUITIONFEE_OUT',group='UNITID', alpha='.1+.9*(UNITID=={})'.format(dav_id), color='1.*(UNITID=={})'.format(dav_id)) \ + p9.scale_alpha_identity() \ + p9.scale_color_cmap('bwr',guide=False) \ + p9.geom_line() + p9.scale_x_date(limits=['2001','2016']) \ + p9.theme(axis_text_x = p9.themes.element_text(rotation=45)) \ + p9.labels.ggtitle("Out-of-state tuition for large uni's") \ + p9.labels.ylab('Out-of-state tuition') .. image:: images/cost_of_uni_55_0.png .. parsed-literal:: We can also pivot on the State, which means that we will produce a plot for every state. This will create many plots, so we adopt a minimalist theme (called void) and shrink the size of each plot. .. code:: ipython3 p9.ggplot(col_samp.reset_index()) \ + p9.aes('YearDT','TUITIONFEE_OUT',group='UNITID', alpha='.1+.9*(UNITID=={})'.format(dav_id), color='1.*(UNITID=={})'.format(dav_id)) \ + p9.scale_alpha_identity() \ + p9.scale_color_cmap('bwr',guide=False) \ + p9.geom_line() + p9.scale_x_date(limits=['2001','2016']) \ + p9.theme_void() \ + p9.facet_wrap('~ STABBR',ncol=8) \ + p9.labels.ggtitle("Out-of-state tuition for large uni's") \ + p9.labels.ylab('Out-of-state tuition') .. image:: images/cost_of_uni_57_0.png .. parsed-literal:: We can also only plot trend lines for each state. This produces an interesting plot that lets us quickly compare the OLS fit to each state. Because there is not issue with plotting too many geoms in this case, we can fit it on the entire data (we use ``col_large``). .. code:: ipython3 p9.ggplot(col_large.reset_index()) \ + p9.aes('YearDT','TUITIONFEE_OUT',group='STABBR') \ + p9.scale_x_date(limits=['2001','2016']) \ + p9.facet_wrap('~ STABBR',ncol=8) \ + p9.stat_smooth(method='lm') \ + p9.theme_void() \ + p9.labels.ggtitle("Out-of-state tuition for large uni's by state") \ + p9.labels.ylab('Out-of-state tuition') .. image:: images/cost_of_uni_59_0.png .. parsed-literal:: There are a few interesting take-aways from these plots. First, the bulk of large universities have tuitions below 10,000 USD. The increase of the tuition seems to be more severe for more expensive universities than for less expensive ones. The increase in tuition in the US is also more extreme in certain states, particularly PA, NY, and MA. Earnings trends ~~~~~~~~~~~~~~~~~~ The college Scorecard also reports the mean earnings of students working and not enrolled 10 years after entry, which is encoded in the MN\_EARN\_WNE\_P10 variable. We will draw a scatterplot of this variable against the in-state tuition for each university. We will add the undergraduate enrollment (UGDS) as a color and size aesthetic element. Then we specify the scale for the size and the color. We make it a scatterplot by adding point geoms. This is done on data for 2013. .. code:: ipython3 col_2013 = col_large.query('YearDT == "2013-01-01"') .. code:: ipython3 p9.ggplot(col_2013) + p9.aes('TUITIONFEE_IN','MN_EARN_WNE_P10',size='UGDS',color='UGDS')\ + p9.scale_size_area(breaks=[10000,20000,40000]) \ + p9.labels.ggtitle("Mean earnings as a function of in-state tuition") \ + p9.geom_point(alpha=.5) + p9.scale_color_cmap('plasma',guide=False) .. image:: images/cost_of_uni_63_0.png .. parsed-literal:: We can also add linear regression lines to the plot. By specifying groups to be universities above and below 20,000 undergrad students we see that larger universities tend to have a higher trend line. .. code:: ipython3 p9.ggplot(col_2013) + p9.aes('TUITIONFEE_IN','MN_EARN_WNE_P10',size='UGDS', color='UGDS',groups='UGDS > 20000')\ + p9.scale_size_area(breaks=[10000,20000,40000]) \ + p9.labels.ggtitle("Mean earnings as a function of in-state tuition") \ + p9.geom_point(alpha=.5) + p9.stat_smooth(show_legend=False) + p9.scale_color_cmap('plasma',guide=False) .. image:: images/cost_of_uni_65_0.png .. parsed-literal:: In the above plot the scatter points are sized such that the area is proportional to the undergraduate enrollment for the university. Just from observation, it seems there are two populations (likely between public and private universities) such that the mean earnings as a function of tuition is higher in the left population and on the right population has a smaller trend but some of the universities have significantly higher mean earnings. Other comparisons ~~~~~~~~~~~~~~~~~~ In this section we will look at other variables, such as admittance rate and the highest degree that the institution grants. Let's focus on the 2013 data that has the in-state tuition not missing. .. code:: ipython3 col_2013_nna = col_2013[~col_2013['TUITIONFEE_IN'].isna()] col_2013_nna = col_2013_nna[col_2013_nna['HIGHDEG'] != 0] We will begin by considering the highest degree granted (HIGHDEG) and in-state tuition. Because HIGHDEG is categorical, it is natural to either facet on it or to make it an aesthetic element for a univariate plot such as a density estimate. We will go with having it determine the fill color for density estimates of the tuition. .. code:: ipython3 p9.ggplot(col_2013_nna) + p9.aes('TUITIONFEE_IN',fill='factor(HIGHDEG)') \ + p9.geom_density(alpha=.5) + p9.labels.ggtitle('Density of in-state tuition by highest degree') .. image:: images/cost_of_uni_70_0.png .. parsed-literal:: We can also get a sense of the influence of admissions rate (ADM\_RATE) on the mean earnings. We will use a point geom to get a scatterplot. We will also add a lowess smoother layer to get a sense of the trend. Because ADM\_RATE is concentrated around 1 it makes sense to use a log-scale for the x axis. This does not change the axis text but does change the position of the point geoms. We also add a color aesthetic element for in-state tuition. .. code:: ipython3 p9.ggplot(col_2013_nna) + p9.aes('ADM_RATE','MN_EARN_WNE_P10',color='TUITIONFEE_IN') \ + p9.geom_point() + p9.scale_x_log10() \ + p9.scale_color_cmap() + p9.stat_smooth(method='lowess') .. image:: images/cost_of_uni_72_0.png .. parsed-literal:: This indicates that there is a close relationship between admissions rate, tuition, and mean earnings. From the generally positive trend of earnings as a function of tuition we may conclude that atttending a more expensive university will cause someone to earn more, but we need to consider confounding variables such as admission rate. These visualizations can help us understand the complex dependencies in this data. We can also look at the admissions rates as a function of the highest degree granted. For this we use the Seaborn package, which has a larger selection of named plots than Pyplot. For example the ``sns.Facetgrid`` will facet on one or two variables and make a grid of plots, in this case we plot the density of the admission rate. .. code:: ipython3 import seaborn as sns g = sns.FacetGrid(col_2013_nna,row='HIGHDEG',aspect=2, height=1.5) sfig = g.map(sns.kdeplot,'ADM_RATE') .. image:: images/cost_of_uni_75_0.png Another interesting named plot is the boxenplot that will produce vertical histograms of the Y variable with variable bin widths by grouping on the X variable. The following is another look at the in-state tuition by year. .. code:: ipython3 g = sns.boxenplot(x='Year',y='TUITIONFEE_IN',data=col_large) g.format_xdata = mpl.dates.DateFormatter('%Y-%m-%d') g.figure.autofmt_xdate() .. image:: images/cost_of_uni_77_0.png You can find many more examples of named plots at the Seaborn website: https://seaborn.pydata.org/ **Note:** Plotnine is a good example of a Python package. It is well organized and has extensive documentation. If you look at the Plotnine source code you can see the basic organization of the sub-modules. Because they have great docstrings, you can for example use help to see what the following method does. .. code:: ipython3 help(p9.themes.element_text) .. parsed-literal:: Help on class element_text in module plotnine.themes.elements: class element_text(builtins.object) | element_text(family=None, style=None, weight=None, color=None, size=None, ha=None, va=None, rotation=None, linespacing=None, backgroundcolor=None, margin=None, \*\*kwargs) | | Theme element: Text | | Parameters | ---------- | family : str | Font family | style : 'normal' | 'italic' | 'oblique' | Font style | color : str | tuple | Text color | weight : str | Should be one of *normal*, *bold*, *heavy*, *light*, | *ultrabold* or *ultralight*. | size : float | text size | ha : 'center' | 'left' | 'right' | Horizontal Alignment. | va : 'center' | 'top' | 'bottom' | 'baseline' | Vertical alignment. | rotation : float | Rotation angle in the range [0, 360] | linespacing : float | Line spacing | backgroundcolor : str | tuple | Background color | margin : dict | Margin around the text. The keys are one of | ``['t', 'b', 'l', 'r']`` and ``units``. The units are | one of ``['pt', 'lines', 'in']``. The *units* default | to ``pt`` and the other keys to ``0``. Not all text | themeables support margin parameters and other than the | ``units``, only some of the other keys will a. | kwargs : dict | Parameters recognised by :class:`matplotlib.text.Text` | | Note | ---- | :class:`element_text` will accept parameters that conform to the | **ggplot2** *element_text* API, but it is preferable the | **Matplotlib** based API described above. | | Methods defined here: | | __init__(self, family=None, style=None, weight=None, color=None, size=None, ha=None, va=None, rotation=None, linespacing=None, backgroundcolor=None, margin=None, \*\*kwargs) | Initialize self. See help(type(self)) for accurate signature. | | ---------------------------------------------------------------------- | Data descriptors defined here: | | __dict__ | dictionary for instance variables (if defined) | | __weakref__ | list of weak references to the object (if defined)