Data and Computation¶
This chapter will serve two purposes:
- Introduce the broad definition of data that we will use throughout this book, and demonstrate how computer programs interact with data
- Provide a crash course in Python
Because we are assuming that you are already familiar with a sequential programming language such as C, we will not dwell on common syntax elements such as if
statements and Boolean expressions.
While minor differences abound between the precise use of these common statements in Python, C, and Java, we will not go over these and hope that you can learn from the examples throughout the book.
Instead, we will devote our time to show you what make Python truly unique.
A good understanding of pythonic programming idioms is fundamental to reading and writing good Python code.
If you have trouble keeping up with the programming aspects of this chapter then I recommend The Python Tutorial.
Two Tales of Data¶
To early statisticians, data primarily meant a sample from a population, such as the result of a survey or experiment. Sir Francis Galton (1822-1911) employed surveys to study human genetics and inheritance of ability [1]. The idea that this data is a random draw from some larger population, and by collecting such data, we can glean information about unknown quantities and properties. Such surveys collected data with the express purpose of calculating statistics in order to estimate desired population level quantities. Building on the work of Galton, statisticians such as R.A. Fisher (1890-1962) developed tools to quantify the uncertainty of statistics extracted from sampled data. Modern Statistics is an expanding field devoted to understanding uncertainty inherent in statistical calculations and algorithms. Now statistics are extracted from every type of dataset imaginable, from text data to videos.
To the computer scientist, data is anything stored on the computer that is used as input or intermediate results for a program. Alan Turing (1912 - 1954) conceved of a computer, which we now call a Turing machine, that works by reading and writing data on a tape of arbitrary length. This served as an early model of a computer, and central to its usefulness as a theoretical construct is the idea that data can influence the state of the processing unit, and in turn the processor will write and alter the data. As we will see the modern computer architecture is built up from this simple model, and understanding modern architectures is critical to being a successful data scientist.
One might be concerned that these two conceptions of data are in conflict. Where is randomness in the Computer Science (CS) notion of data? Why does the use of data as samples from a population seem to be more restrictive than data as anything stored in memory? In fact, the idea that data are anything that is stored in memory is not in conflict with the use of data in Statistics. Data are anything stored in memory, and statistics are anything that is extracted from data. Instead of thinking of a statistic as just a number, we often imbue it with a probability distribution, which enables us to quantify uncertainty. Looking forward, we should think of data as both an object in memory that has to be transferred, processed, and learned from, in addition to being a random quantity. Data carries this randomness like baggage, that it unpacks to produce uncertainty and error, but luckily we can quantify this uncertainty with statistical reasoning.
Data in Python¶
In programming languages, a variable refers to a location in memory for a datum (a single number) and the name that we use to identify this location in our code.
In Python, we might initialize a variable by x=8
which writes the integer 8
to a block in memory and x
now references this location.
This differs from the idea of a random variable from probability—x
doesn’t necessarily have an associated probability distribution—and whenever possible we use the phrase random variable to dilineate them.
Data is anything that we can record and process to extract information, and consequently, we will consider all variables to reference data.
This liberal understanding of data can be befuddling at first.
How can we think of name = "Herbert Hoover"
as a datum?
Consider the first 15 rows of the following dataframe (a table of data). The full table contains the length in number of words of the state of the union addresses for each U.S. president [Pres].
date | name | words |
---|---|---|
January 8, 1790 | George Washington | 1089 |
December 8, 1790 | George Washington | 1401 |
October 25, 1791 | George Washington | 2302 |
November 6, 1792 | George Washington | 2101 |
December 3, 1793 | George Washington | 1968 |
November 19, 1794 | George Washington | 2918 |
December 8, 1795 | George Washington | 1989 |
December 7, 1796 | George Washington | 2871 |
November 22, 1797 | John Adams | 2063 |
December 8, 1798 | John Adams | 2218 |
December 3, 1799 | John Adams | 1505 |
November 22, 1800 | John Adams | 1372 |
December 8, 1801 | Thomas Jefferson | 3224 |
December 15, 1802 | Thomas Jefferson | 2197 |
October 17, 1803 | Thomas Jefferson | 2263 |
Now it makes sense that “Herbert Hoover” is a datum, since it is an element of this dataframe. We can even think of probability distributions over the name variable since we can think of randomly sampling the rows of this table, making this a random variable in some contexts. Data can be webpages, images, earnings reports, religious scripture, etc.
Python has built-in variable types that for the most part can be categorized into numeric types, sequences, mappings, classes, instances, and exceptions. Numeric types include booleans (True and False), integers, floats, and complex. Sequences include strings, lists, and tuples, while mappings are dictionaries. Exceptions are how errors are handled (unless they are syntax errors). Classes and instances enable object-oriented programming in Python: classes are custom types that you can specify in code, while instances are specific variables with that class. In this chapter, we will focus only on numerical types and sequences, we will see the other variable types crop up in later chapters. We will not go into all of the details of the python language syntax, but I will highlight some useful tools from the standard library, pythonic concepts, and how these relate to data science.
The words column in the table above contains all integers. Integer is a numerical type, along with floats (real numbers) and booleans (True and False). In order to work with one datum in the word column at a time, we could of course define specific variables:
word1 = 1089
word2 = 1401
average = (word1 + word2) / 2
print(average)
Then if you run $ python average.py
in the command line it would output 1245.0
.
(Celebrations are in order since we have computed our first statistic.)
What happened is that we saved two integers and averaged them.
It is important to note the specific output here because it one of the most visible differences between Python 2 and Python 3.
Two of our numerical types, integers and floats, appear in the code above; word1, word2
are integers and average
will be a float.
This is because in Python 3, integer division (/) always outputs a float type, hence the decimal place in the output.
You can use the operator // for integer division that outputs the rounded down division as in (word1 + word2) // 2
.
There are many other common operators for the numerical and Boolean types, if you are not familiar with them you can consult https://docs.python.org/3/library/stdtypes.html.
Because these operators are common to the major sequential programming languages, we will not go over them here, and will instead focus on the elements that make Python unique.
Unlike C or Java, Python is dynamically typed, which means that we do not pre-specify that average
will be a float and Python has to figure this out on the fly.
This is what the just-in-time (JIT) compilation does, it figures out that the computer needs to allocate memory for a float just in time.
The JIT compiler makes it so that we can use python interactively instead of writing separate code and running it from command line like we did for average.py.
We used the python interpreter to run the code in the file, average.py, but we can use python in interactive mode.
I could run each of these lines in ipython, by typing $ ipython
into the command line and entering each line into the prompt.
This is effectively using python as a slightly sophisticated calculator.
Often I will use ipython to test out code snippets and then use the magic command %save
to write lines from ipython to a temporary file then move these to a module that I am working on.
Then using the %run
magic command I run the module to import the functions that I have just written.
In this way, I have my favorite editor, emacs, open with the module that I am working on and the temporary code file from ipython, alongside ipython for testing and debugging the code.
Lists and slicing¶
The code above can only be used to average two numbers unfortunately, and to overcome this limitation we will need to introduce lists. At the ipython prompt we can type the following lines of python code and get the output in real time:
In [1]: words = [1089, 1401, 2302,2101, 1968, 2918, 1989, 2871]
In [2]: word_sum = sum(words)
In [3]: word_sum / len(words)
Out[3]: 2079.875
The first line defines the words
list, which contains the state of union address lengths for Washington.
It is cumbersome to maintain an new variable for each datapoint, which is why we use lists and arrays in python.
The list is an example of a sequence variable type in python, because it contains a sequence of any variable type.
Lists are good containers for repeated variables because we can modify the length, append elements, and select subsets of data (called slicing).
They can even contain mixed types of data.
Consider what happens if we have a missing datum, and encode it with the string 'NA'
.
To deal with this, we will make our own averaging function, which in python is defined using def
.
In [4]: words = [1089, 1401, 2302,'NA',2101, 1968, 2918, 1989, 2871]
In [5]: def average(word_list):
...: """Average a list"""
...: csum = 0
...: num = 0
...: for wlen in word_list:
...: if type(wlen) == int:
...: csum += wlen
...: num += 1
...: return csum / num
...:
In [6]: average(words)
Out[6]: 2079.875
It is common in real data to have missing data encoded as strings such as this, or as anomalous values like 999
(This is bad practice, but you may have to deal with it none-the-less).
As a solution, we decided to catch any instance of a non-integer and not include it in the sum.
We used the built-in function type
which will tell you the type of the object.
This is useful when you don’t know the type of the object that a function or method has returned.
You can check the type on the fly and then look for documentation for this type to see what to do next.
Aside: In the above code, we also saw the use ofdef, for, if, return
. These are standard in sequential programming, and if you want a detailed introduction to logic, control flow, and functions you can skim through the Python doc, control flow tutorial. We will discuss some of the ways thatfor
loops are used in python, later in this chapter. You may notice that I added a documentation string in the line after thedef
command. You can access the doc string for a function using the built-in function,help
. It is best practice to add doc strings to your functions, and you can find best practices in PEP 257. The following is shown when you enterhelp(average)
into the ipython prompt:
Help on function average in module __main__:
average(word_list)
Average a list
Slicing is when you subselect elements of a sequence, like a list. Consider the following slices of the words list:
In [7]: words = [1089, 1401, 2302,'NA',2101, 1968, 2918, 1989, 2871]
In [8]: words[3]
Out[8]: 'NA'
In [9]: words[:4]