Data and Computation

This chapter will serve two purposes:

  1. Introduce the broad definition of data that we will use throughout this book, and demonstrate how computer programs interact with data
  2. Provide a crash course in Python

Because we are assuming that you are already familiar with a sequential programming language such as C, we will not dwell on common syntax elements such as if statements and Boolean expressions. While minor differences abound between the precise use of these common statements in Python, C, and Java, we will not go over these and hope that you can learn from the examples throughout the book. Instead, we will devote our time to show you what make Python truly unique. A good understanding of pythonic programming idioms is fundamental to reading and writing good Python code. If you have trouble keeping up with the programming aspects of this chapter then I recommend The Python Tutorial.

Two Tales of Data

To early statisticians, data primarily meant a sample from a population, such as the result of a survey or experiment. Sir Francis Galton (1822-1911) employed surveys to study human genetics and inheritance of ability [1]. The idea that this data is a random draw from some larger population, and by collecting such data, we can glean information about unknown quantities and properties. Such surveys collected data with the express purpose of calculating statistics in order to estimate desired population level quantities. Building on the work of Galton, statisticians such as R.A. Fisher (1890-1962) developed tools to quantify the uncertainty of statistics extracted from sampled data. Modern Statistics is an expanding field devoted to understanding uncertainty inherent in statistical calculations and algorithms. Now statistics are extracted from every type of dataset imaginable, from text data to videos.

To the computer scientist, data is anything stored on the computer that is used as input or intermediate results for a program. Alan Turing (1912 - 1954) conceved of a computer, which we now call a Turing machine, that works by reading and writing data on a tape of arbitrary length. This served as an early model of a computer, and central to its usefulness as a theoretical construct is the idea that data can influence the state of the processing unit, and in turn the processor will write and alter the data. As we will see the modern computer architecture is built up from this simple model, and understanding modern architectures is critical to being a successful data scientist.

One might be concerned that these two conceptions of data are in conflict. Where is randomness in the Computer Science (CS) notion of data? Why does the use of data as samples from a population seem to be more restrictive than data as anything stored in memory? In fact, the idea that data are anything that is stored in memory is not in conflict with the use of data in Statistics. Data are anything stored in memory, and statistics are anything that is extracted from data. Instead of thinking of a statistic as just a number, we often imbue it with a probability distribution, which enables us to quantify uncertainty. Looking forward, we should think of data as both an object in memory that has to be transferred, processed, and learned from, in addition to being a random quantity. Data carries this randomness like baggage, that it unpacks to produce uncertainty and error, but luckily we can quantify this uncertainty with statistical reasoning.

Data in Python

In programming languages, a variable refers to a location in memory for a datum (a single number) and the name that we use to identify this location in our code. In Python, we might initialize a variable by x=8 which writes the integer 8 to a block in memory and x now references this location. This differs from the idea of a random variable from probability—x doesn’t necessarily have an associated probability distribution—and whenever possible we use the phrase random variable to dilineate them. Data is anything that we can record and process to extract information, and consequently, we will consider all variables to reference data. This liberal understanding of data can be befuddling at first. How can we think of name = "Herbert Hoover" as a datum?

Consider the first 15 rows of the following dataframe (a table of data). The full table contains the length in number of words of the state of the union addresses for each U.S. president [Pres].

State of the Union Length
date name words
January 8, 1790 George Washington 1089
December 8, 1790 George Washington 1401
October 25, 1791 George Washington 2302
November 6, 1792 George Washington 2101
December 3, 1793 George Washington 1968
November 19, 1794 George Washington 2918
December 8, 1795 George Washington 1989
December 7, 1796 George Washington 2871
November 22, 1797 John Adams 2063
December 8, 1798 John Adams 2218
December 3, 1799 John Adams 1505
November 22, 1800 John Adams 1372
December 8, 1801 Thomas Jefferson 3224
December 15, 1802 Thomas Jefferson 2197
October 17, 1803 Thomas Jefferson 2263

Now it makes sense that “Herbert Hoover” is a datum, since it is an element of this dataframe. We can even think of probability distributions over the name variable since we can think of randomly sampling the rows of this table, making this a random variable in some contexts. Data can be webpages, images, earnings reports, religious scripture, etc.

Python has built-in variable types that for the most part can be categorized into numeric types, sequences, mappings, classes, instances, and exceptions. Numeric types include booleans (True and False), integers, floats, and complex. Sequences include strings, lists, and tuples, while mappings are dictionaries. Exceptions are how errors are handled (unless they are syntax errors). Classes and instances enable object-oriented programming in Python: classes are custom types that you can specify in code, while instances are specific variables with that class. In this chapter, we will focus only on numerical types and sequences, we will see the other variable types crop up in later chapters. We will not go into all of the details of the python language syntax, but I will highlight some useful tools from the standard library, pythonic concepts, and how these relate to data science.

The words column in the table above contains all integers. Integer is a numerical type, along with floats (real numbers) and booleans (True and False). In order to work with one datum in the word column at a time, we could of course define specific variables:

average.py
word1 = 1089
word2 = 1401
average = (word1 + word2) / 2
print(average)

Then if you run $ python average.py in the command line it would output 1245.0. (Celebrations are in order since we have computed our first statistic.) What happened is that we saved two integers and averaged them. It is important to note the specific output here because it one of the most visible differences between Python 2 and Python 3. Two of our numerical types, integers and floats, appear in the code above; word1, word2 are integers and average will be a float. This is because in Python 3, integer division (/) always outputs a float type, hence the decimal place in the output. You can use the operator // for integer division that outputs the rounded down division as in (word1 + word2) // 2. There are many other common operators for the numerical and Boolean types, if you are not familiar with them you can consult https://docs.python.org/3/library/stdtypes.html. Because these operators are common to the major sequential programming languages, we will not go over them here, and will instead focus on the elements that make Python unique.

Unlike C or Java, Python is dynamically typed, which means that we do not pre-specify that average will be a float and Python has to figure this out on the fly. This is what the just-in-time (JIT) compilation does, it figures out that the computer needs to allocate memory for a float just in time. The JIT compiler makes it so that we can use python interactively instead of writing separate code and running it from command line like we did for average.py.

We used the python interpreter to run the code in the file, average.py, but we can use python in interactive mode. I could run each of these lines in ipython, by typing $ ipython into the command line and entering each line into the prompt. This is effectively using python as a slightly sophisticated calculator. Often I will use ipython to test out code snippets and then use the magic command %save to write lines from ipython to a temporary file then move these to a module that I am working on. Then using the %run magic command I run the module to import the functions that I have just written. In this way, I have my favorite editor, emacs, open with the module that I am working on and the temporary code file from ipython, alongside ipython for testing and debugging the code.

Lists and slicing

The code above can only be used to average two numbers unfortunately, and to overcome this limitation we will need to introduce lists. At the ipython prompt we can type the following lines of python code and get the output in real time:

In [1]: words = [1089, 1401, 2302,2101, 1968, 2918, 1989, 2871]

In [2]: word_sum = sum(words)

In [3]: word_sum / len(words)
Out[3]: 2079.875

The first line defines the words list, which contains the state of union address lengths for Washington. It is cumbersome to maintain an new variable for each datapoint, which is why we use lists and arrays in python. The list is an example of a sequence variable type in python, because it contains a sequence of any variable type. Lists are good containers for repeated variables because we can modify the length, append elements, and select subsets of data (called slicing). They can even contain mixed types of data. Consider what happens if we have a missing datum, and encode it with the string 'NA'. To deal with this, we will make our own averaging function, which in python is defined using def.

In [4]: words = [1089, 1401, 2302,'NA',2101, 1968, 2918, 1989, 2871]

In [5]: def average(word_list):
   ...:     """Average a list"""
   ...:     csum = 0
   ...:     num  = 0
   ...:     for wlen in word_list:
   ...:         if type(wlen) == int:
   ...:             csum += wlen
   ...:             num += 1
   ...:     return csum / num
   ...: 

In [6]: average(words)
Out[6]: 2079.875

It is common in real data to have missing data encoded as strings such as this, or as anomalous values like 999 (This is bad practice, but you may have to deal with it none-the-less). As a solution, we decided to catch any instance of a non-integer and not include it in the sum. We used the built-in function type which will tell you the type of the object. This is useful when you don’t know the type of the object that a function or method has returned. You can check the type on the fly and then look for documentation for this type to see what to do next.

Aside: In the above code, we also saw the use of def, for, if, return. These are standard in sequential programming, and if you want a detailed introduction to logic, control flow, and functions you can skim through the Python doc, control flow tutorial. We will discuss some of the ways that for loops are used in python, later in this chapter. You may notice that I added a documentation string in the line after the def command. You can access the doc string for a function using the built-in function, help. It is best practice to add doc strings to your functions, and you can find best practices in PEP 257. The following is shown when you enter help(average) into the ipython prompt:
Help on function average in module __main__:

average(word_list)
    Average a list

Slicing is when you subselect elements of a sequence, like a list. Consider the following slices of the words list:

In [7]: words = [1089, 1401, 2302,'NA',2101, 1968, 2918, 1989, 2871]

In [8]: words[3]
Out[8]: 'NA'

In [9]: words[:4]
Out[9]: [1089, 1401, 2302, 'NA']

In [10]: words[0:4]
Out[10]: [1089, 1401, 2302, 'NA']

In [11]: words[-1]
Out[11]: 2871

In [12]: words[:-1]
Out[12]: [1089, 1401, 2302, 'NA', 2101, 1968, 2918, 1989]

In [13]: words[-3:]
Out[13]: [2918, 1989, 2871]

In [14]: words[::2]
Out[14]: [1089, 2302, 2101, 2918, 2871]

In [15]: words[::-1]
Out[15]: [2871, 1989, 2918, 1968, 2101, 'NA', 2302, 1401, 1089]

Line [8] selects the fourth element even though we are asking for index 3, which if you are not accustomed to zero-based numbering can cause lots of headaches. The idea here is that the first element of the list actually corresponds to index 0, so the fourth is index 3. One advantage of this is on lines [9] and [10] where we see our first slice of the first four elements. Because of zero-based numbering, the 4 in slice :4 is also the number of elements returned. [9] and [10] do the same thing, because the 0 in 0:4 may be omitted. Negative indexing is allowed, as in [11], which is handy when you want the last element, and you can also combine it with slicing as in [12] and [13] (notice that we did not need to specify the -1 as in -3:-1). You can also specify a stride, the number of indices to move by in the slice, and get every other element, [14], or reverse the list, [15]. In general, the slice a:b:c will start at a until b moving by c indices each time.

Code blocks and indentation

In my code above the def was followed by indented code, which defines a code block, a functional unit of code. The indentation here is not optional in Python, which is a perpetual source of frustration to new Python users. In C and R, blocks are defined using curly brackets {} and indentation is a recommended practice, In Python, this indentation is enforced, and the colon, :, after the def, for, and if statements indicate the start of a new block and more indentation. You end a block by reducing the indentation. If you try to break these rules, then the interpreter will throw and error.

In [16]: for i in range(10):
   ....:     j = i * 10
   ....:         print(i)
   ....: 
  File "<ipython-input-16-496ab58be0fb>", line 3
    print(i)
    ^
IndentationError: unexpected indent

Why does python do this? Let us consult the Zen of Python PEP-20:

“Readability counts.” —Zen of Python

The indentation makes this code much more readable. This comes at a cost: code with many nested blocks will have some extreme indentation. But again the Zen of Python steps in,

“Simple is better than complex…

Flat is better than nested.” —Zen of Python

It is better to separate out complex code into functions with clear documentation. This also keeps you from writing many nested for loops which is typically very inefficient and slow in Python. We will prefer to write short simple code that relies heavily on Python modules such as numpy, scipy, etc..

Tuples and immutability

Tuples are another form of sequence and they are initialized by wrapping them in parentheses (), as opposed to brackets [] for lists. So for example ("dog","german shorthaired pointer","Buckley") initializes a tuple. The tuple is an example of an immutable variable in Python, which means that the variable does not allow modification after its definition. So if we attempt to modify the 2nd entry,

In [17]: row = ("January 8, 1790","George Washington",1089)

In [18]: row[1] = "G. Washington"
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-18-cf35dd03f579> in <module>()
----> 1 row[1] = "G. Washington"

TypeError: 'tuple' object does not support item assignment

a TypeError is thrown. Nevertheless, ("January 8, 1790","George Washington",1089) is a good use of a tuple, because these variables belong together as the record—a row—in a data table. Generally, you should think of tuples as records and lists as repeated variables.

Tuples are also support an idiom called tuple unpacking. The idea is that we can assign values to multiple variables at once:

In [19]: x, y = 1, 2 #assign values

In [20]: y, x = x, y #swap values

In [21]: r, (x, y) = (x**2 + y**2)**0.5, (y, x) #nested value assignment and swapping

In [22]: print(x, y, r)
1 2 2.23606797749979

The line [1] made it so that we could define the variables x and y in the same line. The only requirement was that the right hand side of the equation was a tuple of the same length. The line [2] shows that you can use this idiom to swap values (this would normally take three lines). [3] shows that we can apply this to nested tuples, which can be useful if a function returns this type of data. Did you predict the output of [4]?

Processing Data, Latency, and Iterables

Strings and Input/Output (IO)

Strings are sequences of characters and are immutable as well. We have already seen instances such as s = "George Washington", and like other sequences we can slice them as in s[:6]. Strings support concatenation with the operator +, and repetition with the operation *. The fact that these operators are defined differently for strings as they are for integers and floats is called operator overloading. In this subsection, we will go over the following string methods: join, strip, split, replace, format. You can find these and other string methods such as lower, upper in Python text documentation.

As another example, we will look at a wine quality dataset, available at the UCI machine learning repository. We will use linux to anticipate the size of this dataset and look at its top several lines. Let’s use the bash command, ls -l, to see the size of our file.

$ ls -l ../data/winequality-red.cs

The output is as follows.

-rw-r--r-- 1 jsharpna jsharpna 84199 Oct 16  2009 ../data/winequality-red.csv

We can see that the fifth output tells us that the file has 85 kilobytes, so it will not be a problem to load it into memory. Let’s look at the top of the file using the bash command head,

$ head ../data/winequality-red.csv

which has the following output.

"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.8;0.88;0;2.6;0.098;25;67;0.9968;3.2;0.68;9.8;5
7.8;0.76;0.04;2.3;0.092;15;54;0.997;3.26;0.65;9.8;5
11.2;0.28;0.56;1.9;0.075;17;60;0.998;3.16;0.58;9.8;6
7.4;0.7;0;1.9;0.076;11;34;0.9978;3.51;0.56;9.4;5
7.4;0.66;0;1.8;0.075;13;40;0.9978;3.51;0.56;9.4;5
7.9;0.6;0.06;1.6;0.069;15;59;0.9964;3.3;0.46;9.4;5
7.3;0.65;0;1.2;0.065;15;21;0.9946;3.39;0.47;10;7
7.8;0.58;0.02;2;0.073;9;18;0.9968;3.36;0.57;9.5;7

When reading data the strip, split methods are very helpful. strip can allow us to remove unnecessary trailing and leading characters and split will split the string on a given character. Before we use these methods, we need to read in the data. The open function will create a variable which allows us to read the file line by line. The first argument is the file location, and the next determines if we are reading, 'r', or writing, 'w'.

In [23]: data_folder = '../data/'

In [24]: winefile = open(data_folder + 'winequality-red.csv','r')

In [25]: header = winefile.readline()

In [26]: winefile.close()

winefile has the readline() method which returns the first line of file as a string. We close the file now, since we are done with it for now. Closing files is important especially when you are writing to a file.

In [27]: header
Out[27]: '"fixed acidity";"volatile acidity";"citric acid";"residual sugar";"chlorides";"free sulfur dioxide";"total sulfur dioxide";"density";"pH";"sulphates";"alcohol";"quality"\n'

We can see that this string has delimiters of ; and a trailing newline \n.

In [28]: header = header.strip()

In [29]: names = header.split(';')

strip will take a character or regular expression as an argument, but by default it removes any trailing or leading whitespace (newlines, space, etc.). split will create a list of strings by splitting the header on the delimiter ;, which is handy when reading delimited text data such as CSV files. names now holds the variables with extra quotations around them. Suppose that we would like to create a string that describes the dataset in terms of the variables.

In [30]: name_st = ', '.join(names)

In [31]: print('The variable names are {}'.format(name_st))
The variable names are "fixed acidity", "volatile acidity", "citric acid", "residual sugar", "chlorides", "free sulfur dioxide", "total sulfur dioxide", "density", "pH", "sulphates", "alcohol", "quality"

join is the preferred way to combine a list of strings, think of it as the reverse of split. format is how you insert other strings into larger strings by replacing instances of {} with the arguments. There are other alternatives to the format method, such as formatted string literals, and you can read this Python IO tutorial for a more complete picture.

Reading data

Data can be stored in different places on your computer or network, and understanding where your data is and where it needs to go is critical to successfully processing data. Typically, your data will be on a hard drive or solid state drive either on your PC, server, or network drive. When you assign a variable in Python, that data will then be written in main memory—Random Access Memory (RAM). For standard computer architectures, data must be in the main memory for it to be processed by the central processing unit (CPU). Accessing main memory is faster than hard drives, but they are also typically much smaller. Massive datasets are often too large to be stored on a single machine, and need to be distributed across network drives. To work with that data, it much be routed through a data network resulting in significantly slower data transfer speeds. The CERN Large Hadron Collider in Geneva, Switzerland, has stored over 200 petabytes in a library of very high capacity tape drives.

Your program will have a compute time and a memory allocation. You will see the effects of poor computational efficiency when you have to wait for your process to terminate. Typically, you will not notice a poor use of memory until you run out of main memory and then your process will stall because it will start allocating space in the hard drive (called swap memory). A statistic is anything that is extracted from your data, and if you know the statistics that you want to compute you can often do this without loading all of your data at once. Let’s work through an example illustrating this principle.

The scrabble dictionary is a highly extensive dictionary of words in the English language without any definitions. (Scrabble is a game in which you place letters on a board to spell words to earn points.) Master scrabble players memorize the scrabble dictionary, which means they know the precise spelling of many obscure words without necessarily knowing their meaning. Let’s look at the head of one such dictionary.

In [32]: !head ../data/sowpods.txt
AA
AAH
AAHED
AAHING
AAHS
AAL
AALII
AALIIS
AALS
AARDVARK

For this exercise, let’s define an ordered word as a word in which each letter is followed only by itself or letters that are further in the alphabet. So for example, HISSY is ordered because H is before I is before S, etc. On the other hand BANANA is not ordered because B comes after A. Our goal is to find the longest ordered words in the scrabble dictionary.

Let’s write one function to read the data and output the scrabble dictionary as a list of strings.

def read_dictionary(filename="../data/sowpods.txt"):
    """create a list of words in the scrabble dictionary"""
    with open(filename,'r') as scrabblefile:
        scrabble_dict = [word.strip() for word in scrabblefile]
    return(scrabble_dict)

There are a few syntax elements to note here, and one important idiom which will let us talk about iterables. Recall that open(filename,'r') opens a File object that reads the file in the filename. (We use the with ... as ...: idiom that in this case will automatically close the file once it is done.) The file object that is returned from open is an example of an iterable object, which means that it can be used in a for loop. So far we have seen for loops that involve lists, which are also iterables, and you can roughly think of iterables as being objects that can be looped over. Technically, an iterable is anything that can be turned into an iterator via the iter method, which is implicitely done when the for loop is initiated. We will return to iterators in general, but for now just think of them as anything with which you can create a for loop. In the above code block we also see our first list comprehension, which is denoted by the for loop wrapped in brackets []. There is no idiom that is more pythonic than the list comprehension, and it will be featured heavily in our code.

List comprehension and generator expressions

We are using a list comprehension when we write [word.strip() for word in scrabblefile], which creates a list of strings of stripped words in the scrabble dictionary. To understand what is going on here, let’s consider a block that would do roughly the same thing as this one line:

scrabble_dict = []
for word in scrabblefile:
    scrabble_dict.append(word.strip())

So far we have only used lists for for loops, but you can also use File Just like the for loop above, the list comprehension can take anything that is iterable and will iteratively cache the elements as word and then apply strip to the word which removes trailing whitespace. The list comprehension can be thought of as a mapping that takes the elements of the iterable and applies some transformation. It can also filter elements with the list comprehension, as in

scrabble_dict = [word.strip() for word in scrabblefile if len(word) > 3]

which returns the words in the scrabble dictionary that have more than 3 letters. If we knew that there were ordered words that were longer than 3 letters in the scrabble dictionary (as in HISSY) then this list would be sufficient for our purposes.

In the above we mapped the numbers 0-99 by dividing by 3 and filtered out those which are not divisible by 3. Many times you do not want to create a list and waste the memory, and instead you want to just create an iterable. A generator expression will do this for you. It looks like a list comp, but you just surround the expression with parentheses. This can then be passed to functions that take iterables.

In [33]: sum((i//3 for i in range(100) if i % 3 == 0),100) #100+0+1+2+...+33
Out[33]: 661

There are a few iterators that are built-in to Python, such as range and enumerate. range(a,b) will yield the integers from a to b. enumerate(itrbl) takes an iterable itrbl and yields the tuples (i, itrbl[i]), where we interpret itrbl[i] to be the i th element as if it were a list. We have already seen the use of range to run a standard for loop, and in the following we will use it in list comprehensions to demonstrate their uses.

In [34]: sqrs = [i**2 for i in range(5)]

In [35]: sqrs
Out[35]: [0, 1, 4, 9, 16]

In [36]: [(i,j) for i,j in enumerate(sqrs)]
Out[36]: [(0, 0), (1, 1), (2, 4), (3, 9), (4, 16)]

Iterators

Let’s write a function that tests if a word is ordered:

def is_in_order(word):
    """return a true if the word is in order"""
    witer = iter(word)  #Create the iterator
    lettera = next(witer)  #store the first letter
    for letterb in witer:  #start a loop at the second letter
        if lettera > letterb:  #compare adjacent letters
            return False
        lettera = letterb
    return True

This function iterates through the letters of the word comparing each adjacent pair of letters; if a letter is greater than the next letter then it immediately returns False; if this never happens then it returns True. The particular way that we coded this allows us to delve more deeply into the inner workings of iterables. Strings are iterable, and when used in a for loop they will yield each letter in sequence. Hence, we can call iter(word) which will return an iterator, on which we can call the next method. The idea is that each time you call next then the iterator moves on to the following element so that you can call next again and return the following element, etc.

The following code iterates through the scrabble dictionary and maintains the longest ordered word length. If this length increases then we start the list of ordered words over with the most recently found ordered word. It appends any word that is ordered and of the same length. Try to follow the code below.

def longest_word(scrabble_dict):
    """Returns all of the longest ordered words in the scrabble dictionary"""
    curr_len = 0  #Initialize the longest ordered word length
    for word in scrabble_dict:
        if is_in_order(word):
            wl = len(word)
            if wl > curr_len:
                curr_len = wl
                word_list = [word]  #Start a new word list
            if wl==curr_len:
                word_list.append(word)
    return word_list

With all of these functions saved in the file scrabble.py we can import the functions in the ipython shell. Importing scripts in the same folder works the same way that we import modules with the import command. In the following shell we find that several words including BEEFILY are ordered words with seven letters and there are none of eight or more letters.

In [37]: from scrabble import *

In [38]: ordered_words = longest_word(read_dictionary())

In [39]: print(",".join(ordered_words))
ADDEEMS,ADDEEMS,BEEFILY,BILLOWY,CHIKORS,DIKKOPS,GIMMORS

We can now ask if this was the most efficient way to code this up. One way to assess the efficiency of code in ipython is to use the %timeit, %memit magic commands. %timeit will time a line of code over several runs, while %memit will record the amount of memory used. You will need to install the memory profiler extension and load it as we do below (you can install it with $ pip install memory_profiller).

In [40]: %timeit ordered_words = longest_word(read_dictionary())
10 loops, best of 3: 138 ms per loop

In [41]: %load_ext memory_profiler

In [42]: %memit ordered_words = longest_word(read_dictionary())
peak memory: 73.01 MiB, increment: 15.57 MiB

It took approximately 138 milliseconds on my machine, and 15.57 MiB (a Mebibyte which is approximately one Megabyte) in memory. A MiB is roughly 1 million bytes, so why did this take so much memory? (While this is not a lot to a modern PC, we will see that we can drastically reduce this.) When we ran read_dictionary, we saved the entire scrabble dictionary and passed it to longest_word, and the scrabble dictionary alone takes up roughly 3MB on my disk. Consider instead the following function.

def read_dictionary_iter(filename="../data/sowpods.txt"):
    """create an iterable that reads words in the scrabble dictionary"""
    with open(filename,'r') as scrabblefile:
        for word in scrabblefile:
            yield word.strip()

This function defines a custom iterator, which you can see from the yield command. When this function is called it does not run the for loop immediately, but waits for next to be called on the resulting iterator. Each time next is called the for loop in read_dictionary_iter is incremented and the resulting word is returned. We can pass the resulting iterator to longest_word and it will act just like using a cached list of the words in the for loop of longest_word. The difference is that we never stored the scrabble dictionary in memory.

In [43]: %timeit ordered_words = longest_word(read_dictionary_iter())
10 loops, best of 3: 148 ms per loop

In [44]: %memit ordered_words = longest_word(read_dictionary_iter())
peak memory: 57.80 MiB, increment: 0.00 MiB

We see that the run time was similar, but the memory used through the process was negligible (the increment is the memory used by the line being profiled). It is important to note that if we wanted to do something else with the scrabble dictionary then it will not be in memory, which we could have accomplished by saving the dictionary as a variable for future use if we called read_dictionary. In general, it is much slower to read from the hard drive than it is from main memory, so if you have the free memory, it may be better to cache data that you plan to use in the future in memory. The speed to read from various locations in your computer or network is called latency.

The difference between latency for main memory, hard drives, and network drives is measured in orders of magnitude. For example, to read 1MB from the main memory, which is where your variables are stored, takes roughly 0.25 milliseconds. To read 1MB from a solid state drive takes 1 millisecond, and to read 1MB from a hard disk takes 20 milliseconds. Reading data over a network depends on the speed of routing packets across that network, but it typically is greater than 10 times as slow as reading from disk. When working with big data, it is very common to have many GB of data in multiple drives over a network. The standard way to deal with this is to read the data from drive sequentially, and only store what is needed in memory. The following note is attributed to Jeff Dean and Peter Norvig and it gives a more complete picture of latencies for a standard PC:

Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference                           0.5 ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns                      14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns                      20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy             3,000   ns        3 us
Send 1K bytes over 1 Gbps network       10,000   ns       10 us
Read 4K randomly from SSD*             150,000   ns      150 us          ~1GB/sec SSD
Read 1 MB sequentially from memory     250,000   ns      250 us
Round trip within same datacenter      500,000   ns      500 us
Read 1 MB sequentially from SSD*     1,000,000   ns    1,000 us    1 ms  ~1GB/sec SSD, 4X memory
Disk seek                           10,000,000   ns   10,000 us   10 ms  20x datacenter roundtrip
Read 1 MB sequentially from disk    20,000,000   ns   20,000 us   20 ms  80x memory, 20X SSD
Send packet CA->Netherlands->CA    150,000,000   ns  150,000 us  150 ms

Notes
-----
1 ns = 10^-9 seconds
1 us = 10^-6 seconds = 1,000 ns
1 ms = 10^-3 seconds = 1,000 us = 1,000,000 ns

Credit
------
By Jeff Dean:               http://research.google.com/people/jeff/
Originally by Peter Norvig: http://norvig.com/21-days.html#answers

Contributions
-------------
'Humanized' comparison:  https://gist.github.com/hellerbarde/2843375
Visual comparison chart: http://i.imgur.com/k0t1e.png

Data Structures and Dictionaries

A dictionary is a mapping from keys to values. You can initialize the dictionary using curly brackets {} with key value pairs as in the following.

In [45]: pract_dict = {'AAL':'Aalborg, Denmark', 'AAR':'Aarhus, Denmark', 'ABA':'Abakan, Russia'}

In [46]: print(pract_dict['AAR'])
Aarhus, Denmark

The keys here are airport codes and the values are the cities. We can access the values for a given key using square brackets, []. You can also initialize a dictionary using a generator expression as in the following.

In [47]: {j:i for i,j in enumerate('hotdog')}
Out[47]: {'h': 0, 'o': 4, 't': 2, 'd': 3, 'g': 5}

Typically for a dictionary you will initialize it and then use it for look-ups. To understand the real difference between dictionaries and lists let’s learn something about data structures.

A data structure refers to an entire collection of data with all of their relationships and methods associated with them. The list has an underlying data structure that allows for fast access of its elements (via slicing) and efficient extension of the list (via the append method) and deletion of elements (by using del). One simple theoretical data structure is the stack which maintains a list of items called the stack and you can only push items to the top of the stack (end of the list) or pop them off of the top of the stack, at which point they are no longer in the stack. This last-in first-out behavior of the stack limits its capabilities, but it provides a simple theoretical construct, and it is interesting to ask the question “what sorts of things can I do with a stack?” The Python list can function as a stack and it has the push and pop methods that do precisely this, but it also allows for deletion and modification of any item in the list, which is not something that stacks can do in general.

Different data structures are efficient for different operations. The list is great if you want to look up the item based on its index. It is inefficient for looking up the index for a certain value. This is because list.index will look up the index for that value by iterating through the list until the value is found. This means that the amount of time that this can take will scale like the length of the list (we say that this takes \(O(n)\) time where n is the length of the list). If you want to do fast index lookups you have two main options: a dictionary for reverse lookups, or a sortedlist. Often there will be many different types that you can use to accomplish the same goal, such as when using a tuple, list, or dictionary for storing numbers. You will need to anticipate what sort of operations you will perform on the data in order to choose which type to use.

To illustrate this point, let’s consider the list of all of the words in the scrabble dictionary.

In [48]: print(scrabble_words[:10])
['AA', 'AAH', 'AAHED', 'AAHING', 'AAHS', 'AAL', 'AALII', 'AALIIS', 'AALS', 'AARDVARK']

In [49]: len(scrabble_words)
Out[49]: 267751

In [50]: scrabble_words[99999]
Out[50]: 'GYNAECEA'

We want to make the reverse look-up to see what index gives the word ‘GYNAECEA’. Let’s time the use of index to accomplish this.

In [51]: %timeit gy_ind = scrabble_words.index('GYNAECEA')
1000 loops, best of 3: 1.35 ms per loop

This command takes roughly 1.35 milliseconds to perform. Consider making a dictionary where the keys are the words in the dictionary, and the values are the indices in the dictionary.

In [52]: ind_dict = {word: i for i, word in enumerate(scrabble_words)}

In [53]: %timeit gy_ind = ind_dict['GYNAECEA']
The slowest run took 38.77 times longer than the fastest. This could mean that an intermediate result is being cached.
10000000 loops, best of 3: 50.3 ns per loop

When we initialize a dictionary it creates what is known as a hash table that matches keys to values. A hash table stores the values in blocks in memory, where the location in memory is based on a built in hash function. An example of a hash function for integers is the mod operation, i % 99 for example. For the keys 1024 and 9638 this hash function gives us 34 and 35 respectively. The hash table then stores the values that you would like to store in blocks 34 and 35. Hashing requires special considerations when there is a collision—two inputs have the same output.

If you call hash('james') then you see the hash value of a string, which determines where the value is stored in memory. This means that to find a value of a key, you only need to evaluate the hash function of the key. Consequently lookups take constant time (the run-time does not significantly increase with the length of the dictionary). That is why the look-up in the previous code-block takes only 50 nanoseconds—roughly 25,000 times faster than the index method. The only drawback is that the hash table takes up perhaps an unneccessary amount of memory and initializing the dictionary may take some time. You also need to use keys that are hashable, such as strings, ints, floats.

A sorted list is just a sorted version of the original list used to construct it. Because it is sorted you can look for values through the bisection method. Bisection will look at the mid-point in the sorted list and determine if the query is greater than or less than that value. Then you can rule out half of the list, and continue bisecting in the other half recursively until you find the value. This means that it takes \(O(\log n)\) time to do the lookup. Look at the code below from the scrabble dictionary reverse lookup example.

In [54]: from sortedcontainers import SortedList

In [55]: ind_sl = SortedList(scrabble_words)

In [56]: %timeit gy_ind = ind_sl.index('GYNAECEA')
The slowest run took 60.51 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.97 µs per loop

The look-up took considerable less time than using the list—roughly 500 times less time—but it is still slower than using the dictionary. The SortedList is not built-in to Python and you will have to install the sortedcontainers package. Typically, we don’t use the SortedList unless it is necessary, and will generally stick with dictionaries for reverse look-ups.

[1]Galton was the cousin of Charles Darwin and studied the implications of natural selection and variability on human populations. He was a proponent of eugenics, and much of his research centered on the inheritability of intelligence.
[Pres]Gerhard Peters,The American Presidency Project [online]. Santa Barbara, CA: University of California (hosted), Gerhard Peters (database). Available from World Wide Web: http://www.presidency.ucsb.edu/inaugurals_words.php.