STA 141C Big Data & High Performance Statistical Computing

Units: 4

Format:
Lecture: 3 hours
Discussion: 1 hour

Catalog Description:
High-performance computing in high-level data analysis languages; different computational approaches and paradigms for efficient analysis of big data; interfaces to compiled languages; R and Python programming languages; high-level parallel computing; MapReduce; parallel algorithms and reasoning.

Prerequisite: Course 141A or ECS 40

Goals:
Students learn to reason about computational efficiency in high-level languages. They will be able to use different approaches, technologies and languages to deal with large volumes of data and computationally intensive methods.

Summary of course contents:
This course explores aspects of scaling statistical computing for large data and simulations. It moves from identifying inefficiencies in code, to idioms for more efficient code, to interfacing to compiled code for speed and memory improvements. We then focus on high-level approaches to parallel and distributed computing for data analysis and machine learning and the fundamental general principles involved. We also explore different languages and frameworks for statistical/machine learning and the different concepts underlying these, and their advantages and disadvantages. We also take the opportunity to introduce statistical methods specifically designed for large data, e.g. the bag of little bootstraps.

Restrictions:
None

Illustrative reading:

  • Advanced R, Wickham. Parallel R, McCallum & Weston.
  • Python for Data Analysis, Weston.
  • Hadoop: The Definitive Guide, White.

GE3:
None

Potential Overlap:
ECS 158 covers parallel computing, but uses different technologies and has a more technical, machine-level focus. ECS 145 covers Python, but from a more computer-science and software engineering perspective than a focus on data analysis.

History:
First offered Spring 2017.