## STA 32 Gateway to Statistical Data Science

**Units:** 4

**Format:**

Lecture: 3 hours

Laboratory: 1 hour

**Catalog Description:**

Probability concepts; programming in R; exploratory data analysis; sampling distribution; estimation and inference; linear regression; simulations; resampling methods. Alternative to STA 13 for students with a background in calculus and programming.

**Prerequisite:** MAT 016B or MAT 021B or MAT 017B

**Goals:**

This course gives the student an overview of the structure and applications of probability, statistics, computer simulation and data analysis. It was designed to serve as an alternative to course 13 for students having background in calculus and programming. Course 32 should be the first Statistics course taken by a student who is considering a Statistics major. After completing the course successfully, the student should:

- Be aware of the general types of applications of probability, statistics and simulation, and the roles of the twin tools of mathematical analysis and simulation
- Be able to do simple computations involving probabilities, random variables, mass or density functions, expected values and variances--both through mathematical analysis and through simulation
- Have the prerequisite background from the methodology courses such as course 104, 106, and 108
- Have insight, which should lead to better performance in courses 131ABC

**Summary of course contents:**

1. Introduction to programming with R (3 lectures)

- Vectors, matrices and data frames
- Functions in R
- Plotting and printing functions
- Data input and output
- Logical control statements, loops

2. Descriptive statistical summaries (3 lectures)

- Basic numerical statistical summaries -- univariate and bivariate
- Graphical summaries -- univariate and bivariate

3. Introduction to probability (6 lectures)

- Basic combinatorics and concepts of equally likely outcomes
- Basic rules of probability computations
- Conditional probability, Bayes' theorem
- Binomial and Normal distributions – properties and computations

4. Sampling distributions (3 lectures)

- Concepts of sampling -- SRSWR and SRSWOR
- Generating random samples from different distributions
- Sampling distributions of common statistics -- exploration using simulation

5. Introduction to statistical inference (6 lectures)

- Concepts of bias, variance and large sample distributions
- Hypothesis tests and confidence intervals in one and two sample problems
- Inference using permutation and resampling based procedures

6. Concepts of linear regression and correlation (7 lectures)

- Correlation as a measure of association between variables
- Notions of conditional mean and variance, graphical summaries
- Introduction to linear regression
- Basic regression diagnostics and remedial measures
- Statistical inference using mathematical and resampling-based methods
- Practical linear model building using computing tools

The course project will emphasize on data exploration through computing and statistical reasoning. Different data sources will be used for students of different backgrounds. The focus will be on developing independent data analysis skills.

**Restrictions:
** Only two units of credit allowed to students who have taken course 13; not open for credit to students who have taken course 100.

**Illustrative reading:
**

1. Dalgaard, P. (2008). *Introductory Statistics with R.* Springer.

2. Crawley, M. J. (2014). *Statistics: An Introduction Using R, 2nd Edition.* Wiley.

3. Verzani, J. (2014). *Using R for Introductory Statistics, 2nd Edition.* Chapman & Hall.

4. Freedman, D., Pisani, R. and Purves, R. (2007). *Statistics, 4th Edition.* W.W. Norton & Company.

**GE3:**

SE, QL

**Potential Overlap:**

Since this course was designed as an alternative to course STA 013, there is considerable overlap in the topics covered in the two courses. However, course STA 032 covers the topics in more depth and mathematical sophistication. It also has some overlap in content with STA 100. However, there is considerable emphasis on computing and computer-intensive statistical procedure in STA 032, which distinguished it from both STA 013 and STA 100.

**History:**

None