Data Science 5620

Data Science Institute
Vanderbilt University


Course Overview
Course Calendar
Course Policies

View the Project on GitHub thomasgstewart/data-science-5620-fall-2021

Course Calendar Fall 2021

Deliverables

Title First Submission Due Date Resubmission Due Date
00 Student Profile 2021-08-30 Not available
01 Roulette 2021-09-06 2021-09-24
02 Monte Carlo Error 2021-09-13 2021-10-04
02b Interview Question 2021-09-20 2021-10-11
03 World Series 2021-09-27 2021-10-19
Extra credit: Birthday Problem 2021-09-27 Not available
04 Home Field Advantage 2021-10-04 2021-10-25
05 Log Transformation 2021-10-11 2021-11-03
06 Order Statistics 2021-10-25 2021-11-17
Take a break / Get caught up 2021-11-01 Not available
07 MLE ane MM 2021-11-08 2021-12-09
08 Coverage Probability 2021-11-29 2021-12-16 at 9am
09 Methods not equally good    
10 Central Limit Theorem Shortcut    
11 Correlation and Inference    

Problem Set

Due Date Problem
2021-10-20 (a) Generate, via simulation, a plot of the distribution of the 25th percentile of a sample of size 100 when the underlying distribution is gamma with shape = 3 and scale = 12
(b) Overlay on the plot from (a) the analytic solution of the pdf
2021-10-27 (a) Read chapter 3.
(b) Complete problem 4 of Section 3.12
(c) Complete problem 8 of Section 3.12
2021-11-01 (a) Read chapter 7.
(b) Read chapter 8.
(c) Replicate figure 8.5 by following the example in section 8.4.2
You can get the pima dataset by installing the faraway package and the command data(pima)
(d) Generate the same plot as Figure 8.5 for adult males using the NHANES dataset.
Recall you can use the command Hmisc::getHdata(nhgh) to retrieve the data.
(e) Replicate figure 8.6 by following the example in section 8.4.4
You can use the following commands to retrieve the bike dataset from the UCI repository
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00275/Bike-Sharing-Dataset.zip",temp)
con <- unz(temp, "day.csv")
bike <- read.table(con, sep=",", header = TRUE)
unlink(temp)
2021-11-03 (a) Do exercise 7 of section 8.10, overlay the estimated pdf on the histogram
(b) Add to the plot in (a) a kernel density estimate of the pdf
Recall, the dataset is in the faraway package. Missing values are coded as zero.
2021-11-10 (a) Review the slides at https://biostatdata.app.vumc.org/tgs/13-bootstrap.html
(b) Replicate the figure on slide 32. (The code is provided in slides 30 and 31. Copy and paste.)
2021-11-15 (a) Read chapter 9
(b) Complete questions 1, 2, 3, 6, 7 in section 9.13
2021-11-17 (a) Read https://biostatdata.app.vumc.org/tgs/18-parallel-processing.html
(b) Read https://biostatdata.app.vumc.org/tgs/20-batch-processing-accre.html

MURAL Whiteboards

Date
2021-11-10
2021-11-15
2021-11-17
2021-12-08 (practice problems)

Final Exam

The final exam will occur between 13 December 2021 and 18 December 2021. Students will sign up for oral exam slots in early December.

Topics

PLEASE NOTE: The slides are often changed before lecture (both major edits and minor tweaks).

Topic Slides Textbook sections Videos  
Class logistics        
Definitions of Probability slides      
Simulation & Operating Characteristics slides
review
2    
Basic Probability Ideas slides
slides part 2
1    
→ Belief vs Frequency        
→ Notebook / data.frame definition        
→ And, Or   1.3    
→ Conditional Probability   1.3    
→ Law of Total Probability slides
slides part 2
     
→ Bayes Rule slides
slides part 2
1.9    
Discrete Probability Models   3, 4, 5    
→ Bernouli Random Variables slides      
→ Binomial Random Variables slides      
→ Negative Binomial Random Variables slides      
→ Poisson Random Variables slides      
→ Probability Mass Function slides      
Continuous Probability Models   6    
→ Cumulative Distribution Function slides      
→ Probability Density Function slides      
→ Uniform Random Variables slides      
→ Normal Random Variables slides      
→ Exponential Random Variables slides      
→ Gamma Random Variables slides      
→ Beta Random Variables slides      
→ Mixture Distributions slides      
Expectation and Variance   3.6, 4.1, 4.3, 6.5    
→ Data Types slides      
→ Categorical, Ordinal, Interval, and Ratio Variables slides      
→ Covariance slides      
Transformations of individual observations        
Transformations of samples   7    
→ Min and Max slides      
→ Quantiles slides      
→ Order Statistics slides
application
     
→ Sampling distributions slides      
Methods of Fitting Models Lots of pdfs 8    
→ QQ-plot        
→ Method of moments slides      
→ Maximum likelihood slides      
→ Bayesian        
→ Kernel Density Estimation slides      
Sampling Distributions from Fitted Models        
→ Bootstrap slides      
→ Simulation slides      
→ Central Limit Theorem slides      
Simulation        
→ Parallel Computing slides      
→ Batch processing on ACCRE or AWS slides      
Inference        
→ Sampling and Inference        
→ Inference with CI slides      
→ Inference with Hypothsis testing slides      
Multivariate Normal Distribution   12    
→ Properties slides      
→ Correlation slides      
→ Conditional Distribution slides      
→ Marginal Distribution slides