Lecture 9
 This lecture is talking about how to find the mean of population with Standard Error(SE) and Standard Error of The Means(SEM). And how to decide the sample size to represent the population with Confidence Interval(CI).
1. Sampling and Standard Error
1.1. Stratified Sampling
 Partition population into subgroups
 Take a simple random sample from each subgroup
 When need to use:
 There are small subgroups that should be represented
 It is important that subgroups be represented proportionally to their size in the population
 Can be used to reduced the needed size of sample
For example (Predicting Temperatures in the U.S.)
Data:
 From U.S. National Centers for Environmental Information (NCEI)
 21 different US cities
 ALBUQUERQUE, BALTIMORE, BOSTON, CHARLOTTE, CHICAGO, DALLAS, DETROIT, LAS VEGAS, LOS ANGELES, MIAMI, NEW ORLEANS, NEW YORK, PHILADELPHIA, PHOENIX, PORTLAND, SAN DIEGO, SAN FRANCISCO, SAN JUAN, SEATTLE, ST LOUIS, TAMPA
 1961 – 2015
 421,848 data points (examples)
 ALBUQUERQUE, BALTIMORE, BOSTON, CHARLOTTE, CHICAGO, DALLAS, DETROIT, LAS VEGAS, LOS ANGELES, MIAMI, NEW ORLEANS, NEW YORK, PHILADELPHIA, PHOENIX, PORTLAND, SAN DIEGO, SAN FRANCISCO, SAN JUAN, SEATTLE, ST LOUIS, TAMPA
First, fetch 100 random samples, get the standard deviation and mean:
def makeHist(data, title, xlabel, ylabel, bins = 20):
pylab.hist(data, bins = bins)
pylab.title(title)
pylab.xlabel(xlabel)
pylab.ylabel(ylabel)
def getHighs():
inFile = open('temperatures.csv')
population = []
for l in inFile:
try:
tempC = float(l.split(',')[1])
population.append(tempC)
except:
continue
return population
def getMeansAndSDs(population, sample, verbose = False):
popMean = sum(population)/len(population)
sampleMean = sum(sample)/len(sample)
if verbose:
makeHist(population,
'Daily High 19612015, Population\n' +\
'(mean = ' + str(round(popMean, 2)) + ')',
'Degrees C', 'Number Days')
pylab.figure()
makeHist(sample, 'Daily High 19612015, Sample\n' +\
'(mean = ' + str(round(sampleMean, 2)) + ')',
'Degrees C', 'Number Days')
print('Population mean =', popMean)
print('Standard deviation of population =',
numpy.std(population))
print('Sample mean =', sampleMean)
print('Standard deviation of sample =',
numpy.std(sample))
return popMean, sampleMean,\
numpy.std(population), numpy.std(sample)
random.seed(0)
population = getHighs()
sample = random.sample(population, 100)
getMeansAndSDs(population, sample, True)
Result:
Population mean = 16.298769461986048 Standard deviation of population = 9.4375585448 Sample mean = 17.0685 Standard deviation of sample = 10.390314372
Try it 1000 times and plot the sample means results
random.seed(0)
population = getHighs()
sampleSize = 100
numSamples = 1000
maxMeanDiff = 0
maxSDDiff = 0
sampleMeans = []
for i in range(numSamples):
sample = random.sample(population, sampleSize)
popMean, sampleMean, popSD, sampleSD =\
getMeansAndSDs(population, sample, verbose = False)
sampleMeans.append(sampleMean)
if abs(popMean  sampleMean) > maxMeanDiff:
maxMeanDiff = abs(popMean  sampleMean)
if abs(popSD  sampleSD) > maxSDDiff:
maxSDDiff = abs(popSD  sampleSD)
print('Mean of sample Means =',
round(sum(sampleMeans)/len(sampleMeans), 3))
print('Standard deviation of sample means =',
round(numpy.std(sampleMeans), 3))
print('Maximum difference in means =',
round(maxMeanDiff, 3))
print('Maximum difference in standard deviations =',
round(maxSDDiff, 3))
makeHist(sampleMeans, 'Means of Samples', 'Mean', 'Frequency')
pylab.axvline(x = popMean, color = 'r')
Result:
Mean of sample Means = 16.294 Standard deviation of sample means = 0.943 Maximum difference in means = 3.633 Maximum difference in standard deviations = 2.457
To get a tighter bound, we tried:
 drawing 2000 samples instead of 1000,
 doesn't change too much
 or increasing sample size from 100 to 200
 Standard deviation of sample means drops from 0.94 to 0.66
 drawing 2000 samples instead of 1000,
Then use
pylab.errorbar()
function to plot different sample sizes [50, 100, 200, 300, 400, 500, 600]:pylab.errorbar(xVals, sizeMeans, \ yerr = 1.96*pylab.array(sizeSDs), \ fmt = 'o', label = '95% Confidence Interval')
Result:
 Going from a sample size of 100 to 400 reduced the confidence interval from $1.8\ ^{\circ}C$ to about $1 \ ^{\circ}C$.
Conclusion
 Bigger sample size will always be better.
1.2. Standard Error
 The Standard Error(SE) of a statistic (most commonly the mean) is the standard deviation of its sampling distribution, or sometimes an estimate of that standard deviation. Use to depict the dispersion of sample means around the population mean.
 (标准误所代表的，是在无数次抽样结果中，一次抽样的结果可能偏离无数次抽样结果这一总体的程度。标准误越小，用这一次抽样结果来代表这无数次抽样结果的可靠性就越好。)
 标准差是样本离散程度的一个度量。标准误是：给定样本大小，样本的某个统计量的抽样分布标准差。  from 知乎
 Standard Error of the Mean(SEM) is the standard deviation of the sampling distribution of the sample mean.
 当我们打算使用随机样本来计算整体样本的平均值。SEM 表达的是，我们使用的样本大小是否足够代表整体来计算整体平均值。
 比如我抽了十次样，如果这十次样本平均值的标准差都很接近的话，那么它们的标准误就会很小，这个样本的大小就适合代表整体。
 我们期望随机样本在特定 CI (比如95%=1.96个标准偏差)下的可靠度足够高。比如温度预测，我们肯定不希望95%的 CI 下差值范围超过 1 度。显然这样的预测是不准确的。
 标准误的价值在于，通常情况下，我们无法得知整体样本的数据情况，所以必须使用随机抽样+计算SEM来保证样本大小足够以及结论的可靠性。
 To formulate it, SEM is estimated by the sample estimate of the population standard deviation (sample standard deviation) divided by the square root of the sample size (assuming statistical independence of the values in the sample):
 $SE_{\bar{x}}=\frac{s}{\sqrt{n}}$
 where
s
is the sample standard deviation (i.e., the samplebased estimate of the standard deviation of the population). Because, most time, we can't get the standard deviation of the population. Later, we will prove the samplebased estimate of the standard deviation of the population is close to the standard deviation of the population.
n
is the size (number of observations) of the sample.
 Recall the formula (from CLT) of the standard deviation of the sample means:
 The variance of the sample means ($\sigma_{\bar{x}}^2$) will be close to the variance of the population ($\sigma^2$) divided by the sample size (N).
 $\sigma^2_{\bar{x}}=\frac{\sigma^2}{N}$
 where
σ
is the standard deviation of the population.
 The variance of the sample means ($\sigma_{\bar{x}}^2$) will be close to the variance of the population ($\sigma^2$) divided by the sample size (N).
Prove the SEM theorem
 First, let's use the standard deviation of the population to calculate SEM to see if it is close to the standard deviation of the sample means of the population by simulation
 Test with different sample sizes
def sem(popSD, sampleSize):
return popSD/sampleSize**0.5
sampleSizes = (25, 50, 100, 200, 300, 400, 500, 600)
numTrials = 50
population = getHighs()
popSD = numpy.std(population)
sems = []
sampleSDs = []
for size in sampleSizes:
sems.append(sem(popSD, size))
means = []
for t in range(numTrials):
sample = random.sample(population, size)
means.append(sum(sample)/len(sample))
sampleSDs.append(numpy.std(means)) # calculate the standard deviation of the means of random samples from population
pylab.plot(sampleSizes, sampleSDs,
label = 'Std of 50 means')
pylab.plot(sampleSizes, sems, 'r', label = 'SEM')
pylab.title('SEM vs. SD for 50 Means')
pylab.legend()
 So, we can say, the SEM is very close to the standard deviation of the sample means when we are using the Population Standard Deviation. But can we use the Sample Standard Deviation to instead the Population Standard Deviation?
Compare Sample Standard Deviation
 to prove that the sample estimate of the population standard deviation (sample standard deviation) is close to the population standard deviation
 compare the differences between the Standard Deviation of Single Sample (NOT Standard Error of the Means) and the Standard Deviation of the Population:
def getDiffs(population, sampleSizes):
popStd = numpy.std(population)
diffsFracs = []
for sampleSize in sampleSizes:
diffs = []
for t in range(100):
sample = random.sample(population, sampleSize)
diffs.append(abs(popStd  numpy.std(sample))) # single sample
diffMean = sum(diffs)/len(diffs)
diffsFracs.append(diffMean/popStd)
return pylab.array(diffsFracs)*100
def plotDiffs(sampleSizes, diffs, title, label):
pylab.plot(sampleSizes, diffs, label = label)
pylab.xlabel('Sample Size')
pylab.ylabel('% Difference in SD')
pylab.title(title)
pylab.legend()
sampleSizes = range(20, 600, 1)
diffs = getDiffs(getHighs(), sampleSizes)
plotDiffs(sampleSizes, diffs,
'Sample SD vs Population SD, Temperatures',
label = 'High temps')
 Once sample reaches a reasonable size, Sample Standard Deviation is a pretty good approximation to Population Standard Deviation.
Some Other Questions
Does the Distribution of Population matter?
 Try Three Different Distributions: Uniform, Gaussian and Exponential

def compareDists(): uniform, normal, exp = [], [], [] for i in range(100000): uniform.append(random.random()) normal.append(random.gauss(0, 1)) exp.append(random.expovariate(0.5)) sampleSizes = range(20, 600, 1) udiffs = getDiffs(uniform, sampleSizes) ndiffs = getDiffs(normal, sampleSizes) ediffs = getDiffs(exp, sampleSizes) plotDiffs(sampleSizes, udiffs, 'Sample SD vs Population SD', 'Uniform population') plotDiffs(sampleSizes, ndiffs, 'Sample SD vs Population SD', 'Normal population') plotDiffs(sampleSizes, ediffs, 'Sample SD vs Population SD', 'Exponential population') compareDists()
Conclusion
 It does. Different Distribution of Population has different differences.
Does Population Size Matter?
popSizes = (10000, 100000, 1000000) sampleSizes = range(20, 600, 1) for size in popSizes: population = [] for i in range(size): population.append(random.expovariate(0.5)) ediffs = getDiffs(population, sampleSizes) plotDiffs(sampleSizes, ediffs, 'Sample SD vs Population SD, Uniform', 'Population size = ' + str(size))
 Conclusion
 It doesn't. Different Population Size has almost the same differences.
 Conclusion
Conclusion (To Estimate Mean from a Single Sample)
Prerequisite: independent random samples
Choose sample size based on estimate of skew in population
skew
: A distribution is skewed if one tail extends out further than the other. A distribution has a positive skew (is skewed to the right) if the tail to the right is longer. It has a negative skew (skewed to the left) if the tail to the left is longer.
 Chose a random sample from the population
 Compute the mean and standard deviation of that sample
 Use the standard deviation of that sample to estimate the standard error
 Use the estimated SE to generate confidence intervals around the sample mean
 if the CI is small enough, then we can use this sample size to represent our population.
Test the Conclusion
Are 200 Samples Enough to Estimate the Mean of Population?
 First, we can calculate the SE of 200 samples in theorem
 Second, to compare the true mean and the sample mean, if the difference < 1.96*se, then we can say that, 200 is enough with 95% confidence.
And of course, we need more trails to confirm our assumption.
random.seed(0) temps = getHighs() popMean = sum(temps)/len(temps) sampleSize = 200 # conclusion(1.) numTrials = 10000 numBad = 0 for t in range(numTrials): sample = random.sample(temps, sampleSize) # conclusion(2.) sampleMean = sum(sample)/sampleSize # conclusion(3.) se = numpy.std(sample)/sampleSize**0.5 # conclusion(4.) if abs(popMean  sampleMean) > 1.96*se: # 1.96 is a confidence level of 95% numBad += 1 print('Fraction outside 95% confidence interval =', numBad/numTrials)
Result:
 Fraction outside 95% confidence interval = 0.0511
 Conclusion:
 It is enough for estimate the average temperatures in the U.S
What if we use continuous 200 samples ?
for t in range(numTrials): posStartingPts = range(0, len(temps)  sampleSize) start = random.choice(posStartingPts) sample = temps[start:start+sampleSize] sampleMean = sum(sample)/sampleSize se = numpy.std(sample)/sampleSize**0.5 if abs(popMean  sampleMean) > 1.96*se: numBad += 1 print('Fraction outside 95% confidence interval =', numBad/numTrials)
 Result:
 Fraction outside 95% confidence interval = 0.9367
 Conclusion:
 we have violated a key assumptions.
 we did NOT choose independent random samples
 Data organized by city
 Temperatures correlated with city
 Therefore examples in sample are not independent of each other
 Obvious here, but can be subtle
 Result:
Conclusion for the last two tests
 All theoretical results incorporate some assumptions
 These must be checked before applying the theory!
Usage of SE and SEM
 The notation for standard error can be any one of SE, SEM (for standard error of measurement or mean), or $S_{E}$.
 If its sampling distribution is normally distributed, the sample mean, its standard error, and the quantiles of the normal distribution can be used to calculate confidence intervals for the mean. Like the sample, we calculate if 200 is enough for sample size.
 https://en.wikipedia.org/wiki/Standard_error#Assumptions_and_usage