# Lecture 10 - 11

## Experimental Data

• These lectures are talking about the interplay between statistics and experimental science. And how to get a valid statistical conclusion.

### The Behavior of Springs

• Hooke’s law of elasticity: F = -kx

• In other words, the force, F, stored in a spring is linearly related to the distance, x, the spring has been compressed(or stretched).
• k is called the spring constant.
• All springs have an elastic limit, beyond which the law fails.
• For example, How much does a rider have to weigh to compress the spring on motorbike 1cm? ($k \approx 35,000 N/m$)

• $F = 0.01m * 35,000 N/m$
• $F = 350 N$
• $F = mass * acc$, $F = mass * 9.81m/s^2$
• $mass * 9.81m/s^2 = 350N$
• $mass \approx 35.68kg$
• Finding k

• We can’t conduct the experiment perfectly, so, to hang a series of increasingly heavier weights on the spring, measure the stretch of the spring each time, and plot the results will be a better way.

• First we should know:

• F=-kx
• k=-F/x
• k=9.81*m/x
• Then, we got some data

• Plot the data, we got:

• It has some measurement errors.
• Next step is to fit curves to data.

• When we fit a curve to a set of data, we are finding a fit that relates an independent variable (the mass) to an estimated value of a dependent variable (the distance)

• To fit curves to data, we need to define an objective function that provides a quantitative assessment of how well the curve fits the data. In this case, a straight line, that is linear function.

• Then, finding the best fit is an optimization problem. most commonly used objective function is called least squares:

• $\displaystyle\sum_{i=0}^{len(observed)-1}(observed[i] - predicted[i])^2$
• Next is to use a successive approximation algorithm to find the best least-squares fit. PyLab provides a built-in function, polyfit:

• this function finds the coefficients of a polynomial of degree n that provides a best least-squares fit for the set of points defined by the arrays observedXVals and observedYVals.
• The algorithm used by polyfit is called linear regression.
• Visualizing the Fit

• k = 21.53686

### Another Experiment

• to fit curves to these mystery data:

• how good are these fits

• we can see that quadratic model is better than linear model, but how bad a fit is the line and how good is the quadratic fit?

• comparing mean squared error

• then we got:
• Ave. mean square error for linear model = 9372.73078965
• Ave. mean square error for quadratic model = 1524.02044718
• Seems like, quadratic model is better than linear model. But we still have to ask, is the quadratic fit good in an absolute sense?
• The mean square error is useful for comparing two different models of the same data, but it’s not actually very useful for getting the absolute goodness of fit.

### Coefficient of Determination ( R^2 )

• $R^{2}=1-\frac{\sum_{i}(y_i-p_i)^2}{\sum_{i}(y_i-\mu_i)^2}$

• By comparing the estimation errors (the numerator) with the variability of the original values (the denominator), R^2 is intended to capture the proportion of variability in a data set that is accounted for by the statistical model provided by the fit
• Always between 0 and 1 when fit generated by a linear regression and tested on training data
• If R^2 = 1, the model explains all of the variability in the data. If R^2 = 0, there is no relationship between the values predicted by the model and the actual data.
• Testing Goodness of Fits

• Quadratic model get 84%, and Linear get almost 0%.
• Since the degree of polynomial can affect the Goodness of Fits, what if we use some bigger ones? Can we get a tighter Fit?

### A Tighter Fit May Not Be We Need, So How to Choose?

• Why We Build Models

• Help us understand the process that generated the data
• Help us make predictions about out-of-sample data
• A good model helps us do these things
• How Mystery Data Was Generated

• Let’s Look at Two Data Sets

• Hence Degree 16 Is Tightest Fit, But

• We want model to work well on other data generated by the same process
• It needs to generalize
• Cross Validate

• Generate models using one dataset, and then test it on the other
• Use models for Dataset 1 to predict points for Dataset 2
• Use models for Dataset 2 to predict points for Dataset 1
• Expect testing error to be larger than training error
• A better indication of generalizability than training error

• Increasing the Complexity

• coefficient may be zero if Fitting a quadratic to a perfect line
• for example one:
• data: [{0, 0}, {1, 1}, {2, 2}, {3, 3}]
• Fit y=ax^2+bx+c, will get y=0x^2+1x+0, which is y=x
• for example two:
• data: [{0, 0}, {1, 1}, {2, 2}, {3, 3.1}]
• Fit y=ax^2+bx+c, will get y=0.025x^2+0.955x+0.005, which is y=x
• if data is noisy, can fit the noise rather than the underlying pattern in the data?
• for example:
• data: [{0, 0}, {1, 1}, {2, 2}, {3, 3}, {4, 20}]
• Fit y=ax^2+bx+c, we got:
• R-squared = 0.7026
• suppose we had used a first-degree fit
• model = pylab.polyfit(xVals, yVals, 1)
• Conclusion

• Choosing an overly-complex model can leads to overfitting to the training data
• Increases the risk of a model that works poorly on data not included in the training set
• On the other hand choosing an insufficiently complex model has other problems
• when we fit a line to data that was basically parabolic

### Suppose We Don’t Have a Solid Theory to Validate Model

• Use cross-validation results to guide the choice of model complexity
• If dataset small, use leave-one-out cross validation
• If dataset large enough, use k-fold cross validation or repeated-random-sampling validation

#### Leave-one-out Cross Validation

• Let D be the original data set

#### Repeated Random Sampling

• Let D be the original data set

• Let n be the number of random samples

#### An Example, Temperature By Year

• Task: Model how the mean daily high temperature in the U.S. varied from 1961 through 2015

• Get means for each year and plot them
• Randomly divide data in half n times
• For each dimensionality to be tried
• Train on one half of data
• Test on other half
• Record r-squared on test data
• Report mean r-squared for each dimensionality
• Code: lecture11-3.py
• The Whole Data Set:

• Results:

• Mean R-squares for test data
• For dimensionality 1 mean = 0.7535 Std = 0.0656
• For dimensionality 2 mean = 0.7291 Std = 0.0744
• For dimensionality 3 mean = 0.7039 Std = 0.0684
• Line seems to be the winner

• Highest average r-squared
• Simplest model

### Wrapping Up Curve Fitting

• We can use linear regression to fit a curve to data
• Mapping from independent values to dependent values
• That curve is a model of the data that can be used to predict the value associated with independent values we haven’t seen (out of sample data)
• R-squared used to evaluate model
• Higher not always “better” because of risk of over fitting
• Choose complexity of model based on
• Theory about structure of data
• Cross validation
• Simplicity