# We will just go ahead and import all the useful packages first.
import numpy as np
import sympy as sp
import pandas as pd
import matplotlib.pyplot as plt
# Special Functions
from sklearn.metrics import r2_score, mean_squared_error
Math for Data Science
Continue Curvilinear Models
Important Information
- Email: joanna_bieri@redlands.edu
- Office Hours take place in Duke 209 unless otherwise noted – Office Hours Schedule
Today’s Goals:
- Continue Empirical Modeling
- Nonlinear Models - Linearization
- Exponent and Log models
Last Time - Polynomial Regression
We looked at the ice cream data and did a polynomial regression! We wanted a quadratic function because of the shape of the data.
\[ y = \beta_0 + \beta_1 x + \beta_2 x^2\]
This thing was linear in \(\beta\). Instead we added \(x^2\) to data to our regression as if it were just another data point.
\[ y = \beta_0 + \beta_1 x_1 + \beta_2 x_2\]
we could “trick” linear regression into doing nonlinear regression! This is what polyfit does when we tell it we want a second order polynomial.
{python] np.polyfit(x,y,2)
= 'https://joannabieri.com/mathdatascience/data/Ice_cream_data.csv'
file_location = pd.read_csv(file_location)
DF
10) DF.head(
Temperature (°C) | Ice Cream Sales (units) | |
---|---|---|
0 | -4.662263 | 41.842986 |
1 | -4.316559 | 34.661120 |
2 | -4.213985 | 39.383001 |
3 | -3.949661 | 37.539845 |
4 | -3.578554 | 32.284531 |
5 | -3.455712 | 30.001138 |
6 | -3.108440 | 22.635401 |
7 | -3.081303 | 25.365022 |
8 | -2.672461 | 19.226970 |
9 | -2.652287 | 20.279679 |
Higher order Polynomials
Why stop at \(x^2\) we have a computer we can go crazy!!!!
= np.array(DF['Temperature (°C)'])
x = np.array(DF['Ice Cream Sales (units)'])
y
= 3
N
= np.polyfit(x,y,N)
betas print(betas)
= 0
yfit for i,b in enumerate(betas):
+= b*x**(N-i)
yfit
'k.')
plt.plot(x,y,'b-')
plt.plot(x,yfit,'x')
plt.xlabel('y')
plt.ylabel(
plt.show()
print(f'R squared: {r2_score(y,yfit)}')
print(f'MSE: {mean_squared_error(y,yfit)}')
[ 0.05141274 1.8243121 -1.48090353 3.06137302]
R squared: 0.9367011897445384
MSE: 9.32724344567126
Occams Razor
We want to increase the \(R^2\) value to be as close as possible to \(1\) and decrease the mean squared error to as close as possible to zero, without making our model ridiculously complicated. Stop before you get diminishing returns!
= np.arange(0,15)
Nvals = []
Rscores = []
MSEscores
for N in Nvals:
= np.polyfit(x,y,N)
betas
= 0
yfit for i,b in enumerate(betas):
+= b*x**(N-i)
yfit
Rscores.append(r2_score(y,yfit))
MSEscores.append(mean_squared_error(y,yfit))
'-r')
plt.plot(Nvals,Rscores,
plt.grid()'R^2 score as a function of Polynomial Order')
plt.title('N')
plt.xlabel('score')
plt.ylabel(
plt.show()
'-b')
plt.plot(Nvals,MSEscores,
plt.grid()'MSE score as a function of Polynomial Order')
plt.title('N')
plt.xlabel('score')
plt.ylabel( plt.show()
General Idea of Linearized Functions
We could imagine data that has all sorts of dependencies, curves, etc. For example:
\[y = \beta_0 + \beta_1 x + \beta_2 x^2 + \beta_3 \sin(x) + \beta_4 \sqrt{x} \]
as long as we can write the function linearly in beta
\[y = \beta_0 + \beta_1 f_1(x) + \beta_2 f_2(x)+\beta_3 f_3(x)+\beta_4 f_4(x)\dots \]
we can do a linear regression!
BEWARE np.polyfit() can only deal with polynomials, not other more complicated functions!
NOTE Not all functions can be linearized. We have to be able to write it in the form above.
Delhi Climate Data
Here we will look at some climate data and restrict ourselves to just one year to see if we can model the yearly swing in temperature.
- First we will try a polynomial fit
- Then we will see how we could fit a sine function
= 'https://joannabieri.com/mathdatascience/data/DailyDelhiClimateTrain.csv'
file_location = pd.read_csv(file_location)
DF
10) DF.head(
date | meantemp | humidity | wind_speed | meanpressure | |
---|---|---|---|---|---|
0 | 2013-01-01 | 10.000000 | 84.500000 | 0.000000 | 1015.666667 |
1 | 2013-01-02 | 7.400000 | 92.000000 | 2.980000 | 1017.800000 |
2 | 2013-01-03 | 7.166667 | 87.000000 | 4.633333 | 1018.666667 |
3 | 2013-01-04 | 8.666667 | 71.333333 | 1.233333 | 1017.166667 |
4 | 2013-01-05 | 6.000000 | 86.833333 | 3.700000 | 1016.500000 |
5 | 2013-01-06 | 7.000000 | 82.800000 | 1.480000 | 1018.000000 |
6 | 2013-01-07 | 7.000000 | 78.600000 | 6.300000 | 1020.000000 |
7 | 2013-01-08 | 8.857143 | 63.714286 | 7.142857 | 1018.714286 |
8 | 2013-01-09 | 14.000000 | 51.250000 | 12.500000 | 1017.000000 |
9 | 2013-01-10 | 11.000000 | 62.000000 | 7.400000 | 1015.666667 |
Plot the data
Always look at a scatter plot of your data!
stop = 365 - restricts us to just one year of data
=365
stop= np.arange(0,stop,1)
x = DF['meantemp'][:stop]
y
'.k')
plt.plot(x,y,
plt.grid() plt.show()
Looking at this data it looks like \(y=-x^2\) (aka. upside down parabola) that has been shifted and stretched to fit the year. We will assume we can fit it with a function of the form:
\[ y = \beta_0 + \beta_1 x + \beta_2 x^2\]
and just use np.polyfit()
= np.polyfit(x,y,2)
betas betas
array([-7.01158700e-04, 2.61972640e-01, 8.12191801e+00])
= betas[0]*x**2+betas[1]*x+betas[2]
yfit
'k.')
plt.plot(x,y,'b-')
plt.plot(x,yfit,'x')
plt.xlabel('y')
plt.ylabel( plt.show()
r2_score(y,yfit)
0.8946935431841873
mean_squared_error(y,yfit)
5.765082904243028
This worked pretty well!
Polynomial was not our only choice! What if we tried another function? Like sine or cosine?
Using sin(x) to fit our data
What if we want to use \(y=\beta_0 + \beta_1 \sin(x)\) as our nonlinear function that will fit our data? How do we make this work?
First we need to get a better feel for the sine function!
\[y=a\sin(bx)+c\]
=1
a=1
b=0
c
= np.arange(0,np.pi*4,0.1)
xvals = a*np.sin(xvals) +c
yvals
'-b')
plt.plot(xvals,yvals,
plt.grid()'x')
plt.xlabel('y')
plt.ylabel( plt.show()
Questions:
- Take a minute and play around with the function plotted above what do \(a\), \(b\), and \(c\) do?
- How often does this function repeat?
How could we take a sine function and shift it to fit this data? We need to make sure we get just one bump!
= x
x_sin = np.sin((np.pi/365) *x)
y_sin
'-b')
plt.plot(x_sin,y_sin,
plt.grid()'x')
plt.xlabel('y')
plt.ylabel( plt.show()
Now we have to do some of the hard work (feature engineering) before we send our data to polyfit. In python we enter
np.sin(np.pi *x /365)
to convert our data from \(x\) to our newly engineered data.
We could check to see if there is a correlation between our chosen function of \(x\) and the \(y\) data.
= np.sin(np.pi *x /365)
xsin 0,1] np.corrcoef(xsin,y)[
0.942399024268974
If we applied this function to our data it should look somewhat linear! Here is a plot
# Original Data for one year
= 365
stop = np.arange(0,stop,1)
x = DF['meantemp'][:stop]
y
# New feature
= np.sin(np.pi *x /365)
xsin
# Plot
'.k')
plt.plot(xsin,y,
plt.grid()'x')
plt.xlabel('y')
plt.ylabel( plt.show()
If we do a linear regression with the new feature data we should get a reasonable result.
\[ y = \beta_0 + \beta_1 \sin\left(\frac{\pi x}{365}\right) = \beta_0 + \beta_1 x_1\]
= np.polyfit(xsin,y,1)
betas betas
array([22.6562821, 10.3681457])
= betas[0]*xsin+betas[1]
yfit
'k.')
plt.plot(x,y,'b-')
plt.plot(x,yfit,
plt.grid()'x')
plt.xlabel('y')
plt.ylabel( plt.show()
r2_score(y,yfit)
0.8881159209431134
mean_squared_error(y,yfit)
6.125179888598971
Both worked so which should we use.
Well Occams razor would say to use the simplest model that fits the data. Which one is simpler?
Also, this is a bit of a contrived example because if we look at the data over more years we actually see that neither of our models work very well.
# All of the data
= np.arange(0,len(DF),1)
x = DF['meantemp']
y
= [-7.01158700e-04, 2.61972640e-01, 8.12191801e+00]
betaspoly = [22.6562821, 10.3681457]
betassin
= np.sin(np.pi *x /365)
xsin
= betaspoly[0]*x**2+betaspoly[1]*x+betaspoly[2]
ypoly = betassin[0]*xsin+betassin[1]
ysin
'.k')
plt.plot(x,y,'-b')
plt.plot(x,ypoly,'-r')
plt.plot(x,ysin,
plt.grid()-50,50)
plt.ylim( plt.show()
\(R^2\) for the polyfit
r2_score(ypoly,x)
-13.02177521007198
\(R^2\) for the sin function
r2_score(ysin,x)
-2738.8227158184377
Large negative \(R^2\) values means that we did a HORRIBLE job of fitting the data!
You Try - Due next class
What function should we have tried here?
# Here is your data
= np.arange(0,len(DF),1)
x = DF['meantemp'] y
# Here is a plot of your data
'.k')
plt.plot(x,y,
plt.grid() plt.show()
- What function do you propose to really fit the data? HINT - just apply another function to our sine function. There are multiple things that will work better than what we did above!
- Using your choice of function calculate the Correlation Coefficient.
- Create a scatter plot of your function applied to the data (should look kinda linear).
- Do a linear regression with your function applied to the data.
- Plot the regression vs the real data.
- Calculate the \(R^2\) value.
Here is an example of my results - your results are not expected to be identical, but you should have the same parts to the analysis
Correlation Coefficient
0.9261289713860721
----------------------
Plot of my function applied to the data.
----------------------
Wow it looks pretty liner!
----------------------
Plot of my the data and my new regression
----------------------
R-squared Score:
0.8341113893860616