{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "from datascience import *\n", "%matplotlib inline\n", "path_data = '../../../assets/data/'\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# The Method of Least Squares\n", "We have developed the equation of the regression line that runs through a football shaped scatter plot. But not all scatter plots are football shaped, not even linear ones. Does every scatter plot have a \"best\" line that goes through it? If so, can we still use the formulas for the slope and intercept developed in the previous section, or do we need new ones?\n", "\n", "To address these questions, we need a reasonable definition of \"best\". Recall that the purpose of the line is to *predict* or *estimate* values of $y$, given values of $x$. Estimates typically aren't perfect. Each one is off the true value by an *error*. A reasonable criterion for a line to be the \"best\" is for it to have the smallest possible overall error among all straight lines.\n", "\n", "In this section we will make this criterion precise and see if we can identify the best straight line under the criterion." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "def standard_units(any_numbers):\n", " \"Convert any array of numbers to standard units.\"\n", " return (any_numbers - np.mean(any_numbers))/np.std(any_numbers) \n", "\n", "def correlation(t, x, y):\n", " return np.mean(standard_units(t.column(x))*standard_units(t.column(y)))\n", "\n", "def slope(table, x, y):\n", " r = correlation(table, x, y)\n", " return r * np.std(table.column(y))/np.std(table.column(x))\n", "\n", "def intercept(table, x, y):\n", " a = slope(table, x, y)\n", " return np.mean(table.column(y)) - a * np.mean(table.column(x))\n", "\n", "def fit(table, x, y):\n", " \"\"\"Return the height of the regression line at each x value.\"\"\"\n", " a = slope(table, x, y)\n", " b = intercept(table, x, y)\n", " return a * table.column(x) + b" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our first example is a dataset that has one row for every chapter of the novel \"Little Women.\" The goal is to estimate the number of characters (that is, letters, spaces punctuation marks, and so on) based on the number of periods. Recall that we attempted to do this in the very first lecture of this course." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
Periods | Characters | \n", "
---|---|
189 | 21759 | \n", "
188 | 22148 | \n", "
231 | 20558 | \n", "
... (44 rows omitted)
" ], "text/plain": [ "Periods | Characters | Linear Prediction | Error | \n", "
---|---|---|---|
189 | 21759 | 21183.6 | 575.403 | \n", "
188 | 22148 | 21096.6 | 1051.38 | \n", "
231 | 20558 | 24836.7 | -4278.67 | \n", "
195 | 25526 | 21705.5 | 3820.54 | \n", "
255 | 23395 | 26924.1 | -3529.13 | \n", "
140 | 14622 | 16921.7 | -2299.68 | \n", "
131 | 14431 | 16138.9 | -1707.88 | \n", "
214 | 22476 | 23358 | -882.043 | \n", "
337 | 33767 | 34056.3 | -289.317 | \n", "
185 | 18508 | 20835.7 | -2327.69 | \n", "
... (37 rows omitted)
" ], "text/plain": [ "Periods | Characters | Linear Prediction | Error\n", "189 | 21759 | 21183.6 | 575.403\n", "188 | 22148 | 21096.6 | 1051.38\n", "231 | 20558 | 24836.7 | -4278.67\n", "195 | 25526 | 21705.5 | 3820.54\n", "255 | 23395 | 26924.1 | -3529.13\n", "140 | 14622 | 16921.7 | -2299.68\n", "131 | 14431 | 16138.9 | -1707.88\n", "214 | 22476 | 23358 | -882.043\n", "337 | 33767 | 34056.3 | -289.317\n", "185 | 18508 | 20835.7 | -2327.69\n", "... (37 rows omitted)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lw_with_predictions.with_column('Error', errors)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can use `slope` and `intercept` to calculate the slope and intercept of the fitted line. The graph below shows the line (in light blue). The errors corresponding to four of the points are shown in red. There is nothing special about those four points. They were just chosen for clarity of the display. The function `lw_errors` takes a slope and an intercept (in that order) as its arguments and draws the figure. " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "lw_reg_slope = slope(little_women, 'Periods', 'Characters')\n", "lw_reg_intercept = intercept(little_women, 'Periods', 'Characters')" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "\n", "sample = [[131, 14431], [231, 20558], [392, 40935], [157, 23524]]\n", "def lw_errors(slope, intercept):\n", " little_women.scatter('Periods', 'Characters')\n", " xlims = np.array([50, 450])\n", " plots.plot(xlims, slope * xlims + intercept, lw=2)\n", " for x, y in sample:\n", " plots.plot([x, x], [y, slope * x + intercept], color='r', lw=2)" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Slope of Regression Line: 87.0 characters per period\n", "Intercept of Regression Line: 4745.0 characters\n" ] }, { "data": { "image/png": "\n", "text/plain": [ "