{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-input" ] }, "outputs": [], "source": [ "from datascience import *\n", "path_data = '../../../assets/data/'\n", "import matplotlib\n", "matplotlib.use('Agg')\n", "%matplotlib inline\n", "import matplotlib.pyplot as plots\n", "plots.style.use('fivethirtyeight')\n", "import numpy as np" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# Random Sampling in Python \n", "\n", "This section summarizes the ways you have learned to sample at random using Python, and introduces a new way.\n", "\n", "## Review: Sampling from a Population in a Table \n", "If you are sampling from a population of individuals whose data are represented in the rows of a table, then you can use the Table method `sample` to [randomly select rows](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html#id1) of the table. That is, you can use `sample` to select a random sample of individuals.\n", "\n", "By default, `sample` draws uniformly at random with replacement. This is a natural model for chance experiments such as rolling a die." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Face
1
2
3
4
5
6
" ], "text/plain": [ "Face\n", "1\n", "2\n", "3\n", "4\n", "5\n", "6" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "faces = np.arange(1, 7)\n", "die = Table().with_columns('Face', faces)\n", "die" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Run the cell below to simulate 7 rolls of a die." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Face
5
3
3
5
5
1
6
" ], "text/plain": [ "Face\n", "5\n", "3\n", "3\n", "5\n", "5\n", "1\n", "6" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "die.sample(7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Sometimes it is more natural to sample individuals at random without replacement. This is called a simple random sample. The argument `with_replacement=False` allows you to do this." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Harrison Ford 4871.7 41 118.8 Star Wars: The Force Awakens 936.7
Samuel L. Jackson 4772.8 69 69.2 The Avengers 623.4
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Tom Hanks 4340.8 44 98.7 Toy Story 3 415
Robert Downey, Jr. 3947.3 53 74.5 The Avengers 623.4
Eddie Murphy 3810.4 38 100.3 Shrek 2 441.2
Tom Cruise 3587.2 36 99.6 War of the Worlds 234.3
Johnny Depp 3368.6 45 74.9 Dead Man's Chest 423.3
Michael Caine 3351.5 58 57.8 The Dark Knight 534.9
Scarlett Johansson 3341.2 37 90.3 The Avengers 623.4
\n", "

... (40 rows omitted)

" ], "text/plain": [ "Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross\n", "Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7\n", "Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4\n", "Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9\n", "Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415\n", "Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4\n", "Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2\n", "Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3\n", "Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3\n", "Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9\n", "Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4\n", "... (40 rows omitted)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "actors = Table.read_table(path_data + 'actors.csv')\n", "actors" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Actor Total Gross Number of Movies Average per Movie #1 Movie Gross
Morgan Freeman 4468.3 61 73.3 The Dark Knight 534.9
Orlando Bloom 2815.8 17 165.6 Dead Man's Chest 423.3
Cameron Diaz 3031.7 34 89.2 Shrek 2 441.2
Michael Caine 3351.5 58 57.8 The Dark Knight 534.9
Leonardo DiCaprio 2518.3 25 100.7 Titanic 658.7
" ], "text/plain": [ "Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross\n", "Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9\n", "Orlando Bloom | 2815.8 | 17 | 165.6 | Dead Man's Chest | 423.3\n", "Cameron Diaz | 3031.7 | 34 | 89.2 | Shrek 2 | 441.2\n", "Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9\n", "Leonardo DiCaprio | 2518.3 | 25 | 100.7 | Titanic | 658.7" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Simple random sample of 5 rows\n", "actors.sample(5, with_replacement=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since `sample` gives you the entire sample in the order in which the rows were selected, you can use Table methods on the sampled table to answer many questions about the sample. For example, you can find the number of times the die showed six spots, or the average number of movies in which the sampled actors appeared, or whether one two specified actors appeared in the sample. You might need multiple lines of code to get some of this information." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Review: Sampling from a Population in an Array \n", "\n", "If you are sampling from a population of individuals whose data are represented as an array, you can use the NumPy function `np.random.choice` to [randomly select elements of the array](https://inferentialthinking.com/chapters/09/3/Simulation.html#example-number-of-heads-in-100-tosses).\n", "\n", "By default, `np.random.choice` samples at random with replacement." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([1, 2, 3, 4, 5, 6])" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# The faces of a die, as an array\n", "faces" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([4, 1, 6, 3, 5, 4, 6])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# 7 rolls of the die\n", "np.random.choice(faces, 7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The argument `replace=False` allows you to get a simple random sample, that is, a sample drawn at random without replacement." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "# Array of actor names\n", "actor_names = actors.column('Actor')" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array(['Jonah Hill', 'Julia Roberts', 'Bruce Willis', 'Eddie Murphy',\n", " 'Matt Damon'], dtype='