{
"cells": [
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"tags": [
"remove-input"
]
},
"outputs": [],
"source": [
"from datascience import *\n",
"path_data = '../../../assets/data/'\n",
"import matplotlib\n",
"matplotlib.use('Agg')\n",
"%matplotlib inline\n",
"import matplotlib.pyplot as plots\n",
"plots.style.use('fivethirtyeight')\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Random Sampling in Python \n",
"\n",
"This section summarizes the ways you have learned to sample at random using Python, and introduces a new way.\n",
"\n",
"## Review: Sampling from a Population in a Table \n",
"If you are sampling from a population of individuals whose data are represented in the rows of a table, then you can use the Table method `sample` to [randomly select rows](https://inferentialthinking.com/chapters/10/1/Empirical_Distributions.html#id1) of the table. That is, you can use `sample` to select a random sample of individuals.\n",
"\n",
"By default, `sample` draws uniformly at random with replacement. This is a natural model for chance experiments such as rolling a die."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"
\n",
" \n",
" \n",
" Face | \n",
"
\n",
" \n",
" \n",
" \n",
" 1 | \n",
"
\n",
" \n",
" 2 | \n",
"
\n",
" \n",
" 3 | \n",
"
\n",
" \n",
" 4 | \n",
"
\n",
" \n",
" 5 | \n",
"
\n",
" \n",
" 6 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Face\n",
"1\n",
"2\n",
"3\n",
"4\n",
"5\n",
"6"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"faces = np.arange(1, 7)\n",
"die = Table().with_columns('Face', faces)\n",
"die"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Run the cell below to simulate 7 rolls of a die."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Face | \n",
"
\n",
" \n",
" \n",
" \n",
" 5 | \n",
"
\n",
" \n",
" 3 | \n",
"
\n",
" \n",
" 3 | \n",
"
\n",
" \n",
" 5 | \n",
"
\n",
" \n",
" 5 | \n",
"
\n",
" \n",
" 1 | \n",
"
\n",
" \n",
" 6 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Face\n",
"5\n",
"3\n",
"3\n",
"5\n",
"5\n",
"1\n",
"6"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"die.sample(7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sometimes it is more natural to sample individuals at random without replacement. This is called a simple random sample. The argument `with_replacement=False` allows you to do this."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross | \n",
"
\n",
" \n",
" \n",
" \n",
" Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7 | \n",
"
\n",
" \n",
" Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4 | \n",
"
\n",
" \n",
" Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 | \n",
"
\n",
" \n",
" Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415 | \n",
"
\n",
" \n",
" Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4 | \n",
"
\n",
" \n",
" Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2 | \n",
"
\n",
" \n",
" Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3 | \n",
"
\n",
" \n",
" Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3 | \n",
"
\n",
" \n",
" Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 | \n",
"
\n",
" \n",
" Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4 | \n",
"
\n",
" \n",
"
\n",
"... (40 rows omitted)
"
],
"text/plain": [
"Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross\n",
"Harrison Ford | 4871.7 | 41 | 118.8 | Star Wars: The Force Awakens | 936.7\n",
"Samuel L. Jackson | 4772.8 | 69 | 69.2 | The Avengers | 623.4\n",
"Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9\n",
"Tom Hanks | 4340.8 | 44 | 98.7 | Toy Story 3 | 415\n",
"Robert Downey, Jr. | 3947.3 | 53 | 74.5 | The Avengers | 623.4\n",
"Eddie Murphy | 3810.4 | 38 | 100.3 | Shrek 2 | 441.2\n",
"Tom Cruise | 3587.2 | 36 | 99.6 | War of the Worlds | 234.3\n",
"Johnny Depp | 3368.6 | 45 | 74.9 | Dead Man's Chest | 423.3\n",
"Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9\n",
"Scarlett Johansson | 3341.2 | 37 | 90.3 | The Avengers | 623.4\n",
"... (40 rows omitted)"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"actors = Table.read_table(path_data + 'actors.csv')\n",
"actors"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"\n",
" \n",
" \n",
" Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross | \n",
"
\n",
" \n",
" \n",
" \n",
" Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9 | \n",
"
\n",
" \n",
" Orlando Bloom | 2815.8 | 17 | 165.6 | Dead Man's Chest | 423.3 | \n",
"
\n",
" \n",
" Cameron Diaz | 3031.7 | 34 | 89.2 | Shrek 2 | 441.2 | \n",
"
\n",
" \n",
" Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9 | \n",
"
\n",
" \n",
" Leonardo DiCaprio | 2518.3 | 25 | 100.7 | Titanic | 658.7 | \n",
"
\n",
" \n",
"
"
],
"text/plain": [
"Actor | Total Gross | Number of Movies | Average per Movie | #1 Movie | Gross\n",
"Morgan Freeman | 4468.3 | 61 | 73.3 | The Dark Knight | 534.9\n",
"Orlando Bloom | 2815.8 | 17 | 165.6 | Dead Man's Chest | 423.3\n",
"Cameron Diaz | 3031.7 | 34 | 89.2 | Shrek 2 | 441.2\n",
"Michael Caine | 3351.5 | 58 | 57.8 | The Dark Knight | 534.9\n",
"Leonardo DiCaprio | 2518.3 | 25 | 100.7 | Titanic | 658.7"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Simple random sample of 5 rows\n",
"actors.sample(5, with_replacement=False)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since `sample` gives you the entire sample in the order in which the rows were selected, you can use Table methods on the sampled table to answer many questions about the sample. For example, you can find the number of times the die showed six spots, or the average number of movies in which the sampled actors appeared, or whether one two specified actors appeared in the sample. You might need multiple lines of code to get some of this information."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Review: Sampling from a Population in an Array \n",
"\n",
"If you are sampling from a population of individuals whose data are represented as an array, you can use the NumPy function `np.random.choice` to [randomly select elements of the array](https://inferentialthinking.com/chapters/09/3/Simulation.html#example-number-of-heads-in-100-tosses).\n",
"\n",
"By default, `np.random.choice` samples at random with replacement."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([1, 2, 3, 4, 5, 6])"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# The faces of a die, as an array\n",
"faces"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([4, 1, 6, 3, 5, 4, 6])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 7 rolls of the die\n",
"np.random.choice(faces, 7)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The argument `replace=False` allows you to get a simple random sample, that is, a sample drawn at random without replacement."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"# Array of actor names\n",
"actor_names = actors.column('Actor')"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array(['Jonah Hill', 'Julia Roberts', 'Bruce Willis', 'Eddie Murphy',\n",
" 'Matt Damon'], dtype='