raymondEDS commited on
Commit
63a7f01
·
1 Parent(s): 1289315
.DS_Store ADDED
Binary file (6.15 kB). View file
 
Reference files/Week2_ref/Ch02-statlearn-lab.ipynb ADDED
@@ -0,0 +1,3229 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "id": "245f0c86",
6
+ "metadata": {},
7
+ "source": [
8
+ "\n",
9
+ "# Chapter 2\n",
10
+ "\n",
11
+ "# Lab: Introduction to Python\n",
12
+ "\n"
13
+ ]
14
+ },
15
+ {
16
+ "cell_type": "markdown",
17
+ "id": "5ab29948",
18
+ "metadata": {},
19
+ "source": [
20
+ "## Getting Started"
21
+ ]
22
+ },
23
+ {
24
+ "cell_type": "markdown",
25
+ "id": "ed622870",
26
+ "metadata": {},
27
+ "source": [
28
+ "To run the labs in this book, you will need two things:\n",
29
+ "\n",
30
+ "* An installation of `Python3`, which is the specific version of `Python` used in the labs. \n",
31
+ "* Access to `Jupyter`, a very popular `Python` interface that runs code through a file called a *notebook*. "
32
+ ]
33
+ },
34
+ {
35
+ "cell_type": "markdown",
36
+ "id": "844d37fc",
37
+ "metadata": {},
38
+ "source": [
39
+ "You can download and install `Python3` by following the instructions available at [anaconda.com](http://anaconda.com). "
40
+ ]
41
+ },
42
+ {
43
+ "cell_type": "markdown",
44
+ "id": "462ff1fe",
45
+ "metadata": {},
46
+ "source": [
47
+ " There are a number of ways to get access to `Jupyter`. Here are just a few:\n",
48
+ " \n",
49
+ " * Using Google's `Colaboratory` service: [colab.research.google.com/](https://colab.research.google.com/). \n",
50
+ " * Using `JupyterHub`, available at [jupyter.org/hub](https://jupyter.org/hub). \n",
51
+ " * Using your own `jupyter` installation. Installation instructions are available at [jupyter.org/install](https://jupyter.org/install). \n",
52
+ " \n",
53
+ "Please see the `Python` resources page on the book website [statlearning.com](https://www.statlearning.com) for up-to-date information about getting `Python` and `Jupyter` working on your computer. \n",
54
+ "\n",
55
+ "You will need to install the `ISLP` package, which provides access to the datasets and custom-built functions that we provide.\n",
56
+ "Inside a macOS or Linux terminal type `pip install ISLP`; this also installs most other packages needed in the labs. The `Python` resources page has a link to the `ISLP` documentation website.\n",
57
+ "\n",
58
+ "To run this lab, download the file `Ch2-statlearn-lab.ipynb` from the `Python` resources page. \n",
59
+ "Now run the following code at the command line: `jupyter lab Ch2-statlearn-lab.ipynb`.\n",
60
+ "\n",
61
+ "If you're using Windows, you can use the `start menu` to access `anaconda`, and follow the links. For example, to install `ISLP` and run this lab, you can run the same code above in an `anaconda` shell.\n"
62
+ ]
63
+ },
64
+ {
65
+ "cell_type": "markdown",
66
+ "id": "b46f9182",
67
+ "metadata": {},
68
+ "source": [
69
+ "## Basic Commands\n"
70
+ ]
71
+ },
72
+ {
73
+ "cell_type": "markdown",
74
+ "id": "54060fd9",
75
+ "metadata": {},
76
+ "source": [
77
+ "In this lab, we will introduce some simple `Python` commands. \n",
78
+ " For more resources about `Python` in general, readers may want to consult the tutorial at [docs.python.org/3/tutorial/](https://docs.python.org/3/tutorial/). \n",
79
+ "\n",
80
+ "\n",
81
+ " \n"
82
+ ]
83
+ },
84
+ {
85
+ "cell_type": "markdown",
86
+ "id": "d3dbd0e9",
87
+ "metadata": {},
88
+ "source": [
89
+ "Like most programming languages, `Python` uses *functions*\n",
90
+ "to perform operations. To run a\n",
91
+ "function called `fun`, we type\n",
92
+ "`fun(input1,input2)`, where the inputs (or *arguments*)\n",
93
+ "`input1` and `input2` tell\n",
94
+ "`Python` how to run the function. A function can have any number of\n",
95
+ "inputs. For example, the\n",
96
+ "`print()` function outputs a text representation of all of its arguments to the console."
97
+ ]
98
+ },
99
+ {
100
+ "cell_type": "code",
101
+ "execution_count": 1,
102
+ "id": "9e8aa21f",
103
+ "metadata": {
104
+ "execution": {}
105
+ },
106
+ "outputs": [
107
+ {
108
+ "name": "stdout",
109
+ "output_type": "stream",
110
+ "text": [
111
+ "fit a model with 11 variables\n"
112
+ ]
113
+ }
114
+ ],
115
+ "source": [
116
+ "print('fit a model with', 11, 'variables')\n"
117
+ ]
118
+ },
119
+ {
120
+ "cell_type": "markdown",
121
+ "id": "27d935f8",
122
+ "metadata": {},
123
+ "source": [
124
+ " The following command will provide information about the `print()` function."
125
+ ]
126
+ },
127
+ {
128
+ "cell_type": "code",
129
+ "execution_count": null,
130
+ "id": "d62ec119",
131
+ "metadata": {
132
+ "execution": {}
133
+ },
134
+ "outputs": [],
135
+ "source": [
136
+ "print?\n"
137
+ ]
138
+ },
139
+ {
140
+ "cell_type": "markdown",
141
+ "id": "04b3e2a3",
142
+ "metadata": {},
143
+ "source": [
144
+ "Adding two integers in `Python` is pretty intuitive."
145
+ ]
146
+ },
147
+ {
148
+ "cell_type": "code",
149
+ "execution_count": null,
150
+ "id": "c64e9f4d",
151
+ "metadata": {
152
+ "execution": {}
153
+ },
154
+ "outputs": [],
155
+ "source": [
156
+ "3 + 5\n"
157
+ ]
158
+ },
159
+ {
160
+ "cell_type": "markdown",
161
+ "id": "cd754cba",
162
+ "metadata": {},
163
+ "source": [
164
+ "In `Python`, textual data is handled using\n",
165
+ "*strings*. For instance, `\"hello\"` and\n",
166
+ "`'hello'`\n",
167
+ "are strings. \n",
168
+ "We can concatenate them using the addition `+` symbol."
169
+ ]
170
+ },
171
+ {
172
+ "cell_type": "code",
173
+ "execution_count": null,
174
+ "id": "9abccc1f",
175
+ "metadata": {
176
+ "execution": {}
177
+ },
178
+ "outputs": [],
179
+ "source": [
180
+ "\"hello\" + \"world\"\n"
181
+ ]
182
+ },
183
+ {
184
+ "cell_type": "markdown",
185
+ "id": "c28db903",
186
+ "metadata": {},
187
+ "source": [
188
+ " A string is actually a type of *sequence*: this is a generic term for an ordered list. \n",
189
+ " The three most important types of sequences are lists, tuples, and strings. \n",
190
+ "We introduce lists now. "
191
+ ]
192
+ },
193
+ {
194
+ "cell_type": "markdown",
195
+ "id": "5fdcc5a1",
196
+ "metadata": {},
197
+ "source": [
198
+ "The following command instructs `Python` to join together\n",
199
+ "the numbers 3, 4, and 5, and to save them as a\n",
200
+ "*list* named `x`. When we\n",
201
+ "type `x`, it gives us back the list."
202
+ ]
203
+ },
204
+ {
205
+ "cell_type": "code",
206
+ "execution_count": null,
207
+ "id": "802ca33c",
208
+ "metadata": {
209
+ "execution": {}
210
+ },
211
+ "outputs": [],
212
+ "source": [
213
+ "x = [3, 4, 5]\n",
214
+ "x\n"
215
+ ]
216
+ },
217
+ {
218
+ "cell_type": "markdown",
219
+ "id": "5492ecd1",
220
+ "metadata": {},
221
+ "source": [
222
+ "Note that we used the brackets\n",
223
+ "`[]` to construct this list. \n",
224
+ "\n",
225
+ "We will often want to add two sets of numbers together. It is reasonable to try the following code,\n",
226
+ "though it will not produce the desired results."
227
+ ]
228
+ },
229
+ {
230
+ "cell_type": "code",
231
+ "execution_count": null,
232
+ "id": "a8c72744",
233
+ "metadata": {
234
+ "execution": {}
235
+ },
236
+ "outputs": [],
237
+ "source": [
238
+ "y = [4, 9, 7]\n",
239
+ "x + y\n"
240
+ ]
241
+ },
242
+ {
243
+ "cell_type": "code",
244
+ "execution_count": null,
245
+ "id": "b84f9d0e",
246
+ "metadata": {},
247
+ "outputs": [],
248
+ "source": [
249
+ "x[3]"
250
+ ]
251
+ },
252
+ {
253
+ "cell_type": "markdown",
254
+ "id": "8f42ea1d",
255
+ "metadata": {},
256
+ "source": [
257
+ "The result may appear slightly counterintuitive: why did `Python` not add the entries of the lists\n",
258
+ "element-by-element? \n",
259
+ " In `Python`, lists hold *arbitrary* objects, and are added using *concatenation*. \n",
260
+ " In fact, concatenation is the behavior that we saw earlier when we entered `\"hello\" + \" \" + \"world\"`. \n",
261
+ " "
262
+ ]
263
+ },
264
+ {
265
+ "cell_type": "markdown",
266
+ "id": "69015df5",
267
+ "metadata": {},
268
+ "source": [
269
+ "This example reflects the fact that \n",
270
+ " `Python` is a general-purpose programming language. Much of `Python`'s data-specific\n",
271
+ "functionality comes from other packages, notably `numpy`\n",
272
+ "and `pandas`. \n",
273
+ "In the next section, we will introduce the `numpy` package. \n",
274
+ "See [docs.scipy.org/doc/numpy/user/quickstart.html](https://docs.scipy.org/doc/numpy/user/quickstart.html) for more information about `numpy`.\n"
275
+ ]
276
+ },
277
+ {
278
+ "cell_type": "markdown",
279
+ "id": "16bfc4a2",
280
+ "metadata": {},
281
+ "source": [
282
+ "## Introduction to Numerical Python\n",
283
+ "\n",
284
+ "As mentioned earlier, this book makes use of functionality that is contained in the `numpy` \n",
285
+ " *library*, or *package*. A package is a collection of modules that are not necessarily included in \n",
286
+ " the base `Python` distribution. The name `numpy` is an abbreviation for *numerical Python*. "
287
+ ]
288
+ },
289
+ {
290
+ "cell_type": "markdown",
291
+ "id": "f5bed3f0",
292
+ "metadata": {},
293
+ "source": [
294
+ " To access `numpy`, we must first `import` it."
295
+ ]
296
+ },
297
+ {
298
+ "cell_type": "code",
299
+ "execution_count": null,
300
+ "id": "f1c7d1db",
301
+ "metadata": {
302
+ "execution": {},
303
+ "lines_to_next_cell": 0
304
+ },
305
+ "outputs": [],
306
+ "source": [
307
+ "import numpy as np "
308
+ ]
309
+ },
310
+ {
311
+ "cell_type": "markdown",
312
+ "id": "5c8614e7",
313
+ "metadata": {},
314
+ "source": [
315
+ "In the previous line, we named the `numpy` *module* `np`; an abbreviation for easier referencing."
316
+ ]
317
+ },
318
+ {
319
+ "cell_type": "markdown",
320
+ "id": "ba1224a6",
321
+ "metadata": {},
322
+ "source": [
323
+ "In `numpy`, an *array* is a generic term for a multidimensional\n",
324
+ "set of numbers.\n",
325
+ "We use the `np.array()` function to define `x` and `y`, which are one-dimensional arrays, i.e. vectors."
326
+ ]
327
+ },
328
+ {
329
+ "cell_type": "code",
330
+ "execution_count": null,
331
+ "id": "e2ea2bfd",
332
+ "metadata": {
333
+ "execution": {},
334
+ "lines_to_next_cell": 0
335
+ },
336
+ "outputs": [],
337
+ "source": [
338
+ "x = np.array([3, 4, 5])\n",
339
+ "y = np.array([4, 9, 7])"
340
+ ]
341
+ },
342
+ {
343
+ "cell_type": "markdown",
344
+ "id": "a977e05a",
345
+ "metadata": {},
346
+ "source": [
347
+ "Note that if you forgot to run the `import numpy as np` command earlier, then\n",
348
+ "you will encounter an error in calling the `np.array()` function in the previous line. \n",
349
+ " The syntax `np.array()` indicates that the function being called\n",
350
+ "is part of the `numpy` package, which we have abbreviated as `np`. "
351
+ ]
352
+ },
353
+ {
354
+ "cell_type": "markdown",
355
+ "id": "742431b6",
356
+ "metadata": {},
357
+ "source": [
358
+ "Since `x` and `y` have been defined using `np.array()`, we get a sensible result when we add them together. Compare this to our results in the previous section,\n",
359
+ " when we tried to add two lists without using `numpy`. "
360
+ ]
361
+ },
362
+ {
363
+ "cell_type": "code",
364
+ "execution_count": null,
365
+ "id": "59fbf9fd",
366
+ "metadata": {
367
+ "execution": {},
368
+ "lines_to_next_cell": 0
369
+ },
370
+ "outputs": [],
371
+ "source": [
372
+ "x + y"
373
+ ]
374
+ },
375
+ {
376
+ "cell_type": "markdown",
377
+ "id": "2ceccc2b",
378
+ "metadata": {},
379
+ "source": [
380
+ " \n",
381
+ " \n"
382
+ ]
383
+ },
384
+ {
385
+ "cell_type": "markdown",
386
+ "id": "74be6d74",
387
+ "metadata": {},
388
+ "source": [
389
+ "In `numpy`, matrices are typically represented as two-dimensional arrays, and vectors as one-dimensional arrays. {While it is also possible to create matrices using `np.matrix()`, we will use `np.array()` throughout the labs in this book.}\n",
390
+ "We can create a two-dimensional array as follows. "
391
+ ]
392
+ },
393
+ {
394
+ "cell_type": "code",
395
+ "execution_count": null,
396
+ "id": "2279437e",
397
+ "metadata": {
398
+ "execution": {},
399
+ "lines_to_next_cell": 0
400
+ },
401
+ "outputs": [],
402
+ "source": [
403
+ "x = np.array([[1, 2], [3, 4]])\n",
404
+ "x"
405
+ ]
406
+ },
407
+ {
408
+ "cell_type": "markdown",
409
+ "id": "f96f304d",
410
+ "metadata": {},
411
+ "source": [
412
+ " \n",
413
+ "\n"
414
+ ]
415
+ },
416
+ {
417
+ "cell_type": "markdown",
418
+ "id": "f764f7d1",
419
+ "metadata": {},
420
+ "source": [
421
+ "The object `x` has several \n",
422
+ "*attributes*, or associated objects. To access an attribute of `x`, we type `x.attribute`, where we replace `attribute`\n",
423
+ "with the name of the attribute. \n",
424
+ "For instance, we can access the `ndim` attribute of `x` as follows. "
425
+ ]
426
+ },
427
+ {
428
+ "cell_type": "code",
429
+ "execution_count": null,
430
+ "id": "75bf1b1e",
431
+ "metadata": {
432
+ "execution": {}
433
+ },
434
+ "outputs": [],
435
+ "source": [
436
+ "x.ndim"
437
+ ]
438
+ },
439
+ {
440
+ "cell_type": "markdown",
441
+ "id": "4e3b83bf",
442
+ "metadata": {},
443
+ "source": [
444
+ "The output indicates that `x` is a two-dimensional array. \n",
445
+ "Similarly, `x.dtype` is the *data type* attribute of the object `x`. This indicates that `x` is \n",
446
+ "comprised of 64-bit integers:"
447
+ ]
448
+ },
449
+ {
450
+ "cell_type": "code",
451
+ "execution_count": null,
452
+ "id": "58292240",
453
+ "metadata": {
454
+ "execution": {},
455
+ "lines_to_next_cell": 0
456
+ },
457
+ "outputs": [],
458
+ "source": [
459
+ "x.dtype"
460
+ ]
461
+ },
462
+ {
463
+ "cell_type": "markdown",
464
+ "id": "cf9cf94b",
465
+ "metadata": {},
466
+ "source": [
467
+ "Why is `x` comprised of integers? This is because we created `x` by passing in exclusively integers to the `np.array()` function.\n",
468
+ " If\n",
469
+ "we had passed in any decimals, then we would have obtained an array of\n",
470
+ "*floating point numbers* (i.e. real-valued numbers). "
471
+ ]
472
+ },
473
+ {
474
+ "cell_type": "code",
475
+ "execution_count": null,
476
+ "id": "fc5fff57",
477
+ "metadata": {
478
+ "execution": {},
479
+ "lines_to_next_cell": 2
480
+ },
481
+ "outputs": [],
482
+ "source": [
483
+ "np.array([[1, 2], [3.0, 4]]).dtype\n"
484
+ ]
485
+ },
486
+ {
487
+ "cell_type": "markdown",
488
+ "id": "41a79641",
489
+ "metadata": {},
490
+ "source": [
491
+ "Typing `fun?` will cause `Python` to display \n",
492
+ "documentation associated with the function `fun`, if it exists.\n",
493
+ "We can try this for `np.array()`. "
494
+ ]
495
+ },
496
+ {
497
+ "cell_type": "code",
498
+ "execution_count": null,
499
+ "id": "762562a6",
500
+ "metadata": {
501
+ "execution": {},
502
+ "lines_to_next_cell": 0
503
+ },
504
+ "outputs": [],
505
+ "source": [
506
+ "np.array?\n"
507
+ ]
508
+ },
509
+ {
510
+ "cell_type": "markdown",
511
+ "id": "d4d82167",
512
+ "metadata": {},
513
+ "source": [
514
+ "This documentation indicates that we could create a floating point array by passing a `dtype` argument into `np.array()`."
515
+ ]
516
+ },
517
+ {
518
+ "cell_type": "code",
519
+ "execution_count": null,
520
+ "id": "66d2b82a",
521
+ "metadata": {
522
+ "execution": {},
523
+ "lines_to_next_cell": 2
524
+ },
525
+ "outputs": [],
526
+ "source": [
527
+ "np.array([[1, 2], [3, 4]], float).dtype\n"
528
+ ]
529
+ },
530
+ {
531
+ "cell_type": "markdown",
532
+ "id": "1e3ba5be",
533
+ "metadata": {},
534
+ "source": [
535
+ "The array `x` is two-dimensional. We can find out the number of rows and columns by looking\n",
536
+ "at its `shape` attribute."
537
+ ]
538
+ },
539
+ {
540
+ "cell_type": "code",
541
+ "execution_count": null,
542
+ "id": "89881402",
543
+ "metadata": {
544
+ "execution": {},
545
+ "lines_to_next_cell": 2
546
+ },
547
+ "outputs": [],
548
+ "source": [
549
+ "x.shape\n"
550
+ ]
551
+ },
552
+ {
553
+ "cell_type": "markdown",
554
+ "id": "2967b644",
555
+ "metadata": {},
556
+ "source": [
557
+ "A *method* is a function that is associated with an\n",
558
+ "object. \n",
559
+ "For instance, given an array `x`, the expression\n",
560
+ "`x.sum()` sums all of its elements, using the `sum()`\n",
561
+ "method for arrays. \n",
562
+ "The call `x.sum()` automatically provides `x` as the\n",
563
+ "first argument to its `sum()` method."
564
+ ]
565
+ },
566
+ {
567
+ "cell_type": "code",
568
+ "execution_count": null,
569
+ "id": "0572d3f6",
570
+ "metadata": {
571
+ "execution": {},
572
+ "lines_to_next_cell": 0
573
+ },
574
+ "outputs": [],
575
+ "source": [
576
+ "x = np.array([1, 2, 3, 4])\n",
577
+ "x.sum()"
578
+ ]
579
+ },
580
+ {
581
+ "cell_type": "markdown",
582
+ "id": "e3f49995",
583
+ "metadata": {},
584
+ "source": [
585
+ "We could also sum the elements of `x` by passing in `x` as an argument to the `np.sum()` function. "
586
+ ]
587
+ },
588
+ {
589
+ "cell_type": "code",
590
+ "execution_count": null,
591
+ "id": "33b10a6f",
592
+ "metadata": {
593
+ "execution": {},
594
+ "lines_to_next_cell": 0
595
+ },
596
+ "outputs": [],
597
+ "source": [
598
+ "x = np.array([1, 2, 3, 4])\n",
599
+ "np.sum(x)"
600
+ ]
601
+ },
602
+ {
603
+ "cell_type": "markdown",
604
+ "id": "2f3dd2c3",
605
+ "metadata": {},
606
+ "source": [
607
+ " As another example, the\n",
608
+ "`reshape()` method returns a new array with the same elements as\n",
609
+ "`x`, but a different shape.\n",
610
+ " We do this by passing in a `tuple` in our call to\n",
611
+ " `reshape()`, in this case `(2, 3)`. This tuple specifies that we would like to create a two-dimensional array with \n",
612
+ "$2$ rows and $3$ columns. {Like lists, tuples represent a sequence of objects. Why do we need more than one way to create a sequence? There are a few differences between tuples and lists, but perhaps the most important is that elements of a tuple cannot be modified, whereas elements of a list can be.}\n",
613
+ " \n",
614
+ "In what follows, the\n",
615
+ "`\\n` character creates a *new line*."
616
+ ]
617
+ },
618
+ {
619
+ "cell_type": "code",
620
+ "execution_count": null,
621
+ "id": "a32716db",
622
+ "metadata": {
623
+ "execution": {}
624
+ },
625
+ "outputs": [],
626
+ "source": [
627
+ "x = np.array([1, 2, 3, 4, 5, 6])\n",
628
+ "print('beginning x:\\n', x)\n",
629
+ "x_reshape = x.reshape((2, 3))\n",
630
+ "print('reshaped x:\\n', x_reshape)\n"
631
+ ]
632
+ },
633
+ {
634
+ "cell_type": "markdown",
635
+ "id": "2483179e",
636
+ "metadata": {},
637
+ "source": [
638
+ "The previous output reveals that `numpy` arrays are specified as a sequence\n",
639
+ "of *rows*. This is called *row-major ordering*, as opposed to *column-major ordering*. "
640
+ ]
641
+ },
642
+ {
643
+ "cell_type": "markdown",
644
+ "id": "e256575f",
645
+ "metadata": {},
646
+ "source": [
647
+ "`Python` (and hence `numpy`) uses 0-based\n",
648
+ "indexing. This means that to access the top left element of `x_reshape`, \n",
649
+ "we type in `x_reshape[0,0]`."
650
+ ]
651
+ },
652
+ {
653
+ "cell_type": "code",
654
+ "execution_count": null,
655
+ "id": "3db6e1cf",
656
+ "metadata": {
657
+ "execution": {},
658
+ "lines_to_next_cell": 0
659
+ },
660
+ "outputs": [],
661
+ "source": [
662
+ "x_reshape[0, 0] "
663
+ ]
664
+ },
665
+ {
666
+ "cell_type": "markdown",
667
+ "id": "0e10119e",
668
+ "metadata": {},
669
+ "source": [
670
+ "Similarly, `x_reshape[1,2]` yields the element in the second row and the third column \n",
671
+ "of `x_reshape`. "
672
+ ]
673
+ },
674
+ {
675
+ "cell_type": "code",
676
+ "execution_count": null,
677
+ "id": "e15c753f",
678
+ "metadata": {
679
+ "execution": {},
680
+ "lines_to_next_cell": 0
681
+ },
682
+ "outputs": [],
683
+ "source": [
684
+ "x_reshape[1, 2] "
685
+ ]
686
+ },
687
+ {
688
+ "cell_type": "markdown",
689
+ "id": "f9c55622",
690
+ "metadata": {},
691
+ "source": [
692
+ "Similarly, `x[2]` yields the\n",
693
+ "third entry of `x`. \n",
694
+ "\n",
695
+ "Now, let's modify the top left element of `x_reshape`. To our surprise, we discover that the first element of `x` has been modified as well!\n",
696
+ "\n"
697
+ ]
698
+ },
699
+ {
700
+ "cell_type": "code",
701
+ "execution_count": null,
702
+ "id": "91c6e7d8",
703
+ "metadata": {
704
+ "execution": {}
705
+ },
706
+ "outputs": [],
707
+ "source": [
708
+ "print('x before we modify x_reshape:\\n', x)\n",
709
+ "print('x_reshape before we modify x_reshape:\\n', x_reshape)\n",
710
+ "x_reshape[0, 0] = 5\n",
711
+ "print('x_reshape after we modify its top left element:\\n', x_reshape)\n",
712
+ "print('x after we modify top left element of x_reshape:\\n', x)\n"
713
+ ]
714
+ },
715
+ {
716
+ "cell_type": "markdown",
717
+ "id": "8a840507",
718
+ "metadata": {},
719
+ "source": [
720
+ "Modifying `x_reshape` also modified `x` because the two objects occupy the same space in memory.\n",
721
+ " \n",
722
+ "\n",
723
+ " "
724
+ ]
725
+ },
726
+ {
727
+ "cell_type": "markdown",
728
+ "id": "ec551f3e",
729
+ "metadata": {},
730
+ "source": [
731
+ "We just saw that we can modify an element of an array. Can we also modify a tuple? It turns out that we cannot --- and trying to do so introduces\n",
732
+ "an *exception*, or error."
733
+ ]
734
+ },
735
+ {
736
+ "cell_type": "code",
737
+ "execution_count": null,
738
+ "id": "59d95dce",
739
+ "metadata": {
740
+ "execution": {},
741
+ "lines_to_next_cell": 2
742
+ },
743
+ "outputs": [],
744
+ "source": [
745
+ "my_tuple = (3, 4, 5)\n",
746
+ "my_tuple[0] = 2\n"
747
+ ]
748
+ },
749
+ {
750
+ "cell_type": "markdown",
751
+ "id": "d594f1af",
752
+ "metadata": {},
753
+ "source": [
754
+ "We now briefly mention some attributes of arrays that will come in handy. An array's `shape` attribute contains its dimension; this is always a tuple.\n",
755
+ "The `ndim` attribute yields the number of dimensions, and `T` provides its transpose. "
756
+ ]
757
+ },
758
+ {
759
+ "cell_type": "code",
760
+ "execution_count": null,
761
+ "id": "a6fde9af",
762
+ "metadata": {
763
+ "execution": {}
764
+ },
765
+ "outputs": [],
766
+ "source": [
767
+ "x_reshape.shape, x_reshape.ndim, x_reshape.T\n"
768
+ ]
769
+ },
770
+ {
771
+ "cell_type": "markdown",
772
+ "id": "76d20b98",
773
+ "metadata": {},
774
+ "source": [
775
+ "Notice that the three individual outputs `(2,3)`, `2`, and `array([[5, 4],[2, 5], [3,6]])` are themselves output as a tuple. \n",
776
+ " \n",
777
+ "We will often want to apply functions to arrays. \n",
778
+ "For instance, we can compute the\n",
779
+ "square root of the entries using the `np.sqrt()` function: "
780
+ ]
781
+ },
782
+ {
783
+ "cell_type": "code",
784
+ "execution_count": null,
785
+ "id": "fadb6b45",
786
+ "metadata": {
787
+ "execution": {}
788
+ },
789
+ "outputs": [],
790
+ "source": [
791
+ "np.sqrt(x)\n"
792
+ ]
793
+ },
794
+ {
795
+ "cell_type": "markdown",
796
+ "id": "22fab2ce",
797
+ "metadata": {},
798
+ "source": [
799
+ "We can also square the elements:"
800
+ ]
801
+ },
802
+ {
803
+ "cell_type": "code",
804
+ "execution_count": null,
805
+ "id": "fda3134b",
806
+ "metadata": {
807
+ "execution": {}
808
+ },
809
+ "outputs": [],
810
+ "source": [
811
+ "x**2\n"
812
+ ]
813
+ },
814
+ {
815
+ "cell_type": "markdown",
816
+ "id": "1278f26b",
817
+ "metadata": {},
818
+ "source": [
819
+ "We can compute the square roots using the same notation, raising to the power of $1/2$ instead of 2."
820
+ ]
821
+ },
822
+ {
823
+ "cell_type": "code",
824
+ "execution_count": null,
825
+ "id": "52eb335b",
826
+ "metadata": {
827
+ "execution": {},
828
+ "lines_to_next_cell": 2
829
+ },
830
+ "outputs": [],
831
+ "source": [
832
+ "x**0.5\n"
833
+ ]
834
+ },
835
+ {
836
+ "cell_type": "markdown",
837
+ "id": "299a5a85",
838
+ "metadata": {},
839
+ "source": [
840
+ "Throughout this book, we will often want to generate random data. \n",
841
+ "The `np.random.normal()` function generates a vector of random\n",
842
+ "normal variables. We can learn more about this function by looking at the help page, via a call to `np.random.normal?`.\n",
843
+ "The first line of the help page reads `normal(loc=0.0, scale=1.0, size=None)`. \n",
844
+ " This *signature* line tells us that the function's arguments are `loc`, `scale`, and `size`. These are *keyword* arguments, which means that when they are passed into\n",
845
+ " the function, they can be referred to by name (in any order). {`Python` also uses *positional* arguments. Positional arguments do not need to use a keyword. To see an example, type in `np.sum?`. We see that `a` is a positional argument, i.e. this function assumes that the first unnamed argument that it receives is the array to be summed. By contrast, `axis` and `dtype` are keyword arguments: the position in which these arguments are entered into `np.sum()` does not matter.}\n",
846
+ " By default, this function will generate random normal variable(s) with mean (`loc`) $0$ and standard deviation (`scale`) $1$; furthermore, \n",
847
+ " a single random variable will be generated unless the argument to `size` is changed. \n",
848
+ "\n",
849
+ "We now generate 50 independent random variables from a $N(0,1)$ distribution. "
850
+ ]
851
+ },
852
+ {
853
+ "cell_type": "code",
854
+ "execution_count": null,
855
+ "id": "ac5e9d29",
856
+ "metadata": {
857
+ "execution": {}
858
+ },
859
+ "outputs": [],
860
+ "source": [
861
+ "x = np.random.normal(size=50)\n",
862
+ "x\n"
863
+ ]
864
+ },
865
+ {
866
+ "cell_type": "markdown",
867
+ "id": "d77cf45a",
868
+ "metadata": {},
869
+ "source": [
870
+ "We create an array `y` by adding an independent $N(50,1)$ random variable to each element of `x`."
871
+ ]
872
+ },
873
+ {
874
+ "cell_type": "code",
875
+ "execution_count": null,
876
+ "id": "55fa905e",
877
+ "metadata": {
878
+ "execution": {},
879
+ "lines_to_next_cell": 0
880
+ },
881
+ "outputs": [],
882
+ "source": [
883
+ "y = x + np.random.normal(loc=50, scale=1, size=50)"
884
+ ]
885
+ },
886
+ {
887
+ "cell_type": "markdown",
888
+ "id": "eacfecc9",
889
+ "metadata": {},
890
+ "source": [
891
+ "The `np.corrcoef()` function computes the correlation matrix between `x` and `y`. The off-diagonal elements give the \n",
892
+ "correlation between `x` and `y`. "
893
+ ]
894
+ },
895
+ {
896
+ "cell_type": "code",
897
+ "execution_count": null,
898
+ "id": "fde0dc19",
899
+ "metadata": {
900
+ "execution": {}
901
+ },
902
+ "outputs": [],
903
+ "source": [
904
+ "np.corrcoef(x, y)"
905
+ ]
906
+ },
907
+ {
908
+ "cell_type": "markdown",
909
+ "id": "8a594218",
910
+ "metadata": {},
911
+ "source": [
912
+ "If you're following along in your own `Jupyter` notebook, then you probably noticed that you got a different set of results when you ran the past few \n",
913
+ "commands. In particular, \n",
914
+ " each\n",
915
+ "time we call `np.random.normal()`, we will get a different answer, as shown in the following example."
916
+ ]
917
+ },
918
+ {
919
+ "cell_type": "code",
920
+ "execution_count": null,
921
+ "id": "5099cf54",
922
+ "metadata": {
923
+ "execution": {},
924
+ "lines_to_next_cell": 0
925
+ },
926
+ "outputs": [],
927
+ "source": [
928
+ "print(np.random.normal(scale=5, size=2))\n",
929
+ "print(np.random.normal(scale=5, size=2)) \n"
930
+ ]
931
+ },
932
+ {
933
+ "cell_type": "markdown",
934
+ "id": "2e209118",
935
+ "metadata": {},
936
+ "source": [
937
+ " "
938
+ ]
939
+ },
940
+ {
941
+ "cell_type": "markdown",
942
+ "id": "ed7697a4",
943
+ "metadata": {},
944
+ "source": [
945
+ "In order to ensure that our code provides exactly the same results\n",
946
+ "each time it is run, we can set a *random seed* \n",
947
+ "using the \n",
948
+ "`np.random.default_rng()` function.\n",
949
+ "This function takes an arbitrary, user-specified integer argument. If we set a random seed before \n",
950
+ "generating random data, then re-running our code will yield the same results. The\n",
951
+ "object `rng` has essentially all the random number generating methods found in `np.random`. Hence, to\n",
952
+ "generate normal data we use `rng.normal()`."
953
+ ]
954
+ },
955
+ {
956
+ "cell_type": "code",
957
+ "execution_count": null,
958
+ "id": "9d8074e5",
959
+ "metadata": {
960
+ "execution": {}
961
+ },
962
+ "outputs": [],
963
+ "source": [
964
+ "rng = np.random.default_rng(1303)\n",
965
+ "print(rng.normal(scale=5, size=2))\n",
966
+ "rng2 = np.random.default_rng(1303)\n",
967
+ "print(rng2.normal(scale=5, size=2)) "
968
+ ]
969
+ },
970
+ {
971
+ "cell_type": "markdown",
972
+ "id": "93f826ef",
973
+ "metadata": {},
974
+ "source": [
975
+ "Throughout the labs in this book, we use `np.random.default_rng()` whenever we\n",
976
+ "perform calculations involving random quantities within `numpy`. In principle, this\n",
977
+ "should enable the reader to exactly reproduce the stated results. However, as new versions of `numpy` become available, it is possible\n",
978
+ "that some small discrepancies may occur between the output\n",
979
+ "in the labs and the output\n",
980
+ "from `numpy`.\n",
981
+ "\n",
982
+ "The `np.mean()`, `np.var()`, and `np.std()` functions can be used\n",
983
+ "to compute the mean, variance, and standard deviation of arrays. These functions are also\n",
984
+ "available as methods on the arrays."
985
+ ]
986
+ },
987
+ {
988
+ "cell_type": "code",
989
+ "execution_count": null,
990
+ "id": "e98472df",
991
+ "metadata": {
992
+ "execution": {},
993
+ "lines_to_next_cell": 0
994
+ },
995
+ "outputs": [],
996
+ "source": [
997
+ "rng = np.random.default_rng(3)\n",
998
+ "y = rng.standard_normal(10)\n",
999
+ "np.mean(y), y.mean()"
1000
+ ]
1001
+ },
1002
+ {
1003
+ "cell_type": "markdown",
1004
+ "id": "2870d61f",
1005
+ "metadata": {},
1006
+ "source": [
1007
+ " \n"
1008
+ ]
1009
+ },
1010
+ {
1011
+ "cell_type": "code",
1012
+ "execution_count": null,
1013
+ "id": "8c2784fd",
1014
+ "metadata": {
1015
+ "execution": {},
1016
+ "lines_to_next_cell": 2
1017
+ },
1018
+ "outputs": [],
1019
+ "source": [
1020
+ "np.var(y), y.var(), np.mean((y - y.mean())**2)"
1021
+ ]
1022
+ },
1023
+ {
1024
+ "cell_type": "markdown",
1025
+ "id": "86261a69",
1026
+ "metadata": {},
1027
+ "source": [
1028
+ "Notice that by default `np.var()` divides by the sample size $n$ rather\n",
1029
+ "than $n-1$; see the `ddof` argument in `np.var?`.\n"
1030
+ ]
1031
+ },
1032
+ {
1033
+ "cell_type": "code",
1034
+ "execution_count": null,
1035
+ "id": "7e7205f2",
1036
+ "metadata": {
1037
+ "execution": {}
1038
+ },
1039
+ "outputs": [],
1040
+ "source": [
1041
+ "np.sqrt(np.var(y)), np.std(y)"
1042
+ ]
1043
+ },
1044
+ {
1045
+ "cell_type": "markdown",
1046
+ "id": "d4faf901",
1047
+ "metadata": {},
1048
+ "source": [
1049
+ "The `np.mean()`, `np.var()`, and `np.std()` functions can also be applied to the rows and columns of a matrix. \n",
1050
+ "To see this, we construct a $10 \\times 3$ matrix of $N(0,1)$ random variables, and consider computing its row sums. "
1051
+ ]
1052
+ },
1053
+ {
1054
+ "cell_type": "code",
1055
+ "execution_count": null,
1056
+ "id": "fce06849",
1057
+ "metadata": {
1058
+ "execution": {}
1059
+ },
1060
+ "outputs": [],
1061
+ "source": [
1062
+ "X = rng.standard_normal((10, 3))\n",
1063
+ "X"
1064
+ ]
1065
+ },
1066
+ {
1067
+ "cell_type": "markdown",
1068
+ "id": "6cc355d2",
1069
+ "metadata": {},
1070
+ "source": [
1071
+ "Since arrays are row-major ordered, the first axis, i.e. `axis=0`, refers to its rows. We pass this argument into the `mean()` method for the object `X`. "
1072
+ ]
1073
+ },
1074
+ {
1075
+ "cell_type": "code",
1076
+ "execution_count": null,
1077
+ "id": "1403ff7a",
1078
+ "metadata": {
1079
+ "execution": {}
1080
+ },
1081
+ "outputs": [],
1082
+ "source": [
1083
+ "X.mean(axis=0)"
1084
+ ]
1085
+ },
1086
+ {
1087
+ "cell_type": "markdown",
1088
+ "id": "6785c0ec",
1089
+ "metadata": {},
1090
+ "source": [
1091
+ "The following yields the same result."
1092
+ ]
1093
+ },
1094
+ {
1095
+ "cell_type": "code",
1096
+ "execution_count": null,
1097
+ "id": "7e9255ba",
1098
+ "metadata": {
1099
+ "execution": {},
1100
+ "lines_to_next_cell": 0
1101
+ },
1102
+ "outputs": [],
1103
+ "source": [
1104
+ "X.mean(0)"
1105
+ ]
1106
+ },
1107
+ {
1108
+ "cell_type": "markdown",
1109
+ "id": "5de246dc",
1110
+ "metadata": {},
1111
+ "source": [
1112
+ " "
1113
+ ]
1114
+ },
1115
+ {
1116
+ "cell_type": "markdown",
1117
+ "id": "30b002fa",
1118
+ "metadata": {},
1119
+ "source": [
1120
+ "## Graphics\n",
1121
+ "In `Python`, common practice is to use the library\n",
1122
+ "`matplotlib` for graphics.\n",
1123
+ "However, since `Python` was not written with data analysis in mind,\n",
1124
+ " the notion of plotting is not intrinsic to the language. \n",
1125
+ "We will use the `subplots()` function\n",
1126
+ "from `matplotlib.pyplot` to create a figure and the\n",
1127
+ "axes onto which we plot our data.\n",
1128
+ "For many more examples of how to make plots in `Python`,\n",
1129
+ "readers are encouraged to visit [matplotlib.org/stable/gallery/](https://matplotlib.org/stable/gallery/index.html).\n",
1130
+ "\n",
1131
+ "In `matplotlib`, a plot consists of a *figure* and one or more *axes*. You can think of the figure as the blank canvas upon which \n",
1132
+ "one or more plots will be displayed: it is the entire plotting window. \n",
1133
+ "The *axes* contain important information about each plot, such as its $x$- and $y$-axis labels,\n",
1134
+ "title, and more. (Note that in `matplotlib`, the word *axes* is not the plural of *axis*: a plot's *axes* contains much more information \n",
1135
+ "than just the $x$-axis and the $y$-axis.)\n",
1136
+ "\n",
1137
+ "We begin by importing the `subplots()` function\n",
1138
+ "from `matplotlib`. We use this function\n",
1139
+ "throughout when creating figures.\n",
1140
+ "The function returns a tuple of length two: a figure\n",
1141
+ "object as well as the relevant axes object. We will typically\n",
1142
+ "pass `figsize` as a keyword argument.\n",
1143
+ "Having created our axes, we attempt our first plot using its `plot()` method.\n",
1144
+ "To learn more about it, \n",
1145
+ "type `ax.plot?`."
1146
+ ]
1147
+ },
1148
+ {
1149
+ "cell_type": "code",
1150
+ "execution_count": null,
1151
+ "id": "8236e5f7",
1152
+ "metadata": {
1153
+ "execution": {}
1154
+ },
1155
+ "outputs": [],
1156
+ "source": [
1157
+ "from matplotlib.pyplot import subplots\n",
1158
+ "fig, ax = subplots(figsize=(8, 8))\n",
1159
+ "x = rng.standard_normal(100)\n",
1160
+ "y = rng.standard_normal(100)\n",
1161
+ "ax.plot(x, y);\n"
1162
+ ]
1163
+ },
1164
+ {
1165
+ "cell_type": "markdown",
1166
+ "id": "bbef67e6",
1167
+ "metadata": {},
1168
+ "source": [
1169
+ "We pause here to note that we have *unpacked* the tuple of length two returned by `subplots()` into the two distinct\n",
1170
+ "variables `fig` and `ax`. Unpacking\n",
1171
+ "is typically preferred to the following equivalent but slightly more verbose code:"
1172
+ ]
1173
+ },
1174
+ {
1175
+ "cell_type": "code",
1176
+ "execution_count": null,
1177
+ "id": "ddc9ed4f",
1178
+ "metadata": {
1179
+ "execution": {}
1180
+ },
1181
+ "outputs": [],
1182
+ "source": [
1183
+ "output = subplots(figsize=(8, 8))\n",
1184
+ "fig = output[0]\n",
1185
+ "ax = output[1]"
1186
+ ]
1187
+ },
1188
+ {
1189
+ "cell_type": "markdown",
1190
+ "id": "104d6b8f",
1191
+ "metadata": {},
1192
+ "source": [
1193
+ "We see that our earlier cell produced a line plot, which is the default. To create a scatterplot, we provide an additional argument to `ax.plot()`, indicating that circles should be displayed."
1194
+ ]
1195
+ },
1196
+ {
1197
+ "cell_type": "code",
1198
+ "execution_count": null,
1199
+ "id": "c64ed600",
1200
+ "metadata": {
1201
+ "execution": {},
1202
+ "lines_to_next_cell": 0
1203
+ },
1204
+ "outputs": [],
1205
+ "source": [
1206
+ "fig, ax = subplots(figsize=(8, 8))\n",
1207
+ "ax.plot(x, y, 'o');"
1208
+ ]
1209
+ },
1210
+ {
1211
+ "cell_type": "markdown",
1212
+ "id": "840be2a9",
1213
+ "metadata": {},
1214
+ "source": [
1215
+ "Different values\n",
1216
+ "of this additional argument can be used to produce different colored lines\n",
1217
+ "as well as different linestyles. \n"
1218
+ ]
1219
+ },
1220
+ {
1221
+ "cell_type": "markdown",
1222
+ "id": "971b98bd",
1223
+ "metadata": {},
1224
+ "source": [
1225
+ "As an alternative, we could use the `ax.scatter()` function to create a scatterplot."
1226
+ ]
1227
+ },
1228
+ {
1229
+ "cell_type": "code",
1230
+ "execution_count": null,
1231
+ "id": "bc6245e2",
1232
+ "metadata": {
1233
+ "execution": {}
1234
+ },
1235
+ "outputs": [],
1236
+ "source": [
1237
+ "fig, ax = subplots(figsize=(8, 8))\n",
1238
+ "ax.scatter(x, y, marker='o');"
1239
+ ]
1240
+ },
1241
+ {
1242
+ "cell_type": "markdown",
1243
+ "id": "97f36df0",
1244
+ "metadata": {},
1245
+ "source": [
1246
+ "Notice that in the code blocks above, we have ended\n",
1247
+ "the last line with a semicolon. This prevents `ax.plot(x, y)` from printing\n",
1248
+ "text to the notebook. However, it does not prevent a plot from being produced. \n",
1249
+ " If we omit the trailing semi-colon, then we obtain the following output: "
1250
+ ]
1251
+ },
1252
+ {
1253
+ "cell_type": "code",
1254
+ "execution_count": null,
1255
+ "id": "2454807b",
1256
+ "metadata": {
1257
+ "execution": {},
1258
+ "lines_to_next_cell": 0
1259
+ },
1260
+ "outputs": [],
1261
+ "source": [
1262
+ "fig, ax = subplots(figsize=(8, 8))\n",
1263
+ "ax.scatter(x, y, marker='o')\n"
1264
+ ]
1265
+ },
1266
+ {
1267
+ "cell_type": "markdown",
1268
+ "id": "1230c0a6",
1269
+ "metadata": {},
1270
+ "source": [
1271
+ "In what follows, we will use\n",
1272
+ " trailing semicolons whenever the text that would be output is not\n",
1273
+ "germane to the discussion at hand.\n",
1274
+ "\n",
1275
+ "\n",
1276
+ "\n"
1277
+ ]
1278
+ },
1279
+ {
1280
+ "cell_type": "markdown",
1281
+ "id": "0ccb9964",
1282
+ "metadata": {},
1283
+ "source": [
1284
+ "To label our plot, we make use of the `set_xlabel()`, `set_ylabel()`, and `set_title()` methods\n",
1285
+ "of `ax`.\n",
1286
+ " "
1287
+ ]
1288
+ },
1289
+ {
1290
+ "cell_type": "code",
1291
+ "execution_count": null,
1292
+ "id": "1e18a793",
1293
+ "metadata": {
1294
+ "execution": {}
1295
+ },
1296
+ "outputs": [],
1297
+ "source": [
1298
+ "fig, ax = subplots(figsize=(8, 8))\n",
1299
+ "ax.scatter(x, y, marker='o')\n",
1300
+ "ax.set_xlabel(\"this is the x-axis\")\n",
1301
+ "ax.set_ylabel(\"this is the y-axis\")\n",
1302
+ "ax.set_title(\"Plot of X vs Y\");"
1303
+ ]
1304
+ },
1305
+ {
1306
+ "cell_type": "markdown",
1307
+ "id": "f2d818ee",
1308
+ "metadata": {},
1309
+ "source": [
1310
+ " Having access to the figure object `fig` itself means that we can go in and change some aspects and then redisplay it. Here, we change\n",
1311
+ " the size from `(8, 8)` to `(12, 3)`.\n"
1312
+ ]
1313
+ },
1314
+ {
1315
+ "cell_type": "code",
1316
+ "execution_count": null,
1317
+ "id": "aec3f009",
1318
+ "metadata": {
1319
+ "execution": {},
1320
+ "lines_to_next_cell": 0
1321
+ },
1322
+ "outputs": [],
1323
+ "source": [
1324
+ "fig.set_size_inches(12,3)\n",
1325
+ "fig"
1326
+ ]
1327
+ },
1328
+ {
1329
+ "cell_type": "markdown",
1330
+ "id": "dee531cc",
1331
+ "metadata": {},
1332
+ "source": [
1333
+ " "
1334
+ ]
1335
+ },
1336
+ {
1337
+ "cell_type": "markdown",
1338
+ "id": "011bf802",
1339
+ "metadata": {},
1340
+ "source": [
1341
+ "Occasionally we will want to create several plots within a figure. This can be\n",
1342
+ "achieved by passing additional arguments to `subplots()`. \n",
1343
+ "Below, we create a $2 \\times 3$ grid of plots\n",
1344
+ "in a figure of size determined by the `figsize` argument. In such\n",
1345
+ "situations, there is often a relationship between the axes in the plots. For example,\n",
1346
+ "all plots may have a common $x$-axis. The `subplots()` function can automatically handle\n",
1347
+ "this situation when passed the keyword argument `sharex=True`.\n",
1348
+ "The `axes` object below is an array pointing to different plots in the figure. "
1349
+ ]
1350
+ },
1351
+ {
1352
+ "cell_type": "code",
1353
+ "execution_count": null,
1354
+ "id": "2cbc7fd4",
1355
+ "metadata": {
1356
+ "execution": {},
1357
+ "lines_to_next_cell": 0
1358
+ },
1359
+ "outputs": [],
1360
+ "source": [
1361
+ "fig, axes = subplots(nrows=2,\n",
1362
+ " ncols=3,\n",
1363
+ " figsize=(15, 5))"
1364
+ ]
1365
+ },
1366
+ {
1367
+ "cell_type": "markdown",
1368
+ "id": "b8ff2e6d",
1369
+ "metadata": {},
1370
+ "source": [
1371
+ "We now produce a scatter plot with `'o'` in the second column of the first row and\n",
1372
+ "a scatter plot with `'+'` in the third column of the second row."
1373
+ ]
1374
+ },
1375
+ {
1376
+ "cell_type": "code",
1377
+ "execution_count": null,
1378
+ "id": "702f80d9",
1379
+ "metadata": {
1380
+ "execution": {},
1381
+ "lines_to_next_cell": 0
1382
+ },
1383
+ "outputs": [],
1384
+ "source": [
1385
+ "axes[0,1].plot(x, y, 'o')\n",
1386
+ "axes[1,2].scatter(x, y, marker='+')\n",
1387
+ "fig"
1388
+ ]
1389
+ },
1390
+ {
1391
+ "cell_type": "markdown",
1392
+ "id": "5b265f8b",
1393
+ "metadata": {},
1394
+ "source": [
1395
+ "Type `subplots?` to learn more about \n",
1396
+ "`subplots()`. \n",
1397
+ "\n",
1398
+ "\n"
1399
+ ]
1400
+ },
1401
+ {
1402
+ "cell_type": "markdown",
1403
+ "id": "1bd7e707",
1404
+ "metadata": {},
1405
+ "source": [
1406
+ "To save the output of `fig`, we call its `savefig()`\n",
1407
+ "method. The argument `dpi` is the dots per inch, used\n",
1408
+ "to determine how large the figure will be in pixels."
1409
+ ]
1410
+ },
1411
+ {
1412
+ "cell_type": "code",
1413
+ "execution_count": null,
1414
+ "id": "5493d229",
1415
+ "metadata": {
1416
+ "execution": {},
1417
+ "lines_to_next_cell": 2
1418
+ },
1419
+ "outputs": [],
1420
+ "source": [
1421
+ "fig.savefig(\"Figure.png\", dpi=400)\n",
1422
+ "fig.savefig(\"Figure.pdf\", dpi=200);\n"
1423
+ ]
1424
+ },
1425
+ {
1426
+ "cell_type": "markdown",
1427
+ "id": "7152d0c7",
1428
+ "metadata": {},
1429
+ "source": [
1430
+ "We can continue to modify `fig` using step-by-step updates; for example, we can modify the range of the $x$-axis, re-save the figure, and even re-display it. "
1431
+ ]
1432
+ },
1433
+ {
1434
+ "cell_type": "code",
1435
+ "execution_count": null,
1436
+ "id": "bd07af12",
1437
+ "metadata": {
1438
+ "execution": {}
1439
+ },
1440
+ "outputs": [],
1441
+ "source": [
1442
+ "axes[0,1].set_xlim([-1,1])\n",
1443
+ "fig.savefig(\"Figure_updated.jpg\")\n",
1444
+ "fig"
1445
+ ]
1446
+ },
1447
+ {
1448
+ "cell_type": "markdown",
1449
+ "id": "b5278857",
1450
+ "metadata": {},
1451
+ "source": [
1452
+ "We now create some more sophisticated plots. The \n",
1453
+ "`ax.contour()` method produces a *contour plot* \n",
1454
+ "in order to represent three-dimensional data, similar to a\n",
1455
+ "topographical map. It takes three arguments:\n",
1456
+ "\n",
1457
+ "* A vector of `x` values (the first dimension),\n",
1458
+ "* A vector of `y` values (the second dimension), and\n",
1459
+ "* A matrix whose elements correspond to the `z` value (the third\n",
1460
+ "dimension) for each pair of `(x,y)` coordinates.\n",
1461
+ "\n",
1462
+ "To create `x` and `y`, we’ll use the command `np.linspace(a, b, n)`, \n",
1463
+ "which returns a vector of `n` numbers starting at `a` and ending at `b`."
1464
+ ]
1465
+ },
1466
+ {
1467
+ "cell_type": "code",
1468
+ "execution_count": null,
1469
+ "id": "01019508",
1470
+ "metadata": {
1471
+ "execution": {},
1472
+ "lines_to_next_cell": 0
1473
+ },
1474
+ "outputs": [],
1475
+ "source": [
1476
+ "fig, ax = subplots(figsize=(8, 8))\n",
1477
+ "x = np.linspace(-np.pi, np.pi, 50)\n",
1478
+ "y = x\n",
1479
+ "f = np.multiply.outer(np.cos(y), 1 / (1 + x**2))\n",
1480
+ "ax.contour(x, y, f);\n"
1481
+ ]
1482
+ },
1483
+ {
1484
+ "cell_type": "markdown",
1485
+ "id": "9ef3c475",
1486
+ "metadata": {},
1487
+ "source": [
1488
+ "We can increase the resolution by adding more levels to the image."
1489
+ ]
1490
+ },
1491
+ {
1492
+ "cell_type": "code",
1493
+ "execution_count": null,
1494
+ "id": "7d08992f",
1495
+ "metadata": {
1496
+ "execution": {},
1497
+ "lines_to_next_cell": 0
1498
+ },
1499
+ "outputs": [],
1500
+ "source": [
1501
+ "fig, ax = subplots(figsize=(8, 8))\n",
1502
+ "ax.contour(x, y, f, levels=45);"
1503
+ ]
1504
+ },
1505
+ {
1506
+ "cell_type": "markdown",
1507
+ "id": "8e1d37a2",
1508
+ "metadata": {},
1509
+ "source": [
1510
+ "To fine-tune the output of the\n",
1511
+ "`ax.contour()` function, take a\n",
1512
+ "look at the help file by typing `?plt.contour`.\n",
1513
+ " \n",
1514
+ "The `ax.imshow()` method is similar to \n",
1515
+ "`ax.contour()`, except that it produces a color-coded plot\n",
1516
+ "whose colors depend on the `z` value. This is known as a\n",
1517
+ "*heatmap*, and is sometimes used to plot temperature in\n",
1518
+ "weather forecasts."
1519
+ ]
1520
+ },
1521
+ {
1522
+ "cell_type": "code",
1523
+ "execution_count": null,
1524
+ "id": "1f89d704",
1525
+ "metadata": {
1526
+ "execution": {},
1527
+ "lines_to_next_cell": 2
1528
+ },
1529
+ "outputs": [],
1530
+ "source": [
1531
+ "fig, ax = subplots(figsize=(8, 8))\n",
1532
+ "ax.imshow(f);\n"
1533
+ ]
1534
+ },
1535
+ {
1536
+ "cell_type": "markdown",
1537
+ "id": "2500a6ec",
1538
+ "metadata": {},
1539
+ "source": [
1540
+ "## Sequences and Slice Notation"
1541
+ ]
1542
+ },
1543
+ {
1544
+ "cell_type": "markdown",
1545
+ "id": "07001b88",
1546
+ "metadata": {},
1547
+ "source": [
1548
+ "As seen above, the\n",
1549
+ "function `np.linspace()` can be used to create a sequence\n",
1550
+ "of numbers."
1551
+ ]
1552
+ },
1553
+ {
1554
+ "cell_type": "code",
1555
+ "execution_count": null,
1556
+ "id": "cd971131",
1557
+ "metadata": {
1558
+ "execution": {},
1559
+ "lines_to_next_cell": 2
1560
+ },
1561
+ "outputs": [],
1562
+ "source": [
1563
+ "seq1 = np.linspace(0, 10, 11)\n",
1564
+ "seq1\n"
1565
+ ]
1566
+ },
1567
+ {
1568
+ "cell_type": "markdown",
1569
+ "id": "926f96fc",
1570
+ "metadata": {},
1571
+ "source": [
1572
+ "The function `np.arange()`\n",
1573
+ " returns a sequence of numbers spaced out by `step`. If `step` is not specified, then a default value of $1$ is used. Let's create a sequence\n",
1574
+ " that starts at $0$ and ends at $10$."
1575
+ ]
1576
+ },
1577
+ {
1578
+ "cell_type": "code",
1579
+ "execution_count": null,
1580
+ "id": "aa630d16",
1581
+ "metadata": {
1582
+ "execution": {}
1583
+ },
1584
+ "outputs": [],
1585
+ "source": [
1586
+ "seq2 = np.arange(0, 10)\n",
1587
+ "seq2\n"
1588
+ ]
1589
+ },
1590
+ {
1591
+ "cell_type": "markdown",
1592
+ "id": "6908bad7",
1593
+ "metadata": {},
1594
+ "source": [
1595
+ "Why isn't $10$ output above? This has to do with *slice* notation in `Python`. \n",
1596
+ "Slice notation \n",
1597
+ "is used to index sequences such as lists, tuples and arrays.\n",
1598
+ "Suppose we want to retrieve the fourth through sixth (inclusive) entries\n",
1599
+ "of a string. We obtain a slice of the string using the indexing notation `[3:6]`."
1600
+ ]
1601
+ },
1602
+ {
1603
+ "cell_type": "code",
1604
+ "execution_count": null,
1605
+ "id": "89955ee2",
1606
+ "metadata": {
1607
+ "execution": {},
1608
+ "lines_to_next_cell": 0
1609
+ },
1610
+ "outputs": [],
1611
+ "source": [
1612
+ "\"hello world\"[3:6]"
1613
+ ]
1614
+ },
1615
+ {
1616
+ "cell_type": "markdown",
1617
+ "id": "17d73e4d",
1618
+ "metadata": {},
1619
+ "source": [
1620
+ "In the code block above, the notation `3:6` is shorthand for `slice(3,6)` when used inside\n",
1621
+ "`[]`. "
1622
+ ]
1623
+ },
1624
+ {
1625
+ "cell_type": "code",
1626
+ "execution_count": null,
1627
+ "id": "517f592d",
1628
+ "metadata": {
1629
+ "execution": {}
1630
+ },
1631
+ "outputs": [],
1632
+ "source": [
1633
+ "\"hello world\"[slice(3,6)]\n"
1634
+ ]
1635
+ },
1636
+ {
1637
+ "cell_type": "markdown",
1638
+ "id": "680fe656",
1639
+ "metadata": {},
1640
+ "source": [
1641
+ "You might have expected `slice(3,6)` to output the fourth through seventh characters in the text string (recalling that `Python` begins its indexing at zero), but instead it output the fourth through sixth. \n",
1642
+ " This also explains why the earlier `np.arange(0, 10)` command output only the integers from $0$ to $9$. \n",
1643
+ "See the documentation `slice?` for useful options in creating slices. \n",
1644
+ "\n",
1645
+ " \n",
1646
+ "\n",
1647
+ "\n",
1648
+ "\n",
1649
+ " \n",
1650
+ "\n",
1651
+ "\n",
1652
+ " \n",
1653
+ "\n",
1654
+ " \n",
1655
+ "\n",
1656
+ " \n",
1657
+ "\n",
1658
+ " \n",
1659
+ "\n",
1660
+ " \n",
1661
+ "\n",
1662
+ "\n",
1663
+ " \n"
1664
+ ]
1665
+ },
1666
+ {
1667
+ "cell_type": "markdown",
1668
+ "id": "522a2761",
1669
+ "metadata": {},
1670
+ "source": [
1671
+ "## Indexing Data\n",
1672
+ "To begin, we create a two-dimensional `numpy` array."
1673
+ ]
1674
+ },
1675
+ {
1676
+ "cell_type": "code",
1677
+ "execution_count": null,
1678
+ "id": "35927abd",
1679
+ "metadata": {
1680
+ "execution": {}
1681
+ },
1682
+ "outputs": [],
1683
+ "source": [
1684
+ "A = np.array(np.arange(16)).reshape((4, 4))\n",
1685
+ "A\n"
1686
+ ]
1687
+ },
1688
+ {
1689
+ "cell_type": "markdown",
1690
+ "id": "27c88984",
1691
+ "metadata": {},
1692
+ "source": [
1693
+ "Typing `A[1,2]` retrieves the element corresponding to the second row and third\n",
1694
+ "column. (As usual, `Python` indexes from $0.$)"
1695
+ ]
1696
+ },
1697
+ {
1698
+ "cell_type": "code",
1699
+ "execution_count": null,
1700
+ "id": "78ee7f5b",
1701
+ "metadata": {
1702
+ "execution": {}
1703
+ },
1704
+ "outputs": [],
1705
+ "source": [
1706
+ "A[1,2]\n"
1707
+ ]
1708
+ },
1709
+ {
1710
+ "cell_type": "markdown",
1711
+ "id": "dd65ec1c",
1712
+ "metadata": {},
1713
+ "source": [
1714
+ "The first number after the open-bracket symbol `[`\n",
1715
+ " refers to the row, and the second number refers to the column. \n",
1716
+ "\n",
1717
+ "### Indexing Rows, Columns, and Submatrices\n",
1718
+ " To select multiple rows at a time, we can pass in a list\n",
1719
+ " specifying our selection. For instance, `[1,3]` will retrieve the second and fourth rows:"
1720
+ ]
1721
+ },
1722
+ {
1723
+ "cell_type": "code",
1724
+ "execution_count": null,
1725
+ "id": "16212696",
1726
+ "metadata": {
1727
+ "execution": {}
1728
+ },
1729
+ "outputs": [],
1730
+ "source": [
1731
+ "A[[1,3]]\n"
1732
+ ]
1733
+ },
1734
+ {
1735
+ "cell_type": "markdown",
1736
+ "id": "0b8b3ce3",
1737
+ "metadata": {},
1738
+ "source": [
1739
+ "To select the first and third columns, we pass in `[0,2]` as the second argument in the square brackets.\n",
1740
+ "In this case we need to supply the first argument `:` \n",
1741
+ "which selects all rows."
1742
+ ]
1743
+ },
1744
+ {
1745
+ "cell_type": "code",
1746
+ "execution_count": null,
1747
+ "id": "d5f473d2",
1748
+ "metadata": {
1749
+ "execution": {}
1750
+ },
1751
+ "outputs": [],
1752
+ "source": [
1753
+ "A[:,[0,2]]\n"
1754
+ ]
1755
+ },
1756
+ {
1757
+ "cell_type": "markdown",
1758
+ "id": "471ed1b4",
1759
+ "metadata": {},
1760
+ "source": [
1761
+ "Now, suppose that we want to select the submatrix made up of the second and fourth \n",
1762
+ "rows as well as the first and third columns. This is where\n",
1763
+ "indexing gets slightly tricky. It is natural to try to use lists to retrieve the rows and columns:"
1764
+ ]
1765
+ },
1766
+ {
1767
+ "cell_type": "code",
1768
+ "execution_count": null,
1769
+ "id": "c89646d6",
1770
+ "metadata": {
1771
+ "execution": {}
1772
+ },
1773
+ "outputs": [],
1774
+ "source": [
1775
+ "A[[1,3],[0,2]]\n"
1776
+ ]
1777
+ },
1778
+ {
1779
+ "cell_type": "markdown",
1780
+ "id": "9cbf1ff9",
1781
+ "metadata": {},
1782
+ "source": [
1783
+ " Oops --- what happened? We got a one-dimensional array of length two identical to"
1784
+ ]
1785
+ },
1786
+ {
1787
+ "cell_type": "code",
1788
+ "execution_count": null,
1789
+ "id": "87f6b4f2",
1790
+ "metadata": {
1791
+ "execution": {}
1792
+ },
1793
+ "outputs": [],
1794
+ "source": [
1795
+ "np.array([A[1,0],A[3,2]])\n"
1796
+ ]
1797
+ },
1798
+ {
1799
+ "cell_type": "markdown",
1800
+ "id": "9a93dc96",
1801
+ "metadata": {},
1802
+ "source": [
1803
+ " Similarly, the following code fails to extract the submatrix comprised of the second and fourth rows and the first, third, and fourth columns:"
1804
+ ]
1805
+ },
1806
+ {
1807
+ "cell_type": "code",
1808
+ "execution_count": null,
1809
+ "id": "5da5bda8",
1810
+ "metadata": {
1811
+ "execution": {}
1812
+ },
1813
+ "outputs": [],
1814
+ "source": [
1815
+ "A[[1,3],[0,2,3]]\n"
1816
+ ]
1817
+ },
1818
+ {
1819
+ "cell_type": "markdown",
1820
+ "id": "f4fd2f83",
1821
+ "metadata": {},
1822
+ "source": [
1823
+ "We can see what has gone wrong here. When supplied with two indexing lists, the `numpy` interpretation is that these provide pairs of $i,j$ indices for a series of entries. That is why the pair of lists must have the same length. However, that was not our intent, since we are looking for a submatrix.\n",
1824
+ "\n",
1825
+ "One easy way to do this is as follows. We first create a submatrix by subsetting the rows of `A`, and then on the fly we make a further submatrix by subsetting its columns.\n"
1826
+ ]
1827
+ },
1828
+ {
1829
+ "cell_type": "code",
1830
+ "execution_count": null,
1831
+ "id": "ac48a95b",
1832
+ "metadata": {
1833
+ "execution": {},
1834
+ "lines_to_next_cell": 0
1835
+ },
1836
+ "outputs": [],
1837
+ "source": [
1838
+ "A[[1,3]][:,[0,2]]\n"
1839
+ ]
1840
+ },
1841
+ {
1842
+ "cell_type": "markdown",
1843
+ "id": "5e8388aa",
1844
+ "metadata": {},
1845
+ "source": [
1846
+ " "
1847
+ ]
1848
+ },
1849
+ {
1850
+ "cell_type": "markdown",
1851
+ "id": "a09467cd",
1852
+ "metadata": {},
1853
+ "source": [
1854
+ "There are more efficient ways of achieving the same result.\n",
1855
+ "\n",
1856
+ "The *convenience function* `np.ix_()` allows us to extract a submatrix\n",
1857
+ "using lists, by creating an intermediate *mesh* object."
1858
+ ]
1859
+ },
1860
+ {
1861
+ "cell_type": "code",
1862
+ "execution_count": null,
1863
+ "id": "ee195cc4",
1864
+ "metadata": {
1865
+ "execution": {},
1866
+ "lines_to_next_cell": 2
1867
+ },
1868
+ "outputs": [],
1869
+ "source": [
1870
+ "idx = np.ix_([1,3],[0,2,3])\n",
1871
+ "A[idx]\n"
1872
+ ]
1873
+ },
1874
+ {
1875
+ "cell_type": "markdown",
1876
+ "id": "b7177cb9",
1877
+ "metadata": {},
1878
+ "source": [
1879
+ "Alternatively, we can subset matrices efficiently using slices.\n",
1880
+ " \n",
1881
+ "The slice\n",
1882
+ "`1:4:2` captures the second and fourth items of a sequence, while the slice `0:3:2` captures\n",
1883
+ "the first and third items (the third element in a slice sequence is the step size)."
1884
+ ]
1885
+ },
1886
+ {
1887
+ "cell_type": "code",
1888
+ "execution_count": null,
1889
+ "id": "48917bb5",
1890
+ "metadata": {
1891
+ "execution": {},
1892
+ "lines_to_next_cell": 0
1893
+ },
1894
+ "outputs": [],
1895
+ "source": [
1896
+ "A[1:4:2,0:3:2]\n"
1897
+ ]
1898
+ },
1899
+ {
1900
+ "cell_type": "markdown",
1901
+ "id": "697c5ab0",
1902
+ "metadata": {},
1903
+ "source": [
1904
+ " "
1905
+ ]
1906
+ },
1907
+ {
1908
+ "cell_type": "markdown",
1909
+ "id": "c647dbf0",
1910
+ "metadata": {},
1911
+ "source": [
1912
+ "Why are we able to retrieve a submatrix directly using slices but not using lists?\n",
1913
+ "Its because they are different `Python` types, and\n",
1914
+ "are treated differently by `numpy`.\n",
1915
+ "Slices can be used to extract objects from arbitrary sequences, such as strings, lists, and tuples, while the use of lists for indexing is more limited.\n",
1916
+ "\n",
1917
+ "\n",
1918
+ "\n",
1919
+ "\n",
1920
+ " \n",
1921
+ "\n",
1922
+ " \n",
1923
+ "\n",
1924
+ " \n",
1925
+ "\n",
1926
+ " "
1927
+ ]
1928
+ },
1929
+ {
1930
+ "cell_type": "markdown",
1931
+ "id": "2dce8961",
1932
+ "metadata": {},
1933
+ "source": [
1934
+ "### Boolean Indexing\n",
1935
+ "In `numpy`, a *Boolean* is a type that equals either `True` or `False` (also represented as $1$ and $0$, respectively).\n",
1936
+ "The next line creates a vector of $0$'s, represented as Booleans, of length equal to the first dimension of `A`. "
1937
+ ]
1938
+ },
1939
+ {
1940
+ "cell_type": "code",
1941
+ "execution_count": null,
1942
+ "id": "5d4caf22",
1943
+ "metadata": {
1944
+ "execution": {},
1945
+ "lines_to_next_cell": 0
1946
+ },
1947
+ "outputs": [],
1948
+ "source": [
1949
+ "keep_rows = np.zeros(A.shape[0], bool)\n",
1950
+ "keep_rows"
1951
+ ]
1952
+ },
1953
+ {
1954
+ "cell_type": "markdown",
1955
+ "id": "d83fadb5",
1956
+ "metadata": {},
1957
+ "source": [
1958
+ "We now set two of the elements to `True`. "
1959
+ ]
1960
+ },
1961
+ {
1962
+ "cell_type": "code",
1963
+ "execution_count": null,
1964
+ "id": "348820e3",
1965
+ "metadata": {
1966
+ "execution": {}
1967
+ },
1968
+ "outputs": [],
1969
+ "source": [
1970
+ "keep_rows[[1,3]] = True\n",
1971
+ "keep_rows\n"
1972
+ ]
1973
+ },
1974
+ {
1975
+ "cell_type": "markdown",
1976
+ "id": "a0fb487d",
1977
+ "metadata": {},
1978
+ "source": [
1979
+ "Note that the elements of `keep_rows`, when viewed as integers, are the same as the\n",
1980
+ "values of `np.array([0,1,0,1])`. Below, we use `==` to verify their equality. When\n",
1981
+ "applied to two arrays, the `==` operation is applied elementwise."
1982
+ ]
1983
+ },
1984
+ {
1985
+ "cell_type": "code",
1986
+ "execution_count": null,
1987
+ "id": "4aafe45b",
1988
+ "metadata": {
1989
+ "execution": {}
1990
+ },
1991
+ "outputs": [],
1992
+ "source": [
1993
+ "np.all(keep_rows == np.array([0,1,0,1]))\n"
1994
+ ]
1995
+ },
1996
+ {
1997
+ "cell_type": "markdown",
1998
+ "id": "603c0c53",
1999
+ "metadata": {},
2000
+ "source": [
2001
+ "(Here, the function `np.all()` has checked whether\n",
2002
+ "all entries of an array are `True`. A similar function, `np.any()`, can be used to check whether any entries of an array are `True`.)"
2003
+ ]
2004
+ },
2005
+ {
2006
+ "cell_type": "markdown",
2007
+ "id": "b0a449d1",
2008
+ "metadata": {},
2009
+ "source": [
2010
+ " However, even though `np.array([0,1,0,1])` and `keep_rows` are equal according to `==`, they index different sets of rows!\n",
2011
+ "The former retrieves the first, second, first, and second rows of `A`. "
2012
+ ]
2013
+ },
2014
+ {
2015
+ "cell_type": "code",
2016
+ "execution_count": null,
2017
+ "id": "1be6a588",
2018
+ "metadata": {
2019
+ "execution": {}
2020
+ },
2021
+ "outputs": [],
2022
+ "source": [
2023
+ "A[np.array([0,1,0,1])]\n"
2024
+ ]
2025
+ },
2026
+ {
2027
+ "cell_type": "markdown",
2028
+ "id": "e45bbebe",
2029
+ "metadata": {},
2030
+ "source": [
2031
+ " By contrast, `keep_rows` retrieves only the second and fourth rows of `A` --- i.e. the rows for which the Boolean equals `TRUE`. "
2032
+ ]
2033
+ },
2034
+ {
2035
+ "cell_type": "code",
2036
+ "execution_count": null,
2037
+ "id": "e83da57b",
2038
+ "metadata": {
2039
+ "execution": {}
2040
+ },
2041
+ "outputs": [],
2042
+ "source": [
2043
+ "A[keep_rows]\n"
2044
+ ]
2045
+ },
2046
+ {
2047
+ "cell_type": "markdown",
2048
+ "id": "374d34a7",
2049
+ "metadata": {},
2050
+ "source": [
2051
+ "This example shows that Booleans and integers are treated differently by `numpy`."
2052
+ ]
2053
+ },
2054
+ {
2055
+ "cell_type": "markdown",
2056
+ "id": "25db74bf",
2057
+ "metadata": {},
2058
+ "source": [
2059
+ "We again make use of the `np.ix_()` function\n",
2060
+ " to create a mesh containing the second and fourth rows, and the first, third, and fourth columns. This time, we apply the function to Booleans,\n",
2061
+ " rather than lists."
2062
+ ]
2063
+ },
2064
+ {
2065
+ "cell_type": "code",
2066
+ "execution_count": null,
2067
+ "id": "09675294",
2068
+ "metadata": {
2069
+ "execution": {}
2070
+ },
2071
+ "outputs": [],
2072
+ "source": [
2073
+ "keep_cols = np.zeros(A.shape[1], bool)\n",
2074
+ "keep_cols[[0, 2, 3]] = True\n",
2075
+ "idx_bool = np.ix_(keep_rows, keep_cols)\n",
2076
+ "A[idx_bool]\n"
2077
+ ]
2078
+ },
2079
+ {
2080
+ "cell_type": "markdown",
2081
+ "id": "0166c179",
2082
+ "metadata": {},
2083
+ "source": [
2084
+ "We can also mix a list with an array of Booleans in the arguments to `np.ix_()`:"
2085
+ ]
2086
+ },
2087
+ {
2088
+ "cell_type": "code",
2089
+ "execution_count": null,
2090
+ "id": "a85614e4",
2091
+ "metadata": {
2092
+ "execution": {},
2093
+ "lines_to_next_cell": 0
2094
+ },
2095
+ "outputs": [],
2096
+ "source": [
2097
+ "idx_mixed = np.ix_([1,3], keep_cols)\n",
2098
+ "A[idx_mixed]\n"
2099
+ ]
2100
+ },
2101
+ {
2102
+ "cell_type": "markdown",
2103
+ "id": "f6a338f1",
2104
+ "metadata": {},
2105
+ "source": [
2106
+ " "
2107
+ ]
2108
+ },
2109
+ {
2110
+ "cell_type": "markdown",
2111
+ "id": "b3541e0c",
2112
+ "metadata": {},
2113
+ "source": [
2114
+ "For more details on indexing in `numpy`, readers are referred\n",
2115
+ "to the `numpy` tutorial mentioned earlier.\n"
2116
+ ]
2117
+ },
2118
+ {
2119
+ "cell_type": "markdown",
2120
+ "id": "ab75f168",
2121
+ "metadata": {},
2122
+ "source": [
2123
+ "## Loading Data\n",
2124
+ "\n",
2125
+ "Data sets often contain different types of data, and may have names associated with the rows or columns. \n",
2126
+ "For these reasons, they typically are best accommodated using a\n",
2127
+ " *data frame*. \n",
2128
+ " We can think of a data frame as a sequence\n",
2129
+ "of arrays of identical length; these are the columns. Entries in the\n",
2130
+ "different arrays can be combined to form a row.\n",
2131
+ " The `pandas`\n",
2132
+ "library can be used to create and work with data frame objects."
2133
+ ]
2134
+ },
2135
+ {
2136
+ "cell_type": "markdown",
2137
+ "id": "ca018d13",
2138
+ "metadata": {},
2139
+ "source": [
2140
+ "### Reading in a Data Set\n",
2141
+ "\n",
2142
+ "The first step of most analyses involves importing a data set into\n",
2143
+ "`Python`. \n",
2144
+ " Before attempting to load\n",
2145
+ "a data set, we must make sure that `Python` knows where to find the file containing it. \n",
2146
+ "If the\n",
2147
+ "file is in the same location\n",
2148
+ "as this notebook file, then we are all set. \n",
2149
+ "Otherwise, \n",
2150
+ "the command\n",
2151
+ "`os.chdir()` can be used to *change directory*. (You will need to call `import os` before calling `os.chdir()`.) "
2152
+ ]
2153
+ },
2154
+ {
2155
+ "cell_type": "markdown",
2156
+ "id": "b76342df",
2157
+ "metadata": {},
2158
+ "source": [
2159
+ "We will begin by reading in `Auto.csv`, available on the book website. This is a comma-separated file, and can be read in using `pd.read_csv()`: "
2160
+ ]
2161
+ },
2162
+ {
2163
+ "cell_type": "code",
2164
+ "execution_count": null,
2165
+ "id": "ff81e644",
2166
+ "metadata": {
2167
+ "execution": {}
2168
+ },
2169
+ "outputs": [],
2170
+ "source": [
2171
+ "import pandas as pd\n",
2172
+ "Auto = pd.read_csv('Auto.csv')\n",
2173
+ "Auto\n"
2174
+ ]
2175
+ },
2176
+ {
2177
+ "cell_type": "markdown",
2178
+ "id": "42d6a799",
2179
+ "metadata": {},
2180
+ "source": [
2181
+ "The book website also has a whitespace-delimited version of this data, called `Auto.data`. This can be read in as follows:"
2182
+ ]
2183
+ },
2184
+ {
2185
+ "cell_type": "code",
2186
+ "execution_count": null,
2187
+ "id": "5b45aa7f",
2188
+ "metadata": {
2189
+ "execution": {},
2190
+ "lines_to_next_cell": 0
2191
+ },
2192
+ "outputs": [],
2193
+ "source": [
2194
+ "Auto = pd.read_csv('Auto.data', delim_whitespace=True)\n"
2195
+ ]
2196
+ },
2197
+ {
2198
+ "cell_type": "markdown",
2199
+ "id": "f942c457",
2200
+ "metadata": {},
2201
+ "source": [
2202
+ " Both `Auto.csv` and `Auto.data` are simply text\n",
2203
+ "files. Before loading data into `Python`, it is a good idea to view it using\n",
2204
+ "a text editor or other software, such as Microsoft Excel.\n",
2205
+ "\n"
2206
+ ]
2207
+ },
2208
+ {
2209
+ "cell_type": "markdown",
2210
+ "id": "1aceff38",
2211
+ "metadata": {},
2212
+ "source": [
2213
+ "We now take a look at the column of `Auto` corresponding to the variable `horsepower`: "
2214
+ ]
2215
+ },
2216
+ {
2217
+ "cell_type": "code",
2218
+ "execution_count": null,
2219
+ "id": "413f626a",
2220
+ "metadata": {
2221
+ "execution": {},
2222
+ "lines_to_next_cell": 0
2223
+ },
2224
+ "outputs": [],
2225
+ "source": [
2226
+ "Auto['horsepower']\n"
2227
+ ]
2228
+ },
2229
+ {
2230
+ "cell_type": "markdown",
2231
+ "id": "fd11e757",
2232
+ "metadata": {},
2233
+ "source": [
2234
+ "We see that the `dtype` of this column is `object`. \n",
2235
+ "It turns out that all values of the `horsepower` column were interpreted as strings when reading\n",
2236
+ "in the data. \n",
2237
+ "We can find out why by looking at the unique values."
2238
+ ]
2239
+ },
2240
+ {
2241
+ "cell_type": "code",
2242
+ "execution_count": null,
2243
+ "id": "57b86346",
2244
+ "metadata": {
2245
+ "execution": {},
2246
+ "lines_to_next_cell": 0
2247
+ },
2248
+ "outputs": [],
2249
+ "source": [
2250
+ "np.unique(Auto['horsepower'])\n"
2251
+ ]
2252
+ },
2253
+ {
2254
+ "cell_type": "markdown",
2255
+ "id": "f0aee233",
2256
+ "metadata": {},
2257
+ "source": [
2258
+ "We see the culprit is the value `?`, which is being used to encode missing values.\n",
2259
+ "\n"
2260
+ ]
2261
+ },
2262
+ {
2263
+ "cell_type": "markdown",
2264
+ "id": "b7b032d4",
2265
+ "metadata": {},
2266
+ "source": [
2267
+ "To fix the problem, we must provide `pd.read_csv()` with an argument called `na_values`.\n",
2268
+ "Now, each instance of `?` in the file is replaced with the\n",
2269
+ "value `np.nan`, which means *not a number*:"
2270
+ ]
2271
+ },
2272
+ {
2273
+ "cell_type": "code",
2274
+ "execution_count": null,
2275
+ "id": "a9698b26",
2276
+ "metadata": {
2277
+ "execution": {},
2278
+ "lines_to_next_cell": 2
2279
+ },
2280
+ "outputs": [],
2281
+ "source": [
2282
+ "Auto = pd.read_csv('Auto.data',\n",
2283
+ " na_values=['?'],\n",
2284
+ " delim_whitespace=True)\n",
2285
+ "Auto['horsepower'].sum()\n"
2286
+ ]
2287
+ },
2288
+ {
2289
+ "cell_type": "markdown",
2290
+ "id": "13cb364e",
2291
+ "metadata": {},
2292
+ "source": [
2293
+ "The `Auto.shape` attribute tells us that the data has 397\n",
2294
+ "observations, or rows, and nine variables, or columns."
2295
+ ]
2296
+ },
2297
+ {
2298
+ "cell_type": "code",
2299
+ "execution_count": null,
2300
+ "id": "4877cb2c",
2301
+ "metadata": {
2302
+ "execution": {}
2303
+ },
2304
+ "outputs": [],
2305
+ "source": [
2306
+ "Auto.shape\n"
2307
+ ]
2308
+ },
2309
+ {
2310
+ "cell_type": "markdown",
2311
+ "id": "3fdc6f47",
2312
+ "metadata": {},
2313
+ "source": [
2314
+ "There are\n",
2315
+ "various ways to deal with missing data. \n",
2316
+ "In this case, since only five of the rows contain missing\n",
2317
+ "observations, we choose to use the `Auto.dropna()` method to simply remove these rows."
2318
+ ]
2319
+ },
2320
+ {
2321
+ "cell_type": "code",
2322
+ "execution_count": null,
2323
+ "id": "2ba1d33d",
2324
+ "metadata": {
2325
+ "execution": {},
2326
+ "lines_to_next_cell": 2
2327
+ },
2328
+ "outputs": [],
2329
+ "source": [
2330
+ "Auto_new = Auto.dropna()\n",
2331
+ "Auto_new.shape\n"
2332
+ ]
2333
+ },
2334
+ {
2335
+ "cell_type": "markdown",
2336
+ "id": "ac9748d9",
2337
+ "metadata": {},
2338
+ "source": [
2339
+ "### Basics of Selecting Rows and Columns\n",
2340
+ " \n",
2341
+ "We can use `Auto.columns` to check the variable names."
2342
+ ]
2343
+ },
2344
+ {
2345
+ "cell_type": "code",
2346
+ "execution_count": null,
2347
+ "id": "3d03baab",
2348
+ "metadata": {
2349
+ "execution": {},
2350
+ "lines_to_next_cell": 2
2351
+ },
2352
+ "outputs": [],
2353
+ "source": [
2354
+ "Auto = Auto_new # overwrite the previous value\n",
2355
+ "Auto.columns\n"
2356
+ ]
2357
+ },
2358
+ {
2359
+ "cell_type": "markdown",
2360
+ "id": "d24d4d42",
2361
+ "metadata": {},
2362
+ "source": [
2363
+ "Accessing the rows and columns of a data frame is similar, but not identical, to accessing the rows and columns of an array. \n",
2364
+ "Recall that the first argument to the `[]` method\n",
2365
+ "is always applied to the rows of the array. \n",
2366
+ "Similarly, \n",
2367
+ "passing in a slice to the `[]` method creates a data frame whose *rows* are determined by the slice:"
2368
+ ]
2369
+ },
2370
+ {
2371
+ "cell_type": "code",
2372
+ "execution_count": null,
2373
+ "id": "410b4dd7",
2374
+ "metadata": {
2375
+ "execution": {},
2376
+ "lines_to_next_cell": 0
2377
+ },
2378
+ "outputs": [],
2379
+ "source": [
2380
+ "Auto[:3]\n"
2381
+ ]
2382
+ },
2383
+ {
2384
+ "cell_type": "markdown",
2385
+ "id": "4ea0be7b",
2386
+ "metadata": {},
2387
+ "source": [
2388
+ "Similarly, an array of Booleans can be used to subset the rows:"
2389
+ ]
2390
+ },
2391
+ {
2392
+ "cell_type": "code",
2393
+ "execution_count": null,
2394
+ "id": "3540804d",
2395
+ "metadata": {
2396
+ "execution": {},
2397
+ "lines_to_next_cell": 0
2398
+ },
2399
+ "outputs": [],
2400
+ "source": [
2401
+ "idx_80 = Auto['year'] > 80\n",
2402
+ "Auto[idx_80]\n"
2403
+ ]
2404
+ },
2405
+ {
2406
+ "cell_type": "markdown",
2407
+ "id": "a02221a2",
2408
+ "metadata": {},
2409
+ "source": [
2410
+ "However, if we pass in a list of strings to the `[]` method, then we obtain a data frame containing the corresponding set of *columns*. "
2411
+ ]
2412
+ },
2413
+ {
2414
+ "cell_type": "code",
2415
+ "execution_count": null,
2416
+ "id": "66d174f1",
2417
+ "metadata": {
2418
+ "execution": {},
2419
+ "lines_to_next_cell": 0
2420
+ },
2421
+ "outputs": [],
2422
+ "source": [
2423
+ "Auto[['mpg', 'horsepower']]\n"
2424
+ ]
2425
+ },
2426
+ {
2427
+ "cell_type": "markdown",
2428
+ "id": "54bef6a3",
2429
+ "metadata": {},
2430
+ "source": [
2431
+ "Since we did not specify an *index* column when we loaded our data frame, the rows are labeled using integers\n",
2432
+ "0 to 396."
2433
+ ]
2434
+ },
2435
+ {
2436
+ "cell_type": "code",
2437
+ "execution_count": null,
2438
+ "id": "52789c77",
2439
+ "metadata": {
2440
+ "execution": {},
2441
+ "lines_to_next_cell": 0
2442
+ },
2443
+ "outputs": [],
2444
+ "source": [
2445
+ "Auto.index\n"
2446
+ ]
2447
+ },
2448
+ {
2449
+ "cell_type": "markdown",
2450
+ "id": "3f5fcb26",
2451
+ "metadata": {},
2452
+ "source": [
2453
+ "We can use the\n",
2454
+ "`set_index()` method to re-name the rows using the contents of `Auto['name']`. "
2455
+ ]
2456
+ },
2457
+ {
2458
+ "cell_type": "code",
2459
+ "execution_count": null,
2460
+ "id": "d83650bf",
2461
+ "metadata": {
2462
+ "execution": {}
2463
+ },
2464
+ "outputs": [],
2465
+ "source": [
2466
+ "Auto_re = Auto.set_index('name')\n",
2467
+ "Auto_re\n"
2468
+ ]
2469
+ },
2470
+ {
2471
+ "cell_type": "code",
2472
+ "execution_count": null,
2473
+ "id": "880d79d9",
2474
+ "metadata": {
2475
+ "execution": {},
2476
+ "lines_to_next_cell": 0
2477
+ },
2478
+ "outputs": [],
2479
+ "source": [
2480
+ "Auto_re.columns\n"
2481
+ ]
2482
+ },
2483
+ {
2484
+ "cell_type": "markdown",
2485
+ "id": "dbee53b8",
2486
+ "metadata": {},
2487
+ "source": [
2488
+ "We see that the column `'name'` is no longer there.\n",
2489
+ " \n",
2490
+ "Now that the index has been set to `name`, we can access rows of the data \n",
2491
+ "frame by `name` using the `{loc[]`} method of\n",
2492
+ "`Auto`:"
2493
+ ]
2494
+ },
2495
+ {
2496
+ "cell_type": "code",
2497
+ "execution_count": null,
2498
+ "id": "c01f4095",
2499
+ "metadata": {
2500
+ "execution": {},
2501
+ "lines_to_next_cell": 0
2502
+ },
2503
+ "outputs": [],
2504
+ "source": [
2505
+ "rows = ['amc rebel sst', 'ford torino']\n",
2506
+ "Auto_re.loc[rows]\n"
2507
+ ]
2508
+ },
2509
+ {
2510
+ "cell_type": "markdown",
2511
+ "id": "29688cab",
2512
+ "metadata": {},
2513
+ "source": [
2514
+ "As an alternative to using the index name, we could retrieve the 4th and 5th rows of `Auto` using the `{iloc[]`} method:"
2515
+ ]
2516
+ },
2517
+ {
2518
+ "cell_type": "code",
2519
+ "execution_count": null,
2520
+ "id": "a4202eb8",
2521
+ "metadata": {
2522
+ "execution": {},
2523
+ "lines_to_next_cell": 0
2524
+ },
2525
+ "outputs": [],
2526
+ "source": [
2527
+ "Auto_re.iloc[[3,4]]\n"
2528
+ ]
2529
+ },
2530
+ {
2531
+ "cell_type": "markdown",
2532
+ "id": "5427ede0",
2533
+ "metadata": {},
2534
+ "source": [
2535
+ "We can also use it to retrieve the 1st, 3rd and and 4th columns of `Auto_re`:"
2536
+ ]
2537
+ },
2538
+ {
2539
+ "cell_type": "code",
2540
+ "execution_count": null,
2541
+ "id": "948b2d07",
2542
+ "metadata": {
2543
+ "execution": {},
2544
+ "lines_to_next_cell": 0
2545
+ },
2546
+ "outputs": [],
2547
+ "source": [
2548
+ "Auto_re.iloc[:,[0,2,3]]\n"
2549
+ ]
2550
+ },
2551
+ {
2552
+ "cell_type": "markdown",
2553
+ "id": "b83d56eb",
2554
+ "metadata": {},
2555
+ "source": [
2556
+ "We can extract the 4th and 5th rows, as well as the 1st, 3rd and 4th columns, using\n",
2557
+ "a single call to `iloc[]`:"
2558
+ ]
2559
+ },
2560
+ {
2561
+ "cell_type": "code",
2562
+ "execution_count": null,
2563
+ "id": "1cfdcc5c",
2564
+ "metadata": {
2565
+ "execution": {},
2566
+ "lines_to_next_cell": 0
2567
+ },
2568
+ "outputs": [],
2569
+ "source": [
2570
+ "Auto_re.iloc[[3,4],[0,2,3]]\n"
2571
+ ]
2572
+ },
2573
+ {
2574
+ "cell_type": "markdown",
2575
+ "id": "2bde6514",
2576
+ "metadata": {},
2577
+ "source": [
2578
+ "Index entries need not be unique: there are several cars in the data frame named `ford galaxie 500`."
2579
+ ]
2580
+ },
2581
+ {
2582
+ "cell_type": "code",
2583
+ "execution_count": null,
2584
+ "id": "fd9c5cda",
2585
+ "metadata": {
2586
+ "execution": {},
2587
+ "lines_to_next_cell": 0
2588
+ },
2589
+ "outputs": [],
2590
+ "source": [
2591
+ "Auto_re.loc['ford galaxie 500', ['mpg', 'origin']]\n"
2592
+ ]
2593
+ },
2594
+ {
2595
+ "cell_type": "markdown",
2596
+ "id": "4d097282",
2597
+ "metadata": {},
2598
+ "source": [
2599
+ "### More on Selecting Rows and Columns\n",
2600
+ "Suppose now that we want to create a data frame consisting of the `weight` and `origin` of the subset of cars with \n",
2601
+ "`year` greater than 80 --- i.e. those built after 1980.\n",
2602
+ "To do this, we first create a Boolean array that indexes the rows.\n",
2603
+ "The `loc[]` method allows for Boolean entries as well as strings:"
2604
+ ]
2605
+ },
2606
+ {
2607
+ "cell_type": "code",
2608
+ "execution_count": null,
2609
+ "id": "6d431cb5",
2610
+ "metadata": {
2611
+ "execution": {},
2612
+ "lines_to_next_cell": 2
2613
+ },
2614
+ "outputs": [],
2615
+ "source": [
2616
+ "idx_80 = Auto_re['year'] > 80\n",
2617
+ "Auto_re.loc[idx_80, ['weight', 'origin']]\n"
2618
+ ]
2619
+ },
2620
+ {
2621
+ "cell_type": "markdown",
2622
+ "id": "838a03e0",
2623
+ "metadata": {},
2624
+ "source": [
2625
+ "To do this more concisely, we can use an anonymous function called a `lambda`: "
2626
+ ]
2627
+ },
2628
+ {
2629
+ "cell_type": "code",
2630
+ "execution_count": null,
2631
+ "id": "fac41ce1",
2632
+ "metadata": {
2633
+ "execution": {},
2634
+ "lines_to_next_cell": 0
2635
+ },
2636
+ "outputs": [],
2637
+ "source": [
2638
+ "Auto_re.loc[lambda df: df['year'] > 80, ['weight', 'origin']]\n"
2639
+ ]
2640
+ },
2641
+ {
2642
+ "cell_type": "markdown",
2643
+ "id": "08e61254",
2644
+ "metadata": {},
2645
+ "source": [
2646
+ "The `lambda` call creates a function that takes a single\n",
2647
+ "argument, here `df`, and returns `df['year']>80`.\n",
2648
+ "Since it is created inside the `loc[]` method for the\n",
2649
+ "dataframe `Auto_re`, that dataframe will be the argument supplied.\n",
2650
+ "As another example of using a `lambda`, suppose that\n",
2651
+ "we want all cars built after 1980 that achieve greater than 30 miles per gallon:"
2652
+ ]
2653
+ },
2654
+ {
2655
+ "cell_type": "code",
2656
+ "execution_count": null,
2657
+ "id": "b0885654",
2658
+ "metadata": {
2659
+ "execution": {},
2660
+ "lines_to_next_cell": 0
2661
+ },
2662
+ "outputs": [],
2663
+ "source": [
2664
+ "Auto_re.loc[lambda df: (df['year'] > 80) & (df['mpg'] > 30),\n",
2665
+ " ['weight', 'origin']\n",
2666
+ " ]\n"
2667
+ ]
2668
+ },
2669
+ {
2670
+ "cell_type": "markdown",
2671
+ "id": "d87fc459",
2672
+ "metadata": {},
2673
+ "source": [
2674
+ "The symbol `&` computes an element-wise *and* operation.\n",
2675
+ "As another example, suppose that we want to retrieve all `Ford` and `Datsun`\n",
2676
+ "cars with `displacement` less than 300. We check whether each `name` entry contains either the string `ford` or `datsun` using the `str.contains()` method of the `index` attribute of \n",
2677
+ "of the dataframe:"
2678
+ ]
2679
+ },
2680
+ {
2681
+ "cell_type": "code",
2682
+ "execution_count": null,
2683
+ "id": "213945a6",
2684
+ "metadata": {
2685
+ "execution": {},
2686
+ "lines_to_next_cell": 0
2687
+ },
2688
+ "outputs": [],
2689
+ "source": [
2690
+ "Auto_re.loc[lambda df: (df['displacement'] < 300)\n",
2691
+ " & (df.index.str.contains('ford')\n",
2692
+ " | df.index.str.contains('datsun')),\n",
2693
+ " ['weight', 'origin']\n",
2694
+ " ]\n"
2695
+ ]
2696
+ },
2697
+ {
2698
+ "cell_type": "markdown",
2699
+ "id": "8a940fd1",
2700
+ "metadata": {},
2701
+ "source": [
2702
+ "Here, the symbol `|` computes an element-wise *or* operation.\n",
2703
+ " \n",
2704
+ "In summary, a powerful set of operations is available to index the rows and columns of data frames. For integer based queries, use the `iloc[]` method. For string and Boolean\n",
2705
+ "selections, use the `loc[]` method. For functional queries that filter rows, use the `loc[]` method\n",
2706
+ "with a function (typically a `lambda`) in the rows argument.\n",
2707
+ "\n",
2708
+ "## For Loops\n",
2709
+ "A `for` loop is a standard tool in many languages that\n",
2710
+ "repeatedly evaluates some chunk of code while\n",
2711
+ "varying different values inside the code.\n",
2712
+ "For example, suppose we loop over elements of a list and compute their sum."
2713
+ ]
2714
+ },
2715
+ {
2716
+ "cell_type": "code",
2717
+ "execution_count": null,
2718
+ "id": "a3c4060a",
2719
+ "metadata": {
2720
+ "execution": {},
2721
+ "lines_to_next_cell": 0
2722
+ },
2723
+ "outputs": [],
2724
+ "source": [
2725
+ "total = 0\n",
2726
+ "for value in [3,2,19]:\n",
2727
+ " total += value\n",
2728
+ "print('Total is: {0}'.format(total))\n"
2729
+ ]
2730
+ },
2731
+ {
2732
+ "cell_type": "markdown",
2733
+ "id": "9117e3a1",
2734
+ "metadata": {},
2735
+ "source": [
2736
+ "The indented code beneath the line with the `for` statement is run\n",
2737
+ "for each value in the sequence\n",
2738
+ "specified in the `for` statement. The loop ends either\n",
2739
+ "when the cell ends or when code is indented at the same level\n",
2740
+ "as the original `for` statement.\n",
2741
+ "We see that the final line above which prints the total is executed\n",
2742
+ "only once after the for loop has terminated. Loops\n",
2743
+ "can be nested by additional indentation."
2744
+ ]
2745
+ },
2746
+ {
2747
+ "cell_type": "code",
2748
+ "execution_count": null,
2749
+ "id": "f2bffb69",
2750
+ "metadata": {
2751
+ "execution": {},
2752
+ "lines_to_next_cell": 0
2753
+ },
2754
+ "outputs": [],
2755
+ "source": [
2756
+ "total = 0\n",
2757
+ "for value in [2,3,19]:\n",
2758
+ " for weight in [3, 2, 1]:\n",
2759
+ " total += value * weight\n",
2760
+ "print('Total is: {0}'.format(total))"
2761
+ ]
2762
+ },
2763
+ {
2764
+ "cell_type": "markdown",
2765
+ "id": "9f99e85b",
2766
+ "metadata": {},
2767
+ "source": [
2768
+ "Above, we summed over each combination of `value` and `weight`.\n",
2769
+ "We also took advantage of the *increment* notation\n",
2770
+ "in `Python`: the expression `a += b` is equivalent\n",
2771
+ "to `a = a + b`. Besides\n",
2772
+ "being a convenient notation, this can save time in computationally\n",
2773
+ "heavy tasks in which the intermediate value of `a+b` need not\n",
2774
+ "be explicitly created.\n",
2775
+ "\n",
2776
+ "Perhaps a more\n",
2777
+ "common task would be to sum over `(value, weight)` pairs. For instance,\n",
2778
+ "to compute the average value of a random variable that takes on\n",
2779
+ "possible values 2, 3 or 19 with probability 0.2, 0.3, 0.5 respectively\n",
2780
+ "we would compute the weighted sum. Tasks such as this\n",
2781
+ "can often be accomplished using the `zip()` function that\n",
2782
+ "loops over a sequence of tuples."
2783
+ ]
2784
+ },
2785
+ {
2786
+ "cell_type": "code",
2787
+ "execution_count": null,
2788
+ "id": "ee827a53",
2789
+ "metadata": {
2790
+ "execution": {}
2791
+ },
2792
+ "outputs": [],
2793
+ "source": [
2794
+ "total = 0\n",
2795
+ "for value, weight in zip([2,3,19],\n",
2796
+ " [0.2,0.3,0.5]):\n",
2797
+ " total += weight * value\n",
2798
+ "print('Weighted average is: {0}'.format(total))\n"
2799
+ ]
2800
+ },
2801
+ {
2802
+ "cell_type": "markdown",
2803
+ "id": "dec18466",
2804
+ "metadata": {},
2805
+ "source": [
2806
+ "### String Formatting\n",
2807
+ "In the code chunk above we also printed a string\n",
2808
+ "displaying the total. However, the object `total`\n",
2809
+ "is an integer and not a string.\n",
2810
+ "Inserting the value of something into\n",
2811
+ "a string is a common task, made\n",
2812
+ "simple using\n",
2813
+ "some of the powerful string formatting\n",
2814
+ "tools in `Python`.\n",
2815
+ "Many data cleaning tasks involve\n",
2816
+ "manipulating and programmatically\n",
2817
+ "producing strings.\n",
2818
+ "\n",
2819
+ "For example we may want to loop over the columns of a data frame and\n",
2820
+ "print the percent missing in each column.\n",
2821
+ "Let’s create a data frame `D` with columns in which 20% of the entries are missing i.e. set\n",
2822
+ "to `np.nan`. We’ll create the\n",
2823
+ "values in `D` from a normal distribution with mean 0 and variance 1 using `rng.standard_normal()`\n",
2824
+ "and then overwrite some random entries using `rng.choice()`."
2825
+ ]
2826
+ },
2827
+ {
2828
+ "cell_type": "code",
2829
+ "execution_count": null,
2830
+ "id": "3a097fbc",
2831
+ "metadata": {
2832
+ "execution": {},
2833
+ "lines_to_next_cell": 2
2834
+ },
2835
+ "outputs": [],
2836
+ "source": [
2837
+ "rng = np.random.default_rng(1)\n",
2838
+ "A = rng.standard_normal((127, 5))\n",
2839
+ "M = rng.choice([0, np.nan], p=[0.8,0.2], size=A.shape)\n",
2840
+ "A += M\n",
2841
+ "D = pd.DataFrame(A, columns=['food',\n",
2842
+ " 'bar',\n",
2843
+ " 'pickle',\n",
2844
+ " 'snack',\n",
2845
+ " 'popcorn'])\n",
2846
+ "D[:3]\n"
2847
+ ]
2848
+ },
2849
+ {
2850
+ "cell_type": "code",
2851
+ "execution_count": null,
2852
+ "id": "e064e170",
2853
+ "metadata": {
2854
+ "execution": {},
2855
+ "lines_to_next_cell": 0
2856
+ },
2857
+ "outputs": [],
2858
+ "source": [
2859
+ "for col in D.columns:\n",
2860
+ " template = 'Column \"{0}\" has {1:.2%} missing values'\n",
2861
+ " print(template.format(col,\n",
2862
+ " np.isnan(D[col]).mean()))\n"
2863
+ ]
2864
+ },
2865
+ {
2866
+ "cell_type": "markdown",
2867
+ "id": "7a3e4dd8",
2868
+ "metadata": {},
2869
+ "source": [
2870
+ "We see that the `template.format()` method expects two arguments `{0}`\n",
2871
+ "and `{1:.2%}`, and the latter includes some formatting\n",
2872
+ "information. In particular, it specifies that the second argument should be expressed as a percent with two decimal digits.\n",
2873
+ "\n",
2874
+ "The reference\n",
2875
+ "[docs.python.org/3/library/string.html](https://docs.python.org/3/library/string.html)\n",
2876
+ "includes many helpful and more complex examples."
2877
+ ]
2878
+ },
2879
+ {
2880
+ "cell_type": "markdown",
2881
+ "id": "d8fd496a",
2882
+ "metadata": {},
2883
+ "source": [
2884
+ "## Additional Graphical and Numerical Summaries\n",
2885
+ "We can use the `ax.plot()` or `ax.scatter()` functions to display the quantitative variables. However, simply typing the variable names will produce an error message,\n",
2886
+ "because `Python` does not know to look in the `Auto` data set for those variables."
2887
+ ]
2888
+ },
2889
+ {
2890
+ "cell_type": "code",
2891
+ "execution_count": null,
2892
+ "id": "c915ca52",
2893
+ "metadata": {
2894
+ "execution": {},
2895
+ "lines_to_next_cell": 0
2896
+ },
2897
+ "outputs": [],
2898
+ "source": [
2899
+ "fig, ax = subplots(figsize=(8, 8))\n",
2900
+ "ax.plot(horsepower, mpg, 'o');"
2901
+ ]
2902
+ },
2903
+ {
2904
+ "cell_type": "markdown",
2905
+ "id": "63d47021",
2906
+ "metadata": {},
2907
+ "source": [
2908
+ "We can address this by accessing the columns directly:"
2909
+ ]
2910
+ },
2911
+ {
2912
+ "cell_type": "code",
2913
+ "execution_count": null,
2914
+ "id": "65cd6d02",
2915
+ "metadata": {
2916
+ "execution": {},
2917
+ "lines_to_next_cell": 0
2918
+ },
2919
+ "outputs": [],
2920
+ "source": [
2921
+ "fig, ax = subplots(figsize=(8, 8))\n",
2922
+ "ax.plot(Auto['horsepower'], Auto['mpg'], 'o');\n"
2923
+ ]
2924
+ },
2925
+ {
2926
+ "cell_type": "markdown",
2927
+ "id": "726836f0",
2928
+ "metadata": {},
2929
+ "source": [
2930
+ "Alternatively, we can use the `plot()` method with the call `Auto.plot()`.\n",
2931
+ "Using this method,\n",
2932
+ "the variables can be accessed by name.\n",
2933
+ "The plot methods of a data frame return a familiar object:\n",
2934
+ "an axes. We can use it to update the plot as we did previously: "
2935
+ ]
2936
+ },
2937
+ {
2938
+ "cell_type": "code",
2939
+ "execution_count": null,
2940
+ "id": "76b5c0b1",
2941
+ "metadata": {
2942
+ "execution": {},
2943
+ "lines_to_next_cell": 0
2944
+ },
2945
+ "outputs": [],
2946
+ "source": [
2947
+ "ax = Auto.plot.scatter('horsepower', 'mpg')\n",
2948
+ "ax.set_title('Horsepower vs. MPG');"
2949
+ ]
2950
+ },
2951
+ {
2952
+ "cell_type": "markdown",
2953
+ "id": "69c46251",
2954
+ "metadata": {},
2955
+ "source": [
2956
+ "If we want to save\n",
2957
+ "the figure that contains a given axes, we can find the relevant figure\n",
2958
+ "by accessing the `figure` attribute:"
2959
+ ]
2960
+ },
2961
+ {
2962
+ "cell_type": "code",
2963
+ "execution_count": null,
2964
+ "id": "183a2c2b",
2965
+ "metadata": {
2966
+ "execution": {}
2967
+ },
2968
+ "outputs": [],
2969
+ "source": [
2970
+ "fig = ax.figure\n",
2971
+ "fig.savefig('horsepower_mpg.png');"
2972
+ ]
2973
+ },
2974
+ {
2975
+ "cell_type": "markdown",
2976
+ "id": "6f10cb46",
2977
+ "metadata": {},
2978
+ "source": [
2979
+ "We can further instruct the data frame to plot to a particular axes object. In this\n",
2980
+ "case the corresponding `plot()` method will return the\n",
2981
+ "modified axes we passed in as an argument. Note that\n",
2982
+ "when we request a one-dimensional grid of plots, the object `axes` is similarly\n",
2983
+ "one-dimensional. We place our scatter plot in the middle plot of a row of three plots\n",
2984
+ "within a figure."
2985
+ ]
2986
+ },
2987
+ {
2988
+ "cell_type": "code",
2989
+ "execution_count": null,
2990
+ "id": "75fbb981",
2991
+ "metadata": {
2992
+ "execution": {}
2993
+ },
2994
+ "outputs": [],
2995
+ "source": [
2996
+ "fig, axes = subplots(ncols=3, figsize=(15, 5))\n",
2997
+ "Auto.plot.scatter('horsepower', 'mpg', ax=axes[1]);\n"
2998
+ ]
2999
+ },
3000
+ {
3001
+ "cell_type": "markdown",
3002
+ "id": "53ffc0da",
3003
+ "metadata": {},
3004
+ "source": [
3005
+ "Note also that the columns of a data frame can be accessed as attributes: try typing in `Auto.horsepower`. "
3006
+ ]
3007
+ },
3008
+ {
3009
+ "cell_type": "markdown",
3010
+ "id": "1c4705e0",
3011
+ "metadata": {},
3012
+ "source": [
3013
+ "We now consider the `cylinders` variable. Typing in `Auto.cylinders.dtype` reveals that it is being treated as a quantitative variable. \n",
3014
+ "However, since there is only a small number of possible values for this variable, we may wish to treat it as \n",
3015
+ " qualitative. Below, we replace\n",
3016
+ "the `cylinders` column with a categorical version of `Auto.cylinders`. The function `pd.Series()` owes its name to the fact that `pandas` is often used in time series applications."
3017
+ ]
3018
+ },
3019
+ {
3020
+ "cell_type": "code",
3021
+ "execution_count": null,
3022
+ "id": "55b3a1cc",
3023
+ "metadata": {
3024
+ "execution": {},
3025
+ "lines_to_next_cell": 0
3026
+ },
3027
+ "outputs": [],
3028
+ "source": [
3029
+ "Auto.cylinders = pd.Series(Auto.cylinders, dtype='category')\n",
3030
+ "Auto.cylinders.dtype\n"
3031
+ ]
3032
+ },
3033
+ {
3034
+ "cell_type": "markdown",
3035
+ "id": "adc75408",
3036
+ "metadata": {},
3037
+ "source": [
3038
+ " Now that `cylinders` is qualitative, we can display it using\n",
3039
+ " the `boxplot()` method."
3040
+ ]
3041
+ },
3042
+ {
3043
+ "cell_type": "code",
3044
+ "execution_count": null,
3045
+ "id": "f3d88794",
3046
+ "metadata": {
3047
+ "execution": {}
3048
+ },
3049
+ "outputs": [],
3050
+ "source": [
3051
+ "fig, ax = subplots(figsize=(8, 8))\n",
3052
+ "Auto.boxplot('mpg', by='cylinders', ax=ax);\n"
3053
+ ]
3054
+ },
3055
+ {
3056
+ "cell_type": "markdown",
3057
+ "id": "62d6582f",
3058
+ "metadata": {},
3059
+ "source": [
3060
+ "The `hist()` method can be used to plot a *histogram*."
3061
+ ]
3062
+ },
3063
+ {
3064
+ "cell_type": "code",
3065
+ "execution_count": null,
3066
+ "id": "eea49f5b",
3067
+ "metadata": {
3068
+ "execution": {},
3069
+ "lines_to_next_cell": 0
3070
+ },
3071
+ "outputs": [],
3072
+ "source": [
3073
+ "fig, ax = subplots(figsize=(8, 8))\n",
3074
+ "Auto.hist('mpg', ax=ax);\n"
3075
+ ]
3076
+ },
3077
+ {
3078
+ "cell_type": "markdown",
3079
+ "id": "c5a5933c",
3080
+ "metadata": {},
3081
+ "source": [
3082
+ "The color of the bars and the number of bins can be changed:"
3083
+ ]
3084
+ },
3085
+ {
3086
+ "cell_type": "code",
3087
+ "execution_count": null,
3088
+ "id": "d5bcfff8",
3089
+ "metadata": {
3090
+ "execution": {},
3091
+ "lines_to_next_cell": 0
3092
+ },
3093
+ "outputs": [],
3094
+ "source": [
3095
+ "fig, ax = subplots(figsize=(8, 8))\n",
3096
+ "Auto.hist('mpg', color='red', bins=12, ax=ax);\n"
3097
+ ]
3098
+ },
3099
+ {
3100
+ "cell_type": "markdown",
3101
+ "id": "60c36b6c",
3102
+ "metadata": {},
3103
+ "source": [
3104
+ " See `Auto.hist?` for more plotting\n",
3105
+ "options.\n",
3106
+ " \n",
3107
+ "We can use the `pd.plotting.scatter_matrix()` function to create a *scatterplot matrix* to visualize all of the pairwise relationships between the columns in\n",
3108
+ "a data frame."
3109
+ ]
3110
+ },
3111
+ {
3112
+ "cell_type": "code",
3113
+ "execution_count": null,
3114
+ "id": "edb66cae",
3115
+ "metadata": {
3116
+ "execution": {},
3117
+ "lines_to_next_cell": 0
3118
+ },
3119
+ "outputs": [],
3120
+ "source": [
3121
+ "pd.plotting.scatter_matrix(Auto);\n"
3122
+ ]
3123
+ },
3124
+ {
3125
+ "cell_type": "markdown",
3126
+ "id": "0b162bd9",
3127
+ "metadata": {},
3128
+ "source": [
3129
+ " We can also produce scatterplots\n",
3130
+ "for a subset of the variables."
3131
+ ]
3132
+ },
3133
+ {
3134
+ "cell_type": "code",
3135
+ "execution_count": null,
3136
+ "id": "4f5d25d9",
3137
+ "metadata": {
3138
+ "execution": {},
3139
+ "lines_to_next_cell": 0
3140
+ },
3141
+ "outputs": [],
3142
+ "source": [
3143
+ "pd.plotting.scatter_matrix(Auto[['mpg',\n",
3144
+ " 'displacement',\n",
3145
+ " 'weight']]);\n"
3146
+ ]
3147
+ },
3148
+ {
3149
+ "cell_type": "markdown",
3150
+ "id": "8cae5dfc",
3151
+ "metadata": {},
3152
+ "source": [
3153
+ "The `describe()` method produces a numerical summary of each column in a data frame."
3154
+ ]
3155
+ },
3156
+ {
3157
+ "cell_type": "code",
3158
+ "execution_count": null,
3159
+ "id": "ce7b23e2",
3160
+ "metadata": {
3161
+ "execution": {},
3162
+ "lines_to_next_cell": 0
3163
+ },
3164
+ "outputs": [],
3165
+ "source": [
3166
+ "Auto[['mpg', 'weight']].describe()\n"
3167
+ ]
3168
+ },
3169
+ {
3170
+ "cell_type": "markdown",
3171
+ "id": "d5042294",
3172
+ "metadata": {},
3173
+ "source": [
3174
+ "We can also produce a summary of just a single column."
3175
+ ]
3176
+ },
3177
+ {
3178
+ "cell_type": "code",
3179
+ "execution_count": null,
3180
+ "id": "a6545d2f",
3181
+ "metadata": {
3182
+ "execution": {},
3183
+ "lines_to_next_cell": 0
3184
+ },
3185
+ "outputs": [],
3186
+ "source": [
3187
+ "Auto['cylinders'].describe()\n",
3188
+ "Auto['mpg'].describe()\n"
3189
+ ]
3190
+ },
3191
+ {
3192
+ "cell_type": "markdown",
3193
+ "id": "c2ea7f81",
3194
+ "metadata": {},
3195
+ "source": [
3196
+ "To exit `Jupyter`, select `File / Close and Halt`.\n",
3197
+ "\n",
3198
+ " \n",
3199
+ "\n"
3200
+ ]
3201
+ }
3202
+ ],
3203
+ "metadata": {
3204
+ "jupytext": {
3205
+ "cell_metadata_filter": "-all",
3206
+ "formats": "Rmd,ipynb",
3207
+ "main_language": "python"
3208
+ },
3209
+ "kernelspec": {
3210
+ "display_name": "Python 3 (ipykernel)",
3211
+ "language": "python",
3212
+ "name": "python3"
3213
+ },
3214
+ "language_info": {
3215
+ "codemirror_mode": {
3216
+ "name": "ipython",
3217
+ "version": 3
3218
+ },
3219
+ "file_extension": ".py",
3220
+ "mimetype": "text/x-python",
3221
+ "name": "python",
3222
+ "nbconvert_exporter": "python",
3223
+ "pygments_lexer": "ipython3",
3224
+ "version": "3.10.4"
3225
+ }
3226
+ },
3227
+ "nbformat": 4,
3228
+ "nbformat_minor": 5
3229
+ }
Reference files/Week2_ref/Lecture_1_basics.ipynb ADDED
The diff for this file is too large to render. See raw diff
 
app/.DS_Store ADDED
Binary file (6.15 kB). View file
 
app/__pycache__/main.cpython-311.pyc CHANGED
Binary files a/app/__pycache__/main.cpython-311.pyc and b/app/__pycache__/main.cpython-311.pyc differ
 
app/components/__pycache__/login.cpython-311.pyc CHANGED
Binary files a/app/components/__pycache__/login.cpython-311.pyc and b/app/components/__pycache__/login.cpython-311.pyc differ
 
app/components/login.py CHANGED
@@ -5,7 +5,11 @@ def login():
5
  Display a login form and return True if login is successful, False otherwise.
6
  """
7
  st.title("Login to Data Science Course App")
8
-
 
 
 
 
9
  # Create a form for login
10
  with st.form("login_form"):
11
  username = st.text_input("Username")
@@ -14,7 +18,7 @@ def login():
14
 
15
  if submit_button:
16
  # Check credentials (test account)
17
- if username == "student" and password == "123":
18
  # Store login state in session
19
  st.session_state.logged_in = True
20
  st.session_state.username = username
 
5
  Display a login form and return True if login is successful, False otherwise.
6
  """
7
  st.title("Login to Data Science Course App")
8
+
9
+ #usernames
10
+ usernames = ["admin", "student", "manxiii"]
11
+ passwords = ["admin", "123", "manxi123"]
12
+
13
  # Create a form for login
14
  with st.form("login_form"):
15
  username = st.text_input("Username")
 
18
 
19
  if submit_button:
20
  # Check credentials (test account)
21
+ if username in usernames and password in passwords:
22
  # Store login state in session
23
  st.session_state.logged_in = True
24
  st.session_state.username = username
app/main.py CHANGED
@@ -12,6 +12,10 @@ sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
12
  # Import the login component
13
  from app.components.login import login
14
 
 
 
 
 
15
  # Page configuration
16
  st.set_page_config(
17
  page_title="Data Science Course App",
@@ -101,6 +105,11 @@ def sidebar_navigation():
101
  if st.session_state.logged_in:
102
  st.write(f"Welcome, {st.session_state.username}!")
103
 
 
 
 
 
 
104
  # Logout button
105
  if st.button("Logout"):
106
  st.session_state.logged_in = False
@@ -120,156 +129,15 @@ def sidebar_navigation():
120
  st.rerun()
121
 
122
  def show_week_content():
123
- st.markdown("""
124
- ## Week 1: Research Topic Selection and Literature Review
125
-
126
- This week, you'll learn how to:
127
- - Select a suitable research topic
128
- - Conduct a literature review
129
- - Define your research objectives
130
- - Create a research proposal
131
- """)
132
-
133
- # Topic Selection Section
134
- st.header("1. Topic Selection")
135
- st.markdown("""
136
- ### Guidelines for Selecting Your Research Topic:
137
- - Choose a topic that interests you
138
- - Ensure sufficient data availability
139
- - Consider the scope and complexity
140
- - Check for existing research gaps
141
- """)
142
-
143
- # Interactive Topic Selection
144
- st.subheader("Topic Selection Form")
145
- with st.form("topic_form"):
146
- research_area = st.selectbox(
147
- "Select your research area",
148
- ["Computer Vision", "NLP", "Time Series", "Recommendation Systems", "Other"]
149
- )
150
-
151
- topic = st.text_input("Proposed Research Topic")
152
- problem_statement = st.text_area("Brief Problem Statement")
153
- motivation = st.text_area("Why is this research important?")
154
-
155
- submitted = st.form_submit_button("Submit Topic")
156
-
157
- if submitted:
158
- st.success("Topic submitted successfully! We'll review and provide feedback.")
159
-
160
- # Linear Regression Visualization
161
- st.header("2. Linear Regression Demo")
162
- st.markdown("""
163
- ### Understanding Linear Regression
164
-
165
- Linear regression is a fundamental machine learning algorithm that models the relationship between a dependent variable and one or more independent variables.
166
- Below is an interactive demonstration of simple linear regression.
167
- """)
168
-
169
- # Create interactive controls
170
- col1, col2 = st.columns(2)
171
- with col1:
172
- n_points = st.slider("Number of data points", 10, 100, 50)
173
- noise = st.slider("Noise level", 0.1, 2.0, 0.5)
174
- with col2:
175
- slope = st.slider("True slope", -2.0, 2.0, 1.0)
176
- intercept = st.slider("True intercept", -5.0, 5.0, 0.0)
177
-
178
- # Generate synthetic data
179
- np.random.seed(42)
180
- X = np.random.rand(n_points) * 10
181
- y = slope * X + intercept + np.random.normal(0, noise, n_points)
182
-
183
- # Fit linear regression
184
- X_reshaped = X.reshape(-1, 1)
185
- model = LinearRegression()
186
- model.fit(X_reshaped, y)
187
- y_pred = model.predict(X_reshaped)
188
-
189
- # Create the plot
190
- fig = go.Figure()
191
-
192
- # Add scatter plot of actual data
193
- fig.add_trace(go.Scatter(
194
- x=X,
195
- y=y,
196
- mode='markers',
197
- name='Actual Data',
198
- marker=dict(color='blue')
199
- ))
200
 
201
- # Add regression line
202
- fig.add_trace(go.Scatter(
203
- x=X,
204
- y=y_pred,
205
- mode='lines',
206
- name='Regression Line',
207
- line=dict(color='red')
208
- ))
209
-
210
- # Update layout
211
- fig.update_layout(
212
- title='Linear Regression Visualization',
213
- xaxis_title='X',
214
- yaxis_title='Y',
215
- showlegend=True,
216
- height=500
217
- )
218
-
219
- # Display the plot
220
- st.plotly_chart(fig, use_container_width=True)
221
-
222
- # Display regression coefficients
223
- st.markdown(f"""
224
- ### Regression Results
225
- - Estimated slope: {model.coef_[0]:.2f}
226
- - Estimated intercept: {model.intercept_:.2f}
227
- - R² score: {model.score(X_reshaped, y):.2f}
228
- """)
229
-
230
- # Literature Review Section
231
- st.header("3. Literature Review")
232
- st.markdown("""
233
- ### Steps for Conducting Literature Review:
234
- 1. Search for relevant papers
235
- 2. Read and analyze key papers
236
- 3. Identify research gaps
237
- 4. Document your findings
238
- """)
239
-
240
- # Literature Review Template
241
- st.subheader("Literature Review Template")
242
- with st.expander("Download Template"):
243
- st.download_button(
244
- label="Download Literature Review Template",
245
- data="Literature Review Template\n\n1. Introduction\n2. Related Work\n3. Methodology\n4. Results\n5. Discussion\n6. Conclusion",
246
- file_name="literature_review_template.txt",
247
- mime="text/plain"
248
- )
249
-
250
- # Weekly Assignment
251
- st.header("Weekly Assignment")
252
- st.markdown("""
253
- ### Assignment 1: Research Proposal
254
- 1. Select your research topic
255
- 2. Write a brief problem statement
256
- 3. Conduct initial literature review
257
- 4. Submit your research proposal
258
-
259
- **Due Date:** End of Week 1
260
- """)
261
-
262
- # Assignment Submission
263
- st.subheader("Submit Your Assignment")
264
- with st.form("assignment_form"):
265
- proposal_file = st.file_uploader("Upload your research proposal (PDF or DOC)")
266
- comments = st.text_area("Additional comments or questions")
267
-
268
- if st.form_submit_button("Submit Assignment"):
269
- if proposal_file is not None:
270
- st.success("Assignment submitted successfully!")
271
- else:
272
- st.error("Please upload your research proposal.")
273
 
274
  # Main content
275
  def main():
@@ -280,33 +148,14 @@ def main():
280
  return
281
 
282
  # User is logged in, show course content
283
- if st.session_state.current_week == 1:
284
  show_week_content()
285
  else:
286
  st.title("Data Science Research Paper Course")
287
  st.markdown("""
288
  ## Welcome to the Data Science Research Paper Course! 📚
289
 
290
- This 10-week course will guide you through the process of creating a machine learning research paper.
291
- Each week, you'll learn new concepts and complete tasks that build upon each other.
292
-
293
- ### Getting Started
294
- 1. Use the sidebar to navigate between weeks
295
- 2. Complete the weekly tasks and assignments
296
- 3. Track your progress using the progress bar
297
- 4. Submit your work for feedback
298
-
299
- ### Course Overview
300
- - Week 1: Research Topic Selection and Literature Review
301
- - Week 2: Data Collection and Preprocessing
302
- - Week 3: Exploratory Data Analysis
303
- - Week 4: Feature Engineering
304
- - Week 5: Model Selection and Baseline
305
- - Week 6: Model Training and Optimization
306
- - Week 7: Model Evaluation
307
- - Week 8: Results Analysis
308
- - Week 9: Paper Writing
309
- - Week 10: Final Review and Submission
310
  """)
311
 
312
  if __name__ == "__main__":
 
12
  # Import the login component
13
  from app.components.login import login
14
 
15
+ # Import week pages
16
+ from app.pages import week_1
17
+ from app.pages import week_2
18
+
19
  # Page configuration
20
  st.set_page_config(
21
  page_title="Data Science Course App",
 
105
  if st.session_state.logged_in:
106
  st.write(f"Welcome, {st.session_state.username}!")
107
 
108
+ # Debug button to show current week
109
+ if st.session_state.username == "admin":
110
+ if st.button("Debug: Show Current Week"):
111
+ st.write(f"Current week: {st.session_state.current_week}")
112
+
113
  # Logout button
114
  if st.button("Logout"):
115
  st.session_state.logged_in = False
 
129
  st.rerun()
130
 
131
  def show_week_content():
132
+ # Debug print to show current week
133
+ st.write(f"Debug: Current week is {st.session_state.current_week}")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
134
 
135
+ if st.session_state.current_week == 1:
136
+ week_1.show()
137
+ elif st.session_state.current_week == 2:
138
+ week_2.show()
139
+ else:
140
+ st.warning("Content for this week is not yet available.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
141
 
142
  # Main content
143
  def main():
 
148
  return
149
 
150
  # User is logged in, show course content
151
+ if st.session_state.current_week in [1, 2]:
152
  show_week_content()
153
  else:
154
  st.title("Data Science Research Paper Course")
155
  st.markdown("""
156
  ## Welcome to the Data Science Research Paper Course! 📚
157
 
158
+ This section has not bee released yet.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
159
  """)
160
 
161
  if __name__ == "__main__":
app/pages/.DS_Store ADDED
Binary file (6.15 kB). View file
 
app/pages/1_Week_1.py DELETED
@@ -1,168 +0,0 @@
1
- import streamlit as st
2
- import numpy as np
3
- import plotly.graph_objects as go
4
- from sklearn.linear_model import LinearRegression
5
-
6
- # Page configuration
7
- st.set_page_config(
8
- page_title="Week 1 - Research Topic Selection",
9
- page_icon="📚",
10
- layout="wide"
11
- )
12
-
13
- # Check if user is logged in
14
- if not st.session_state.get("logged_in", False):
15
- st.warning("Please log in to access this page.")
16
- st.stop()
17
-
18
- # Main content
19
- st.markdown("""
20
- ## Week 1: Research Topic Selection and Literature Review
21
-
22
- This week, you'll learn how to:
23
- - Select a suitable research topic
24
- - Conduct a literature review
25
- - Define your research objectives
26
- - Create a research proposal
27
- """)
28
-
29
- # Topic Selection Section
30
- st.header("1. Topic Selection")
31
- st.markdown("""
32
- ### Guidelines for Selecting Your Research Topic:
33
- - Choose a topic that interests you
34
- - Ensure sufficient data availability
35
- - Consider the scope and complexity
36
- - Check for existing research gaps
37
- """)
38
-
39
- # Interactive Topic Selection
40
- st.subheader("Topic Selection Form")
41
- with st.form("topic_form"):
42
- research_area = st.selectbox(
43
- "Select your research area",
44
- ["Computer Vision", "NLP", "Time Series", "Recommendation Systems", "Other"]
45
- )
46
-
47
- topic = st.text_input("Proposed Research Topic")
48
- problem_statement = st.text_area("Brief Problem Statement")
49
- motivation = st.text_area("Why is this research important?")
50
-
51
- submitted = st.form_submit_button("Submit Topic")
52
-
53
- if submitted:
54
- st.success("Topic submitted successfully! We'll review and provide feedback.")
55
-
56
- # Linear Regression Visualization
57
- st.header("2. Linear Regression Demo")
58
- st.markdown("""
59
- ### Understanding Linear Regression
60
-
61
- Linear regression is a fundamental machine learning algorithm that models the relationship between a dependent variable and one or more independent variables.
62
- Below is an interactive demonstration of simple linear regression.
63
- """)
64
-
65
- # Create interactive controls
66
- col1, col2 = st.columns(2)
67
- with col1:
68
- n_points = st.slider("Number of data points", 10, 100, 50)
69
- noise = st.slider("Noise level", 0.1, 2.0, 0.5)
70
- with col2:
71
- slope = st.slider("True slope", -2.0, 2.0, 1.0)
72
- intercept = st.slider("True intercept", -5.0, 5.0, 0.0)
73
-
74
- # Generate synthetic data
75
- np.random.seed(42)
76
- X = np.random.rand(n_points) * 10
77
- y = slope * X + intercept + np.random.normal(0, noise, n_points)
78
-
79
- # Fit linear regression
80
- X_reshaped = X.reshape(-1, 1)
81
- model = LinearRegression()
82
- model.fit(X_reshaped, y)
83
- y_pred = model.predict(X_reshaped)
84
-
85
- # Create the plot
86
- fig = go.Figure()
87
-
88
- # Add scatter plot of actual data
89
- fig.add_trace(go.Scatter(
90
- x=X,
91
- y=y,
92
- mode='markers',
93
- name='Actual Data',
94
- marker=dict(color='blue')
95
- ))
96
-
97
- # Add regression line
98
- fig.add_trace(go.Scatter(
99
- x=X,
100
- y=y_pred,
101
- mode='lines',
102
- name='Regression Line',
103
- line=dict(color='red')
104
- ))
105
-
106
- # Update layout
107
- fig.update_layout(
108
- title='Linear Regression Visualization',
109
- xaxis_title='X',
110
- yaxis_title='Y',
111
- showlegend=True,
112
- height=500
113
- )
114
-
115
- # Display the plot
116
- st.plotly_chart(fig, use_container_width=True)
117
-
118
- # Display regression coefficients
119
- st.markdown(f"""
120
- ### Regression Results
121
- - Estimated slope: {model.coef_[0]:.2f}
122
- - Estimated intercept: {model.intercept_:.2f}
123
- - R² score: {model.score(X_reshaped, y):.2f}
124
- """)
125
-
126
- # Literature Review Section
127
- st.header("3. Literature Review")
128
- st.markdown("""
129
- ### Steps for Conducting Literature Review:
130
- 1. Search for relevant papers
131
- 2. Read and analyze key papers
132
- 3. Identify research gaps
133
- 4. Document your findings
134
- """)
135
-
136
- # Literature Review Template
137
- st.subheader("Literature Review Template")
138
- with st.expander("Download Template"):
139
- st.download_button(
140
- label="Download Literature Review Template",
141
- data="Literature Review Template\n\n1. Introduction\n2. Related Work\n3. Methodology\n4. Results\n5. Discussion\n6. Conclusion",
142
- file_name="literature_review_template.txt",
143
- mime="text/plain"
144
- )
145
-
146
- # Weekly Assignment
147
- st.header("Weekly Assignment")
148
- st.markdown("""
149
- ### Assignment 1: Research Proposal
150
- 1. Select your research topic
151
- 2. Write a brief problem statement
152
- 3. Conduct initial literature review
153
- 4. Submit your research proposal
154
-
155
- **Due Date:** End of Week 1
156
- """)
157
-
158
- # Assignment Submission
159
- st.subheader("Submit Your Assignment")
160
- with st.form("assignment_form"):
161
- proposal_file = st.file_uploader("Upload your research proposal (PDF or DOC)")
162
- comments = st.text_area("Additional comments or questions")
163
-
164
- if st.form_submit_button("Submit Assignment"):
165
- if proposal_file is not None:
166
- st.success("Assignment submitted successfully!")
167
- else:
168
- st.error("Please upload your research proposal.")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
app/pages/__pycache__/week_1.cpython-311.pyc ADDED
Binary file (891 Bytes). View file
 
app/pages/__pycache__/week_2.cpython-311.pyc ADDED
Binary file (10.5 kB). View file
 
app/pages/week_1.py CHANGED
@@ -3,157 +3,16 @@ import numpy as np
3
  import plotly.graph_objects as go
4
  from sklearn.linear_model import LinearRegression
5
 
6
- def show_week_content():
 
7
  st.markdown("""
8
- ## Week 1: Research Topic Selection and Literature Review
9
-
10
- This week, you'll learn how to:
11
- - Select a suitable research topic
12
- - Conduct a literature review
13
- - Define your research objectives
14
- - Create a research proposal
15
  """)
16
-
17
- # Topic Selection Section
18
- st.header("1. Topic Selection")
19
- st.markdown("""
20
- ### Guidelines for Selecting Your Research Topic:
21
- - Choose a topic that interests you
22
- - Ensure sufficient data availability
23
- - Consider the scope and complexity
24
- - Check for existing research gaps
25
- """)
26
-
27
- # Interactive Topic Selection
28
- st.subheader("Topic Selection Form")
29
- with st.form("topic_form"):
30
- research_area = st.selectbox(
31
- "Select your research area",
32
- ["Computer Vision", "NLP", "Time Series", "Recommendation Systems", "Other"]
33
- )
34
-
35
- topic = st.text_input("Proposed Research Topic")
36
- problem_statement = st.text_area("Brief Problem Statement")
37
- motivation = st.text_area("Why is this research important?")
38
-
39
- submitted = st.form_submit_button("Submit Topic")
40
-
41
- if submitted:
42
- st.success("Topic submitted successfully! We'll review and provide feedback.")
43
-
44
- # Linear Regression Visualization
45
- st.header("2. Linear Regression Demo")
46
- st.markdown("""
47
- ### Understanding Linear Regression
48
-
49
- Linear regression is a fundamental machine learning algorithm that models the relationship between a dependent variable and one or more independent variables.
50
- Below is an interactive demonstration of simple linear regression.
51
- """)
52
-
53
- # Create interactive controls
54
- col1, col2 = st.columns(2)
55
- with col1:
56
- n_points = st.slider("Number of data points", 10, 100, 50)
57
- noise = st.slider("Noise level", 0.1, 2.0, 0.5)
58
- with col2:
59
- slope = st.slider("True slope", -2.0, 2.0, 1.0)
60
- intercept = st.slider("True intercept", -5.0, 5.0, 0.0)
61
-
62
- # Generate synthetic data
63
- np.random.seed(42)
64
- X = np.random.rand(n_points) * 10
65
- y = slope * X + intercept + np.random.normal(0, noise, n_points)
66
-
67
- # Fit linear regression
68
- X_reshaped = X.reshape(-1, 1)
69
- model = LinearRegression()
70
- model.fit(X_reshaped, y)
71
- y_pred = model.predict(X_reshaped)
72
-
73
- # Create the plot
74
- fig = go.Figure()
75
-
76
- # Add scatter plot of actual data
77
- fig.add_trace(go.Scatter(
78
- x=X,
79
- y=y,
80
- mode='markers',
81
- name='Actual Data',
82
- marker=dict(color='blue')
83
- ))
84
-
85
- # Add regression line
86
- fig.add_trace(go.Scatter(
87
- x=X,
88
- y=y_pred,
89
- mode='lines',
90
- name='Regression Line',
91
- line=dict(color='red')
92
- ))
93
-
94
- # Update layout
95
- fig.update_layout(
96
- title='Linear Regression Visualization',
97
- xaxis_title='X',
98
- yaxis_title='Y',
99
- showlegend=True,
100
- height=500
101
- )
102
-
103
- # Display the plot
104
- st.plotly_chart(fig, use_container_width=True)
105
-
106
- # Display regression coefficients
107
- st.markdown(f"""
108
- ### Regression Results
109
- - Estimated slope: {model.coef_[0]:.2f}
110
- - Estimated intercept: {model.intercept_:.2f}
111
- - R² score: {model.score(X_reshaped, y):.2f}
112
- """)
113
-
114
- # Literature Review Section
115
- st.header("3. Literature Review")
116
- st.markdown("""
117
- ### Steps for Conducting Literature Review:
118
- 1. Search for relevant papers
119
- 2. Read and analyze key papers
120
- 3. Identify research gaps
121
- 4. Document your findings
122
- """)
123
-
124
- # Literature Review Template
125
- st.subheader("Literature Review Template")
126
- with st.expander("Download Template"):
127
- st.download_button(
128
- label="Download Literature Review Template",
129
- data="Literature Review Template\n\n1. Introduction\n2. Related Work\n3. Methodology\n4. Results\n5. Discussion\n6. Conclusion",
130
- file_name="literature_review_template.txt",
131
- mime="text/plain"
132
- )
133
-
134
- # Weekly Assignment
135
- st.header("Weekly Assignment")
136
  st.markdown("""
137
- ### Assignment 1: Research Proposal
138
- 1. Select your research topic
139
- 2. Write a brief problem statement
140
- 3. Conduct initial literature review
141
- 4. Submit your research proposal
142
-
143
- **Due Date:** End of Week 1
144
  """)
145
-
146
- # Assignment Submission
147
- st.subheader("Submit Your Assignment")
148
- with st.form("assignment_form"):
149
- proposal_file = st.file_uploader("Upload your research proposal (PDF or DOC)")
150
- comments = st.text_area("Additional comments or questions")
151
-
152
- if st.form_submit_button("Submit Assignment"):
153
- if proposal_file is not None:
154
- st.success("Assignment submitted successfully!")
155
- else:
156
- st.error("Please upload your research proposal.")
157
-
158
  if __name__ == "__main__":
159
- show_week_content()
 
3
  import plotly.graph_objects as go
4
  from sklearn.linear_model import LinearRegression
5
 
6
+ # Week 1 content in person
7
+ def show():
8
  st.markdown("""
9
+ ## Week 1 content in person
 
 
 
 
 
 
10
  """)
11
+
12
+ # Week 1 content online
13
+ def show():
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
14
  st.markdown("""
15
+ ## Week 1 content not online yet
 
 
 
 
 
 
16
  """)
 
 
 
 
 
 
 
 
 
 
 
 
 
17
  if __name__ == "__main__":
18
+ show()
app/pages/week_1_WIP.py ADDED
@@ -0,0 +1,159 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import numpy as np
3
+ import plotly.graph_objects as go
4
+ from sklearn.linear_model import LinearRegression
5
+
6
+ def show():
7
+ st.markdown("""
8
+ ## Week 1: Research Topic Selection and Literature Review
9
+
10
+ This week, you'll learn how to:
11
+ - Select a suitable research topic
12
+ - Conduct a literature review
13
+ - Define your research objectives
14
+ - Create a research proposal
15
+ """)
16
+
17
+ # Topic Selection Section
18
+ st.header("1. Topic Selection")
19
+ st.markdown("""
20
+ ### Guidelines for Selecting Your Research Topic:
21
+ - Choose a topic that interests you
22
+ - Ensure sufficient data availability
23
+ - Consider the scope and complexity
24
+ - Check for existing research gaps
25
+ """)
26
+
27
+ # Interactive Topic Selection
28
+ st.subheader("Topic Selection Form")
29
+ with st.form("topic_form"):
30
+ research_area = st.selectbox(
31
+ "Select your research area",
32
+ ["Computer Vision", "NLP", "Time Series", "Recommendation Systems", "Other"]
33
+ )
34
+
35
+ topic = st.text_input("Proposed Research Topic")
36
+ problem_statement = st.text_area("Brief Problem Statement")
37
+ motivation = st.text_area("Why is this research important?")
38
+
39
+ submitted = st.form_submit_button("Submit Topic")
40
+
41
+ if submitted:
42
+ st.success("Topic submitted successfully! We'll review and provide feedback.")
43
+
44
+ # Linear Regression Visualization
45
+ st.header("2. Linear Regression Demo")
46
+ st.markdown("""
47
+ ### Understanding Linear Regression
48
+
49
+ Linear regression is a fundamental machine learning algorithm that models the relationship between a dependent variable and one or more independent variables.
50
+ Below is an interactive demonstration of simple linear regression.
51
+ """)
52
+
53
+ # Create interactive controls
54
+ col1, col2 = st.columns(2)
55
+ with col1:
56
+ n_points = st.slider("Number of data points", 10, 100, 50)
57
+ noise = st.slider("Noise level", 0.1, 2.0, 0.5)
58
+ with col2:
59
+ slope = st.slider("True slope", -2.0, 2.0, 1.0)
60
+ intercept = st.slider("True intercept", -5.0, 5.0, 0.0)
61
+
62
+ # Generate synthetic data
63
+ np.random.seed(42)
64
+ X = np.random.rand(n_points) * 10
65
+ y = slope * X + intercept + np.random.normal(0, noise, n_points)
66
+
67
+ # Fit linear regression
68
+ X_reshaped = X.reshape(-1, 1)
69
+ model = LinearRegression()
70
+ model.fit(X_reshaped, y)
71
+ y_pred = model.predict(X_reshaped)
72
+
73
+ # Create the plot
74
+ fig = go.Figure()
75
+
76
+ # Add scatter plot of actual data
77
+ fig.add_trace(go.Scatter(
78
+ x=X,
79
+ y=y,
80
+ mode='markers',
81
+ name='Actual Data',
82
+ marker=dict(color='blue')
83
+ ))
84
+
85
+ # Add regression line
86
+ fig.add_trace(go.Scatter(
87
+ x=X,
88
+ y=y_pred,
89
+ mode='lines',
90
+ name='Regression Line',
91
+ line=dict(color='red')
92
+ ))
93
+
94
+ # Update layout
95
+ fig.update_layout(
96
+ title='Linear Regression Visualization',
97
+ xaxis_title='X',
98
+ yaxis_title='Y',
99
+ showlegend=True,
100
+ height=500
101
+ )
102
+
103
+ # Display the plot
104
+ st.plotly_chart(fig, use_container_width=True)
105
+
106
+ # Display regression coefficients
107
+ st.markdown(f"""
108
+ ### Regression Results
109
+ - Estimated slope: {model.coef_[0]:.2f}
110
+ - Estimated intercept: {model.intercept_:.2f}
111
+ - R² score: {model.score(X_reshaped, y):.2f}
112
+ """)
113
+
114
+ # Literature Review Section
115
+ st.header("3. Literature Review")
116
+ st.markdown("""
117
+ ### Steps for Conducting Literature Review:
118
+ 1. Search for relevant papers
119
+ 2. Read and analyze key papers
120
+ 3. Identify research gaps
121
+ 4. Document your findings
122
+ """)
123
+
124
+ # Literature Review Template
125
+ st.subheader("Literature Review Template")
126
+ with st.expander("Download Template"):
127
+ st.download_button(
128
+ label="Download Literature Review Template",
129
+ data="Literature Review Template\n\n1. Introduction\n2. Related Work\n3. Methodology\n4. Results\n5. Discussion\n6. Conclusion",
130
+ file_name="literature_review_template.txt",
131
+ mime="text/plain"
132
+ )
133
+
134
+ # Weekly Assignment
135
+ st.header("Weekly Assignment")
136
+ st.markdown("""
137
+ ### Assignment 1: Research Proposal
138
+ 1. Select your research topic
139
+ 2. Write a brief problem statement
140
+ 3. Conduct initial literature review
141
+ 4. Submit your research proposal
142
+
143
+ **Due Date:** End of Week 1
144
+ """)
145
+
146
+ # Assignment Submission
147
+ st.subheader("Submit Your Assignment")
148
+ with st.form("assignment_form"):
149
+ proposal_file = st.file_uploader("Upload your research proposal (PDF or DOC)")
150
+ comments = st.text_area("Additional comments or questions")
151
+
152
+ if st.form_submit_button("Submit Assignment"):
153
+ if proposal_file is not None:
154
+ st.success("Assignment submitted successfully!")
155
+ else:
156
+ st.error("Please upload your research proposal.")
157
+
158
+ if __name__ == "__main__":
159
+ show()
app/pages/week_2.py ADDED
@@ -0,0 +1,228 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import streamlit as st
2
+ import numpy as np
3
+ import plotly.graph_objects as go
4
+ import io
5
+ import sys
6
+ import pandas as pd
7
+ from contextlib import redirect_stdout
8
+ import matplotlib.pyplot as plt
9
+ import seaborn as sns
10
+
11
+ # Initialize session state for notebook-like cells
12
+ if 'cells' not in st.session_state:
13
+ st.session_state.cells = []
14
+ if 'df' not in st.session_state:
15
+ st.session_state.df = None
16
+
17
+ def capture_output(code, df=None):
18
+ """Helper function to capture print output"""
19
+ f = io.StringIO()
20
+ with redirect_stdout(f):
21
+ try:
22
+ # Create a dictionary of variables to use in exec
23
+ variables = {'pd': pd, 'np': np, 'plt': plt, 'sns': sns}
24
+ if df is not None:
25
+ variables['df'] = df
26
+ exec(code, variables)
27
+ except Exception as e:
28
+ return f"Error: {str(e)}"
29
+ return f.getvalue()
30
+
31
+ def show():
32
+ st.markdown("""
33
+ ## Week 2: Python Basics - Part 1: Coding Exercises
34
+
35
+ In this first part, we'll learn some fundamental Python concepts through hands-on exercises:
36
+ - Importing libraries
37
+ - Using print statements
38
+ - Basic arithmetic operations
39
+ - Working with lists
40
+ """)
41
+
42
+ # Importing Libraries Section
43
+ st.header("1. Importing Libraries")
44
+ st.markdown("""
45
+ Python has a rich ecosystem of libraries. To use them, we need to import them first.
46
+ """)
47
+
48
+ with st.expander("Import Example"):
49
+ st.code("""
50
+ # Importing a library
51
+ import math
52
+
53
+ # Using a function from the library
54
+ print(math.sqrt(16)) # This will print 4.0
55
+ """, line_numbers=True)
56
+
57
+ # Interactive Import Exercise
58
+ st.subheader("Try it yourself!")
59
+ import_code = st.text_area("Try importing and using the math library:",
60
+ "import math\nprint(math.sqrt(25))",
61
+ height=100)
62
+ if st.button("Run Import Code"):
63
+ output = capture_output(import_code)
64
+ st.code(output, line_numbers=True)
65
+
66
+ # Print Statements Section
67
+ st.header("2. Print Statements")
68
+ st.markdown("""
69
+ The print() function is used to display output to the console.
70
+ """)
71
+
72
+ with st.expander("Print Examples"):
73
+ st.code("""
74
+ # Basic print
75
+ print("Hello, World!")
76
+
77
+ # Print with variables
78
+ name = "Alice"
79
+ print(f"Hello, {name}!")
80
+
81
+ # Print multiple items
82
+ print("The answer is:", 42)
83
+ """, line_numbers=True)
84
+
85
+ # Interactive Print Exercise
86
+ st.subheader("Try it yourself!")
87
+ print_code = st.text_area("Try some print statements:",
88
+ 'print("Hello, World!")\nname = "Python"\nprint(f"Hello, {name}!")',
89
+ height=100)
90
+ if st.button("Run Print Code"):
91
+ output = capture_output(print_code)
92
+ st.code(output, line_numbers=True)
93
+
94
+ # Basic Arithmetic Section
95
+ st.header("3. Basic Arithmetic")
96
+ st.markdown("""
97
+ Python can perform basic mathematical operations.
98
+ """)
99
+
100
+ with st.expander("Arithmetic Examples"):
101
+ st.code("""
102
+ # Addition
103
+ result = 5 + 3
104
+ print(result) # Prints 8
105
+
106
+ # Subtraction
107
+ result = 10 - 4
108
+ print(result) # Prints 6
109
+
110
+ # Multiplication
111
+ result = 6 * 7
112
+ print(result) # Prints 42
113
+
114
+ # Division
115
+ result = 15 / 3
116
+ print(result) # Prints 5.0
117
+ """, line_numbers=True)
118
+
119
+ # Interactive Arithmetic Exercise
120
+ st.subheader("Try it yourself!")
121
+ arithmetic_code = st.text_area("Try some arithmetic operations:",
122
+ 'print(5 + 3)\nprint(10 - 4)\nprint(6 * 7)\nprint(15 / 3)',
123
+ height=100)
124
+ if st.button("Run Arithmetic Code"):
125
+ output = capture_output(arithmetic_code)
126
+ st.code(output, line_numbers=True)
127
+
128
+ # Lists Section
129
+ st.header("4. Lists")
130
+ st.markdown("""
131
+ Lists are used to store multiple items in a single variable.
132
+ """)
133
+
134
+ with st.expander("List Examples"):
135
+ st.code("""
136
+ # Creating a list
137
+ fruits = ["apple", "banana", "cherry"]
138
+
139
+ # Accessing list items
140
+ print(fruits[0]) # Prints "apple"
141
+
142
+ # Adding to a list
143
+ fruits.append("orange")
144
+ print(fruits) # Prints ["apple", "banana", "cherry", "orange"]
145
+
146
+ # List length
147
+ print(len(fruits)) # Prints 4
148
+ """, line_numbers=True)
149
+
150
+ # Interactive List Exercise
151
+ st.subheader("Try it yourself!")
152
+ list_code = st.text_area("Try working with lists:",
153
+ 'fruits = ["apple", "banana", "cherry"]\nprint(fruits[0])\nfruits.append("orange")\nprint(fruits)\nprint(len(fruits))',
154
+ height=100)
155
+ if st.button("Run List Code"):
156
+ output = capture_output(list_code)
157
+ st.code(output, line_numbers=True)
158
+
159
+ # Practice Exercise
160
+ st.header("Practice Exercise")
161
+ st.markdown("""
162
+ ### Try this exercise:
163
+ Create a program that:
164
+ 1. Imports the math library
165
+ 2. Creates a list of numbers
166
+ 3. Uses a loop to print each number and its square root
167
+ """)
168
+
169
+ # Interactive Practice Exercise
170
+ st.subheader("Try your solution!")
171
+ practice_code = st.text_area("Write your solution here:",
172
+ 'import math\n\nnumbers = [4, 9, 16, 25]\n\nfor num in numbers:\n print(f"Number: {num}, Square root: {math.sqrt(num)}")',
173
+ height=150)
174
+ if st.button("Run Practice Code"):
175
+ output = capture_output(practice_code)
176
+ st.code(output, line_numbers=True)
177
+
178
+ st.markdown("""
179
+ ## Part 2: Data Cleaning Lab
180
+
181
+ In this lab, we'll learn how to clean and prepare data using pandas. We'll work with the Advertising dataset and practice common data cleaning techniques.
182
+
183
+ This lab is hosted in a Jupyter notebook environment. We will create a new notebook for this lab.
184
+ """)
185
+
186
+
187
+ st.markdown("""
188
+ ## Week 2: Reference Material
189
+
190
+ Please refer to the following links:
191
+ - [Pandas Documentation](https://pandas.pydata.org/docs/)
192
+ - [Numpy Documentation](https://numpy.org/doc/)
193
+ - [Matplotlib Documentation](https://matplotlib.org/stable/users/index.html)
194
+ - [Seaborn Documentation](https://seaborn.pydata.org/index.html)
195
+ For learning more about python use the following link:
196
+ - [Introduction to Statistical Learning](https://www.statlearning.com/resources-python)
197
+ - [Learning Python notebook](https://github.com/intro-stat-learning/ISLP_labs/blob/stable/Ch02-statlearn-lab.ipynb)
198
+ For our dataset used today for class:
199
+ - [Advertising Dataset](https://www.statlearning.com/s/Advertising.csv)
200
+ """)
201
+
202
+ # Weekly Assignment
203
+ st.header("Weekly Assignment")
204
+ st.markdown("""
205
+ ### Assignment 2: Python Basics
206
+ 1. Import the dataset that you studied last week: https://github.com/hollandstam1/thesis/blob/main/_book/Quantifying- Art-Historical-Narratives.pdf
207
+ 2. Create a new notebook and load the dataset
208
+ 3. Explore the dataset by answering the following questions:
209
+ - How many rows and columns are there in the dataset?
210
+ - What are the variables in the dataset?
211
+ - What is the data type of each variable?
212
+ - What is the range of each variable?
213
+ - What is the mean of each variable?
214
+
215
+ **Due Date:** End of Week 2
216
+ """)
217
+ '''
218
+ # Assignment Submission
219
+ st.subheader("Submit Your Assignment")
220
+ with st.form("assignment_form"):
221
+ script_file = st.file_uploader("Upload your Python script (.py)")
222
+ comments = st.text_area("Additional comments or questions")
223
+
224
+ if st.form_submit_button("Submit Assignment"):
225
+ if script_file is not None:
226
+ st.success("Assignment submitted successfully!")
227
+ else:
228
+ st.error("Please upload your Python script.")'''