Commercial banks receive *a lot* of applications for credit cards. Many of them get rejected for many reasons, like high loan balances, low income levels, or too many inquiries on an individual's credit report, for example. Manually analyzing these applications is mundane, error-prone, and time-consuming (and time is money!). Luckily, this task can be automated with the power of machine learning and pretty much every commercial bank does so nowadays. In this notebook, we will build an automatic credit card approval predictor using machine learning techniques, just like the real banks do.

We'll use the Credit Card Approval dataset from the UCI Machine Learning Repository. The structure of this notebook is as follows:

- First, we will start off by loading and viewing the dataset.
- We will see that the dataset has a mixture of both numerical and non-numerical features, that it contains values from different ranges, plus that it contains a number of missing entries.
- We will have to preprocess the dataset to ensure the machine learning model we choose can make good predictions.
- After our data is in good shape, we will do some exploratory data analysis to build our intuitions.
- Finally, we will build a machine learning model that can predict if an individual's application for a credit card will be accepted.

First, loading and viewing the dataset. We find that since this data is confidential, the contributor of the dataset has anonymized the feature names.

In [193]:

```
# Import pandas
import pandas as pd
# Load dataset
cc_apps = pd.read_csv('datasets/cc_approvals.data', header = None)
# Inspect data
cc_apps.head()
```

Out[193]:

The output may appear a bit confusing at its first sight, but let's try to figure out the most important features of a credit card application. The features of this dataset have been anonymized to protect the privacy, but this blog gives us a pretty good overview of the probable features. The probable features in a typical credit card application are `Gender`

, `Age`

, `Debt`

, `Married`

, `BankCustomer`

, `EducationLevel`

, `Ethnicity`

, `YearsEmployed`

, `PriorDefault`

, `Employed`

, `CreditScore`

, `DriversLicense`

, `Citizen`

, `ZipCode`

, `Income`

and finally the `ApprovalStatus`

. This gives us a pretty good starting point, and we can map these features with respect to the columns in the output.

As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.

In [195]:

```
# Print summary statistics
cc_apps_description = cc_apps.describe()
print(cc_apps_description)
print("\n")
# Print DataFrame information
cc_apps_info = cc_apps.info()
print(cc_apps_info)
print("\n")
# Inspect missing values in the dataset
print(cc_apps.tail(17))
```

We've uncovered some issues that will affect the performance of our machine learning model(s) if they go unchanged:

- Our dataset contains both numeric and non-numeric data (specifically data that are of
`float64`

,`int64`

and`object`

types). Specifically, the features 2, 7, 10 and 14 contain numeric values (of types float64, float64, int64 and int64 respectively) and all the other features contain non-numeric values. - The dataset also contains values from several ranges. Some features have a value range of 0 - 28, some have a range of 2 - 67, and some have a range of 1017 - 100000. Apart from these, we can get useful statistical information (like
`mean`

,`max`

, and`min`

) about the features that have numerical values. - Finally, the dataset has missing values, which we'll take care of in this task. The missing values in the dataset are labeled with '?', which can be seen in the last cell's output.

Now, let's temporarily replace these missing value question marks with NaN.

In [197]:

```
# Import numpy
import numpy as np
# Inspect missing values in the dataset
print(cc_apps.tail(17))
# Replace the '?'s with NaN
cc_apps = cc_apps.replace('?', np.nan)
# Inspect the missing values again
print(cc_apps.tail(17))
```

We replaced all the question marks with NaNs. This is going to help us in the next missing value treatment that we are going to perform.

An important question that gets raised here is *why are we giving so much importance to missing values*? Can't they be just ignored? Ignoring missing values can affect the performance of a machine learning model heavily. While ignoring the missing values our machine learning model may miss out on information about the dataset that may be useful for its training. Then, there are many models which cannot handle missing values implicitly such as LDA.

So, to avoid this problem, we are going to impute the missing values with a strategy called mean imputation.

In [199]:

```
# Impute the missing values with mean imputation
cc_apps.loc[[2,7,10,14]].fillna(np.mean, inplace=True)
# Count the number of NaNs in the dataset to verify
print(cc_apps.isna().sum())
```

We have successfully taken care of the missing values present in the numeric columns. There are still some missing values to be imputed for columns 0, 1, 3, 4, 5, 6 and 13. All of these columns contain non-numeric data and this why the mean imputation strategy would not work here. This needs a different treatment.

We are going to impute these missing values with the most frequent values as present in the respective columns. This is good practice when it comes to imputing missing values for categorical data in general.

In [201]:

```
# Iterate over each column of cc_apps
for col in cc_apps.columns:
# Check if the column is of object type
if cc_apps[col].dtypes == 'object':
# Impute with the most frequent value
cc_apps = cc_apps.fillna(cc_apps[col].value_counts().index[0])
# Count the number of NaNs in the dataset and print the counts to verify
print(cc_apps.isna().sum())
```

The missing values are now successfully handled.

There is still some minor but essential data preprocessing needed before we proceed towards building our machine learning model. We are going to divide these remaining preprocessing steps into three main tasks:

- Convert the non-numeric data into numeric.
- Split the data into train and test sets.
- Scale the feature values to a uniform range.

First, we will be converting all the non-numeric values into numeric ones. We do this because not only it results in a faster computation but also many machine learning models (like XGBoost) (and especially the ones developed using scikit-learn) require the data to be in a strictly numeric format. We will do this by using a technique called label encoding.

In [203]:

```
# Import LabelEncoder
from sklearn.preprocessing import LabelEncoder
# Instantiate LabelEncoder
le = LabelEncoder()
# Iterate over all the values of each column and extract their dtypes
# .to_numpy() is adviced over .values in the pandas 1.3.4 documentation
for col in cc_apps.columns.to_numpy():
# Compare if the dtype is object
if cc_apps[col].dtypes =='object':
# Use LabelEncoder to do the numeric transformation
cc_apps[col]=le.fit_transform(cc_apps[col])
```

We have successfully converted all the non-numeric values to numeric ones.

Now, we will split our data into train set and test set to prepare our data for two different phases of machine learning modeling: training and testing. Ideally, no information from the test data should be used to scale the training data or should be used to direct the training process of a machine learning model. Hence, we first split the data and then apply the scaling.

Also, features like `DriversLicense`

and `ZipCode`

are not as important as the other features in the dataset for predicting credit card approvals. We should drop them to design our machine learning model with the best set of features. In Data Science literature, this is often referred to as *feature selection*.

In [205]:

```
# Import train_test_split
from sklearn.model_selection import train_test_split
# Drop the features 11 and 13 and convert the DataFrame to a NumPy array
cc_apps = cc_apps.drop([11,13], axis=1)
cc_apps = cc_apps.to_numpy()
# Segregate features and labels into separate variables
X,y = cc_apps[:,0:12] , cc_apps[:,13]
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
y,
test_size=0.33,
random_state=42)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
```

The data is now split into two separate sets - train and test sets respectively. We are only left with one final preprocessing step of scaling before we can fit a machine learning model to the data.

Now, let's try to understand what these scaled values mean in the real world. Let's use `CreditScore`

as an example. The credit score of a person is their creditworthiness based on their credit history. The higher this number, the more financially trustworthy a person is considered to be. So, a `CreditScore`

of 1 is the highest since we're rescaling all the values to the range of 0-1.

In [207]:

```
# Import MinMaxScaler
from sklearn.preprocessing import MinMaxScaler
# Instantiate MinMaxScaler and use it to rescale X_train and X_test
scaler = MinMaxScaler(feature_range=(0, 1))
rescaledX_train = scaler.fit_transform(X_train)
rescaledX_test = scaler.fit_transform(X_test)
print(rescaledX_train.shape)
print(rescaledX_test.shape)
```

Essentially, predicting if a credit card application will be approved or not is a classification task. According to UCI, our dataset contains more instances that correspond to "Denied" status than instances corresponding to "Approved" status. Specifically, out of 690 instances, there are 383 (55.5%) applications that got denied and 307 (44.5%) applications that got approved.

This gives us a benchmark. A good machine learning model should be able to accurately predict the status of the applications with respect to these statistics.

Which model should we pick? A question to ask is: *are the features that affect the credit card approval decision process correlated with each other?* Although we can measure correlation, that is outside the scope of this notebook, so we'll rely on our intuition that they indeed are correlated for now. Because of this correlation, we'll take advantage of the fact that generalized linear models perform well in these cases. Let's start our machine learning modeling with a Logistic Regression model (a generalized linear model).

In [209]:

```
# Import LogisticRegression
from sklearn.linear_model import LogisticRegression
# Instantiate a LogisticRegression classifier with default parameter values
logreg = LogisticRegression()
# Fit logreg to the train set
logreg.fit(rescaledX_train,y_train)
```

Out[209]:

But how well does our model perform?

We will now evaluate our model on the test set with respect to classification accuracy. But we will also take a look the model's confusion matrix. In the case of predicting credit card applications, it is equally important to see if our machine learning model is able to predict the approval status of the applications as denied that originally got denied. If our model is not performing well in this aspect, then it might end up approving the application that should have been approved. The confusion matrix helps us to view our model's performance from these aspects.

In [211]:

```
# Import confusion_matrix
from sklearn.metrics import confusion_matrix
# Use logreg to predict instances from the test set and store it
y_pred = logreg.predict(rescaledX_test)
# Get the accuracy score of logreg model and print it
print("Accuracy of logistic regression classifier: ", logreg.score(rescaledX_test, y_test))
# Print the confusion matrix of the logreg model
print(confusion_matrix(y_test, y_pred))
```

Our model was pretty good! It was able to yield an accuracy score of almost 84%.

For the confusion matrix, the first element of the of the first row of the confusion matrix denotes the true negatives meaning the number of negative instances (denied applications) predicted by the model correctly. And the last element of the second row of the confusion matrix denotes the true positives meaning the number of positive instances (approved applications) predicted by the model correctly.

Let's see if we can do better. We can perform a grid search of the model parameters to improve the model's ability to predict credit card approvals.

scikit-learn's implementation of logistic regression consists of different hyperparameters but we will grid search over the following two:

- tol
- max_iter

In [213]:

```
# Import GridSearchCV
from sklearn.model_selection import GridSearchCV
# Define the grid of values for tol and max_iter
tol = [0.01, 0.001, 0.0001]
max_iter = [100, 150, 200]
# Create a dictionary where tol and max_iter are keys and the lists of their values are corresponding values
param_grid = dict(tol = tol, max_iter = max_iter)
```

We have defined the grid of hyperparameter values and converted them into a single dictionary format which `GridSearchCV()`

expects as one of its parameters. Now, we will begin the grid search to see which values perform best.

We will instantiate `GridSearchCV()`

with our earlier `logreg`

model with all the data we have. Instead of passing train and test sets separately, we will supply `X`

(scaled version) and `y`

. We will also instruct `GridSearchCV()`

to perform a cross-validation of five folds.

We'll end the notebook by storing the best-achieved score and the respective best parameters.

While building this credit card predictor, we tackled some of the most widely-known preprocessing steps such as **scaling**, **label encoding**, and **missing value imputation**. We finished with some **machine learning** to predict if a person's application for a credit card would get approved or not given some information about that person.

In [218]:

```
# Instantiate GridSearchCV with the required parameters
grid_model = GridSearchCV(estimator = logreg, param_grid = param_grid, cv = 5)
# Use scaler to rescale X and assign it to rescaledX
rescaledX = scaler.fit_transform(X)
# Fit data to grid_model
grid_model_result = grid_model.fit(rescaledX, y)
# Summarize results
best_score, best_params = grid_model_result.best_score_, grid_model_result.best_params_
print("Best: %f using %s" % (best_score, best_params))
```

This Jupyter notebook contains the Datacamp exercises and some of my personal notes. If you're interested in learning this subject have a look at: https://campus.datacamp.com/courses/supervised-learning-with-scikit-learn

Machine learning is the field that teaches machines and computers to learn from existing data to make predictions on new data: Will a tumor be benign or malignant? Which of your customers will take their business elsewhere? Is a particular email spam? In this course, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. You'll learn how to build predictive models, tune their parameters, and determine how well they will perform with unseen data—all while using real world datasets. You'll be using scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.

Machine learning is the science and art of giving computers the ability to learn to make decisions from data without being explicitly programmed. For example, your computer can learn to predict whether an email is spam or not spam given its content and sender. Another example: your computer can learn to cluster, say, Wikipedia entries, into different categories based on the words they contain. It could then assign any new Wikipedia article to one of the existing clusters. Notice that, in the first example, we are trying to predict a particular class label, that is, spam or not spam. In the second example, there is no such label. When there are labels present, we call it supervised learning. When there are no labels present, we call it unsupervised learning.

Unsupervised learning Unsupervised learning, in essence, is the machine learning task of uncovering hidden patterns and structures from unlabeled data. For example, a business may wish to group its customers into distinct categories based on their purchasing behavior without knowing in advance what these categories maybe. This is known as clustering, one branch of unsupervised learning.

Reinforcement learning There is also reinforcement learning, in which machines or software agents interact with an environment. Reinforcement agents are able to automatically figure out how to optimize their behavior given a system of rewards and punishments. Reinforcement learning draws inspiration from behavioral psychology and has applications in many fields, such as, economics, genetics, as well as game playing. In 2016, reinforcement learning was used to train Google DeepMind's AlphaGo, which was the first computer program to beat the world champion in Go.

Supervised learning But let's come back to supervised learning, which will be the focus of this course. In supervised learning, we have several data points or samples, described using predictor variables or features and a target variable. Our data is commonly represented in a table structure such as the one you see here, in which there is a row for each data point and a column for each feature. Here, we see the iris dataset: each row represents measurements of a different flower and each column is a particular kind of measurement, like the width and length of a certain part of the flower. The aim of supervised learning is to build a model that is able to predict the target variable, here the particular species of a flower, given the predictor variables, here the physical measurements. If the target variable consists of categories, like 'click' or 'no click', 'spam' or 'not spam', or different species of flowers, we call the learning task classification. Alternatively, if the target is a continuously varying variable, for example, the price of a house, it is a regression task. In this chapter, we will focus on classification. In the following, on regression.

Naming conventions A note on naming conventions: out in the wild, you will find that what we call a feature, others may call a predictor variable or independent variable, and what we call the target variable, others may call dependent variable or response variable.

Supervised learning The goal of supervised learning is frequently to either automate a time-consuming or expensive manual task, such as a doctor's diagnosis, or to make predictions about the future, say whether a customer will click on an add or not. For supervised learning, you need labeled data and there are many ways to get it: you can get historical data, which already has labels that you are interested in; you can perform experiments to get labeled data, such as A/B-testing to see how many clicks you get; or you can also crowdsourced labeling data which, like reCAPTCHA does for text recognition. In any case, the goal is to learn from data for which the right output is known, so that we can make predictions on new data for which we don't know the output.

Supervised learning in Python There are many ways to perform supervised learning in Python. In this course, we will use scikit-learn, or sklearn, one of the most popular and user-friendly machine learning libraries for Python. It also integrates very well with the SciPy stack, including libraries such as NumPy. There are a number of other ML libraries out there, such as TensorFlow and keras, which are well worth checking out once you got the basics down.

Once you decide to leverage supervised machine learning to solve a new problem, you need to identify whether your problem is better suited to classification or regression. This exercise will help you develop your intuition for distinguishing between the two.

Provided below are 4 example applications of machine learning. Which of them is a supervised classification problem?

Possible Answers

**Using labeled financial data to predict whether the value of a stock will go up or go down next week.**- Using labeled housing price data to predict the price of a new house based on various features.
- Using unlabeled data to cluster the students of an online education company into different categories based on their learning styles.
- Using labeled financial data to predict what the value of a stock will be next week

Exactly! In this example, there are two discrete, qualitative outcomes: the stock market going up, and the stock market going down. This can be represented using a binary variable, and is an application perfectly suited for classification.

Once you decide to leverage supervised machine learning to solve a new problem, you need to identify whether your problem is better suited to classification or regression. This exercise will help you develop your intuition for distinguishing between the two.

Provided below are 4 example applications of machine learning. Which of them is a supervised classification problem?

Answer the question Possible Answers

**Using labeled financial data to predict whether the value of a stock will go up or go down next week.**- Using labeled housing price data to predict the price of a new house based on various features.
- Using unlabeled data to cluster the students of an online education company into different categories based on their learning styles.
- Using labeled financial data to predict what the value of a stock will be next week.

In this example, there are two discrete, qualitative outcomes: the stock market going up, and the stock market going down. This can be represented using a binary variable, and is an application perfectly suited for classification.

The Iris dataset It contains data pertaining to iris flowers in which the features consist of four measurements: petal length, petal width, sepal length, and sepal width. The target variable encodes the species of flower and there are three possibilities: 'versicolor', 'virginica', and 'setosa'. As this is one of the datasets included in scikit-learn,

The Iris dataset in scikit-learn we'll import it from there with from sklearn import datasets. In the exercises, you'll get practice at importing files from your local file system for supervised learning. We'll also import pandas, numpy, and pyplot under their standard aliases. In addition, we'll set the plotting style to ggplot using plt dot style dot use. Firstly, because it looks great and secondly, in order to help all you R aficionados feel at home. We then load the dataset with datasets dot load iris and assign the data to a variable iris. Checking out the type of iris, we see that it's a bunch, which is similar to a dictionary in that it contains key-value pairs. Printing the keys, we see that they are the feature names: DESCR, which provides a description of the dataset; the target names; the data, which contains the values features; and the target, which is the target data.

The Iris dataset in scikit-learn As you see here, both the feature and target data are provided as NumPy arrays. The dot shape attribute of the array feature array tells us that there are 150 rows and four columns. Remember: samples are in rows, features are in columns. Thus we have 150 samples and the four features: petal length and width and sepal length and width. Moreover, note that the target variable is encoded as zero for "setosa", 1 for "versicolor" and 2 for "virginica". We can see this by printing iris dot target names, in which "setosa" corresponds to index 0, "versicolor" to index 1 and "virginica" to index 2.

Exploratory data analysis (EDA) In order to perform some initial exploratory data analysis, or EDA for short, we'll assign the feature and target data to X and y, respectively. We'll then build a DataFrame of the feature data using pd dot DataFrame and also passing column names. Viewing the head of the data frame shows us the first five rows.

Visual EDA Now, we'll do a bit of visual EDA. We use the pandas function scatter matrix to visualize our dataset. We pass it the our DataFrame, along with our target variable as argument to the parameter c, which stands for color, ensuring that our data points in our figure will be colored by their species. We also pass a list to fig size, which specifies the size of our figure, as well as a marker size and shape.

Visual EDA The result is a matrix of figures, which on the diagonal are histograms of the features corresponding to the row and column. The off-diagonal figures are scatter plots of the column feature versus row feature colored by the target variable. There is a great deal of information in this scatter matrix.

Visual EDA See, here for example, that petal width and length are highly correlated, as you may expect, and that flowers are clustered according to species.

In this chapter, you'll be working with a dataset obtained from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/Congressional+Voting+Records) consisting of votes made by US House of Representatives Congressmen. Your goal will be to predict their party affiliation ('Democrat' or 'Republican') based on how they voted on certain key issues. Here, it's worth noting that we have preprocessed this dataset to deal with missing values. This is so that your focus can be directed towards understanding how to train and evaluate supervised learning models. Once you have mastered these fundamentals, you will be introduced to preprocessing techniques in Chapter 4 and have the chance to apply them there yourself - including on this very same dataset! Before thinking about what supervised learning models you can apply to this, however, you need to perform Exploratory data analysis (EDA) in order to understand the structure of the data. For a refresher on the importance of EDA, check out the first two chapters of Statistical Thinking in Python (Part 1). Get started with your EDA now by exploring this voting records dataset numerically. It has been pre-loaded for you into a DataFrame called df. Use pandas' .head(), .info(), and .describe() methods in the IPython Shell to explore the DataFrame, and select the statement below that is not true.

Possible Answers

- DataFrame has a total of 435 rows and 17 columns.
- Except for 'party', all of the columns are of type int64.
- The first two rows of the DataFrame consist of votes made by Republicans and the next three rows consist of votes made by Democrats.
**There are 17 predictor variables, or features, in this DataFrame.**- The target variable in this DataFrame is 'party'.

The number of columns in the DataFrame is not equal to the number of features. One of the columns - 'party' is the target variable.

In [17]:

```
import pandas as pd
df = pd.read_csv('./datasets/house-votes-84.data', header=None)
df.columns = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
df.replace({'?':'n'}, inplace=True)
df.replace({'n':0, 'y': 1}, inplace=True)
df.head()
```

Out[17]:

The Numerical EDA you did in the previous exercise gave you some very important information, such as the names and data types of the columns, and the dimensions of the DataFrame. Following this with some visual EDA will give you an even better understanding of the data. In the video, Hugo used the scatter_matrix() function on the Iris data for this purpose. However, you may have noticed in the previous exercise that all the features in this dataset are binary; that is, they are either 0 or 1. So a different type of plot would be more useful here, such as Seaborn's countplot.

Given on the right is a countplot of the 'education' bill, generated from the following code:

plt.figure() sns.countplot(x='education', hue='party', data=df, palette='RdBu') plt.xticks([0,1], ['No', 'Yes']) plt.show()

In sns.countplot(), we specify the x-axis data to be 'education', and hue to be 'party'. Recall that 'party' is also our target variable. So the resulting plot shows the difference in voting behavior between the two parties for the 'education' bill, with each party colored differently. We manually specified the color to be 'RdBu', as the Republican party has been traditionally associated with red, and the Democratic party with blue.

It seems like Democrats voted resoundingly against this bill, compared to Republicans. This is the kind of information that our machine learning model will seek to learn when we try to predict party affiliation solely based on voting behavior. An expert in U.S politics may be able to predict this without machine learning, but probably not instantaneously - and certainly not if we are dealing with hundreds of samples!

In the IPython Shell, explore the voting behavior further by generating countplots for the 'satellite' and 'missile' bills, and answer the following question: Of these two bills, for which ones do Democrats vote resoundingly in favor of, compared to Republicans? Be sure to begin your plotting statements for each figure with plt.figure() so that a new figure will be set up. Otherwise, your plots will be overlaid onto the same figure.

Possible Answers

- 'satellite'.
- 'missile'.
**Both 'satellite' and 'missile'.**- Neither 'satellite' nor 'missile'.

In [18]:

```
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(5, 5))
sns.countplot(x='education', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
```

In [19]:

```
plt.figure(figsize=(5, 5))
sns.countplot(x='satellite', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
```

In [20]:

```
plt.figure(figsize=(5, 5))
sns.countplot(x='missile', hue='party', data=df, palette='RdBu')
plt.xticks([0,1], ['No', 'Yes'])
plt.show()
```

Democrats voted in favor of both 'satellite' and 'missile'

The classification challenge We have a set of labeled data and we want to build a classifier that takes unlabeled data as input and outputs a label. So how do we construct this classifier? We first need choose a type of classifier and it needs to learn from the already labeled data. For this reason, we call the already labeled data the training data. So let's build our first classifier!

k-Nearest Neighbors We'll choose a simple algorithm called K-nearest neighbors. The basic idea of K-nearest neighbors, or KNN, is to predict the label of any data point by looking at the K, for example, 3, closest labeled data points and getting them to vote on what label the unlabeled point should have.

k-Nearest Neighbors In this image, there's an example of KNN in two dimensions: how do you classify the data point in the middle?

k-Nearest Neighbors Well, if k equals 3,

k-Nearest Neighbors you would classify it as red and, if k equals 5, as green.

k-Nearest Neighbors and, if k equals 5,

k-Nearest Neighbors as green.

k-NN: Intuition To get a bit of intuition for KNN, let's check out a scatter plot of two dimensions of the iris dataset, petal length and petal width. The following holds for higher dimensions, however, we'll show the 2D case for illustrative purposes.

k-NN: Intuition What the KNN algorithm essentially does is create a set of decision boundaries and we visualize the 2D case here.

k-NN: Intuition Any new data point here will be predicted 'setosa',

k-NN: Intuition any new data point here will be predicted 'virginica',

k-NN: Intuition and any new data point here will be predicted 'versicolor'.

Scikit-learn fit and predict All machine learning models in scikit-learn are implemented as python classes. These classes serve two purposes: they implement the algorithms for learning a model, and predicting, while also storing all the information that is learned from the data. Training a model on the data is also called fitting the model to the data. In scikit-learn, we use the fit method to do this. Similarly, the predict method is what we use to predict the label of an, unlabeled data point.

Using scikit-learn to fit a classifier Now we're going to fit our very first classifier using scikit-learn! To do so, we first need to import it. To this end, we import KNeighborsClassifier from sklearn dot neighbors. We then instantiate our KNeighborsClassifier, set the number of neighbors equal to 6, and assign it to the variable knn. Then we can fit this classifier to our training set, the labeled data. To do so, we apply the method fit to the classifier and pass it two arguments: the features as a NumPy array and the labels, or target, as a NumPy array. The scikit-learn API requires firstly that you have the data as a NumPy array or pandas DataFrame. It also requires that the features take on continuous values, such as the price of a house, as opposed to categories, such as 'male' or 'female'. It also requires that there are no missing values in the data. All datasets that we'll work with now satisfy these final two properties. Later in the course, you'll see how to deal with categorical features and missing data. In particular, the scikit-learn API requires that the features are in an array where each column is a feature and each row a different observation or data point. Looking at the shape of iris data, we see that there are 150 observations of four features. Similarly, the target needs to be a single column with the same number of observations as the feature data. We see in this case there are indeed also 150 labels. Also check out what is returned when we fit the classifier: it returns the classifier itself and modifies it to fit it to the data. Now that we have fit our classifier, lets use it to predict on some unlabeled data!

Predicting on unlabeled data Here we have set of observations, X new. We use the predict method on the classifier and pass it the data. Once again, the API requires that we pass the data as a NumPy array with features in columns and observations in rows; checking the shape of X new, we see that it has three rows and four columns, that is, three observations and four features. Then we would expect calling knn dot predict of X new to return a three-by-one array with a prediction for each observation or row in X new. And indeed it does! It predicts one, which corresponds to 'versicolor' for the first two observations and 0, which corresponds to 'setosa' for the third.

Having explored the Congressional voting records dataset, it is time now to build your first classifier. In this exercise, you will fit a k-Nearest Neighbors classifier to the voting dataset, which has once again been pre-loaded for you into a DataFrame df.

In the video, Hugo discussed the importance of ensuring your data adheres to the format required by the scikit-learn API. The features need to be in an array where each column is a feature and each row a different observation or data point - in this case, a Congressman's voting record. The target needs to be a single column with the same number of observations as the feature data. We have done this for you in this exercise. Notice we named the feature array X and response variable y: This is in accordance with the common scikit-learn practice.

Your job is to create an instance of a k-NN classifier with 6 neighbors (by specifying the n_neighbors parameter) and then fit it to the data. The data has been pre-loaded into a DataFrame called df.

- Import KNeighborsClassifier from sklearn.neighbors.
- Create arrays X and y for the features and the target variable. Here this has been done for you. Note the use of .drop() to drop the target variable 'party' from the feature array X as well as the use of the .values attribute to ensure X and y are NumPy arrays. Without using .values, X and y are a DataFrame and Series respectively; the scikit-learn API will accept them in this form also as long as they are of the right shape.
- Instantiate a KNeighborsClassifier called knn with 6 neighbors by specifying the n_neighbors parameter.
- Fit the classifier to the data using the .fit() method.

In [21]:

```
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values
# Create a k-NN classifier with 6 neighbors
knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the data
knn.fit(X, y)
```

Out[21]:

Now that your k-NN classifier with 6 neighbors has been fit to the data, it can be used to predict the labels of new data points.

Having fit a k-NN classifier, you can now use it to predict the label of a new data point. However, there is no unlabeled data available since all of it was used to fit the model! You can still use the .predict() method on the X that was used to fit the model, but it is not a good indicator of the model's ability to generalize to new, unseen data.

In the next video, Hugo will discuss a solution to this problem. For now, a random unlabeled data point has been generated and is available to you as X_new. You will use your classifier to predict the label for this new data point, as well as on the training data X that the model has already seen. Using .predict() on X_new will generate 1 prediction, while using it on X will generate 435 predictions: 1 for each sample.

The DataFrame has been pre-loaded as df. This time, you will create the feature array X and target variable array y yourself.

- Create arrays for the features and the target variable from df. As a reminder, the target variable is 'party'.
- Instantiate a KNeighborsClassifier with 6 neighbors.
- Fit the classifier to the data.
- Predict the labels of the training data, X.
- Predict the label of the new data point X_new.

In [22]:

```
X_new = pd.DataFrame([0.696469, 0.286139, 0.226851, 0.551315, 0.719469, 0.423106, 0.980764,
0.68483, 0.480932, 0.392118, 0.343178, 0.72905, 0.438572, 0.059678,
0.398044, 0.737995]).transpose()
```

In [23]:

```
# Import KNeighborsClassifier from sklearn.neighbors
from sklearn.neighbors import KNeighborsClassifier
# Create arrays for the features and the response variable
y = df['party'].values
X = df.drop('party', axis=1).values
# Create a k-NN classifier with 6 neighbors: knn
knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the data
knn.fit(X, y)
# Predict the labels for the training data X
y_pred = knn.predict(X)
# Predict and print the label for the new data point X_new
new_prediction = knn.predict(X_new)
print("Prediction: {}".format(new_prediction))
```

Did your model predict 'democrat' or 'republican'? How sure can you be of its predictions? In other words, how can you measure its performance? This is what you will learn in the next video.

Measuring model performance Now that we know how to fit a classifier and use it to predict the labels of previously unseen data, we need to figure out how to measure its performance. That is, we need a metric.

Measuring model performance In classification problems, accuracy is a commonly-used metric. The accuracy of a classifier is defined as the number of correct predictions divided by the total number of data points. This begs the question though: which data do we use to compute accuracy? What we are really interested in is how well our model will perform on new data, that is, samples that the algorithm has never seen before.

Measuring model performance Well, you could compute the accuracy on the data you used to fit the classifier. However, as this data was used to train it, the classifier's performance will not be indicative of how well it can generalize to unseen data. For this reason, it is common practice to split your data into two sets, a training set and a test set. You train or fit the classifier on the training set. Then you make predictions on the labeled test set and compare these predictions with the known labels. You then compute the accuracy of your predictions.

Train/test split To do this, we first import train test split from sklearn dot model selection. We then use the train test split function to randomly split our data. The first argument will be the feature data, the second the targets or labels. The test size keyword argument specifies what proportion of the original data is used for the test set. Lastly, the random state kwarg sets a seed for the random number generator that splits the data into train and test. Setting the seed with the same argument later will allow you to reproduce the exact split and your downstream results. train test split returns four arrays: the training data, the test data, the training labels, and the test labels. We unpack these into four variables: X train, X test, y train, and y test, respectively. By default, train test split splits the data into 75% training data and 25% test data, which is a good rule of thumb. We specify the size of the test size using the keyword argument test size, which we do here to set it to 30%. It is also best practice to perform your split so that the split reflects the labels on your data. That is, you want the labels to be distributed in train and test sets as they are in the original dataset. To achieve this, we use the keyword argument stratify equals y, where y the list or array containing the labels. We then instantiate our K-nearest neighbors classifier, fit it to the training data using the fit method, make our predictions on the test data and store the results as y pred. Printing them shows that the predictions take on three values, as expected. To check out the accuracy of our model, we use the score method of the model and pass it X test and y test. See here that the accuracy of our K-nearest neighbors model is approximately 95%, which is pretty good for an out-of-the-box model!

Model complexity Recall that we recently discussed the concept of a decision boundary. Here, we visualize a decision boundary for several, increasing values of K in a KNN model. Note that, as K increases, the decision boundary gets smoother and less curvy. Therefore, we consider it to be a less complex model than those with a lower K. Generally, complex models run the risk of being sensitive to noise in the specific data that you have, rather than reflecting general trends in the data. This is know as overfitting.

1 Source: Andreas Müller & Sarah Guido, Introduction to Machine Learning with Python

Model complexity and over/underfitting If you increase K even more and make the model even simpler, then the model will perform less well on both test and training sets, as indicated in this schematic figure, known as a model complexity curve.

Model complexity and over/underfitting This is called underfitting.

Model complexity and over/underfitting We can see that there is a sweet spot in the middle that gives us the best performance on the test set.

Up until now, you have been performing binary classification, since the target variable had two possible outcomes. Hugo, however, got to perform multi-class classification in the videos, where the target variable could take on three possible outcomes. Why does he get to have all the fun?! In the following exercises, you'll be working with the MNIST digits recognition dataset, which has 10 classes, the digits 0 through 9! A reduced version of the MNIST (http://yann.lecun.com/exdb/mnist/ ) dataset is one of scikit-learn's included datasets, and that is the one we will use in this exercise.

Each sample in this scikit-learn dataset is an 8x8 image representing a handwritten digit. Each pixel is represented by an integer in the range 0 to 16, indicating varying levels of black. Recall that scikit-learn's built-in datasets are of type Bunch, which are dictionary-like objects. Helpfully for the MNIST dataset, scikit-learn provides an 'images' key in addition to the 'data' and 'target' keys that you have seen with the Iris data. Because it is a 2D array of the images corresponding to each sample, this 'images' key is useful for visualizing the images, as you'll see in this exercise (for more on plotting 2D arrays, see Chapter 2 of DataCamp's course on Data Visualization with Python). On the other hand, the 'data' key contains the feature array - that is, the images as a flattened array of 64 pixels.

Notice that you can access the keys of these Bunch objects in two different ways: By using the . notation, as in digits.images, or the [] notation, as in digits['images'].

For more on the MNIST data, check out this exercise in Part 1 of DataCamp's Importing Data in Python course. There, the full version of the MNIST dataset is used, in which the images are 28x28. It is a famous dataset in machine learning and computer vision, and frequently used as a benchmark to evaluate the performance of a new model.

- Import datasets from sklearn and matplotlib.pyplot as plt.
- Load the digits dataset using the .load_digits() method on datasets.
- Print the keys and DESCR of digits.
- Print the shape of images and data keys using the . notation.
- Display the 1011th image using plt.imshow(). This has been done for you, so hit 'Submit Answer' to see which handwritten digit this happens to be!

In [24]:

```
# Import necessary modules
from sklearn import datasets
import matplotlib.pyplot as plt
# Load the digits dataset: digits
digits = datasets.load_digits()
# Print the keys and DESCR of the dataset
print(digits.keys())
print(digits['DESCR'])
# Print the shape of the images and data keys
print(digits.images.shape)
print(digits.data.shape)
# Display digit 1010
plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest')
plt.show()
```

It looks like the image in question corresponds to the digit '5'. Now, can you build a classifier that can make this prediction not only for this image, but for all the other ones in the dataset? You'll do so in the next exercise!

In [25]:

```
# Import necessary modules
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
# Create feature and target arrays
X = digits.data
y = digits.target
# Split into training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42, stratify=y)
# Create a k-NN classifier with 7 neighbors: knn
knn = KNeighborsClassifier(n_neighbors = 7)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Print the accuracy
print(knn.score(X_test, y_test))
```

Incredibly, this out of the box k-NN classifier with 7 neighbors has learned from the training data and predicted the labels of the images in the test set with 98% accuracy, and it did so in less than a second! This is one illustration of how incredibly useful machine learning techniques can be.

Remember the model complexity curve that Hugo showed in the video? You will now construct such a curve for the digits dataset! In this exercise, you will compute and plot the training and testing accuracy scores for a variety of different neighbor values. By observing how the accuracy scores differ for the training and testing sets with different values of k, you will develop your intuition for overfitting and underfitting.

The training and testing sets are available to you in the workspace as X_train, X_test, y_train, y_test. In addition, KNeighborsClassifier has been imported from sklearn.neighbors.

- Inside the for loop:
- Setup a k-NN classifier with the number of neighbors equal to k.
- Fit the classifier with k neighbors to the training data.
- Compute accuracy scores the training set and test set separately using the .score() method and assign the results to the train_accuracy and test_accuracy arrays respectively.

In [26]:

```
import numpy as np
# Setup arrays to store train and test accuracies
neighbors = np.arange(1, 9)
train_accuracy = np.empty(len(neighbors))
test_accuracy = np.empty(len(neighbors))
# Loop over different values of k
for i, k in enumerate(neighbors):
# Setup a k-NN Classifier with k neighbors: knn
knn = KNeighborsClassifier(n_neighbors=k)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
#Compute accuracy on the training set
train_accuracy[i] = knn.score(X_train, y_train)
#Compute accuracy on the testing set
test_accuracy[i] = knn.score(X_test, y_test)
# Generate plot
plt.title('k-NN: Varying Number of Neighbors')
plt.plot(neighbors, test_accuracy, label = 'Testing Accuracy')
plt.plot(neighbors, train_accuracy, label = 'Training Accuracy')
plt.legend()
plt.xlabel('Number of Neighbors')
plt.ylabel('Accuracy')
plt.show()
```

It looks like the test accuracy is highest when using 3 and 5 neighbors. Using 8 neighbors or more seems to result in a simple model that underfits the data. Now that you've grasped the fundamentals of classification, you will learn about regression in the next chapter!

Introduction to regression Congrats on making it through that introduction to supervised learning and classification. Now, we're going to check out the other type of supervised learning problem: regression. In regression tasks, the target value is a continuously varying variable, such as a country's GDP or the price of a house.

Boston housing data Our first regression task will be using the Boston housing dataset! Let's check out the data. First, we load it from a comma-separated values file, also known as a csv file, using pandas' read csv function. See the DataCamp course on importing data for more information on file formats and loading your data. Note that you can also load this data from scikit-learn's built-in datasets. We then view the head of the data frame using the head method. The documentation tells us the feature 'CRIM' is per capita crime rate, 'NX' is nitric oxides concentration, and 'RM' average number of rooms per dwelling, for example. The target variable, 'MEDV', is the median value of owner occupied homes in thousands of dollars.

Creating feature and target arrays Now, given data as such, recall that scikit-learn wants 'features' and target' values in distinct arrays, X and y,. Thus, we split our DataFrame: in the first line here, we drop the target; in the second, we keep only the target. Using the values attributes returns the NumPy arrays that we will use.

Predicting house value from a single feature As a first task, let's try to predict the price from a single feature: the average number of rooms in a block. To do this, we slice out the number of rooms column of the DataFrame X, which is the fifth column into the variable X rooms. Checking the type of X rooms and y, we see that both are NumPy arrays. To turn them into NumPy arrays of the desired shape, we apply the reshape method to keep the first dimension, but add another dimension of size one to X.

Plotting house value vs. number of rooms Now, let's plot house value as a function of number of rooms using matplotlib's plt dot scatter. We'll also label our axes using x label and y label.

Plotting house value vs. number of rooms We can immediately see that, as one might expect, more rooms lead to higher prices.

Fitting a regression model It's time to fit a regression model to our data. We're going to use a model called linear regression, which we'll explain in the next video. But first, I'm going to show you how to fit it and to plot its predictions. We import numpy as np, linear model from sklearn, and instantiate LinearRegression as regr. We then fit the regression to the data using regr dot fit and passing in the data, the number of rooms, and the target variable, the house price, as we did with the classification problems. After this, we want to check out the regressors predictions over the range of the data. We can achieve that by using np linspace between the maximum and minimum number of rooms and make a prediction for this data.

Fitting a regression model Plotting this line with the scatter plot results in the figure you see here.

Andy introduced regression to you using the Boston housing dataset. But regression models can be used in a variety of contexts to solve a variety of different problems.

Given below are four example applications of machine learning. Your job is to pick the one that is best framed as a regression problem.

Possible Answers

- An e-commerce company using labeled customer data to predict whether or not a customer will purchase a particular item.
- A healthcare company using data about cancer tumors (such as their geometric measurements) to predict whether a new tumor is benign or malignant.
- A restaurant using review data to ascribe positive or negative sentiment to a given review.
**A bike share company using time and weather data to predict the number of bikes being rented at any given hour.**

The target variable here - the number of bike rentals at any given hour - is quantitative, so this is best framed as a regression problem.

In this chapter, you will work with Gapminder data that we have consolidated into one CSV file available in the workspace as 'gapminder.csv'. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: 'fertility', which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's .reshape() method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

- Import numpy and pandas as their standard aliases.
- Read the file 'gapminder.csv' into a DataFrame df using the read_csv() function.
- Create array X for the 'fertility' feature and array y for the 'life' target variable.
- Reshape the arrays by using the .reshape() method and passing in -1 and 1.

In [27]:

```
# Import numpy and pandas
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('./datasets/gapminder.csv')
# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values
# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))
# Reshape X and y
y = y.reshape(-1,1)
X = X.reshape(-1,1)
# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))
```

In [28]:

```
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
```

Out[28]:

Notice the differences in shape before and after applying the .reshape() method. Getting the feature and target variable arrays into the right format for scikit-learn is an important precursor to model building.

As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as df and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with life, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as .info(), .describe(), .head().

In case you are curious, the heatmap was generated using Seaborn's heatmap function and the following line of code, where df.corr() computes the pairwise correlation between columns:

sns.heatmap(df.corr(), square=True, cmap='RdYlGn')

Once you have a feel for the data, consider the statements below and select the one that is not true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

Possible Answers

- The DataFrame has 139 samples (or rows) and 9 columns.
- life and fertility are negatively correlated.
- The mean of life is 69.602878.
**fertility is of type int64.**- GDP and life are positively correlated.

As seen by using df.info(), fertility, along with all the other columns, is of type float64, not int64

OLS = Ordinary least Squares: Minimizes sum of squares

Now, you will fit a linear regression and predict life expectancy using just one feature. You saw Andy do this earlier using the 'RM' feature of the Boston housing dataset. In this exercise, you will use the 'fertility' feature of the Gapminder dataset. Since the goal is to predict life expectancy, the target variable here is 'life'. The array for the target variable has been pre-loaded as y and the array for 'fertility' has been pre-loaded as X_fertility.

A scatter plot with 'fertility' on the x-axis and 'life' on the y-axis has been generated. As you can see, there is a strongly negative correlation, so a linear regression should be able to capture this trend. Your job is to fit a linear regression and then predict the life expectancy, overlaying these predicted values on the plot to generate a regression line. You will also compute and print the score using scikit-learn's .score() method.

- Import LinearRegression from sklearn.linear_model.
- Create a LinearRegression regressor called reg.
- Set up the prediction space to range from the minimum to the maximum of X_fertility. This has been done for you.
- Fit the regressor to the data (X_fertility and y) and compute its predictions using the .predict() method and the prediction_space array.
- Compute and print the score using the .score() method.
- Overlay the plot with your linear regression line. This has been done for you, so hit 'Submit Answer' to see the result!

In [32]:

```
from sklearn.linear_model import LinearRegression
X_fertility = df.fertility.to_numpy().reshape(-1, 1)
y = df.life.to_numpy().reshape(-1, 1)
# Create the regressor: reg
reg = LinearRegression()
# Create the prediction space
prediction_space = np.linspace(df.fertility.max(), df.fertility.min()).reshape(-1,1)
# Fit the model to the data
reg.fit(X_fertility, y)
# Compute predictions over the prediction space: y_pred
y_pred = reg.predict(prediction_space)
# Print R^2
score = reg.score(X_fertility, y)
print(f'Score: {score}')
# Plot regression line
sns.scatterplot(data=df, x='fertility', y='life')
plt.xlabel('Fertility')
plt.ylabel('Life Expectancy')
plt.plot(prediction_space, y_pred, color='black', linewidth=3)
plt.show()
```

Notice how the line captures the underlying trend in the data. And the performance is quite decent for this basic regression model with only one feature.

As you learned in Chapter 1, train and test sets are vital to ensure that your supervised learning model is able to generalize well to new data. This was true for classification models, and is equally true for linear regression models.

In this exercise, you will split the Gapminder dataset into training and testing sets, and then fit and predict a linear regression over all features. In addition to computing the R2 score, you will also compute the Root Mean Squared Error (RMSE), which is another commonly used metric to evaluate regression models. The feature array X and target variable array y have been pre-loaded for you from the DataFrame df.

- Import LinearRegression from sklearn.linear_model, mean_squared_error from sklearn.metrics, and train_test_split from sklearn.model_selection.
- Using X and y, create training and test sets such that 30% is used for testing and 70% for training. Use a random state of 42.
- Create a linear regression regressor called reg_all, fit it to the training set, and evaluate it on the test set.
- Compute and print the R2 score using the .score() method on the test set.
- Compute and print the RMSE. To do this, first compute the Mean Squared Error using the mean_squared_error() function with the arguments y_test and y_pred, and then take its square root using np.sqrt().

In [33]:

```
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
X = df.drop(['life', 'Region'], axis=1).to_numpy()
y = df.life.to_numpy()
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)
# Create the regressor: reg_all
reg_all = LinearRegression()
# Fit the regressor to the training data
reg_all.fit(X_train, y_train)
# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)
# Compute and print R^2 and RMSE
print(f"R^2: {reg_all.score(X_test, y_test):0.3f}")
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print(f"Root Mean Squared Error: {rmse:0.3f}")
```

Using all features has improved the model score. This makes sense, as the model has more information to learn from. However, there is one potential pitfall to this process. Can you spot it? You'll learn about this as well how to better validate your models in the next video!

- You're now also becoming more acquainted with train test split, and computing model performance metrics on the test set.
- Can you spot a potential pitfall of this process?
- If you're computing R2 on your test set, the R2 returned, is dependent on the way the data is split.

- The data points in the test set may have some peculiarities that mean the R2 computed on it, is not representative of the model's ability to generalize to unseen data.

To combat this dependence on what is essentially an arbitrary split, we use a technique call cross-validation.

- Begin by splitting the dataset into five groups, or folds.

- Hold out the first fold as a test set, fit the model on the remaining 4 folds, predict on the test test set, and compute the metric of interest.

- Next, hold out the second fold as the test set, fit on the remaining data, predict on the test set, and compute the metric of interest.

- Then, similarly, with the third, fourth and fifth fold.

- As a result, there are five values of R2 from which statistics of interest can be computed, such as mean, median, and 95% confidence interval.

- As the dataset is split into 5 folds, this process is called 5-fold cross validation.
- 10 folds would be 10-fold cross validation.

- Generally, if k folds are used, it is called k-fold cross validation or k-fold CV.
- The trade-off is, more folds are computationally more expensive, because there is more fitting and predicting.
- This method avoids the problem of the metric of choice being dependent on the train test split.

- sklearn.model_selection.cross_val_score
- This returns an array of cross-validation scores, which are assigned to cv_results
- The length of the array is the number of folds specified by the cv parameter.
- The reported score is R2 , the default score for linear regression
- We can also compute the mean

In [34]:

```
# Import the necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# Create a linear regression object: reg
reg = LinearRegression()
# Compute 5-fold cross-validation scores: cv_scores
cv_scores = cross_val_score(reg, X, y, cv=5)
# Print the 5-fold cross-validation scores
print(cv_scores)
print("Average 5-Fold CV Score: {}".format(np.mean(cv_scores)))
```

Now that you have cross-validated your model, you can more confidently evaluate its predictions.

Cross validation is essential but do not forget that the more folds you use, the more computationally expensive cross-validation becomes. In this exercise, you will explore this for yourself. Your job is to perform 3-fold cross-validation and then 10-fold cross-validation on the Gapminder dataset.

In the IPython Shell, you can use %timeit to see how long each 3-fold CV takes compared to 10-fold CV by executing the following cv=3 and cv=10:

%timeit cross_val_score(reg, X, y, cv = *__*)

pandas and numpy are available in the workspace as pd and np. The DataFrame has been loaded as df and the feature/target variable arrays X and y have been created.

- Import LinearRegression from sklearn.linear_model and cross_val_score from sklearn.model_selection.
- Create a linear regression regressor called reg.
- Perform 3-fold CV and then 10-fold CV. Compare the resulting mean scores.

In [35]:

```
# Import necessary modules
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score
# Create a linear regression object: reg
reg = LinearRegression()
# Perform 3-fold CV
cvscores_3 = cross_val_score(reg, X, y, cv=3)
print(np.mean(cvscores_3))
# Perform 10-fold CV
cvscores_10 = cross_val_score(reg, X, y, cv=10)
print(np.mean(cvscores_10))
```

Did you use %timeit in the IPython Shell to see how much longer it takes 10-fold cross-validation to run compared to 3-fold cross-validation?

In [36]:

```
cv3 = %timeit -n10 -r3 -q -o cross_val_score(reg, X, y, cv=3)
cv10 = %timeit -n10 -r3 -q -o cross_val_score(reg, X, y, cv=10)
print(f'cv=3 time: {cv3}\ncv=10 time: {cv10}')
```

- Why regularize?
- Recall: Linear regression minimizes a loss function

- It chooses a coefficient for each feature variable

- Large coefficient can lead to overfitting

- Penalizing large coefficients : Regularization

- Ridge regression
- Loss function = \text{OLS loss function} + \alpha \sum^{n}_{i=1}a_i^2OLS loss function+α∑

- Alpha : Parameter we need to choose (Hyperparameter or \lambdaλ)

- Picking alpha is similar to picking k in k-NN

- Alpha controls model complexity

- Alpha = 0: get back OLS (Can lead to overfitting)

- Very high alpha: Can lead to underfitting

- Lasso regression
- Loss function = \text{OLS loss function} + \alpha \sum^{n}_{i=1}|a_i|OLS loss function+α∑

- Can be used to select import features of a dataset

- Shrinks the coefficients of less important features to exactly 0

The first type of regularized regression that we'll look at, is called ridge regression, in which out loss function is the standard OLS loss function, plus the squared value of each coefficient, multiplied by some constant, α Loss function=OLS loss function+α∗∑ni=1a2i Thus, when minimizing the loss function to fit to our data, models are penalized for coefficients with a large magnitude: large positive and large negative coefficients. Note, α is a parameter we need to choose in order to fit and predict. Essentially, we can select the α for which our model performs best. Picking α for ridge regression is similar to picking k in KNN. This is called hyperparameter tuning, and we'll see much more of this in section 3. This α , which you may also see called λ in the wild, can be thought of as a parameter that controls the model complexity. Notice when α=0 , we get back OLS , which can lead to overfitting. Large coefficients, in this case, are not penalized, and the overfitting problem is not accounted for. A very high α means large coefficients are significantly penalized, which can lead to a model that's too simple, and end up underfitting the data. The method of performing ridge regression with scikit-learn, mirrors the other models we have seen.

sklearn.linear_model.Ridge Set α with the alpha parameter. Setting the normalize parameter to True, ensures all the variables are on the same scale, which will be covered later in more depth.

In [37]:

```
from sklearn.linear_model import LinearRegression, Ridge, Lasso
boston = pd.read_csv('./datasets/boston.csv')
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)
# instantiate the model
ridge = Ridge(alpha=0.1, normalize=True)
# fit the model
ridge.fit(X_train, y_train)
# predict on the test data
ridge_pred = ridge.predict(X_test)
# get the score
rs = ridge.score(X_test, y_test)
print(f'Ridge Score: {round(rs, 4)}')
```

There is another type of regularized regression called lasso regression, in which our loss function is the standard OLS loss function, plus the absolute value of each coefficient, multiplied by some constant, α . Loss function=OLS loss function+α∗∑ni=1|ai|

sklearn.linear_model.Lasso Lasso regression in scikit-learn, mirrors ridge regression. Lasso regression for feature selection

One of the useful aspects of lasso regression is it can be used to select important features of a dataset. This is because it tends to reduce the coefficients of less important features to be exactly zero. The features whose coefficients are not shrunk to zero, are 'selected' by the LASSO algorithm. Plotting the coefficients as a function of feature name, yields graph below, and you can see directly, the most important predictor for our target variable, housing price, is number of rooms, 'RM'. This is not surprising, and is a great sanity check. This type of feature selection is very important for machine learning in an industry or business setting, because it allows you, as the Data Scientist, to communicate important results to non-technical colleagues. The power of reporting important features from a linear model, cannot be overestimated. It is also valuable in research science, in order to identify which factors are important predictors for various physical phenomena.

In [38]:

```
# split the data
X_train, X_test, y_train, y_test = train_test_split(boston.drop('MEDV', axis=1), boston.MEDV, test_size=0.3, random_state=42)
# instantiate the regressor
lasso = Lasso(alpha=0.1, normalize=True)
# fit the model
lasso.fit(X_train, y_train)
# predict on the test data
lasso_pred = lasso.predict(X_test)
# get the score
ls = lasso.score(X_test, y_test)
print(f'Ridge Score: {round(ls, 4)}')
```

In [39]:

```
# store the feature names
names = boston.drop('MEDV', axis=1).columns
# instantiate the regressor
lasso = Lasso(alpha=0.1)
# extract and store the coef attribute
lasso_coef = lasso.fit(boston.drop('MEDV', axis=1), boston.MEDV).coef_
plt.plot(range(len(names)), lasso_coef)
plt.xticks(range(len(names)), names, rotation=60)
plt.ylabel('Coefficients')
plt.grid()
plt.show()
```

In the video, you saw how Lasso selected out the 'RM' feature as being the most important for predicting Boston house prices, while shrinking the coefficients of certain other features to 0. Its ability to perform feature selection in this way becomes even more useful when you are dealing with data involving thousands of features.

In this exercise, you will fit a lasso regression to the Gapminder data you have been working with and plot the coefficients. Just as with the Boston data, you will find that the coefficients of some features are shrunk to 0, with only the most important ones remaining.

The feature and target variable arrays have been pre-loaded as X and y.

- Import Lasso from sklearn.linear_model.
- Instantiate a Lasso regressor with an alpha of 0.4 and specify normalize=True.
- Fit the regressor to the data and compute the coefficients using the coef_ attribute.
- Plot the coefficients on the y-axis and column names on the x-axis. This has been done for you, so hit 'Submit Answer' to view the plot!

In [40]:

```
# Import numpy and pandas
import numpy as np
import pandas as pd
# Read the CSV file into a DataFrame: df
df = pd.read_csv('./datasets/gm_2008_region.csv')
# Create arrays for features and target variable
X = df.drop(['life','Region'], axis=1)
y = df['life'].values.reshape(-1,1)
# Import Lasso
from sklearn.linear_model import Lasso
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Instantiate a lasso regressor: lasso
#lasso = Lasso(alpha=0.4, normalize=True)
pipe = Pipeline(steps=[
('scaler',StandardScaler()),
('lasso',Lasso(alpha=0.4, normalize=True))
])
# Fit the regressor to the data
#lasso.fit(X, y)
pipe.fit(X, y) # apply scaling on training data
# Compute and print the coefficients
lasso_coef = pipe.named_steps['lasso'].coef_
print(lasso_coef)
# Plot the coefficients
df_columns = df.drop(['life', 'Region'], axis=1).columns
# Plot the coefficients
plt.plot(range(len(df_columns)), lasso_coef)
plt.xticks(range(len(df_columns)), df_columns.values, rotation=60)
plt.margins(0.02)
plt.show()
```

According to the lasso algorithm, it seems like 'child_mortality' is the most important feature when predicting life expectancy.

Lasso is great for feature selection, but when building regression models, Ridge regression should be your first choice.

Recall that lasso performs regularization by adding to the loss function a penalty term of the absolute value of each coefficient multiplied by some alpha. This is also known as regularization because the regularization term is the norm of the coefficients. This is not the only way to regularize, however.

If instead you took the sum of the squared values of the coefficients multiplied by some alpha - like in Ridge regression - you would be computing the norm. In this exercise, you will practice fitting ridge regression models over a range of different alphas, and plot cross-validated scores for each, using this function that we have defined for you, which plots the score as well as standard error for each alpha:

def display_plot(cv_scores, cv_scores_std): fig = plt.figure() ax = fig.add_subplot(1,1,1) ax.plot(alpha_space, cv_scores)

```
std_error = cv_scores_std / np.sqrt(10)
ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
ax.set_ylabel('CV Score +/- Std Error')
ax.set_xlabel('Alpha')
ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
ax.set_xlim([alpha_space[0], alpha_space[-1]])
ax.set_xscale('log')
plt.show()
```

Don't worry about the specifics of the above function works. The motivation behind this exercise is for you to see how the score varies with different alphas, and to understand the importance of selecting the right value for alpha. You'll learn how to tune alpha in the next chapter.

- Instantiate a Ridge regressor and specify normalize=True.
- Inside the for loop:
- Specify the alpha value for the regressor to use.
- Perform 10-fold cross-validation on the regressor with the specified alpha. The data is available in the arrays X and y.
- Append the average and the standard deviation of the computed cross-validated scores. NumPy has been pre-imported for you as np.
- Use the display_plot() function to visualize the scores and standard deviations.

In [41]:

```
def display_plot(cv_scores, cv_scores_std, alpha_space):
fig = plt.figure(figsize=(10, 5))
ax = fig.add_subplot(1,1,1)
ax.plot(alpha_space, cv_scores, label='CV Scores')
std_error = cv_scores_std / np.sqrt(10)
ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, color='red', alpha=0.2, label='CV Score ± std error')
ax.set_ylabel('CV Score +/- Std Error')
ax.set_xlabel('Alpha')
ax.axhline(np.max(cv_scores), linestyle='--', color='.5', label='Max CV Score')
ax.set_xlim([alpha_space[0], alpha_space[-1]])
ax.set_xscale('log')
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
plt.show()
```

In [42]:

```
# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []
# Create a ridge regressor: ridge
ridge = Ridge(normalize=True)
# Compute scores over range of alphas
for alpha in alpha_space:
# Specify the alpha value to use: ridge.alpha
ridge.alpha = alpha
# Perform 10-fold CV: ridge_cv_scores
ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)
# Append the mean of ridge_cv_scores to ridge_scores
ridge_scores.append(np.mean(ridge_cv_scores))
# Append the std of ridge_cv_scores to ridge_scores_std
ridge_scores_std.append(np.std(ridge_cv_scores))
# Display the plot
display_plot(ridge_scores, ridge_scores_std, alpha_space)
```

Notice how the cross-validation scores change with different alphas. Which alpha should you pick? How can you fine-tune your model? You'll learn all about this in the next chapter!

Class imbalance: Imbalanced classes occur where there arer disproiportional ratio of observations in each class.

Having trained your model, your next task is to evaluate its performance. In this chapter, you will learn about some of the other metrics available in scikit-learn that will allow you to assess your model's performance in a more nuanced manner. Next, learn to optimize your classification and regression models using hyperparameter tuning.

- Classification metrics
- Measuring model performance with accuracy:
- Fraction of correctly classified samples
- Not always a useful metrics

- Measuring model performance with accuracy:
- Class imbalance example: Emails
- Spam classification
- 99% of emails are real; 1% of emails are spam

- Could build a classifier that predicts ALL emails as real
- 99% accuracy on real messages!
- But horrible on actually classifying spam
- Fails at its original purpose

- Spam classification
Diagnosing classification predictions

- Confusion matrix (image from https://medium.com/@cmukesh8688/evaluation-machine-learning-by-confusion-matrix-a4196051cf8d )
Accuracy: $$ \dfrac{tp + tn}{tp + tn + fp + fn} $$

Precision (Positive Predictive Value): $$ \dfrac{tp}{tp + fp}$$

Recall (Sensitivity, hit rate, True Positive Rate): $$ \dfrac{tp}{tp + fn}$$

F1 score: Harmonic mean of precision and recall $$ 2 \cdot \dfrac{\text{precision} \cdot \text{recall}}{\text{precision} + \text{recall}} $$

High precision : Not many real emails predicted as spam

- High recall : Predicted most spam emails correctly

In Chapter 1, you evaluated the performance of your k-NN classifier based on its accuracy. However, as Andy discussed, accuracy is not always an informative metric. In this exercise, you will dive more deeply into evaluating the performance of binary classifiers by computing a confusion matrix and generating a classification report.

You may have noticed in the video that the classification report consisted of three rows, and an additional support column. The support gives the number of samples of the true response that lie in that class - so in the video example, the support was the number of Republicans or Democrats in the test set on which the classification report was computed. The precision, recall, and f1-score columns, then, gave the respective metrics for that particular class.

Here, you'll work with the PIMA Indians dataset obtained from the UCI Machine Learning Repository. The goal is to predict whether or not a given female patient will contract diabetes based on features such as BMI, age, and number of pregnancies. Therefore, it is a binary classification problem. A target value of 0 indicates that the patient does not have diabetes, while a value of 1 indicates that the patient does have diabetes. As in Chapters 1 and 2, the dataset has been preprocessed to deal with missing values.

The dataset has been loaded into a DataFrame df and the feature and target variable arrays X and y have been created for you. In addition, sklearn.model_selection.train_test_split and sklearn.neighbors.KNeighborsClassifier have already been imported.

Your job is to train a k-NN classifier to the data and evaluate its performance by generating a confusion matrix and classification report.

- Import classification_report and confusion_matrix from sklearn.metrics.
- Create training and testing sets with 40% of the data used for testing. Use a random state of 42.
- Instantiate a k-NN classifier with 6 neighbors, fit it to the training data, and predict the labels of the test set.
- Compute and print the confusion matrix and classification report using the confusion_matrix() and classification_report() functions.

In [43]:

```
import pandas as pd
df = pd.read_csv('./datasets/diabetes.csv')
df.tail(5)
```

Out[43]:

In [44]:

```
X = df.iloc[:, :-1]
y = df.iloc[:, -1]
```

In [45]:

```
# Import necessary modules
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
# Create training and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Instantiate a k-NN classifier: knn
knn = KNeighborsClassifier(n_neighbors=6)
# Fit the classifier to the training data
knn.fit(X_train, y_train)
# Predict the labels of the test data: y_pred
y_pred = knn.predict(X_test)
# Generate the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

- Logistic regression for binary classification
- Logistic regression outputs probabilities
- If the probability is greater than 0.5:
- The data is labeled '1'

- If the probability is less than 0.5:
- The data is labeled '0'

- Probability thresholds
- By default, logistic regression threshold = 0.5
- Not specific to logistic regression
- k-NN classifiers also have thresholds

- ROC curves (Receiver Operating Characteristic curve)

Time to build your first logistic regression model! As Hugo showed in the video, scikit-learn makes it very easy to try different models, since the Train-Test-Split/Instantiate/Fit/Predict paradigm applies to all classifiers and regressors - which are known in scikit-learn as 'estimators'. You'll see this now for yourself as you train a logistic regression model on exactly the same data as in the previous exercise. Will it outperform k-NN? There's only one way to find out!

In [47]:

```
from sklearn.linear_model import LogisticRegression
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Create the classifier: logreg
logreg = LogisticRegression(max_iter=1000)
# Fit the classifier to the training data
logreg.fit(X_train, y_train)
# Predict the labels of the test set: y_pred
y_pred = logreg.predict(X_test)
# Compute and print the confusion matrix and classification report
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
```

Great job in the previous exercise - you now have a new addition to your toolbox of classifiers!

Classification reports and confusion matrices are great methods to quantitatively evaluate model performance, while ROC curves provide a way to visually evaluate models. As Hugo demonstrated in the video, most classifiers in scikit-learn have a `.predict_proba()`

method which returns the probability of a given sample being in a particular class. Having built a logistic regression model, you'll now evaluate its performance by plotting an ROC curve. In doing so, you'll make use of the `.predict_proba()`

method and become familiar with its functionality.

In [48]:

```
from sklearn.metrics import roc_curve
# Compute predicted probabilities: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
# Generate ROC curve values: fpr, tpr, thresholds
fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
# Plot ROC curve
plt.plot([0, 1], [0, 1], 'k--')
plt.plot(fpr, tpr)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
```

Out[48]:

When looking at your ROC curve, you may have noticed that the y-axis (True positive rate) is also known as recall. Indeed, in addition to the ROC curve, there are other ways to visually evaluate model performance. One such way is the precision-recall curve, which is generated by plotting the precision and recall for different thresholds. As a reminder, precision and recall are defined as: $$ \text{Precision} = \dfrac{TP}{TP + FP} \\ \text{Recall} = \dfrac{TP}{TP + FN}$$ Study the precision-recall curve. Note that here, the class is positive (1) if the individual has diabetes.

In [50]:

```
from sklearn.metrics import precision_recall_curve
precision, recall, thresholds = precision_recall_curve(y_test, y_pred_prob)
plt.plot(recall, precision)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision / Recall plot')
```

Out[50]:

- Larger area under the ROC curve = better model

Say you have a binary classifier that in fact is just randomly making guesses. It would be correct approximately 50% of the time, and the resulting ROC curve would be a diagonal line in which the True Positive Rate and False Positive Rate are always equal. The Area under this ROC curve would be 0.5. This is one way in which the AUC, which Hugo discussed in the video, is an informative metric to evaluate a model. If the AUC is greater than 0.5, the model is better than random guessing. Always a good sign!

In this exercise, you'll calculate AUC scores using the `roc_auc_score()`

function from `sklearn.metrics`

as well as by performing cross-validation on the diabetes dataset.

In [51]:

```
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import cross_val_score
# Compute predicted probabilites: y_pred_prob
y_pred_prob = logreg.predict_proba(X_test)[:, 1]
# Compute and print AUC score
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
# Compute cross-validated AUC scores: cv_auc
cv_auc = cross_val_score(logreg, X, y, cv=5, scoring='roc_auc')
# Print list of AUC scores
print("AUC scores computed using 5-fold cross-validation: {}".format(cv_auc))
```

- Linear regression: Choosing parameters
- Ridge/Lasso regression: Choosing alpha
- k-Nearest Neighbors: Choosing n_neighbors
- Hyperparameters: Parameters like alpha and k
- Hyperparameters cannot be learned by fitting the model
- Choosing the correct hyperparameter
- Try a bunch of different hyperparameter values
- Fit all of them separately
- See how well each performs
- Choose the best performing one
- It is essential to use cross-validation

- Grid search cross-validation

Like the alpha parameter of lasso and ridge regularization that you saw earlier, logistic regression also has a regularization parameter: $C$. $C$ controls the inverse of the regularization strength, and this is what you will tune in this exercise. A large $C$ can lead to an overfit model, while a small $C$ can lead to an underfit model.

The hyperparameter space for $C$ has been setup for you. Your job is to use GridSearchCV and logistic regression to find the optimal $C$ in this hyperparameter space.

You may be wondering why you aren't asked to split the data into training and test sets. Good observation! Here, we want you to focus on the process of setting up the hyperparameter grid and performing grid-search cross-validation. In practice, you will indeed want to hold out a portion of your data for evaluation purposes, and you will learn all about this in the next video!

In [52]:

```
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
# Setup the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C':c_space}
# Instantiate a logistic regression classifier: logreg
logreg = LogisticRegression(max_iter=1000)
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
# Fit it to the data
logreg_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_))
print("Best score is {}".format(logreg_cv.best_score_))
```

GridSearchCV can be computationally expensive, especially if you are searching over a large hyperparameter space and dealing with multiple hyperparameters. A solution to this is to use `RandomizedSearchCV`

, in which not all hyperparameter values are tried out. Instead, a fixed number of hyperparameter settings is sampled from specified probability distributions. You'll practice using `RandomizedSearchCV`

in this exercise and see how this works.

Here, you'll also be introduced to a new model: the Decision Tree. Don't worry about the specifics of how this model works. Just like k-NN, linear regression, and logistic regression, decision trees in scikit-learn have `.fit()`

and `.predict()`

methods that you can use in exactly the same way as before. Decision trees have many parameters that can be tuned, such as `max_features`

, `max_depth`

, and `min_samples_leaf`

: This makes it an ideal use case for `RandomizedSearchCV`

.

In [53]:

```
from scipy.stats import randint
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import RandomizedSearchCV
# Setup the parameters and distributions to sample from: param_dist
param_dist = {
"max_depth": [3, None],
"max_features": randint(1, 9),
"min_samples_leaf": randint(1, 9),
"criterion": ["gini", "entropy"],
}
# Instantiate a Decision Tree classifier: tree
tree = DecisionTreeClassifier()
# Instantiate the RandomizedSearchCV object: tree_cv
tree_cv = RandomizedSearchCV(tree, param_dist, cv=5)
# Fit it to the data
tree_cv.fit(X, y)
# Print the tuned parameters and score
print("Tuned Decision Tree Parameters: {}".format(tree_cv.best_params_))
print("Best score is {}".format(tree_cv.best_score_))
```

- How well can the model perform on never before seen data?
- Using ALL data for cross-validation is not ideal
- Split data into training and hold-out set at the beginning
- Perform grid search cross-validation on training set
- Choose best hyperparameters and evaluate on hold-out set

You will now practice evaluating a model with tuned hyperparameters on a hold-out set. The feature array and target variable array from the diabetes dataset have been pre-loaded as `X`

and `y`

.

In addition to $C$, logistic regression has a `'penalty'`

hyperparameter which specifies whether to use `'l1'`

or `'l2'`

regularization. Your job in this exercise is to create a hold-out set, tune the `'C'`

and `'penalty'`

hyperparameters of a logistic regression classifier using `GridSearchCV`

on the training set.

In [54]:

```
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
# Create the hyperparameter grid
c_space = np.logspace(-5, 8, 15)
param_grid = {'C': c_space, 'penalty': ['l1', 'l2']}
# Instantiate the logistic regression classifier: logreg
logreg = LogisticRegression(max_iter=1000, solver='liblinear')
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Instantiate the GridSearchCV object: logreg_cv
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
# Fit it to the training data
logreg_cv.fit(X_train, y_train)
# Print the optimal parameters and best score
print("Tuned Logistic Regression Parameter: {}".format(logreg_cv.best_params_))
print("Tuned Logistic Regression Accuracy: {}".format(logreg_cv.best_score_))
```

Remember lasso and ridge regression from the previous chapter? Lasso used the $L1$ penalty to regularize, while ridge used the $L2$ penalty. There is another type of regularized regression known as the elastic net. In elastic net regularization, the penalty term is a linear combination of the $L1$ and $L2$ penalties: $$ a * L1 + b * L2 $$

In scikit-learn, this term is represented by the `'l1_ratio'`

parameter: An `'l1_ratio'`

of 1 corresponds to an $L1$ penalty, and anything lower is a combination of $L1$ and $L2$.

In this exercise, you will `GridSearchCV`

to tune the `'l1_ratio'`

of an elastic net model trained on the Gapminder data. As in the previous exercise, use a hold-out set to evaluate your model's performance.

In [56]:

```
df = pd.read_csv('./datasets/gm_2008_region.csv')
df.drop(labels=['Region'], axis='columns', inplace=True)
df.head()
```

Out[56]:

In [57]:

```
X = df.drop('life', axis='columns').values
y = df['life'].values
```

In [58]:

```
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV, train_test_split
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Create the hyperparameter grid
l1_space = np.linspace(0, 1, 30)
param_grid = {'l1_ratio': l1_space}
# Instantiate the ElasticNet regressor: elastic_net
elastic_net = ElasticNet(max_iter=100000, tol=0.001)
# Setup the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(elastic_net, param_grid, cv=5)
# Fit it to the training data
gm_cv.fit(X_train, y_train)
# Predict on the test set and compute metrics
y_pred = gm_cv.predict(X_test)
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_pred,y_test)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))
```

In [61]:

```
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
```

- Dealing with categorical features
- Scikit-learn will note accept categorical features by default
- Need to encode categorical features numerically
- Convert to 'dummy variables'
- 0 : Obsevation was NOT that in category
- 1 : Observation was that category

The Gapminder dataset that you worked with in previous chapters also contained a categorical `'Region'`

feature, which we dropped in previous exercises since you did not have the tools to deal with it. Now however, you do, so we have added it back in!

Your job in this exercise is to explore this feature. Boxplots are particularly useful for visualizing categorical features such as this.

In [65]:

```
df = pd.read_csv('./datasets/gm_2008_region.csv')
df.head()
# Create a boxplot of life expectancy per region
df.boxplot('life', 'Region', rot=60)
plt.savefig('./datasets/gm-boxplot.png', dpi=100)
```

Scikit-learn does not accept non-numerical features. You saw in the previous exercise that the `'Region'`

feature contains very useful information that can predict life expectancy. For example, Sub-Saharan Africa has a lower life expectancy compared to Europe and Central Asia. Therefore, if you are trying to predict life expectancy, it would be preferable to retain the `'Region'`

feature. To do this, you need to binarize it by creating dummy variables, which is what you will do in this exercise.

In [66]:

```
df.columns
# Create dummy variables: df_region
df_region = pd.get_dummies(df)
# Print the columns of df_regions
print(df_region.columns)
# Create dummy variables with drop_first=True: df_region
df_region = pd.get_dummies(df, drop_first=True)
# Print the new columns of df_region
print(df_region.columns)
```

In [67]:

```
X = df_region.drop('life', axis='columns')
y = df_region['life']
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
# Instantiate a ridge regressor: ridge
ridge = Ridge(alpha=0.5, normalize=True)
# Perform 5-fold cross-validation: ridge_cv
ridge_cv = cross_val_score(ridge, X, y, cv=5)
# Print the cross-validated scores
print(ridge_cv)
```

- Dropping missing data
- It can remove most of datas, we need a more robust method.

- Imputing missing data
- Making an educated guess about the missing values
- Example : Using the mean of the non-missing entries

You will see that there are certain data points labeled with a `'?'`

. These denote missing values. As you saw in the video, different datasets encode missing values in different ways. Sometimes it may be a `'9999'`

, other times a `0`

- real-world data can be very messy! If you're lucky, the missing values will already be encoded as `NaN`

. We use `NaN`

because it is an efficient and simplified way of internally representing missing data, and it lets us take advantage of pandas methods such as `.dropna()`

and `.fillna()`

, as well as scikit-learn's Imputation transformer `Imputer()`

.

In this exercise, your job is to convert the `'?'`

s to NaNs, and then drop the rows that contain them from the DataFrame.

In [70]:

```
df = pd.read_csv('./datasets/house-votes-84.data', header=None)
df.columns = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
df.replace({'n':0, 'y':1}, inplace=True)
df.head()
# Convert '?' to NaN
df[df == '?'] = np.nan
# Print the number of NaNs
print(df.isnull().sum())
# Print shape of original DataFrame
print('Shape of Original DataFrame: {}'.format(df.shape))
# Drop missing values and print shape fo new DataFrame
df = df.dropna()
# Print shape of new DataFrame
print("Shape of DataFrame After Dropping All Rows with Missing Values: {}".format(df.shape))
```

As you've come to appreciate, there are many steps to building a model, from creating training and test sets, to fitting a classifier or regressor, to tuning its parameters, to evaluating its performance on new data. Imputation can be seen as the first step of this machine learning process, the entirety of which can be viewed within the context of a pipeline. Scikit-learn provides a pipeline constructor that allows you to piece together these steps into one process and thereby simplify your workflow.

You'll now practice setting up a pipeline with two steps: the imputation step, followed by the instantiation of a classifier. You've seen three classifiers in this course so far: k-NN, logistic regression, and the decision tree. You will now be introduced to a fourth one - the Support Vector Machine, or SVM. For now, do not worry about how it works under the hood. It works exactly as you would expect of the scikit-learn estimators that you have worked with previously, in that it has the same `.fit()`

and `.predict()`

methods as before.

Note:

`sklearn.preprocessing.Imputer`

is replaced with`sklearn.imputer.SimpleImputer`

from version 0.20

In [71]:

```
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
# Setup the Imputation transformer: imp
imp = SimpleImputer(missing_values=np.nan, strategy='most_frequent')
# Instantiate the SVC classifier: clf
clf = SVC()
# Setup the pipeline with the required steps: steps
steps = [('imputation', imp),
('SVM', clf)]
```

Having setup the steps of the pipeline in the previous exercise, you will now use it on the voting dataset to classify a Congressman's party affiliation. What makes pipelines so incredibly useful is the simple interface that they provide. You can use the `.fit()`

and `.predict()`

methods on pipelines just as you did with your classifiers and regressors!

In [72]:

```
X = df.drop('party', axis='columns')
y = df['party']
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
('SVM', SVC())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Fit the pipeline to the train set
pipeline.fit(X_train, y_train)
# Predict the labels of the test set
y_pred = pipeline.predict(X_test)
# Compute metrics
print(classification_report(y_test, y_pred))
```

- Why scale your data?
- Many models use some form of distance to inform them
- Features on larger scales can unduly influence the model
- Example: k-NN uses distance explicitly when making predictions
- Want features to be on a similar scale
- Normalizing (or scaling and centering)

- Ways to normalize your data
- Standardization: Subtract the mean and divide by variance
- All features are centered around zero and have variance one
- Can also subtract the minimum and divide by the range
- Minimum zero and maximum one
- Can also normalize so the data ranges from -1 to +1

You will now explore scaling for yourself on a new dataset - White Wine Quality! Hugo used the Red Wine Quality dataset in the video. We have used the `'quality'`

feature of the wine to create a binary target variable: If `'quality'`

is less than 5, the target variable is 1, and otherwise, it is 0.

Notice how some features seem to have different units of measurement. `'density'`

, for instance, takes values between 0.98 and 1.04, while `'total sulfur dioxide'`

ranges from 9 to 440. As a result, it may be worth scaling the features here. Your job in this exercise is to scale the features and compute the mean and standard deviation of the unscaled features compared to the scaled features.

In [83]:

```
# Fit the pipeline to the training set: knn_scaled
knn_scaled = pipeline.fit(X_train, y_train)
# Instantiate and fit a k-NN classifier to the unscaled data
knn_unscaled = KNeighborsClassifier().fit(X_train, y_train)
# Compute and print metrics
print('Accuracy with Scaling: {}'.format(knn_scaled.score(X_test, y_test)))
print('Accuracy without Scaling: {}'.format(knn_unscaled.score(X_test, y_test)))
df = pd.read_csv('./datasets/white_wine.csv', sep=';')
df['quality'] = df['quality'] < 5
X = df.drop('quality', axis='columns').values
y = df['quality'].values
df.head()
```

Out[83]:

In [ ]:

```
# Import scale
from sklearn.preprocessing import scale
import numpy as np
# Scale the features: X_scaled
X_scaled = scale(X)
# Print the mean and standard deviation of the unscaled features
print("Mean of Unscaled Features: {}".format(np.mean(X)))
print("Standard Deviation of Unscaled Features: {}".format(np.std(X)))
# Print the mean and standard deviation of the scaled features
print("Mean of Scaled Features: {}".format(np.mean(X_scaled)))
print("Standard Deviation of Scaled Features: {}".format(np.std(X_scaled)))
```

With regard to whether or not scaling is effective, the proof is in the pudding! See for yourself whether or not scaling the features of the White Wine Quality dataset has any impact on its performance. You will use a k-NN classifier as part of a pipeline that includes scaling, and for the purposes of comparison, a k-NN classifier trained on the unscaled data has been provided.

In [133]:

```
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
# Setup the pipeline steps: steps
steps = [('scaler', StandardScaler()),
('knn', KNeighborsClassifier())]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
```

It is time now to piece together everything you have learned so far into a pipeline for classification! Your job in this exercise is to build a pipeline that includes scaling and hyperparameter tuning to classify wine quality.

You'll return to using the SVM classifier you were briefly introduced to earlier in this chapter. The hyperparameters you will tune are `C`

and `gamma`

. `C`

controls the regularization strength. It is analogous to the `C`

you tuned for logistic regression in Chapter 3, while `gamma`

controls the kernel coefficient: Do not worry about this now as it is beyond the scope of this course.

In [134]:

```
from sklearn.metrics import classification_report, accuracy_score
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.svm import SVC
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# Setup the pipeline
steps = [('scaler', StandardScaler()),
('SVM', SVC())]
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'SVM__C':[1, 10, 100],
'SVM__gamma':[0.1, 0.01]}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=21)
# Instantiate the GridSearchCV object: cv
cv = GridSearchCV(pipeline, param_grid=parameters, cv=3)
# Fit to the training set
cv.fit(X_train, y_train)
# Predict the labels of the test set: y_pred
y_pred = cv.predict(X_test)
# Compute and print metrics
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))
print("Tuned Model Parameters: {}".format(cv.best_params_))
```

For this final exercise, you will return to the Gapminder dataset. Guess what? Even this dataset has missing values that we dealt with for you in earlier chapters! Now, you have all the tools to take care of them yourself!

Your job is to build a pipeline that imputes the missing data, scales the features, and fits an ElasticNet to the Gapminder data. You will then tune the `l1_ratio`

of your ElasticNet using `GridSearchCV`

.

In [135]:

```
df = pd.read_csv('./datasets/gm_2008_region.csv')
df.head()
df.drop(['Region'], axis='columns', inplace=True)
X = df.drop('life', axis='columns').values
y = df['life'].values
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import ElasticNet
from sklearn.pipeline import Pipeline
# Setup the pipeline steps: steps
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
('scaler', StandardScaler()),
('elasticnet', ElasticNet(tol=0.6))]
# Create the pipeline: pipeline
pipeline = Pipeline(steps)
# Specify the hyperparameter space
parameters = {'elasticnet__l1_ratio': np.linspace(0, 1, 30)}
# Create train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=42)
# Create the GridSearchCV object: gm_cv
gm_cv = GridSearchCV(pipeline, param_grid=parameters, cv=3)
# Fit to the training set
gm_cv.fit(X_train, y_train)
# Compute and print the metrics
r2 = gm_cv.score(X_test, y_test)
print("Tuned ElasticNet Alpha: {}".format(gm_cv.best_params_))
print("Tuned ElasticNet R squared: {}".format(r2))
```

These are personal notes on the Datacamp course 'Statistical Thinking in Python (Part 2)'. If you're interested consider doing a course at https://www.datacamp.com

After completing Statistical Thinking in Python (Part 1), you have the probabilistic mindset and foundational hacker stats skills to dive into data sets and extract useful information from them. In this course, you will do just that, expanding and honing your hacker stats toolbox to perform the two key tasks in statistical inference, parameter estimation and hypothesis testing. You will work with real data sets as you learn, culminating with analysis of measurements of the beaks of the Darwin's famous finches. You will emerge from this course with new knowledge and lots of practice under your belt, ready to attack your own inference problems out in the world.

When doing statistical inference, we speak the language of probability. A probability distribution that describes your data has parameters. So, a major goal of statistical inference is to estimate the values of these parameters, which allows us to concisely and unambiguously describe our data and draw conclusions from it. In this chapter, you will learn how to find the optimal parameters, those that best describe your data.

The number of games played between each no-hitter in the modern era (1901-2015) of Major League Baseball is stored in the array nohitter_times.

If you assume that no-hitters are described as a Poisson process, then the time between no-hitters is Exponentially distributed. As you have seen, the Exponential distribution has a single parameter, which we will call τ , the typical interval time. The value of the parameter τ that makes the exponential distribution best match the data is the mean interval time (where time is in units of number of games) between no-hitters.

Compute the value of this parameter from the data. Then, use np.random.exponential() to "repeat" the history of Major League Baseball by drawing inter-no-hitter times from an exponential distribution with the τ you found and plot the histogram as an approximation to the PDF.

NumPy, pandas, matlotlib.pyplot, and seaborn have been imported for you as np, pd, plt, and sns, respectively.

- Seed the random number generator with 42.
- Compute the mean time (in units of number of games) between no-hitters.
- Draw 100,000 samples from an Exponential distribution with the parameter you computed from the mean of the inter-no-hitter times.
- Plot the theoretical PDF using plt.hist(). Remember to use keyword arguments bins=50, normed=True, and histtype='step'. Be sure to label your axes.
- Show your plot.

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
nohitter_times = np.array([843,1613,1101,215,684,814,278,324,161,219,545,715,966,624
,29,450,107,20,91,1325,124,1468,104,1309,429,62,1878,1104
,123,251,93,188,983,166,96,702,23,524,26,299,59,39
,12,2,308,1114,813,887,645,2088,42,2090,11,886,1665,1084
,2900,2432,750,4021,1070,1765,1322,26,548,1525,77,2181,2752,127
,2147,211,41,1575,151,479,697,557,2267,542,392,73,603,233
,255,528,397,1529,1023,1194,462,583,37,943,996,480,1497,717
,224,219,1531,498,44,288,267,600,52,269,1086,386,176,2199
,216,54,675,1243,463,650,171,327,110,774,509,8,197,136
,12,1124,64,380,811,232,192,731,715,226,605,539,1491,323
,240,179,702,156,82,1397,354,778,603,1001,385,986,203,149
,576,445,180,1403,252,675,1351,2983,1568,45,899,3260,1025,31
,100,2055,4043,79,238,3931,2351,595,110,215,0,563,206,660
,242,577,179,157,192,192,1848,792,1693,55,388,225,1134,1172
,1555,31,1582,1044,378,1687,2915,280,765,2819,511,1521,745,2491
,580,2072,6450,578,745,1075,1103,1549,1520,138,1202,296,277,351
,391,950,459,62,1056,1128,139,420,87,71,814,603,1349,162
,1027,783,326,101,876,381,905,156,419,239,119,129,467])
```

In [2]:

```
# Se# Seed random number generator
np.random.seed(42)
# Compute mean no-hitter time: tau
tau = np.mean(nohitter_times)
# Draw out of an exponential distribution with parameter tau: inter_nohitter_time
inter_nohitter_time = np.random.exponential(tau, 100000)
# Plot the PDF and label axes
_ = plt.hist(inter_nohitter_time,
bins=50, density=True, histtype='step') #here again I used 'density' instead of the old 'normed' prescribed in the course since matplotlib has newer releases which only work with the new 'density' parameter.
_ = plt.xlabel('Games between no-hitters')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
```

We see the typical shape of the Exponential distribution, going from a maximum at 0 and decaying to the right.

You have modeled no-hitters using an Exponential distribution. Create an ECDF of the real data. Overlay the theoretical CDF with the ECDF from the data. This helps you to verify that the Exponential distribution describes the observed data.

It may be helpful to remind yourself of the function you created in the previous course to compute the ECDF, as well as the code you wrote to plot it.

- Compute an ECDF from the actual time between no-hitters (nohitter_times). Use the ecdf() function you wrote in the prequel course.
- Create a CDF from the theoretical samples you took in the last exercise (inter_nohitter_time).
- Plot x_theor and y_theor as a line using plt.plot(). Then overlay the ECDF of the real data x and y as points. To do this, you have to specify the keyword arguments marker = '.' and linestyle = 'none' in addition to x and y inside plt.plot().
- Set a 2% margin on the plot.
- Show the plot.

In [3]:

```
def ecdf(data):
"""Compute ECDF for a one-dimensional array of measurements."""
# Number of data points: n
n = len(data)
# x-data for the ECDF: x
x = np.sort(data)
# y-data for the ECDF: y
y = np.arange(1, n+1) / n
return x, y
```

In [4]:

```
# Create an ECDF from real data: x, y
x, y = ecdf(nohitter_times)
# Create a CDF from theoretical samples: x_theor, y_theor
x_theor, y_theor = ecdf(inter_nohitter_time)
# Overlay the plots
plt.plot(x_theor, y_theor)
plt.plot(x, y, marker='.', linestyle='none')
# Margins and axis labels
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')
# Show the plot
plt.show()
```

It looks like no-hitters in the modern era of Major League Baseball are Exponentially distributed. Based on the story of the Exponential distribution, this suggests that they are a random process; when a no-hitter will happen is independent of when the last no-hitter was.

Now sample out of an exponential distribution with τ being twice as large as the optimal τ . Do it again for τ half as large. Make CDFs of these samples and overlay them with your data. You can see that they do not reproduce the data as well. Thus, the τ you computed from the mean inter-no-hitter times is optimal in that it best reproduces the data.

Note: In this and all subsequent exercises, the random number generator is pre-seeded for you to save you some typing.

- Take 10000 samples out of an Exponential distribution with parameter τ 1 / 2 = tau/2.
- Take 10000 samples out of an Exponential distribution with parameter τ 2 = 2*tau.
- Generate CDFs from these two sets of samples using your ecdf() function.
- Add these two CDFs as lines to your plot. This has been done for you, so hit 'Submit Answer' to view the plot!

In [5]:

```
# Plot the theoretical CDFs
plt.plot(x_theor, y_theor, label='Theoretical CDF')
plt.plot(x, y, marker='.', linestyle='none', label='Games between nohitters')
plt.margins(0.02)
plt.xlabel('Games between no-hitters')
plt.ylabel('CDF')
# Take samples with half tau: samples_half
samples_half = np.random.exponential(tau/2, size=10000)
# Take samples with double tau: samples_double
samples_double = np.random.exponential(2*tau, size=10000)
# Generate CDFs from these samples
x_half, y_half = ecdf(samples_half)
x_double, y_double = ecdf(samples_double)
# Plot these CDFs as lines
_ = plt.plot(x_half, y_half, color='green', label='Half tau')
_ = plt.plot(x_double, y_double, color='red', label='Double tau')
_ = plt.legend(loc='lower right', shadow=True, fontsize='x-large')
# Show the plot
plt.show()
```

Great work! Notice how the value of tau given by the mean matches the data best. In this way, tau is an optimal parameter.

In the next few exercises, we will look at the correlation between female literacy and fertility (defined as the average number of children born per woman) throughout the world. For ease of analysis and interpretation, we will work with the illiteracy rate.

It is always a good idea to do some EDA ahead of our analysis. To this end, plot the fertility versus illiteracy and compute the Pearson correlation coefficient. The Numpy array illiteracy has the illiteracy rate among females for most of the world's nations. The array fertility has the corresponding fertility data.

Here, it may be useful to refer back to the function you wrote in the previous course to compute the Pearson correlation coefficient.

- Plot fertility (y-axis) versus illiteracy (x-axis) as a scatter plot.
- Set a 2% margin.
- Compute and print the Pearson correlation coefficient between illiteracy and fertility.

In [6]:

```
def pearson_r(x, y):
"""Compute Pearson correlation coefficient between two arrays."""
# Compute correlation matrix: corr_mat
corr_mat = np.corrcoef(x, y)
# Return entry [0,1]
return corr_mat[0,1]
```

In [7]:

```
illiteracy = np.array([9.5,49.2,1.,11.2,9.8,60.,50.2,51.2,0.6,1.,8.5,6.1,9.8,1.
,42.2,77.2,18.7,22.8,8.5,43.9,1.,1.,1.5,10.8,11.9,3.4,0.4,3.1
,6.6,33.7,40.4,2.3,17.2,0.7,36.1,1.,33.2,55.9,30.8,87.4,15.4,54.6
,5.1,1.1,10.2,19.8,0.,40.7,57.2,59.9,3.1,55.7,22.8,10.9,34.7,32.2
,43.,1.3,1.,0.5,78.4,34.2,84.9,29.1,31.3,18.3,81.8,39.,11.2,67.
,4.1,0.2,78.1,1.,7.1,1.,29.,1.1,11.7,73.6,33.9,14.,0.3,1.
,0.8,71.9,40.1,1.,2.1,3.8,16.5,4.1,0.5,44.4,46.3,18.7,6.5,36.8
,18.6,11.1,22.1,71.1,1.,0.,0.9,0.7,45.5,8.4,0.,3.8,8.5,2.
,1.,58.9,0.3,1.,14.,47.,4.1,2.2,7.2,0.3,1.5,50.5,1.3,0.6
,19.1,6.9,9.2,2.2,0.2,12.3,4.9,4.6,0.3,16.5,65.7,63.5,16.8,0.2
,1.8,9.6,15.2,14.4,3.3,10.6,61.3,10.9,32.2,9.3,11.6,20.7,6.5,6.7
,3.5,1.,1.6,20.5,1.5,16.7,2.,0.9])
fertility = np.array([1.769,2.682,2.077,2.132,1.827,3.872,2.288,5.173,1.393,1.262,2.156,3.026
,2.033,1.324,2.816,5.211,2.1,1.781,1.822,5.908,1.881,1.852,1.39,2.281
,2.505,1.224,1.361,1.468,2.404,5.52,4.058,2.223,4.859,1.267,2.342,1.579
,6.254,2.334,3.961,6.505,2.53,2.823,2.498,2.248,2.508,3.04,1.854,4.22
,5.1,4.967,1.325,4.514,3.173,2.308,4.62,4.541,5.637,1.926,1.747,2.294
,5.841,5.455,7.069,2.859,4.018,2.513,5.405,5.737,3.363,4.89,1.385,1.505
,6.081,1.784,1.378,1.45,1.841,1.37,2.612,5.329,5.33,3.371,1.281,1.871
,2.153,5.378,4.45,1.46,1.436,1.612,3.19,2.752,3.35,4.01,4.166,2.642
,2.977,3.415,2.295,3.019,2.683,5.165,1.849,1.836,2.518,2.43,4.528,1.263
,1.885,1.943,1.899,1.442,1.953,4.697,1.582,2.025,1.841,5.011,1.212,1.502
,2.516,1.367,2.089,4.388,1.854,1.748,2.978,2.152,2.362,1.988,1.426,3.29
,3.264,1.436,1.393,2.822,4.969,5.659,3.24,1.693,1.647,2.36,1.792,3.45
,1.516,2.233,2.563,5.283,3.885,0.966,2.373,2.663,1.251,2.052,3.371,2.093
,2.,3.883,3.852,3.718,1.732,3.928])
```

In [8]:

```
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
# Set the margins and label axes
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Show the plot
plt.show()
# Show the Pearson correlation coefficient
print(pearson_r(illiteracy, fertility))
```

You can see the correlation between illiteracy and fertility by eye, and by the substantial Pearson correlation coefficient of 0.8. It is difficult to resolve in the scatter plot, but there are many points around near-zero illiteracy and about 1.8 children/woman.

We will assume that fertility is a linear function of the female illiteracy rate. That is, f = a i + b , where a is the slope and b is the intercept. We can think of the intercept as the minimal fertility rate, probably somewhere between one and two. The slope tells us how the fertility rate varies with illiteracy. We can find the best fit line using np.polyfit().

Plot the data and the best fit line. Print out the slope and intercept. (Think: what are their units?)

- Compute the slope and intercept of the regression line using np.polyfit(). Remember, fertility is on the y-axis and illiteracy on the x-axis.
- Print out the slope and intercept from the linear regression.
- To plot the best fit line, create an array x that consists of 0 and 100 using np.array(). Then, compute the theoretical values of y based on your regression parameters. I.e., y = a * x + b.
- Plot the data and the regression line on the same plot. Be sure to label your axes.
- Hit 'Submit Answer' to display your plot.

In [9]:

```
# Plot the illiteracy rate versus fertility
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
plt.margins(0.02)
_ = plt.xlabel('percent illiterate')
_ = plt.ylabel('fertility')
# Perform a linear regression using np.polyfit(): a, b
a, b = np.polyfit(illiteracy, fertility ,1)
# Print the results to the screen
print('slope =', a, 'children per woman / percent illiterate')
print('intercept =', b, 'children per woman')
# Make theoretical line to plot
x = np.array([0,100])
y = a * x + b
# Add regression line to your plot
_ = plt.plot(x, y)
# Draw the plot
plt.show()
```

The function np.polyfit() that you used to get your regression parameters finds the optimal slope and intercept. It is optimizing the sum of the squares of the residuals, also known as RSS (for residual sum of squares). In this exercise, you will plot the function that is being optimized, the RSS, versus the slope parameter a. To do this, fix the intercept to be what you found in the optimization. Then, plot the RSS vs. the slope. Where is it minimal?

- Specify the values of the slope to compute the RSS. Use np.linspace() to get 200 points in the range between 0 and 0.1. For example, to get 100 points in the range between 0 and 0.5, you could use np.linspace() like so: np.linspace(0, 0.5, 100).
- Initialize an array, rss, to contain the RSS using np.empty_like() and the array you created above. The empty_like() function returns a new array with the same shape and type as a given array (in this case, a_vals).
- Write a for loop to compute the sum of RSS of the slope. Hint: the RSS is given by np.sum((y_data - a * x_data - b) ** 2). The variable b you computed in the last exercise is already in your namespace. Here, fertility is the y_data and illiteracy the x_data.
- Plot the RSS (rss) versus slope (a_vals).
- Hit 'Submit Answer' to see the plot!

In [10]:

```
# Specify slopes to consider: a_vals
a_vals = np.linspace(0, 0.1, 200)
# Initialize sum of square of residuals: rss
rss = np.empty_like(a_vals)
# Compute sum of square of residuals for each value of a_vals
for i, a in enumerate(a_vals):
rss[i] = np.sum((fertility - a*illiteracy - b)**2)
# Plot the RSS
plt.plot(a_vals, rss, '-')
plt.xlabel('slope (children per woman / percent illiterate)')
plt.ylabel('sum of square of residuals')
plt.show()
```

Notice that the minimum on the plot, that is the value of the slope that gives the minimum sum of the square of the residuals, is the same value you got when performing the regression.

In 1973, statistician Francis Anscombe published a paper that contained for fictitious x-y data sets, plotted below. He uses these data sets to make an important point. The point becomes clear if we blindly go about doing parameter estimation on these data sets. Lets look at the mean x & y values of the four data sets.

They're all the same

What if we do a linear regression on each of the data sets? They're the same

Some of the fits are less optimal than others Let's look at the sum of the squares of the residuals. They're basically all the same

The data sets were constructed so this would happen. He was making a very important point. There are already some powerful tools for statistical inference. You can compute summary statistics and optimal parameters, including linear regression parameters, and by the end of the course, you'll be able to construct confidence intervals which quantify uncertainty about the parameter estimates. These are crucial skills for any data analyst, no doubt

This is a powerful reminder to do some graphical exploratory data analysis before you start computing and making judgments about your data.

For example, plot[0, 0] (x_0, y_0) might be well modeled with a line, and the regression parameters will be meaningful. The same is true of plot[1, 0] (x_2, y_2), but the outlier throws off the slope and intercept. After doing EDA, you should look into what is causing that outlier Plot[1, 1] (x_3, y_3) might also have a linear relationship between x, and y, but from the plot, you can conclude that you should try to acquire more data for intermediate x values to make sure that it does. Plot[0, 1] (x_1, y_1) is definitely not linear, and you need to choose another model.

Explore your data first! Always do EDA as you jump into a data set.

In [11]:

```
from matplotlib import collections as mc
anscombe = 'https://assets.datacamp.com/production/repositories/470/datasets/fe820c6cbe9bcf4060eeb9e31dd86aa04264153a/anscombe.csv'
anscombe = pd.read_csv(anscombe, header=[0, 1])
anscombe.columns = [f'{x[1]}_{x[0]}' for x in anscombe.columns] # reduce multi-level column name to single level
def anscombe_plot():
for i in range(1, 5):
plt.subplot(2, 2, i)
g = sns.scatterplot(x=f'x_{i-1}', y=f'y_{i-1}', data=anscombe)
# x & y mean
x_mean = anscombe.loc[:, f'x_{i-1}'].mean()
y_mean = anscombe.loc[:, f'y_{i-1}'].mean()
plt.vlines(anscombe.loc[:, f'x_{i-1}'].mean(), 0, 14, color='g')
plt.text(9.2, 4, f'x-mean: {x_mean}')
plt.hlines(anscombe.loc[:, f'y_{i-1}'].mean(), 2, 20, color='g')
plt.text(13, 6.9, f'y-mean: {y_mean:0.1f}')
# regression line
slope, intercept = np.polyfit(anscombe[f'x_{i-1}'], anscombe[f'y_{i-1}'], 1)
y1 = slope*2 + intercept
y2 = slope*20 + intercept
lc = mc.LineCollection([[(2, y1), (20, y2)]], color='green')
g.add_collection(lc)
plt.text(15, 12, 'reg-line')
# sum of the squares of the residuals
A = np.vstack((anscombe.loc[:, f'x_{i-1}'], np.ones(11))).T
resid = np.linalg.lstsq(A, anscombe.loc[:, f'y_{i-1}'], rcond=-1)[1][0]
plt.text(12.3, 1, f'sum sq. resid: {resid:0.1f}')
plt.ylim(0, 14)
plt.xlim(2, 20)
plt.xticks(list(range(4, 22, 2)))
plt.ylabel('y')
plt.xlabel('x')
return plt.show()
anscombe_plot()
```

For practice, perform a linear regression on the data set from Anscombe's quartet that is most reasonably interpreted with linear regression.

- Compute the parameters for the slope and intercept using np.polyfit(). The Anscombe data are stored in the arrays x and y.
- Print the slope a and intercept b.
- Generate theoretical x and y data from the linear regression. Your x array, which you can create with np.array(), should consist of 3 and 15. To generate the y data, multiply the slope by x_theor and add the intercept.
- Plot the Anscombe data as a scatter plot and then plot the theoretical line. Remember to include the marker='.' and linestyle='none' keyword arguments in addition to x and y when to plot the Anscombe data as a scatter plot. You do not need these arguments when plotting the theoretical line.
- Hit 'Submit Answer' to see the plot!

In [12]:

```
x = np.array([10.,8., 13., 9., 11., 14., 6., 4., 12., 7., 5.])
y = np.array([ 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68])
# Perform linear regression: a, b
a, b = np.polyfit(x,y, 1)
# Print the slope and intercept
print(a, b)
# Generate theoretical x and y data: x_theor, y_theor
x_theor = np.array([3, 15])
y_theor = a * x_theor + b
# Plot the Anscombe data and theoretical line
_ = plt.plot(x, y, marker='.', linestyle='none')
_ = plt.plot(x_theor, y_theor)
# Label the axes
plt.xlabel('x')
plt.ylabel('y')
# Show the plot
plt.show()
```

Now, to verify that all four of the Anscombe data sets have the same slope and intercept from a linear regression, you will compute the slope and intercept for each set. The data are stored in lists; anscombe_x = [x1, x2, x3, x4] and anscombe_y = [y1, y2, y3, y4], where, for example, x2 and y2 are the x and y values for the second Anscombe data set.

- Write a for loop to do the following for each Anscombe data set.
- Compute the slope and intercept.
- Print the slope and intercept.

In [13]:

```
anscombe_x = [[10., 8., 13., 9., 11., 14., 6., 4., 12., 7., 5.], [10., 8., 13., 9., 11., 14., 6., 4., 12., 7., 5.],[10., 8., 13., 9., 11., 14., 6., 4., 12., 7., 5.],[ 8., 8., 8., 8., 8., 8., 8., 19., 8., 8., 8.]]
anscombe_y = [[ 8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84,4.82, 5.68],[9.14, 8.14, 8.74, 8.77, 9.26, 8.1 , 6.13, 3.1 , 9.13, 7.26, 4.74],[ 7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15,6.42, 5.73],[ 6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.5 , 5.56,7.91, 6.89]]
# Iterate through x,y pairs
for x, y in zip(anscombe_x, anscombe_y):
# Compute the slope and intercept: a, b
a, b = np.polyfit(x, y, 1)
# Print the result
print('slope:', a, 'intercept:', b)
```

Getting tripped up over terminology is a common cause of frustration in students. Unfortunately, you often will read and hear other data scientists using different terminology for bootstrap samples and replicates. This is even more reason why we need everything to be clear and consistent for this course. So, before going forward discussing bootstrapping, let's get our terminology down. If we have a data set with n repeated measurements, a bootstrap sample is an array of length n that was drawn from the original data with replacement. What is a bootstrap replicate?

Possible Answers

- Just another name for a bootstrap sample.
**A single value of a statistic computed from a bootstrap sample.**- An actual repeat of the measurements

To help you gain intuition about how bootstrapping works, imagine you have a data set that has only three points, [-1, 0, 1]. How many unique bootstrap samples can be drawn (e.g., [-1, 0, 1] and [1, 0, -1] are unique), and what is the maximum mean you can get from a bootstrap sample? It might be useful to jot down the samples on a piece of paper.

(These are too few data to get meaningful results from bootstrap procedures, but this example is useful for intuition.)

Possible Answers

- There are 3 unique samples, and the maximum mean is 0.
- There are 10 unique samples, and the maximum mean is 0.
- There are 10 unique samples, and the maximum mean is 1.
- There are 27 unique samples, and the maximum mean is 0.
**There are 27 unique samples, and the maximum mean is 1.**

There are 27 total bootstrap samples, and one of them, [1,1,1] has a mean of 1. Conversely, 7 of them have a mean of zero.

In this exercise, you will generate bootstrap samples from the set of annual rainfall data measured at the Sheffield Weather Station in the UK from 1883 to 2015. The data are stored in the NumPy array rainfall in units of millimeters (mm). By graphically displaying the bootstrap samples with an ECDF, you can get a feel for how bootstrap sampling allows probabilistic descriptions of data.

- Write a for loop to acquire 50 bootstrap samples of the rainfall data and plot their ECDF.
- Use np.random.choice() to generate a bootstrap sample from the NumPy array rainfall. Be sure that the size of the resampled array is len(rainfall).
- Use the function ecdf() that you wrote in the prequel to this course to generate the x and y values for the ECDF of the bootstrap sample bs_sample.
- Plot the ECDF values. Specify color='gray' (to make gray dots) and alpha=0.1 (to make them semi-transparent, since we are overlaying so many) in addition to the marker='.' and linestyle='none' keyword arguments.

- Use ecdf() to generate x and y values for the ECDF of the original rainfall data available in the array rainfall.
- Plot the ECDF values of the original data.
- Hit 'Submit Answer' to visualize the samples!

In [14]:

```
rainfall = np.array([875.5,648.2,788.1,940.3,491.1,743.5,730.1,686.5,878.8,865.6
,654.9,831.5,798.1,681.8,743.8,689.1,752.1,837.2,710.6,749.2
,967.1,701.2,619.,747.6,803.4,645.6,804.1,787.4,646.8,997.1
,774.,734.5,835.,840.7,659.6,828.3,909.7,856.9,578.3,904.2
,883.9,740.1,773.9,741.4,866.8,871.1,712.5,919.2,927.9,809.4
,633.8,626.8,871.3,774.3,898.8,789.6,936.3,765.4,882.1,681.1
,661.3,847.9,683.9,985.7,771.1,736.6,713.2,774.5,937.7,694.5
,598.2,983.8,700.2,901.3,733.5,964.4,609.3,1035.2,718.,688.6
,736.8,643.3,1038.5,969.,802.7,876.6,944.7,786.6,770.4,808.6
,761.3,774.2,559.3,674.2,883.6,823.9,960.4,877.8,940.6,831.8
,906.2,866.5,674.1,998.1,789.3,915.,737.1,763.,666.7,824.5
,913.8,905.1,667.8,747.4,784.7,925.4,880.2,1086.9,764.4,1050.1
,595.2,855.2,726.9,785.2,948.8,970.6,896.,618.4,572.4,1146.4
,728.2,864.2,793.])
```

In [15]:

```
for i in range(50):
# Generate bootstrap sample: bs_sample
bs_sample = np.random.choice(rainfall, size=len(rainfall))
# Compute and plot ECDF from bootstrap sample
x, y = ecdf(bs_sample)
_ = plt.plot(x, y, marker='.', linestyle='none',
color='gray', alpha=0.1)
# Compute and plot ECDF from original data
x, y = ecdf(rainfall)
_ = plt.plot(x, y, marker='.')
# Make margins and label axes
plt.margins(0.02)
_ = plt.xlabel('yearly rainfall (mm)')
_ = plt.ylabel('ECDF')
# Show the plot
plt.show()
```

Notice how the bootstrap samples give an idea of how the distribution of rainfalls is spread.

The function bootstrap_replicate_1d() from the video is available in your namespace. Now you'll write another function, draw_bs_reps(data, func, size=1), which generates many bootstrap replicates from the data set. This function will come in handy for you again and again as you compute confidence intervals and later when you do hypothesis tests.

For your reference, the bootstrap_replicate_1d() function is provided below:

In [16]:

```
def bootstrap_replicate_1d(data, func):
"""Generate bootstrap replicate of 1D data."""
bs_sample = np.random.choice(data, len(data))
return func(bs_sample)
```

- Define a function with call signature draw_bs_reps(data, func, size=1).
- Using np.empty(), initialize an array called bs_replicates of size size to hold all of the bootstrap replicates.
- Write a for loop that ranges over size and computes a replicate using bootstrap_replicate_1d(). Refer to the exercise description above to see the function signature of bootstrap_replicate_1d(). Store the replicate in the appropriate index of bs_replicates.
- Return the array of replicates bs_replicates. This has already been done for you.

In [17]:

```
def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates."""
# Initialize array of replicates: bs_replicates
bs_replicates = np.empty(size)
# Generate replicates
for i in range(size):
bs_replicates[i] = bootstrap_replicate_1d(data, func)
return bs_replicates
```

In this exercise, you will compute a bootstrap estimate of the probability density function of the mean annual rainfall at the Sheffield Weather Station. Remember, we are estimating the mean annual rainfall we would get if the Sheffield Weather Station could repeat all of the measurements from 1883 to 2015 over and over again. This is a probabilistic estimate of the mean. You will plot the PDF as a histogram, and you will see that it is Normal.

In fact, it can be shown theoretically that under not-too-restrictive conditions, the value of the mean will always be Normally distributed. (This does not hold in general, just for the mean and a few other statistics.) The standard deviation of this distribution, called the standard error of the mean, or SEM, is given by the standard deviation of the data divided by the square root of the number of data points. I.e., for a data set, sem = np.std(data) / np.sqrt(len(data)). Using hacker statistics, you get this same result without the need to derive it, but you will verify this result from your bootstrap replicates.

The dataset has been pre-loaded for you into an array called rainfall.

- Draw 10000 bootstrap replicates of the mean annual rainfall using your draw_bs_reps() function and the rainfall array. Hint: Pass in np.mean for func to compute the mean.
- As a reminder, draw_bs_reps() accepts 3 arguments: data, func, and size.

- Compute and print the standard error of the mean of rainfall.
- The formula to compute this is np.std(data) / np.sqrt(len(data)).

- Compute and print the standard deviation of your bootstrap replicates bs_replicates.
- Make a histogram of the replicates using the normed=True keyword argument and 50 bins.
- Hit 'Submit Answer' to see the plot!

In [18]:

```
# Take 10,000 bootstrap replicates of the mean: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.mean, 10000)
# Compute and print SEM
sem = np.std(rainfall) / np.sqrt(len(rainfall))
print(sem)
# Compute and print standard deviation of bootstrap replicates
bs_std = np.std(bs_replicates)
print(bs_std)
# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, density=True)
_ = plt.xlabel('mean annual rainfall (mm)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
```

Notice that the SEM we got from the known expression and the bootstrap replicates is the same and the distribution of the bootstrap replicates of the mean is Normal.

A confidence interval gives upper and lower bounds on the range of parameter values you might expect to get if we repeat our measurements. For named distributions, you can compute them analytically or look them up, but one of the many beautiful properties of the bootstrap method is that you can take percentiles of your bootstrap replicates to get your confidence interval. Conveniently, you can use the np.percentile() function.

Use the bootstrap replicates you just generated to compute the 95% confidence interval. That is, give the 2.5th and 97.5th percentile of your bootstrap replicates stored as bs_replicates. What is the 95% confidence interval?

Possible Answers

- (765, 776) mm/year
**(780, 821) mm/year**- (761, 817) mm/year
- (761, 841) mm/year

In [19]:

```
np.percentile(bs_replicates, [2.5, 97.5])
```

Out[19]:

See, it's simple to get confidence intervals using bootstrap!

We saw in a previous exercise that the mean is Normally distributed. This does not necessarily hold for other statistics, but no worry: as hackers, we can always take bootstrap replicates! In this exercise, you'll generate bootstrap replicates for the variance of the annual rainfall at the Sheffield Weather Station and plot the histogram of the replicates.

Here, you will make use of the draw_bs_reps() function you defined a few exercises ago. It is provided below for your reference:

In [20]:

```
def draw_bs_reps(data, func, size=1):
"""Draw bootstrap replicates."""
# Initialize array of replicates
bs_replicates = np.empty(size)
# Generate replicates
for i in range(size):
bs_replicates[i] = bootstrap_replicate_1d(data, func)
return bs_replicates
```

- Draw 10000 bootstrap replicates of the variance in annual rainfall, stored in the rainfall dataset, using your draw_bs_reps() function. Hint: Pass in np.var for computing the variance.
- Divide your variance replicates (bs_replicates) by 100 to put the variance in units of square centimeters for convenience.
- Make a histogram of bs_replicates using the normed=True keyword argument and 50 bins.

In [21]:

```
# Generate 10,000 bootstrap replicates of the variance: bs_replicates
bs_replicates = draw_bs_reps(rainfall, np.var, 10000)
# Put the variance in units of square centimeters
bs_replicates = bs_replicates/100
# Make a histogram of the results
_ = plt.hist(bs_replicates, bins=50, density=True)
_ = plt.xlabel('variance of annual rainfall (sq. cm)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
```

This is not normally distributed, as it has a longer tail to the right. Note that you can also compute a confidence interval on the variance, or any other statistic, using np.percentile() with your bootstrap replicates.

Consider again the inter-no-hitter intervals for the modern era of baseball. Generate 10,000 bootstrap replicates of the optimal parameter τ . Plot a histogram of your replicates and report a 95% confidence interval.

- Generate 10000 bootstrap replicates of τ from the nohitter_times data using your draw_bs_reps() function. Recall that the optimal τ is calculated as the mean of the data.
- Compute the 95% confidence interval using np.percentile() and passing in two arguments: The array bs_replicates, and the list of percentiles - in this case 2.5 and 97.5.
- Print the confidence interval.
- Plot a histogram of your bootstrap replicates. This has been done for you, so hit 'Submit Answer' to see the plot!

In [22]:

```
# Draw bootstrap replicates of the mean no-hitter time (equal to tau): bs_replicates
bs_replicates = draw_bs_reps(nohitter_times, np.mean, 10000)
# Compute the 95% confidence interval: conf_int
conf_int = np.percentile(bs_replicates, [2.5, 97.5])
# Print the confidence interval
print('95% confidence interval =', conf_int, 'games')
# Plot the histogram of the replicates
_ = plt.hist(bs_replicates, bins=50, density=True)
_ = plt.xlabel(r'$\tau$ (games)')
_ = plt.ylabel('PDF')
# Show the plot
plt.show()
```

This gives you an estimate of what the typical time between no-hitters is. It could be anywhere between 660 and 870 games.

As discussed in the video, pairs bootstrap involves resampling pairs of data. Each collection of pairs fit with a line, in this case using np.polyfit(). We do this again and again, getting bootstrap replicates of the parameter values. To have a useful tool for doing pairs bootstrap, you will write a function to perform pairs bootstrap on a set of x,y data.

- Define a function with call signature draw_bs_pairs_linreg(x, y, size=1) to perform pairs bootstrap estimates on linear regression parameters.
- Use np.arange() to set up an array of indices going from 0 to len(x). These are what you will resample and use them to pick values out of the x and y arrays.
- Use np.empty() to initialize the slope and intercept replicate arrays to be of size size.
- Write a for loop to:
- Resample the indices inds. Use np.random.choice() to do this.
- Make new x and y arrays bs_x and bs_y using the the resampled indices bs_inds. To do this, slice x and y with bs_inds.
- Use np.polyfit() on the new x and y arrays and store the computed slope and intercept.

- Return the pair bootstrap replicates of the slope and intercept.

In [23]:

```
def draw_bs_pairs_linreg(x, y, size=1):
"""Perform pairs bootstrap for linear regression."""
# Set up array of indices to sample from: inds
inds = np.arange(0, len(x))
# Initialize replicates: bs_slope_reps, bs_intercept_reps
bs_slope_reps = np.empty(size)
bs_intercept_reps = np.empty(size)
# Generate replicates
for i in range(size):
bs_inds = np.random.choice(inds, size=len(inds))
bs_x, bs_y = x[bs_inds], y[bs_inds]
bs_slope_reps[i], bs_intercept_reps[i] = np.polyfit(bs_x, bs_y, 1)
return bs_slope_reps, bs_intercept_reps
```

Using the function you just wrote, perform pairs bootstrap to plot a histogram describing the estimate of the slope from the illiteracy/fertility data. Also report the 95% confidence interval of the slope. The data is available to you in the NumPy arrays illiteracy and fertility.

As a reminder, draw_bs_pairs_linreg() has a function signature of draw_bs_pairs_linreg(x, y, size=1), and it returns two values: bs_slope_reps and bs_intercept_reps.

- Use your draw_bs_pairs_linreg() function to take 1000 bootstrap replicates of the slope and intercept. The x-axis data is illiteracy and y-axis data is fertility.
- Compute and print the 95% bootstrap confidence interval for the slope.
- Plot and show a histogram of the slope replicates. Be sure to label your axes. This has been done for you, so click 'Submit Answer' to see your histogram!

In [24]:

```
# Generate replicates of slope and intercept using pairs bootstrap
bs_slope_reps, bs_intercept_reps = draw_bs_pairs_linreg(illiteracy, fertility, 1000)
# Compute and print 95% CI for slope
print(np.percentile(bs_slope_reps, [2.5, 97.5]))
# Plot the histogram
_ = plt.hist(bs_slope_reps, bins=50, density=True)
_ = plt.xlabel('slope')
_ = plt.ylabel('PDF')
plt.show()
```

A nice way to visualize the variability we might expect in a linear regression is to plot the line you would get from each bootstrap replicate of the slope and intercept. Do this for the first 100 of your bootstrap replicates of the slope and intercept (stored as bs_slope_reps and bs_intercept_reps).

- Generate an array of x -values consisting of 0 and 100 for the plot of the regression lines. Use the np.array() function for this.
- Write a for loop in which you plot a regression line with a slope and intercept given by the pairs bootstrap replicates. Do this for 100 lines.
- When plotting the regression lines in each iteration of the for loop, recall the regression equation y = a*x + b. Here, a is bs_slope_reps[i] and b is bs_intercept_reps[i].
- Specify the keyword arguments linewidth=0.5, alpha=0.2, and color='red' in your call to plt.plot().

- Make a scatter plot with illiteracy on the x-axis and fertility on the y-axis. Remember to specify the marker='.' and linestyle='none' keyword arguments.
- Label the axes, set a 2% margin, and show the plot. This has been done for you, so hit 'Submit Answer' to visualize the bootstrap regressions!

In [25]:

```
# Generate array of x-values for bootstrap lines: x
x = np.array([0, 100])
# Plot the bootstrap lines
for i in range(100):
_ = plt.plot(x,
bs_slope_reps[i]*x + bs_intercept_reps[i],
linewidth=0.5, alpha=0.2, color='red')
# Plot the data
_ = plt.plot(illiteracy, fertility, marker='.', linestyle='none')
# Label axes, set the margins, and show the plot
_ = plt.xlabel('illiteracy')
_ = plt.ylabel('fertility')
plt.margins(0.02)
plt.show()
```

Great work! You now have some serious chops for parameter estimation. Let's move on to hypothesis testing!

You now know how to define and estimate parameters given a model. But the question remains: how reasonable is it to observe your data if a model is true? This question is addressed by hypothesis tests. They are the icing on the inference cake. After completing this chapter, you will be able to carefully construct and test hypotheses using hacker statistics.

When we studied linear regression, we assumed a linear model for how the data are generated and then estimated the parameters that are defined by that model. But, how do we assess how reasonable it is that our observed data are actually described by the model? This is the realm of hypothesis testing. Let's start by thinking about a simpler scenario.

Ohio and Pennsylvania are similar states. They're neighbors and they both have liberal urban counties and also lots of rural conservative counties. I hypothesize that country-level voting in these two states have identical probability distributions. We have voting data to help test this hypothesis. Stated more concretely, we're going to assess how reasonable the observed data are assuming the hypothesis is true.

The hypothesis we are testing is typically called the null hypothesis. We might start by just plotting the two ECDFs of county-level votes. Based on the graph below, it's pretty tough to make a judgment. Pennsylvania seems to be slightly more toward Obama in the middle part of the ECDFs, but not by much. We can't really draw a conclusion here. We could just compare some summary statistics. An answer is still not clear. The means and medians of the two states are really close, and the standard deviations are almost identical. Eyeballing the data is not enough to make a determination. To resolve this issue, we can simulate what the data would look like if the country-level voting trends in the two states were identically distributed. We can do this by putting the Democratic share of the vote for all of Pennsylvania's 67 counties and Ohio's 88 counties together. We then ignore what state they belong to. Next, randomly scramble the ordering of the counties. We then re-label the first 67 to be Pennsylvania and the remaining to be Ohio. We we just redid the election as if there was no difference between Pennsylvania and Ohio.

This technique, of scrambling the order of an array, is called permutation. It is at the heart of simulating a null hypothesis were we assume two quantities are identically distributed.

In the video, you learned that permutation sampling is a great way to simulate the hypothesis that two variables have identical probability distributions. This is often a hypothesis you want to test, so in this exercise, you will write a function to generate a permutation sample from two data sets.

Remember, a permutation sample of two arrays having respectively n1 and n2 entries is constructed by concatenating the arrays together, scrambling the contents of the concatenated array, and then taking the first n1 entries as the permutation sample of the first array and the last n2 entries as the permutation sample of the second array.

- Concatenate the two input arrays into one using np.concatenate(). Be sure to pass in data1 and data2 as one argument (data1, data2).
- Use np.random.permutation() to permute the concatenated array.
- Store the first len(data1) entries of permuted_data as perm_sample_1 and the last len(data2) entries of permuted_data as perm_sample_2. In practice, this can be achieved by using :len(data1) and len(data1): to slice permuted_data.
- Return perm_sample_1 and perm_sample_2.

In [26]:

```
def permutation_sample(data1, data2):
"""Generate a permutation sample from two data sets."""
# Concatenate the data sets: data
data = np.concatenate((data1, data2))
# Permute the concatenated array: permuted_data
permuted_data = np.random.permutation(data)
# Split the permuted array into two: perm_sample_1, perm_sample_2
perm_sample_1 = permuted_data[:len(data1)]
perm_sample_2 = permuted_data[len(data1):]
return perm_sample_1, perm_sample_2
```

To help see how permutation sampling works, in this exercise you will generate permutation samples and look at them graphically.

We will use the Sheffield Weather Station data again, this time considering the monthly rainfall in June (a dry month) and November (a wet month). We expect these might be differently distributed, so we will take permutation samples to see how their ECDFs would look if they were identically distributed.

The data are stored in the Numpy arrays rain_june and rain_november.

As a reminder, permutation_sample() has a function signature of permutation_sample(data_1, data_2) with a return value of permuted_data[:len(data_1)], permuted_data[len(data_1):], where permuted_data = np.random.permutation(np.concatenate((data_1, data_2))).

- Write a for loop to generate 50 permutation samples, compute their ECDFs, and plot them.
- Generate a permutation sample pair from rain_june and rain_november using your permutation_sample() function.
- Generate the x and y values for an ECDF for each of the two permutation samples for the ECDF using your ecdf() function.
- Plot the ECDF of the first permutation sample (x_1 and y_1) as dots. Do the same for the second permutation sample (x_2 and y_2).

- Generate x and y values for ECDFs for the rain_june and rain_november data and plot the ECDFs using respectively the keyword arguments color='red' and color='blue'.
- Label your axes, set a 2% margin, and show your plot. This has been done for you, so just hit 'Submit Answer' to view the plot!

In [27]:

```
rain_june = np.array([66.2,39.7,76.4,26.5,11.2,61.8,6.1,48.4,89.2,104.,34.,60.6
,57.1,79.1,90.9,32.3,63.8,78.2,27.5,43.4,30.1,17.3,77.5,44.9
,92.2,39.6,79.4,66.1,53.5,98.5,20.8,55.5,39.6,56.,65.1,14.8
,13.2,88.1,8.4,32.1,19.6,40.4,2.2,77.5,105.4,77.2,38.,27.1
,111.8,17.2,26.7,23.3,77.2,87.2,27.7,50.6,60.3,15.1,6.,29.4
,39.3,56.3,80.4,85.3,68.4,72.5,13.3,28.4,14.7,37.4,49.5,57.2
,85.9,82.1,31.8,126.6,30.7,41.4,33.9,13.5,99.1,70.2,91.8,61.3
,13.7,54.9,62.5,24.2,69.4,83.1,44.,48.5,11.9,16.6,66.4,90.
,34.9,132.8,33.4,225.,7.6,40.9,76.5,48.,140.,55.9,54.1,46.4
,68.6,52.2,108.3,14.6,11.3,29.8,130.9,152.4,61.,46.6,43.9,30.9
,111.1,68.5,42.2,9.8,285.6,56.7,168.2,41.2,47.8,166.6,37.8,45.4
,43.2])
rain_november = np.array([83.6,30.9,62.2,37.,41.,160.2,18.2,122.4,71.3,44.2,49.1,37.6
,114.5,28.8,82.5,71.9,50.7,67.7,112.,63.6,42.8,57.2,99.1,86.4
,84.4,38.1,17.7,102.2,101.3,58.,82.,101.4,81.4,100.1,54.6,39.6
,57.5,29.2,48.8,37.3,115.4,55.6,62.,95.,84.2,118.1,153.2,83.4
,104.7,59.,46.4,50.,147.6,76.8,59.9,101.8,136.6,173.,92.5,37.
,59.8,142.1,9.9,158.2,72.6,28.,112.9,119.3,199.2,50.7,44.,170.7
,67.2,21.4,61.3,15.6,106.,116.2,42.3,38.5,132.5,40.8,147.5,93.9
,71.4,87.3,163.7,141.4,62.6,84.9,28.8,121.1,28.6,32.4,112.,50.
,96.9,81.8,70.4,117.5,41.2,124.9,78.2,93.,53.5,50.5,42.6,47.9
,73.1,129.1,56.9,103.3,60.5,134.3,93.1,49.5,48.2,167.9,27.,111.1
,55.4,36.2,57.4,66.8,58.3,60.,161.6,112.7,37.4,110.6,56.6,95.8
,126.8])
```

In [28]:

```
for i in range(50):
# Generate permutation samples
perm_sample_1, perm_sample_2 = permutation_sample(rain_june, rain_november)
# Compute ECDFs
x_1, y_1 = ecdf(perm_sample_1)
x_2, y_2 = ecdf(perm_sample_2)
# Plot ECDFs of permutation sample
_ = plt.plot(x_1, y_1, marker='.', linestyle='none',
color='red', alpha=0.02)
_ = plt.plot(x_2, y_2, marker='.', linestyle='none',
color='blue', alpha=0.02)
# Create and plot ECDFs from original data
x_1, y_1 = ecdf(rain_june)
x_2, y_2 = ecdf(rain_november)
_ = plt.plot(x_1, y_1, marker='.', linestyle='none', color='red')
_ = plt.plot(x_2, y_2, marker='.', linestyle='none', color='blue')
# Label axes, set margin, and show plot
plt.margins(0.02)
_ = plt.xlabel('monthly rainfall (mm)')
_ = plt.ylabel('ECDF')
plt.show()
```

Notice that the permutation samples ECDFs overlap and give a purple haze. None of the ECDFs from the permutation samples overlap with the observed data, suggesting that the hypothesis is not commensurate with the data. June and November rainfall are not identically distributed.

The most important thing to consider when performing a hypothesis tes is: What are you asking?

The p-value is generally a measure of: the probability of observing a test statistic equally or more extreme than the one you observed, given that the null hypothesis is true.

As we are considering probability in hacker statistics, we cannot assign a probability to a hypothesis, only probability of observing data under a hypothesis.

As discussed in the video, a permutation replicate is a single value of a statistic computed from a permutation sample. As the draw_bs_reps() function you wrote in chapter 2 is useful for you to generate bootstrap replicates, it is useful to have a similar function, draw_perm_reps(), to generate permutation replicates. You will write this useful function in this exercise.

The function has call signature draw_perm_reps(data_1, data_2, func, size=1). Importantly, func must be a function that takes two arrays as arguments. In most circumstances, func will be a function you write yourself.

- Define a function with this signature: draw_perm_reps(data_1, data_2, func, size=1).
- Initialize an array to hold the permutation replicates using np.empty().
- Write a for loop to:
- Compute a permutation sample using your permutation_sample() function
- Pass the samples into func() to compute the replicate and store the result in your array of replicates.

- Return the array of replicates.

In [29]:

```
def draw_perm_reps(data_1, data_2, func, size=1):
"""Generate multiple permutation replicates."""
# Initialize array of replicates: perm_replicates
perm_replicates = np.empty(size)
for i in range(size):
# Generate permutation sample
perm_sample_1, perm_sample_2 = permutation_sample(data_1, data_2)
# Compute the test statistic
perm_replicates[i] = func(perm_sample_1, perm_sample_2)
return perm_replicates
```

Kleinteich and Gorb (Sci. Rep., 4, 5225, 2014) performed an interesting experiment with South American horned frogs. They held a plate connected to a force transducer, along with a bait fly, in front of them. They then measured the impact force and adhesive force of the frog's tongue when it struck the target.

Frog A is an adult and Frog B is a juvenile. The researchers measured the impact force of 20 strikes for each frog. In the next exercise, we will test the hypothesis that the two frogs have the same distribution of impact forces. But, remember, it is important to do EDA first! Let's make a bee swarm plot for the data. They are stored in a Pandas data frame, df, where column ID is the identity of the frog and column impact_force is the impact force in Newtons (N).

- Use sns.swarmplot() to make a bee swarm plot of the data by specifying the x, y, and data keyword arguments.
- Label your axes.
- Show the plot.

In [30]:

```
frogs = pd.read_csv('./datasets/frog_tongue.txt', header=[14])
frogs.date = pd.to_datetime(frogs.date, format='%Y_%m_%d')
frogs.head()
```

Out[30]:

Kleinteich and Gorb, Sci. Rep., 4, 5225, 2014. It was featured in the New York Times. http://www.nytimes.com/2014/08/25/science/a-frog-thats-a-living-breathing-pac-man.html

The authors included the data in their supplemental information.

Importantly, the ID refers to the identifites of the frogs they tested.

- I: adult, 63 mm snout-vent-length (SVL) and 63.1 g body weight,
`Ceratophrys cranwelli crossed with Ceratophrys cornuta`

- II: adult, 70 mm SVL and 72.7 g body weight,
`Ceratophrys cranwelli crossed with Ceratophrys cornuta`

- III: juvenile, 28 mm SVL and 12.7 g body weight, Ceratophrys cranwelli
- IV: juvenile, 31 mm SVL and 12.7 g body weight, Ceratophrys cranwelli

In [31]:

```
# Make bee swarm plot
_ = sns.swarmplot(x=frogs['ID'], y=frogs['impact force (mN)'], data=frogs)
# Label axes
_ = plt.xlabel('frog')
_ = plt.ylabel('impact force (N)')
# Show the plot
plt.show()
```

In [32]:

```
print(frogs.shape)
```

Eyeballing it, it does not look like they come from the same distribution. Frog I, the adult, has three or four very hard strikes, and Frog II, the strongest juvenile, has a couple weak ones as well as weaker juvenile frogs III and IV. However, it is possible that with only 80 samples it might be too difficult to tell if they have difference distributions, so we should proceed with the hypothesis test.

The average strike force of Frog I was 0.71 Newtons (N), and that of Frog II was 0.42 N for a difference of 0.29 N. It is possible the frogs strike with the same force and this observed difference was by chance. You will compute the probability of getting at least a 0.29 N difference in mean strike force under the hypothesis that the distributions of strike forces for the two frogs are identical. We use a permutation test with a test statistic of the difference of means to test this hypothesis.

For your convenience, the data has been stored in the arrays force_a and force_b.

- Define a function with call signature diff_of_means(data_1, data_2) that returns the differences in means between two data sets, mean of data_1 minus mean of data_2.
- Use this function to compute the empirical difference of means that was observed in the frogs.
- Draw 10000 permutation replicates of the difference of means.
- Compute the p-value.
- Print the p-value.

In [33]:

```
force_a = frogs['impact force (mN)'][frogs.ID == 'II'].div(1000).to_list()
force_b = frogs['impact force (mN)'][frogs.ID == 'IV'].div(1000).to_list()
```

In [34]:

```
print('Mean force of frog a: ' + str(np.mean(force_a)))
print('Mean force of frog b: ' + str(np.mean(force_b)))
```

In [35]:

```
def diff_of_means(data_1, data_2):
"""Difference in means of two arrays."""
# The difference of means of data_1, data_2: diff
diff = np.mean(data_1) - np.mean(data_2)
return diff
# Compute difference of mean impact force from experiment: empirical_diff_means
empirical_diff_means = diff_of_means(force_a, force_b)
# Draw 10,000 permutation replicates: perm_replicates
perm_replicates = d
```