# Exploratory Data Analysis in Python

# 1: Exploring the NSFG data

To get the number of rows and columns in a DataFrame, you can read its `shape`

attribute.

To get the column names, you can read the `columns`

attribute. The result is an Index, which is a Pandas data structure that is similar to a list. Let’s begin exploring the NSFG data! It has been pre-loaded for you into a DataFrame called `nsfg`

.

# Display the number of rows and columns nsfg.shape # Display the names of the columns nsfg.columns # Select column birthwgt_oz1: ounces ounces = nsfg['birthwgt_oz1'] # Print the first 5 elements of ounces print(ounces.head())

In [1]: nsfg.shape Out[1]: (9358, 10) In [2]: nsfg.columns Out[2]: Index(['caseid', 'outcome', 'birthwgt_lb1', 'birthwgt_oz1', 'prglngth', 'nbrnaliv', 'agecon', 'agepreg', 'hpagelb', 'wgt2013_2015'], dtype='object') 0 4.0 1 12.0 2 4.0 3 NaN 4 13.0 Name: birthwgt_oz1, dtype: float64

Remember these attributes and methods; they are useful when you are exploring a new dataset. It’s now time to check for errors and prepare the data for analysis. Keep going!

# Validate a variable

In the NSFG dataset, the variable `'outcome'`

encodes the outcome of each pregnancy as shown below:

value | label |
---|---|

1 | Live birth |

2 | Induced abortion |

3 | Stillbirth |

4 | Miscarriage |

5 | Ectopic pregnancy |

6 | Current pregnancy |

The `nsfg`

DataFrame has been pre-loaded for you. Explore it in the IPython Shell and use the methods Allen showed you in the video to answer the following question: How many pregnancies in this dataset ended with a live birth?

In [4]: nsfg[nsfg['outcome']== 1].shape Out[4]: (6489, 10)

By comparing your results with the codebook, you confirm you are interpreting the data correctly.

# Clean a variable

In the NSFG dataset, the variable `'nbrnaliv'`

records the number of babies born alive at the end of a pregnancy.

If you use `.value_counts()`

to view the responses, you’ll see that the value `8`

appears once, and if you consult the codebook, you’ll see that this value indicates that the respondent refused to answer the question.

Your job in this exercise is to replace this value with `np.nan`

. Recall from the video how Allen replaced the values `98`

and `99`

in the `ounces`

column using the `.replace()`

method:

```
ounces.replace([98, 99], np.nan, inplace=True)
```

##### Instructions

- In the
`'nbrnaliv'`

column, replace the value`8`

, in place, with the special value`NaN`

. - Confirm that the value
`8`

no longer appears in this column by printing the values and their frequencies.

# Replace the value 8 with NaN nsfg['nbrnaliv'].replace([8], np.nan, inplace=True) # Print the values and their frequencies print(nsfg['nbrnaliv'].value_counts())

<script.py> output: 1.0 6379 2.0 100 3.0 5 Name: nbrnaliv, dtype: int64

If you are careful about this kind of cleaning and validation, it will save time (in the long run) and avoid potentially serious errors.

# Compute a variable

For each pregnancy in the NSFG dataset, the variable `'agecon'`

encodes the respondent’s age at conception, and `'agepreg'`

the respondent’s age at the end of the pregnancy.

Both variables are recorded as integers with two implicit decimal places, so the value `2575`

means that the respondent’s age was `25.75`

.

##### Instructions 1/3

- 1/3 Select
`'agecon'`

and`'agepreg'`

, divide them by`100`

, and assign them to the local variables`agecon`

and`agepreg`

.

# Select the columns and divide by 100 agecon = nsfg['agecon']/100 agepreg = nsfg['agepreg']/100

- 2/3 Compute the difference, which is an estimate of the duration of the pregnancy. Keep in mind that for each pregnancy,
`agepreg`

will be larger than`agecon`

.

# Select the columns and divide by 100 agecon = nsfg['agecon'] / 100 agepreg = nsfg['agepreg'] / 100 # Compute the difference preg_length = agepreg - agecon

- 3/3 Use
`.describe()`

to compute the mean duration and other summary statistics.

# Select the columns and divide by 100 agecon = nsfg['agecon'] / 100 agepreg = nsfg['agepreg'] / 100 # Compute the difference preg_length = agepreg - agecon # Compute summary statistics print(preg_length.describe())

<script.py> output: count 9109.000000 mean 0.552069 std 0.271479 min 0.000000 25% 0.250000 50% 0.670000 75% 0.750000 max 0.920000 dtype: float64

A variable that’s computed from other variables is sometimes called a ‘recode’. It’s now time to get back to the motivating question for this chapter: what is the average birth weight for babies in the U.S.? See you in the next video!

Pyplot doesn’t work with NaN’s.

# Make a histogram

Histograms are one of the most useful tools in exploratory data analysis. They quickly give you an overview of the distribution of a variable, that is, what values the variable can have, and how many times each value appears.

As we saw in a previous exercise, the NSFG dataset includes a variable `'agecon'`

that records age at conception for each pregnancy. Here, you’re going to plot a histogram of this variable. You’ll use the `bins`

parameter that you saw in the video, and also a new parameter – `histtype`

– which you can read more about here in the `matplotlib`

documentation. Learning how to read documentation is an essential skill. If you want to learn more about `matplotlib`

, you can check out DataCamp’s Introduction to Matplotlib course. https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.hist.html

##### Instructions 1/2

- 1Plot a histogram of
`agecon`

with`20`

bins.

# Plot the histogram plt.hist(agecon, bins=20) # Label the axes plt.xlabel('Age at conception') plt.ylabel('Number of pregnancies') # Show the figure plt.show()

- 2/2 Adapt your code to make an unfilled histogram by setting the parameter
`histtype`

to be`'step'`

.

# Plot the histogram plt.hist(agecon, bins=20, histtype='step') # Label the axes plt.xlabel('Age at conception') plt.ylabel('Number of pregnancies') # Show the figure plt.show()

`matplotlib`

functions provide a lot of options; be sure to read the documentation so you know what they can do.

# Compute birth weight

Now let’s pull together the steps in this chapter to compute the average birth weight for full-term babies.

I’ve provided a function, `resample_rows_weighted`

, that takes the NSFG data and resamples it using the sampling weights in `wgt2013_2015`

. The result is a sample that is representative of the U.S. population.

Then I extract `birthwgt_lb1`

and `birthwgt_oz1`

, replace special codes with `NaN`

, and compute total birth weight in pounds, `birth_weight`

.

```
# Resample the data
nsfg = resample_rows_weighted(nsfg, 'wgt2013_2015')
# Clean the weight variables
pounds = nsfg['birthwgt_lb1'].replace([98, 99], np.nan)
ounces = nsfg['birthwgt_oz1'].replace([98, 99], np.nan)
# Compute total birth weight
birth_weight = pounds + ounces/16
```

##### Instructions

- Make a Boolean Series called
`full_term`

that is true for babies with`'prglngth'`

greater than or equal to 37 weeks. - Use
`full_term`

and`birth_weight`

to select birth weight in pounds for full-term babies. Store the result in`full_term_weight`

. - Compute the mean weight of full-term babies.

# Create a Boolean Series for full-term babies full_term = nsfg['prglngth'] >= 37 # Select the weights of full-term babies full_term_weight = birth_weight[full_term] # Compute the mean weight of full-term babies print(full_term_weight.mean())

7.392597951914515

# Filter

In the previous exercise, you computed the mean birth weight for full-term babies; you filtered out preterm babies because their distribution of weight is different.

The distribution of weight is also different for multiple births, like twins and triplets. In this exercise, you’ll filter them out, too, and see what effect it has on the mean.

##### Instructions

- Use the variable
`'nbrnaliv'`

to make a Boolean Series that is`True`

for single births (where`'nbrnaliv'`

equals`1`

) and`False`

otherwise. - Use Boolean Series and logical operators to select single, full-term babies and compute their mean birth weight.
- For comparison, select multiple, full-term babies and compute their mean birth weight.

# Filter full-term babies full_term = nsfg['prglngth'] >= 37 # Filter single births single = nsfg['nbrnaliv']== 1 # Compute birth weight for single full-term babies # the tilda ~ sign makes the inverse of a boolean value single_full_term_weight = birth_weight[single & full_term] print('Single full-term mean:', single_full_term_weight.mean()) # Compute birth weight for multiple full-term babies mult_full_term_weight = birth_weight[~single & full_term] print('Multiple full-term mean:', mult_full_term_weight.mean())

Single full-term mean: 7.40297320308299 Multiple full-term mean: 5.784722222222222

Now that we have clean data, we’re ready to explore. Coming up in Chapter 2, we’ll look at distributions of variables in the General Social Survey and explore the relationship between education and income.

## 2: Probability Mass Function

In probability and statistics, a **probability mass function** is a function that gives the probability that a discrete random variableis exactly equal to some value.^{[1]} Sometimes it is also known as the discrete density function. The probability mass function is often the primary means of defining a discrete probability distribution, and such functions exist for either scalar or multivariate random variables whose domain is discrete. (from: https://en.wikipedia.org/wiki/Probability_mass_function)

# Make a PMF

The GSS dataset has been pre-loaded for you into a DataFrame called `gss`

. You can explore it in the IPython Shell to get familiar with it.

In this exercise, you’ll focus on one variable in this dataset, `'year'`

, which represents the year each respondent was interviewed.

The `Pmf`

class you saw in the video has already been created for you. You can access it outside of DataCamp via the `empiricaldist`

library. (see: https://pypi.org/project/empiricaldist/ , `use pip install empiricaldist`

, check examples on https://nbviewer.org/github/AllenDowney/empiricaldist/blob/master/empiricaldist/dist_demo.ipynb).

The library `empiricaldist`

also provides `Cdf`

, which represents a cumulative distribution function.

##### Instructions 1/2

- 1/2 Make a PMF for
`year`

with`normalize=False`

and display the result.

# Compute the PMF for year pmf_year = Pmf(gss['year'], normalize=False) # Print the result print(pmf_year)

<script.py> output: 1972 1613 1973 1504 1974 1484 1975 1490 1976 1499 1977 1530 1978 1532 1980 1468 1982 1860 1983 1599 1984 1473 1985 1534 1986 1470 1987 1819 1988 1481 1989 1537 1990 1372 1991 1517 1993 1606 1994 2992 1996 2904 1998 2832 2000 2817 2002 2765 2004 2812 2006 4510 2008 2023 2010 2044 2012 1974 2014 2538 2016 2867 Name: Pmf, dtype: int64

The PMF makes it easy to extract insights like this. Time now to visualize the PMF for the `'age'`

variable of this GSS dataset!

# Plot a PMF

Now let’s plot a PMF for the age of the respondents in the GSS dataset. The variable `'age'`

contains respondents’ age in years.

##### Instructions 1/3

- 1/3 Select the
`'age'`

column from the`gss`

DataFrame and store the result in`age`

.

# Select the age column age = gss['age']

- 2/3 Make a normalized PMF of
`age`

. Store the result in`pmf_age`

.

# Select the age column age = gss['age'] # Make a PMF of age pmf_age = Pmf(age, normalize=True)

- 3/3 Plot
`pmf_age`

as a bar chart.

# Select the age column age = gss['age'] # Make a PMF of age pmf_age = Pmf(age) # Plot the PMF pmf_age.bar(label='age') # Label the axes plt.xlabel('Age') plt.ylabel('PMF') plt.show()

You could also use `pmf_age.plot()`

to plot the Pmf as a line plot.

# Make a CDF (Cumulative Distribution Function)

In probability theory and statistics, the **cumulative distribution function** (**CDF**) of a real-valued random variable , or just **distribution function** of , evaluated at , is the probability that will take a value less than or equal to .^{[1]}

Every probability distribution supported on the real numbers, discrete or “mixed” as well as continuous, is uniquely identified by an **upwards continuous**^{[2]} **monotonic increasing** cumulative distribution function satisfying and .

In the case of a scalar continuous distribution, it gives the area under the probability density function from minus infinity to . Cumulative distribution functions are also used to specify the distribution of multivariate random variables.

(From: https://en.wikipedia.org/wiki/Cumulative_distribution_function)

In this exercise, you’ll make a CDF and use it to determine the fraction of respondents in the GSS dataset who are OLDER than 30.

The GSS dataset has been preloaded for you into a DataFrame called `gss`

.

As with the `Pmf`

class from the previous lesson, the `Cdf`

class you just saw in the video has been created for you, and you can access it outside of DataCamp via the `empiricaldist`

library.

##### Instructions 1/4

- 1/4 Select the
`'age'`

column. Store the result in`age`

.

# Select the age column age = gss['age']

- 2/4 Compute the CDF of
`age`

. Store the result in`cdf_age`

.

# Select the age column age = gss['age'] # Compute the CDF of age cdf_age = Cdf(age)

- 3/4 Calculate the CDF of
`30`

.

# Select the age column age = gss['age'] # Compute the CDF of age cdf_age = Cdf(age) # Calculate the CDF of 30 print(cdf_age(30))

<script.py> output: 0.2539137136526388

#### Question

What fraction of the respondents in the GSS dataset are OLDER than 30?

##### Possible Answers

**Approximately 75%**- Approximately 65%
- Approximately 45%
- Approximately 25%

# Compute IQR

Recall from the video that the interquartile range (IQR) is the difference between the 75th and 25th percentiles. It is a measure of variability that is robust in the presence of errors or extreme values.

In this exercise, you’ll compute the interquartile range of income in the GSS dataset. Income is stored in the `'realinc'`

column, and the CDF of income has already been computed and stored in `cdf_income`

.

##### Instructions 1/4

- 1/4 Calculate the 75th percentile of income and store it in
`percentile_75th`

.

# Calculate the 75th percentile percentile_75th = cdf_income.inverse(0.75)

- 2/4 Calculate the 25th percentile of income and store it in
`percentile_25th`

.

# Calculate the 75th percentile percentile_75th = cdf_income.inverse(0.75) # Calculate the 25th percentile percentile_25th = cdf_income.inverse(0.25)

- 3/4 Calculate the interquartile range of income. Store the result in
`iqr`

.

# Calculate the 75th percentile percentile_75th = cdf_income.inverse(0.75) # Calculate the 25th percentile percentile_25th = cdf_income.inverse(0.25) # Calculate the interquartile range iqr = percentile_75th - percentile_25th # Print the interquartile range print(iqr)

<script.py> output: 29676.0

- 4/4

#### Question

What is the interquartile range (IQR) of income in the GSS datset?

##### Possible Answers

**Approximately 29676**- Approximately 26015
- Approximately 34702
- Approximately 30655

# Plot a CDF

The distribution of income in almost every country is long-tailed; that is, there are a small number of people with very high incomes.

In the GSS dataset, the variable `'realinc'`

represents total household income, converted to 1986 dollars. We can get a sense of the shape of this distribution by plotting the CDF.

##### Instructions

- Select
`'realinc'`

from the`gss`

dataset. - Make a Cdf object called
`cdf_income`

. - Create a plot of
`cdf_income`

using`.plot()`

.

# Select realinc income = gss['realinc'] # Make the CDF cdf_income = Cdf(income) # Plot it cdf_income.plot() # Label the axes plt.xlabel('Income (1986 USD)') plt.ylabel('CDF') plt.show()

It’s recommended to use CDF’s for exploratory analysis because a CDF gives a clearer view of the distribution, without too much noise. Besides that CDF’s are good for comparing distributions. Especially if there are more than two CDF’s.

# Distribution of education

Let’s begin comparing incomes for different levels of education in the GSS dataset, which has been pre-loaded for you into a DataFrame called `gss`

. The variable `educ`

represents the respondent’s years of education.

What fraction of respondents report that they have 12 years of education or fewer?

##### Instructions

##### Possible Answers

- Approximately 22%
- Approximately 31%
- Approximately 47%
- Approximately 53%

In [1]: cdf = Cdf(gss['educ']) In [2]: print(cdf(12)) 0.5322611710323575

If you evaluate the CDF at 12, you get the fraction of respondents with 12 or fewer years of eduction.

# Extract education levels

Let’s create Boolean Series to identify respondents with different levels of education.

In the U.S, 12 years of education usually means the respondent has completed high school (secondary education). A respondent with 14 years of education has probably completed an associate degree (two years of college); someone with 16 years has probably completed a bachelor’s degree (four years of college).

##### Instructions

- Complete the line that identifies respondents with associate degrees, that is, people with 14 or more years of education but less than 16.
- Complete the line that identifies respondents with 12 or fewer years of education.
- Confirm that the mean of
`high`

is the fraction we computed in the previous exercise, about 53%.

# Select educ educ = gss['educ'] # Bachelor's degree bach = (educ >= 16) # Associate degree assc = ((educ >= 14) & (educ < 16)) # High school (12 or fewer years of education) high = (educ <= 12) print(high.mean())

<script.py> output: 0.5308807991547402

Remember, you can use logical operators to make Boolean Series and select rows from a DataFrame or Series.

# Plot income CDFs

Let’s now see what the distribution of income looks like for people with different education levels. You can do this by plotting the CDFs. Recall how Allen plotted the income CDFs of respondents interviewed before and after 1995:

```
Cdf(income[pre95]).plot(label='Before 1995')
Cdf(income[~pre95]).plot(label='After 1995')
```

You can assume that Boolean Series have been defined, as in the previous exercise, to identify respondents with different education levels: `high`

, `assc`

, and `bach`

.

##### Instructions

- Fill in the missing lines of code to plot the CDFs.

income = gss['realinc'] # Plot the CDFs Cdf(income[high]).plot(label='High school') Cdf(income[assc]).plot(label='Associate') Cdf(income[bach]).plot(label='Bachelor') # Label the axes plt.xlabel('Income (1986 USD)') plt.ylabel('CDF') plt.legend() plt.show()

It might not be surprising that people with more education have higher incomes, but looking at these distributions, we can see where the differences are.

# Distribution of income

In many datasets, the distribution of income is approximately lognormal, which means that the logarithms of the incomes fit a normal distribution. We’ll see whether that’s true for the GSS data. As a first step, you’ll compute the mean and standard deviation of the log of incomes using NumPy’s `np.log10()`

function.

Then, you’ll use the computed mean and standard deviation to make a `norm`

object using the `scipy.stats.norm()`

function.

##### Instructions

- Extract
`'realinc'`

from`gss`

and compute its logarithm using`np.log10()`

. - Compute the mean and standard deviation of the result.
- Make a
`norm`

object by passing the computed mean and standard deviation to`norm()`

.

# Extract realinc and compute its log income = gss['realinc'] log_income = np.log10(income) # Compute mean and standard deviation mean = log_income.mean() std = log_income.std() print(mean, std) # Make a norm object from scipy.stats import norm dist = norm(mean, std)

<script.py> output: 4.371148677934171 0.4290082383271385

Now we can plot the model and the observed distribution and see where they differ.

# Comparing CDFs

To see whether the distribution of income is well modeled by a lognormal distribution, we’ll compare the CDF of the logarithm of the data to a normal distribution with the same mean and standard deviation. These variables from the previous exercise are available for use:

```
# Extract realinc and compute its log
log_income = np.log10(gss['realinc'])
# Compute mean and standard deviation
mean, std = log_income.mean(), log_income.std()
# Make a norm object
from scipy.stats import norm
dist = norm(mean, std)
```

`dist`

is a `scipy.stats.norm`

object with the same mean and standard deviation as the data. It provides `.cdf()`

, which evaluates the normal cumulative distribution function.

Be careful with capitalization: `Cdf()`

, with an uppercase `C`

, creates `Cdf`

objects. `dist.cdf()`

, with a lowercase `c`

, evaluates the normal cumulative distribution function.

##### Instructions

- Evaluate the normal cumulative distribution function using
`dist.cdf`

. - Use the
`Cdf()`

function to compute the CDF of`log_income`

. - Plot the result.

# Evaluate the model CDF xs = np.linspace(2, 5.5) ys = dist.cdf(xs) # Plot the model CDF plt.clf() plt.plot(xs, ys, color='gray') # Create and plot the Cdf of log_income Cdf(log_income).plot() # Label the axes plt.xlabel('log10 of realinc') plt.ylabel('CDF') plt.show()

The lognormal model is a pretty good fit for the data, but clearly not a perfect match. That’s what real data is like; sometimes it doesn’t fit the model.

# Comparing PDFs

In the previous exercise, we used CDFs to see if the distribution of income is lognormal. We can make the same comparison using a PDF and KDE. That’s what you’ll do in this exercise!

Probability Density Function (PDF):

Boxplot and probability density function of a normal distribution*N*(0, *σ*^{2}).Geometric visualisation of the mode, median and mean of an arbitrary probability density function.^{[1]}

In probability theory, a **probability density function** (**PDF**), or **density** of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a *relative likelihood* that the value of the random variable would be close to that sample.^{[2]}^{[3]} In other words, while the *absolute likelihood* for a continuous random variable to take on any particular value is 0 (since there is an infinite set of possible values to begin with), the value of the PDF at two different samples can be used to infer, in any particular draw of the random variable, how much more likely it is that the random variable would be close to one sample compared to the other sample.

See: https://en.wikipedia.org/wiki/Probability_density_function

As before, the `norm`

object `dist`

is available in your workspace:

```
from scipy.stats import norm
dist = norm(mean, std)
```

Just as all `norm`

objects have a `.cdf()`

method, they also have a `.pdf()`

method.

To create a KDE plot, you can use Seaborn’s `kdeplot()`

function. To learn more about this function and Seaborn, you can check out DataCamp’s Data Visualization with Seaborn course. Here, Seaborn has been imported for you as `sns`

.

##### Instructions

- Evaluate the normal PDF using
`dist`

, which is a`norm`

object with the same mean and standard deviation as the data. - Make a KDE plot of the logarithms of the incomes, using
`log_income`

, which is a Series object.

# Evaluate the normal PDF xs = np.linspace(2, 5.5) ys = dist.pdf(xs) # Plot the model PDF plt.clf() plt.plot(xs, ys, color='gray') # Plot the data KDE sns.kdeplot(log_income) # Label the axes plt.xlabel('log10 of realinc') plt.ylabel('PDF') plt.show()

We’ve seen several ways to vizualize and compare distributions: PMFs, CDFs, and KDE plots.

- Probability Mass Functions (PMFs): https://en.wikipedia.org/wiki/Probability_mass_function
- Cumulative Distribution Functions (CDFs): https://en.wikipedia.org/wiki/Cumulative_distribution_function
- Interquartile Range (IQR): https://en.wikipedia.org/wiki/Interquartile_range
- Probability Density Function (PDFs): https://en.wikipedia.org/wiki/Probability_density_function

# 3: Exploring Relationships

# PMF of age

Do people tend to gain weight as they get older? We can answer this question by visualizing the relationship between weight and age. But before we make a scatter plot, it is a good idea to visualize distributions one variable at a time. Here, you’ll visualize age using a bar chart first. Recall that all PMF objects have a `.bar()`

method to make a bar chart.

The BRFSS dataset includes a variable, `'AGE'`

(note the capitalization!), which represents each respondent’s age. To protect respondents’ privacy, ages are rounded off into 5-year bins. `'AGE'`

contains the midpoint of the bins.

##### Instructions

- Extract the variable
`'AGE'`

from the DataFrame`brfss`

and assign it to`age`

. - Get the PMF of
`age`

and plot it as a bar chart.

# Extract age age = brfss['AGE'] # Plot the PMF pmf_age = Pmf(age) pmf_age.plot() # Label the axes plt.xlabel('Age in years') plt.ylabel('PMF') plt.show()

Or

# Extract age age = brfss['AGE'] # Plot the PMF pmf_age = Pmf(age) pmf_age.bar() # Label the axes plt.xlabel('Age in years') plt.ylabel('PMF') plt.show()

Notice that the last age range is bigger than the others. That’s the kind of thing you see when you plot distributions.

# Scatter plot

Now let’s make a scatterplot of `weight`

versus `age`

. To make the code run faster, I’ve selected only the first 1000 rows from the `brfss`

DataFrame.

`weight`

and `age`

have already been extracted for you. Your job is to use `plt.plot()`

to make a scatter plot.

##### Instructions

- Make a scatter plot of
`weight`

and`age`

with format string`'o'`

and`alpha=0.1`

.

# Select the first 1000 respondents brfss = brfss[:1000] # Extract age and weight age = brfss['AGE'] weight = brfss['WTKG3'] # Make a scatter plot plt.plot(age, weight, 'o', alpha=0.1) plt.xlabel('Age in years') plt.ylabel('Weight in kg') plt.show()

By adjusting alpha we can avoid saturating the plot. Next we’ll jitter the data to break up the columns.

# Jittering

In the previous exercise, the ages fall in columns because they’ve been rounded into 5-year bins. If we jitter them, the scatter plot will show the relationship more clearly. Recall how Allen jittered `height`

and `weight`

in the video:

```
height_jitter = height + np.random.normal(0, 2, size=len(brfss))
weight_jitter = weight + np.random.normal(0, 2, size=len(brfss))
```

##### Instructions

- Add random noise to
`age`

with mean`0`

and standard deviation`2.5`

. - Make a scatter plot between
`weight`

and`age`

with marker size`5`

and`alpha=0.2`

. Be sure to also specify`'o'`

.

# Select the first 1000 respondents brfss = brfss[:1000] # Add jittering to age age = brfss['AGE'] + np.random.normal(0, 2.5, size=len(brfss)) # Extract weight weight = brfss['WTKG3'] # Make a scatter plot plt.plot(age, weight, 'o', markersize=5, alpha=0.2) plt.xlabel('Age in years') plt.ylabel('Weight in kg') plt.show()p

By smoothing out the ages and avoiding saturation, we get the best view of the data. But in this case the nature of the relationship is still hard to see. In the next lesson, we’ll see some other ways to visualize it.

# Height and weight

Previously we looked at a scatter plot of height and weight, and saw that taller people tend to be heavier. Now let’s take a closer look using a box plot. The `brfss`

DataFrame contains a variable `'_HTMG10'`

that represents height in centimeters, binned into 10 cm groups.

Recall how Allen created the box plot of `'AGE'`

and `'WTKG3'`

in the video, with the y-axis on a logarithmic scale:

```
sns.boxplot(x='AGE', y='WTKG3', data=data, whis=10)
plt.yscale('log')
```

##### Instructions

- Fill in the parameters of
`.boxplot()`

to plot the distribution of weight (`'WTKG3'`

) in each height (`'_HTMG10'`

) group. Specify`whis=10`

, just as was done in the video. - Add a line to plot the y-axis on a logarithmic scale.

# Drop rows with missing data data = brfss.dropna(subset=['_HTMG10', 'WTKG3']) # Make a box plot sns.boxplot(x='_HTMG10', y='WTKG3', data=data, whis=10) # Plot the y-axis on a log scale plt.yscale('log') # Remove unneeded lines and label axes sns.despine(left=True, bottom=True) plt.xlabel('Height in cm') plt.ylabel('Weight in kg') plt.show()

These box plots provide a good view of the relationship between the variables. They also show the spread of the values in each column.

# Distribution of income

In the next two exercises we’ll look at relationships between income and other variables. In the BRFSS, income is represented as a categorical variable; that is, respondents are assigned to one of 8 income categories. The variable name is `'INCOME2'`

. Before we connect income with anything else, let’s look at the distribution by computing the PMF. Recall that all `Pmf`

objects have a `.bar()`

method.

##### Instructions

- Extract
`'INCOME2'`

from the`brfss`

DataFrame and assign it to`income`

. - Plot the PMF of
`income`

as a bar chart.

# Extract income income = brfss['INCOME2'] # Plot the PMF pmf_income = Pmf(income).bar() # Label the axes plt.xlabel('Income level') plt.ylabel('PMF') plt.show()

Almost half of the respondents are in the top income category, so this dataset doesn’t distinguish between the highest incomes and the median. But maybe it can tell us something about people with incomes below the median.

# Income and height

Let’s now use a violin plot to visualize the relationship between income and height.

##### Instructions

- Create a violin plot to plot the distribution of height (
`'HTM4'`

) in each income (`'INCOME2'`

) group. Specify`inner=None`

to simplify the plot.

# Drop rows with missing data data = brfss.dropna(subset=['INCOME2', 'HTM4']) # Make a violin plot sns.violinplot(x='INCOME2', y='HTM4', data=data, inner=None) # Remove unneeded lines and label axes sns.despine(left=True, bottom=True) plt.xlabel('Income level') plt.ylabel('Height in cm') plt.show()

It looks like there is a weak positive relationsip between income and height, at least for incomes below the median. In the next lesson we’ll see some ways to quantify the strength of this relationship.

# Computing correlations

The purpose of the BRFSS is to explore health risk factors, so it includes questions about diet. The variable `'_VEGESU1'`

represents the number of servings of vegetables respondents reported eating per day.

Let’s see how this variable relates to age and income.

##### Instructions

- From the
`brfss`

DataFrame, select the columns`'AGE'`

,`'INCOME2'`

, and`'_VEGESU1'`

. - Compute the correlation matrix for these variables.

# Select columns columns = ['AGE', 'INCOME2', '_VEGESU1'] subset = brfss[columns] # Compute the correlation matrix print(subset.corr())

<script.py> output: AGE INCOME2 _VEGESU1 AGE 1.000000 -0.015158 -0.009834 INCOME2 -0.015158 1.000000 0.119670 _VEGESU1 -0.009834 0.119670 1.000000

In the next exercise you’ll think about how to interpret these results.

# Interpreting correlations

In the previous exercise, the correlation between income and vegetable consumption is about `0.12`

. The correlation between age and vegetable consumption is about `-0.01`

.

Which of the following are correct interpretations of these results:

*A*: People with higher incomes eat more vegetables.*B*: The relationship between income and vegetable consumption is linear.*C*: Older people eat more vegetables.*D*: There could be a strong nonlinear relationship between age and vegetable consumption.

##### Answer the question

#### Possible Answers

- A and C only.
- B and D only.
- B and C only.
**A and D only.**

The correlation between income and vegetable consumption is small, but it suggests that there is a relationship. But a correlation close to 0 does mean there is no relationship.

# Income and vegetables

As we saw in a previous exercise, the variable `'_VEGESU1'`

represents the number of vegetable servings respondents reported eating per day.

Let’s estimate the slope of the relationship between vegetable consumption and income.

##### Instructions

- Extract the columns
`'INCOME2'`

and`'_VEGESU1'`

from`subset`

into`xs`

and`ys`

respectively. - Compute the simple linear regression of these variables.

from scipy.stats import linregress # Extract the variables subset = brfss.dropna(subset=['INCOME2', '_VEGESU1']) xs = subset['INCOME2'] ys = subset['_VEGESU1'] # Compute the linear regression res = linregress(xs, ys) print(res)

<script.py> output: LinregressResult( slope=0.06988048092105019, intercept=1.5287786243363106, rvalue=0.11967005884864107, pvalue=1.378503916247615e-238, stderr=0.002110976356332332 )

The estimated slope tells you the increase in vegetable servings from one income group to the next.

# Fit a line

Continuing from the previous exercise:

- Assume that
`xs`

and`ys`

contain income codes and daily vegetable consumption, respectively, and `res`

contains the results of a simple linear regression of`ys`

onto`xs`

.

Now, you’re going to compute the line of best fit. NumPy has been imported for you as `np`

.

##### Instructions

- Set
`fx`

to the minimum and maximum of`xs`

, stored in a NumPy array. - Set
`fy`

to the points on the fitted line that correspond to the`fx`

.

# Plot the scatter plot plt.clf() x_jitter = xs + np.random.normal(0, 0.15, len(xs)) plt.plot(x_jitter, ys, 'o', alpha=0.2) # Plot the line of best fit fx = np.array([xs.min(), xs.max()]) fy = res.intercept + res.slope * fx plt.plot(fx, fy, '-', alpha=0.7) plt.xlabel('Income code') plt.ylabel('Vegetable servings per day') plt.ylim([0, 6]) plt.show()

We’ve seen several ways to visualize relationships between variables and quantify their strength. In the next chapter we use regression to explore relationships among more than two variables.

# Regression and causation

In the BRFSS dataset, there is a strong relationship between vegetable consumption and income. The income of people who eat 8 servings of vegetables per day is double the income of people who eat none, on average.

Which of the following conclusions can we draw from this data?

A. Eating a good diet leads to better health and higher income.

B. People with higher income can afford a better diet.

C. People with high income are more likely to be vegetarians.

##### Answer the question

#### Possible Answers

- A only.
- B only.
- B and C.
**None of them.**

This data is consistent with all of these conclusions, but it does not provide conclusive evidence for any of them.

# Using StatsModels

Let’s run the same regression using SciPy and StatsModels, and confirm we get the same results.

##### Instructions

- Compute the regression of
`'_VEGESU1'`

as a function of`'INCOME2'`

using SciPy’s`linregress()`

. - Compute the regression of
`'_VEGESU1'`

as a function of`'INCOME2'`

using StatsModels’`smf.ols()`

.

from scipy.stats import linregress import statsmodels.formula.api as smf # Run regression with linregress subset = brfss.dropna(subset=['INCOME2', '_VEGESU1']) xs = subset['INCOME2'] ys = subset['_VEGESU1'] res = linregress(xs, ys) print(res) # Run regression with StatsModels results = smf.ols('_VEGESU1 ~ INCOME2', data = brfss).fit() print(results.params)

<script.py> output: LinregressResult(slope=0.06988048092105019, intercept=1.5287786243363106, rvalue=0.11967005884864107, pvalue=1.378503916247615e-238, stderr=0.002110976356332332) Intercept 1.528779 INCOME2 0.069880 dtype: float64

When you start working with a new library, checks like this help ensure that you are doing it right.

# Plot income and education

To get a closer look at the relationship between income and education, let’s use the variable `'educ'`

to group the data, then plot mean income in each group.

Here, the GSS dataset has been pre-loaded into a DataFrame called `gss`

.

##### Instructions

- Group
`gss`

by`'educ'`

. Store the result in`grouped`

. - From
`grouped`

, extract`'realinc'`

and compute the mean. - Plot
`mean_income_by_educ`

as a scatter plot. Specify`'o'`

and`alpha=0.5`

.

# Group by educ grouped = gss.groupby('educ') # Compute mean income in each group mean_income_by_educ = grouped['realinc'].mean() # Plot mean income as a scatter plot plt.plot(mean_income_by_educ, 'o', alpha=0.5) # Label the axes plt.xlabel('Education (years)') plt.ylabel('Income (1986 $)') plt.show()

It looks like the relationship between income and education is non-linear.

# Non-linear model of education

The graph in the previous exercise suggests that the relationship between income and education is non-linear. So let’s try fitting a non-linear model.

##### Instructions

- Add a column named
`'educ2'`

to the`gss`

DataFrame; it should contain the values from`'educ'`

squared. - Run a regression model that uses
`'educ'`

,`'educ2'`

,`'age'`

, and`'age2'`

to predict`'realinc'`

.

import statsmodels.formula.api as smf # Add a new column with educ squared gss['educ2'] = gss['educ']**2 # Run a regression model with educ, educ2, age, and age2 results = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss).fit() # Print the estimated parameters print(results.params)

<script.py> output: Intercept -23241.884034 educ -528.309369 educ2 159.966740 age 1696.717149 age2 -17.196984 dtype: float64

The slope associated with `educ2`

is positive, so the model curves upward.

# Making predictions

At this point, we have a model that predicts income using age, education, and sex.

Let’s see what it predicts for different levels of education, holding `age`

constant.

##### Instructions

- Using
`np.linspace()`

, add a variable named`'educ'`

to`df`

with a range of values from`0`

to`20`

. - Add a variable named
`'age'`

with the constant value`30`

. - Use
`df`

to generate predicted income as a function of education.

# Run a regression model with educ, educ2, age, and age2 results = smf.ols('realinc ~ educ + educ2 + age + age2', data=gss).fit() # Make the DataFrame df = pd.DataFrame() df['educ'] = np.linspace(0,20) df['age'] = 30 df['educ2'] = df['educ']**2 df['age2'] = df['age']**2 # Generate and plot the predictions pred = results.predict(df) print(pred.head())

<script.py> output: 0 12182.344976 1 11993.358518 2 11857.672098 3 11775.285717 4 11746.199374 dtype: float64

Now let’s see what the results look like.

# Visualizing predictions

Now let’s visualize the results from the previous exercise!

##### Instructions

- Plot
`mean_income_by_educ`

using circles (`'o'`

). Specify an`alpha`

of`0.5`

. - Plot the prediction results with a line, with
`df['educ']`

on the x-axis and`pred`

on the y-axis.

# Plot mean income in each age group plt.clf() grouped = gss.groupby('educ') mean_income_by_educ = grouped['realinc'].mean() plt.plot(mean_income_by_educ, 'o', alpha=0.5) # Plot the predictions pred = results.predict(df) plt.plot(df['educ'], pred, label='Age 30') # Label axes plt.xlabel('Education (years)') plt.ylabel('Income (1986 $)') plt.legend() plt.show()

Looks like this model captures the relationship pretty well. Nice job.

Logistic Regression is a powerful tool to analyse relations between binary variable and the features that predict it.

# Predicting a binary variable

Let’s use logistic regression to predict a binary variable. Specifically, we’ll use age, sex, and education level to predict support for legalizing cannabis (marijuana) in the U.S.

In the GSS dataset, the variable `grass`

records the answer to the question “Do you think the use of marijuana should be made legal or not?”

##### Instructions 1/4

- 1/4 Fill in the parameters of
`smf.logit()`

to predict`grass`

using the variables`age`

,`age2`

,`educ`

, and`educ2`

, along with`sex`

as a categorical variable.

# Recode grass gss['grass'].replace(2, 0, inplace=True) # Run logistic regression results = smf.logit('grass ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit() results.params

<script.py> output: Optimization terminated successfully. Current function value: 0.588510 Iterations 6

- 2/4 Add a column called
`educ`

and set it to 12 years; then compute a second column,`educ2`

, which is the square of`educ`

.

# Set the education level to 12 df['educ'] = 12 df['educ2'] = df['educ']**2

<script.py> output: Optimization terminated successfully. Current function value: 0.588510 Iterations 6

3/4 Generate separate predictions for men and women.

# Recode grass gss['grass'].replace(2, 0, inplace=True) # Run logistic regression results = smf.logit('grass ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit() results.params # Make a DataFrame with a range of ages df = pd.DataFrame() df['age'] = np.linspace(18, 89) df['age2'] = df['age']**2 # Set the education level to 12 df['educ'] = 12 df['educ2'] = df['educ']**2 # Generate predictions for men and women df['sex'] = 1 pred1 = results.predict(df) df['sex'] = 2 pred2 = results.predict(df)

<script.py> output: Optimization terminated successfully. Current function value: 0.588510 Iterations 6

4/4 Fill in the missing code to compute the mean of `'grass'`

for each age group, and then the arguments of `plt.plot()`

to plot `pred2`

versus `df['age']`

with the label `'Female'`

.

# Recode grass gss['grass'].replace(2, 0, inplace=True) # Run logistic regression results = smf.logit('grass ~ age + age2 + educ + educ2 + C(sex)', data=gss).fit() results.params # Make a DataFrame with a range of ages df = pd.DataFrame() df['age'] = np.linspace(18, 89) df['age2'] = df['age']**2 # Set the education level to 12 df['educ'] = 12 df['educ2'] = df['educ']**2 # Generate predictions for men and women df['sex'] = 1 pred1 = results.predict(df) df['sex'] = 2 pred2 = results.predict(df) plt.clf() grouped = gss.groupby('age') favor_by_age = grouped['grass'].mean() plt.plot(favor_by_age, 'o', alpha=0.5) plt.plot(df['age'], pred1, label='Male') plt.plot(df['age'], pred2, label='Female') plt.xlabel('Age') plt.ylabel('Probability of favoring legalization') plt.legend() plt.show()

Congratulations on completing this course. I hope you enjoyed it and learned a lot. Should you wish to use the `Pmf`

and `Cdf`

classes from this course in your own work, you can download the `empiricaldist`

library here.