Introduction to Importing Data in Python

In this chapter, you’ll learn how to import data into Python from all types of flat files, which are a simple and prevalent form of data storage. You’ve previously learned how to use NumPy and pandas—you will learn how to use these packages to import flat files and customize your imports.

Exploring your working directory

In order to import data into Python, you should first have an idea of what files are in your working directory.

IPython, which is running on DataCamp’s servers, has a bunch of cool commands, including its magic commands. For example, starting a line with ! gives you complete system shell access. This means that the IPython magic command ! ls will display the contents of your current directory. Your task is to use the IPython magic command ! lsto check out the contents of your current directory and answer the following question: which of the following files is in your working directory?

In [1]:
! ls
moby_dick.txt

Importing entire text files

In this exercise, you’ll be working with the file moby_dick.txt. It is a text file that contains the opening sentences of Moby Dick, one of the great American novels! Here you’ll get experience opening a text file, printing its contents to the shell and, finally, closing it.

Instructions
  • Open the file moby_dick.txt as read-only and store it in the variable file. Make sure to pass the filename enclosed in quotation marks ''.
  • Print the contents of the file to the shell using the print() function. As Hugo showed in the video, you’ll need to apply the method read() to the object file.
  • Check whether the file is closed by executing print(file.closed).
  • Close the file using the close() method.
  • Check again that the file is closed as you did above.
# Open a file: file
file = open('moby_dick.txt', 'r')

# Print it
print(file.read())

# Check whether file is closed
print(file.closed)

# Close file
file.close()

# Check whether file is closed
print(file.closed)
<script.py> output:
    CHAPTER 1. Loomings.
    
    Call me Ishmael. Some years ago--never mind how long precisely--having
    little or no money in my purse, and nothing particular to interest me on
    shore, I thought I would sail about a little and see the watery part of
    the world. It is a way I have of driving off the spleen and regulating
    the circulation. Whenever I find myself growing grim about the mouth;
    whenever it is a damp, drizzly November in my soul; whenever I find
    myself involuntarily pausing before coffin warehouses, and bringing up
    the rear of every funeral I meet; and especially whenever my hypos get
    such an upper hand of me, that it requires a strong moral principle to
    prevent me from deliberately stepping into the street, and methodically
    knocking people's hats off--then, I account it high time to get to sea
    as soon as I can. This is my substitute for pistol and ball. With a
    philosophical flourish Cato throws himself upon his sword; I quietly
    take to the ship. There is nothing surprising in this. If they but knew
    it, almost all men in their degree, some time or other, cherish very
    nearly the same feelings towards the ocean with me.
    False
    True

Importing text files line by line

For large files, we may not want to print all of their content to the shell: you may wish to print only the first few lines. Enter the readline() method, which allows you to do this. When a file called file is open, you can print out the first line by executing file.readline(). If you execute the same command again, the second line will print, and so on.

In the introductory video, Hugo also introduced the concept of a context manager. He showed that you can bind a variable file by using a context manager construct:

with open('huck_finn.txt') as file:

While still within this construct, the variable file will be bound to open('huck_finn.txt'); thus, to print the file to the shell, all the code you need to execute is:

with open('huck_finn.txt') as file:
    print(file.readline())

You’ll now use these tools to print the first few lines of moby_dick.txt!

Instructions
  • Open moby_dick.txt using the with context manager and the variable file.
  • Print the first three lines of the file to the shell by using readline()three times within the context manager.
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
    print(file.readline())
    print(file.readline())
    print(file.readline())
CHAPTER 1. Loomings.



Call me Ishmael. Some years ago--never mind how long precisely--having

Using NumPy to import flat files

In this exercise, you’re now going to load the MNIST digit recognition dataset using the numpy function loadtxt() and see just how easy it can be:

  • The first argument will be the filename.
  • The second will be the delimiter which, in this case, is a comma.

You can find more information about the MNIST dataset here on the webpage of Yann LeCun, who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.

Instructions
  • Fill in the arguments of np.loadtxt() by passing file and a comma ',' for the delimiter.
  • Fill in the argument of print() to print the type of the object digits. Use the function type().
  • Execute the rest of the code to visualize one of the rows of the data.
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
# Import package
import numpy as np

# Assign filename to variable: file
file = 'digits.csv'

# Load file as array: digits
digits = np.loadtxt(file, delimiter=',')

# Print datatype of digits
print(type(digits))

# Select and reshape a row
im = digits[21, 1:]
im_sq = np.reshape(im, (28, 28))

# Plot reshaped data (matplotlib.pyplot already loaded as plt)
plt.imshow(im_sq, cmap='Greys', interpolation='nearest')
plt.show()
<class 'numpy.ndarray'>

Customizing your NumPy import

What if there are rows, such as a header, that you don’t want to import? What if your file has a delimiter other than a comma? What if you only wish to import particular columns?

There are a number of arguments that np.loadtxt() takes that you’ll find useful: 

  • delimiter changes the delimiter that loadtxt() is expecting.
    • You can use ',' for comma-delimited.
    • You can use '\t' for tab-delimited. 
  • skiprows allows you to specify how many rows (not indices) you wish to skip
  • usecols takes a list of the indices of the columns you wish to keep.

The file that you’ll be importing, digits_header.txt, has a header and is tab-delimited.

Instructions
  • Complete the arguments of np.loadtxt(): the file you’re importing is tab-delimited, you want to skip the first row and you only want to import the first and third columns.
  • Complete the argument of the print() call in order to print the entire array that you just imported.
# Import numpy
import numpy as np

# Assign the filename: file
file = 'digits_header.txt'

# Load the data: data
data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0, 2])

# Print data
print(data)
<script.py> output:
    [[1. 0.]
     [0. 0.]
     [1. 0.]
     [4. 0.]
     [0. 0.]
     [0. 0.]
     [7. 0.]
     [3. 0.]
     [5. 0.]
     [3. 0.]
     [8. 0.]
     [9. 0.]
     [1. 0.]
     [3. 0.]
     [3. 0.]
     [1. 0.]
     [2. 0.]
     [0. 0.]
     [7. 0.]
     [5. 0.]
     [8. 0.]
     [6. 0.]
     [2. 0.]
     [0. 0.]
     [2. 0.]
     [3. 0.]
     [6. 0.]
     [9. 0.]
     [9. 0.]
     [7. 0.]
     [8. 0.]
     [9. 0.]
     [4. 0.]
     [9. 0.]
     [2. 0.]
     [1. 0.]
     [3. 0.]
     [1. 0.]
     [1. 0.]
     [4. 0.]
     [9. 0.]
     [1. 0.]
     [4. 0.]
     [4. 0.]
     [2. 0.]
     [6. 0.]
     [3. 0.]
     [7. 0.]
     [7. 0.]
     [4. 0.]
     [7. 0.]
     [5. 0.]
     [1. 0.]
     [9. 0.]
     [0. 0.]
     [2. 0.]
     [2. 0.]
     [3. 0.]
     [9. 0.]
     [1. 0.]
     [1. 0.]
     [1. 0.]
     [5. 0.]
     [0. 0.]
     [6. 0.]
     [3. 0.]
     [4. 0.]
     [8. 0.]
     [1. 0.]
     [0. 0.]
     [3. 0.]
     [9. 0.]
     [6. 0.]
     [2. 0.]
     [6. 0.]
     [4. 0.]
     [7. 0.]
     [1. 0.]
     [4. 0.]
     [1. 0.]
     [5. 0.]
     [4. 0.]
     [8. 0.]
     [9. 0.]
     [2. 0.]
     [9. 0.]
     [9. 0.]
     [8. 0.]
     [9. 0.]
     [6. 0.]
     [3. 0.]
     [6. 0.]
     [4. 0.]
     [6. 0.]
     [2. 0.]
     [9. 0.]
     [1. 0.]
     [2. 0.]
     [0. 0.]
     [5. 0.]]

Importing different datatypes

The file seaslug.txt

  • has a text header, consisting of strings
  • is tab-delimited.

These data consists of percentage of sea slug larvae that had metamorphosed in a given time period. Read more here.

Due to the header, if you tried to import it as-is using np.loadtxt(), Python would throw you a ValueError and tell you that it could not convert string to float. There are two ways to deal with this: firstly, you can set the data type argument dtype equal to str(for string).

Alternatively, you can skip the first row as we have seen before, using the skiprows argument.

Instructions
  • Complete the first call to np.loadtxt() by passing file as the first argument.
  • Execute print(data[0]) to print the first element of data.
  • Complete the second call to np.loadtxt(). The file you’re importing is tab-delimited, the datatype is float, and you want to skip the first row.
  • Print the 10th element of data_float by completing the print()command. Be guided by the previous print() call.
  • Execute the rest of the code to visualize the data.
# Assign filename: file
file = 'seaslug.txt'

# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)

# Print the first element of data
print(data[0])

# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)

# Print the 10th element of data_float
print(data_float[9])

# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
<script.py> output:
    ['Time' 'Percent']
    ['0' '0.5']

Working with mixed datatypes (1)

Much of the time you will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this. There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.

Import 'titanic.csv' using the function np.genfromtxt() as follows:

data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)

Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, datais an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported. You can test this by checking out the array’s shape in the shell by executing np.shape(data).

Accessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].

After importing the Titanic data as a structured array (as per the instructions above), print the entire column with the name Survived to the shell. What are the last 4 values of this column?

In [1]:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
In [2]:
print(data['Fare'])
In [3]:
print(data['Survived'])
[0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1
 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0
 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1
 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0
 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 1 0
 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1
 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 0 1
 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0
 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0
 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0
 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0
 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0
 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0
 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0
 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1
 0 1 0]

1, 0, 1, 0

Working with mixed datatypes (2)

You have just used np.genfromtxt() to import data containing mixed datatypes. There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None. In this exercise, you’ll practice using this to achieve the same result.

Instructions
  • Import titanic.csv using the function np.recfromcsv() and assign it to the variable, d. You’ll only need to pass file to it because it has the defaults delimiter=',' and names=True in addition to dtype=None!
  • Run the remaining code to print the first three entries of the resulting array d.
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d


# Print out first three entries of d
print(d[:3])
# Assign the filename: file
file = 'titanic.csv'

# Import file using np.recfromcsv: d
d = np.recfromcsv('titanic.csv', delimiter=',', names=True, dtype=None)

# Print out first three entries of d
print(d[:3])
# Import file using np.recfromcsv: d
d = np.recfromcsv('titanic.csv', delimiter=',', names=True, dtype=None)

# Print out first three entries of d
print(d[:3])
[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171',  7.25  , b'', b'S')
 (2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
 (3, 1, 3, b'female', 26., 0, 0, b'STON/O2. 3101282',  7.925 , b'', b'S')]

Using pandas to import flat files as DataFrames (1)

In the last exercise, you were able to import flat files containing columns with different datatypes as numpy arrays. However, the DataFrame object in pandas is a more appropriate structure in which to store such data and, thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv()and read_table().

Instructions
  • Import the pandas package using the alias pd.
  • Read titanic.csv into a DataFrame called df. The file name is already stored in the file object.
  • In a print() call, view the head of the DataFrame.
# Import pandas as pd
import pandas as pd

# Assign the filename: file
file = 'titanic.csv'

# Read the file into a DataFrame: df
df = pd.read_csv('titanic.csv')

# View the head of the DataFrame
print(df.head())
  PassengerId  Survived  Pclass     Sex   Age  ...  Parch            Ticket     Fare  Cabin Embarked
0            1         0       3    male  22.0  ...      0         A/5 21171   7.2500    NaN        S
1            2         1       1  female  38.0  ...      0          PC 17599  71.2833    C85        C
2            3         1       3  female  26.0  ...      0  STON/O2. 3101282   7.9250    NaN        S
3            4         1       1  female  35.0  ...      0            113803  53.1000   C123        S
4            5         0       3    male  35.0  ...      0            373450   8.0500    NaN        S

[5 rows x 11 columns]

Using pandas to import flat files as DataFrames (2)

In the last exercise, you were able to import flat files into a pandasDataFrame. As a bonus, it is then straightforward to retrieve the corresponding numpy array using the attribute values. You’ll now have a chance to do this using the MNIST dataset, which is available as digits.csv.

Instructions

100 XP

  • Import the first 5 rows of the file into a DataFrame using the function pd.read_csv() and assign the result to data. You’ll need to use the arguments nrows and header (there is no header in this file).
  • Build a numpy array from the resulting DataFrame in data and assign to data_array.
  • Execute print(type(data_array)) to print the datatype of data_array.
# Assign the filename: file
file = 'digits.csv'

# Read the first 5 rows of the file into a DataFrame: data
data = pd.read_csv(file, nrows=5, header=None)

# Build a numpy array from the DataFrame: data_array
data_array = data.to_numpy()

# Print the datatype of data_array to the shell
print(type(data_array))
<script.py> output:
    <class 'numpy.ndarray'>

Customizing your pandas import

The pandas package is also great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN. To wrap up this chapter, you’re now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which

  • contains comments after the character '#'
  • is tab-delimited.
Instructions
  • Complete the sep (the pandas version of delim), comment and na_values arguments of pd.read_csv()comment takes characters that comments occur after in the file, which in this case is '#'na_values takes a list of strings to recognize as NA/NaN, in this case the string 'Nothing'.
  • Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the 'Age' of passengers aboard the Titanic.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep=',', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
  PassengerId\tSurvived\tPclass\tSex\tAge\tSibSp\tParch\tTicket\tFare\tCabin\tEmbarked
0  1\t0\t3\tmale\t22.0\t1\t0\tA/5 21171\t7.25\tNo...                                  
1              2\t1\t1\tfemale\t38.0\t1\t0\tPC 17599                                  
2  3\t1\t3\tfemale\t26.0\t0\t0\tSTON/O2. 3101282\...                                  
3  4\t1\t1\tfemale\t35.0\t1\t0\t113803\t53.1\tC12...                                  
4       5\t0\t3\tmale\t35.0\t0\t0\t373450\t8.05\t\tS                                  
Traceback (most recent call last):
  File "<stdin>", line 14, in <module>
    pd.DataFrame.hist(data[['Age']])
  File "<stdin>", line 2934, in __getitem__
    raise_missing=True)
  File "<stdin>", line 1354, in _convert_to_indexer
    return self._get_listlike_indexer(obj, axis, **kwargs)[1]
  File "<stdin>", line 1161, in _get_listlike_indexer
    raise_missing=raise_missing)
  File "<stdin>", line 1246, in _validate_read_indexer
    key=key, axis=self.obj._get_axis_name(axis)))
KeyError: "None of [Index(['Age'], dtype='object')] are in the [columns]"
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt

# Assign filename: file
file = 'titanic_corrupt.txt'

# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')

# Print the head of the DataFrame
print(data.head())

# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
   PassengerId  Survived  Pclass     Sex   Age  ...  Parch            Ticket    Fare  Cabin Embarked
0            1         0       3    male  22.0  ...      0         A/5 21171   7.250    NaN       S 
1            2         1       1  female  38.0  ...      0          PC 17599     NaN    NaN      NaN
2            3         1       3  female  26.0  ...      0  STON/O2. 3101282   7.925    NaN        S
3            4         1       1  female  35.0  ...      0            113803  53.100   C123        S
4            5         0       3    male  35.0  ...      0            373450   8.050    NaN        S

[5 rows x 11 columns]

Not so flat any more

In Chapter 1, you learned how to use the IPython magic command ! ls to explore your current working directory. You can also do this natively in Python using the library os, which consists of miscellaneous operating system interfaces.

The first line of the following code imports the library os, the second line stores the name of the current directory in a string called wd and the third outputs the contents of the directory in a list to the shell.

import os
wd = os.getcwd()
os.listdir(wd)

Run this code in the IPython shell and answer the following questions. Ignore the files that begin with ..

Check out the contents of your current directory and answer the following questions: (1) which file is in your directory and NOT an example of a flat file; (2) why is it not a flat file?

In [1]:
import os
wd = os.getcwd()
os.listdir(wd)

Out[1]:
['titanic.txt', 'battledeath.xlsx']


battledeath.xlsx
 is not a flat because it is a spreadsheet consisting of many sheets, not a single table.

Loading a pickled file

There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries. If you want your files to be human readable, you may want to save them as text files in a clever manner. JSONs, which you will see in a later chapter, are appropriate for Python dictionaries.

However, if you merely want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or a bytestream.

In this exercise, you’ll import the pickle package, open a previously pickled data structure from a file and load it.

Instructions
  • Import the pickle package.
  • Complete the second argument of open() so that it is read only for a binary file. This argument will be a string of two letters, one signifying ‘read only’, the other ‘binary’.
  • Pass the correct argument to pickle.load(); it should use the variable that is bound to open.
  • Print the data, d.
  • Print the datatype of d; take your mind back to your previous use of the function type().
# Import pickle package
import pickle

# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)

# Print d
print(d)

# Print datatype of d
print(type(d))
# Import pickle package
import pickle

# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
    d = pickle.load(file)

# Print d
print(d)

# Print datatype of d
print(type(d))

<script.py> output:
    {'June': '69.4', 'Aug': '85', 'Airline': '8', 'Mar': '84.4'}
    <class 'dict'>

Listing sheets in Excel files

Whether you like it or not, any working data scientist will need to deal with Excel spreadsheets at some point in time. You won’t always want to do so in Excel, however!

Here, you’ll learn how to use pandas to import Excel spreadsheets and how to list the names of the sheets in any loaded .xlsx file.

Recall from the video that, given an Excel file imported into a variable spreadsheet, you can retrieve a list of the sheet names using the attribute spreadsheet.sheet_names.

Specifically, you’ll be loading and checking out the spreadsheet 'battledeath.xlsx', modified from the Peace Research Institute Oslo’s (PRIO) dataset. This data contains age-adjusted mortality rates due to war in various countries over several years.

Instructions
  • Assign the spreadsheet filename (provided above) to the variable file.
  • Pass the correct argument to pd.ExcelFile() to load the file using pandas, assigning the result to the variable xls.
  • Print the sheetnames of the Excel spreadsheet by passing the necessary argument to the print() function.
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print sheet names
print(xls.sheet_names)
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print sheet names
print(xls.sheet_names)
['2002', '2004']

Importing sheets from Excel files

In the previous exercises, you saw that the Excel file contains two sheets, '2002' and '2004'. The next step is to import these.

In this exercise, you’ll learn how to import any given sheet of your loaded .xlsx file as a DataFrame. You’ll be able to do so by specifying either the sheet’s name or its index.

The spreadsheet 'battledeath.xlsx' is already loaded as xls.

Instructions
  • Load the sheet '2004' into the DataFrame df1 using its name as a string.
  • Print the head of df1 to the shell.
  • Load the sheet 2002 into the DataFrame df2 using its index (0).
  • Print the head of df2 to the shell.
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2
df2 = xls.parse('2002')

# Print the head of the DataFrame df2
print(df2.head())
# Import pandas
import pandas as pd

# Assign spreadsheet filename: file
file = 'battledeath.xlsx'

# Load spreadsheet: xls
xls = pd.ExcelFile(file)

# Print sheet names
print(xls.sheet_names)

# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')

# Print the head of the DataFrame df1
print(df1.head())

# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)

# Print the head of the DataFrame df2
print(df2.head())
  War(country)      2004
0  Afghanistan  9.451028
1      Albania  0.130354
2      Algeria  3.407277
3      Andorra  0.000000
4       Angola  2.597931
  War, age-adjusted mortality due to       2002
0                        Afghanistan  36.083990
1                            Albania   0.128908
2                            Algeria  18.314120
3                            Andorra   0.000000
4                             Angola  18.964560

Customizing your spreadsheet import

Here, you’ll parse your spreadsheets and use additional arguments to skip rows, rename columns and select only particular columns.

The spreadsheet 'battledeath.xlsx' is already loaded as xls.

As before, you’ll use the method parse(). This time, however, you’ll add the additional arguments skiprowsnames and usecols. These skip rows, name the columns and designate which columns to parse, respectively. All these arguments can be assigned to lists containing the specific row numbers, strings and column numbers, as appropriate.

Instructions
  • Parse the first sheet by index. In doing so, skip the first row of data and name the columns 'Country' and 'AAM due to War (2002)'using the argument names. The values passed to skiprows and names all need to be of type list.
  • Parse the second sheet by index. In doing so, parse only the first column with the usecols parameter, skip the first row and rename the column 'Country'. The argument passed to usecols also needs to be of type list.
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

# Print the head of the DataFrame df1
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

# Print the head of the DataFrame df1
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())

               Country  AAM due to War (2002)
0              Albania               0.128908
1              Algeria              18.314120
2              Andorra               0.000000
3               Angola              18.964560
4  Antigua and Barbuda               0.000000

               Country
0              Albania
1              Algeria
2              Andorra
3               Angola
4  Antigua and Barbuda

Importing SAS files

In this exercise, you’ll figure out how to import a SAS file as a DataFrame using SAS7BDAT and pandas. The file 'sales.sas7bdat' is already in your working directory and both pandas and matplotlib.pyplot have already been imported as follows:

import pandas as pd
import matplotlib.pyplot as plt

The data are adapted from the website of the undergraduate text book Principles of Econometrics by Hill, Griffiths and Lim.

Instructions
  • Import the module SAS7BDAT from the library sas7bdat.
  • In the context of the file 'sales.sas7bdat', load its contents to a DataFrame df_sas, using the method to_data_frame() on the object file.
  • Print the head of the DataFrame df_sas.
  • Execute your entire script to produce a histogram plot!
# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

# Print head of DataFrame
print(df_sas.head())

# Plot histogram of DataFrame features (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()
<script.py> output:
         YEAR     P           S
    0  1950.0  12.9  181.899994
    1  1951.0  11.9  245.000000
    2  1952.0  10.7  250.199997
    3  1953.0  11.3  265.899994
    4  1954.0  11.2  248.500000

Importing Stata files

Here, you’ll gain expertise in importing Stata files as DataFrames using the pd.read_stata() function from pandas. The last exercise’s file, 'disarea.dta', is still in your working directory.

Instructions
  • Use pd.read_stata() to load the file 'disarea.dta' into the DataFrame df.
  • Print the head of the DataFrame df.
  • Visualize your results by plotting a histogram of the column disa10. We’ve already provided this code for you, so just run it!
# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# Print the head of the DataFrame df
print(df.head())

# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()
<script.py> output:
      wbcode               country  disa1  disa2  disa3  ...  disa21  disa22  disa23  disa24  disa25
    0    AFG           Afghanistan   0.00   0.00   0.76  ...     0.0    0.00    0.02    0.00    0.00
    1    AGO                Angola   0.32   0.02   0.56  ...     0.0    0.99    0.98    0.61    0.00
    2    ALB               Albania   0.00   0.00   0.02  ...     0.0    0.00    0.00    0.00    0.16
    3    ARE  United Arab Emirates   0.00   0.00   0.00  ...     0.0    0.00    0.00    0.00    0.00
    4    ARG             Argentina   0.00   0.24   0.24  ...     0.0    0.00    0.01    0.00    0.11
    
    [5 rows x 27 columns]

Using h5py to import HDF5 files

The file 'LIGO_data.hdf5' is already in your working directory. In this exercise, you’ll import it using the h5py library. You’ll also print out its datatype to confirm you have imported it correctly. You’ll then study the structure of the file in order to see precisely what HDF groups it contains.

You can find the LIGO data plus loads of documentation and tutorials here. There is also a great tutorial on Signal Processing with the data here.

Instructions
  • Import the package h5py.
  • Assign the name of the file to the variable file.
  • Load the file as read only into the variable data.
  • Print the datatype of data.
  • Print the names of the groups in the HDF5 file 'LIGO_data.hdf5'.
# Import packages
import numpy as np
import h5py

# Assign filename: file
file = 'LIGO_data.hdf5'

# Load file: data
data = h5py.File(file, 'r')

# Print the datatype of the loaded file
print(type(data))

# Print the keys of the file
for key in data.keys():
    print(key)
<class 'h5py._hl.files.File'>
meta
quality
strain

Extracting data from your HDF5 file

In this exercise, you’ll extract some of the LIGO experiment’s actual data from the HDF5 file and you’ll visualize it.

To do so, you’ll need to first explore the HDF5 group 'strain'.

Instructions
  • Assign the HDF5 group data['strain'] to group.
  • In the for loop, print out the keys of the HDF5 group in group.
  • Assign to the variable strain the values of the time series data data['strain']['Strain'] using the attribute .value.
  • Set num_samples equal to 10000, the number of time points we wish to sample.
  • Execute the rest of the code to produce a plot of the time series data in LIGO_data.hdf5.
<?php

Loading .mat files

In this exercise, you’ll figure out how to load a MATLAB file using scipy.io.loadmat() and you’ll discover what Python datatype it yields.

The file 'albeck_gene_expression.mat' is in your working directory. This file contains gene expression data from the Albeck Lab at UC Davis. You can find the data and some great documentation here.

Instructions
  • Import the package scipy.io.
  • Load the file 'albeck_gene_expression.mat' into the variable mat; do so using the function scipy.io.loadmat().
  • Use the function type() to print the datatype of mat to the IPython shell.
# Import package
import scipy.io

# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')

# Print the datatype type of mat
print(type(mat))
<class 'dict'>

The structure of .mat in Python

Here, you’ll discover what is in the MATLAB dictionary that you loaded in the previous exercise.

The file 'albeck_gene_expression.mat' is already loaded into the variable mat. The following libraries have already been imported as follows:

import scipy.io
import matplotlib.pyplot as plt
import numpy as np

Once again, this file contains gene expression data from the Albeck Lab at UCDavis. You can find the data and some great documentation here.

Instructions
  • Use the method .keys() on the dictionary mat to print the keys. Most of these keys (in fact the ones that do NOT begin and end with ‘__’) are variables from the corresponding MATLAB environment.
  • Print the type of the value corresponding to the key 'CYratioCyt'in mat. Recall that mat['CYratioCyt'] accesses the value.
  • Print the shape of the value corresponding to the key 'CYratioCyt' using the numpy function shape().
  • Execute the entire script to see some oscillatory gene expression data!
# Print the keys of the MATLAB dictionary
print(mat.keys())

# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))

# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))

# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()
dict_keys(['__header__', '__version__', '__globals__', 'rfpCyt', 'rfpNuc', 'cfpNuc', 'cfpCyt', 'yfpNuc', 'yfpCyt', 'CYratioCyt'])
<class 'numpy.ndarray'>
()

Creating a database engine

Here, you’re going to fire up your very first SQL engine. You’ll create an engine to connect to the SQLite database 'Chinook.sqlite', which is in your working directory. Remember that to create an engine to connect to 'Northwind.sqlite', Hugo executed the command

engine = create_engine('sqlite:///Northwind.sqlite')

Here, 'sqlite:///Northwind.sqlite' is called the connection stringto the SQLite database Northwind.sqlite. A little bit of background on the Chinook database: the Chinook database contains information about a semi-fictional digital media store in which media data is real and customer, employee and sales data has been manually created.

Why the name Chinook, you ask? According to their website,

The name of this sample database was based on the Northwind database. Chinooks are winds in the interior West of North America, where the Canadian Prairies and Great Plains meet various mountain ranges. Chinooks are most prevalent over southern Alberta in Canada. Chinook is a good name choice for a database that intends to be an alternative to Northwind.

Instructions
  • Import the function create_engine from the module sqlalchemy.
  • Create an engine to connect to the SQLite database 'Chinook.sqlite' and assign it to engine.
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

What are the tables in the database?

In this exercise, you’ll once again create an engine to connect to 'Chinook.sqlite'. Before you can get any data out of the database, however, you’ll need to know what tables it contains!

To this end, you’ll save the table names to a list using the method table_names() on the engine and then you will print the list.

Instructions
  • Import the function create_engine from the module sqlalchemy.
  • Create an engine to connect to the SQLite database 'Chinook.sqlite' and assign it to engine.
  • Using the method table_names() on the engine engine, assign the table names of 'Chinook.sqlite' to the variable table_names.
  • Print the object table_names to the shell.
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)
# Import necessary module
from sqlalchemy import create_engine

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Save the table names to a list: table_names
table_names = engine.table_names()

# Print the table names to the shell
print(table_names)
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']

The Hello World of SQL Queries!

Now, it’s time for liftoff! In this exercise, you’ll perform the Hello World of SQL queries, SELECT, in order to retrieve all columns of the table Album in the Chinook database. Recall that the query SELECT *selects all columns.

Instructions
  • Open the engine connection as con using the method connect()on the engine.
  • Execute the query that selects ALL columns from the Album table. Store the results in rs.
  • Store all of your query results in the DataFrame df by applying the fetchall() method to the results rs.
  • Close the connection!
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine connection: con
con = engine.connect()

# Perform query: rs
rs = con.execute('SELECT * FROM Album')

# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())

# Close connection
con.close()

# Print head of DataFrame df
print(df.head())
<script.py> output:
       0                                      1  2
    0  1  For Those About To Rock We Salute You  1
    1  2                      Balls to the Wall  2
    2  3                      Restless and Wild  2
    3  4                      Let There Be Rock  1
    4  5                               Big Ones  3

Customizing the Hello World of SQL Queries

Congratulations on executing your first SQL query! Now you’re going to figure out how to customize your query in order to:

  • Select specified columns from a table;
  • Select a specified number of rows;
  • Import column names from the database table.

Recall that Hugo performed a very similar query customization in the video:

engine = create_engine('sqlite:///Northwind.sqlite')

with engine.connect() as con:
    rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")
    df = pd.DataFrame(rs.fetchmany(size=5))
    df.columns = rs.keys()

Packages have already been imported as follows:

from sqlalchemy import create_engine
import pandas as pd

The engine has also already been created:

engine = create_engine('sqlite:///Chinook.sqlite')

The engine connection is already open with the statement

with engine.connect() as con:

All the code you need to complete is within this context.

Instructions
  • Execute the SQL query that selects the columns LastName and Title from the Employee table. Store the results in the variable rs.
  • Apply the method fetchmany() to rs in order to retrieve 3 of the records. Store them in the DataFrame df.
  • Using the rs object, set the DataFrame’s column names to the corresponding names of the table columns.
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT LastName, Title FROM Employee")
    df = pd.DataFrame(rs.fetchmany(3))
    df.columns = rs.keys()

# Print the length of the DataFrame df
print(len(df))

# Print the head of the DataFrame df
print(df.head())
3
  LastName                Title
0    Adams      General Manager
1  Edwards        Sales Manager
2  Peacock  Sales Support Agent

Filtering your database records using SQL’s WHERE

You can now execute a basic SQL query to select records from any table in your database and you can also perform simple query customizations to select particular columns and numbers of rows.

There are a couple more standard SQL query chops that will aid you in your journey to becoming an SQL ninja.

Let’s say, for example that you wanted to get all records from the Customer table of the Chinook database for which the Country is 'Canada'. You can do this very easily in SQL using a SELECTstatement followed by a WHERE clause as follows:

SELECT * FROM Customer WHERE Country = 'Canada'

In fact, you can filter any SELECT statement by any condition using a WHERE clause. This is called filtering your records.

In this interactive exercise, you’ll select all records of the Employeetable for which 'EmployeeId' is greater than or equal to 6.

Packages are already imported as follows:

import pandas as pd
from sqlalchemy import create_engine

Query away!

Instructions
  • Complete the argument of create_engine() so that the engine for the SQLite database 'Chinook.sqlite' is created.
  • Execute the query that selects all records from the Employee table where 'EmployeeId' is greater than or equal to 6. Use the >=operator and assign the results to rs.
  • Apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df.
  • Using the rs object, set the DataFrame’s column names to the corresponding names of the table columns.
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Employee WHERE 'EmployeeId'>=6")
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print the head of the DataFrame df
print(df.head())
   EmployeeId  LastName FirstName       Title  ReportsTo  ... Country PostalCode              Phone                Fax                    Email
0           6  Mitchell   Michael  IT Manager          1  ...  Canada    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com
1           7      King    Robert    IT Staff          6  ...  Canada    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com
2           8  Callahan     Laura    IT Staff          6  ...  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com

[3 rows x 15 columns]

Ordering your SQL records with ORDER BY

You can also order your SQL query results. For example, if you wanted to get all records from the Customer table of the Chinook database and order them in increasing order by the column SupportRepId, you could do so with the following query:

"SELECT * FROM Customer ORDER BY SupportRepId"

In fact, you can order any SELECT statement by any column.

In this interactive exercise, you’ll select all records of the Employeetable and order them in increasing order by the column BirthDate.

Packages are already imported as follows:

import pandas as pd
from sqlalchemy import create_engine

Get querying!

Instructions
  • Using the function create_engine(), create an engine for the SQLite database Chinook.sqlite and assign it to the variable engine.
  • In the context manager, execute the query that selects all records from the Employee table and orders them in increasing order by the column BirthDate. Assign the result to rs.
  • In a call to pd.DataFrame(), apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df.
  • Set the DataFrame’s column names to the corresponding names of the table columns.
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Open engine in context manager
with engine.connect() as con:
    rs = con.execute('SELECT * FROM Employee ORDER BY BirthDate ASC')
    df = pd.DataFrame(rs.fetchall())

    # Set the DataFrame's column names
    df.columns = rs.keys()

# Print head of DataFrame
print(df.head())
   EmployeeId  LastName FirstName                Title  ReportsTo  ... Country PostalCode              Phone                Fax                     Email
0           4      Park  Margaret  Sales Support Agent        2.0  ...  Canada    T2P 5G3  +1 (403) 263-4423  +1 (403) 263-4289  margaret@chinookcorp.com
1           2   Edwards     Nancy        Sales Manager        1.0  ...  Canada    T2P 2T3  +1 (403) 262-3443  +1 (403) 262-3322     nancy@chinookcorp.com
2           1     Adams    Andrew      General Manager        NaN  ...  Canada    T5K 2N1  +1 (780) 428-9482  +1 (780) 428-3457    andrew@chinookcorp.com
3           5   Johnson     Steve  Sales Support Agent        2.0  ...  Canada    T3B 1Y7   1 (780) 836-9987   1 (780) 836-9543     steve@chinookcorp.com
4           8  Callahan     Laura             IT Staff        6.0  ...  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772     laura@chinookcorp.com

[5 rows x 15 columns]

Pandas and The Hello World of SQL Queries!

Here, you’ll take advantage of the power of pandas to write the results of your SQL query to a DataFrame in one swift line of Python code!

You’ll first import pandas and create the SQLite 'Chinook.sqlite'engine. Then you’ll query the database to select all records from the Album table.

Recall that to select all records from the Orders table in the Northwind database, Hugo executed the following command:

df = pd.read_sql_query("SELECT * FROM Orders", engine)
Instructions
  • Import the pandas package using the alias pd.
  • Using the function create_engine(), create an engine for the SQLite database Chinook.sqlite and assign it to the variable engine.
  • Use the pandas function read_sql_query() to assign to the variable df the DataFrame of results from the following query: select allrecords from the table Album.
  • The remainder of the code is included to confirm that the DataFrame created by this method is equal to that created by the previous method that you learned.
# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Album', engine)

# Print head of DataFrame
print(df.head())

# Open engine in context manager and store query result in df1
with engine.connect() as con:
    rs = con.execute("SELECT * FROM Album")
    df1 = pd.DataFrame(rs.fetchall())
    df1.columns = rs.keys()

# Confirm that both methods yield the same result
print(df.equals(df1))
<script.py> output:
       AlbumId                                  Title  ArtistId
    0        1  For Those About To Rock We Salute You         1
    1        2                      Balls to the Wall         2
    2        3                      Restless and Wild         2
    3        4                      Let There Be Rock         1
    4        5                               Big Ones         3
    True

Pandas for more complex querying

Here, you’ll become more familiar with the pandas function read_sql_query() by using it to execute a more complex query: a SELECT statement followed by both a WHERE clause AND an ORDER BY clause.

You’ll build a DataFrame that contains the rows of the Employee table for which the EmployeeId is greater than or equal to 6 and you’ll order these entries by BirthDate.

Instructions
  • Using the function create_engine(), create an engine for the SQLite database Chinook.sqlite and assign it to the variable engine.
  • Use the pandas function read_sql_query() to assign to the variable df the DataFrame of results from the following query: select allrecords from the Employee table where the EmployeeId is greater than or equal to 6 and ordered by BirthDate (make sure to use WHERE and ORDER BY in this precise order).
# Import packages
from sqlalchemy import create_engine
import pandas as pd

# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')

# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Employee WHERE EmployeeId>=6 ORDER BY BirthDate', engine)

# Print head of DataFrame
print(df.head())
   EmployeeId  LastName FirstName       Title  ReportsTo  ... Country PostalCode              Phone                Fax                    Email
0           8  Callahan     Laura    IT Staff          6  ...  Canada    T1H 1Y8  +1 (403) 467-3351  +1 (403) 467-8772    laura@chinookcorp.com
1           7      King    Robert    IT Staff          6  ...  Canada    T1K 5N8  +1 (403) 456-9986  +1 (403) 456-8485   robert@chinookcorp.com
2           6  Mitchell   Michael  IT Manager          1  ...  Canada    T3B 0C5  +1 (403) 246-9887  +1 (403) 246-9899  michael@chinookcorp.com

[3 rows x 15 columns]

The power of SQL lies in relationships between tables: INNER JOIN

Here, you’ll perform your first INNER JOIN! You’ll be working with your favourite SQLite database, Chinook.sqlite. For each record in the Album table, you’ll extract the Title along with the Name of the Artist. The latter will come from the Artist table and so you will need to INNER JOIN these two tables on the ArtistID column of both.

Recall that to INNER JOIN the Orders and Customers tables from the Northwind database, Hugo executed the following SQL query:

"SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID"

The following code has already been executed to import the necessary packages and to create the engine:

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
Instructions
  • Assign to rs the results from the following query: select all the records, extracting the Title of the record and Name of the artist of each record from the Album table and the Artist table, respectively. To do so, INNER JOIN these two tables on the ArtistID column of both.
  • In a call to pd.DataFrame(), apply the method fetchall() to rs in order to fetch all records in rs. Store them in the DataFrame df.
  • Set the DataFrame’s column names to the corresponding names of the table columns.
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
    rs = con.execute('SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID=Artist.ArtistID')
    df = pd.DataFrame(rs.fetchall())
    df.columns = rs.keys()

# Print head of DataFrame df
print(df.head())
                                   Title       Name
0  For Those About To Rock We Salute You      AC/DC
1                      Balls to the Wall     Accept
2                      Restless and Wild     Accept
3                      Let There Be Rock      AC/DC
4                               Big Ones  Aerosmith

Filtering your INNER JOIN

Congrats on performing your first INNER JOIN! You’re now going to finish this chapter with one final exercise in which you perform an INNER JOIN and filter the result using a WHERE clause.

Recall that to INNER JOIN the Orders and Customers tables from the Northwind database, Hugo executed the following SQL query:

"SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID"

The following code has already been executed to import the necessary packages and to create the engine:

import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
Instructions
  • Use the pandas function read_sql_query() to assign to the variable df the DataFrame of results from the following query: select allrecords from PlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackIdthat satisfy the condition Milliseconds < 250000.
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)

# Print head of DataFrame
print(df.head())
   PlaylistId  TrackId  TrackId              Name  AlbumId  ...  GenreId  Composer Milliseconds    Bytes  UnitPrice
0           1     3390     3390  One and the Same      271  ...       23      None       217732  3559040       0.99
1           1     3392     3392     Until We Fall      271  ...       23      None       230758  3766605       0.99
2           1     3393     3393     Original Fire      271  ...       23      None       218916  3577821       0.99
3           1     3394     3394       Broken City      271  ...       23      None       228366  3728955       0.99
4           1     3395     3395          Somedays      271  ...       23      None       213831  3497176       0.99

[5 rows x 11 columns]

Add a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.