Introduction to Importing Data in Python
In this chapter, you’ll learn how to import data into Python from all types of flat files, which are a simple and prevalent form of data storage. You’ve previously learned how to use NumPy and pandas—you will learn how to use these packages to import flat files and customize your imports.
Exploring your working directory
In order to import data into Python, you should first have an idea of what files are in your working directory.
IPython, which is running on DataCamp’s servers, has a bunch of cool commands, including its magic commands. For example, starting a line with ! gives you complete system shell access. This means that the IPython magic command ! ls will display the contents of your current directory. Your task is to use the IPython magic command ! lsto check out the contents of your current directory and answer the following question: which of the following files is in your working directory?
In [1]: ! ls moby_dick.txt
Importing entire text files
In this exercise, you’ll be working with the file moby_dick.txt. It is a text file that contains the opening sentences of Moby Dick, one of the great American novels! Here you’ll get experience opening a text file, printing its contents to the shell and, finally, closing it.
Instructions
- Open the file
moby_dick.txtas read-only and store it in the variablefile. Make sure to pass the filename enclosed in quotation marks''. - Print the contents of the file to the shell using the
print()function. As Hugo showed in the video, you’ll need to apply the methodread()to the objectfile. - Check whether the file is closed by executing
print(file.closed). - Close the file using the
close()method. - Check again that the file is closed as you did above.
# Open a file: file
file = open('moby_dick.txt', 'r')
# Print it
print(file.read())
# Check whether file is closed
print(file.closed)
# Close file
file.close()
# Check whether file is closed
print(file.closed)
<script.py> output:
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago--never mind how long precisely--having
little or no money in my purse, and nothing particular to interest me on
shore, I thought I would sail about a little and see the watery part of
the world. It is a way I have of driving off the spleen and regulating
the circulation. Whenever I find myself growing grim about the mouth;
whenever it is a damp, drizzly November in my soul; whenever I find
myself involuntarily pausing before coffin warehouses, and bringing up
the rear of every funeral I meet; and especially whenever my hypos get
such an upper hand of me, that it requires a strong moral principle to
prevent me from deliberately stepping into the street, and methodically
knocking people's hats off--then, I account it high time to get to sea
as soon as I can. This is my substitute for pistol and ball. With a
philosophical flourish Cato throws himself upon his sword; I quietly
take to the ship. There is nothing surprising in this. If they but knew
it, almost all men in their degree, some time or other, cherish very
nearly the same feelings towards the ocean with me.
False
TrueImporting text files line by line
For large files, we may not want to print all of their content to the shell: you may wish to print only the first few lines. Enter the readline() method, which allows you to do this. When a file called file is open, you can print out the first line by executing file.readline(). If you execute the same command again, the second line will print, and so on.
In the introductory video, Hugo also introduced the concept of a context manager. He showed that you can bind a variable file by using a context manager construct:
with open('huck_finn.txt') as file:
While still within this construct, the variable file will be bound to open('huck_finn.txt'); thus, to print the file to the shell, all the code you need to execute is:
with open('huck_finn.txt') as file:
print(file.readline())
You’ll now use these tools to print the first few lines of moby_dick.txt!
Instructions
- Open
moby_dick.txtusing thewithcontext manager and the variablefile. - Print the first three lines of the file to the shell by using
readline()three times within the context manager.
# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())# Read & print the first 3 lines
with open('moby_dick.txt') as file:
print(file.readline())
print(file.readline())
print(file.readline())
CHAPTER 1. Loomings.
Call me Ishmael. Some years ago--never mind how long precisely--havingUsing NumPy to import flat files
In this exercise, you’re now going to load the MNIST digit recognition dataset using the numpy function loadtxt() and see just how easy it can be:
- The first argument will be the filename.
- The second will be the delimiter which, in this case, is a comma.
You can find more information about the MNIST dataset here on the webpage of Yann LeCun, who is currently Director of AI Research at Facebook and Founding Director of the NYU Center for Data Science, among many other things.
Instructions
- Fill in the arguments of
np.loadtxt()by passingfileand a comma','for the delimiter. - Fill in the argument of
print()to print the type of the objectdigits. Use the functiontype(). - Execute the rest of the code to visualize one of the rows of the data.
import numpy as np # Assign filename to variable: file file = 'digits.csv' # Load file as array: digits digits = np.loadtxt(file, delimiter=',') # Print datatype of digits print(type(digits)) # Select and reshape a row im = digits[21, 1:] im_sq = np.reshape(im, (28, 28)) # Plot reshaped data (matplotlib.pyplot already loaded as plt) plt.imshow(im_sq, cmap='Greys', interpolation='nearest') plt.show()
# Import package import numpy as np # Assign filename to variable: file file = 'digits.csv' # Load file as array: digits digits = np.loadtxt(file, delimiter=',') # Print datatype of digits print(type(digits)) # Select and reshape a row im = digits[21, 1:] im_sq = np.reshape(im, (28, 28)) # Plot reshaped data (matplotlib.pyplot already loaded as plt) plt.imshow(im_sq, cmap='Greys', interpolation='nearest') plt.show() <class 'numpy.ndarray'>

Customizing your NumPy import
What if there are rows, such as a header, that you don’t want to import? What if your file has a delimiter other than a comma? What if you only wish to import particular columns?
There are a number of arguments that np.loadtxt() takes that you’ll find useful:
delimiterchanges the delimiter thatloadtxt()is expecting.- You can use
','for comma-delimited. - You can use
'\t'for tab-delimited.
- You can use
skiprowsallows you to specify how many rows (not indices) you wish to skipusecolstakes a list of the indices of the columns you wish to keep.
The file that you’ll be importing, digits_header.txt, has a header and is tab-delimited.
Instructions
- Complete the arguments of
np.loadtxt(): the file you’re importing is tab-delimited, you want to skip the first row and you only want to import the first and third columns. - Complete the argument of the
print()call in order to print the entire array that you just imported.
# Import numpy import numpy as np # Assign the filename: file file = 'digits_header.txt' # Load the data: data data = np.loadtxt(file, delimiter='\t', skiprows=1, usecols=[0, 2]) # Print data print(data)
<script.py> output:
[[1. 0.]
[0. 0.]
[1. 0.]
[4. 0.]
[0. 0.]
[0. 0.]
[7. 0.]
[3. 0.]
[5. 0.]
[3. 0.]
[8. 0.]
[9. 0.]
[1. 0.]
[3. 0.]
[3. 0.]
[1. 0.]
[2. 0.]
[0. 0.]
[7. 0.]
[5. 0.]
[8. 0.]
[6. 0.]
[2. 0.]
[0. 0.]
[2. 0.]
[3. 0.]
[6. 0.]
[9. 0.]
[9. 0.]
[7. 0.]
[8. 0.]
[9. 0.]
[4. 0.]
[9. 0.]
[2. 0.]
[1. 0.]
[3. 0.]
[1. 0.]
[1. 0.]
[4. 0.]
[9. 0.]
[1. 0.]
[4. 0.]
[4. 0.]
[2. 0.]
[6. 0.]
[3. 0.]
[7. 0.]
[7. 0.]
[4. 0.]
[7. 0.]
[5. 0.]
[1. 0.]
[9. 0.]
[0. 0.]
[2. 0.]
[2. 0.]
[3. 0.]
[9. 0.]
[1. 0.]
[1. 0.]
[1. 0.]
[5. 0.]
[0. 0.]
[6. 0.]
[3. 0.]
[4. 0.]
[8. 0.]
[1. 0.]
[0. 0.]
[3. 0.]
[9. 0.]
[6. 0.]
[2. 0.]
[6. 0.]
[4. 0.]
[7. 0.]
[1. 0.]
[4. 0.]
[1. 0.]
[5. 0.]
[4. 0.]
[8. 0.]
[9. 0.]
[2. 0.]
[9. 0.]
[9. 0.]
[8. 0.]
[9. 0.]
[6. 0.]
[3. 0.]
[6. 0.]
[4. 0.]
[6. 0.]
[2. 0.]
[9. 0.]
[1. 0.]
[2. 0.]
[0. 0.]
[5. 0.]]Importing different datatypes
The file seaslug.txt
- has a text header, consisting of strings
- is tab-delimited.
These data consists of percentage of sea slug larvae that had metamorphosed in a given time period. Read more here.
Due to the header, if you tried to import it as-is using np.loadtxt(), Python would throw you a ValueError and tell you that it could not convert string to float. There are two ways to deal with this: firstly, you can set the data type argument dtype equal to str(for string).
Alternatively, you can skip the first row as we have seen before, using the skiprows argument.
Instructions
- Complete the first call to
np.loadtxt()by passingfileas the first argument. - Execute
print(data[0])to print the first element ofdata. - Complete the second call to
np.loadtxt(). The file you’re importing is tab-delimited, the datatype isfloat, and you want to skip the first row. - Print the 10th element of
data_floatby completing theprint()command. Be guided by the previousprint()call. - Execute the rest of the code to visualize the data.
# Assign filename: file
file = 'seaslug.txt'
# Import file: data
data = np.loadtxt(file, delimiter='\t', dtype=str)
# Print the first element of data
print(data[0])
# Import data as floats and skip the first row: data_float
data_float = np.loadtxt(file, delimiter='\t', dtype=float, skiprows=1)
# Print the 10th element of data_float
print(data_float[9])
# Plot a scatterplot of the data
plt.scatter(data_float[:, 0], data_float[:, 1])
plt.xlabel('time (min.)')
plt.ylabel('percentage of larvae')
plt.show()
<script.py> output:
['Time' 'Percent']
['0' '0.5']Working with mixed datatypes (1)
Much of the time you will need to import datasets which have different datatypes in different columns; one column may contain strings and another floats, for example. The function np.loadtxt() will freak at this. There is another function, np.genfromtxt(), which can handle such structures. If we pass dtype=None to it, it will figure out what types each column should be.
Import 'titanic.csv' using the function np.genfromtxt() as follows:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
Here, the first argument is the filename, the second specifies the delimiter , and the third argument names tells us there is a header. Because the data are of different types, datais an object called a structured array. Because numpy arrays have to contain elements that are all the same type, the structured array solves this by being a 1D array, where each element of the array is a row of the flat file imported. You can test this by checking out the array’s shape in the shell by executing np.shape(data).
Accessing rows and columns of structured arrays is super-intuitive: to get the ith row, merely execute data[i] and to get the column with name 'Fare', execute data['Fare'].
After importing the Titanic data as a structured array (as per the instructions above), print the entire column with the name Survived to the shell. What are the last 4 values of this column?
In [1]:
data = np.genfromtxt('titanic.csv', delimiter=',', names=True, dtype=None)
In [2]:
print(data['Fare'])In [3]: print(data['Survived']) [0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 1 0 1 1 0 1 1 0 0 1 0 1 0 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 1 0 1 1 0 0 1 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 1 1 0 1 1 1 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 1 1 0 0 0 1 0 0 1 1 1 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 1 0 1 0 1 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 1 0 0 1 1 0 0 0 0 1 1 1 1 1 0 1 0 0 0 1 1 0 0 1 0 0 0 1 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 0 1 1 0 0 0 1 0 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 0 1 1 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 1 0 0 0 1 1 1 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 1 0 0 0 1 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 1 1 1 1 0 0 0 1 0 0 1 1 0 0 1 0 1 0 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 0]
1, 0, 1, 0
Working with mixed datatypes (2)
You have just used np.genfromtxt() to import data containing mixed datatypes. There is also another function np.recfromcsv() that behaves similarly to np.genfromtxt(), except that its default dtype is None. In this exercise, you’ll practice using this to achieve the same result.
Instructions
- Import
titanic.csvusing the functionnp.recfromcsv()and assign it to the variable,d. You’ll only need to passfileto it because it has the defaultsdelimiter=','andnames=Truein addition todtype=None! - Run the remaining code to print the first three entries of the resulting array
d.
# Assign the filename: file file = 'titanic.csv' # Import file using np.recfromcsv: d # Print out first three entries of d print(d[:3])
# Assign the filename: file
file = 'titanic.csv'
# Import file using np.recfromcsv: d
d = np.recfromcsv('titanic.csv', delimiter=',', names=True, dtype=None)
# Print out first three entries of d
print(d[:3])# Import file using np.recfromcsv: d
d = np.recfromcsv('titanic.csv', delimiter=',', names=True, dtype=None)
# Print out first three entries of d
print(d[:3])
[(1, 0, 3, b'male', 22., 1, 0, b'A/5 21171', 7.25 , b'', b'S')
(2, 1, 1, b'female', 38., 1, 0, b'PC 17599', 71.2833, b'C85', b'C')
(3, 1, 3, b'female', 26., 0, 0, b'STON/O2. 3101282', 7.925 , b'', b'S')]Using pandas to import flat files as DataFrames (1)
In the last exercise, you were able to import flat files containing columns with different datatypes as numpy arrays. However, the DataFrame object in pandas is a more appropriate structure in which to store such data and, thankfully, we can easily import files of mixed data types as DataFrames using the pandas functions read_csv()and read_table().
Instructions
- Import the
pandaspackage using the aliaspd. - Read
titanic.csvinto a DataFrame calleddf. The file name is already stored in thefileobject. - In a
print()call, view the head of the DataFrame.
# Import pandas as pd
import pandas as pd
# Assign the filename: file
file = 'titanic.csv'
# Read the file into a DataFrame: df
df = pd.read_csv('titanic.csv')
# View the head of the DataFrame
print(df.head())PassengerId Survived Pclass Sex Age ... Parch Ticket Fare Cabin Embarked 0 1 0 3 male 22.0 ... 0 A/5 21171 7.2500 NaN S 1 2 1 1 female 38.0 ... 0 PC 17599 71.2833 C85 C 2 3 1 3 female 26.0 ... 0 STON/O2. 3101282 7.9250 NaN S 3 4 1 1 female 35.0 ... 0 113803 53.1000 C123 S 4 5 0 3 male 35.0 ... 0 373450 8.0500 NaN S [5 rows x 11 columns]
Using pandas to import flat files as DataFrames (2)
In the last exercise, you were able to import flat files into a pandasDataFrame. As a bonus, it is then straightforward to retrieve the corresponding numpy array using the attribute values. You’ll now have a chance to do this using the MNIST dataset, which is available as digits.csv.
Instructions
100 XP
- Import the first 5 rows of the file into a DataFrame using the function
pd.read_csv()and assign the result todata. You’ll need to use the argumentsnrowsandheader(there is no header in this file). - Build a
numpyarray from the resulting DataFrame indataand assign todata_array. - Execute
print(type(data_array))to print the datatype ofdata_array.
# Assign the filename: file file = 'digits.csv' # Read the first 5 rows of the file into a DataFrame: data data = pd.read_csv(file, nrows=5, header=None) # Build a numpy array from the DataFrame: data_array data_array = data.to_numpy() # Print the datatype of data_array to the shell print(type(data_array))
<script.py> output:
<class 'numpy.ndarray'>Customizing your pandas import
The pandas package is also great at dealing with many of the issues you will encounter when importing data as a data scientist, such as comments occurring in flat files, empty lines and missing values. Note that missing values are also commonly referred to as NA or NaN. To wrap up this chapter, you’re now going to import a slightly corrupted copy of the Titanic dataset titanic_corrupt.txt, which
- contains comments after the character
'#' - is tab-delimited.
Instructions
- Complete the
sep(thepandasversion ofdelim),commentandna_valuesarguments ofpd.read_csv().commenttakes characters that comments occur after in the file, which in this case is'#'.na_valuestakes a list of strings to recognize asNA/NaN, in this case the string'Nothing'. - Execute the rest of the code to print the head of the resulting DataFrame and plot the histogram of the
'Age'of passengers aboard the Titanic.
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep=',', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
PassengerId\tSurvived\tPclass\tSex\tAge\tSibSp\tParch\tTicket\tFare\tCabin\tEmbarked
0 1\t0\t3\tmale\t22.0\t1\t0\tA/5 21171\t7.25\tNo...
1 2\t1\t1\tfemale\t38.0\t1\t0\tPC 17599
2 3\t1\t3\tfemale\t26.0\t0\t0\tSTON/O2. 3101282\...
3 4\t1\t1\tfemale\t35.0\t1\t0\t113803\t53.1\tC12...
4 5\t0\t3\tmale\t35.0\t0\t0\t373450\t8.05\t\tS
Traceback (most recent call last):
File "<stdin>", line 14, in <module>
pd.DataFrame.hist(data[['Age']])
File "<stdin>", line 2934, in __getitem__
raise_missing=True)
File "<stdin>", line 1354, in _convert_to_indexer
return self._get_listlike_indexer(obj, axis, **kwargs)[1]
File "<stdin>", line 1161, in _get_listlike_indexer
raise_missing=raise_missing)
File "<stdin>", line 1246, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)))
KeyError: "None of [Index(['Age'], dtype='object')] are in the [columns]"
# Import matplotlib.pyplot as plt
import matplotlib.pyplot as plt
# Assign filename: file
file = 'titanic_corrupt.txt'
# Import file: data
data = pd.read_csv(file, sep='\t', comment='#', na_values='Nothing')
# Print the head of the DataFrame
print(data.head())
# Plot 'Age' variable in a histogram
pd.DataFrame.hist(data[['Age']])
plt.xlabel('Age (years)')
plt.ylabel('count')
plt.show()
PassengerId Survived Pclass Sex Age ... Parch Ticket Fare Cabin Embarked
0 1 0 3 male 22.0 ... 0 A/5 21171 7.250 NaN S
1 2 1 1 female 38.0 ... 0 PC 17599 NaN NaN NaN
2 3 1 3 female 26.0 ... 0 STON/O2. 3101282 7.925 NaN S
3 4 1 1 female 35.0 ... 0 113803 53.100 C123 S
4 5 0 3 male 35.0 ... 0 373450 8.050 NaN S
[5 rows x 11 columns]
Not so flat any more
In Chapter 1, you learned how to use the IPython magic command ! ls to explore your current working directory. You can also do this natively in Python using the library os, which consists of miscellaneous operating system interfaces.
The first line of the following code imports the library os, the second line stores the name of the current directory in a string called wd and the third outputs the contents of the directory in a list to the shell.
import os
wd = os.getcwd()
os.listdir(wd)
Run this code in the IPython shell and answer the following questions. Ignore the files that begin with ..
Check out the contents of your current directory and answer the following questions: (1) which file is in your directory and NOT an example of a flat file; (2) why is it not a flat file?
In [1]: import os wd = os.getcwd() os.listdir(wd) Out[1]: ['titanic.txt', 'battledeath.xlsx']
is not a flat because it is a spreadsheet consisting of many sheets, not a single table.
battledeath.xlsx
Loading a pickled file
There are a number of datatypes that cannot be saved easily to flat files, such as lists and dictionaries. If you want your files to be human readable, you may want to save them as text files in a clever manner. JSONs, which you will see in a later chapter, are appropriate for Python dictionaries.
However, if you merely want to be able to import them into Python, you can serialize them. All this means is converting the object into a sequence of bytes, or a bytestream.
In this exercise, you’ll import the pickle package, open a previously pickled data structure from a file and load it.
Instructions
- Import the
picklepackage. - Complete the second argument of
open()so that it is read only for a binary file. This argument will be a string of two letters, one signifying ‘read only’, the other ‘binary’. - Pass the correct argument to
pickle.load(); it should use the variable that is bound toopen. - Print the data,
d. - Print the datatype of
d; take your mind back to your previous use of the functiontype().
# Import pickle package
import pickle
# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
d = pickle.load(file)
# Print d
print(d)
# Print datatype of d
print(type(d))# Import pickle package
import pickle
# Open pickle file and load data: d
with open('data.pkl', 'rb') as file:
d = pickle.load(file)
# Print d
print(d)
# Print datatype of d
print(type(d))
<script.py> output:
{'June': '69.4', 'Aug': '85', 'Airline': '8', 'Mar': '84.4'}
<class 'dict'>Listing sheets in Excel files
Whether you like it or not, any working data scientist will need to deal with Excel spreadsheets at some point in time. You won’t always want to do so in Excel, however!
Here, you’ll learn how to use pandas to import Excel spreadsheets and how to list the names of the sheets in any loaded .xlsx file.
Recall from the video that, given an Excel file imported into a variable spreadsheet, you can retrieve a list of the sheet names using the attribute spreadsheet.sheet_names.
Specifically, you’ll be loading and checking out the spreadsheet 'battledeath.xlsx', modified from the Peace Research Institute Oslo’s (PRIO) dataset. This data contains age-adjusted mortality rates due to war in various countries over several years.
Instructions
- Assign the spreadsheet filename (provided above) to the variable
file. - Pass the correct argument to
pd.ExcelFile()to load the file using pandas, assigning the result to the variablexls. - Print the sheetnames of the Excel spreadsheet by passing the necessary argument to the
print()function.
# Import pandas import pandas as pd # Assign spreadsheet filename: file file = 'battledeath.xlsx' # Load spreadsheet: xls xls = pd.ExcelFile(file) # Print sheet names print(xls.sheet_names)
# Import pandas import pandas as pd # Assign spreadsheet filename: file file = 'battledeath.xlsx' # Load spreadsheet: xls xls = pd.ExcelFile(file) # Print sheet names print(xls.sheet_names) ['2002', '2004']
Importing sheets from Excel files
In the previous exercises, you saw that the Excel file contains two sheets, '2002' and '2004'. The next step is to import these.
In this exercise, you’ll learn how to import any given sheet of your loaded .xlsx file as a DataFrame. You’ll be able to do so by specifying either the sheet’s name or its index.
The spreadsheet 'battledeath.xlsx' is already loaded as xls.
Instructions
- Load the sheet
'2004'into the DataFramedf1using its name as a string. - Print the head of
df1to the shell. - Load the sheet
2002into the DataFramedf2using its index (0). - Print the head of
df2to the shell.
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')
# Print the head of the DataFrame df1
print(df1.head())
# Load a sheet into a DataFrame by index: df2
df2 = xls.parse('2002')
# Print the head of the DataFrame df2
print(df2.head())# Import pandas
import pandas as pd
# Assign spreadsheet filename: file
file = 'battledeath.xlsx'
# Load spreadsheet: xls
xls = pd.ExcelFile(file)
# Print sheet names
print(xls.sheet_names)
# Load a sheet into a DataFrame by name: df1
df1 = xls.parse('2004')
# Print the head of the DataFrame df1
print(df1.head())
# Load a sheet into a DataFrame by index: df2
df2 = xls.parse(0)
# Print the head of the DataFrame df2
print(df2.head())
War(country) 2004
0 Afghanistan 9.451028
1 Albania 0.130354
2 Algeria 3.407277
3 Andorra 0.000000
4 Angola 2.597931
War, age-adjusted mortality due to 2002
0 Afghanistan 36.083990
1 Albania 0.128908
2 Algeria 18.314120
3 Andorra 0.000000
4 Angola 18.964560Customizing your spreadsheet import
Here, you’ll parse your spreadsheets and use additional arguments to skip rows, rename columns and select only particular columns.
The spreadsheet 'battledeath.xlsx' is already loaded as xls.
As before, you’ll use the method parse(). This time, however, you’ll add the additional arguments skiprows, names and usecols. These skip rows, name the columns and designate which columns to parse, respectively. All these arguments can be assigned to lists containing the specific row numbers, strings and column numbers, as appropriate.
Instructions
- Parse the first sheet by index. In doing so, skip the first row of data and name the columns
'Country'and'AAM due to War (2002)'using the argumentnames. The values passed toskiprowsandnamesall need to be of typelist. - Parse the second sheet by index. In doing so, parse only the first column with the
usecolsparameter, skip the first row and rename the column'Country'. The argument passed tousecolsalso needs to be of typelist.
# Parse the first sheet and rename the columns: df1 df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)']) # Print the head of the DataFrame df1 print(df1.head()) # Parse the first column of the second sheet and rename the column: df2 df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country']) # Print the head of the DataFrame df2 print(df2.head())
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])
# Print the head of the DataFrame df1
print(df1.head())
# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])
# Print the head of the DataFrame df2
print(df2.head())
Country AAM due to War (2002)
0 Albania 0.128908
1 Algeria 18.314120
2 Andorra 0.000000
3 Angola 18.964560
4 Antigua and Barbuda 0.000000
Country
0 Albania
1 Algeria
2 Andorra
3 Angola
4 Antigua and BarbudaImporting SAS files
In this exercise, you’ll figure out how to import a SAS file as a DataFrame using SAS7BDAT and pandas. The file 'sales.sas7bdat' is already in your working directory and both pandas and matplotlib.pyplot have already been imported as follows:
import pandas as pd
import matplotlib.pyplot as plt
The data are adapted from the website of the undergraduate text book Principles of Econometrics by Hill, Griffiths and Lim.
Instructions
- Import the module
SAS7BDATfrom the librarysas7bdat. - In the context of the file
'sales.sas7bdat', load its contents to a DataFramedf_sas, using the methodto_data_frame()on the objectfile. - Print the head of the DataFrame
df_sas. - Execute your entire script to produce a histogram plot!
# Import sas7bdat package
from sas7bdat import SAS7BDAT
# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
df_sas = file.to_data_frame()
# Print head of DataFrame
print(df_sas.head())
# Plot histogram of DataFrame features (pandas and pyplot already imported)
pd.DataFrame.hist(df_sas[['P']])
plt.ylabel('count')
plt.show()<script.py> output:
YEAR P S
0 1950.0 12.9 181.899994
1 1951.0 11.9 245.000000
2 1952.0 10.7 250.199997
3 1953.0 11.3 265.899994
4 1954.0 11.2 248.500000
Importing Stata files
Here, you’ll gain expertise in importing Stata files as DataFrames using the pd.read_stata() function from pandas. The last exercise’s file, 'disarea.dta', is still in your working directory.
Instructions
- Use
pd.read_stata()to load the file'disarea.dta'into the DataFramedf. - Print the head of the DataFrame
df. - Visualize your results by plotting a histogram of the column
disa10. We’ve already provided this code for you, so just run it!
# Import pandas
import pandas as pd
# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')
# Print the head of the DataFrame df
print(df.head())
# Plot histogram of one column of the DataFrame
pd.DataFrame.hist(df[['disa10']])
plt.xlabel('Extent of disease')
plt.ylabel('Number of countries')
plt.show()<script.py> output:
wbcode country disa1 disa2 disa3 ... disa21 disa22 disa23 disa24 disa25
0 AFG Afghanistan 0.00 0.00 0.76 ... 0.0 0.00 0.02 0.00 0.00
1 AGO Angola 0.32 0.02 0.56 ... 0.0 0.99 0.98 0.61 0.00
2 ALB Albania 0.00 0.00 0.02 ... 0.0 0.00 0.00 0.00 0.16
3 ARE United Arab Emirates 0.00 0.00 0.00 ... 0.0 0.00 0.00 0.00 0.00
4 ARG Argentina 0.00 0.24 0.24 ... 0.0 0.00 0.01 0.00 0.11
[5 rows x 27 columns]
Using h5py to import HDF5 files
The file 'LIGO_data.hdf5' is already in your working directory. In this exercise, you’ll import it using the h5py library. You’ll also print out its datatype to confirm you have imported it correctly. You’ll then study the structure of the file in order to see precisely what HDF groups it contains.
You can find the LIGO data plus loads of documentation and tutorials here. There is also a great tutorial on Signal Processing with the data here.
Instructions
- Import the package
h5py. - Assign the name of the file to the variable
file. - Load the file as read only into the variable
data. - Print the datatype of
data. - Print the names of the groups in the HDF5 file
'LIGO_data.hdf5'.
# Import packages
import numpy as np
import h5py
# Assign filename: file
file = 'LIGO_data.hdf5'
# Load file: data
data = h5py.File(file, 'r')
# Print the datatype of the loaded file
print(type(data))
# Print the keys of the file
for key in data.keys():
print(key)
<class 'h5py._hl.files.File'> meta quality strain
Extracting data from your HDF5 file
In this exercise, you’ll extract some of the LIGO experiment’s actual data from the HDF5 file and you’ll visualize it.
To do so, you’ll need to first explore the HDF5 group 'strain'.
Instructions
- Assign the HDF5 group
data['strain']togroup. - In the
forloop, print out the keys of the HDF5 group ingroup. - Assign to the variable
strainthe values of the time series datadata['strain']['Strain']using the attribute.value. - Set
num_samplesequal to10000, the number of time points we wish to sample. - Execute the rest of the code to produce a plot of the time series data in
LIGO_data.hdf5.
<?php

Loading .mat files
In this exercise, you’ll figure out how to load a MATLAB file using scipy.io.loadmat() and you’ll discover what Python datatype it yields.
The file 'albeck_gene_expression.mat' is in your working directory. This file contains gene expression data from the Albeck Lab at UC Davis. You can find the data and some great documentation here.
Instructions
- Import the package
scipy.io. - Load the file
'albeck_gene_expression.mat'into the variablemat; do so using the functionscipy.io.loadmat(). - Use the function
type()to print the datatype ofmatto the IPython shell.
# Import package
import scipy.io
# Load MATLAB file: mat
mat = scipy.io.loadmat('albeck_gene_expression.mat')
# Print the datatype type of mat
print(type(mat))
<class 'dict'>
The structure of .mat in Python
Here, you’ll discover what is in the MATLAB dictionary that you loaded in the previous exercise.
The file 'albeck_gene_expression.mat' is already loaded into the variable mat. The following libraries have already been imported as follows:
import scipy.io
import matplotlib.pyplot as plt
import numpy as np
Once again, this file contains gene expression data from the Albeck Lab at UCDavis. You can find the data and some great documentation here.
Instructions
- Use the method
.keys()on the dictionarymatto print the keys. Most of these keys (in fact the ones that do NOT begin and end with ‘__’) are variables from the corresponding MATLAB environment. - Print the type of the value corresponding to the key
'CYratioCyt'inmat. Recall thatmat['CYratioCyt']accesses the value. - Print the shape of the value corresponding to the key
'CYratioCyt'using thenumpyfunctionshape(). - Execute the entire script to see some oscillatory gene expression data!
# Print the keys of the MATLAB dictionary
print(mat.keys())
# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))
# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))
# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()dict_keys(['__header__', '__version__', '__globals__', 'rfpCyt', 'rfpNuc', 'cfpNuc', 'cfpCyt', 'yfpNuc', 'yfpCyt', 'CYratioCyt']) <class 'numpy.ndarray'> ()

Creating a database engine
Here, you’re going to fire up your very first SQL engine. You’ll create an engine to connect to the SQLite database 'Chinook.sqlite', which is in your working directory. Remember that to create an engine to connect to 'Northwind.sqlite', Hugo executed the command
engine = create_engine('sqlite:///Northwind.sqlite')
Here, 'sqlite:///Northwind.sqlite' is called the connection stringto the SQLite database Northwind.sqlite. A little bit of background on the Chinook database: the Chinook database contains information about a semi-fictional digital media store in which media data is real and customer, employee and sales data has been manually created.
Why the name Chinook, you ask? According to their website,
The name of this sample database was based on the Northwind database. Chinooks are winds in the interior West of North America, where the Canadian Prairies and Great Plains meet various mountain ranges. Chinooks are most prevalent over southern Alberta in Canada. Chinook is a good name choice for a database that intends to be an alternative to Northwind.
Instructions
- Import the function
create_enginefrom the modulesqlalchemy. - Create an engine to connect to the SQLite database
'Chinook.sqlite'and assign it toengine.
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')What are the tables in the database?
In this exercise, you’ll once again create an engine to connect to 'Chinook.sqlite'. Before you can get any data out of the database, however, you’ll need to know what tables it contains!
To this end, you’ll save the table names to a list using the method table_names() on the engine and then you will print the list.
Instructions
- Import the function
create_enginefrom the modulesqlalchemy. - Create an engine to connect to the SQLite database
'Chinook.sqlite'and assign it toengine. - Using the method
table_names()on the engineengine, assign the table names of'Chinook.sqlite'to the variabletable_names. - Print the object
table_namesto the shell.
# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Save the table names to a list: table_names
table_names = engine.table_names()
# Print the table names to the shell
print(table_names)# Import necessary module
from sqlalchemy import create_engine
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Save the table names to a list: table_names
table_names = engine.table_names()
# Print the table names to the shell
print(table_names)
['Album', 'Artist', 'Customer', 'Employee', 'Genre', 'Invoice', 'InvoiceLine', 'MediaType', 'Playlist', 'PlaylistTrack', 'Track']The Hello World of SQL Queries!
Now, it’s time for liftoff! In this exercise, you’ll perform the Hello World of SQL queries, SELECT, in order to retrieve all columns of the table Album in the Chinook database. Recall that the query SELECT *selects all columns.
Instructions
- Open the engine connection as
conusing the methodconnect()on the engine. - Execute the query that selects ALL columns from the
Albumtable. Store the results inrs. - Store all of your query results in the DataFrame
dfby applying thefetchall()method to the resultsrs. - Close the connection!
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine connection: con
con = engine.connect()
# Perform query: rs
rs = con.execute('SELECT * FROM Album')
# Save results of the query to DataFrame: df
df = pd.DataFrame(rs.fetchall())
# Close connection
con.close()
# Print head of DataFrame df
print(df.head())<script.py> output:
0 1 2
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3Customizing the Hello World of SQL Queries
Congratulations on executing your first SQL query! Now you’re going to figure out how to customize your query in order to:
- Select specified columns from a table;
- Select a specified number of rows;
- Import column names from the database table.
Recall that Hugo performed a very similar query customization in the video:
engine = create_engine('sqlite:///Northwind.sqlite')
with engine.connect() as con:
rs = con.execute("SELECT OrderID, OrderDate, ShipName FROM Orders")
df = pd.DataFrame(rs.fetchmany(size=5))
df.columns = rs.keys()
Packages have already been imported as follows:
from sqlalchemy import create_engine
import pandas as pd
The engine has also already been created:
engine = create_engine('sqlite:///Chinook.sqlite')
The engine connection is already open with the statement
with engine.connect() as con:
All the code you need to complete is within this context.
Instructions
- Execute the SQL query that selects the columns
LastNameandTitlefrom theEmployeetable. Store the results in the variablers. - Apply the method
fetchmany()torsin order to retrieve 3 of the records. Store them in the DataFramedf. - Using the
rsobject, set the DataFrame’s column names to the corresponding names of the table columns.
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT LastName, Title FROM Employee")
df = pd.DataFrame(rs.fetchmany(3))
df.columns = rs.keys()
# Print the length of the DataFrame df
print(len(df))
# Print the head of the DataFrame df
print(df.head())3 LastName Title 0 Adams General Manager 1 Edwards Sales Manager 2 Peacock Sales Support Agent
Filtering your database records using SQL’s WHERE
You can now execute a basic SQL query to select records from any table in your database and you can also perform simple query customizations to select particular columns and numbers of rows.
There are a couple more standard SQL query chops that will aid you in your journey to becoming an SQL ninja.
Let’s say, for example that you wanted to get all records from the Customer table of the Chinook database for which the Country is 'Canada'. You can do this very easily in SQL using a SELECTstatement followed by a WHERE clause as follows:
SELECT * FROM Customer WHERE Country = 'Canada'
In fact, you can filter any SELECT statement by any condition using a WHERE clause. This is called filtering your records.
In this interactive exercise, you’ll select all records of the Employeetable for which 'EmployeeId' is greater than or equal to 6.
Packages are already imported as follows:
import pandas as pd
from sqlalchemy import create_engine
Query away!
Instructions
- Complete the argument of
create_engine()so that the engine for the SQLite database'Chinook.sqlite'is created. - Execute the query that selects all records from the
Employeetable where'EmployeeId'is greater than or equal to6. Use the>=operator and assign the results tors. - Apply the method
fetchall()torsin order to fetch all records inrs. Store them in the DataFramedf. - Using the
rsobject, set the DataFrame’s column names to the corresponding names of the table columns.
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute("SELECT * FROM Employee WHERE 'EmployeeId'>=6")
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print the head of the DataFrame df
print(df.head())EmployeeId LastName FirstName Title ReportsTo ... Country PostalCode Phone Fax Email 0 6 Mitchell Michael IT Manager 1 ... Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com 1 7 King Robert IT Staff 6 ... Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com 2 8 Callahan Laura IT Staff 6 ... Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com [3 rows x 15 columns]
Ordering your SQL records with ORDER BY
You can also order your SQL query results. For example, if you wanted to get all records from the Customer table of the Chinook database and order them in increasing order by the column SupportRepId, you could do so with the following query:
"SELECT * FROM Customer ORDER BY SupportRepId"
In fact, you can order any SELECT statement by any column.
In this interactive exercise, you’ll select all records of the Employeetable and order them in increasing order by the column BirthDate.
Packages are already imported as follows:
import pandas as pd
from sqlalchemy import create_engine
Get querying!
Instructions
- Using the function
create_engine(), create an engine for the SQLite databaseChinook.sqliteand assign it to the variableengine. - In the context manager, execute the query that selects all records from the
Employeetable and orders them in increasing order by the columnBirthDate. Assign the result tors. - In a call to
pd.DataFrame(), apply the methodfetchall()torsin order to fetch all records inrs. Store them in the DataFramedf. - Set the DataFrame’s column names to the corresponding names of the table columns.
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Open engine in context manager
with engine.connect() as con:
rs = con.execute('SELECT * FROM Employee ORDER BY BirthDate ASC')
df = pd.DataFrame(rs.fetchall())
# Set the DataFrame's column names
df.columns = rs.keys()
# Print head of DataFrame
print(df.head())EmployeeId LastName FirstName Title ReportsTo ... Country PostalCode Phone Fax Email 0 4 Park Margaret Sales Support Agent 2.0 ... Canada T2P 5G3 +1 (403) 263-4423 +1 (403) 263-4289 margaret@chinookcorp.com 1 2 Edwards Nancy Sales Manager 1.0 ... Canada T2P 2T3 +1 (403) 262-3443 +1 (403) 262-3322 nancy@chinookcorp.com 2 1 Adams Andrew General Manager NaN ... Canada T5K 2N1 +1 (780) 428-9482 +1 (780) 428-3457 andrew@chinookcorp.com 3 5 Johnson Steve Sales Support Agent 2.0 ... Canada T3B 1Y7 1 (780) 836-9987 1 (780) 836-9543 steve@chinookcorp.com 4 8 Callahan Laura IT Staff 6.0 ... Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com [5 rows x 15 columns]
Pandas and The Hello World of SQL Queries!
Here, you’ll take advantage of the power of pandas to write the results of your SQL query to a DataFrame in one swift line of Python code!
You’ll first import pandas and create the SQLite 'Chinook.sqlite'engine. Then you’ll query the database to select all records from the Album table.
Recall that to select all records from the Orders table in the Northwind database, Hugo executed the following command:
df = pd.read_sql_query("SELECT * FROM Orders", engine)
Instructions
- Import the
pandaspackage using the aliaspd. - Using the function
create_engine(), create an engine for the SQLite databaseChinook.sqliteand assign it to the variableengine. - Use the
pandasfunctionread_sql_query()to assign to the variabledfthe DataFrame of results from the following query: select allrecords from the tableAlbum. - The remainder of the code is included to confirm that the DataFrame created by this method is equal to that created by the previous method that you learned.
# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Album', engine)
# Print head of DataFrame
print(df.head())
# Open engine in context manager and store query result in df1
with engine.connect() as con:
rs = con.execute("SELECT * FROM Album")
df1 = pd.DataFrame(rs.fetchall())
df1.columns = rs.keys()
# Confirm that both methods yield the same result
print(df.equals(df1))<script.py> output:
AlbumId Title ArtistId
0 1 For Those About To Rock We Salute You 1
1 2 Balls to the Wall 2
2 3 Restless and Wild 2
3 4 Let There Be Rock 1
4 5 Big Ones 3
TruePandas for more complex querying
Here, you’ll become more familiar with the pandas function read_sql_query() by using it to execute a more complex query: a SELECT statement followed by both a WHERE clause AND an ORDER BY clause.
You’ll build a DataFrame that contains the rows of the Employee table for which the EmployeeId is greater than or equal to 6 and you’ll order these entries by BirthDate.
Instructions
- Using the function
create_engine(), create an engine for the SQLite databaseChinook.sqliteand assign it to the variableengine. - Use the
pandasfunctionread_sql_query()to assign to the variabledfthe DataFrame of results from the following query: select allrecords from theEmployeetable where theEmployeeIdis greater than or equal to6and ordered byBirthDate(make sure to useWHEREandORDER BYin this precise order).
# Import packages
from sqlalchemy import create_engine
import pandas as pd
# Create engine: engine
engine = create_engine('sqlite:///Chinook.sqlite')
# Execute query and store records in DataFrame: df
df = pd.read_sql_query('SELECT * FROM Employee WHERE EmployeeId>=6 ORDER BY BirthDate', engine)
# Print head of DataFrame
print(df.head())EmployeeId LastName FirstName Title ReportsTo ... Country PostalCode Phone Fax Email 0 8 Callahan Laura IT Staff 6 ... Canada T1H 1Y8 +1 (403) 467-3351 +1 (403) 467-8772 laura@chinookcorp.com 1 7 King Robert IT Staff 6 ... Canada T1K 5N8 +1 (403) 456-9986 +1 (403) 456-8485 robert@chinookcorp.com 2 6 Mitchell Michael IT Manager 1 ... Canada T3B 0C5 +1 (403) 246-9887 +1 (403) 246-9899 michael@chinookcorp.com [3 rows x 15 columns]
The power of SQL lies in relationships between tables: INNER JOIN
Here, you’ll perform your first INNER JOIN! You’ll be working with your favourite SQLite database, Chinook.sqlite. For each record in the Album table, you’ll extract the Title along with the Name of the Artist. The latter will come from the Artist table and so you will need to INNER JOIN these two tables on the ArtistID column of both.
Recall that to INNER JOIN the Orders and Customers tables from the Northwind database, Hugo executed the following SQL query:
"SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID"
The following code has already been executed to import the necessary packages and to create the engine:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
Instructions
- Assign to
rsthe results from the following query: select all the records, extracting theTitleof the record andNameof the artist of each record from theAlbumtable and theArtisttable, respectively. To do so,INNER JOINthese two tables on theArtistIDcolumn of both. - In a call to
pd.DataFrame(), apply the methodfetchall()torsin order to fetch all records inrs. Store them in the DataFramedf. - Set the DataFrame’s column names to the corresponding names of the table columns.
# Open engine in context manager
# Perform query and save results to DataFrame: df
with engine.connect() as con:
rs = con.execute('SELECT Title, Name FROM Album INNER JOIN Artist on Album.ArtistID=Artist.ArtistID')
df = pd.DataFrame(rs.fetchall())
df.columns = rs.keys()
# Print head of DataFrame df
print(df.head())Title Name 0 For Those About To Rock We Salute You AC/DC 1 Balls to the Wall Accept 2 Restless and Wild Accept 3 Let There Be Rock AC/DC 4 Big Ones Aerosmith
Filtering your INNER JOIN
Congrats on performing your first INNER JOIN! You’re now going to finish this chapter with one final exercise in which you perform an INNER JOIN and filter the result using a WHERE clause.
Recall that to INNER JOIN the Orders and Customers tables from the Northwind database, Hugo executed the following SQL query:
"SELECT OrderID, CompanyName FROM Orders INNER JOIN Customers on Orders.CustomerID = Customers.CustomerID"
The following code has already been executed to import the necessary packages and to create the engine:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('sqlite:///Chinook.sqlite')
Instructions
- Use the
pandasfunctionread_sql_query()to assign to the variabledfthe DataFrame of results from the following query: select allrecords fromPlaylistTrack INNER JOIN Track on PlaylistTrack.TrackId = Track.TrackIdthat satisfy the conditionMilliseconds < 250000.
# Execute query and store records in DataFrame: df
df = pd.read_sql_query("SELECT * FROM PlaylistTrack INNER JOIN track on PlaylistTrack.TrackId = Track.TrackId WHERE Milliseconds < 250000", engine)
# Print head of DataFrame
print(df.head())PlaylistId TrackId TrackId Name AlbumId ... GenreId Composer Milliseconds Bytes UnitPrice 0 1 3390 3390 One and the Same 271 ... 23 None 217732 3559040 0.99 1 1 3392 3392 Until We Fall 271 ... 23 None 230758 3766605 0.99 2 1 3393 3393 Original Fire 271 ... 23 None 218916 3577821 0.99 3 1 3394 3394 Broken City 271 ... 23 None 228366 3728955 0.99 4 1 3395 3395 Somedays 271 ... 23 None 213831 3497176 0.99 [5 rows x 11 columns]