Intermediate Importing Data in Python
Importing flat files from the web: your turn!
You are about to import your first file from the web! The flat file you will import will be 'winequality-red.csv'
from the University of California, Irvine’s Machine Learning repository. The flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.
The URL of the file is
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
After you import it, you’ll check your working directory to confirm that it is there and then you’ll load it into a pandas
DataFrame.
Instructions
- Import the function
urlretrieve
from the subpackageurllib.request
. - Assign the URL of the file to the variable
url
. - Use the function
urlretrieve()
to save the file locally as'winequality-red.csv'
. - Execute the remaining code to load
'winequality-red.csv'
in a pandas DataFrame and to print its head to the shell.
# Import package from urllib.request import urlretrieve # Import pandas import pandas as pd # Assign url of file: url url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv' # Save file locally urlretrieve(url, 'winequality-red.csv') # Read file into a DataFrame and print its head df = pd.read_csv('winequality-red.csv', sep=';') print(df.head())
fixed acidity volatile acidity citric acid residual sugar chlorides \ 0 7.4 0.70 0.00 1.9 0.076 1 7.8 0.88 0.00 2.6 0.098 2 7.8 0.76 0.04 2.3 0.092 3 11.2 0.28 0.56 1.9 0.075 4 7.4 0.70 0.00 1.9 0.076 free sulfur dioxide total sulfur dioxide density pH sulphates \ 0 11.0 34.0 0.9978 3.51 0.56 1 25.0 67.0 0.9968 3.20 0.68 2 15.0 54.0 0.9970 3.26 0.65 3 17.0 60.0 0.9980 3.16 0.58 4 11.0 34.0 0.9978 3.51 0.56 alcohol quality 0 9.4 5 1 9.8 5 2 9.8 5 3 9.8 6 4 9.4 5
Opening and reading flat files from the web
You have just imported a file from the web, saved it locally and loaded it into a DataFrame. If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas
. In particular, you can use the function pd.read_csv()
with the URL as the first argument and the separator sep
as the second argument.
The URL of the file, once again, is
'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
Instructions
- Assign the URL of the file to the variable
url
. - Read file into a DataFrame
df
usingpd.read_csv()
, recalling that the separator in the file is';'
. - Print the head of the DataFrame
df
. - Execute the rest of the code to plot histogram of the first feature in the DataFrame
df
.
# Import packages import matplotlib.pyplot as plt import pandas as pd # Assign url of file: url url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv' # Read file into a DataFrame: df df = pd.read_csv(url, ';') # Print the head of the DataFrame print(df.head()) # Plot first column of df pd.DataFrame.hist(df.ix[:, 0:1]) plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)') plt.ylabel('count') plt.show()
<script.py> output: fixed acidity volatile acidity citric acid residual sugar chlorides \ 0 7.4 0.70 0.00 1.9 0.076 1 7.8 0.88 0.00 2.6 0.098 2 7.8 0.76 0.04 2.3 0.092 3 11.2 0.28 0.56 1.9 0.075 4 7.4 0.70 0.00 1.9 0.076 free sulfur dioxide total sulfur dioxide density pH sulphates \ 0 11.0 34.0 0.9978 3.51 0.56 1 25.0 67.0 0.9968 3.20 0.68 2 15.0 54.0 0.9970 3.26 0.65 3 17.0 60.0 0.9980 3.16 0.58 4 11.0 34.0 0.9978 3.51 0.56 alcohol quality 0 9.4 5 1 9.8 5 2 9.8 5 3 9.8 6 4 9.4 5
Importing non-flat files from the web
Congrats! You’ve just loaded a flat file from the web into a DataFrame without first saving it locally using the pandas
function pd.read_csv()
. This function is super cool because it has close relatives that allow you to load all types of files, not only flat ones. In this interactive exercise, you’ll use pd.read_excel()
to import an Excel spreadsheet.
The URL of the spreadsheet is
'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'
Your job is to use pd.read_excel()
to read in all of its sheets, print the sheet names and then print the head of the first sheet using its name, not its index.
Note that the output of pd.read_excel()
is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.
Instructions
- Assign the URL of the file to the variable
url
. - Read the file in
url
into a dictionaryxls
usingpd.read_excel()
recalling that, in order to import all sheets you need to passNone
to the argumentsheet_name
. - Print the names of the sheets in the Excel spreadsheet; these will be the keys of the dictionary
xls
. - Print the head of the first sheet using the sheet name, not the index of the sheet! The sheet name is
'1700'
# Import package import pandas as pd # Assign url of file: url url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls' # Read in all sheets of Excel file: xls xls = pd.read_excel(url, sheet_name=None) # Print the sheetnames to the shell print(xls.keys()) # Print the head of the first sheet (using its name, NOT its index) print(xls['1700'].head())
odict_keys(['1700', '1900']) country 1700 0 Afghanistan 34.565000 1 Akrotiri and Dhekelia 34.616667 2 Albania 41.312000 3 Algeria 36.720000 4 American Samoa -14.307000
Performing HTTP requests in Python using urllib
Now that you know the basics behind HTTP GET requests, it’s time to perform some of your own. In this interactive exercise, you will ping our very own DataCamp servers to perform a GET request to extract information from the first coding exercise of this course, "https://campus.datacamp.com/courses/1606/4135?ex=2"
.
In the next exercise, you’ll extract the HTML itself. Right now, however, you are going to package and send the request and then catch the response.
Instructions
- Import the functions
urlopen
andRequest
from the subpackageurllib.request
. - Package the request to the url
"https://campus.datacamp.com/courses/1606/4135?ex=2"
using the functionRequest()
and assign it torequest
. - Send the request and catch the response in the variable
response
with the functionurlopen()
. - Run the rest of the code to see the datatype of
response
and to close the connection!
# Import packages from urllib.request import urlopen, Request # Specify the url url = "https://campus.datacamp.com/courses/1606/4135?ex=2" # This packages the request: request request = Request(url) # Sends the request and catches the response: response response = urlopen(request) # Print the datatype of response print(type(response)) # Be polite and close the response! response.close()
<class 'http.client.HTTPResponse'>
Printing HTTP request results in Python using urllib
You have just packaged and sent a GET request to "https://campus.datacamp.com/courses/1606/4135?ex=2"
and then caught the response. You saw that such a response is a http.client.HTTPResponse
object. The question remains: what can you do with this response?
Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse
object has an associated read()
method. In this exercise, you’ll build on your previous great work to extract the response and print the HTML.
Instructions
- Send the request and catch the response in the variable
response
with the functionurlopen()
, as in the previous exercise. - Extract the response using the
read()
method and store the result in the variablehtml
. - Print the string
html
. - Hit submit to perform all of the above and to close the response: be tidy!
# Import packages from urllib.request import urlopen, Request # Specify the url url = "https://campus.datacamp.com/courses/1606/4135?ex=2" # This packages the request request = Request(url) # Sends the request and catches the response: response response = urlopen(request) # Extract the response: html html = response.read() # Print the html print(html) # Be polite and close the response! response.close()
b'<!doctype html><html lang="en"><head><link rel="apple-touch-icon-precomposed" sizes="57x57" href="/apple-touch-icon-57x57.png"><link rel="apple-touch-icon-precomposed" sizes="114x114" href="/apple-touch-icon-114x114.png"><link rel="apple-touch-icon-precomposed" sizes="72x72" href="/apple-touch-icon-72x72.png"><link rel="apple-touch-icon-precomposed" sizes="144x144" href="/apple-touch-icon-144x144.png"><link rel="apple-touch-icon-precomposed" sizes="60x60" href="/apple-touch-icon-60x60.png"><link rel="apple-touch-icon-precomposed" sizes="120x120" href="/apple-touch-icon-120x120.png"><link rel="apple-touch-icon-precomposed" sizes="76x76" href="/apple-touch-icon-76x76.png"><link rel="apple-touch-icon-precomposed" sizes="152x152" href="/apple-touch-icon-152x152.png"><link rel="icon" type="image/png" href="/favicon.ico"><link rel="icon" type="image/png" href="/favicon-196x196.png" sizes="196x196"><link rel="icon" type="image/png" href="/favicon-96x96.png" sizes="96x96"><link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32"><link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16"><link rel="icon" type="image/png" href="/favicon-128.png" sizes="128x128"><meta name="application-name" content="DataCamp"><meta name="msapplication-TileColor" content="#FFFFFF"><meta name="msapplication-TileImage" content="/mstile-144x144.png"><meta name="msapplication-square70x70logo" content="/mstile-70x70.png"><meta name="msapplication-square150x150logo" content="/mstile-150x150.png"><meta name="msapplication-wide310x150logo" content="/mstile-310x150.png"><meta name="msapplication-square310x310logo" content="/mstile-310x310.png"><link href="/static/css/17.279a2d9e.chunk.css" rel="stylesheet"><link href="/static/css/main.88385ec5.chunk.css" rel="stylesheet"><title data-react-helmet="true">Importing flat files from the web: your turn! | Python</title><link data-react-helmet="true" rel="canonical" href="https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=2"><meta data-react-helmet="true" charset="utf-8"><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"><meta data-react-helmet="true" name="fragment" content="!"><meta data-react-helmet="true" name="keywords" content="R, Python, Data analysis, interactive, learning"><meta data-react-helmet="true" name="description" content="Here is an example of Importing flat files from the web: your turn!: You are about to import your first file from the web! The flat file you will import will be 'winequality-red."><meta data-react-helmet="true" name="twitter:card" content="summary"><meta data-react-helmet="true" name="twitter:site" content="@DataCamp"><meta data-react-helmet="true" name="twitter:title" content="Importing flat files from the web: your turn! | Python"><meta data-react-helmet="true" name="twitter:description" content="Here is an example of Importing flat files from the web: your turn!: You are about to import your first file from the web! The flat file you will import will be 'winequality-red."><meta data-react-helmet="true" name="twitter:creator" content="@DataCamp"><meta data-react-helmet="true" name="twitter:image:src" content="/public/assets/images/var/twitter_share.png"><meta data-react-helmet="true" name="twitter:domain" content="www.datacamp.com"><meta data-react-helmet="true" property="og:title" content="Importing flat files from the web: your turn! | Python"><meta data-react-helmet="true" property="og:image" content="/public/assets/images/var/linkedin_share.png"><meta data-react-helmet="true" name="google-signin-clientid" content="892114885437-01a7plbsu1b2vobuhvnckmmanhb58h3a.apps.googleusercontent.com"><meta data-react-helmet="true" name="google-signin-scope" content="email profile"><meta data-react-helmet="true" name="google-signin-cookiepolicy" content="single_host_origin"><script data-react-helmet="true" async="true" src="https://compliance.datacamp.com/base.js"></script><script data-react-helmet="true">\n var dataLayerContent = {\n gtm_version: 2,\n };\n if (typeof window[\'dataLayer\'] === \'undefined\') {\n window[\'dataLayer\'] = [dataLayerContent];\n } else {\n window[\'dataLayer\'].push(dataLayerContent);\n }\n </script><script async src=\'/cdn-cgi/bm/cv/669835187/api.js\'></script></head><body><script>window.PRELOADED_STATE = "["~#iR",["^ ","n","StateRecord","v",["^ ","backendSession",["~#iOM",["status",["^2",["code","none","text",""]],"isInitSession",false,"message",null]],"boot",["^0",["^ ","n","BootStateRecord","v",["^ ","bootState","PRE_BOOTED","error",null]]],"chapter",["^2",["current",["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",1,"number_of_videos",3,"slug","importing-data-from-the-internet-1","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",12,"free_preview",true,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter1.pdf","title","Importing data from the Internet","xp",1050,"id",4135,"exercises",["~#iL",[["^2",["type","VideoExercise","title","Importing flat files from the web","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=1"]],["^2",["type","NormalExercise","title","Importing flat files from the web: your turn!","aggregate_xp",100,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=2"]],["^2",["type","NormalExercise","title","Opening and reading flat files from the web","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=3"]],["^2",["type","NormalExercise","title","Importing non-flat files from the web","aggregate_xp",100,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=4"]],["^2",["type","VideoExercise","title","HTTP requests to import files from the web","aggregate_xp",50,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=5"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using urllib","aggregate_xp",100,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=6"]],["^2",["type","NormalExercise","title","Printing HTTP request results in Python using urllib","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=7"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using requests","aggregate_xp",100,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=8"]],["^2",["type","VideoExercise","title","Scraping the web in Python","aggregate_xp",50,"number",9,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=9"]],["^2",["type","NormalExercise","title","Parsing HTML with BeautifulSoup","aggregate_xp",100,"number",10,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=10"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the text","aggregate_xp",100,"number",11,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=11"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the hyperlinks","aggregate_xp",100,"number",12,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=12"]]]],"description","The web is a rich source of data from which you can extract various types of insights and findings. In this chapter, you will learn how to get data from the web, whether it is stored in files or in HTML. You'll also learn the basics of scraping and parsing web data.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]]]],"contentAuthorization",["^ "],"course",["^2",["difficulty_level",1,"reduced_outline",null,"marketing_video","","active_image","course-1606-master:506759a234ec905a9377923e00ae7511-20210409131713880","author_field",null,"chapters",["^7",[["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",1,"number_of_videos",3,"slug","importing-data-from-the-internet-1","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",12,"free_preview",true,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter1.pdf","title","Importing data from the Internet","xp",1050,"id",4135,"exercises",["^7",[["^2",["type","VideoExercise","title","Importing flat files from the web","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=1"]],["^2",["type","NormalExercise","title","Importing flat files from the web: your turn!","aggregate_xp",100,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=2"]],["^2",["type","NormalExercise","title","Opening and reading flat files from the web","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=3"]],["^2",["type","NormalExercise","title","Importing non-flat files from the web","aggregate_xp",100,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=4"]],["^2",["type","VideoExercise","title","HTTP requests to import files from the web","aggregate_xp",50,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=5"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using urllib","aggregate_xp",100,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=6"]],["^2",["type","NormalExercise","title","Printing HTTP request results in Python using urllib","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=7"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using requests","aggregate_xp",100,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=8"]],["^2",["type","VideoExercise","title","Scraping the web in Python","aggregate_xp",50,"number",9,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=9"]],["^2",["type","NormalExercise","title","Parsing HTML with BeautifulSoup","aggregate_xp",100,"number",10,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=10"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the text","aggregate_xp",100,"number",11,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=11"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the hyperlinks","aggregate_xp",100,"number",12,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=12"]]]],"description","The web is a rich source of data from which you can extract various types of insights and findings. In this chapter, you will learn how to get data from the web, whether it is stored in files or in HTML. You'll also learn the basics of scraping and parsing web data.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]],["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",2,"number_of_videos",2,"slug","interacting-with-apis-to-import-data-from-the-web-2","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",9,"free_preview",null,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter2.pdf","title","Interacting with APIs to import data from the web","xp",650,"id",4136,"exercises",["^7",[["^2",["type","VideoExercise","title","Introduction to APIs and JSONs","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=1"]],["^2",["type","PureMultipleChoiceExercise","title","Pop quiz: What exactly is a JSON?","aggregate_xp",50,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=2"]],["^2",["type","NormalExercise","title","Loading and exploring a JSON","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=3"]],["^2",["type","MultipleChoiceExercise","title","Pop quiz: Exploring your JSON","aggregate_xp",50,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=4"]],["^2",["type","VideoExercise","title","APIs and interacting with the world wide web","aggregate_xp",50,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=5"]],["^2",["type","PureMultipleChoiceExercise","title","Pop quiz: What's an API?","aggregate_xp",50,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=6"]],["^2",["type","NormalExercise","title","API requests","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=7"]],["^2",["type","NormalExercise","title","JSON\xe2\x80\x93from the web to Python","aggregate_xp",100,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=8"]],["^2",["type","NormalExercise","title","Checking out the Wikipedia API","aggregate_xp",100,"number",9,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=9"]]]],"description","In this chapter, you will gain a deeper understanding of how to import data from the web. You will learn the basics of extracting data from APIs, gain insight on the importance of APIs, and practice extracting data by diving into the OMDB and Library of Congress APIs.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]],["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",3,"number_of_videos",2,"slug","diving-deep-into-the-twitter-api","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",8,"free_preview",null,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter3.pdf","title","Diving deep into the Twitter API","xp",700,"id",4140,"exercises",["^7",[["^2",["type","VideoExercise","title","The Twitter API and Authentication","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=1"]],["^2",["type","NormalExercise","title","API Authentication","aggregate_xp",100,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=2"]],["^2",["type","NormalExercise","title","Streaming tweets","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=3"]],["^2",["type","NormalExercise","title","Load and explore your Twitter data","aggregate_xp",100,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=4"]],["^2",["type","NormalExercise","title","Twitter data to DataFrame","aggregate_xp",100,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=5"]],["^2",["type","NormalExercise","title","A little bit of Twitter text analysis","aggregate_xp",100,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=6"]],["^2",["type","NormalExercise","title","Plotting your Twitter data","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=7"]],["^2",["type","VideoExercise","title","Final Thoughts","aggregate_xp",50,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=8"]]]],"description","In this chapter, you will consolidate your knowledge of interacting with APIs in a deep dive into the Twitter streaming API. You'll learn how to stream real-time Twitter data, and how to analyze and visualize it.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]]]],"time_needed",null,"author_image","https://assets.datacamp.com/production/course_1606/author_images/author_image_course_1606_20200310-1-lgdj4c?1583853939","tracks",["^7",[["^2",["path","/tracks/data-analyst-with-python","title_with_subtitle","Data Analyst with Python"]],["^2",["path","/tracks/data-scientist-with-python","title_with_subtitle","Data Scientist with Python"]],["^2",["path","/tracks/importing-cleaning-data-with-python","title_with_subtitle","Importing & Cleaning Data with Python"]]]],"runtime_config",null,"lti_only",false,"image_url","https://assets.datacamp.com/production/course_1606/shields/thumb/shield_image_course_1606_20200310-1-17hkmhz?1583853940","topic_id",8,"slug","intermediate-importing-data-in-python","last_updated_on","27/09/2021","paid",true,"collaborators",["^7",[["^2",["avatar_url","https://assets.datacamp.com/users/avatars/000/382/294/square/francis-photo.jpg?1471980001","full_name","Francisco Castro"]]]],"time_needed_in_hours",2,"technology_id",2,"university",null,"archived_at",null,"state","live","author_bio",null,"should_cache",true,"sharing_links",["^2",["twitter","http://bit.ly/1eWTMJh","facebook","http://bit.ly/1iS42Do"]],"instructors",["^7",[["^2",["id",301837,"marketing_biography","Data Scientist at DataCamp","biography","Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC. If you want to know what he likes to talk about, definitely check out DataFramed, the DataCamp podcast, which he hosts and produces: https://www.datacamp.com/community/podcast","avatar_url","https://assets.datacamp.com/users/avatars/000/301/837/square/hugoaboutpic.jpg?1493154678","full_name","Hugo Bowne-Anderson","instructor_path","/instructors/hugobowne"]]]],"seo_title","Intermediate Importing Data in Python","title","Intermediate Importing Data in Python","xp",2400,"image_thumbnail_url","https://assets.datacamp.com/production/course_1606/shields/thumb_home/shield_image_course_1606_20200310-1-17hkmhz?1583853940","short_description","Improve your Python data importing skills and learn to work with web and API data.","nb_of_subscriptions",121502,"seo_description","Further improve your Python data importing skills and learn to work with more web and API data.","type","datacamp","link","https://www.datacamp.com/courses/intermediate-importing-data-in-python","id",1606,"datasets",["^7",[["^2",["asset_url","https://assets.datacamp.com/production/repositories/488/datasets/b422ace2fceada7b569e0ba3e8d833fddc684c4d/latitude.xls","name","Latitudes (XLS)"]],["^2",["asset_url","https://assets.datacamp.com/production/repositories/488/datasets/3ef452f83a91556ea4284624b969392c0506fb33/tweets3.txt","name","Tweets"]],["^2",["asset_url","https://assets.datacamp.com/production/repositories/488/datasets/013936d2700e2d00207ec42100d448c23692eb6f/winequality-red.csv","name","Red wine quality"]]]],"description","As a data scientist, you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before you can do so, however, you will need to know how to get data into Python. In the prequel to this course, you learned many ways to import data into Python: from flat files such as .txt and .csv; from files native to other software such as Excel spreadsheets, Stata, SAS, and MATLAB files; and from relational databases such as SQLite and PostgreSQL. In this course, you'll extend this knowledge base by learning to import data from the web and by pulling data from Application Programming Interfaces\xe2\x80\x94 APIs\xe2\x80\x94such as the Twitter streaming API, which allows us to stream real-time tweets.","prerequisites",["^7",[["^2",["path","/courses/introduction-to-importing-data-in-python","title","Introduction to Importing Data in Python"]]]],"original_image_url","https://assets.datacamp.com/production/course_1606/shields/original/shield_image_course_1606_20200310-1-17hkmhz?1583853940","programming_language","python","external_slug","intermediate-importing-data-in-python"]],"exercises",["^2",["current",1,"all",["^7",[["^2",["sample_code","","sct","","aspect_ratio",56.25,"instructions",null,"externalId",990668,"question","","hint",null,"possible_answers",["^7",[]],"runtime_config",null,"number",1,"video_hls",null,"randomNumber",0.8135754130537021,"chapter_id",4135,"assignment",null,"feedbacks",["^7",[]],"attachments",null,"version","v0","title","Importing flat files from the web","xp",50,"language","python","pre_exercise_code","","solution","","type","VideoExercise","id",990668,"projector_key","course_1606_59604c018a6e132016cd26144a12fee0","video_link",null,"key","e36457c7ed","course_id",1606]],["^2",["sample_code","# Import package\\\\nfrom ____ import ____\\\\n\\\\n# Import pandas\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\n\\\\n\\\\n# Save file locally\\\\n\\\\n\\\\n# Read file into a DataFrame and print its head\\\\ndf = pd.read_csv('winequality-red.csv', sep=';')\\\\nprint(df.head())","sct","Ex().has_import(\\\\"urllib.request.urlretrieve\\\\")\\\\nEx().has_import(\\\\"pandas\\\\")\\\\nEx().check_object(\\\\"url\\\\").has_equal_value()\\\\nEx().check_function(\\\\"urllib.request.urlretrieve\\\\").multi(\\\\n check_args(0).has_equal_value(),\\\\n check_args(1).has_equal_value()\\\\n)\\\\nEx().check_correct(\\\\n check_object(\\\\"df\\\\").has_equal_value(),\\\\n check_function(\\\\"pandas.read_csv\\\\").multi(\\\\n check_args(0).has_equal_value(),\\\\n check_args(1).has_equal_value()\\\\n )\\\\n)\\\\nEx().has_printout(0)\\\\nsuccess_msg(\\\\"Awesome!\\\\")\\\\n","instructions","<ul>\\\\n<li>Import the function <code>urlretrieve</code> from the subpackage <code>urllib.request</code>.</li>\\\\n<li>Assign the URL of the file to the variable <code>url</code>.</li>\\\\n<li>Use the function <code>urlretrieve()</code> to save the file locally as <code>'winequality-red.csv'</code>.</li>\\\\n<li>Execute the remaining code to load <code>'winequality-red.csv'</code> in a pandas DataFrame and to print its head to the shell.</li>\\\\n</ul>","externalId",42707,"question","","hint","<ul>\\\\n<li>To import a function <code>y</code> from a subpackage <code>x</code>, execute <code>from x import y</code>.</li>\\\\n<li>This one's a long URL. Make sure you typed it in correctly!</li>\\\\n<li>Pass the <em>url</em> to import (in the <code>url</code> object you defined) as the first argument and the <em>filename</em> for saving the file locally as the second argument to <code>urlretrieve()</code>.</li>\\\\n<li>You don't have to change the code for loading <code>'winequality-red.csv'</code> and printing its head.</li>\\\\n</ul>","possible_answers",["^7",[]],"number",2,"user",["^2",["isHintShown",false,"editorTabs",["^2",["files/script.py",["^2",["title","script.py","isSolution",false,"props",["^2",["active",true,"isClosable",false,"code",null,"extra",["^2",[]]]]]]]],"outputMarkdownTabs",["^2",[]],"markdown",["^2",["titles",["^7",["Knit PDF","Knit HTML"]],"activeTitle","Knit HTML"]],"currentXp",100,"graphicalTabs",["^2",["plot",["^2",["extraClass","animation--flash","title","Plots","props",["^2",["sources",["^7",[]],"currentIndex",0]],"dimension",["^2",["isRealSize",false,"width",1,"height",1]]]],"html",["^2",["extraClass","animation--flash","title","HTML Viewer","props",["^2",["sources",["^7",[]],"currentIndex",0]]]]]],"feedbackMessages",["^7",[]],"lastSubmittedCode",null,"ltiStatus",["^2",[]],"lastSubmitActiveEditorTab",null,"consoleSqlTabs",["^2",["query_result",["^2",["extraClass","","title","query result","props",["^2",["active",true,"isNotView",true,"message","No query executed yet..."]]]]]],"consoleTabs",["^2",["console",["^2",["title","IPython Shell","props",["^2",["active",true]],"dimension",["^2",["cols",400]]]],"slides",["^2",["title","Slides","props",["^2",["active",false]]]]]],"inputMarkdownTabs",["^2",[]],"consoleObjectViewTabs",["^2",[]]]],"randomNumber",0.09607275750047917,"assignment","<p>You are about to import your first file from the web! The flat file you will import will be <code>'winequality-red.csv'</code> from the University of California, Irvine's <a href=\\\\"http://archive.ics.uci.edu/ml/index.html\\\\">Machine Learning repository</a>. The flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.</p>\\\\n<p>The URL of the file is</p>\\\\n<pre><code>'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n</code></pre>\\\\n<p>After you import it, you'll check your working directory to confirm that it is there and then you'll load it into a <code>pandas</code> DataFrame.</p>","feedbacks",["^7",[]],"attachments",null,"title","Importing flat files from the web: your turn!","xp",100,"language","python","pre_exercise_code","","solution","# Import package\\\\nfrom urllib.request import urlretrieve\\\\n\\\\n# Import pandas\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\nurl = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n\\\\n# Save file locally\\\\nurlretrieve(url, 'winequality-red.csv')\\\\n\\\\n# Read file into a DataFrame and print its head\\\\ndf = pd.read_csv('winequality-red.csv', sep=';')\\\\nprint(df.head())","type","NormalExercise","id",42707]],["^2",["sample_code","# Import packages\\\\nimport matplotlib.pyplot as plt\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\n\\\\n\\\\n# Read file into a DataFrame: df\\\\n\\\\n\\\\n# Print the head of the DataFrame\\\\nprint(____)\\\\n\\\\n# Plot first column of df\\\\npd.DataFrame.hist(df.ix[:, 0:1])\\\\nplt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')\\\\nplt.ylabel('count')\\\\nplt.show()\\\\n","sct","Ex().has_import(\\\\"matplotlib.pyplot\\\\")\\\\nEx().has_import(\\\\"pandas\\\\")\\\\nEx().check_object(\\\\"url\\\\").has_equal_value()\\\\nEx().check_correct(\\\\n check_object(\\\\"df\\\\").has_equal_value(),\\\\n check_function(\\\\"pandas.read_csv\\\\").multi(\\\\n check_args(0).has_equal_value(),\\\\n check_args(1).has_equal_value()\\\\n )\\\\n)\\\\nEx().has_printout(0)\\\\nEx().check_function(\\\\"pandas.DataFrame.hist\\\\").check_args(0).has_equal_value()\\\\nEx().check_function(\\\\"matplotlib.pyplot.show\\\\")\\\\n\\\\nsuccess_msg(\\\\"Awesome!\\\\")\\\\n","instructions","<ul>\\\\n<li>Assign the URL of the file to the variable <code>url</code>.</li>\\\\n<li>Read file into a DataFrame <code>df</code> using <code>pd.read_csv()</code>, recalling that the separator in the file is <code>';'</code>.</li>\\\\n<li>Print the head of the DataFrame <code>df</code>.</li>\\\\n<li>Execute the rest of the code to plot histogram of the first feature in the DataFrame <code>df</code>.</li>\\\\n</ul>","externalId",42708,"question","","hint","<ul>\\\\n<li>Make sure you typed the URL correctly!</li>\\\\n<li>Pass the <em>url</em> (the <code>url</code> object you defined) as the first argument and the <em>separator</em> as the second argument to <code>pd.read_csv()</code>.</li>\\\\n<li>The <em>head</em> of a DataFrame can be accessed by using <code>head()</code> on the DataFrame.</li>\\\\n<li>You don't have to change any of the code for plotting the histograms.</li>\\\\n</ul>","possible_answers",["^7",[]],"number",3,"randomNumber",0.45988091705093903,"assignment","<p>You have just imported a file from the web, saved it locally and loaded it into a DataFrame. If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using <code>pandas</code>. In particular, you can use the function <code>pd.read_csv()</code> with the URL as the first argument and the separator <code>sep</code> as the second argument.</p>\\\\n<p>The URL of the file, once again, is</p>\\\\n<pre><code>'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n</code></pre>","feedbacks",["^7",[]],"attachments",null,"title","Opening and reading flat files from the web","xp",100,"language","python","pre_exercise_code","","solution","# Import packages\\\\nimport matplotlib.pyplot as plt\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\nurl = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n\\\\n# Read file into a DataFrame: df\\\\ndf = pd.read_csv(url, sep=';')\\\\n\\\\n# Print the head of the DataFrame\\\\nprint(df.head())\\\\n\\\\n# Plot first column of df\\\\npd.DataFrame.hist(df.ix[:, 0:1])\\\\nplt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')\\\\nplt.ylabel('count')\\\\nplt.show()\\\\n","type","NormalExercise","id",42708]],["^2",["sample_code","# Import package\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\n\\\\n\\\\n# Read in all sheets of Excel file: xls\\\\n\\\\n\\\\n# Print the sheetnames to the shell\\\\n\\\\n\\\\n# Print the head of the first sheet (using its name, NOT its index)\\\\n\\\\n","sct","Ex().has_import('pandas')\\\\nEx().check_correct(\\\\n has_printout(0),\\\\n multi(\\\\n check_correct(\\\\n check_object('xls').is_instance(dict),\\\\n check_correct(\\\\n check_function('pandas.read_excel').multi(\\\\n check_args(0).has_equal_value(),\\\\n check_args('sheet_name').has_equal_value()\\\\n ),\\\\n check_object('url').has_equal_value()\\\\n )\\\\n )\\\\n )\\\\n)\\\\nEx().has_printout(1)\\\\nsuccess_msg(\\\\"Awesome!\\\\")","instructions","<ul>\\\\n<li>Assign the URL of the file to the variable <code>url</code>.</li>\\\\n<li>Read the file in <code>url</code> into a dictionary <code>xls</code> using <code>pd.read_excel()</code> recalling that, in order to import all sheets you need to pass <code>None</code> to the argument <code>sheet_name</code>.</li>\\\\n<li>Print the names of the sheets in the Excel spreadsheet; these will be the keys of the dictionary <code>xls</code>.</li>\\\\n<li>Print the head of the first sheet <em>using the sheet name, not the index of the sheet</em>! The sheet name is <code>'1700'</code></li>\\\\n</ul>","externalId",42709,"question","","hint","<ul>\\\\n<li>Make sure you typed in the URL correctly!</li>\\\\n<li>Pass the <em>url</em> (the <code>url</code> object you defined) as the first argument and <code>sheet_name</code> with its corresponding value as the second argument to <code>pd.read_excel()</code>.</li>\\\\n<li>The <em>keys</em> of a dictionary can be accessed by using <code>keys()</code> on the dictionary.</li>\\\\n<li>You can access a sheet using the format: <em>dictionary</em><strong>[</strong><em>sheet name or index</em><strong>]</strong>.</li>\\\\n</ul>","possible_answers",["^7",[]],"number",4,"randomNumber",0.9315409079931212,"assignment","<p>Congrats! You've just loaded a flat file from the web into a DataFrame without first saving it locally using the <code>pandas</code> function <code>pd.read_csv()</code>. This function is super cool because it has close relatives that allow you to load all types of files, not only flat ones. In this interactive exercise, you'll use <code>pd.read_excel()</code> to import an Excel spreadsheet.</p>\\\\n<p>The URL of the spreadsheet is</p>\\\\n<pre><code>'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'\\\\n</code></pre>\\\\n<p>Your job is to use <code>pd.read_excel()</code> to read in all of its sheets, print the sheet names and then print the head of the first sheet <em>using its name, not its index</em>.</p>\\\\n<p>Note that the output of <code>pd.read_excel()</code> is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.</p>","feedbacks",["^7",[]],"attachments",null,"title","Importing non-flat files from the web","xp",100,"language","python","pre_exercise_code","","solution","# Import package\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\nurl = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'\\\\n\\\\n# Read in all sheets of Excel file: xls\\\\nxls = pd.read_excel(url, sheet_name=None)\\\\n\\\\n# Print the sheetnames to the shell\\\\nprint(xls.keys())\\\\n\\\\n# Print the head of the first sheet (using its name, NOT its index)\\\\nprint(xls['1700'].head())","type","NormalExercise","id",42709]],["^2",["sample_code","","sct","","aspect_ratio",56.25,"instructions",null,"externalId",990669,"question","","hint",null,"possible_answers",["^7",[]],"runtime_config",null,"number",5,"video_hls",null,"randomNumber",0.23912258770150818,"chapter_id",4135,"assignment",null,"feedbacks",["^7",[]],"attachments",null,"version","v0","title","HTTP requests to import files from the web","xp",50,"language","python","pre_exercise_code","","solution","","type","VideoExercise","id",990669,"projector_key","course_1606_9d15ae176be1800b996f7869a82b8087","video_link",null,"key","e480d1fdcf","course_id",1606]],["^2",["sample_code","# Import packages\\\\n\\\\n\\\\n# Specify the url\\\\nurl = \\\\"https://campus.datacamp.com/courses/1606/4135?ex=2\\\\"\\\\n\\\\n# This packages the request: request\\\\n\\\\n\\\\n# Sends the request and catches the response: response\\\\n\\\\n\\\\n# Print the datatype of response\\\\nprint(type(response))\\\\n\\\\n# Be polite and close the response!\\\\nresponse.close()\\\\n","sct","\\\\n# Test: import urlopen, Request\\\\nimport_msg = \\\\"Did you correctly import the required packages?\\\\"\\\\nEx().has_import(\\\\n \\\\"urllib.request.urlopen\\\\",\\\\n not_imported_msg=import_msg\\\\n)\\\\nEx().has_import(\\\\n \\\\"urllib.request.Request\\\\",\\\\n not_imported_msg=import_msg\\\\n)\\\\n\\\\n# Test: Predefined code\\\\npredef_msg = \\\\"You don't have to change any of the predefined code.\\\\"\\\\nEx().check_object(\\\\"url\\\\", missing_msg=predef_msg).has_equal_value(incorrect_msg = predef_msg)\\\\n\\\\n# Test: call to Request() and 'request' variable\\\\nEx().check_function(\\\\"urllib.request.Request\\\\").check_args(0).h
Performing HTTP requests in Python using requests
Now that you’ve got your head and hands around making HTTP requests using the urllib package, you’re going to figure out how to do the same using the higher-level requests library. You’ll once again be pinging DataCamp servers for their "http://www.datacamp.com/teach/documentation"
page.
Note that unlike in the previous exercises using urllib, you don’t have to close the connection when using requests!
Instructions
- Import the package
requests
. - Assign the URL of interest to the variable
url
. - Package the request to the URL, send the request and catch the response with a single function
requests.get()
, assigning the response to the variabler
. - Use the
text
attribute of the objectr
to return the HTML of the webpage as a string; store the result in a variabletext
. - Hit submit to print the HTML of the webpage.
# Import package import requests # Specify the url: url url = "http://www.datacamp.com/teach/documentation" # Packages the request, send the request and catch the response: r r = requests.get(url) # Extract the response: text text = r.text # Print the html print(text)
<script.py> output: <!DOCTYPE HTML> <html lang="en-US"> <head> <meta charset="UTF-8" /> <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" /> <meta name="robots" content="noindex, nofollow" /> <meta name="viewport" content="width=device-width,initial-scale=1" /> <title>Just a moment...</title> <style type="text/css"> html, body {width: 100%; height: 100%; margin: 0; padding: 0;} body {background-color: #ffffff; color: #000000; font-family:-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Helvetica Neue",Arial, sans-serif; font-size: 16px; line-height: 1.7em;-webkit-font-smoothing: antialiased;} h1 { text-align: center; font-weight:700; margin: 16px 0; font-size: 32px; color:#000000; line-height: 1.25;} p {font-size: 20px; font-weight: 400; margin: 8px 0;} p, .attribution, {text-align: center;} #spinner {margin: 0 auto 30px auto; display: block;} .attribution {margin-top: 32px;} @keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} } @-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} } #cf-bubbles > .bubbles { animation: fader 1.6s infinite;} #cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;} #cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;} .bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; } a { color: #2c7cb0; text-decoration: none; -moz-transition: color 0.15s ease; -o-transition: color 0.15s ease; -webkit-transition: color 0.15s ease; transition: color 0.15s ease; } a:hover{color: #f4a15d} .attribution{font-size: 16px; line-height: 1.5;} .ray_id{display: block; margin-top: 8px;} #cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; } #cf-hcaptcha-container { text-align:center;} #cf-hcaptcha-container iframe { display: inline-block;} </style> <meta http-equiv="refresh" content="3"> <script type="text/javascript"> //<![CDATA[ (function(){ window._cf_chl_opt={ cvId: "2", cType: "non-interactive", cNounce: "34415", cRay: "696301f41ee5180f", cHash: "10e0e81796684cb", cFPWv: "b", cTTimeMs: "1000", cRq: { ru: "aHR0cDovL3d3dy5kYXRhY2FtcC5jb20vdGVhY2gvZG9jdW1lbnRhdGlvbg==", ra: "cHl0aG9uLXJlcXVlc3RzLzIuMTIuMQ==", rm: "R0VU", d: "ZOgEvcDeIixerAxGVar7L1j4O4CsnK4uNXaUQ+x2CHwq8alqqhk9YnierOCZC/eLfB9vkIsMxuLAc/HBO2QeEDYvvXPYW236hCQaIN9AeztHb7mWv5B+eAjOBwPKBrJLaf2t28v85h7gFNy1mXkZpEGGIHs52p6BlpIxjOb3FUSEZqj9yi+fO2d6tlMT3+mPH158CuGFSZkTyUzICqSOkM17POPmpDHCDzw0pObbcZxNA4LNgNyCONEnSZ64ZMmxv07R/3bGmcBaHxvw0RvGaGojwr1SuNYWO6c8y2VwNmixT4B7PEBeK58W7YfpKXhd42FWbRSnMxqt2M1RWxQCQGBQ77XYu96RxGsXTwBKBWrFXRZlOySatu0+hHhPQWjUrLawncs4TAyVMlNTJo5ynQob5hGWh7DLVvMCdj+oUgYev/V0u3bY5vcWGPzBMimNN1qs2cvqN/Vm8A8m5tjqBwWnhl6KwwJQyxPnriGMWO5VCI+yG4pCR0hrM5N/KzeY8dR/FCLiHJC4UgySqBip1NNDxE+G6zikGnoM5zCG1vE=", t: "MTYzMjg5NTM5OS4wNjIwMDA=", m: "DY3U5u3GNMGcIh28j9Ix8veJszY6uCdaD6RbAkx+xeA=", i1: "WVKE5Eam1cwqayHFagWTOg==", i2: "LBZy73WNehCXmqxKC5fGxw==", zh: "hzfiqo9hugT9sHeHQ1zy81NCL/S0295H0+GuRnkSV9o=", uh: "i/CCY+JJoYjtn06LPki7UEDiltWWpUGH2i5oc7l6Ktk=", hh: "rAZnIHiyrNuZ60h9aAZNML8izDilqmOSNuCtac1WqPs=", } } window._cf_chl_enter = function(){window._cf_chl_opt.p=1}; })(); //]]> </script> </head> <body> <table width="100%" height="100%" cellpadding="20"> <tr> <td align="center" valign="middle"> <div class="cf-browser-verification cf-im-under-attack"> <noscript> <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1> </noscript> <div id="cf-content" style="display:none"> <div id="cf-bubbles"> <div class="bubbles"></div> <div class="bubbles"></div> <div class="bubbles"></div> </div> <h1><span data-translate="checking_browser">Checking your browser before accessing</span> datacamp.com.</h1> <div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none"> <p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p> </div> <p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p> <p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p> <p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting…</p> </div> <form class="challenge-form" id="challenge-form" action="/teach/documentation?__cf_chl_jschl_tk__=pmd_Y0DCyG5HijBcO5QKNcw1NKaMX4rtgczvN.P0MQax5R4-1632895399-0-gqNtZGzNAiWjcnBszQJ9" method="POST" enctype="application/x-www-form-urlencoded"> <input type="hidden" name="md" value="rcE3GJnfe41l6t.msA6gYk77TOtfYhDNo_pcVl5RvSw-1632895399-0-AbUkp3mA7_99XGeouka-ysDdds0OIXdr53Vx12QRECIzlK9g7Gr1mkrFOaIE7WtBR3D3r4Cw7GPmbxKFAv9MK5Lp8AFmBHpdRemCMWPDaVuYNZrwhSuaT2USpplD5S-zh8m94FXpbCcyEmbV0d02WspXsgzVE9bKhk5Nixm7Bgbj4NijgGr8TnRt_zl4Sdfjq8bQdaaZgUqtWS_FtcFXcuEyS7sk9niJN83Qi1-cqsgVRd7rd4nScaiUtVQFc0LsNOmSlur54HjtiTzd1MsY5YCrAgLxec4DPSd8eawSpR53EBIVfvOzGwffrVbWz3hA9FjjfQwx7tsjXyHNgLANHTDMT5VEqSsdk8rCXEQAGTomAzRXaqAcLv135LLZujjf-gSWI8Tu47qBmiUJwwO5iKeb_Ly1XpCTwvlagSRAgMxN4Ar4zCzLyYan0qWd6x6A9khhEjDSfPosE38a4XiH9MsDcHsuNm1OEf_udPuNv67G" /> <input type="hidden" name="r" value="__6qKQ955HRlhKKdw9CwBt_VH0k94Wa2GK0576tHtHc-1632895399-0-Ac1Z18QR1M8vH4OHaYqGBvXAY4rOBFf2vo0HL6aFi4XxyKoGcoWAD/FRaXC/WHmTuP+yLpRb4Rm0wiuT13TZwvNZvn2aeaiXPMfXVTR+igpHoSHi2VzVQ4T70vIH9UqfztJdvKxyapit1U/Q2uQosWity6dmq7TLTBUpc6534469czUfH8PzkmeaC8SqGhafFK6rWJQjSmDy1a3Arx88jGZt+zkiL6shfnlc/uzJvFIersiJ+j6Xz9cN6QlKpkklaZjGkr59ky20ZXHNQ7xAw3iTsA2IYKo4zaIkI6MA3S1Gjc7mB1uhXicVhvB0MYP8XWZiUSn/r0N0v655yP2DiLTVf9JdmUcx2zRrDYHOvuQAJwBcaBqa4p8dfFixgumpZ3S3M9YVrsSd/PAfxYNlnfWlotqQAN4ZKa1oHPq7Xe3Xf4I9g8yVoUSjYqiFBsHaypo257M9Dtn9BufJ6MQ0vZar0+ztsJiFypzVNQHsIaLGQXsfzFJ7Ur8syCfo70/99QOmwxQd0N1l4SzEvwnE9YR4XpfH2/PRkZQr9mO1akpMf/Chjx8GqT6VS04iNM/xww=="/> <input type="hidden" value="133674064405b50cfdd3c527d466bf65" id="jschl-vc" name="jschl_vc"/> <!-- <input type="hidden" value="" id="jschl-vc" name="jschl_vc"/> --> <input type="hidden" name="pass" value="1632895400.062-8rOUa/oCpC"/> <input type="hidden" id="jschl-answer" name="jschl_answer"/> </form> <script type="text/javascript"> //<![CDATA[ (function(){ var a = document.getElementById('cf-content'); a.style.display = 'block'; var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent); var trkjs = isIE ? new Image() : document.createElement('img'); trkjs.setAttribute("src", "/cdn-cgi/images/trace/jschal/js/transparent.gif?ray=696301f41ee5180f"); trkjs.id = "trk_jschal_js"; trkjs.setAttribute("alt", ""); document.body.appendChild(trkjs); var cpo=document.createElement('script'); cpo.type='text/javascript'; cpo.src="/cdn-cgi/challenge-platform/h/b/orchestrate/jsch/v1?ray=696301f41ee5180f"; document.getElementsByTagName('head')[0].appendChild(cpo); }()); //]]> </script> <div id="trk_jschal_nojs" style="background-image:url('/cdn-cgi/images/trace/jschal/nojs/transparent.gif?ray=696301f41ee5180f')"> </div> </div> <div class="attribution"> DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a> <br /> <span class="ray_id">Ray ID: <code>696301f41ee5180f</code></span> </div> </td> </tr> </table> </body> </html>
Parsing HTML with BeautifulSoup
In this interactive exercise, you’ll learn how to use the BeautifulSoup package to parse, prettify and extract information from HTML. You’ll scrape the data from the webpage of Guido van Rossum, Python’s very own Benevolent Dictator for Life. In the following exercises, you’ll prettify the HTML and then extract the text and the hyperlinks.
The URL of interest is url = 'https://www.python.org/~guido/'
.
Instructions
- Import the function
BeautifulSoup
from the packagebs4
. - Assign the URL of interest to the variable
url
. - Package the request to the URL, send the request and catch the response with a single function
requests.get()
, assigning the response to the variabler
. - Use the
text
attribute of the objectr
to return the HTML of the webpage as a string; store the result in a variablehtml_doc
. - Create a BeautifulSoup object
soup
from the resulting HTML using the functionBeautifulSoup()
. - Use the method
prettify()
onsoup
and assign the result topretty_soup
. - Hit submit to print to prettified HTML to your shell!
# Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # Create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Prettify the BeautifulSoup object: pretty_soup pretty_soup = soup.prettify() # Print the response print(pretty_soup)
# Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # Create a BeautifulSoup object from the HTML: soup soup = html_doc.BeautifulSoup() # Prettify the BeautifulSoup object: pretty_soup pretty_soup = soup.prettify() # Print the response print(pretty_soup) Traceback (most recent call last): File "<stdin>", line 15, in <module> soup = html_doc.BeautifulSoup() AttributeError: 'str' object has no attribute 'BeautifulSoup' # Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # Create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Prettify the BeautifulSoup object: pretty_soup pretty_soup = soup.prettify() # Print the response print(pretty_soup) <html> <head> <title> Guido's Personal Home Page </title> </head> <body bgcolor="#FFFFFF" text="#000000"> <!-- Built from main --> <h1> <a href="pics.html"> <img border="0" src="images/IMG_2192.jpg"/> </a> Guido van Rossum - Personal Home Page <a href="pics.html"> <img border="0" height="216" src="images/guido-headshot-2019.jpg" width="270"/> </a> </h1> <p> <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm"> <i> "Gawky and proud of it." </i> </a> <h3> <a href="images/df20000406.jpg"> Who I Am </a> </h3> <p> Read my <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html"> "King's Day Speech" </a> for some inspiration. <p> I am the author of the <a href="http://www.python.org"> Python </a> programming language. See also my <a href="Resume.html"> resume </a> and my <a href="Publications.html"> publications list </a> , a <a href="bio.html"> brief bio </a> , assorted <a href="http://legacy.python.org/doc/essays/"> writings </a> , <a href="http://legacy.python.org/doc/essays/ppt/"> presentations </a> and <a href="interviews.html"> interviews </a> (all about Python), some <a href="pics.html"> pictures of me </a> , <a href="http://neopythonic.blogspot.com"> my new blog </a> , and my <a href="http://www.artima.com/weblogs/index.jsp?blogger=12088"> old blog </a> on Artima.com. I am <a href="https://twitter.com/gvanrossum"> @gvanrossum </a> on Twitter. <p> I am retired, working on personal projects (and maybe a book). I have worked for Dropbox, Google, Elemental Security, Zope Corporation, BeOpen.com, CNRI, CWI, and SARA. (See my <a href="Resume.html"> resume </a> .) I created Python while at CWI. <h3> How to Reach Me </h3> <p> You can send email for me to guido (at) python.org. I read everything sent there, but I receive too much email to respond to everything. <h3> My Name </h3> <p> My name often poses difficulties for Americans. <p> <b> Pronunciation: </b> in Dutch, the "G" in Guido is a hard G, pronounced roughly like the "ch" in Scottish "loch". (Listen to the <a href="guido.au"> sound clip </a> .) However, if you're American, you may also pronounce it as the Italian "Guido". I'm not too worried about the associations with mob assassins that some people have. :-) <p> <b> Spelling: </b> my last name is two words, and I'd like to keep it that way, the spelling on some of my credit cards notwithstanding. Dutch spelling rules dictate that when used in combination with my first name, "van" is not capitalized: "Guido van Rossum". But when my last name is used alone to refer to me, it is capitalized, for example: "As usual, Van Rossum was right." <p> <b> Alphabetization: </b> in America, I show up in the alphabet under "V". But in Europe, I show up under "R". And some of my friends put me under "G" in their address book... <h3> More Hyperlinks </h3> <ul> <li> Here's a collection of <a href="http://legacy.python.org/doc/essays/"> essays </a> relating to Python that I've written, including the foreword I wrote for Mark Lutz' book "Programming Python". <p> <li> I own the official <a href="images/license.jpg"> <img align="center" border="0" height="75" src="images/license_thumb.jpg" width="100"> Python license. </img> </a> <p> </p> </li> </p> </li> </ul> <h3> The Audio File Formats FAQ </h3> <p> I was the original creator and maintainer of the Audio File Formats FAQ. It is now maintained by Chris Bagwell at <a href="http://www.cnpbagwell.com/audio-faq"> http://www.cnpbagwell.com/audio-faq </a> . And here is a link to <a href="http://sox.sourceforge.net/"> SOX </a> , to which I contributed some early code. </p> </p> </p> </p> </p> </p> </p> </p> </p> </p> </body> </html> <hr> <a href="images/internetdog.gif"> "On the Internet, nobody knows you're a dog." </a> <hr> </hr> </hr>
How to write a request header
From https://365datascience.com/tutorials/python-tutorials/request-headers-web-scraping/
A Chrome User Agent String might look like: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36
“
This string we can assign to an object ‘headers’:
header = {"UserAgent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"
The usual get request looks like:
r = requests.get("https://www.youtube.com")
Now we’re going to make it include the header:
r = requests.get("https://www.youtube.com", headers=headers)
Turning a webpage into data using BeautifulSoup: getting the text
As promised, in the following exercises, you’ll learn the basics of extracting information from HTML soup. In this exercise, you’ll figure out how to extract the text from the BDFL’s webpage, along with printing the webpage’s title.
Instructions
- In the sample code, the HTML response object
html_doc
has already been created: your first task is to Soupify it using the functionBeautifulSoup()
and to assign the resulting soup to the variablesoup
. - Extract the title from the HTML soup
soup
using the attributetitle
and assign the result toguido_title
. - Print the title of Guido’s webpage to the shell using the
print()
function. - Extract the text from the HTML soup
soup
using the methodget_text()
and assign toguido_text
. - Hit submit to print the text from Guido’s webpage to the shell.
# Import packages import requests from bs4 import BeautifulSoup # Specify url: url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extract the response as html: html_doc html_doc = r.text # Create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Get the title of Guido's webpage: guido_title guido_title = soup.title # Print the title of Guido's webpage to the shell print(soup.title) # Get Guido's text: guido_text guido_text = soup.get_text()
<script.py> output: <title>Guido's Personal Home Page</title> Guido's Personal Home Page Guido van Rossum - Personal Home Page "Gawky and proud of it." Who I Am Read my "King's Day Speech" for some inspiration. I am the author of the Python programming language. See also my resume and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some pictures of me, my new blog, and my old blog on Artima.com. I am @gvanrossum on Twitter. I am retired, working on personal projects (and maybe a book). I have worked for Dropbox, Google, Elemental Security, Zope Corporation, BeOpen.com, CNRI, CWI, and SARA. (See my resume.) I created Python while at CWI. How to Reach Me You can send email for me to guido (at) python.org. I read everything sent there, but I receive too much email to respond to everything. My Name My name often poses difficulties for Americans. Pronunciation: in Dutch, the "G" in Guido is a hard G, pronounced roughly like the "ch" in Scottish "loch". (Listen to the sound clip.) However, if you're American, you may also pronounce it as the Italian "Guido". I'm not too worried about the associations with mob assassins that some people have. :-) Spelling: my last name is two words, and I'd like to keep it that way, the spelling on some of my credit cards notwithstanding. Dutch spelling rules dictate that when used in combination with my first name, "van" is not capitalized: "Guido van Rossum". But when my last name is used alone to refer to me, it is capitalized, for example: "As usual, Van Rossum was right." Alphabetization: in America, I show up in the alphabet under "V". But in Europe, I show up under "R". And some of my friends put me under "G" in their address book... More Hyperlinks Here's a collection of essays relating to Python that I've written, including the foreword I wrote for Mark Lutz' book "Programming Python". I own the official Python license. The Audio File Formats FAQ I was the original creator and maintainer of the Audio File Formats FAQ. It is now maintained by Chris Bagwell at http://www.cnpbagwell.com/audio-faq. And here is a link to SOX, to which I contributed some early code. "On the Internet, nobody knows you're a dog."
Turning a webpage into data using BeautifulSoup: getting the hyperlinks
In this exercise, you’ll figure out how to extract the URLs of the hyperlinks from the BDFL’s webpage. In the process, you’ll become close friends with the soup method find_all()
.
Instructions
- Use the method
find_all()
to find all hyperlinks insoup
, remembering that hyperlinks are defined by the HTML tag<a>
but passed tofind_all()
without angle brackets; store the result in the variablea_tags
. - The variable
a_tags
is a results set: your job now is to enumerate over it, using afor
loop and to print the actual URLs of the hyperlinks; to do this, for every elementlink
ina_tags
, you want toprint()
link.get('href')
.
# Import packages import requests from bs4 import BeautifulSoup # Specify url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Print the title of Guido's webpage print(soup.title) # Find all 'a' tags (which define hyperlinks): a_tags a_tags = soup.find_all('a')
# Import packages import requests from bs4 import BeautifulSoup # Specify url url = 'https://www.python.org/~guido/' # Package the request, send the request and catch the response: r r = requests.get(url) # Extracts the response as html: html_doc html_doc = r.text # create a BeautifulSoup object from the HTML: soup soup = BeautifulSoup(html_doc) # Print the title of Guido's webpage print(soup.title) # Find all 'a' tags (which define hyperlinks): a_tags a_tags = soup.find_all('a') # Print the URLs to the shell for link in soup.find_all('a'): print(link.get('href')) <title>Guido's Personal Home Page</title> pics.html pics.html http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm images/df20000406.jpg http://neopythonic.blogspot.com/2016/04/kings-day-speech.html http://www.python.org Resume.html Publications.html bio.html http://legacy.python.org/doc/essays/ http://legacy.python.org/doc/essays/ppt/ interviews.html pics.html http://neopythonic.blogspot.com http://www.artima.com/weblogs/index.jsp?blogger=12088 Tweets by gvanrossum Resume.html guido.au http://legacy.python.org/doc/essays/ images/license.jpg http://www.cnpbagwell.com/audio-faq http://sox.sourceforge.net/ images/internetdog.gif
Loading and exploring a JSON
Now that you know what a JSON is, you’ll load one into your Python environment and explore it yourself. Here, you’ll load the JSON 'a_movie.json'
into the variable json_data
, which will be a dictionary. You’ll then explore the JSON contents by printing the key-value pairs of json_data
to the shell.
Instructions
- Load the JSON
'a_movie.json'
into the variablejson_data
within the context provided by thewith
statement. To do so, use the functionjson.load()
within the context manager. - Use a
for
loop to print all key-value pairs in the dictionaryjson_data
. Recall that you can access a value in a dictionary using the syntax: dictionary[
key]
.
# Load JSON: json_data with open("a_movie.json") as json_file: json_data = json.load(json_file) # Print each key-value pair in json_data for k in json_data.keys(): print(k + ': ', json_data[k])
# Load JSON: json_data with open("a_movie.json") as json_file: json_data = json.load(json_file) # Print each key-value pair in json_data for k in json_data.keys(): print(k + ': ', json_data[k]) Title: The Social Network Year: 2010 Rated: PG-13 Released: 01 Oct 2010 Runtime: 120 min Genre: Biography, Drama Director: David Fincher Writer: Aaron Sorkin, Ben Mezrich Actors: Jesse Eisenberg, Andrew Garfield, Justin Timberlake Plot: As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business. Language: English, French Country: United States Awards: Won 3 Oscars. 172 wins & 186 nominations total Poster: https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}] Metascore: 95 imdbRating: 7.7 imdbVotes: 653,830 imdbID: tt1285016 Type: movie DVD: 11 Jan 2011 BoxOffice: $96,962,694 Production: Scott Rudin Productions, Trigger Street Productions, Michael De Luca Website: N/A Response: True
Alternative way of programming the above:
# Load JSON: json_data with open("a_movie.json") as json_file: json_data = json.load(json_file) # Print each key-value pair in json_data for k, v in json_data.items(): print(k + ': ', v)
# Load JSON: json_data with open("a_movie.json") as json_file: json_data = json.load(json_file) # Print each key-value pair in json_data for k, v in json_data.items(): print(k + ': ', v) Title: The Social Network Year: 2010 Rated: PG-13 Released: 01 Oct 2010 Runtime: 120 min Genre: Biography, Drama Director: David Fincher Writer: Aaron Sorkin, Ben Mezrich Actors: Jesse Eisenberg, Andrew Garfield, Justin Timberlake Plot: As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business. Language: English, French Country: United States Awards: Won 3 Oscars. 172 wins & 186 nominations total Poster: https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}] Metascore: 95 imdbRating: 7.7 imdbVotes: 653,830 imdbID: tt1285016 Type: movie DVD: 11 Jan 2011 BoxOffice: $96,962,694 Production: Scott Rudin Productions, Trigger Street Productions, Michael De Luca Website: N/A Response: True
Pop quiz: Exploring your JSON
Load the JSON 'a_movie.json'
into a variable, which will be a dictionary. Do so by copying, pasting and executing the following code in the IPython Shell:
import json
with open("a_movie.json") as json_file:
json_data = json.load(json_file)
Print the values corresponding to the keys 'Title'
and 'Year'
and answer the following question about the movie that the JSON describes:
What is the title and year of the movie?
In [6]: for k in json_data.keys(): if k== 'Title': print(k + ' : ' + json_data[k]) if k=='Year': print(k + ' : ' + json_data[k]) Title : The Social Network Year : 2010
API requests
Now it’s your turn to pull some movie data down from the Open Movie Database (OMDB) using their API. The movie you’ll query the API about is The Social Network. Recall that, in the video, to query the API about the movie Hackers, Hugo’s query string was 'http://www.omdbapi.com/?t=hackers'
and had a single argument t=hackers
.
Note: recently, OMDB has changed their API: you now also have to specify an API key. This means you’ll have to add another argument to the URL: apikey=72bc447a
.
Instructions
- Import the
requests
package. - Assign to the variable
url
the URL of interest in order to query'http://www.omdbapi.com'
for the data corresponding to the movie The Social Network. The query string should have two arguments:apikey=72bc447a
andt=the+social+network
. You can combine them as follows:apikey=72bc447a&t=the+social+network
. - Print the text of the response object
r
by using itstext
attribute and passing the result to theprint()
function.
# Import requests package import requests # Assign URL to variable: url url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network' # Package the request, send the request and catch the response: r r = requests.get(url) # Print the text of the response print(r.text)
{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin, Ben Mezrich","Actors":"Jesse Eisenberg, Andrew Garfield, Justin Timberlake","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"United States","Awards":"Won 3 Oscars. 172 wins & 186 nominations total","Poster":"https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"653,830","imdbID":"tt1285016","Type":"movie","DVD":"11 Jan 2011","BoxOffice":"$96,962,694","Production":"Scott Rudin Productions, Trigger Street Productions, Michael De Luca","Website":"N/A","Response":"True"}
JSON–from the web to Python
Wow, congrats! You’ve just queried your first API programmatically in Python and printed the text of the response to the shell. However, as you know, your response is actually a JSON, so you can do one step better and decode the JSON. You can then print the key-value pairs of the resulting dictionary. That’s what you’re going to do now!
Instructions
- Pass the variable
url
to therequests.get()
function in order to send the relevant request and catch the response, assigning the resultant response message to the variabler
. - Apply the
json()
method to the response objectr
and store the resulting dictionary in the variablejson_data
. - Hit
Submit Answer
to print the key-value pairs of the dictionaryjson_data
to the shell.
# Import package import requests # Assign URL to variable: url url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network' # Package the request, send the request and catch the response: r r = requests.get(url) # Decode the JSON data into a dictionary: json_data json_data = r.json() # Print each key-value pair in json_data for k in json_data.keys(): print(k + ': ', json_data[k])
Title: The Social Network Year: 2010 Rated: PG-13 Released: 01 Oct 2010 Runtime: 120 min Genre: Biography, Drama Director: David Fincher Writer: Aaron Sorkin, Ben Mezrich Actors: Jesse Eisenberg, Andrew Garfield, Justin Timberlake Plot: As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business. Language: English, French Country: United States Awards: Won 3 Oscars. 172 wins & 186 nominations total Poster: https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg Ratings: [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}] Metascore: 95 imdbRating: 7.7 imdbVotes: 653,830 imdbID: tt1285016 Type: movie DVD: 11 Jan 2011 BoxOffice: $96,962,694 Production: Scott Rudin Productions, Trigger Street Productions, Michael De Luca Website: N/A Response: True
Checking out the Wikipedia API
You’re doing so well and having so much fun that we’re going to throw one more API at you: the Wikipedia API (documented here). You’ll figure out how to find and extract information from the Wikipedia page for Pizza. What gets a bit wild here is that your query will return nested JSONs, that is, JSONs with JSONs, but Python can handle that because it will translate them into dictionaries within dictionaries.
The URL that requests the relevant query from the Wikipedia API is
https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza
Instructions
- Assign the relevant URL to the variable
url
. - Apply the
json()
method to the response objectr
and store the resulting dictionary in the variablejson_data
. - The variable
pizza_extract
holds the HTML of an extract from Wikipedia’s Pizza page as a string; use the functionprint()
to print this string to the shell.
# Import package import requests # Assign URL to variable: url url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza' # Package the request, send the request and catch the response: r r = requests.get(url) # Decode the JSON data into a dictionary: json_data json_data = r.json() # Print the Wikipedia page extract pizza_extract = json_data['query']['pages']['24768']['extract'] print(pizza_extract)
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1033289096"> <p class="mw-empty-elt"> </p> <p><b>Pizza</b> (<small>Italian: </small><span title="Representation in the International Phonetic Alphabet (IPA)">[ˈpittsa]</span>, <small>Neapolitan: </small><span title="Representation in the International Phonetic Alphabet (IPA)">[ˈpittsə]</span>) is an Italian dish consisting of a usually round, flattened base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as anchovies, mushrooms, onions, olives, pineapple, meat, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven. A small pizza is sometimes called a pizzetta. A person who makes pizza is known as a <b>pizzaiolo</b>. </p><p>In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced, and is eaten with the use of a knife and fork. In casual settings, however, it is cut into wedges to be eaten while held in the hand. </p><p>The term <i>pizza</i> was first recorded in the 10th century in a Latin manuscript from the Southern Italian town of Gaeta in Lazio, on the border with Campania. Modern pizza was invented in Naples, and the dish and its variants have since become popular in many countries. It has become one of the most popular foods in the world and a common fast food item in Europe, North America and Australasia; available at pizzerias (restaurants specializing in pizza), restaurants offering Mediterranean cuisine, and via pizza delivery. Various food companies also sell ready-baked frozen pizzas in grocery stores, to be reheated in an ordinary home oven. </p><p>The <i>Associazione Verace Pizza Napoletana</i> (lit. True Neapolitan Pizza Association) is a non-profit organization founded in 1984 with headquarters in Naples that aims to promote traditional Neapolitan pizza. In 2009, upon Italy's request, Neapolitan pizza was registered with the European Union as a Traditional Speciality Guaranteed dish, and in 2017 the art of its making was included on UNESCO's list of intangible cultural heritage.</p>
API Authentication
The package tweepy
is great at handling all the Twitter API OAuth Authentication details for you. All you need to do is pass it your authentication credentials. In this interactive exercise, we have created some mock authentication credentials (if you wanted to replicate this at home, you would need to create a Twitter App as Hugo detailed in the video). Your task is to pass these credentials to tweepy’s OAuth handler.
Instructions
- Import the package
tweepy
. - Pass the parameters
consumer_key
andconsumer_secret
to the functiontweepy.OAuthHandler()
. - Complete the passing of OAuth credentials to the OAuth handler
auth
by applying to it the methodset_access_token()
, along with argumentsaccess_token
andaccess_token_secret
.
import tweepy # Store OAuth authentication credentials in relevant variables access_token = "test_token1" access_token_secret = "test_token1_secret" consumer_key = "test_token2" consumer_secret = "test_token2_secret" # Pass OAuth details to tweepy's OAuth handler auth = tweepy.OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret)
Streaming tweets
Now that you have set up your authentication credentials, it is time to stream some tweets! We have already defined the tweet stream listener class, MyStreamListener
, just as Hugo did in the introductory video. You can find the code for the tweet stream listener class here.
Your task is to create the Stream
object and to filter tweets according to particular keywords.
Instructions
- Create your
Stream
object with authentication by passingtweepy.Stream()
the authentication handlerauth
and the Stream listenerl
; - To filter Twitter streams, pass to the
track
argument instream.filter()
a list containing the desired keywords'clinton'
,'trump'
,'sanders'
, and'cruz'
.
l = MyStreamListener() # Create your Stream object with authentication stream = tweepy.Stream(auth, l) # Filter Twitter Streams to capture data by the keywords: stream.filter(track = ['clinton', 'trump', 'sanders', 'cruz'])
Load and explore your Twitter data
Now that you’ve got your Twitter data sitting locally in a text file, it’s time to explore it! This is what you’ll do in the next few interactive exercises. In this exercise, you’ll read the Twitter data into a list: tweets_data
.
Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).
Instructions
- Assign the filename
'tweets.txt'
to the variabletweets_data_path
. - Initialize
tweets_data
as an empty list to store the tweets in. - Within the
for
loop initiated byfor line in tweets_file:
, load each tweet into a variable,tweet
, usingjson.loads()
, then appendtweet
totweets_data
using theappend()
method. - Hit submit and check out the keys of the first tweet dictionary printed to the shell.
import json # String of path to file: tweets_data_path tweets_data_path = 'tweets.txt' # Initialize empty list to store tweets: tweets_data tweets_data = [] # Open connection to file tweets_file = open(tweets_data_path, "r") # Read in tweets and store in list: tweets_data for line in tweets_file: tweet = json.loads(line) tweets_data.append(tweet) # Close connection to file tweets_file.close()
<script.py> output: dict_keys(['in_reply_to_user_id', 'created_at', 'filter_level', 'truncated', 'possibly_sensitive', 'timestamp_ms', 'user', 'text', 'extended_entities', 'in_reply_to_status_id', 'entities', 'favorited', 'retweeted', 'is_quote_status', 'id', 'favorite_count', 'retweeted_status', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'id_str', 'in_reply_to_screen_name', 'coordinates', 'lang', 'place', 'contributors', 'geo', 'retweet_count', 'source'])
Twitter data to DataFrame
Now you have the Twitter data in a list of dictionaries, tweets_data
, where each dictionary corresponds to a single tweet. Next, you’re going to extract the text and language of each tweet. The text in a tweet, t1
, is stored as the value t1['text']
; similarly, the language is stored in t1['lang']
. Your task is to build a DataFrame in which each row is a tweet and the columns are 'text'
and 'lang'
.
Instructions
- Use
pd.DataFrame()
to construct a DataFrame of tweet texts and languages; to do so, the first argument should betweets_data
, a list of dictionaries. The second argument topd.DataFrame()
is a listof the keys you wish to have as columns. Assign the result of thepd.DataFrame()
call todf
. - Print the head of the DataFrame.
# Import package import pandas as pd # Build DataFrame of tweet texts and languages df = pd.DataFrame(tweets_data, columns=['text', 'lang']) # Print head of DataFrame print(df.head())
<script.py> output: text lang 0 b"RT @bpolitics: .@krollbondrating's Christoph... en 1 b'RT @HeidiAlpine: @dmartosko Cruz video found... en 2 b'Njihuni me Zonj\\xebn Trump !!! | Ekskluzive... et 3 b"Your an idiot she shouldn't have tried to gr... en 4 b'RT @AlanLohner: The anti-American D.C. elite... en
A little bit of Twitter text analysis
Now that you have your DataFrame of tweets set up, you’re going to do a bit of text analysis to count how many tweets contain the words 'clinton'
, 'trump'
, 'sanders'
and 'cruz'
. In the pre-exercise code, we have defined the following function word_in_text()
, which will tell you whether the first argument (a word) occurs within the 2nd argument (a tweet).
import re
def word_in_text(word, text):
word = word.lower()
text = text.lower()
match = re.search(word, text)
if match:
return True
return False
You’re going to iterate over the rows of the DataFrame and calculate how many tweets contain each of our keywords! The list of objects for each candidate has been initialized to 0.
Instructions
- Within the
for
loopfor index, row in df.iterrows():
, the code currently increases the value ofclinton
by1
each time a tweet (text row) mentioning ‘Clinton’ is encountered; complete the code so that the same happens fortrump
,sanders
andcruz
.
[clinton, trump, sanders, cruz] = [0, 0, 0, 0] # Iterate through df, counting the number of tweets in which # each candidate is mentioned for index, row in df.iterrows(): clinton += word_in_text('clinton', row['text']) trump += word_in_text('trump', row['text']) sanders += word_in_text('sanders', row['text']) cruz += word_in_text('cruz', row['text'])
Plotting your Twitter data
Now that you have the number of tweets that each candidate was mentioned in, you can plot a bar chart of this data. You’ll use the statistical data visualization library seaborn
, which you may not have seen before, but we’ll guide you through. You’ll first import seaborn
as sns
. You’ll then construct a barplot of the data using sns.barplot
, passing it two arguments:
- a list of labels and
- a list containing the variables you wish to plot (
clinton
,trump
and so on.)
Hopefully, you’ll see that Trump was unreasonably represented! We have already run the previous exercise solutions in your environment.
Instructions
- Import both
matplotlib.pyplot
andseaborn
using the aliasesplt
andsns
, respectively. - Complete the arguments of
sns.barplot
:- The first argument should be the list of labels to appear on the x-axis (created in the previous step).
- The second argument should be a list of the variables you wish to plot, as produced in the previous exercise (i.e. a list containing
clinton
,trump
, etc).
# Import packages import matplotlib.pyplot as plt import seaborn as sns # Set seaborn style sns.set(color_codes=True) # Create a list of labels:cd cd = ['clinton', 'trump', 'sanders', 'cruz'] # Plot the bar chart ax = sns.barplot(x=cd, y=[clinton, trump, sanders, cruz]) ax.set(ylabel="count") plt.show()