Intermediate Importing Data in Python

Importing flat files from the web: your turn!

You are about to import your first file from the web! The flat file you will import will be 'winequality-red.csv' from the University of California, Irvine’s Machine Learning repository. The flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.

The URL of the file is

'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

After you import it, you’ll check your working directory to confirm that it is there and then you’ll load it into a pandas DataFrame.

Instructions
  • Import the function urlretrieve from the subpackage urllib.request.
  • Assign the URL of the file to the variable url.
  • Use the function urlretrieve() to save the file locally as 'winequality-red.csv'.
  • Execute the remaining code to load 'winequality-red.csv' in a pandas DataFrame and to print its head to the shell.
# Import package
from urllib.request import urlretrieve

# Import pandas
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Save file locally
urlretrieve(url, 'winequality-red.csv')

# Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';')
print(df.head())
   fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
0            7.4              0.70         0.00             1.9      0.076   
1            7.8              0.88         0.00             2.6      0.098   
2            7.8              0.76         0.04             2.3      0.092   
3           11.2              0.28         0.56             1.9      0.075   
4            7.4              0.70         0.00             1.9      0.076   

   free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
0                 11.0                  34.0   0.9978  3.51       0.56   
1                 25.0                  67.0   0.9968  3.20       0.68   
2                 15.0                  54.0   0.9970  3.26       0.65   
3                 17.0                  60.0   0.9980  3.16       0.58   
4                 11.0                  34.0   0.9978  3.51       0.56   

   alcohol  quality  
0      9.4        5  
1      9.8        5  
2      9.8        5  
3      9.8        6  
4      9.4        5  

Opening and reading flat files from the web

You have just imported a file from the web, saved it locally and loaded it into a DataFrame. If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using pandas. In particular, you can use the function pd.read_csv()with the URL as the first argument and the separator sep as the second argument.

The URL of the file, once again, is

'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
Instructions
  • Assign the URL of the file to the variable url.
  • Read file into a DataFrame df using pd.read_csv(), recalling that the separator in the file is ';'.
  • Print the head of the DataFrame df.
  • Execute the rest of the code to plot histogram of the first feature in the DataFrame df.
# Import packages
import matplotlib.pyplot as plt
import pandas as pd

# Assign url of file: url
url = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

# Read file into a DataFrame: df
df = pd.read_csv(url, ';')

# Print the head of the DataFrame
print(df.head())

# Plot first column of df
pd.DataFrame.hist(df.ix[:, 0:1])
plt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')
plt.ylabel('count')
plt.show()
<script.py> output:
       fixed acidity  volatile acidity  citric acid  residual sugar  chlorides  \
    0            7.4              0.70         0.00             1.9      0.076   
    1            7.8              0.88         0.00             2.6      0.098   
    2            7.8              0.76         0.04             2.3      0.092   
    3           11.2              0.28         0.56             1.9      0.075   
    4            7.4              0.70         0.00             1.9      0.076   
    
       free sulfur dioxide  total sulfur dioxide  density    pH  sulphates  \
    0                 11.0                  34.0   0.9978  3.51       0.56   
    1                 25.0                  67.0   0.9968  3.20       0.68   
    2                 15.0                  54.0   0.9970  3.26       0.65   
    3                 17.0                  60.0   0.9980  3.16       0.58   
    4                 11.0                  34.0   0.9978  3.51       0.56   
    
       alcohol  quality  
    0      9.4        5  
    1      9.8        5  
    2      9.8        5  
    3      9.8        6  
    4      9.4        5  

Importing non-flat files from the web

Congrats! You’ve just loaded a flat file from the web into a DataFrame without first saving it locally using the pandas function pd.read_csv(). This function is super cool because it has close relatives that allow you to load all types of files, not only flat ones. In this interactive exercise, you’ll use pd.read_excel() to import an Excel spreadsheet.

The URL of the spreadsheet is

'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

Your job is to use pd.read_excel() to read in all of its sheets, print the sheet names and then print the head of the first sheet using its name, not its index.

Note that the output of pd.read_excel() is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.

Instructions
  • Assign the URL of the file to the variable url.
  • Read the file in url into a dictionary xls using pd.read_excel()recalling that, in order to import all sheets you need to pass None to the argument sheet_name.
  • Print the names of the sheets in the Excel spreadsheet; these will be the keys of the dictionary xls.
  • Print the head of the first sheet using the sheet name, not the index of the sheet! The sheet name is '1700'
# Import package
import pandas as pd

# Assign url of file: url
url = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'

# Read in all sheets of Excel file: xls
xls = pd.read_excel(url, sheet_name=None)

# Print the sheetnames to the shell
print(xls.keys())

# Print the head of the first sheet (using its name, NOT its index)
print(xls['1700'].head())
odict_keys(['1700', '1900'])
                 country       1700
0            Afghanistan  34.565000
1  Akrotiri and Dhekelia  34.616667
2                Albania  41.312000
3                Algeria  36.720000
4         American Samoa -14.307000

Performing HTTP requests in Python using urllib

Now that you know the basics behind HTTP GET requests, it’s time to perform some of your own. In this interactive exercise, you will ping our very own DataCamp servers to perform a GET request to extract information from the first coding exercise of this course, "https://campus.datacamp.com/courses/1606/4135?ex=2".

In the next exercise, you’ll extract the HTML itself. Right now, however, you are going to package and send the request and then catch the response.

Instructions
  • Import the functions urlopen and Request from the subpackage urllib.request.
  • Package the request to the url "https://campus.datacamp.com/courses/1606/4135?ex=2" using the function Request() and assign it to request.
  • Send the request and catch the response in the variable responsewith the function urlopen().
  • Run the rest of the code to see the datatype of response and to close the connection!
# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "https://campus.datacamp.com/courses/1606/4135?ex=2"

# This packages the request: request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Print the datatype of response
print(type(response))

# Be polite and close the response!
response.close()
<class 'http.client.HTTPResponse'>

Printing HTTP request results in Python using urllib

You have just packaged and sent a GET request to "https://campus.datacamp.com/courses/1606/4135?ex=2" and then caught the response. You saw that such a response is a http.client.HTTPResponse object. The question remains: what can you do with this response?

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method. In this exercise, you’ll build on your previous great work to extract the response and print the HTML.

Instructions
  • Send the request and catch the response in the variable responsewith the function urlopen(), as in the previous exercise.
  • Extract the response using the read() method and store the result in the variable html.
  • Print the string html.
  • Hit submit to perform all of the above and to close the response: be tidy!
# Import packages
from urllib.request import urlopen, Request

# Specify the url
url = "https://campus.datacamp.com/courses/1606/4135?ex=2"

# This packages the request
request = Request(url)

# Sends the request and catches the response: response
response = urlopen(request)

# Extract the response: html
html = response.read()

# Print the html
print(html)

# Be polite and close the response!
response.close()
b'<!doctype html><html lang="en"><head><link rel="apple-touch-icon-precomposed" sizes="57x57" href="/apple-touch-icon-57x57.png"><link rel="apple-touch-icon-precomposed" sizes="114x114" href="/apple-touch-icon-114x114.png"><link rel="apple-touch-icon-precomposed" sizes="72x72" href="/apple-touch-icon-72x72.png"><link rel="apple-touch-icon-precomposed" sizes="144x144" href="/apple-touch-icon-144x144.png"><link rel="apple-touch-icon-precomposed" sizes="60x60" href="/apple-touch-icon-60x60.png"><link rel="apple-touch-icon-precomposed" sizes="120x120" href="/apple-touch-icon-120x120.png"><link rel="apple-touch-icon-precomposed" sizes="76x76" href="/apple-touch-icon-76x76.png"><link rel="apple-touch-icon-precomposed" sizes="152x152" href="/apple-touch-icon-152x152.png"><link rel="icon" type="image/png" href="/favicon.ico"><link rel="icon" type="image/png" href="/favicon-196x196.png" sizes="196x196"><link rel="icon" type="image/png" href="/favicon-96x96.png" sizes="96x96"><link rel="icon" type="image/png" href="/favicon-32x32.png" sizes="32x32"><link rel="icon" type="image/png" href="/favicon-16x16.png" sizes="16x16"><link rel="icon" type="image/png" href="/favicon-128.png" sizes="128x128"><meta name="application-name" content="DataCamp"><meta name="msapplication-TileColor" content="#FFFFFF"><meta name="msapplication-TileImage" content="/mstile-144x144.png"><meta name="msapplication-square70x70logo" content="/mstile-70x70.png"><meta name="msapplication-square150x150logo" content="/mstile-150x150.png"><meta name="msapplication-wide310x150logo" content="/mstile-310x150.png"><meta name="msapplication-square310x310logo" content="/mstile-310x310.png"><link href="/static/css/17.279a2d9e.chunk.css" rel="stylesheet"><link href="/static/css/main.88385ec5.chunk.css" rel="stylesheet"><title data-react-helmet="true">Importing flat files from the web: your turn! | Python</title><link data-react-helmet="true" rel="canonical" href="https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=2"><meta data-react-helmet="true" charset="utf-8"><meta data-react-helmet="true" http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"><meta data-react-helmet="true" name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1"><meta data-react-helmet="true" name="fragment" content="!"><meta data-react-helmet="true" name="keywords" content="R, Python, Data analysis, interactive, learning"><meta data-react-helmet="true" name="description" content="Here is an example of Importing flat files from the web: your turn!: You are about to import your first file from the web! The flat file you will import will be 'winequality-red."><meta data-react-helmet="true" name="twitter:card" content="summary"><meta data-react-helmet="true" name="twitter:site" content="@DataCamp"><meta data-react-helmet="true" name="twitter:title" content="Importing flat files from the web: your turn! | Python"><meta data-react-helmet="true" name="twitter:description" content="Here is an example of Importing flat files from the web: your turn!: You are about to import your first file from the web! The flat file you will import will be 'winequality-red."><meta data-react-helmet="true" name="twitter:creator" content="@DataCamp"><meta data-react-helmet="true" name="twitter:image:src" content="/public/assets/images/var/twitter_share.png"><meta data-react-helmet="true" name="twitter:domain" content="www.datacamp.com"><meta data-react-helmet="true" property="og:title" content="Importing flat files from the web: your turn! | Python"><meta data-react-helmet="true" property="og:image" content="/public/assets/images/var/linkedin_share.png"><meta data-react-helmet="true" name="google-signin-clientid" content="892114885437-01a7plbsu1b2vobuhvnckmmanhb58h3a.apps.googleusercontent.com"><meta data-react-helmet="true" name="google-signin-scope" content="email profile"><meta data-react-helmet="true" name="google-signin-cookiepolicy" content="single_host_origin"><script data-react-helmet="true" async="true" src="https://compliance.datacamp.com/base.js"></script><script data-react-helmet="true">\n      var dataLayerContent = {\n        gtm_version: 2,\n      };\n      if (typeof window[\'dataLayer\'] === \'undefined\') {\n        window[\'dataLayer\'] = [dataLayerContent];\n      } else {\n        window[\'dataLayer\'].push(dataLayerContent);\n      }\n    </script><script async src=\'/cdn-cgi/bm/cv/669835187/api.js\'></script></head><body><script>window.PRELOADED_STATE = "["~#iR",["^ ","n","StateRecord","v",["^ ","backendSession",["~#iOM",["status",["^2",["code","none","text",""]],"isInitSession",false,"message",null]],"boot",["^0",["^ ","n","BootStateRecord","v",["^ ","bootState","PRE_BOOTED","error",null]]],"chapter",["^2",["current",["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",1,"number_of_videos",3,"slug","importing-data-from-the-internet-1","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",12,"free_preview",true,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter1.pdf","title","Importing data from the Internet","xp",1050,"id",4135,"exercises",["~#iL",[["^2",["type","VideoExercise","title","Importing flat files from the web","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=1"]],["^2",["type","NormalExercise","title","Importing flat files from the web: your turn!","aggregate_xp",100,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=2"]],["^2",["type","NormalExercise","title","Opening and reading flat files from the web","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=3"]],["^2",["type","NormalExercise","title","Importing non-flat files from the web","aggregate_xp",100,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=4"]],["^2",["type","VideoExercise","title","HTTP requests to import files from the web","aggregate_xp",50,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=5"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using urllib","aggregate_xp",100,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=6"]],["^2",["type","NormalExercise","title","Printing HTTP request results in Python using urllib","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=7"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using requests","aggregate_xp",100,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=8"]],["^2",["type","VideoExercise","title","Scraping the web in Python","aggregate_xp",50,"number",9,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=9"]],["^2",["type","NormalExercise","title","Parsing HTML with BeautifulSoup","aggregate_xp",100,"number",10,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=10"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the text","aggregate_xp",100,"number",11,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=11"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the hyperlinks","aggregate_xp",100,"number",12,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=12"]]]],"description","The web is a rich source of data from which you can extract various types of insights and findings. In this chapter, you will learn how to get data from the web, whether it is stored in files or in HTML. You'll also learn the basics of scraping and parsing web data.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]]]],"contentAuthorization",["^ "],"course",["^2",["difficulty_level",1,"reduced_outline",null,"marketing_video","","active_image","course-1606-master:506759a234ec905a9377923e00ae7511-20210409131713880","author_field",null,"chapters",["^7",[["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",1,"number_of_videos",3,"slug","importing-data-from-the-internet-1","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",12,"free_preview",true,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter1.pdf","title","Importing data from the Internet","xp",1050,"id",4135,"exercises",["^7",[["^2",["type","VideoExercise","title","Importing flat files from the web","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=1"]],["^2",["type","NormalExercise","title","Importing flat files from the web: your turn!","aggregate_xp",100,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=2"]],["^2",["type","NormalExercise","title","Opening and reading flat files from the web","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=3"]],["^2",["type","NormalExercise","title","Importing non-flat files from the web","aggregate_xp",100,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=4"]],["^2",["type","VideoExercise","title","HTTP requests to import files from the web","aggregate_xp",50,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=5"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using urllib","aggregate_xp",100,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=6"]],["^2",["type","NormalExercise","title","Printing HTTP request results in Python using urllib","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=7"]],["^2",["type","NormalExercise","title","Performing HTTP requests in Python using requests","aggregate_xp",100,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=8"]],["^2",["type","VideoExercise","title","Scraping the web in Python","aggregate_xp",50,"number",9,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=9"]],["^2",["type","NormalExercise","title","Parsing HTML with BeautifulSoup","aggregate_xp",100,"number",10,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=10"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the text","aggregate_xp",100,"number",11,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=11"]],["^2",["type","NormalExercise","title","Turning a webpage into data using BeautifulSoup: getting the hyperlinks","aggregate_xp",100,"number",12,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/importing-data-from-the-internet-1?ex=12"]]]],"description","The web is a rich source of data from which you can extract various types of insights and findings. In this chapter, you will learn how to get data from the web, whether it is stored in files or in HTML. You'll also learn the basics of scraping and parsing web data.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]],["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",2,"number_of_videos",2,"slug","interacting-with-apis-to-import-data-from-the-web-2","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",9,"free_preview",null,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter2.pdf","title","Interacting with APIs to import data from the web","xp",650,"id",4136,"exercises",["^7",[["^2",["type","VideoExercise","title","Introduction to APIs and JSONs","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=1"]],["^2",["type","PureMultipleChoiceExercise","title","Pop quiz: What exactly is a JSON?","aggregate_xp",50,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=2"]],["^2",["type","NormalExercise","title","Loading and exploring a JSON","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=3"]],["^2",["type","MultipleChoiceExercise","title","Pop quiz: Exploring your JSON","aggregate_xp",50,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=4"]],["^2",["type","VideoExercise","title","APIs and interacting with the world wide web","aggregate_xp",50,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=5"]],["^2",["type","PureMultipleChoiceExercise","title","Pop quiz: What's an API?","aggregate_xp",50,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=6"]],["^2",["type","NormalExercise","title","API requests","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=7"]],["^2",["type","NormalExercise","title","JSON\xe2\x80\x93from the web to Python","aggregate_xp",100,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=8"]],["^2",["type","NormalExercise","title","Checking out the Wikipedia API","aggregate_xp",100,"number",9,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/interacting-with-apis-to-import-data-from-the-web-2?ex=9"]]]],"description","In this chapter, you will gain a deeper understanding of how to import data from the web. You will learn the basics of extracting data from APIs, gain insight on the importance of APIs, and practice extracting data by diving into the OMDB and Library of Congress APIs.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]],["^2",["badge_uncompleted_url","https://assets.datacamp.com/production/default/badges/missing_unc.png","number",3,"number_of_videos",2,"slug","diving-deep-into-the-twitter-api","last_updated_on","09/04/2021","title_meta",null,"nb_exercises",8,"free_preview",null,"slides_link","https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/slides/chapter3.pdf","title","Diving  deep into the Twitter API","xp",700,"id",4140,"exercises",["^7",[["^2",["type","VideoExercise","title","The Twitter API and Authentication","aggregate_xp",50,"number",1,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=1"]],["^2",["type","NormalExercise","title","API Authentication","aggregate_xp",100,"number",2,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=2"]],["^2",["type","NormalExercise","title","Streaming tweets","aggregate_xp",100,"number",3,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=3"]],["^2",["type","NormalExercise","title","Load and explore your Twitter data","aggregate_xp",100,"number",4,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=4"]],["^2",["type","NormalExercise","title","Twitter data to DataFrame","aggregate_xp",100,"number",5,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=5"]],["^2",["type","NormalExercise","title","A little bit of Twitter text analysis","aggregate_xp",100,"number",6,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=6"]],["^2",["type","NormalExercise","title","Plotting your Twitter data","aggregate_xp",100,"number",7,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=7"]],["^2",["type","VideoExercise","title","Final Thoughts","aggregate_xp",50,"number",8,"url","https://campus.datacamp.com/courses/intermediate-importing-data-in-python/diving-deep-into-the-twitter-api?ex=8"]]]],"description","In this chapter, you will consolidate your knowledge of interacting with APIs in a deep dive into the Twitter streaming API. You'll learn how to stream real-time Twitter data, and how to analyze and visualize it.","badge_completed_url","https://assets.datacamp.com/production/default/badges/missing.png"]]]],"time_needed",null,"author_image","https://assets.datacamp.com/production/course_1606/author_images/author_image_course_1606_20200310-1-lgdj4c?1583853939","tracks",["^7",[["^2",["path","/tracks/data-analyst-with-python","title_with_subtitle","Data Analyst  with Python"]],["^2",["path","/tracks/data-scientist-with-python","title_with_subtitle","Data Scientist  with Python"]],["^2",["path","/tracks/importing-cleaning-data-with-python","title_with_subtitle","Importing & Cleaning Data  with Python"]]]],"runtime_config",null,"lti_only",false,"image_url","https://assets.datacamp.com/production/course_1606/shields/thumb/shield_image_course_1606_20200310-1-17hkmhz?1583853940","topic_id",8,"slug","intermediate-importing-data-in-python","last_updated_on","27/09/2021","paid",true,"collaborators",["^7",[["^2",["avatar_url","https://assets.datacamp.com/users/avatars/000/382/294/square/francis-photo.jpg?1471980001","full_name","Francisco Castro"]]]],"time_needed_in_hours",2,"technology_id",2,"university",null,"archived_at",null,"state","live","author_bio",null,"should_cache",true,"sharing_links",["^2",["twitter","http://bit.ly/1eWTMJh","facebook","http://bit.ly/1iS42Do"]],"instructors",["^7",[["^2",["id",301837,"marketing_biography","Data Scientist at DataCamp","biography","Hugo is a data scientist, educator, writer and podcaster at DataCamp. His main interests are promoting data & AI literacy, helping to spread data skills through organizations and society and doing amateur stand up comedy in NYC. If you want to know what he likes to talk about, definitely check out DataFramed, the DataCamp podcast, which he hosts and produces: https://www.datacamp.com/community/podcast","avatar_url","https://assets.datacamp.com/users/avatars/000/301/837/square/hugoaboutpic.jpg?1493154678","full_name","Hugo Bowne-Anderson","instructor_path","/instructors/hugobowne"]]]],"seo_title","Intermediate Importing Data in Python","title","Intermediate Importing Data in Python","xp",2400,"image_thumbnail_url","https://assets.datacamp.com/production/course_1606/shields/thumb_home/shield_image_course_1606_20200310-1-17hkmhz?1583853940","short_description","Improve your Python data importing skills and learn to work with web and API data.","nb_of_subscriptions",121502,"seo_description","Further improve your Python data importing skills and learn to work with more web and API data.","type","datacamp","link","https://www.datacamp.com/courses/intermediate-importing-data-in-python","id",1606,"datasets",["^7",[["^2",["asset_url","https://assets.datacamp.com/production/repositories/488/datasets/b422ace2fceada7b569e0ba3e8d833fddc684c4d/latitude.xls","name","Latitudes (XLS)"]],["^2",["asset_url","https://assets.datacamp.com/production/repositories/488/datasets/3ef452f83a91556ea4284624b969392c0506fb33/tweets3.txt","name","Tweets"]],["^2",["asset_url","https://assets.datacamp.com/production/repositories/488/datasets/013936d2700e2d00207ec42100d448c23692eb6f/winequality-red.csv","name","Red wine quality"]]]],"description","As a data scientist, you will need to clean data, wrangle and munge it, visualize it, build predictive models and interpret these models. Before you can do so, however, you will need to know how to get data into Python. In the prequel to this course, you learned many ways to import data into Python: from flat files such as .txt and .csv; from files native to other software such as Excel spreadsheets, Stata, SAS, and MATLAB files; and from relational databases such as SQLite and PostgreSQL. In this course, you'll extend this knowledge base by learning to import data from the web and by pulling data from Application Programming Interfaces\xe2\x80\x94 APIs\xe2\x80\x94such as the Twitter streaming API, which allows us to stream real-time tweets.","prerequisites",["^7",[["^2",["path","/courses/introduction-to-importing-data-in-python","title","Introduction to Importing Data in Python"]]]],"original_image_url","https://assets.datacamp.com/production/course_1606/shields/original/shield_image_course_1606_20200310-1-17hkmhz?1583853940","programming_language","python","external_slug","intermediate-importing-data-in-python"]],"exercises",["^2",["current",1,"all",["^7",[["^2",["sample_code","","sct","","aspect_ratio",56.25,"instructions",null,"externalId",990668,"question","","hint",null,"possible_answers",["^7",[]],"runtime_config",null,"number",1,"video_hls",null,"randomNumber",0.8135754130537021,"chapter_id",4135,"assignment",null,"feedbacks",["^7",[]],"attachments",null,"version","v0","title","Importing flat files from the web","xp",50,"language","python","pre_exercise_code","","solution","","type","VideoExercise","id",990668,"projector_key","course_1606_59604c018a6e132016cd26144a12fee0","video_link",null,"key","e36457c7ed","course_id",1606]],["^2",["sample_code","# Import package\\\\nfrom ____ import ____\\\\n\\\\n# Import pandas\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\n\\\\n\\\\n# Save file locally\\\\n\\\\n\\\\n# Read file into a DataFrame and print its head\\\\ndf = pd.read_csv('winequality-red.csv', sep=';')\\\\nprint(df.head())","sct","Ex().has_import(\\\\"urllib.request.urlretrieve\\\\")\\\\nEx().has_import(\\\\"pandas\\\\")\\\\nEx().check_object(\\\\"url\\\\").has_equal_value()\\\\nEx().check_function(\\\\"urllib.request.urlretrieve\\\\").multi(\\\\n  check_args(0).has_equal_value(),\\\\n  check_args(1).has_equal_value()\\\\n)\\\\nEx().check_correct(\\\\n  check_object(\\\\"df\\\\").has_equal_value(),\\\\n  check_function(\\\\"pandas.read_csv\\\\").multi(\\\\n    check_args(0).has_equal_value(),\\\\n    check_args(1).has_equal_value()\\\\n  )\\\\n)\\\\nEx().has_printout(0)\\\\nsuccess_msg(\\\\"Awesome!\\\\")\\\\n","instructions","<ul>\\\\n<li>Import the function <code>urlretrieve</code> from the subpackage <code>urllib.request</code>.</li>\\\\n<li>Assign the URL of the file to the variable <code>url</code>.</li>\\\\n<li>Use the function <code>urlretrieve()</code> to save the file locally as <code>'winequality-red.csv'</code>.</li>\\\\n<li>Execute the remaining code to load <code>'winequality-red.csv'</code> in a pandas DataFrame and to print its head to the shell.</li>\\\\n</ul>","externalId",42707,"question","","hint","<ul>\\\\n<li>To import a function <code>y</code> from a subpackage <code>x</code>, execute <code>from x import y</code>.</li>\\\\n<li>This one's a long URL. Make sure you typed it in correctly!</li>\\\\n<li>Pass the <em>url</em> to import (in the <code>url</code> object you defined) as the first argument and the <em>filename</em> for saving the file locally as the second argument to <code>urlretrieve()</code>.</li>\\\\n<li>You don't have to change the code for loading <code>'winequality-red.csv'</code> and printing its head.</li>\\\\n</ul>","possible_answers",["^7",[]],"number",2,"user",["^2",["isHintShown",false,"editorTabs",["^2",["files/script.py",["^2",["title","script.py","isSolution",false,"props",["^2",["active",true,"isClosable",false,"code",null,"extra",["^2",[]]]]]]]],"outputMarkdownTabs",["^2",[]],"markdown",["^2",["titles",["^7",["Knit PDF","Knit HTML"]],"activeTitle","Knit HTML"]],"currentXp",100,"graphicalTabs",["^2",["plot",["^2",["extraClass","animation--flash","title","Plots","props",["^2",["sources",["^7",[]],"currentIndex",0]],"dimension",["^2",["isRealSize",false,"width",1,"height",1]]]],"html",["^2",["extraClass","animation--flash","title","HTML Viewer","props",["^2",["sources",["^7",[]],"currentIndex",0]]]]]],"feedbackMessages",["^7",[]],"lastSubmittedCode",null,"ltiStatus",["^2",[]],"lastSubmitActiveEditorTab",null,"consoleSqlTabs",["^2",["query_result",["^2",["extraClass","","title","query result","props",["^2",["active",true,"isNotView",true,"message","No query executed yet..."]]]]]],"consoleTabs",["^2",["console",["^2",["title","IPython Shell","props",["^2",["active",true]],"dimension",["^2",["cols",400]]]],"slides",["^2",["title","Slides","props",["^2",["active",false]]]]]],"inputMarkdownTabs",["^2",[]],"consoleObjectViewTabs",["^2",[]]]],"randomNumber",0.09607275750047917,"assignment","<p>You are about to import your first file from the web! The flat file you will import will be <code>'winequality-red.csv'</code> from the University of California, Irvine's <a href=\\\\"http://archive.ics.uci.edu/ml/index.html\\\\">Machine Learning repository</a>. The flat file contains tabular data of physiochemical properties of red wine, such as pH, alcohol content and citric acid content, along with wine quality rating.</p>\\\\n<p>The URL of the file is</p>\\\\n<pre><code>'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n</code></pre>\\\\n<p>After you import it, you'll check your working directory to confirm that it is there and then you'll load it into a <code>pandas</code> DataFrame.</p>","feedbacks",["^7",[]],"attachments",null,"title","Importing flat files from the web: your turn!","xp",100,"language","python","pre_exercise_code","","solution","# Import package\\\\nfrom urllib.request import urlretrieve\\\\n\\\\n# Import pandas\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\nurl = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n\\\\n# Save file locally\\\\nurlretrieve(url, 'winequality-red.csv')\\\\n\\\\n# Read file into a DataFrame and print its head\\\\ndf = pd.read_csv('winequality-red.csv', sep=';')\\\\nprint(df.head())","type","NormalExercise","id",42707]],["^2",["sample_code","# Import packages\\\\nimport matplotlib.pyplot as plt\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\n\\\\n\\\\n# Read file into a DataFrame: df\\\\n\\\\n\\\\n# Print the head of the DataFrame\\\\nprint(____)\\\\n\\\\n# Plot first column of df\\\\npd.DataFrame.hist(df.ix[:, 0:1])\\\\nplt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')\\\\nplt.ylabel('count')\\\\nplt.show()\\\\n","sct","Ex().has_import(\\\\"matplotlib.pyplot\\\\")\\\\nEx().has_import(\\\\"pandas\\\\")\\\\nEx().check_object(\\\\"url\\\\").has_equal_value()\\\\nEx().check_correct(\\\\n  check_object(\\\\"df\\\\").has_equal_value(),\\\\n  check_function(\\\\"pandas.read_csv\\\\").multi(\\\\n    check_args(0).has_equal_value(),\\\\n    check_args(1).has_equal_value()\\\\n  )\\\\n)\\\\nEx().has_printout(0)\\\\nEx().check_function(\\\\"pandas.DataFrame.hist\\\\").check_args(0).has_equal_value()\\\\nEx().check_function(\\\\"matplotlib.pyplot.show\\\\")\\\\n\\\\nsuccess_msg(\\\\"Awesome!\\\\")\\\\n","instructions","<ul>\\\\n<li>Assign the URL of the file to the variable <code>url</code>.</li>\\\\n<li>Read file into a DataFrame <code>df</code> using <code>pd.read_csv()</code>, recalling that the separator in the file is <code>';'</code>.</li>\\\\n<li>Print the head of the DataFrame <code>df</code>.</li>\\\\n<li>Execute the rest of the code to plot histogram of the first feature in the DataFrame <code>df</code>.</li>\\\\n</ul>","externalId",42708,"question","","hint","<ul>\\\\n<li>Make sure you typed the URL correctly!</li>\\\\n<li>Pass the <em>url</em> (the <code>url</code> object you defined) as the first argument and the <em>separator</em> as the second argument to <code>pd.read_csv()</code>.</li>\\\\n<li>The <em>head</em> of a DataFrame can be accessed by using <code>head()</code> on the DataFrame.</li>\\\\n<li>You don't have to change any of the code for plotting the histograms.</li>\\\\n</ul>","possible_answers",["^7",[]],"number",3,"randomNumber",0.45988091705093903,"assignment","<p>You have just imported a file from the web, saved it locally and loaded it into a DataFrame. If you just wanted to load a file from the web into a DataFrame without first saving it locally, you can do that easily using <code>pandas</code>. In particular, you can use the function <code>pd.read_csv()</code> with the URL as the first argument and the separator <code>sep</code> as the second argument.</p>\\\\n<p>The URL of the file, once again, is</p>\\\\n<pre><code>'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n</code></pre>","feedbacks",["^7",[]],"attachments",null,"title","Opening and reading flat files from the web","xp",100,"language","python","pre_exercise_code","","solution","# Import packages\\\\nimport matplotlib.pyplot as plt\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\nurl = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'\\\\n\\\\n# Read file into a DataFrame: df\\\\ndf = pd.read_csv(url, sep=';')\\\\n\\\\n# Print the head of the DataFrame\\\\nprint(df.head())\\\\n\\\\n# Plot first column of df\\\\npd.DataFrame.hist(df.ix[:, 0:1])\\\\nplt.xlabel('fixed acidity (g(tartaric acid)/dm$^3$)')\\\\nplt.ylabel('count')\\\\nplt.show()\\\\n","type","NormalExercise","id",42708]],["^2",["sample_code","# Import package\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\n\\\\n\\\\n# Read in all sheets of Excel file: xls\\\\n\\\\n\\\\n# Print the sheetnames to the shell\\\\n\\\\n\\\\n# Print the head of the first sheet (using its name, NOT its index)\\\\n\\\\n","sct","Ex().has_import('pandas')\\\\nEx().check_correct(\\\\n    has_printout(0),\\\\n    multi(\\\\n        check_correct(\\\\n            check_object('xls').is_instance(dict),\\\\n            check_correct(\\\\n                check_function('pandas.read_excel').multi(\\\\n                    check_args(0).has_equal_value(),\\\\n                    check_args('sheet_name').has_equal_value()\\\\n                ),\\\\n                check_object('url').has_equal_value()\\\\n            )\\\\n        )\\\\n    )\\\\n)\\\\nEx().has_printout(1)\\\\nsuccess_msg(\\\\"Awesome!\\\\")","instructions","<ul>\\\\n<li>Assign the URL of the file to the variable <code>url</code>.</li>\\\\n<li>Read the file in <code>url</code> into a dictionary <code>xls</code> using <code>pd.read_excel()</code> recalling that, in order to import all sheets you need to pass <code>None</code> to the argument <code>sheet_name</code>.</li>\\\\n<li>Print the names of the sheets in the Excel spreadsheet; these will be the keys of the dictionary <code>xls</code>.</li>\\\\n<li>Print the head of the first sheet <em>using the sheet name, not the index of the sheet</em>! The sheet name is <code>'1700'</code></li>\\\\n</ul>","externalId",42709,"question","","hint","<ul>\\\\n<li>Make sure you typed in the URL correctly!</li>\\\\n<li>Pass the <em>url</em> (the <code>url</code> object you defined) as the first argument and <code>sheet_name</code> with its corresponding value as the second argument to <code>pd.read_excel()</code>.</li>\\\\n<li>The <em>keys</em> of a dictionary can be accessed by using <code>keys()</code> on the dictionary.</li>\\\\n<li>You can access a sheet using the format: <em>dictionary</em><strong>[</strong><em>sheet name or index</em><strong>]</strong>.</li>\\\\n</ul>","possible_answers",["^7",[]],"number",4,"randomNumber",0.9315409079931212,"assignment","<p>Congrats! You've just loaded a flat file from the web into a DataFrame without first saving it locally using the <code>pandas</code> function <code>pd.read_csv()</code>. This function is super cool because it has close relatives that allow you to load all types of files, not only flat ones. In this interactive exercise, you'll use <code>pd.read_excel()</code> to import an Excel spreadsheet.</p>\\\\n<p>The URL of the spreadsheet is</p>\\\\n<pre><code>'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'\\\\n</code></pre>\\\\n<p>Your job is to use <code>pd.read_excel()</code> to read in all of its sheets, print the sheet names and then print the head of the first sheet <em>using its name, not its index</em>.</p>\\\\n<p>Note that the output of <code>pd.read_excel()</code> is a Python dictionary with sheet names as keys and corresponding DataFrames as corresponding values.</p>","feedbacks",["^7",[]],"attachments",null,"title","Importing non-flat files from the web","xp",100,"language","python","pre_exercise_code","","solution","# Import package\\\\nimport pandas as pd\\\\n\\\\n# Assign url of file: url\\\\nurl = 'http://s3.amazonaws.com/assets.datacamp.com/course/importing_data_into_r/latitude.xls'\\\\n\\\\n# Read in all sheets of Excel file: xls\\\\nxls = pd.read_excel(url, sheet_name=None)\\\\n\\\\n# Print the sheetnames to the shell\\\\nprint(xls.keys())\\\\n\\\\n# Print the head of the first sheet (using its name, NOT its index)\\\\nprint(xls['1700'].head())","type","NormalExercise","id",42709]],["^2",["sample_code","","sct","","aspect_ratio",56.25,"instructions",null,"externalId",990669,"question","","hint",null,"possible_answers",["^7",[]],"runtime_config",null,"number",5,"video_hls",null,"randomNumber",0.23912258770150818,"chapter_id",4135,"assignment",null,"feedbacks",["^7",[]],"attachments",null,"version","v0","title","HTTP requests to import files from the web","xp",50,"language","python","pre_exercise_code","","solution","","type","VideoExercise","id",990669,"projector_key","course_1606_9d15ae176be1800b996f7869a82b8087","video_link",null,"key","e480d1fdcf","course_id",1606]],["^2",["sample_code","# Import packages\\\\n\\\\n\\\\n# Specify the url\\\\nurl = \\\\"https://campus.datacamp.com/courses/1606/4135?ex=2\\\\"\\\\n\\\\n# This packages the request: request\\\\n\\\\n\\\\n# Sends the request and catches the response: response\\\\n\\\\n\\\\n# Print the datatype of response\\\\nprint(type(response))\\\\n\\\\n# Be polite and close the response!\\\\nresponse.close()\\\\n","sct","\\\\n# Test: import urlopen, Request\\\\nimport_msg = \\\\"Did you correctly import the required packages?\\\\"\\\\nEx().has_import(\\\\n    \\\\"urllib.request.urlopen\\\\",\\\\n    not_imported_msg=import_msg\\\\n)\\\\nEx().has_import(\\\\n    \\\\"urllib.request.Request\\\\",\\\\n    not_imported_msg=import_msg\\\\n)\\\\n\\\\n# Test: Predefined code\\\\npredef_msg = \\\\"You don't have to change any of the predefined code.\\\\"\\\\nEx().check_object(\\\\"url\\\\", missing_msg=predef_msg).has_equal_value(incorrect_msg = predef_msg)\\\\n\\\\n# Test: call to Request() and 'request' variable\\\\nEx().check_function(\\\\"urllib.request.Request\\\\").check_args(0).h

Performing HTTP requests in Python using requests

Now that you’ve got your head and hands around making HTTP requests using the urllib package, you’re going to figure out how to do the same using the higher-level requests library. You’ll once again be pinging DataCamp servers for their "http://www.datacamp.com/teach/documentation" page.

Note that unlike in the previous exercises using urllib, you don’t have to close the connection when using requests!

Instructions
  • Import the package requests.
  • Assign the URL of interest to the variable url.
  • Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
  • Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable text.
  • Hit submit to print the HTML of the webpage.
# Import package
import requests

# Specify the url: url
url = "http://www.datacamp.com/teach/documentation"

# Packages the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response: text
text = r.text

# Print the html
print(text)
<script.py> output:
    <!DOCTYPE HTML>
    <html lang="en-US">
    <head>
      <meta charset="UTF-8" />
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
      <meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
      <meta name="robots" content="noindex, nofollow" />
      <meta name="viewport" content="width=device-width,initial-scale=1" />
      <title>Just a moment...</title>
      <style type="text/css">
        html, body {width: 100%; height: 100%; margin: 0; padding: 0;}
        body {background-color: #ffffff; color: #000000; font-family:-apple-system, system-ui, BlinkMacSystemFont, "Segoe UI", Roboto, Oxygen, Ubuntu, "Helvetica Neue",Arial, sans-serif; font-size: 16px; line-height: 1.7em;-webkit-font-smoothing: antialiased;}
        h1 { text-align: center; font-weight:700; margin: 16px 0; font-size: 32px; color:#000000; line-height: 1.25;}
        p {font-size: 20px; font-weight: 400; margin: 8px 0;}
        p, .attribution, {text-align: center;}
        #spinner {margin: 0 auto 30px auto; display: block;}
        .attribution {margin-top: 32px;}
        @keyframes fader     { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
        @-webkit-keyframes fader { 0% {opacity: 0.2;} 50% {opacity: 1.0;} 100% {opacity: 0.2;} }
        #cf-bubbles > .bubbles { animation: fader 1.6s infinite;}
        #cf-bubbles > .bubbles:nth-child(2) { animation-delay: .2s;}
        #cf-bubbles > .bubbles:nth-child(3) { animation-delay: .4s;}
        .bubbles { background-color: #f58220; width:20px; height: 20px; margin:2px; border-radius:100%; display:inline-block; }
        a { color: #2c7cb0; text-decoration: none; -moz-transition: color 0.15s ease; -o-transition: color 0.15s ease; -webkit-transition: color 0.15s ease; transition: color 0.15s ease; }
        a:hover{color: #f4a15d}
        .attribution{font-size: 16px; line-height: 1.5;}
        .ray_id{display: block; margin-top: 8px;}
        #cf-wrapper #challenge-form { padding-top:25px; padding-bottom:25px; }
        #cf-hcaptcha-container { text-align:center;}
        #cf-hcaptcha-container iframe { display: inline-block;}
      </style>
    
          <meta http-equiv="refresh" content="3">
      <script type="text/javascript">
        //<![CDATA[
        (function(){
          
          window._cf_chl_opt={
            cvId: "2",
            cType: "non-interactive",
            cNounce: "34415",
            cRay: "696301f41ee5180f",
            cHash: "10e0e81796684cb",
            cFPWv: "b",
            cTTimeMs: "1000",
            cRq: {
              ru: "aHR0cDovL3d3dy5kYXRhY2FtcC5jb20vdGVhY2gvZG9jdW1lbnRhdGlvbg==",
              ra: "cHl0aG9uLXJlcXVlc3RzLzIuMTIuMQ==",
              rm: "R0VU",
              d: "ZOgEvcDeIixerAxGVar7L1j4O4CsnK4uNXaUQ+x2CHwq8alqqhk9YnierOCZC/eLfB9vkIsMxuLAc/HBO2QeEDYvvXPYW236hCQaIN9AeztHb7mWv5B+eAjOBwPKBrJLaf2t28v85h7gFNy1mXkZpEGGIHs52p6BlpIxjOb3FUSEZqj9yi+fO2d6tlMT3+mPH158CuGFSZkTyUzICqSOkM17POPmpDHCDzw0pObbcZxNA4LNgNyCONEnSZ64ZMmxv07R/3bGmcBaHxvw0RvGaGojwr1SuNYWO6c8y2VwNmixT4B7PEBeK58W7YfpKXhd42FWbRSnMxqt2M1RWxQCQGBQ77XYu96RxGsXTwBKBWrFXRZlOySatu0+hHhPQWjUrLawncs4TAyVMlNTJo5ynQob5hGWh7DLVvMCdj+oUgYev/V0u3bY5vcWGPzBMimNN1qs2cvqN/Vm8A8m5tjqBwWnhl6KwwJQyxPnriGMWO5VCI+yG4pCR0hrM5N/KzeY8dR/FCLiHJC4UgySqBip1NNDxE+G6zikGnoM5zCG1vE=",
              t: "MTYzMjg5NTM5OS4wNjIwMDA=",
              m: "DY3U5u3GNMGcIh28j9Ix8veJszY6uCdaD6RbAkx+xeA=",
              i1: "WVKE5Eam1cwqayHFagWTOg==",
              i2: "LBZy73WNehCXmqxKC5fGxw==",
              zh: "hzfiqo9hugT9sHeHQ1zy81NCL/S0295H0+GuRnkSV9o=",
              uh: "i/CCY+JJoYjtn06LPki7UEDiltWWpUGH2i5oc7l6Ktk=",
              hh: "rAZnIHiyrNuZ60h9aAZNML8izDilqmOSNuCtac1WqPs=",
            }
          }
          window._cf_chl_enter = function(){window._cf_chl_opt.p=1};
          
        })();
        //]]>
      </script>
      
    
    </head>
    <body>
      <table width="100%" height="100%" cellpadding="20">
        <tr>
          <td align="center" valign="middle">
              <div class="cf-browser-verification cf-im-under-attack">
      <noscript>
        <h1 data-translate="turn_on_js" style="color:#bd2426;">Please turn JavaScript on and reload the page.</h1>
      </noscript>
      <div id="cf-content" style="display:none">
        
        <div id="cf-bubbles">
          <div class="bubbles"></div>
          <div class="bubbles"></div>
          <div class="bubbles"></div>
        </div>
        <h1><span data-translate="checking_browser">Checking your browser before accessing</span> datacamp.com.</h1>
        
        <div id="no-cookie-warning" class="cookie-warning" data-translate="turn_on_cookies" style="display:none">
          <p data-translate="turn_on_cookies" style="color:#bd2426;">Please enable Cookies and reload the page.</p>
        </div>
        <p data-translate="process_is_automatic">This process is automatic. Your browser will redirect to your requested content shortly.</p>
        <p data-translate="allow_5_secs" id="cf-spinner-allow-5-secs" >Please allow up to 5 seconds…</p>
        <p data-translate="redirecting" id="cf-spinner-redirecting" style="display:none">Redirecting…</p>
      </div>
       
      <form class="challenge-form" id="challenge-form" action="/teach/documentation?__cf_chl_jschl_tk__=pmd_Y0DCyG5HijBcO5QKNcw1NKaMX4rtgczvN.P0MQax5R4-1632895399-0-gqNtZGzNAiWjcnBszQJ9" method="POST" enctype="application/x-www-form-urlencoded">
        <input type="hidden" name="md" value="rcE3GJnfe41l6t.msA6gYk77TOtfYhDNo_pcVl5RvSw-1632895399-0-AbUkp3mA7_99XGeouka-ysDdds0OIXdr53Vx12QRECIzlK9g7Gr1mkrFOaIE7WtBR3D3r4Cw7GPmbxKFAv9MK5Lp8AFmBHpdRemCMWPDaVuYNZrwhSuaT2USpplD5S-zh8m94FXpbCcyEmbV0d02WspXsgzVE9bKhk5Nixm7Bgbj4NijgGr8TnRt_zl4Sdfjq8bQdaaZgUqtWS_FtcFXcuEyS7sk9niJN83Qi1-cqsgVRd7rd4nScaiUtVQFc0LsNOmSlur54HjtiTzd1MsY5YCrAgLxec4DPSd8eawSpR53EBIVfvOzGwffrVbWz3hA9FjjfQwx7tsjXyHNgLANHTDMT5VEqSsdk8rCXEQAGTomAzRXaqAcLv135LLZujjf-gSWI8Tu47qBmiUJwwO5iKeb_Ly1XpCTwvlagSRAgMxN4Ar4zCzLyYan0qWd6x6A9khhEjDSfPosE38a4XiH9MsDcHsuNm1OEf_udPuNv67G" />
        <input type="hidden" name="r" value="__6qKQ955HRlhKKdw9CwBt_VH0k94Wa2GK0576tHtHc-1632895399-0-Ac1Z18QR1M8vH4OHaYqGBvXAY4rOBFf2vo0HL6aFi4XxyKoGcoWAD/FRaXC/WHmTuP+yLpRb4Rm0wiuT13TZwvNZvn2aeaiXPMfXVTR+igpHoSHi2VzVQ4T70vIH9UqfztJdvKxyapit1U/Q2uQosWity6dmq7TLTBUpc6534469czUfH8PzkmeaC8SqGhafFK6rWJQjSmDy1a3Arx88jGZt+zkiL6shfnlc/uzJvFIersiJ+j6Xz9cN6QlKpkklaZjGkr59ky20ZXHNQ7xAw3iTsA2IYKo4zaIkI6MA3S1Gjc7mB1uhXicVhvB0MYP8XWZiUSn/r0N0v655yP2DiLTVf9JdmUcx2zRrDYHOvuQAJwBcaBqa4p8dfFixgumpZ3S3M9YVrsSd/PAfxYNlnfWlotqQAN4ZKa1oHPq7Xe3Xf4I9g8yVoUSjYqiFBsHaypo257M9Dtn9BufJ6MQ0vZar0+ztsJiFypzVNQHsIaLGQXsfzFJ7Ur8syCfo70/99QOmwxQd0N1l4SzEvwnE9YR4XpfH2/PRkZQr9mO1akpMf/Chjx8GqT6VS04iNM/xww=="/>
        <input type="hidden" value="133674064405b50cfdd3c527d466bf65" id="jschl-vc" name="jschl_vc"/>
        <!-- <input type="hidden" value="" id="jschl-vc" name="jschl_vc"/> -->
        <input type="hidden" name="pass" value="1632895400.062-8rOUa/oCpC"/>
        <input type="hidden" id="jschl-answer" name="jschl_answer"/>
      </form>
         
        <script type="text/javascript">
          //<![CDATA[
          (function(){
              var a = document.getElementById('cf-content');
              a.style.display = 'block';
              var isIE = /(MSIE|Trident\/|Edge\/)/i.test(window.navigator.userAgent);
              var trkjs = isIE ? new Image() : document.createElement('img');
              trkjs.setAttribute("src", "/cdn-cgi/images/trace/jschal/js/transparent.gif?ray=696301f41ee5180f");
              trkjs.id = "trk_jschal_js";
              trkjs.setAttribute("alt", "");
              document.body.appendChild(trkjs);
              var cpo=document.createElement('script');
              cpo.type='text/javascript';
              cpo.src="/cdn-cgi/challenge-platform/h/b/orchestrate/jsch/v1?ray=696301f41ee5180f";
              document.getElementsByTagName('head')[0].appendChild(cpo);
            }());
          //]]>
        </script>
      
    
      
      <div id="trk_jschal_nojs" style="background-image:url('/cdn-cgi/images/trace/jschal/nojs/transparent.gif?ray=696301f41ee5180f')"> </div>
    </div>
    
              
              <div class="attribution">
                DDoS protection by <a rel="noopener noreferrer" href="https://www.cloudflare.com/5xx-error-landing/" target="_blank">Cloudflare</a>
                <br />
                <span class="ray_id">Ray ID: <code>696301f41ee5180f</code></span>
              </div>
          </td>
         
        </tr>
      </table>
    </body>
    </html>

Parsing HTML with BeautifulSoup

In this interactive exercise, you’ll learn how to use the BeautifulSoup package to parseprettify and extract information from HTML. You’ll scrape the data from the webpage of Guido van Rossum, Python’s very own Benevolent Dictator for Life. In the following exercises, you’ll prettify the HTML and then extract the text and the hyperlinks.

The URL of interest is url = 'https://www.python.org/~guido/'.

Instructions
  • Import the function BeautifulSoup from the package bs4.
  • Assign the URL of interest to the variable url.
  • Package the request to the URL, send the request and catch the response with a single function requests.get(), assigning the response to the variable r.
  • Use the text attribute of the object r to return the HTML of the webpage as a string; store the result in a variable html_doc.
  • Create a BeautifulSoup object soup from the resulting HTML using the function BeautifulSoup().
  • Use the method prettify() on soup and assign the result to pretty_soup.
  • Hit submit to print to prettified HTML to your shell!
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = html_doc.BeautifulSoup()

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)
Traceback (most recent call last):
  File "<stdin>", line 15, in <module>
    soup = html_doc.BeautifulSoup()
AttributeError: 'str' object has no attribute 'BeautifulSoup'
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()

# Print the response
print(pretty_soup)

<html>
 <head>
  <title>
   Guido's Personal Home Page
  </title>
 </head>
 <body bgcolor="#FFFFFF" text="#000000">
  <!-- Built from main -->
  <h1>
   <a href="pics.html">
    <img border="0" src="images/IMG_2192.jpg"/>
   </a>
   Guido van Rossum - Personal Home Page
   <a href="pics.html">
    <img border="0" height="216" src="images/guido-headshot-2019.jpg" width="270"/>
   </a>
  </h1>
  <p>
   <a href="http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm">
    <i>
     "Gawky and proud of it."
    </i>
   </a>
   <h3>
    <a href="images/df20000406.jpg">
     Who I Am
    </a>
   </h3>
   <p>
    Read
my
    <a href="http://neopythonic.blogspot.com/2016/04/kings-day-speech.html">
     "King's
Day Speech"
    </a>
    for some inspiration.
    <p>
     I am the author of the
     <a href="http://www.python.org">
      Python
     </a>
     programming language.  See also my
     <a href="Resume.html">
      resume
     </a>
     and my
     <a href="Publications.html">
      publications list
     </a>
     , a
     <a href="bio.html">
      brief bio
     </a>
     , assorted
     <a href="http://legacy.python.org/doc/essays/">
      writings
     </a>
     ,
     <a href="http://legacy.python.org/doc/essays/ppt/">
      presentations
     </a>
     and
     <a href="interviews.html">
      interviews
     </a>
     (all about Python), some
     <a href="pics.html">
      pictures of me
     </a>
     ,
     <a href="http://neopythonic.blogspot.com">
      my new blog
     </a>
     , and
my
     <a href="http://www.artima.com/weblogs/index.jsp?blogger=12088">
      old
blog
     </a>
     on Artima.com.  I am
     <a href="https://twitter.com/gvanrossum">
      @gvanrossum
     </a>
     on Twitter.
     <p>
      I am retired, working on personal projects (and maybe a book).
I have worked for Dropbox, Google, Elemental Security, Zope
Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
my
      <a href="Resume.html">
       resume
      </a>
      .)  I created Python while at CWI.
      <h3>
       How to Reach Me
      </h3>
      <p>
       You can send email for me to guido (at) python.org.
I read everything sent there, but I receive too much email to respond
to everything.
       <h3>
        My Name
       </h3>
       <p>
        My name often poses difficulties for Americans.
        <p>
         <b>
          Pronunciation:
         </b>
         in Dutch, the "G" in Guido is a hard G,
pronounced roughly like the "ch" in Scottish "loch".  (Listen to the
         <a href="guido.au">
          sound clip
         </a>
         .)  However, if you're
American, you may also pronounce it as the Italian "Guido".  I'm not
too worried about the associations with mob assassins that some people
have. :-)
         <p>
          <b>
           Spelling:
          </b>
          my last name is two words, and I'd like to keep it
that way, the spelling on some of my credit cards notwithstanding.
Dutch spelling rules dictate that when used in combination with my
first name, "van" is not capitalized: "Guido van Rossum".  But when my
last name is used alone to refer to me, it is capitalized, for
example: "As usual, Van Rossum was right."
          <p>
           <b>
            Alphabetization:
           </b>
           in America, I show up in the alphabet under
"V".  But in Europe, I show up under "R".  And some of my friends put
me under "G" in their address book...
           <h3>
            More Hyperlinks
           </h3>
           <ul>
            <li>
             Here's a collection of
             <a href="http://legacy.python.org/doc/essays/">
              essays
             </a>
             relating to Python
that I've written, including the foreword I wrote for Mark Lutz' book
"Programming Python".
             <p>
              <li>
               I own the official
               <a href="images/license.jpg">
                <img align="center" border="0" height="75" src="images/license_thumb.jpg" width="100">
                 Python license.
                </img>
               </a>
               <p>
               </p>
              </li>
             </p>
            </li>
           </ul>
           <h3>
            The Audio File Formats FAQ
           </h3>
           <p>
            I was the original creator and maintainer of the Audio File Formats
FAQ.  It is now maintained by Chris Bagwell
at
            <a href="http://www.cnpbagwell.com/audio-faq">
             http://www.cnpbagwell.com/audio-faq
            </a>
            .  And here is a link to
            <a href="http://sox.sourceforge.net/">
             SOX
            </a>
            , to which I contributed
some early code.
           </p>
          </p>
         </p>
        </p>
       </p>
      </p>
     </p>
    </p>
   </p>
  </p>
 </body>
</html>
<hr>
 <a href="images/internetdog.gif">
  "On the Internet, nobody knows you're
a dog."
 </a>
 <hr>
 </hr>
</hr>

How to write a request header

From https://365datascience.com/tutorials/python-tutorials/request-headers-web-scraping/

A Chrome User Agent String might look like: “Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36

This string we can assign to an object ‘headers’:

header = {"UserAgent" : "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"

The usual get request looks like:

r = requests.get("https://www.youtube.com")

Now we’re going to make it include the header:

r = requests.get("https://www.youtube.com", headers=headers)

Turning a webpage into data using BeautifulSoup: getting the text

As promised, in the following exercises, you’ll learn the basics of extracting information from HTML soup. In this exercise, you’ll figure out how to extract the text from the BDFL’s webpage, along with printing the webpage’s title.

Instructions
  • In the sample code, the HTML response object html_doc has already been created: your first task is to Soupify it using the function BeautifulSoup() and to assign the resulting soup to the variable soup.
  • Extract the title from the HTML soup soup using the attribute title and assign the result to guido_title.
  • Print the title of Guido’s webpage to the shell using the print()function.
  • Extract the text from the HTML soup soup using the method get_text() and assign to guido_text.
  • Hit submit to print the text from Guido’s webpage to the shell.
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url: url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extract the response as html: html_doc
html_doc = r.text

# Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Get the title of Guido's webpage: guido_title
guido_title = soup.title

# Print the title of Guido's webpage to the shell
print(soup.title)

# Get Guido's text: guido_text
guido_text = soup.get_text()
<script.py> output:
    <title>Guido's Personal Home Page</title>
    
    
    Guido's Personal Home Page
    
    
    
    
    
    Guido van Rossum - Personal Home Page
    
    
    "Gawky and proud of it."
    Who I Am
    Read
    my "King's
    Day Speech" for some inspiration.
    
    I am the author of the Python
    programming language.  See also my resume
    and my publications list, a brief bio, assorted writings, presentations and interviews (all about Python), some
    pictures of me,
    my new blog, and
    my old
    blog on Artima.com.  I am
    @gvanrossum on Twitter.
    
    I am retired, working on personal projects (and maybe a book).
    I have worked for Dropbox, Google, Elemental Security, Zope
    Corporation, BeOpen.com, CNRI, CWI, and SARA.  (See
    my resume.)  I created Python while at CWI.
    
    How to Reach Me
    You can send email for me to guido (at) python.org.
    I read everything sent there, but I receive too much email to respond
    to everything.
    
    My Name
    My name often poses difficulties for Americans.
    
    Pronunciation: in Dutch, the "G" in Guido is a hard G,
    pronounced roughly like the "ch" in Scottish "loch".  (Listen to the
    sound clip.)  However, if you're
    American, you may also pronounce it as the Italian "Guido".  I'm not
    too worried about the associations with mob assassins that some people
    have. :-)
    
    Spelling: my last name is two words, and I'd like to keep it
    that way, the spelling on some of my credit cards notwithstanding.
    Dutch spelling rules dictate that when used in combination with my
    first name, "van" is not capitalized: "Guido van Rossum".  But when my
    last name is used alone to refer to me, it is capitalized, for
    example: "As usual, Van Rossum was right."
    
    Alphabetization: in America, I show up in the alphabet under
    "V".  But in Europe, I show up under "R".  And some of my friends put
    me under "G" in their address book...
    
    
    More Hyperlinks
    
    Here's a collection of essays relating to Python
    that I've written, including the foreword I wrote for Mark Lutz' book
    "Programming Python".
    I own the official 
    Python license.
    
    The Audio File Formats FAQ
    I was the original creator and maintainer of the Audio File Formats
    FAQ.  It is now maintained by Chris Bagwell
    at http://www.cnpbagwell.com/audio-faq.  And here is a link to
    SOX, to which I contributed
    some early code.
    
    
    
    "On the Internet, nobody knows you're
    a dog."

Turning a webpage into data using BeautifulSoup: getting the hyperlinks

In this exercise, you’ll figure out how to extract the URLs of the hyperlinks from the BDFL’s webpage. In the process, you’ll become close friends with the soup method find_all().

Instructions
  • Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag <a> but passed to find_all() without angle brackets; store the result in the variable a_tags.
  • The variable a_tags is a results set: your job now is to enumerate over it, using a for loop and to print the actual URLs of the hyperlinks; to do this, for every element link in a_tags, you want to print() link.get('href').
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')
# Import packages
import requests
from bs4 import BeautifulSoup

# Specify url
url = 'https://www.python.org/~guido/'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Extracts the response as html: html_doc
html_doc = r.text

# create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)

# Print the title of Guido's webpage
print(soup.title)

# Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')

# Print the URLs to the shell
for link in soup.find_all('a'):
    print(link.get('href'))
    
    
<title>Guido's Personal Home Page</title>
pics.html
pics.html
http://www.washingtonpost.com/wp-srv/business/longterm/microsoft/stories/1998/raymond120398.htm
images/df20000406.jpg
http://neopythonic.blogspot.com/2016/04/kings-day-speech.html
http://www.python.org
Resume.html
Publications.html
bio.html
http://legacy.python.org/doc/essays/
http://legacy.python.org/doc/essays/ppt/
interviews.html
pics.html
http://neopythonic.blogspot.com
http://www.artima.com/weblogs/index.jsp?blogger=12088

Resume.html
guido.au
http://legacy.python.org/doc/essays/
images/license.jpg
http://www.cnpbagwell.com/audio-faq
http://sox.sourceforge.net/
images/internetdog.gif

Loading and exploring a JSON

Now that you know what a JSON is, you’ll load one into your Python environment and explore it yourself. Here, you’ll load the JSON 'a_movie.json' into the variable json_data, which will be a dictionary. You’ll then explore the JSON contents by printing the key-value pairs of json_data to the shell.

Instructions
  • Load the JSON 'a_movie.json' into the variable json_data within the context provided by the with statement. To do so, use the function json.load() within the context manager.
  • Use a for loop to print all key-value pairs in the dictionary json_data. Recall that you can access a value in a dictionary using the syntax: dictionary[key].
# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])
# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])
    

Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin, Ben Mezrich
Actors:  Jesse Eisenberg, Andrew Garfield, Justin Timberlake
Plot:  As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  United States
Awards:  Won 3 Oscars. 172 wins & 186 nominations total
Poster:  https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.7
imdbVotes:  653,830
imdbID:  tt1285016
Type:  movie
DVD:  11 Jan 2011
BoxOffice:  $96,962,694
Production:  Scott Rudin Productions, Trigger Street Productions, Michael De Luca
Website:  N/A
Response:  True

Alternative way of programming the above:

# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k, v in json_data.items():
    print(k + ': ', v)
# Load JSON: json_data
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

# Print each key-value pair in json_data
for k, v in json_data.items():
    print(k + ': ', v)
    
Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin, Ben Mezrich
Actors:  Jesse Eisenberg, Andrew Garfield, Justin Timberlake
Plot:  As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  United States
Awards:  Won 3 Oscars. 172 wins & 186 nominations total
Poster:  https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.7
imdbVotes:  653,830
imdbID:  tt1285016
Type:  movie
DVD:  11 Jan 2011
BoxOffice:  $96,962,694
Production:  Scott Rudin Productions, Trigger Street Productions, Michael De Luca
Website:  N/A
Response:  True

Pop quiz: Exploring your JSON

Load the JSON 'a_movie.json' into a variable, which will be a dictionary. Do so by copying, pasting and executing the following code in the IPython Shell:

import json
with open("a_movie.json") as json_file:
    json_data = json.load(json_file)

Print the values corresponding to the keys 'Title' and 'Year' and answer the following question about the movie that the JSON describes:

What is the title and year of the movie?

In [6]:
for k in json_data.keys():
    if k== 'Title':
        print(k + ' : ' + json_data[k])
    if k=='Year':  
        print(k + ' : ' + json_data[k]) 
        
Title : The Social Network
Year : 2010

API requests

Now it’s your turn to pull some movie data down from the Open Movie Database (OMDB) using their API. The movie you’ll query the API about is The Social Network. Recall that, in the video, to query the API about the movie Hackers, Hugo’s query string was 'http://www.omdbapi.com/?t=hackers' and had a single argument t=hackers.

Note: recently, OMDB has changed their API: you now also have to specify an API key. This means you’ll have to add another argument to the URL: apikey=72bc447a.

Instructions
  • Import the requests package.
  • Assign to the variable url the URL of interest in order to query 'http://www.omdbapi.com' for the data corresponding to the movie The Social Network. The query string should have two arguments: apikey=72bc447a and t=the+social+network. You can combine them as follows: apikey=72bc447a&t=the+social+network.
  • Print the text of the response object r by using its text attribute and passing the result to the print() function.
# Import requests package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Print the text of the response
print(r.text)
{"Title":"The Social Network","Year":"2010","Rated":"PG-13","Released":"01 Oct 2010","Runtime":"120 min","Genre":"Biography, Drama","Director":"David Fincher","Writer":"Aaron Sorkin, Ben Mezrich","Actors":"Jesse Eisenberg, Andrew Garfield, Justin Timberlake","Plot":"As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.","Language":"English, French","Country":"United States","Awards":"Won 3 Oscars. 172 wins & 186 nominations total","Poster":"https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg","Ratings":[{"Source":"Internet Movie Database","Value":"7.7/10"},{"Source":"Rotten Tomatoes","Value":"96%"},{"Source":"Metacritic","Value":"95/100"}],"Metascore":"95","imdbRating":"7.7","imdbVotes":"653,830","imdbID":"tt1285016","Type":"movie","DVD":"11 Jan 2011","BoxOffice":"$96,962,694","Production":"Scott Rudin Productions, Trigger Street Productions, Michael De Luca","Website":"N/A","Response":"True"}

JSON–from the web to Python

Wow, congrats! You’ve just queried your first API programmatically in Python and printed the text of the response to the shell. However, as you know, your response is actually a JSON, so you can do one step better and decode the JSON. You can then print the key-value pairs of the resulting dictionary. That’s what you’re going to do now!

Instructions
  • Pass the variable url to the requests.get() function in order to send the relevant request and catch the response, assigning the resultant response message to the variable r.
  • Apply the json() method to the response object r and store the resulting dictionary in the variable json_data.
  • Hit Submit Answer to print the key-value pairs of the dictionary json_data to the shell.
# Import package
import requests

# Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network'

# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print each key-value pair in json_data
for k in json_data.keys():
    print(k + ': ', json_data[k])
Title:  The Social Network
Year:  2010
Rated:  PG-13
Released:  01 Oct 2010
Runtime:  120 min
Genre:  Biography, Drama
Director:  David Fincher
Writer:  Aaron Sorkin, Ben Mezrich
Actors:  Jesse Eisenberg, Andrew Garfield, Justin Timberlake
Plot:  As Harvard student Mark Zuckerberg creates the social networking site that would become known as Facebook, he is sued by the twins who claimed he stole their idea, and by the co-founder who was later squeezed out of the business.
Language:  English, French
Country:  United States
Awards:  Won 3 Oscars. 172 wins & 186 nominations total
Poster:  https://m.media-amazon.com/images/M/MV5BOGUyZDUxZjEtMmIzMC00MzlmLTg4MGItZWJmMzBhZjE0Mjc1XkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_SX300.jpg
Ratings:  [{'Source': 'Internet Movie Database', 'Value': '7.7/10'}, {'Source': 'Rotten Tomatoes', 'Value': '96%'}, {'Source': 'Metacritic', 'Value': '95/100'}]
Metascore:  95
imdbRating:  7.7
imdbVotes:  653,830
imdbID:  tt1285016
Type:  movie
DVD:  11 Jan 2011
BoxOffice:  $96,962,694
Production:  Scott Rudin Productions, Trigger Street Productions, Michael De Luca
Website:  N/A
Response:  True

Checking out the Wikipedia API

You’re doing so well and having so much fun that we’re going to throw one more API at you: the Wikipedia API (documented here). You’ll figure out how to find and extract information from the Wikipedia page for Pizza. What gets a bit wild here is that your query will return nested JSONs, that is, JSONs with JSONs, but Python can handle that because it will translate them into dictionaries within dictionaries.

The URL that requests the relevant query from the Wikipedia API is

https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza
Instructions
  • Assign the relevant URL to the variable url.
  • Apply the json() method to the response object r and store the resulting dictionary in the variable json_data.
  • The variable pizza_extract holds the HTML of an extract from Wikipedia’s Pizza page as a string; use the function print() to print this string to the shell.
# Import package
import requests

# Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'


# Package the request, send the request and catch the response: r
r = requests.get(url)

# Decode the JSON data into a dictionary: json_data
json_data = r.json()

# Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract']
print(pizza_extract)
<link rel="mw-deduplicated-inline-style" href="mw-data:TemplateStyles:r1033289096">
<p class="mw-empty-elt">
</p>
<p><b>Pizza</b> (<small>Italian: </small><span title="Representation in the International Phonetic Alphabet (IPA)">[ˈpittsa]</span>, <small>Neapolitan: </small><span title="Representation in the International Phonetic Alphabet (IPA)">[ˈpittsə]</span>) is an Italian dish consisting of a usually round, flattened base of leavened wheat-based dough topped with tomatoes, cheese, and often various other ingredients (such as anchovies, mushrooms, onions, olives, pineapple, meat, etc.), which is then baked at a high temperature, traditionally in a wood-fired oven. A small pizza is sometimes called a pizzetta. A person who makes pizza is known as a <b>pizzaiolo</b>.
</p><p>In Italy, pizza served in formal settings, such as at a restaurant, is presented unsliced, and is eaten with the use of a knife and fork. In casual settings, however, it is cut into wedges to be eaten while held in the hand.
</p><p>The term <i>pizza</i> was first recorded in the 10th century in a Latin manuscript from the Southern Italian town of Gaeta in Lazio, on the border with Campania. Modern pizza was invented in Naples, and the dish and its variants have since become popular in many countries. It has become one of the most popular foods in the world and a common fast food item in Europe, North America and Australasia; available at pizzerias (restaurants specializing in pizza),  restaurants offering Mediterranean cuisine, and via pizza delivery. Various food companies also sell ready-baked frozen pizzas in grocery stores, to be reheated in an ordinary home oven.
</p><p>The <i>Associazione Verace Pizza Napoletana</i> (lit. True Neapolitan Pizza Association) is a non-profit organization founded in 1984 with headquarters in Naples that aims to promote traditional Neapolitan pizza. In 2009, upon Italy's request, Neapolitan pizza was registered with the European Union as a Traditional Speciality Guaranteed dish, and in 2017 the art of its making was included on UNESCO's list of intangible cultural heritage.</p>

API Authentication

The package tweepy is great at handling all the Twitter API OAuth Authentication details for you. All you need to do is pass it your authentication credentials. In this interactive exercise, we have created some mock authentication credentials (if you wanted to replicate this at home, you would need to create a Twitter App as Hugo detailed in the video). Your task is to pass these credentials to tweepy’s OAuth handler.

Instructions
  • Import the package tweepy.
  • Pass the parameters consumer_key and consumer_secret to the function tweepy.OAuthHandler().
  • Complete the passing of OAuth credentials to the OAuth handler auth by applying to it the method set_access_token(), along with arguments access_token and access_token_secret.
import tweepy

# Store OAuth authentication credentials in relevant variables
access_token = "test_token1"
access_token_secret = "test_token1_secret"
consumer_key = "test_token2"
consumer_secret = "test_token2_secret"

# Pass OAuth details to tweepy's OAuth handler
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

Streaming tweets

Now that you have set up your authentication credentials, it is time to stream some tweets! We have already defined the tweet stream listener class, MyStreamListener, just as Hugo did in the introductory video. You can find the code for the tweet stream listener class here.

Your task is to create the Streamobject and to filter tweets according to particular keywords.

Instructions
  • Create your Stream object with authentication by passing tweepy.Stream() the authentication handler auth and the Stream listener l;
  • To filter Twitter streams, pass to the track argument in stream.filter() a list containing the desired keywords 'clinton''trump''sanders', and 'cruz'.
l = MyStreamListener()

# Create your Stream object with authentication
stream = tweepy.Stream(auth, l)

# Filter Twitter Streams to capture data by the keywords:
stream.filter(track = ['clinton', 'trump', 'sanders', 'cruz'])

Load and explore your Twitter data

Now that you’ve got your Twitter data sitting locally in a text file, it’s time to explore it! This is what you’ll do in the next few interactive exercises. In this exercise, you’ll read the Twitter data into a list: tweets_data.

Be aware that this is real data from Twitter and as such there is always a risk that it may contain profanity or other offensive content (in this exercise, and any following exercises that also use real Twitter data).

Instructions
  • Assign the filename 'tweets.txt' to the variable tweets_data_path.
  • Initialize tweets_data as an empty list to store the tweets in.
  • Within the for loop initiated by for line in tweets_file:, load each tweet into a variable, tweet, using json.loads(), then append tweet to tweets_data using the append() method.
  • Hit submit and check out the keys of the first tweet dictionary printed to the shell.
import json

# String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'

# Initialize empty list to store tweets: tweets_data
tweets_data = []

# Open connection to file
tweets_file = open(tweets_data_path, "r")

# Read in tweets and store in list: tweets_data
for line in tweets_file:
    tweet = json.loads(line)
    tweets_data.append(tweet)

# Close connection to file
tweets_file.close()
<script.py> output:
    dict_keys(['in_reply_to_user_id', 'created_at', 'filter_level', 'truncated', 'possibly_sensitive', 'timestamp_ms', 'user', 'text', 'extended_entities', 'in_reply_to_status_id', 'entities', 'favorited', 'retweeted', 'is_quote_status', 'id', 'favorite_count', 'retweeted_status', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str', 'id_str', 'in_reply_to_screen_name', 'coordinates', 'lang', 'place', 'contributors', 'geo', 'retweet_count', 'source'])

Twitter data to DataFrame

Now you have the Twitter data in a list of dictionaries, tweets_data, where each dictionary corresponds to a single tweet. Next, you’re going to extract the text and language of each tweet. The text in a tweet, t1, is stored as the value t1['text']; similarly, the language is stored in t1['lang']. Your task is to build a DataFrame in which each row is a tweet and the columns are 'text' and 'lang'.

Instructions
  • Use pd.DataFrame() to construct a DataFrame of tweet texts and languages; to do so, the first argument should be tweets_data, a list of dictionaries. The second argument to pd.DataFrame() is a listof the keys you wish to have as columns. Assign the result of the pd.DataFrame() call to df.
  • Print the head of the DataFrame.
# Import package
import pandas as pd

# Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text', 'lang'])

# Print head of DataFrame
print(df.head())
<script.py> output:
                                                    text lang
    0  b"RT @bpolitics: .@krollbondrating's Christoph...   en
    1  b'RT @HeidiAlpine: @dmartosko Cruz video found...   en
    2  b'Njihuni me Zonj\\xebn Trump !!! | Ekskluzive...   et
    3  b"Your an idiot she shouldn't have tried to gr...   en
    4  b'RT @AlanLohner: The anti-American D.C. elite...   en

A little bit of Twitter text analysis

Now that you have your DataFrame of tweets set up, you’re going to do a bit of text analysis to count how many tweets contain the words 'clinton''trump''sanders' and 'cruz'. In the pre-exercise code, we have defined the following function word_in_text(), which will tell you whether the first argument (a word) occurs within the 2nd argument (a tweet).

import re

def word_in_text(word, text):
    word = word.lower()
    text = text.lower()
    match = re.search(word, text)

    if match:
        return True
    return False

You’re going to iterate over the rows of the DataFrame and calculate how many tweets contain each of our keywords! The list of objects for each candidate has been initialized to 0.

Instructions
  • Within the for loop for index, row in df.iterrows():, the code currently increases the value of clinton by 1 each time a tweet (text row) mentioning ‘Clinton’ is encountered; complete the code so that the same happens for trumpsanders and cruz.
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

# Iterate through df, counting the number of tweets in which
# each candidate is mentioned
for index, row in df.iterrows():
    clinton += word_in_text('clinton', row['text'])
    trump += word_in_text('trump', row['text'])
    sanders += word_in_text('sanders', row['text'])
    cruz += word_in_text('cruz', row['text'])

Plotting your Twitter data

Now that you have the number of tweets that each candidate was mentioned in, you can plot a bar chart of this data. You’ll use the statistical data visualization library seaborn, which you may not have seen before, but we’ll guide you through. You’ll first import seaborn as sns. You’ll then construct a barplot of the data using sns.barplot, passing it two arguments: 

  1. a list of labels and
  2. a list containing the variables you wish to plot (clintontrump and so on.)

Hopefully, you’ll see that Trump was unreasonably represented! We have already run the previous exercise solutions in your environment.

Instructions
  • Import both matplotlib.pyplot and seaborn using the aliases plt and sns, respectively.
  • Complete the arguments of sns.barplot
    • The first argument should be the list of labels to appear on the x-axis (created in the previous step).
    • The second argument should be a list of the variables you wish to plot, as produced in the previous exercise (i.e. a list containing clintontrump, etc).
# Import packages
import matplotlib.pyplot as plt
import seaborn as sns


# Set seaborn style
sns.set(color_codes=True)

# Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']

# Plot the bar chart
ax = sns.barplot(x=cd, y=[clinton, trump, sanders, cruz])
ax.set(ylabel="count")
plt.show()

Add a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.