Merging dataframes with Pandas

Posted On 25 July 2021

Performing Anti-Joins

Merge employees and top_cust with a left join, setting indicatorargument to True. Save the result to empl_cust.

Select the srid column of empl_cust and the rows where _merge is 'left_only'. Save the result to srid_list.

Subset the employees table and select those rows where the srid is in the variable srid_list and print the results.

# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
                                 how='left', indicator=True)

# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge'] == 'left_only', 'srid']

# Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])

In [1]:
employees.head()
Out[1]:

   srid    lname     fname                title  hire_date                     email
0     1    Adams    Andrew      General Manager 2002-08-14    andrew@chinookcorp.com
1     2  Edwards     Nancy        Sales Manager 2002-05-01     nancy@chinookcorp.com
2     3  Peacock      Jane  Sales Support Agent 2002-04-01      jane@chinookcorp.com
3     4     Park  Margaret  Sales Support Agent 2003-05-03  margaret@chinookcorp.com
4     5  Johnson     Steve  Sales Support Agent 2003-10-17     steve@chinookcorp.com
In [2]:
top_cust.head()
Out[2]:

   cid  srid      fname        lname               phone                 fax                     email
0    1     3       Luís    Gonçalves  +55 (12) 3923-5555  +55 (12) 3923-5566      luisg@embraer.com.br
1    2     5     Leonie       Köhler    +49 0711 2842222                 NaN     leonekohler@surfeu.de
2    3     3   François     Tremblay   +1 (514) 721-4711                 NaN       ftremblay@gmail.com
3    4     4      Bjørn       Hansen     +47 22 44 22 22                 NaN     bjorn.hansen@yahoo.no
4    5     4  František  Wichterlová    +420 2 4172 5555    +420 2 4172 5555  frantisekw@jetbrains.com
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
                            how='left', indicator=True)
In [3]:
empl_cust.head()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
    empl_cust.head()
NameError: name 'empl_cust' is not defined
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
                            how='left', indicator=True)

# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge'] == 'left_only', 'srid']
In [4]:
empl_cust.head()
Out[4]:

   srid  lname_x fname_x                title  hire_date  ...    lname_y               phone                 fax                        email_y     _merge
0     1    Adams  Andrew      General Manager 2002-08-14  ...        NaN                 NaN                 NaN                            NaN  left_only
1     2  Edwards   Nancy        Sales Manager 2002-05-01  ...        NaN                 NaN                 NaN                            NaN  left_only
2     3  Peacock    Jane  Sales Support Agent 2002-04-01  ...  Gonçalves  +55 (12) 3923-5555  +55 (12) 3923-5566           luisg@embraer.com.br       both
3     3  Peacock    Jane  Sales Support Agent 2002-04-01  ...   Tremblay   +1 (514) 721-4711                 NaN            ftremblay@gmail.com       both
4     3  Peacock    Jane  Sales Support Agent 2002-04-01  ...    Almeida  +55 (21) 2271-7000  +55 (21) 2271-7070  roberto.almeida@riotur.gov.br       both

[5 rows x 13 columns]
# Merge employees and top_cust
empl_cust = employees.merge(top_cust, on='srid', 
                                 how='left', indicator=True)

# Select the srid column where _merge is left_only
srid_list = empl_cust.loc[empl_cust['_merge'] == 'left_only', 'srid']

# Get employees not working with top customers
print(employees[employees['srid'].isin(srid_list)])
   srid     lname    fname            title  hire_date                    email
0     1     Adams   Andrew  General Manager 2002-08-14   andrew@chinookcorp.com
1     2   Edwards    Nancy    Sales Manager 2002-05-01    nancy@chinookcorp.com
5     6  Mitchell  Michael       IT Manager 2003-10-17  michael@chinookcorp.com
6     7      King   Robert         IT Staff 2004-01-02   robert@chinookcorp.com
7     8  Callahan    Laura         IT Staff 2004-03-04    laura@chinookcorp.com

<script.py> output:
       srid     lname    fname            title  hire_date                    email
    0     1     Adams   Andrew  General Manager 2002-08-14   andrew@chinookcorp.com
    1     2   Edwards    Nancy    Sales Manager 2002-05-01    nancy@chinookcorp.com
    5     6  Mitchell  Michael       IT Manager 2003-10-17  michael@chinookcorp.com
    6     7      King   Robert         IT Staff 2004-01-02   robert@chinookcorp.com
    7     8  Callahan    Laura         IT Staff 2004-03-04    laura@chinookcorp.com

Success! You performed an anti-join by first merging the tables with a left join, selecting the ID of those employees who did not support a top customer, and then subsetting the original employee’s table. From that, we can see that there are five employees not supporting top customers. Anti-joins are a powerful tool to filter a main table (i.e. employees) by another (i.e. customers).

Performing a semi-join

Some of the tracks that have generated the most significant amount of revenue are from TV-shows or are other non-musical audio. You have been given a table of invoices that include top revenue-generating items. Additionally, you have a table of non-musical tracks from the streaming service. In this exercise, you’ll use a semi-join to find the top revenue-generating non-musical tracks..

The tables non_mus_tcks, top_invoices, and genres have been loaded for you.

Instructions

Merge non_mus_tcks and top_invoices on tid using an inner join. Save the result as tracks_invoices.
Use .isin() to subset the rows of non_mus_tck where tid is in the tidcolumn of tracks_invoices. Save the result as top_tracks.
Group top_tracks by gid and count the tid rows. Save the result to cnt_by_gid.
Merge cnt_by_gid with the genres table on gid and print the result.

# Merge the non_mus_tck and top_invoices tables on tid
tracks_invoices = non_mus_tcks.merge(top_invoices, on='tid', how='inner')

# Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices
top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(tracks_invoices['tid'])]

# Group the top_tracks by gid and count the tid rows
cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})

# Merge the genres table to cnt_by_gid on gid and print
print(cnt_by_gid.merge(genres, on='gid'))

In [1]:
tracks_invoices.head()
Out[1]:

    tid       name  aid  mtid  gid  u_price  ilid  iid  uprice  quantity
0  2850    The Fix  228     3   21     1.99   473   88    1.99         1
1  2850    The Fix  228     3   21     1.99  2192  404    1.99         1
2  2868  Walkabout  230     3   19     1.99   476   88    1.99         1
3  2868  Walkabout  230     3   19     1.99  2194  404    1.99         1
4  3177   Hot Girl  249     3   19     1.99  1668  306    1.99         1
# Merge the non_mus_tck and top_invoices tables on tid
tracks_invoices = non_mus_tcks.merge(top_invoices, on='tid', how='inner')

# Use .isin() to subset non_mus_tcks to rows with tid in tracks_invoices
top_tracks = non_mus_tcks[non_mus_tcks['tid'].isin(tracks_invoices['tid'])]

# Group the top_tracks by gid and count the tid rows
cnt_by_gid = top_tracks.groupby(['gid'], as_index=False).agg({'tid':'count'})

# Merge the genres table to cnt_by_gid on gid and print
print(cnt_by_gid.merge(genres, on='gid'))
   gid  tid      name
0   19    4  TV Shows
1   21    2     Drama
2   22    1    Comedy

Nice job! In this exercise, you replicated a semi-join to filter the table of tracks by the table of invoice items to find the top revenue non-musical tracks. With some additional data manipulation, you discovered that ‘TV-shows’ is the non-musical genre that has the most top revenue-generating tracks. Now that you’ve done both semi- and anti-joins, it’s time to move to the next topic.

Concatenation basics

You have been given a few tables of data with musical track info for different albums from the metal band, Metallica. The track info comes from their Ride The Lightning, Master Of Puppets, and St. Anger albums. Try various features of the .concat() method by concatenating the tables vertically together in different ways.

The tables tracks_master, tracks_ride, and tracks_st have loaded for you.

Concatenate tracks_master, tracks_ride, and tracks_st, in that order, setting sort to True.
Concatenate tracks_master, tracks_ride, and tracks_st, where the index goes from 0 to n-1.
Concatenate tracks_master, tracks_ride, and tracks_st, showing only columns that are in all tables.

# Concatenate the tracks, show only columns names that are in all tables
tracks_from_albums = pd.concat([tracks_master, tracks_ride, tracks_st],
                               join='inner',
                               sort=True)
print(tracks_from_albums)

# Concatenate the tracks
tracks_from_albums = pd.concat([tracks_master, tracks_ride, tracks_st],
                               sort=True)
print(tracks_from_albums)
   aid             composer  gid  mtid                     name   tid  u_price
0  152  J.Hetfield/L.Ulrich    3     1                  Battery  1853     0.99
1  152            K.Hammett    3     1        Master Of Puppets  1854     0.99
4  152  J.Hetfield/L.Ulrich    3     1        Disposable Heroes  1857     0.99
0  154                  NaN    3     1     Fight Fire With Fire  1874     0.99
1  154                  NaN    3     1       Ride The Lightning  1875     0.99
2  154                  NaN    3     1  For Whom The Bell Tolls  1876     0.99
3  154                  NaN    3     1            Fade To Black  1877     0.99
4  154                  NaN    3     1        Trapped Under Ice  1878     0.99
0  155                  NaN    3     1                  Frantic  1882     0.99
1  155                  NaN    3     1                St. Anger  1883     0.99
2  155                  NaN    3     1     Some Kind Of Monster  1884     0.99
3  155                  NaN    3     1             Dirty Window  1885     0.99
4  155                  NaN    3     1            Invisible Kid  1886     0.99

<script.py> output:
       aid             composer  gid  mtid                     name   tid  u_price
    0  152  J.Hetfield/L.Ulrich    3     1                  Battery  1853     0.99
    1  152            K.Hammett    3     1        Master Of Puppets  1854     0.99
    4  152  J.Hetfield/L.Ulrich    3     1        Disposable Heroes  1857     0.99
    0  154                  NaN    3     1     Fight Fire With Fire  1874     0.99
    1  154                  NaN    3     1       Ride The Lightning  1875     0.99
    2  154                  NaN    3     1  For Whom The Bell Tolls  1876     0.99
    3  154                  NaN    3     1            Fade To Black  1877     0.99
    4  154                  NaN    3     1        Trapped Under Ice  1878     0.99
    0  155                  NaN    3     1                  Frantic  1882     0.99
    1  155                  NaN    3     1                St. Anger  1883     0.99
    2  155                  NaN    3     1     Some Kind Of Monster  1884     0.99
    3  155                  NaN    3     1             Dirty Window  1885     0.99
    4  155                  NaN    3     1            Invisible Kid  1886     0.99
# Concatenate the tracks so the index goes from 0 to n-1
tracks_from_albums = pd.concat([tracks_master, tracks_ride, tracks_st],
                               ignore_index=True,
                               sort=True)
print(tracks_from_albums)
    aid             composer  gid  mtid                     name   tid  u_price
0   152  J.Hetfield/L.Ulrich    3     1                  Battery  1853     0.99
1   152            K.Hammett    3     1        Master Of Puppets  1854     0.99
2   152  J.Hetfield/L.Ulrich    3     1        Disposable Heroes  1857     0.99
3   154                  NaN    3     1     Fight Fire With Fire  1874     0.99
4   154                  NaN    3     1       Ride The Lightning  1875     0.99
5   154                  NaN    3     1  For Whom The Bell Tolls  1876     0.99
6   154                  NaN    3     1            Fade To Black  1877     0.99
7   154                  NaN    3     1        Trapped Under Ice  1878     0.99
8   155                  NaN    3     1                  Frantic  1882     0.99
9   155                  NaN    3     1                St. Anger  1883     0.99
10  155                  NaN    3     1     Some Kind Of Monster  1884     0.99
11  155                  NaN    3     1             Dirty Window  1885     0.99
12  155                  NaN    3     1            Invisible Kid  1886     0.99

<script.py> output:
        aid             composer  gid  mtid                     name   tid  u_price
    0   152  J.Hetfield/L.Ulrich    3     1                  Battery  1853     0.99
    1   152            K.Hammett    3     1        Master Of Puppets  1854     0.99
    2   152  J.Hetfield/L.Ulrich    3     1        Disposable Heroes  1857     0.99
    3   154                  NaN    3     1     Fight Fire With Fire  1874     0.99
    4   154                  NaN    3     1       Ride The Lightning  1875     0.99
    5   154                  NaN    3     1  For Whom The Bell Tolls  1876     0.99
    6   154                  NaN    3     1            Fade To Black  1877     0.99
    7   154                  NaN    3     1        Trapped Under Ice  1878     0.99
    8   155                  NaN    3     1                  Frantic  1882     0.99
    9   155                  NaN    3     1                St. Anger  1883     0.99
    10  155                  NaN    3     1     Some Kind Of Monster  1884     0.99
    11  155                  NaN    3     1             Dirty Window  1885     0.99
    12  155                  NaN    3     1            Invisible Kid  1886     0.99
# Concatenate the tracks, show only columns names that are in all tables
tracks_from_albums = pd.concat([tracks_master, tracks_ride, tracks_st],
                               join='inner',
                               sort=True)
print(tracks_from_albums)
   aid  gid  mtid                     name   tid  u_price
0  152    3     1                  Battery  1853     0.99
1  152    3     1        Master Of Puppets  1854     0.99
4  152    3     1        Disposable Heroes  1857     0.99
0  154    3     1     Fight Fire With Fire  1874     0.99
1  154    3     1       Ride The Lightning  1875     0.99
2  154    3     1  For Whom The Bell Tolls  1876     0.99
3  154    3     1            Fade To Black  1877     0.99
4  154    3     1        Trapped Under Ice  1878     0.99
0  155    3     1                  Frantic  1882     0.99
1  155    3     1                St. Anger  1883     0.99
2  155    3     1     Some Kind Of Monster  1884     0.99
3  155    3     1             Dirty Window  1885     0.99
4  155    3     1            Invisible Kid  1886     0.99

<script.py> output:
       aid  gid  mtid                     name   tid  u_price
    0  152    3     1                  Battery  1853     0.99
    1  152    3     1        Master Of Puppets  1854     0.99
    4  152    3     1        Disposable Heroes  1857     0.99
    0  154    3     1     Fight Fire With Fire  1874     0.99
    1  154    3     1       Ride The Lightning  1875     0.99
    2  154    3     1  For Whom The Bell Tolls  1876     0.99
    3  154    3     1            Fade To Black  1877     0.99
    4  154    3     1        Trapped Under Ice  1878     0.99
    0  155    3     1                  Frantic  1882     0.99
    1  155    3     1                St. Anger  1883     0.99
    2  155    3     1     Some Kind Of Monster  1884     0.99
    3  155    3     1             Dirty Window  1885     0.99
    4  155    3     1            Invisible Kid  1886     0.99

Great job! You’ve concatenated your first set of tables, adjusted the index, and altered the columns shown in the output. The .concat() method is a very flexible tool that is useful for combining data into a new dataset.

Concatenating with keys

The leadership of the music streaming company has come to you and asked you for assistance in analyzing sales for a recent business quarter. They would like to know which month in the quarter saw the highest average invoice total. You have been given three tables with invoice data named inv_jul, inv_aug, and inv_sep. Concatenate these tables into one to create a graph of the average monthly invoice total.

Instructions

Concatenate the three tables together vertically in order with the oldest month first, adding '7Jul', '8Aug', and '9Sep' as keys for their respective months, and save to variable avg_inv_by_month.
Use the .agg() method to find the average of the total column from the grouped invoices.
Create a bar chart of avg_inv_by_month.

# Concatenate the tables and add keys
inv_jul_thr_sep = pd.concat([inv_jul, inv_aug, inv_sep], 
                            keys=['7Jul', '8Aug', '9Sep'])

# Group the invoices by the index keys and find avg of the total column
avg_inv_by_month = inv_jul_thr_sep.groupby(level=0).agg({'total':'mean'})

# Bar plot of avg_inv_by_month
avg_inv_by_month.plot()
plt.show()

# Concatenate the tables and add keys
inv_jul_thr_sep = pd.concat([inv_jul, inv_aug, inv_sep], 
                            keys=['7Jul', '8Aug', '9Sep'])

# Group the invoices by the index keys and find avg of the total column
avg_inv_by_month = inv_jul_thr_sep.groupby(level=0).agg({'total':'mean'})

# Bar plot of avg_inv_by_month
avg_inv_by_month.plot(kind='bar')
plt.show()

Way to come through! There are many ways to write code for this task. However, concatenating the tables with a key provides a hierarchical index that can be used for grouping. Once grouped, you can average the groups and create plots. You were able to find out that September had the highest average invoice total.

Using the append method

The .concat() method is excellent when you need a lot of control over how concatenation is performed. However, if you do not need as much control, then the .append() method is another option. You’ll try this method out by appending the track lists together from different Metallica albums. From there, you will merge it with the invoice_items table to determine which track sold the most.

The tables tracks_master, tracks_ride, tracks_st, and invoice_items have loaded for you.

Instructions

Use the .append() method to combine (in this order)tracks_ride, tracks_master, and tracks_st together vertically, and save to metallica_tracks.
Merge metallica_tracks and invoice_items on tid with an inner join, and save to tracks_invoices.
For each tid and name in tracks_invoices, sum the quantity sold column, and save as tracks_sold.
Sort tracks_sold in descending order by the quantity column, and print the table.

# Use the .append() method to combine the tracks tables
metallica_tracks = tracks_ride.append([tracks_master, tracks_st], sort=False)

# Merge metallica_tracks and invoice_items
tracks_invoices = metallica_tracks.merge(invoice_items, on='tid', how='inner')

# For each tid and name sum the quantity sold
tracks_sold = tracks_invoices.groupby(['tid','name']).agg({'quantity':'sum'})

# Sort in decending order by quantity and print the results
print(tracks_sold.sort_values(by=['quantity'], ascending=False))

<script.py> output:
                                  quantity
    tid  name                             
    1853 Battery                         2
    1876 For Whom The Bell Tolls         2
    1854 Master Of Puppets               1
    1857 Disposable Heroes               1
    1875 Ride The Lightning              1
    1877 Fade To Black                   1
    1882 Frantic                         1
    1884 Some Kind Of Monster            1
    1886 Invisible Kid                   1

Great work! Even though .append() is less flexible, it’s also simpler than .concat(). It looks like Battery, and For Whom The Bell Tolls were the most sold tracks.

Concatenate and merge to find common songs

The senior leadership of the streaming service is requesting your help again. You are given the historical files for a popular playlist in the classical music genre in 2018 and 2019. Additionally, you are given a similar set of files for the most popular pop music genre playlist on the streaming service in 2018 and 2019. Your goal is to concatenate the respective files to make a large classical playlist table and overall popular music table. Then filter the classical music table using a semi-join to return only the most popular classical music tracks.

The tables classic_18, classic_19, and pop_18, pop_19 have been loaded for you. Additionally, pandas has been loaded as pd.

Instructions 2/2

Concatenate the classic_18 and classic_19 tables vertically where the index goes from 0 to n-1, and save to classic_18_19.
Concatenate the pop_18 and pop_19 tables vertically where the index goes from 0 to n-1, and save to pop_18_19.
With classic_18_19 on the left, merge it with pop_18_19 on tid using an inner join.
Use .isin() to filter classic_18_19 where tid is in classic_pop.

# Concatenate the classic tables vertically
classic_18_19 = pd.concat([classic_18, classic_19], ignore_index=True)

# Concatenate the pop tables vertically
pop_18_19 = pd.concat([pop_18, pop_19], ignore_index=True)

# Merge classic_18_19 with pop_18_19
classic_pop = classic_18_19.merge(pop_18_19, on='tid', how='inner')

# Using .isin(), filter classic_18_19 rows where tid is in classic_pop
popular_classic = classic_18_19[classic_18_19['tid'].isin(classic_pop['tid'])]

# Print popular chart
print(popular_classic)

# Concatenate the classic tables vertically
classic_18_19 = pd.concat([classic_18, classic_19], ignore_index=True)

# Concatenate the pop tables vertically
pop_18_19 = pd.concat([pop_18, pop_19], ignore_index=True)

<script.py> output:
    Empty DataFrame
    Columns: [pid, tid]
    Index: []

<script.py> output:
        pid   tid
    3    12  3479
    10   12  3439
    21   12  3445
    23   12  3449
    48   12  3437
    50   12  3435

Excellent work! In this exercise, you demonstrated many of the concepts discussed in this chapter, including concatenation, and semi-joins. You now have experience combining data vertically and using semi- and anti-joins. Time to move on to the next chapter!

Correlation between GDP and S&P500

In this exercise, you want to analyze stock returns from the S&P 500. You believe there may be a relationship between the returns of the S&P 500 and the GDP of the US. Merge the different datasets together to compute the correlation.

Two tables have been provided for you, named sp500, and gdp. As always, pandas has been imported for you as pd.

Instructions

Use merge_ordered() to merge gdp and sp500 using a left join on yearand date. Save the results as gdp_sp500.
Print gdp_sp500 and look at the returns for the year 2018.
Use merge_ordered(), again similar to before, to merge gdp and sp500 use the function’s ability to interpolate missing data to forward fill the missing value for returns, assigning this table to the variable gdp_sp500.

# Use merge_ordered() to merge gdp and sp500, interpolate missing value
gdp_sp500 = pd.merge_ordered(gdp, sp500, left_on='year', right_on='date', 
                             how='left',  fill_method='ffill')

# Subset the gdp and returns columns
gdp_returns = gdp_sp500[['gdp', 'returns']]

# Print gdp_returns correlation
print (gdp_returns.corr())

<script.py> output:
                gdp  returns
    0  1.499210e+13    12.78
    1  1.554260e+13     0.00
    2  1.619700e+13    13.41
    3  1.619700e+13    13.41
    4  1.678480e+13    29.60
    5  1.752170e+13    11.39
    6  1.821930e+13    -0.73
    7  1.870720e+13     9.54
    8  1.948540e+13    19.42
    9  2.049410e+13    19.42

<script.py> output:
                  gdp   returns
    gdp      1.000000  0.212173
    returns  0.212173  1.000000

Awesome work! You can see the different aspects of merge_ordered() and how you might use it on data that can be ordered. By using this function, you were able to fill in the missing data from 2019. Finally, the correlation of 0.21 between the GDP and S&P500 is low to moderate at best. You may want to find another predictor if you plan to play in the stock market.

Phillips curve using merge_ordered()

There is an economic theory developed by A. W. Phillips which states that inflation and unemployment have an inverse relationship. The theory claims that with economic growth comes inflation, which in turn should lead to more jobs and less unemployment.

You will take two tables of data from the U.S. Bureau of Labor Statistics, containing unemployment and inflation data over different periods, and create a Phillips curve. The tables have different frequencies. One table has a data entry every six months, while the other has a data entry every month. You will need to use the entries where you have data within both tables.

The tables unemployment and inflation have been loaded for you.

Instructions

Use merge_ordered() to merge the inflation and unemployment tables on date with an inner join, and save the results as inflation_unemploy.
Print the inflation_unemploy variable.
Using inflation_unemploy, create a scatter plot with unemployment_rate on the horizontal axis and cpi (inflation) on the vertical axis.

<?php# Use merge_ordered() to merge inflation, unemployment with inner join
inflation_unemploy = pd.merge_ordered(inflation, unemployment, on='date', how='inner')

# Print inflation_unemploy 
print(inflation_unemploy)

# Plot a scatter plot of unemployment_rate vs cpi of inflation_unemploy
inflation_unemploy.plot(x='unemployment_rate', y='cpi' , kind='scatter')
plt.show()

# Use merge_ordered() to merge inflation, unemployment with inner join
inflation_unemploy = pd.merge_ordered(inflation, unemployment, on='date', how='inner')

# Print inflation_unemploy 
print(inflation_unemploy)

# Plot a scatter plot of unemployment_rate vs cpi of inflation_unemploy
inflation_unemploy.plot(x='unemployment_rate', y='cpi' , kind='scatter')
plt.show()
         date      cpi     seriesid                  data_type  unemployment_rate
0  2014-01-01  235.288  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                6.7
1  2014-06-01  237.231  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                6.1
2  2015-01-01  234.718  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                5.6
3  2015-06-01  237.684  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                5.3
4  2016-01-01  237.833  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                5.0
5  2016-06-01  240.167  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                4.9
6  2017-01-01  243.780  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                4.7
7  2017-06-01  244.182  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                4.3
8  2018-01-01  248.884  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                4.1
9  2018-06-01  251.134  CUSR0000SA0  SEASONALLY ADJUSTED INDEX                4.0

Great work! You created a Phillips curve. There are critics of the curve, but what is more important in this example is that you were able to use entries where you had entries in both tables by using an inner join. You might ask why not use the default outer join and use forward fill to fill to estimate the missing variables. You might choose differently. In this case, instead of showing an estimated unemployment rate (which is a continually changing measure) for five periods, that data was dropped from the plot.

merge_ordered() caution, multiple columns

When using merge_ordered() to merge on multiple columns, the order is important when you combine it with the forward fill feature. The function sorts the merge on columns in the order provided. In this exercise, we will merge GDP and population data from the World Bank for the Australia and Sweden, reversing the order of the merge on columns. The frequency of the series are different, the GDP values are quarterly, and the population is yearly. Use the forward fill feature to fill in the missing data. Depending on the order provided, the fill forward will use unintended data to fill in the missing values.

The tables gdp and pop have been loaded.

Instructions

Use merge_ordered() on gdp and pop, merging on columns date and country with the fill feature, save to ctry_date.
Perform the same merge of gdp and pop, but join on country and date(reverse of step 1) with the fill feature, saving this as date_ctry.

# Merge gdp and pop on country and date with fill
date_ctry = pd.merge_ordered(gdp, pop, on=['country', 'date', ], fill_method='ffill')

# Print date_ctry
print(date_ctry)

# Merge gdp and pop on date and country with fill and notice rows 2 and 3
ctry_date = pd.merge_ordered(gdp, pop, on=['date', 'country'], 
                             fill_method='ffill')

# Print ctry_date
print(ctry_date)
         date    country           gdp  series_code_x       pop series_code_y
0  1990-01-01  Australia  158051.13240  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
1  1990-01-01     Sweden   79837.84599  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
2  1990-04-01  Australia  158263.58160  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
3  1990-04-01     Sweden   80582.28597  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
4  1990-07-01  Australia  157329.27900  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
5  1990-07-01     Sweden   79974.36017  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
6  1990-09-01  Australia  158240.67810  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
7  1990-09-01     Sweden   80106.49738  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
8  1991-01-01  Australia  156195.95350  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
9  1991-01-01     Sweden   79524.24192  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
10 1991-04-01  Australia  155989.03270  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
11 1991-04-01     Sweden   79073.05901  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
12 1991-07-01  Australia  156635.85760  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
13 1991-07-01     Sweden   79084.77036  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
14 1991-09-01  Australia  156744.05660  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
15 1991-09-01     Sweden   79740.60625  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
16 1992-01-01  Australia  157916.08110  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
17 1992-01-01     Sweden   79390.92175  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
18 1992-04-01  Australia  159047.82710  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
19 1992-04-01     Sweden   79060.28298  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
20 1992-07-01  Australia  160658.17600  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
21 1992-07-01     Sweden   78904.60477  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
22 1992-09-01  Australia  163960.22070  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
23 1992-09-01     Sweden   76996.83684  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
24 1993-01-01  Australia  165097.49510  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
25 1993-01-01     Sweden   75783.58777  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
26 1993-04-01  Australia  166027.05900  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
27 1993-04-01     Sweden   76708.54823  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
28 1993-07-01  Australia  166203.17860  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
29 1993-07-01     Sweden   77662.01816  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
30 1993-09-01  Australia  169279.34790  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
31 1993-09-01     Sweden   77703.30364  NYGDPMKTPSAKD   8718561   SP.POP.TOTL

<script.py> output:
             date    country           gdp  series_code_x       pop series_code_y
    0  1990-01-01  Australia  158051.13240  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    1  1990-01-01     Sweden   79837.84599  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    2  1990-04-01  Australia  158263.58160  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    3  1990-04-01     Sweden   80582.28597  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    4  1990-07-01  Australia  157329.27900  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    5  1990-07-01     Sweden   79974.36017  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    6  1990-09-01  Australia  158240.67810  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    7  1990-09-01     Sweden   80106.49738  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    8  1991-01-01  Australia  156195.95350  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    9  1991-01-01     Sweden   79524.24192  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    10 1991-04-01  Australia  155989.03270  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    11 1991-04-01     Sweden   79073.05901  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    12 1991-07-01  Australia  156635.85760  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    13 1991-07-01     Sweden   79084.77036  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    14 1991-09-01  Australia  156744.05660  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    15 1991-09-01     Sweden   79740.60625  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    16 1992-01-01  Australia  157916.08110  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    17 1992-01-01     Sweden   79390.92175  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    18 1992-04-01  Australia  159047.82710  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    19 1992-04-01     Sweden   79060.28298  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    20 1992-07-01  Australia  160658.17600  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    21 1992-07-01     Sweden   78904.60477  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    22 1992-09-01  Australia  163960.22070  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    23 1992-09-01     Sweden   76996.83684  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    24 1993-01-01  Australia  165097.49510  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    25 1993-01-01     Sweden   75783.58777  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    26 1993-04-01  Australia  166027.05900  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    27 1993-04-01     Sweden   76708.54823  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    28 1993-07-01  Australia  166203.17860  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    29 1993-07-01     Sweden   77662.01816  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    30 1993-09-01  Australia  169279.34790  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    31 1993-09-01     Sweden   77703.30364  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
# Merge gdp and pop on country and date with fill
date_ctry = pd.merge_ordered(gdp, pop, on=['country', 'date', ], fill_method='ffill')

# Print date_ctry
print(date_ctry)
         date    country           gdp  series_code_x       pop series_code_y
0  1990-01-01  Australia  158051.13240  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
1  1990-04-01  Australia  158263.58160  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
2  1990-07-01  Australia  157329.27900  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
3  1990-09-01  Australia  158240.67810  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
4  1991-01-01  Australia  156195.95350  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
5  1991-04-01  Australia  155989.03270  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
6  1991-07-01  Australia  156635.85760  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
7  1991-09-01  Australia  156744.05660  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
8  1992-01-01  Australia  157916.08110  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
9  1992-04-01  Australia  159047.82710  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
10 1992-07-01  Australia  160658.17600  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
11 1992-09-01  Australia  163960.22070  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
12 1993-01-01  Australia  165097.49510  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
13 1993-04-01  Australia  166027.05900  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
14 1993-07-01  Australia  166203.17860  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
15 1993-09-01  Australia  169279.34790  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
16 1990-01-01     Sweden   79837.84599  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
17 1990-04-01     Sweden   80582.28597  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
18 1990-07-01     Sweden   79974.36017  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
19 1990-09-01     Sweden   80106.49738  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
20 1991-01-01     Sweden   79524.24192  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
21 1991-04-01     Sweden   79073.05901  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
22 1991-07-01     Sweden   79084.77036  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
23 1991-09-01     Sweden   79740.60625  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
24 1992-01-01     Sweden   79390.92175  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
25 1992-04-01     Sweden   79060.28298  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
26 1992-07-01     Sweden   78904.60477  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
27 1992-09-01     Sweden   76996.83684  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
28 1993-01-01     Sweden   75783.58777  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
29 1993-04-01     Sweden   76708.54823  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
30 1993-07-01     Sweden   77662.01816  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
31 1993-09-01     Sweden   77703.30364  NYGDPMKTPSAKD   8718561   SP.POP.TOTL

<script.py> output:
             date    country           gdp  series_code_x       pop series_code_y
    0  1990-01-01  Australia  158051.13240  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    1  1990-04-01  Australia  158263.58160  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    2  1990-07-01  Australia  157329.27900  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    3  1990-09-01  Australia  158240.67810  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    4  1991-01-01  Australia  156195.95350  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    5  1991-04-01  Australia  155989.03270  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    6  1991-07-01  Australia  156635.85760  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    7  1991-09-01  Australia  156744.05660  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    8  1992-01-01  Australia  157916.08110  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    9  1992-04-01  Australia  159047.82710  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    10 1992-07-01  Australia  160658.17600  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    11 1992-09-01  Australia  163960.22070  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    12 1993-01-01  Australia  165097.49510  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    13 1993-04-01  Australia  166027.05900  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    14 1993-07-01  Australia  166203.17860  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    15 1993-09-01  Australia  169279.34790  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    16 1990-01-01     Sweden   79837.84599  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    17 1990-04-01     Sweden   80582.28597  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    18 1990-07-01     Sweden   79974.36017  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    19 1990-09-01     Sweden   80106.49738  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    20 1991-01-01     Sweden   79524.24192  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    21 1991-04-01     Sweden   79073.05901  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    22 1991-07-01     Sweden   79084.77036  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    23 1991-09-01     Sweden   79740.60625  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    24 1992-01-01     Sweden   79390.92175  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    25 1992-04-01     Sweden   79060.28298  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    26 1992-07-01     Sweden   78904.60477  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    27 1992-09-01     Sweden   76996.83684  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    28 1993-01-01     Sweden   75783.58777  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    29 1993-04-01     Sweden   76708.54823  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    30 1993-07-01     Sweden   77662.01816  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    31 1993-09-01     Sweden   77703.30364  NYGDPMKTPSAKD   8718561   SP.POP.TOTL

Nice! When you merge on date first, the table is sorted by date then country. When forward fill is applied, Sweden’s population value in January is used to fill in the missing values for both Australia and the Sweden for the remainder of the year. This is not what you want. The fill forward is using unintended data to fill in the missing values. However, when you merge on country first, the table is sorted by country then date, so the forward fill is applied appropriately in this situation.

Using merge_asof() to study stocks

You have a feed of stock market prices that you record. You attempt to track the price every five minutes. Still, due to some network latency, the prices you record are roughly every 5 minutes. You pull your price logs for three banks, JP Morgan(JPM), Wells Fargo (WFC), and Bank Of America (BAC). You want to know how the price change of the two other banks compare to JP Morgan. Therefore, you will need to merge these three logs into one table. Afterward, you will use the pandas .diff() method to compute the price change over time. Finally, plot the price changes so you can review your analysis.

The three log files have been loaded for you as tables named jpm, wells, and bac.

Instructions

Use merge_asof() to merge jpm (left table) and wells together on the date_time column, where the rows with the nearest times are matched, and with suffixes=('', '_wells'). Save to jpm_wells.
Use merge_asof() to merge jpm_wells (left table) and bac together on the date_time column, where the rows with the closest times are matched, and with suffixes=('_jpm', '_bac'). Save to jpm_wells_bac.
Using price_diffs, create a line plot of the close price of JPM, WFC, and BAC only.

# Use merge_asof() to merge jpm and wells
jpm_wells = pd.merge_asof(jpm, wells, on='date_time', suffixes=('', '_wells'), direction='nearest')


# Use merge_asof() to merge jpm_wells and bac
jpm_wells_bac = pd.merge_asof(jpm_wells, bac, on=['date_time'], suffixes=('_jpm', '_bac'), direction='nearest')


# Compute price diff
price_diffs = jpm_wells_bac.diff()

# Plot the price diff of the close of jpm, wells and bac only
price_diffs.plot(y=['close_jpm', 'close_wells', 'close_bac'])
plt.show()

Fabulous! You can see that during this period, the price change for these bank stocks was roughly the same, although the price change for JP Morgan was more variable. The critical point here is that the merge_asof()function is very useful in performing the fuzzy matching between the timestamps of all the tables.

Using merge_asof() to create dataset

The merge_asof() function can be used to create datasets where you have a table of start and stop dates, and you want to use them to create a flag in another table. You have been given gdp, which is a table of quarterly GDP values of the US during the 1980s. Additionally, the table recession has been given to you. It holds the starting date of every US recession since 1980, and the date when the recession was declared to be over. Use merge_asof() to merge the tables and create a status flag if a quarter was during a recession. Finally, to check your work, plot the data in a bar chart.

The tables gdp and recession have been loaded for you.

Instructions

Using merge_asof(), merge gdp and recession on date, with gdp as the left table. Save to the variable gdp_recession.
Create a list using a list comprehension and a conditional expression, named is_recession, where for each row if the gdp_recession['econ_status'] value is equal to ‘recession’ then enter 'r' else 'g'.
Using gdp_recession, plot a bar chart of gdp versus date, setting the colorargument equal to is_recession.

# Merge gdp and recession on date using merge_asof()
gdp_recession = pd.merge_asof(gdp, recession, on='date')

# Create a list based on the row value of gdp_recession['econ_status']
is_recession = ['r' if s=='recession' else 'g' for s in gdp_recession['econ_status']]

# Plot a bar chart of gdp_recession
gdp_recession.plot(kind='bar', y='gdp', x='date', color=is_recession, rot=90)
plt.show()

Terrific work! You can see from the chart that there were a number of quarters early in the 1980s where a recession was an issue. merge_asof() allowed you to quickly add a flag to the gdp dataset by matching between two different dates, in one line of code! If you were to perform the same task using subsetting, it would have taken a lot more code.

merge_asof()	both	.merge_ordered
It can be used to do fuzzy matching of dates between tables	this function can set the suffix for overlapping column names	If it cannot march the rows of the tables exactly, it can use forward fill to interpolate the missing data.
Has an argument that can be set to `'forward'` to select the first row in the right table whose key column is greater than or equal to the left’s	this function can be used when working with ordered or time-series data	It allows for a right joint during the merge.
After matching two tables, if there are missing values at the top of the table from the right table, this function can fill them in.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_asof.html		https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.merge_ordered.html

Subsetting rows with .query()

In this exercise, you will revisit GDP and population data for Australia and Sweden from the World Bank and expand on it using the .query() method. You’ll merge the two tables and compute the GDP per capita. Afterwards, you’ll use the .query() method to sub-select the rows and create a plot. Recall that you will need to merge on multiple columns in the proper order.

The tables gdp and pop have been loaded for you.

Use merge_ordered() on gdp and pop on columns country and datewith the fill feature, save to gdp_pop and print.
Add a column named gdp_per_capita to gdp_pop that divides gdp by pop.
Pivot gdp_pop so values='gdp_per_capita', index='date', and columns='country', save as gdp_pivot.
Use .query() to select rows from gdp_pivot where date is greater than equal to 1991-01-01". Save as recent_gdp_pop.

# Merge gdp and pop on date and country with fill
gdp_pop = pd.merge_ordered(gdp, pop, on=['country','date'], fill_method='ffill')

# Add a column named gdp_per_capita to gdp_pop that divides the gdp by pop
gdp_pop['gdp_per_capita'] = gdp_pop['gdp'] / gdp_pop['pop']

# Pivot data so gdp_per_capita, where index is date and columns is country
gdp_pivot = gdp_pop.pivot_table('gdp_per_capita', 'date', 'country')

# Select dates equal to or greater than 1991-01-01
recent_gdp_pop = gdp_pivot.query('date>="1991-01-01"')

# Plot recent_gdp_pop
recent_gdp_pop.plot(rot=90)
plt.show()

# Merge gdp and pop on date and country with fill
gdp_pop = pd.merge_ordered(gdp, pop, on=['country', 'date'], fill_method='ffill')

print(gdp_pop)
         date    country           gdp  series_code_x       pop series_code_y
0  1990-01-01  Australia  158051.13240  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
1  1990-04-01  Australia  158263.58160  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
2  1990-07-01  Australia  157329.27900  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
3  1990-09-01  Australia  158240.67810  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
4  1991-01-01  Australia  156195.95350  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
5  1991-04-01  Australia  155989.03270  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
6  1991-07-01  Australia  156635.85760  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
7  1991-09-01  Australia  156744.05660  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
8  1992-01-01  Australia  157916.08110  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
9  1992-04-01  Australia  159047.82710  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
10 1992-07-01  Australia  160658.17600  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
11 1992-09-01  Australia  163960.22070  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
12 1993-01-01  Australia  165097.49510  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
13 1993-04-01  Australia  166027.05900  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
14 1993-07-01  Australia  166203.17860  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
15 1993-09-01  Australia  169279.34790  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
16 1990-01-01     Sweden   79837.84599  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
17 1990-04-01     Sweden   80582.28597  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
18 1990-07-01     Sweden   79974.36017  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
19 1990-09-01     Sweden   80106.49738  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
20 1991-01-01     Sweden   79524.24192  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
21 1991-04-01     Sweden   79073.05901  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
22 1991-07-01     Sweden   79084.77036  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
23 1991-09-01     Sweden   79740.60625  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
24 1992-01-01     Sweden   79390.92175  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
25 1992-04-01     Sweden   79060.28298  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
26 1992-07-01     Sweden   78904.60477  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
27 1992-09-01     Sweden   76996.83684  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
28 1993-01-01     Sweden   75783.58777  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
29 1993-04-01     Sweden   76708.54823  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
30 1993-07-01     Sweden   77662.01816  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
31 1993-09-01     Sweden   77703.30364  NYGDPMKTPSAKD   8718561   SP.POP.TOTL

<script.py> output:
             date    country           gdp  series_code_x       pop series_code_y
    0  1990-01-01  Australia  158051.13240  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    1  1990-04-01  Australia  158263.58160  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    2  1990-07-01  Australia  157329.27900  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    3  1990-09-01  Australia  158240.67810  NYGDPMKTPSAKD  17065100   SP.POP.TOTL
    4  1991-01-01  Australia  156195.95350  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    5  1991-04-01  Australia  155989.03270  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    6  1991-07-01  Australia  156635.85760  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    7  1991-09-01  Australia  156744.05660  NYGDPMKTPSAKD  17284000   SP.POP.TOTL
    8  1992-01-01  Australia  157916.08110  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    9  1992-04-01  Australia  159047.82710  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    10 1992-07-01  Australia  160658.17600  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    11 1992-09-01  Australia  163960.22070  NYGDPMKTPSAKD  17495000   SP.POP.TOTL
    12 1993-01-01  Australia  165097.49510  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    13 1993-04-01  Australia  166027.05900  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    14 1993-07-01  Australia  166203.17860  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    15 1993-09-01  Australia  169279.34790  NYGDPMKTPSAKD  17667000   SP.POP.TOTL
    16 1990-01-01     Sweden   79837.84599  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    17 1990-04-01     Sweden   80582.28597  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    18 1990-07-01     Sweden   79974.36017  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    19 1990-09-01     Sweden   80106.49738  NYGDPMKTPSAKD   8558835   SP.POP.TOTL
    20 1991-01-01     Sweden   79524.24192  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    21 1991-04-01     Sweden   79073.05901  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    22 1991-07-01     Sweden   79084.77036  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    23 1991-09-01     Sweden   79740.60625  NYGDPMKTPSAKD   8617375   SP.POP.TOTL
    24 1992-01-01     Sweden   79390.92175  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    25 1992-04-01     Sweden   79060.28298  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    26 1992-07-01     Sweden   78904.60477  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    27 1992-09-01     Sweden   76996.83684  NYGDPMKTPSAKD   8668067   SP.POP.TOTL
    28 1993-01-01     Sweden   75783.58777  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    29 1993-04-01     Sweden   76708.54823  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    30 1993-07-01     Sweden   77662.01816  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
    31 1993-09-01     Sweden   77703.30364  NYGDPMKTPSAKD   8718561   SP.POP.TOTL
# Merge gdp and pop on date and country with fill
gdp_pop = pd.merge_ordered(gdp, pop, on=['country','date'], fill_method='ffill')

# Add a column named gdp_per_capita to gdp_pop that divides the gdp by pop
gdp_pop['gdp_per_capita'] = gdp_pop['gdp'] / gdp_pop['pop']
# Merge gdp and pop on date and country with fill
gdp_pop = pd.merge_ordered(gdp, pop, on=['country','date'], fill_method='ffill')

# Add a column named gdp_per_capita to gdp_pop that divides the gdp by pop
gdp_pop['gdp_per_capita'] = gdp_pop['gdp'] / gdp_pop['pop']

# Pivot data so gdp_per_capita, where index is date and columns is country
gdp_pivot = gdp_pop.pivot_table('gdp_per_capita', 'date', 'country')

# Select dates equal to or greater than 1991-01-01
recent_gdp_pop = gdp_pivot.query('date>="1991-01-01"')

# Plot recent_gdp_pop
recent_gdp_pop.plot(rot=90)
plt.show()

Amazing! You can see from the plot that the per capita GDP of Australia passed Sweden in 1992. By using the .query() method, you were able to select the appropriate rows easily. The .query() method is easy to read and straightforward.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.query.html?highlight=query#pandas.DataFrame.query

Using .melt() to reshape government data

The US Bureau of Labor Statistics (BLS) often provides data series in an easy-to-read format – it has a separate column for each month, and each year is a different row. Unfortunately, this wide format makes it difficult to plot this information over time. In this exercise, you will reshape a table of US unemployment rate data from the BLS into a form you can plot using .melt(). You will need to add a date column to the table and sort by it to plot the data correctly.

The unemployment rate data has been loaded for you in a table called ur_wide. You are encouraged to view the table in the console before beginning the exercise.

Instructions

Use .melt() to unpivot all of the columns of ur_wide except yearand ensure that the columns with the months and values are named month and unempl_rate, respectively. Save the result as ur_tall.
Add a column to ur_tall named date which combines the year and month columns as year–month format into a larger string, and converts it to a date data type.
Sort ur_tall by date and save as ur_sorted.
Using ur_sorted, plot unempl_rate on the y-axis and date on the x-axis.

# unpivot everything besides the year column
ur_tall = ur_wide.melt(id_vars=['year'], var_name='month', value_name='unempl_rate')


# Create a date column using the month and year columns of ur_tall
ur_tall['date'] = pd.to_datetime(ur_tall['year'] + '-' + ur_tall['month'])

# Sort ur_tall by date in ascending order
ur_sorted = ur_tall.sort_values(by='date')

# Plot the unempl_rate by date
ur_sorted.plot(x='date', y='unempl_rate')
plt.show()

Nice going! The plot shows a steady decrease in the unemployment rate with an increase near the end. This increase is likely the effect of the COVID-19 pandemic and its impact on shutting down most of the US economy. In general, data is often provided (especially by governments) in a format that is easily read by people but not by machines. The .melt() method is a handy tool for reshaping data into a useful form.

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.melt.html?highlight=melt#pandas.DataFrame.melt

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html?highlight=sort_values#pandas.DataFrame.sort_values

Using .melt() for stocks vs bond performance

It is widespread knowledge that the price of bonds is inversely related to the price of stocks. In this last exercise, you’ll review many of the topics in this chapter to confirm this. You have been given a table of percent change of the US 10-year treasury bond price. It is in a wide format where there is a separate column for each year. You will need to use the .melt() method to reshape this table.

Additionally, you will use the .query() method to filter out unneeded data. You will merge this table with a table of the percent change of the Dow Jones Industrial stock index price. Finally, you will plot data.

The tables ten_yr and dji have been loaded for you.

Instructions

Use .melt() on ten_yr to unpivot everything except the metriccolumn, setting var_name='date' and value_name='close'. Save the result to bond_perc.
Using the .query() method, select only those rows were metric equals ‘close’, and save to bond_perc_close.
Use merge_ordered() to merge dji (left table) and bond_perc_closeon date with an inner join, and set suffixes equal to ('_dow', '_bond'). Save the result to dow_bond.
Using dow_bond, plot only the Dow and bond values.

# Use melt on ten_yr, unpivot everything besides the metric column
bond_perc = ten_yr.melt(id_vars=['metric'], var_name='date', value_name='close')

# Use query on bond_perc to select only the rows where metric=close
bond_perc_close = bond_perc.query('metric=="close"')

# Merge (ordered) dji and bond_perc_close on date with an inner join
dow_bond = pd.merge_ordered(dji, bond_perc_close, on='date', how='inner', suffixes=['_dow', '_bond'])


# Plot only the close_dow and close_bond columns
dow_bond.plot(y=['close_dow', 'close_bond'], x='date', rot=90)
plt.show()

Super job! You used many of the techniques we have reviewed in this chapter to produce the plot. The plot confirms that the bond and stock prices are inversely correlated. Often as the price of stocks increases, the price for bonds decreases.

About The Author

hrozema

Add a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Search

Performing Anti-Joins

Performing a semi-join

Instructions

Concatenation basics

Concatenating with keys

Concatenating with keys

Instructions

Using the append method

Instructions

Concatenate and merge to find common songs

Instructions 2/2

Correlation between GDP and S&P500

Instructions

Phillips curve using merge_ordered()

Instructions

merge_ordered() caution, multiple columns

Instructions

Using merge_asof() to study stocks

Instructions

Using merge_asof() to create dataset

Instructions

Subsetting rows with .query()

Using .melt() to reshape government data

Instructions

Using .melt() for stocks vs bond performance

Instructions

Related Posts

Raspberry Pi controlled ceramic kiln

Local Audio Transcription with MLX Whisper and AI agent API on Apple Silicon

Statistical thinking in Python (part 1)

About The Author

hrozema

Add a Comment