How to create a DataFrame in Pandas from a list and add columns
Imagine we have a list and we want to be able to use it as a Pandas DataFrame in Python, how do we do that?
# Our list is a list of strings fruit_list = ['Apple', 'Banana', 'Cherry', 'Dragon Fruit', 'Elderberry'] print(fruit_list)
And just for fun, I got my inspiration for these delicious fruits from this website: https://www.whateatly.com/category/fruits/
Never mind, now we created and printed our list of fruit names. Now the question is how to turn this list into a DataFrame.
# First import the Pandas library import pandas as pd # Our list is a list of strings fruit_list = ['Apple', 'Banana', 'Cherry', 'Dragon Fruit', 'Elderberry'] print(fruit_list) # Turn fruit_list into a DataFrame df = pd.DataFrame(fruit_list) df
What have we done? On line 2 we imported the pandas library as pd. On line 5 we created our list of strings. On line 6 we print this list. On line 9 we call Pandas, use the .DataFrame method with the argument fruit_list and store this as our DataFrame df. On line 10 we print the df DataFrame.
Let’s have a look at the result in the console:
['Apple', 'Banana', 'Cherry', 'Dragon Fruit', 'Elderberry'] 0 0 Apple 1 Banana 2 Cherry 3 Dragon Fruit 4 Elderberry
As you can see on line 1 our assembly of fruits is printed as a list, with the brakes etc. From line 3 we can see a table like structure of our fruits, plus a index-column with row-numbers starting from 0. This is what Pandas automatically does for us, and it’s a good help for getting an overview of the DataFrame.
So, now we know how to turn a list into a DataFrame. But there are of course more options how to turn lists into DataFrames:
Two column DataFrame from two lists
Let’s say that we have already our fruit_list and we want to turn it into a shopping list, so with each item we want to add a number next to it. I call this list number_items. And it’s a bit funny to ask in a shop for 25 cherries and 50 elderberry, but that’s how I do it for this exercise.
# First import the Pandas library import pandas as pd # Our list is a list of strings called fruit_list fruit_list = ['Apple', 'Banana', 'Cherry', 'Dragon Fruit', 'Elderberry'] # Create a list of numbers called number_items number_items = [3, 4, 25, 2, 50) # Zip both fruit_list and number_items together, store into DataFrame df2 and # give appropriate column names df = pd.DataFrame(list(zip(fruit_list, number_items)), columns = ['Fruit', 'Number'] df
Which results in:
Fruit Number 0 Apple 3 1 Banana 4 2 Cherry 25 3 Dragon Fruit 2 4 Elderberry 50
Adding a column to a DataFrame
Now let’s spice up our DataFrame df2 a bit by adding a column with rating.
# Add a column 'Rating' with values [8, 9, 7, 9, 5] df['Rating'] = [8, 9, 7, 9, 5] df
And the result of our actions shows the added Rating column.
Fruit Number Rating 0 Apple 3 8 1 Banana 4 9 2 Cherry 25 7 3 Dragon Fruit 2 9 4 Elderberry 50 5
Something interesting is the difference between df and print(df). Both ways show a slightly different output.
Note that adding a column with values can also be done by using the .assign() method:
# Adding a column val to our DataFrame by using the .assign() method df = df.assign( val = [324,35,645,867,78]) df
With the result in the console:
Fruit Number Rating val 0 Apple 3 8 324 1 Banana 4 9 35 2 Cherry 25 7 645 3 Dragon Fruit 2 9 867 4 Elderberry 50 5 78
Adding a column with 1’s
In some cases it’s necessary to add columns with fixed values. So in this example we add a column val2 with on each row a 1.
# Add a column val2 with value 1 on each row. df['val2'] = 1 df
Which shows the new column val2 with values 1 in each row.
Fruit Number Rating val val2 0 Apple 3 8 324 1 1 Banana 4 9 35 1 2 Cherry 25 7 645 1 3 Dragon Fruit 2 9 867 1 4 Elderberry 50 5 78 1
Adding a column with values based on a condition
The next step we do is adding a column Review to this DataFrame with value 1 if the Rating is equal or higher than 8 and value 0 if the Rating is lower than 8. To check with a condition we need to import NumPy as np
import numpy as np # Adding a column Review with value 1 if Rating>=8 and 0 if Rating<8. df['Review'] = np.where(df['Rating']>=8, '1', '0') df
The addition of NumPy makes the use of DataFrames very powerful. The result is:
Fruit Number Rating val val2 Review 0 Apple 3 8 324 1 1 1 Banana 4 9 35 1 1 2 Cherry 25 7 645 1 0 3 Dragon Fruit 2 9 867 1 1 4 Elderberry 50 5 78 1 0
Adding a column with several values
In the previous example we have seen how to make a new column with just two values 1 and 0. However, what to do if we want to make several values, for instance ‘delicious’, ‘plain’ and ‘not good’? For this we need to define a function which distinguishes the column Rating on:
- values below 6: not good;
- between 6 and 8: plain;
- and values 8 and above: delicious.
# Define a function which helps classifying the values in Rating def f(row): if row['Rating'] < 6: classification = 'not good' elif row['Rating'] < 8: classification = 'plain' else: classification = 'delicous' return classification # Create a new column Review2 using the function above df['Review2'] = df.apply(f, axis=1) print(df)
The result of this is shown below:
Fruit Number Rating val val2 Review Review2 0 Apple 3 8 324 1 1 delicous 1 Banana 4 9 35 1 1 delicous 2 Cherry 25 7 645 1 0 plain 3 Dragon Fruit 2 9 867 1 1 delicous 4 Elderberry 50 5 78 1 0 not good
Use compared values in columns to create a new column
Imagine we did two tests on the fruit and we only want to buy the best fruit. The second test gave different test results. In the situation where the second test gave higher results we want to mark the fruit with further_check as ‘yes’, and in all other cases ‘no’. How do we do this?
# Add new column test1, and test2 df = df.assign( Test1 = [7,9,5,6,6]) df = df.assign( Test2 = [7,10,7,8,5]) # Create the new column 'Further_check' with value yes if the values of Test2 were higher than Test1. # Lower values are marked with no. df['Further_check'] = np.where(df['Test2']>=df['Test1'], 'yes', 'no')
Our DataFrame now looks as follows:
Fruit Number Rating val val2 Review Review2 Test1 Test2 \ 0 Apple 3 8 324 1 1 delicous 7 7 1 Banana 4 9 35 1 1 delicous 9 10 2 Cherry 25 7 645 1 0 plain 5 7 3 Dragon Fruit 2 9 867 1 1 delicous 6 8 4 Elderberry 50 5 78 1 0 not good 6 5 Further_check 0 yes 1 yes 2 yes 3 yes 4 no
Last for this article is how to subset our dataframe based on the condition ‘Further_check=yes?
# Use the .loc method to subset for rows with certain values, store in df_check df_check = df.loc[df['Further_check'] == 'yes', 'Fruit'] # Print df_check print(df_check)
Resulting in our list of fruits that we definitely need to check further:
0 Apple 1 Banana 2 Cherry 3 Dragon Fruit
So much fruit makes me want to take a bite. The python script you can find here as a .zip file. I hope this helped you a bit further, if you have a question feel free reach out to at info [ at ] hylkerozema.nl
Enjoy fruits!