How to use Pandas Python Library

Pandas tutorial

Pandas is a standard Python library designed for working with tabular data. This library allows you to read tabular data, perform transformations on it, calculate statistical parameters, and also serves as a foundation for libraries that contain implementations of more complex analysis methods. To use the library, we need to import it, that is, make it available for use:

import pandas as pd

Researching the data

To open a file, you need to use the read_csv() method, whose argument is the path to the data file that interests us. To find the correct path to a file in Google Colab, you can right-click on it and select "Copy path". As an example, we suggest using the file tips.csv that you can upload to your Google Colab:

df = pd.read_csv('/content/tips.csv')

After loading the data, we created an object with a data type called DataFrame. It is not part of the standard library, but it is available in pandas

DataFrame is a data type used in the pandas library for working with tabular data. Similar data types are available in other libraries and in other languages.

The closest analogy to a DataFrame would be a data table: a DataFrame consists of columns and rows, where columns contain variables and rows contain values corresponding to each variable. For starters, let's explore our data a bit and learn a few useful commands.

To get the first ten rows of a DataFrame, the .head() method is used. The number of rows to display is specified as an argument (if no argument is specified, i.e., by default, 5 rows are displayed):

df.head(10)

To find out the column names, their number, data types in them, and the number of observations:

df.info()

To find out basic descriptive statistics (maximum and minimum values, mean, standard deviation) for columns that contain numeric data (float or int):

df.describe()

To get information about the number of rows and columns:

df.shape

When we work with a large DataFrame, for variables that are described by a nominal scale, it may be important for us to understand what values it consists of. For this we can use the unique() method. For example, you can find out what value an observation in the time column can take:

df['time'].unique()

In addition to this, it may be important for us to know the number of observations corresponding to each value of the class parameter. That is, how many checks were issued during dinner and lunch. To do this, use the following command:

df['time'].value_counts()

Now we know that 176 checks were issued during the lunch time. The value_counts method can be refined with a parameter so that we not only find out the number of people in each class, but also see the proportions. To do this, let's add an additional argument to our code:

df['time'].value_counts(normalize=True)

Filtering DataFrame

Let's start with filtering the DataFrame. Let's try to see only the column containing information about the total bill.

df[['total_bill']]

You can use another method: df.total_bill. Here we access the column as an attribute. The result will be the same, but it only works if there are no spaces in the column name. Going forward, we will use the first notation method.

Let's say we can select several columns, for example total_bill and tip. In this case, our code should look like this:

df[['total_bill', 'tip']]

Pay attention to the two opening and two closing square brackets. The outer pair of square brackets is the way to indicate a column or columns. The inner pair of square brackets creates a list that contains the desired columns.

If we want to access certain rows of our DataFrame, we can use row indices:

df[0:5]

Note that numbering in a DataFrame starts from 0, not 1, just like in lists, so the instruction df[1:5] would give us a result in which the first row is missing.

We know that the column with time contains only two values: Lunch, Dinner. If we want to find out the average bill for the dinner, we first need to filter the observations related to the dinner. Note that when filtering by a categorical variable, the value must be enclosed in quotes.

df[df['time'] == 'Dinner']

We can save this result to a separate DataFrame and save it as a csv file:

df_dinner = df[df['time'] == 'Dinner']
df_dinner.to_csv('dinner_tips.csv')

This method also works for numeric variables, but in that case quotes should not be used. The == operator can be replaced with <, >, and other comparison operators if we want to select values that are greater than or less than a certain threshold. Let's choose checks where the tip was more than 5:

df[df['tip'] > 5]

Finally, the last thing we need to learn is to use multiple conditions at once to select part of a DataFrame. Suppose we want to examine the checks of smokers who came for dinner. That is, we need to set two conditions: smoker must equal Yes, and time must equal Dinner. We can combine multiple conditions:

df[(df['smoker'] == 'Yes') & (df['time']=='Dinner')]

Parentheses around each individual condition are mandatory. Additionally, pandas uses different symbols for some logical operators.

Table of differences between the standard library and pandas:

Descriptive Statistics

Now let's see what we can say about the data using the simplest descriptive statistics methods: mean, median, and standard deviation. Most of the statistics we discussed are implemented in the pandas package.

Let's calculate the mean and median values of the total_bill column:

print(df['total_bill'].mean())
print(df['total_bill'].median())

In addition to the mean and median, there are other methods. For example, we can find out the maximum and minimum values:

print(df['total_bill'].min())
print(df['total_bill'].max())

Or find out the sum of all values in a column, for example the sum of all checks:

print(df['total_bill'].sum())

Or calculate the standard deviation:

SUBSCRIBE TO OUR NEWSLETTER