In this section we cover the basic yet most common scenarios of filtering data frame rows by column values.
We wish to filter by a logical column to return the rows where the logical column is True
.
In this example, we wish to filter the rows of the data frame df where the logical column ‘col_3'
has a value of True
.
df_2 = df.loc[df['col_3']]
Here is how this works:
True
and / or False
values) can be passed directly to loc[]
which would return the rows where the passed column (in this example: col_3
) takes a value of True
.loc[]
is the preferred approach in Pandas.We wish to obtain (filter) the rows of a data frame where the value of a particular column meets a condition that involves comparison to a scaler value.
In this example, we wish to obtain the rows of the data frame df
where the numerical column ‘col_1’
takes a value greater than 0.
df_2 = df.loc[df['col_1'] > 0]
Here is how this works:
df['col_1'] > 0
yields a logical Series
where all values of col_1
that are larger than 0 are True
and all values that are smaller than 0 are False
.Series
of evaluating df['col_1'] > 0
is passed to the first argument of loc[]
to subset (extract) the corresponding rows.>
. In general we can use any comparison operation on numerical columns including smaller than <
, equal ==
, greater than or equal >=
, or smaller than or equal <=
. See Numerical Operations for more details.We wish to return rows where a comparison of the DataFrame’s column evaluates to True
.
In this example, we wish to return rows where the value of the column col_1
is larger than the value of another column col_2
.
df_2 = df.loc[df['col_1'] > df['col_2']]
Here is how this works:
col_1
is compared to the value of col_2
. If the value of col_1
for a row is larger than the value of col_2
, the expression evaluates to True
otherwise to False
.df['col_1'] > df['col_2']
is a Series
of boolean True
or False
values.loc[]
which returns the subset of rows of a data frame (in this example: df
) corresponding to where the boolean Series is True
. In this example, that would be rows where the value of col_1
is larger than the value of col_2
.We wish to obtain rows where a condition is not satisfied.
In this example, we wish to obtain rows where the value of a string column ‘col_2'
is neither ‘a’
nor ‘c’
.
No Missing Values
df_2 = df.loc[~ df['col_2'].isin(['a','c'])]
Here is how this works:
isin()
operator to check whether each value of ‘col_2'
is either ‘a’
or ‘c’
. The output is a Series
of logical values that is True
if the value of the corresponding element of ‘col_2’
is a member of the list and False
otherwise. We cover checking for membership in a list in detail in List Operations.!
to invert the logical values resulting from isin()
so that we have a value of True
for rows where the value of ‘col_2'
is neither ‘a’
nor ‘c’
loc[]
which returns the subset of rows of a data frame df
corresponding to where the boolean Series is True
.Missing Values
df_2 = (df
.dropna(subset='col_2')
.loc[~ df['col_2'].isin(['a','c'])])
Here is how this works:
np.nan
) returns False
. We would not know whether a False
is due to the value not meeting a condition or because it is missing.True
. When we wish to take the complement, it becomes an issue though.dropna()
before comparing. We cover working with missing values in Missing Values.If we are filtering rows as part of a chain of data manipulation operations and need to refer to a row that was created earlier in the same chain, we need to refer to it via a lambda
function.
df_2 = (df
.assign(col_2 = 2 * df['col_1'])
.loc[lambda x: x['col_2'] > 0])
Here is how this works:
df.assign(col_2 = 2 * df['col_1']).loc[df['col_2'] > 0]
lambda
function.The rows we get as output from a row filtering operation, like in the scenarios above, will have the same indices as they had in the original data frame resulting in a “patchy” index. While this is appropriate in some situations, often times we wish that the data frame that results from filtering has a fresh index.
df_2 = (df
.loc[df['col_1'] > 0]
.reset_index(drop=True))
Here is how this works:
reset_index()
after loc[]
to give the new data frame df_2
a fresh index that starts at 0 and increments by 1 for each row.True
, the original index is kept in the data frame as a new column.