from siuba import _, filter, group_by
from siuba.data import mtcars
mtcars >> filter(_.cyl == 4, _.gear == 5)| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 | 
| 27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 | 
The filter() function keeps rows of data that meet all specified conditions.
Here is a simple filter().
from siuba import _, filter, group_by
from siuba.data import mtcars
mtcars >> filter(_.cyl == 4, _.gear == 5)| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 26 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.7 | 0 | 1 | 5 | 2 | 
| 27 | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.9 | 1 | 1 | 5 | 2 | 
It keeps rows where the cyl column is equal to 4, AND the gear column is equal to five.
In order to keep a row when one of several conditions is met, use the bar (|) operator.
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 | 
| 7 | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.19 | 20.00 | 1 | 0 | 4 | 2 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.60 | 0 | 1 | 5 | 8 | 
| 31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.78 | 18.60 | 1 | 1 | 4 | 2 | 
14 rows × 11 columns
The code above keeps rows where cyl is equal to 4 OR gear is equal to 5.
Be sure to explicitly put parentheses around both sides. Otherwise, python will group the operation like _.cyl == (4 | _.gear) == 5.
Filter drops rows where conditions return False or not available (NA) values. This section will cover how to tell what is considered NA, and how to drop rows of data with NA values.
Use pandas.isna() to determine whether a value is considered to be NA.
Notice in the code above that the last value is None. We can confirm pandas sees this as an NA with the code below.
Since None is considered an NA, its row gets removed in the filter below.
If you want to remove only by NA values from your data, use the pandas .notna() method.
This effectively says, “keep any values of x that are not NA”.
In general, use an OR condition with _.some_col.isna() to keep rows with missing values.
This strategy should work almost all the time.
In the example below, we keep rows where the horsepower (hp) is above the median horsepower for each cylinder group.
(grouped data frame)
| mpg | cyl | disp | hp | drat | wt | qsec | vs | am | gear | carb | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 2 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.32 | 18.61 | 1 | 1 | 4 | 1 | 
| 6 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.57 | 15.84 | 0 | 0 | 3 | 4 | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
| 30 | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.57 | 14.60 | 0 | 1 | 5 | 8 | 
| 31 | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.78 | 18.60 | 1 | 1 | 4 | 2 | 
15 rows × 11 columns
This means that the filter performs the following operations:
_.cylhp within each group.hp is greater than the group’s median hp.