from siuba import _, group_by, mutate, select
from siuba.data import mtcars
small_cars = mtcars[["mpg", "cyl", "hp"]]Mutate to transform
The mutate() function creates a new column of data, or overwrite an existing one.
We’ll use a subset of the mtcars dataset for examples.
Basics
small_cars >> mutate(mpg_per_cyl = _.mpg / _.cyl)| mpg | cyl | hp | mpg_per_cyl | |
|---|---|---|---|---|
| 0 | 21.0 | 6 | 110 | 3.500 |
| 1 | 21.0 | 6 | 110 | 3.500 |
| ... | ... | ... | ... | ... |
| 30 | 15.0 | 8 | 335 | 1.875 |
| 31 | 21.4 | 4 | 109 | 5.350 |
32 rows × 4 columns
Replacing columns
When a created column is given the same name as an existing column, it replaces that column in the data.
small_cars >> mutate(mpg = _.mpg - _.mpg.mean(), new_column = 1)| mpg | cyl | hp | new_column | |
|---|---|---|---|---|
| 0 | 0.909375 | 6 | 110 | 1 |
| 1 | 0.909375 | 6 | 110 | 1 |
| ... | ... | ... | ... | ... |
| 30 | -5.090625 | 8 | 335 | 1 |
| 31 | 1.309375 | 4 | 109 | 1 |
32 rows × 4 columns
Note that replacement columns are put in the same position as the original columns. For example, in the result above, the mpg column is still in the first position on the left.
Using previous arguments
Arguments can refer to columns that were created in earlier arguments.
small_cars >> mutate(cyl2 = _.cyl * 2, cyl4 = _.cyl2 * 2)| mpg | cyl | hp | cyl2 | cyl4 | |
|---|---|---|---|---|---|
| 0 | 21.0 | 6 | 110 | 12 | 24 |
| 1 | 21.0 | 6 | 110 | 12 | 24 |
| ... | ... | ... | ... | ... | ... |
| 30 | 15.0 | 8 | 335 | 16 | 32 |
| 31 | 21.4 | 4 | 109 | 8 | 16 |
32 rows × 5 columns
In the code above, cyl4 uses the earlier argument cyl2.
Grouped mutates
(small_cars
>> group_by(_.cyl)
>> mutate(
hp_mean = _.hp.mean(),
demeaned_hp = _.hp - _.hp_mean
)
)(grouped data frame)
| mpg | cyl | hp | hp_mean | demeaned_hp | |
|---|---|---|---|---|---|
| 0 | 21.0 | 6 | 110 | 122.285714 | -12.285714 |
| 1 | 21.0 | 6 | 110 | 122.285714 | -12.285714 |
| ... | ... | ... | ... | ... | ... |
| 30 | 15.0 | 8 | 335 | 209.214286 | 125.785714 |
| 31 | 21.4 | 4 | 109 | 82.636364 | 26.363636 |
32 rows × 5 columns
(small_cars
>> group_by(_.cyl)
>> mutate(
hp_per_cyl = _.hp / _.cyl,
diff = _.hp_per_cyl - _.hp_per_cyl.shift(1)
)
)(grouped data frame)
| mpg | cyl | hp | hp_per_cyl | diff | |
|---|---|---|---|---|---|
| 0 | 21.0 | 6 | 110 | 18.333333 | NaN |
| 1 | 21.0 | 6 | 110 | 18.333333 | 0.000 |
| ... | ... | ... | ... | ... | ... |
| 30 | 15.0 | 8 | 335 | 41.875000 | 8.875 |
| 31 | 21.4 | 4 | 109 | 27.250000 | -1.000 |
32 rows × 5 columns