import pandas as pd
from siuba import _, spread, gather
costs = pd.DataFrame({
'id': [1,2],
'price_x': [.1, .2],
'price_y': [.4, .5],
'price_z': [.7, .8]
})
costs| id | price_x | price_y | price_z | |
|---|---|---|---|---|
| 0 | 1 | 0.1 | 0.4 | 0.7 |
| 1 | 2 | 0.2 | 0.5 | 0.8 |
import pandas as pd
from siuba import _, spread, gather
costs = pd.DataFrame({
'id': [1,2],
'price_x': [.1, .2],
'price_y': [.4, .5],
'price_z': [.7, .8]
})
costs| id | price_x | price_y | price_z | |
|---|---|---|---|---|
| 0 | 1 | 0.1 | 0.4 | 0.7 |
| 1 | 2 | 0.2 | 0.5 | 0.8 |
# selecting each variable manually
costs >> gather('measure', 'value', _.price_x, _.price_y, _.price_z)
# selecting variables using a slice
costs >> gather('measure', 'value', _["price_x":"price_z"])
# selecting by excluding id
costs >> gather('measure', 'value', -_.id)| id | measure | value | |
|---|---|---|---|
| 0 | 1 | price_x | 0.1 |
| 1 | 2 | price_x | 0.2 |
| ... | ... | ... | ... |
| 4 | 1 | price_z | 0.7 |
| 5 | 2 | price_z | 0.8 |
6 rows Γ 3 columns
These functions are thoroughly tested, but currently experimental. The sections below describe their basic use. See their reference page for more examples.
pivot_wider()If there would be multiple entries per cell in the spread wide data, then the spread() function raises an error.
This is shown below, where there are duplicate entries where id=1 and measure="a".
df = pd.DataFrame({
"id": [1, 1, 2],
"measure": ["a", "a", "b"],
"value": [8, 9, 10]
})
df >> spread("measure", "value")ValueError: Index contains duplicate entries, cannot reshape
Use the pivot_wider() function to deal with this situation.
| id | a | b | |
|---|---|---|---|
| 0 | 1 | (8, 9) | NaN |
| 1 | 2 | NaN | (10,) |
Notice that the top-left entry is a list of two values, (8, 9). The values_fn argument is able to reduce those values down to one.
For example, by taking the mean.
| id | a | b | |
|---|---|---|---|
| 0 | 1 | 8.5 | NaN |
| 1 | 2 | NaN | 10.0 |
Note that values_fn= may be a lambda function.
If youβd rather use pandasβs pivot_table() method, siuba can pipe to it using this syntax:
Where you would replace ... with your arguments. See flexible piping for more details.
pivot_longer()Use pivot_longer() to stack columns of data, turning them into rows.
Notice that this data has two columns (x and y) that you might want to stack on top of each other.
| id | variable | number | |
|---|---|---|---|
| 0 | 1 | measure | a |
| 0 | 1 | value | 8 |
| ... | ... | ... | ... |
| 2 | 2 | measure | b |
| 2 | 2 | value | 10 |
6 rows Γ 3 columns
Note that in the code above, we do the following:
~_.id selects every columns except id for stacking.variable column.value column.