from setup import ___
from siuba import *
from plotnine import *
from music_top200 import music_top200, track_features
Summarize
Click here to open the slides full screen.
Exercise 1:
The code below calculates the average duration.
- Uncomment the summarize verb.
- Change it to calculate median duration.
- Make sure to change the resulting column name to indicate its a median.
(music_top200#>> summarize(avg_duration = _.duration.mean())
)
country | position | track_name | artist | streams | duration | continent | |
---|---|---|---|---|---|---|---|
0 | Argentina | 1 | Tusa | KAROL G | 1858666 | 200.960 | Americas |
1 | Argentina | 2 | Tattoo | Rauw Alejandro | 1344382 | 202.887 | Americas |
2 | Argentina | 3 | Hola - Remix | Dalex | 1330011 | 249.520 | Americas |
... | ... | ... | ... | ... | ... | ... | ... |
12397 | South Africa | 198 | Black And White | Niall Horan | 11771 | 193.090 | Africa |
12398 | South Africa | 199 | When I See U | Fantasia | 11752 | 217.347 | Africa |
12399 | South Africa | 200 | Psycho! | MASN | 11743 | 197.217 | Africa |
12400 rows × 7 columns
(music_top200>> summarize(median_duration = _.duration.median(), ttl_streams = _.streams.sum())
)
median_duration | ttl_streams | |
---|---|---|
0 | 201.084 | 3018225255 |
1 rows × 2 columns
What is the median duration?
205.990
Incorrect65.631
Incorrect201.084
That’s right!851.871
IncorrectAdd a second argument to summarize, which calculates the sum of streams. How large is it?
30,182,252
Incorrect301,822,525
That’s right!451,822,525
Incorrect45,182,252
IncorrectExercise 2:
Use verbs you learned in chapter 1 to do the following:
- find the track with the lowest duration
- subset the data to keep only the row for that track
(Note, you may need to run code multiple times)
(music_top200
)
country | position | track_name | artist | streams | duration | continent | |
---|---|---|---|---|---|---|---|
0 | Argentina | 1 | Tusa | KAROL G | 1858666 | 200.960 | Americas |
1 | Argentina | 2 | Tattoo | Rauw Alejandro | 1344382 | 202.887 | Americas |
2 | Argentina | 3 | Hola - Remix | Dalex | 1330011 | 249.520 | Americas |
... | ... | ... | ... | ... | ... | ... | ... |
12397 | South Africa | 198 | Black And White | Niall Horan | 11771 | 193.090 | Africa |
12398 | South Africa | 199 | When I See U | Fantasia | 11752 | 217.347 | Africa |
12399 | South Africa | 200 | Psycho! | MASN | 11743 | 197.217 | Africa |
12400 rows × 7 columns
Now, use summarize to calculate the min duration, and the max duration directly.
(music_top200
)
country | position | track_name | artist | streams | duration | continent | |
---|---|---|---|---|---|---|---|
0 | Argentina | 1 | Tusa | KAROL G | 1858666 | 200.960 | Americas |
1 | Argentina | 2 | Tattoo | Rauw Alejandro | 1344382 | 202.887 | Americas |
2 | Argentina | 3 | Hola - Remix | Dalex | 1330011 | 249.520 | Americas |
... | ... | ... | ... | ... | ... | ... | ... |
12397 | South Africa | 198 | Black And White | Niall Horan | 11771 | 193.090 | Africa |
12398 | South Africa | 199 | When I See U | Fantasia | 11752 | 217.347 | Africa |
12399 | South Africa | 200 | Psycho! | MASN | 11743 | 197.217 | Africa |
12400 rows × 7 columns
Why would you use summarize like this, rather than the arrange and filter approach?
The min and max values might be in different observations (rows).
That’s right. Filter keeps specific rows, but summarize can collect values from across rows.It keeps more of the original data.
Try again. Filter keeps all the variables (columns), while summarize will remove most of them.The arrange and filter approach can do anything summarize can.
Try again.Exercise 3:
The examples below show what happens verbs like filter and mutate use methods like .mean()
.
# create column for average streams
(music_top200>> mutate(avg_streams = _.streams.mean())
)
country | position | track_name | artist | streams | duration | continent | avg_streams | |
---|---|---|---|---|---|---|---|---|
0 | Argentina | 1 | Tusa | KAROL G | 1858666 | 200.960 | Americas | 243405.2625 |
1 | Argentina | 2 | Tattoo | Rauw Alejandro | 1344382 | 202.887 | Americas | 243405.2625 |
2 | Argentina | 3 | Hola - Remix | Dalex | 1330011 | 249.520 | Americas | 243405.2625 |
... | ... | ... | ... | ... | ... | ... | ... | ... |
12397 | South Africa | 198 | Black And White | Niall Horan | 11771 | 193.090 | Africa | 243405.2625 |
12398 | South Africa | 199 | When I See U | Fantasia | 11752 | 217.347 | Africa | 243405.2625 |
12399 | South Africa | 200 | Psycho! | MASN | 11743 | 197.217 | Africa | 243405.2625 |
12400 rows × 8 columns
# keep observations where streams are above average
(music_top200>> filter(_.streams < _.streams.mean())
)
country | position | track_name | artist | streams | duration | continent | |
---|---|---|---|---|---|---|---|
108 | Argentina | 109 | Me Gusta | Ciro y los Persas | 243159 | 289.093 | Americas |
109 | Argentina | 110 | Tal Vez | Paulo Londra | 242870 | 264.483 | Americas |
110 | Argentina | 111 | Physical | Dua Lipa | 239225 | 193.829 | Americas |
... | ... | ... | ... | ... | ... | ... | ... |
12397 | South Africa | 198 | Black And White | Niall Horan | 11771 | 193.090 | Africa |
12398 | South Africa | 199 | When I See U | Fantasia | 11752 | 217.347 | Africa |
12399 | South Africa | 200 | Psycho! | MASN | 11743 | 197.217 | Africa |
9341 rows × 7 columns
Based on the examples above, can you use only the filter verb to get the most streamed song in all the data?
# modify the filter verb below
(music_top200>> filter()
)
country | position | track_name | artist | streams | duration | continent | |
---|---|---|---|---|---|---|---|
0 | Argentina | 1 | Tusa | KAROL G | 1858666 | 200.960 | Americas |
1 | Argentina | 2 | Tattoo | Rauw Alejandro | 1344382 | 202.887 | Americas |
2 | Argentina | 3 | Hola - Remix | Dalex | 1330011 | 249.520 | Americas |
... | ... | ... | ... | ... | ... | ... | ... |
12397 | South Africa | 198 | Black And White | Niall Horan | 11771 | 193.090 | Africa |
12398 | South Africa | 199 | When I See U | Fantasia | 11752 | 217.347 | Africa |
12399 | South Africa | 200 | Psycho! | MASN | 11743 | 197.217 | Africa |
12400 rows × 7 columns
(music_top200>> filter(_.streams == _.streams.max())
)
country | position | track_name | artist | streams | duration | continent | |
---|---|---|---|---|---|---|---|
7800 | United States | 1 | The Box | Roddy Ricch | 12987027 | 196.653 | Americas |
1 rows × 7 columns