Summarize

Click here to open the slides full screen.

from setup import ___

from siuba import *
from plotnine import *

from music_top200 import music_top200, track_features

Exercise 1:

The code below calculates the average duration.

  • Uncomment the summarize verb.
  • Change it to calculate median duration.
  • Make sure to change the resulting column name to indicate its a median.
(music_top200
  #>> summarize(avg_duration = _.duration.mean())
)
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

12400 rows × 7 columns

(music_top200
  >> summarize(median_duration = _.duration.median(), ttl_streams = _.streams.sum())
)
median_duration ttl_streams
0 201.084 3018225255

1 rows × 2 columns

What is the median duration?

205.990 Incorrect
65.631 Incorrect
201.084 That’s right!
851.871 Incorrect

Add a second argument to summarize, which calculates the sum of streams. How large is it?

30,182,252 Incorrect
301,822,525 That’s right!
451,822,525 Incorrect
45,182,252 Incorrect

Exercise 2:

Use verbs you learned in chapter 1 to do the following:

  • find the track with the lowest duration
  • subset the data to keep only the row for that track

(Note, you may need to run code multiple times)

(music_top200

)
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

12400 rows × 7 columns

Now, use summarize to calculate the min duration, and the max duration directly.

(music_top200

)
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

12400 rows × 7 columns

Why would you use summarize like this, rather than the arrange and filter approach?

The min and max values might be in different observations (rows). That’s right. Filter keeps specific rows, but summarize can collect values from across rows.
It keeps more of the original data. Try again. Filter keeps all the variables (columns), while summarize will remove most of them.
The arrange and filter approach can do anything summarize can. Try again.

Exercise 3:

The examples below show what happens verbs like filter and mutate use methods like .mean().

# create column for average streams
(music_top200
  >> mutate(avg_streams = _.streams.mean())
)
country position track_name artist streams duration continent avg_streams
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 243405.2625
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 243405.2625
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 243405.2625
... ... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 243405.2625
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 243405.2625
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 243405.2625

12400 rows × 8 columns

# keep observations where streams are above average
(music_top200
  >> filter(_.streams < _.streams.mean())
)
country position track_name artist streams duration continent
108 Argentina 109 Me Gusta Ciro y los Persas 243159 289.093 Americas
109 Argentina 110 Tal Vez Paulo Londra 242870 264.483 Americas
110 Argentina 111 Physical Dua Lipa 239225 193.829 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

9341 rows × 7 columns

Based on the examples above, can you use only the filter verb to get the most streamed song in all the data?

# modify the filter verb below

(music_top200
  >> filter()
)
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

12400 rows × 7 columns

(music_top200
  >> filter(_.streams == _.streams.max())
)
country position track_name artist streams duration continent
7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas

1 rows × 7 columns

What is the most streamed song? The Box by Roddy Ricch