Summarize

Click here to open the slides full screen.

from setup import ___

from siuba import *
from plotnine import *

from music_top200 import music_top200, track_features

Exercise 1:

The code below calculates the average duration.

Uncomment the summarize verb.
Change it to calculate median duration.
Make sure to change the resulting column name to indicate its a median.

(music_top200
  #>> summarize(avg_duration = _.duration.mean())
)

	country	position	track_name	artist	streams	duration	continent
0	Argentina	1	Tusa	KAROL G	1858666	200.960	Americas
1	Argentina	2	Tattoo	Rauw Alejandro	1344382	202.887	Americas
2	Argentina	3	Hola - Remix	Dalex	1330011	249.520	Americas
...	...	...	...	...	...	...	...
12397	South Africa	198	Black And White	Niall Horan	11771	193.090	Africa
12398	South Africa	199	When I See U	Fantasia	11752	217.347	Africa
12399	South Africa	200	Psycho!	MASN	11743	197.217	Africa

12400 rows × 7 columns

Show solution

(music_top200
  >> summarize(median_duration = _.duration.median(), ttl_streams = _.streams.sum())
)

	median_duration	ttl_streams
0	201.084	3018225255

1 rows × 2 columns

What is the median duration?

205.990

Incorrect

65.631

Incorrect

201.084

That’s right!

851.871

Incorrect

Add a second argument to summarize, which calculates the sum of streams. How large is it?

30,182,252

Incorrect

301,822,525

That’s right!

451,822,525

Incorrect

45,182,252

Incorrect

Exercise 2:

Use verbs you learned in chapter 1 to do the following:

find the track with the lowest duration
subset the data to keep only the row for that track

(Note, you may need to run code multiple times)

(music_top200

)

	country	position	track_name	artist	streams	duration	continent
0	Argentina	1	Tusa	KAROL G	1858666	200.960	Americas
1	Argentina	2	Tattoo	Rauw Alejandro	1344382	202.887	Americas
2	Argentina	3	Hola - Remix	Dalex	1330011	249.520	Americas
...	...	...	...	...	...	...	...
12397	South Africa	198	Black And White	Niall Horan	11771	193.090	Africa
12398	South Africa	199	When I See U	Fantasia	11752	217.347	Africa
12399	South Africa	200	Psycho!	MASN	11743	197.217	Africa

12400 rows × 7 columns

Now, use summarize to calculate the min duration, and the max duration directly.

(music_top200

)

	country	position	track_name	artist	streams	duration	continent
0	Argentina	1	Tusa	KAROL G	1858666	200.960	Americas
1	Argentina	2	Tattoo	Rauw Alejandro	1344382	202.887	Americas
2	Argentina	3	Hola - Remix	Dalex	1330011	249.520	Americas
...	...	...	...	...	...	...	...
12397	South Africa	198	Black And White	Niall Horan	11771	193.090	Africa
12398	South Africa	199	When I See U	Fantasia	11752	217.347	Africa
12399	South Africa	200	Psycho!	MASN	11743	197.217	Africa

12400 rows × 7 columns

Why would you use summarize like this, rather than the arrange and filter approach?

The min and max values might be in different observations (rows).

That’s right. Filter keeps specific rows, but summarize can collect values from across rows.

It keeps more of the original data.

Try again. Filter keeps all the variables (columns), while summarize will remove most of them.

The arrange and filter approach can do anything summarize can.

Try again.

Exercise 3:

The examples below show what happens verbs like filter and mutate use methods like .mean().

# create column for average streams
(music_top200
  >> mutate(avg_streams = _.streams.mean())
)

	country	position	track_name	artist	streams	duration	continent	avg_streams
0	Argentina	1	Tusa	KAROL G	1858666	200.960	Americas	243405.2625
1	Argentina	2	Tattoo	Rauw Alejandro	1344382	202.887	Americas	243405.2625
2	Argentina	3	Hola - Remix	Dalex	1330011	249.520	Americas	243405.2625
...	...	...	...	...	...	...	...	...
12397	South Africa	198	Black And White	Niall Horan	11771	193.090	Africa	243405.2625
12398	South Africa	199	When I See U	Fantasia	11752	217.347	Africa	243405.2625
12399	South Africa	200	Psycho!	MASN	11743	197.217	Africa	243405.2625

12400 rows × 8 columns

# keep observations where streams are above average
(music_top200
  >> filter(_.streams < _.streams.mean())
)

	country	position	track_name	artist	streams	duration	continent
108	Argentina	109	Me Gusta	Ciro y los Persas	243159	289.093	Americas
109	Argentina	110	Tal Vez	Paulo Londra	242870	264.483	Americas
110	Argentina	111	Physical	Dua Lipa	239225	193.829	Americas
...	...	...	...	...	...	...	...
12397	South Africa	198	Black And White	Niall Horan	11771	193.090	Africa
12398	South Africa	199	When I See U	Fantasia	11752	217.347	Africa
12399	South Africa	200	Psycho!	MASN	11743	197.217	Africa

9341 rows × 7 columns

Based on the examples above, can you use only the filter verb to get the most streamed song in all the data?

# modify the filter verb below

(music_top200
  >> filter()
)

	country	position	track_name	artist	streams	duration	continent
0	Argentina	1	Tusa	KAROL G	1858666	200.960	Americas
1	Argentina	2	Tattoo	Rauw Alejandro	1344382	202.887	Americas
2	Argentina	3	Hola - Remix	Dalex	1330011	249.520	Americas
...	...	...	...	...	...	...	...
12397	South Africa	198	Black And White	Niall Horan	11771	193.090	Africa
12398	South Africa	199	When I See U	Fantasia	11752	217.347	Africa
12399	South Africa	200	Psycho!	MASN	11743	197.217	Africa

12400 rows × 7 columns

Show solution

(music_top200
  >> filter(_.streams == _.streams.max())
)

	country	position	track_name	artist	streams	duration	continent
7800	United States	1	The Box	Roddy Ricch	12987027	196.653	Americas

1 rows × 7 columns

What is the most streamed song?

The Box by Roddy Ricch