Introduction to Siuba

Resources

Introduction to siuba

Data Analysis

Meet the data: Spotify top 200

Meet the data: Spotify top 200

music_top200
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

12400 rows × 7 columns

Meet the data: Spotify top 200

music_top200
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa

Meet the data: Spotify song features

Data Analysis

How code is structured

(track_features
  >> filter(_.artist == "The Weeknd")
  >> ggplot(aes("energy", "valence"))
   + geom_point()
)

How code is structured

(track_features
  >> filter(_.artist == "The Weeknd")

 
)
artist album track_name energy valence danceability speechiness acousticness popularity duration
568 The Weeknd My Dear Melancholy, Call Out My Name 0.593 0.175 0.461 0.0356 0.17000 82 228.373
2753 The Weeknd Blinding Lights Blinding Lights 0.796 0.345 0.513 0.0629 0.00147 75 201.573
3004 The Weeknd In Your Eyes (Remix) In Your Eyes (with Doja Cat) - Remix 0.731 0.727 0.679 0.0319 0.00518 81 237.912
... ... ... ... ... ... ... ... ... ... ...
23966 The Weeknd Beauty Behind The Madness The Hills 0.564 0.137 0.585 0.0515 0.06710 83 242.253
24688 The Weeknd Starboy Starboy 0.587 0.486 0.679 0.2760 0.14100 84 230.453
24982 The Weeknd After Hours In Your Eyes 0.719 0.717 0.667 0.0346 0.00285 91 237.520

23 rows × 10 columns

How code is structured

(track_features
  >> filter(_.artist == "The Weeknd")
  >> ggplot(aes("energy", "valence"))
 
)

How code is structured

(track_features
  >> filter(_.artist == "The Weeknd")
  >> ggplot(aes("energy", "valence"))
   + geom_point()
)

Let’s practice!

The filter verb

The filter verb

Filter for top songs

(music_top200
  >> filter(_.position == 1)
)

Filter step 1: start the block

(music_top200

)

Filter step 2: pipe operator and verb name

(music_top200
  >> filter()
)

Filter step 3: write the operation

(music_top200
  >> filter(_.position == 1)
)

Filter for top songs

(music_top200
  >> filter(_.position == 1)
)
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas
200 Austria 1 Blinding Lights The Weeknd 229576 201.573 Europe
400 Australia 1 Blinding Lights The Weeknd 1757343 201.573 Oceania
... ... ... ... ... ... ... ...
11800 Uruguay 1 Tusa KAROL G 120175 200.960 Americas
12000 Viet Nam 1 Sweet Night V 189261 214.259 Asia
12200 South Africa 1 The Box Roddy Ricch 94422 196.653 Africa

62 rows × 7 columns

Filter for country

(music_top200
  >> filter(_.country == "United States")
)
country position track_name artist streams duration continent
7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas
7801 United States 2 Myron Lil Uzi Vert 9163134 224.955 Americas
7802 United States 3 Blueberry Faygo Lil Mosey 8043475 162.547 Americas
... ... ... ... ... ... ... ...
7997 United States 198 Lights Up Harry Styles 1606234 172.227 Americas
7998 United States 199 Without Me Halsey 1606153 201.661 Americas
7999 United States 200 Enemies (feat. DaBaby) Post Malone 1597824 196.760 Americas

200 rows × 7 columns

Filter with two variables

(music_top200
  >> filter(_.position == 1, _.country == "United States")
)
country position track_name artist streams duration continent
7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas

1 rows × 7 columns

Let’s practice!

The arrange verb

The arrange verb

Sorting with arrange

(music_top200
  >> arrange(_.duration)
)
country position track_name artist streams duration continent
10868 Slovakia 69 Klop Klop Karlo 17222 65.631 Europe
4586 Greece 187 FENDI iLLEOo 16786 76.099 Europe
9937 Poland 138 Mistrz ping-ponga PRO8L3M 145143 83.360 Europe
... ... ... ... ... ... ... ...
535 Australia 136 Innerbloom RÜFÜS DU SOL 260092 578.041 Oceania
1302 Brazil 103 Poesia Acústica #8: Amor e Samba Pineapple StormTv 839192 614.615 Americas
11557 Turkey 158 Susamam Şanışer 194804 851.871 Asia

12400 rows × 7 columns

arrange descending

(music_top200
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
11557 Turkey 158 Susamam Şanışer 194804 851.871 Asia
1302 Brazil 103 Poesia Acústica #8: Amor e Samba Pineapple StormTv 839192 614.615 Americas
535 Australia 136 Innerbloom RÜFÜS DU SOL 260092 578.041 Oceania
... ... ... ... ... ... ... ...
9937 Poland 138 Mistrz ping-ponga PRO8L3M 145143 83.360 Europe
4586 Greece 187 FENDI iLLEOo 16786 76.099 Europe
10868 Slovakia 69 Klop Klop Karlo 17222 65.631 Europe

12400 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
7841 United States 42 After Hours The Weeknd 3672033 361.027 Americas
7915 United States 116 Life Is Good (feat. Drake, DaBaby & Lil Baby) - Remix Future 2181930 315.346 Americas
7923 United States 124 SICKO MODE Travis Scott 2085268 312.820 Americas
... ... ... ... ... ... ... ...
7832 United States 33 Strawberry Peels (feat. Young Thug & Gunna) Lil Uzi Vert 4007781 115.350 Americas
7853 United States 54 CITY OF ANGELS 24kGoldn 3443366 112.493 Americas
7971 United States 172 Skechers DripReport 1731265 106.000 Americas

200 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")

)
country position track_name artist streams duration continent
7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas
7801 United States 2 Myron Lil Uzi Vert 9163134 224.955 Americas
7802 United States 3 Blueberry Faygo Lil Mosey 8043475 162.547 Americas
... ... ... ... ... ... ... ...
7997 United States 198 Lights Up Harry Styles 1606234 172.227 Americas
7998 United States 199 Without Me Halsey 1606153 201.661 Americas
7999 United States 200 Enemies (feat. DaBaby) Post Malone 1597824 196.760 Americas

200 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
7841 United States 42 After Hours The Weeknd 3672033 361.027 Americas
7915 United States 116 Life Is Good (feat. Drake, DaBaby & Lil Baby) - Remix Future 2181930 315.346 Americas
7923 United States 124 SICKO MODE Travis Scott 2085268 312.820 Americas
... ... ... ... ... ... ... ...
7832 United States 33 Strawberry Peels (feat. Young Thug & Gunna) Lil Uzi Vert 4007781 115.350 Americas
7853 United States 54 CITY OF ANGELS 24kGoldn 3443366 112.493 Americas
7971 United States 172 Skechers DripReport 1731265 106.000 Americas

200 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
7841 United States 42 After Hours The Weeknd 3672033 361.027 Americas
7915 United States 116 Life Is Good (feat. Drake, DaBaby & Lil Baby) - Remix Future 2181930 315.346 Americas
7923 United States 124 SICKO MODE Travis Scott 2085268 312.820 Americas
... ... ... ... ... ... ... ...
7832 United States 33 Strawberry Peels (feat. Young Thug & Gunna) Lil Uzi Vert 4007781 115.350 Americas
7853 United States 54 CITY OF ANGELS 24kGoldn 3443366 112.493 Americas
7971 United States 172 Skechers DripReport 1731265 106.000 Americas

200 rows × 7 columns

Let’s practice!

The arrange verb

The arrange verb

Sorting with arrange

(music_top200
  >> arrange(_.duration)
)
country position track_name artist streams duration continent
10868 Slovakia 69 Klop Klop Karlo 17222 65.631 Europe
4586 Greece 187 FENDI iLLEOo 16786 76.099 Europe
9937 Poland 138 Mistrz ping-ponga PRO8L3M 145143 83.360 Europe
... ... ... ... ... ... ... ...
535 Australia 136 Innerbloom RÜFÜS DU SOL 260092 578.041 Oceania
1302 Brazil 103 Poesia Acústica #8: Amor e Samba Pineapple StormTv 839192 614.615 Americas
11557 Turkey 158 Susamam Şanışer 194804 851.871 Asia

12400 rows × 7 columns

arrange descending

(music_top200
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
11557 Turkey 158 Susamam Şanışer 194804 851.871 Asia
1302 Brazil 103 Poesia Acústica #8: Amor e Samba Pineapple StormTv 839192 614.615 Americas
535 Australia 136 Innerbloom RÜFÜS DU SOL 260092 578.041 Oceania
... ... ... ... ... ... ... ...
9937 Poland 138 Mistrz ping-ponga PRO8L3M 145143 83.360 Europe
4586 Greece 187 FENDI iLLEOo 16786 76.099 Europe
10868 Slovakia 69 Klop Klop Karlo 17222 65.631 Europe

12400 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
7841 United States 42 After Hours The Weeknd 3672033 361.027 Americas
7915 United States 116 Life Is Good (feat. Drake, DaBaby & Lil Baby) - Remix Future 2181930 315.346 Americas
7923 United States 124 SICKO MODE Travis Scott 2085268 312.820 Americas
... ... ... ... ... ... ... ...
7832 United States 33 Strawberry Peels (feat. Young Thug & Gunna) Lil Uzi Vert 4007781 115.350 Americas
7853 United States 54 CITY OF ANGELS 24kGoldn 3443366 112.493 Americas
7971 United States 172 Skechers DripReport 1731265 106.000 Americas

200 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")

)
country position track_name artist streams duration continent
7800 United States 1 The Box Roddy Ricch 12987027 196.653 Americas
7801 United States 2 Myron Lil Uzi Vert 9163134 224.955 Americas
7802 United States 3 Blueberry Faygo Lil Mosey 8043475 162.547 Americas
... ... ... ... ... ... ... ...
7997 United States 198 Lights Up Harry Styles 1606234 172.227 Americas
7998 United States 199 Without Me Halsey 1606153 201.661 Americas
7999 United States 200 Enemies (feat. DaBaby) Post Malone 1597824 196.760 Americas

200 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
7841 United States 42 After Hours The Weeknd 3672033 361.027 Americas
7915 United States 116 Life Is Good (feat. Drake, DaBaby & Lil Baby) - Remix Future 2181930 315.346 Americas
7923 United States 124 SICKO MODE Travis Scott 2085268 312.820 Americas
... ... ... ... ... ... ... ...
7832 United States 33 Strawberry Peels (feat. Young Thug & Gunna) Lil Uzi Vert 4007781 115.350 Americas
7853 United States 54 CITY OF ANGELS 24kGoldn 3443366 112.493 Americas
7971 United States 172 Skechers DripReport 1731265 106.000 Americas

200 rows × 7 columns

arrange and filter

(music_top200
  >> filter(_.country == "United States")
  >> arrange(-_.duration)
)
country position track_name artist streams duration continent
7841 United States 42 After Hours The Weeknd 3672033 361.027 Americas
7915 United States 116 Life Is Good (feat. Drake, DaBaby & Lil Baby) - Remix Future 2181930 315.346 Americas
7923 United States 124 SICKO MODE Travis Scott 2085268 312.820 Americas
... ... ... ... ... ... ... ...
7832 United States 33 Strawberry Peels (feat. Young Thug & Gunna) Lil Uzi Vert 4007781 115.350 Americas
7853 United States 54 CITY OF ANGELS 24kGoldn 3443366 112.493 Americas
7971 United States 172 Skechers DripReport 1731265 106.000 Americas

200 rows × 7 columns

Let’s practice!

The mutate verb

The mutate verb

Using mutate to change a variable

(music_top200
  >> mutate(streams = _.streams / 1000)
)

Using mutate to change a variable

(music_top200
  >> mutate(streams = _.streams / 1000)
)

Using mutate to change a variable

(music_top200
  >> mutate(streams = _.streams / 1000)
)

Using mutate to change a variable (result)

(music_top200
  >> mutate(streams = _.streams / 1000)
)
country position track_name artist streams duration continent
0 Argentina 1 Tusa KAROL G 1858.666 200.960 Americas
1 Argentina 2 Tattoo Rauw Alejandro 1344.382 202.887 Americas
2 Argentina 3 Hola - Remix Dalex 1330.011 249.520 Americas
... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11.771 193.090 Africa
12398 South Africa 199 When I See U Fantasia 11.752 217.347 Africa
12399 South Africa 200 Psycho! MASN 11.743 197.217 Africa

12400 rows × 7 columns

Using mutate to add a new variable

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
)
country position track_name artist streams duration continent ttl_stream_time
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 3.735175e+08
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 2.727576e+08
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 3.318643e+08
... ... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 2.272862e+06
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 2.554262e+06
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 2.315919e+06

12400 rows × 8 columns

Using mutate to add a new variable (result)

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
)
country position track_name artist streams duration continent ttl_stream_time
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 3.735175e+08
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 2.727576e+08
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 3.318643e+08
... ... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 2.272862e+06
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 2.554262e+06
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 2.315919e+06

12400 rows × 8 columns

Answering a question

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
  >> filter(_.country == "Costa Rica")
  >> arrange(-_.ttl_stream_time)
)

Answering a question

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
 
 
)
country position track_name artist streams duration continent ttl_stream_time
0 Argentina 1 Tusa KAROL G 1858666 200.960 Americas 3.735175e+08
1 Argentina 2 Tattoo Rauw Alejandro 1344382 202.887 Americas 2.727576e+08
2 Argentina 3 Hola - Remix Dalex 1330011 249.520 Americas 3.318643e+08
... ... ... ... ... ... ... ... ...
12397 South Africa 198 Black And White Niall Horan 11771 193.090 Africa 2.272862e+06
12398 South Africa 199 When I See U Fantasia 11752 217.347 Africa 2.554262e+06
12399 South Africa 200 Psycho! MASN 11743 197.217 Africa 2.315919e+06

12400 rows × 8 columns

Answering a question

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
  >> filter(_.country == "Costa Rica")
 
)
country position track_name artist streams duration continent ttl_stream_time
2200 Costa Rica 1 Safaera Bad Bunny 338078 295.177 Americas 9.979285e+07
2201 Costa Rica 2 Si Veo a Tu Mamá Bad Bunny 244932 170.972 Americas 4.187651e+07
2202 Costa Rica 3 Ignorantes Bad Bunny 233113 210.607 Americas 4.909523e+07
... ... ... ... ... ... ... ... ...
2397 Costa Rica 198 Ride It Regard 21731 157.606 Americas 3.424936e+06
2398 Costa Rica 199 Sola Manuel Turizo 21704 195.044 Americas 4.233235e+06
2399 Costa Rica 200 Nena Maldición (feat. Lenny Tavárez) Paulo Londra 21684 228.875 Americas 4.962926e+06

200 rows × 8 columns

Answering a question

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
  >> filter(_.country == "Costa Rica") 
  >> arrange(-_.ttl_stream_time) 
)
country position track_name artist streams duration continent ttl_stream_time
2200 Costa Rica 1 Safaera Bad Bunny 338078 295.177 Americas 9.979285e+07
2202 Costa Rica 3 Ignorantes Bad Bunny 233113 210.607 Americas 4.909523e+07
2222 Costa Rica 23 René Residente 101872 457.592 Americas 4.661581e+07
... ... ... ... ... ... ... ... ...
2377 Costa Rica 178 Dónde Estás KHEA 23177 153.560 Americas 3.559060e+06
2394 Costa Rica 195 Blueberry Faygo Lil Mosey 21771 162.547 Americas 3.538811e+06
2397 Costa Rica 198 Ride It Regard 21731 157.606 Americas 3.424936e+06

200 rows × 8 columns

Answering a question (result)

(music_top200
  >> mutate(ttl_stream_time = _.streams * _.duration)
  >> filter(_.country == "Costa Rica") 
  >> arrange(-_.ttl_stream_time) 
)
country position track_name artist streams duration continent ttl_stream_time
2200 Costa Rica 1 Safaera Bad Bunny 338078 295.177 Americas 9.979285e+07
2202 Costa Rica 3 Ignorantes Bad Bunny 233113 210.607 Americas 4.909523e+07
2222 Costa Rica 23 René Residente 101872 457.592 Americas 4.661581e+07
... ... ... ... ... ... ... ... ...
2377 Costa Rica 178 Dónde Estás KHEA 23177 153.560 Americas 3.559060e+06
2394 Costa Rica 195 Blueberry Faygo Lil Mosey 21771 162.547 Americas 3.538811e+06
2397 Costa Rica 198 Ride It Regard 21731 157.606 Americas 3.424936e+06

200 rows × 8 columns

Let’s practice!

Visualization with plotnine

Visualizing with plotnine

Importing plotnine


from siuba import *
from plotnine import *

Variables

billie = (
  track_features
  >> filter(_.artist == "Billie Eilish")
)

Variables

(
  track_features
  >> filter(_.artist == "Billie Eilish")
)

Variables


billie = (
  track_features
  >> filter(_.artist == "Billie Eilish")
)

Variables (result)

billie
artist album track_name energy valence danceability speechiness acousticness popularity duration
1273 Billie Eilish dont smile ... my boy 0.3940 0.3240 0.692 0.2070 0.472 44 170.852
2899 Billie Eilish WHEN WE ALL... listen befo... 0.0561 0.0820 0.319 0.0450 0.935 79 242.652
2950 Billie Eilish lovely (wit... lovely (wit... 0.2960 0.1200 0.351 0.0333 0.934 89 200.186
... ... ... ... ... ... ... ... ... ... ...
24857 Billie Eilish WHEN WE ALL... ilomilo 0.4230 0.5720 0.855 0.0585 0.724 79 156.371
24997 Billie Eilish WHEN I WAS ... WHEN I WAS ... 0.3320 0.0628 0.696 0.0425 0.853 71 270.520
25147 Billie Eilish come out an... come out an... 0.3210 0.1770 0.640 0.0931 0.693 74 210.376

27 rows × 10 columns

Visualizing with plotnine

(billie
 >> ggplot(aes("energy", "valence"))
  + geom_point()
  + labs(title = "Billie Eilish hit track features")
)

Visualizing with plotnine

(billie
 >> ggplot(aes("energy", "valence"))
  + geom_point()
  + labs(title = "Billie Eilish hit track features")  
)

Visualizing with plotnine

(billie
 >> ggplot(aes("energy", "valence"))
  + geom_point()
  + labs(title = "Billie Eilish hit track features")
)

Visualizing with plotnine

(billie
 >> ggplot(aes("energy", "valence"))
  + geom_point()
  + labs(title = "Billie Eilish hit track features")
)

Let’s practice!

Using plotnine geoms

Using plotnine geoms

(billie
 >> ggplot(aes("energy", "valence"))
  + geom_point()
)

Using geom_label

(billie
 >> ggplot(aes("energy", "valence", label = "track_name"))
  + geom_label()
)

Using geom_text

(billie
 >> ggplot(aes("energy", "valence", label = "track_name"))
  + geom_text()
)

Combining geoms

(billie
 >> ggplot(aes("energy", "valence", label = "track_name"))
  + geom_text(nudge_y = .1)
  + geom_point()
)

More on geom options

Let’s practice!

Using plotnine Aesthetics

Scatterplots

billie = filter(track_features, _.artist == "Billie Eilish")

(billie
  >> ggplot(aes("energy", "valence"))
   + geom_point()
)

Additional variables

billie
artist album track_name energy valence danceability speechiness acousticness popularity duration
1273 Billie Eilish dont smile at me (Expanded Edition) my boy 0.3940 0.3240 0.692 0.2070 0.472 44 170.852
2899 Billie Eilish WHEN WE ALL FALL ASLEEP, WHERE DO WE GO? listen before i go 0.0561 0.0820 0.319 0.0450 0.935 79 242.652
2950 Billie Eilish lovely (with Khalid) lovely (with Khalid) 0.2960 0.1200 0.351 0.0333 0.934 89 200.186
... ... ... ... ... ... ... ... ... ... ...
24857 Billie Eilish WHEN WE ALL FALL ASLEEP, WHERE DO WE GO? ilomilo 0.4230 0.5720 0.855 0.0585 0.724 79 156.371
24997 Billie Eilish WHEN I WAS OLDER (Music Inspired By The Film ROMA) WHEN I WAS OLDER - Music Inspired By The Film ROMA 0.3320 0.0628 0.696 0.0425 0.853 71 270.520
25147 Billie Eilish come out and play come out and play 0.3210 0.1770 0.640 0.0931 0.693 74 210.376

27 rows × 10 columns

The color aesthetic

(billie
  >> ggplot(aes("energy", "valence", color = "acousticness"))
   + geom_point()
)

The size aesthetic

(billie
  >> ggplot(aes("energy", "valence", color = "acousticness", size = "popularity"))
   + geom_point()
)

Aesthetics with multiple geoms

(billie
  >> ggplot(aes("energy", "valence", 
                color = "acousticness", size = "popularity",
                label = "track_name"))
   + geom_point()
   + geom_text(nudge_y = .1)
)

Let’s practice!

  • filter artist, get characteristics
  • try different combinations of characteristics–which ones seem most related?
  • diagnose error (unquoted ggplot)
  • diagnose error (verb without _)
  • plot with aesthetics
  • plot text instead
  • labs?

Faceting

Faceting

asia_top200 = (
  music_top200
  >> filter(_.continent == "Asia")
)
asia_top200
country position track_name artist streams duration continent
4600 Hong Kong 1 WANNABE ITZY 112648 191.242 Asia
4601 Hong Kong 2 Intentions (feat. Quavo) Justin Bieber 104467 212.867 Asia
4602 Hong Kong 3 Señorita Shawn Mendes 84196 190.960 Asia
... ... ... ... ... ... ... ...
12197 Viet Nam 198 Đưa Nhau Đi Trốn (Chill Version) Đen 20750 241.959 Asia
12198 Viet Nam 199 Hôm Nay Tôi Buồn Phùng Khánh Linh 20580 275.000 Asia
12199 Viet Nam 200 Kick It NCT 127 20495 233.013 Asia

2600 rows × 7 columns

Faceting

(asia_top200
  >> ggplot(aes("position", "streams", color = "country"))
   + geom_point()
)

Faceting

(asia_top200
  >> ggplot(aes("position", "streams", color = "country"))
   + geom_point()
   + facet_wrap('~country')
)

Let’s practice!

Summarizing data

Data analysis

Extracting data

(music_top200
  >> filter(_.country == "Japan", _.position == 1)
)
country position track_name artist streams duration continent
6400 Japan 1 I LOVE... Official HIGE DANdism 1591844 282.027 Asia

1 rows × 7 columns

The summarize verb

(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 205.990073

1 rows × 1 columns

The summarize verb

(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 205.990073

1 rows × 1 columns

The summarize verb

(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 205.990073

1 rows × 1 columns

Summarizing one country

(music_top200
  >> filter(_.country == "Japan")
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 250.53499

1 rows × 1 columns

Summarizing into multiple columns

(music_top200
  >> filter(_.country == "Japan")
  >> summarize(
      avg_duration = _.duration.mean(),
      ttl_streams = _.streams.sum()
  )
)
avg_duration ttl_streams
0 250.53499 48942067

1 rows × 2 columns

Methods for summarizing

E.g. _.some_column.mean()

  • .mean()
  • .sum()
  • .median()
  • .min()
  • .max()

Let’s practice!

The group_by verb

The group_by verb

The summarize verb

(music_top200
  >> filter(_.country == "Japan")
  >> summarize(avg_duration = _.duration.mean()))
avg_duration
0 250.53499

1 rows × 1 columns

Summarizing by country

(music_top200
  >> group_by(_.country)
  >> summarize(avg_duration = _.duration.mean())
)
country avg_duration
0 Argentina 212.847855
1 Australia 204.795300
2 Austria 184.894870
... ... ...
59 United States 190.827500
60 Uruguay 210.796985
61 Viet Nam 217.222830

62 rows × 2 columns

Summarizing by continent and position

(music_top200
  >> group_by(_.continent, _.position)
  >> summarize(
      min_streams = _.streams.min(),
      max_streams = _.streams.max()
  )
)
continent position min_streams max_streams
0 Africa 1 94422 94422
1 Africa 2 74689 74689
2 Africa 3 67552 67552
... ... ... ... ...
997 Oceania 198 44570 225951
998 Oceania 199 44364 225492
999 Oceania 200 44291 225179

1000 rows × 4 columns

Summarizing by continent and position

(music_top200

  >> summarize(
      min_streams = _.streams.min(),
      max_streams = _.streams.max()
  )
)
min_streams max_streams
0 1470 12987027

1 rows × 2 columns

Summarizing by continent and position

(music_top200
  >> filter(_.continent == "Oceania", _.position == 1)
  >> summarize(
      min_streams = _.streams.min(),
      max_streams = _.streams.max()
  )
)
min_streams max_streams
0 321272 1757343

1 rows × 2 columns

Summarizing by continent and position

(music_top200
  >> group_by(_.continent, _.position)
  >> summarize(
      min_streams = _.streams.min(),
      max_streams = _.streams.max()
  )
)
continent position min_streams max_streams
0 Africa 1 94422 94422
1 Africa 2 74689 74689
2 Africa 3 67552 67552
... ... ... ... ...
997 Oceania 198 44570 225951
998 Oceania 199 44364 225492
999 Oceania 200 44291 225179

1000 rows × 4 columns

Let’s practice!

Visualizing summarized data

When visualizing raw data doesn’t work

(music_top200
  >> ggplot(aes("position", "streams", color = "country"))
   + geom_point()
)

Calculating min and max streams

by_position = (
  music_top200
  >> group_by(_.position)
  >> summarize(max_streams = _.streams.max(),
               min_streams = _.streams.min())
)
by_position
position max_streams min_streams
0 1 12987027 13604
1 2 9163134 10801
2 3 8043475 9510
... ... ... ...
197 198 1606234 1472
198 199 1606153 1470
199 200 1597824 1470

200 rows × 3 columns

Plotting

(by_position
  >> ggplot(aes("position", "max_streams"))
   + geom_point()
   + labs(title = "Top 200 hits - max streams overall")
)

Plotting (result)

(by_position
  >> ggplot(aes("position", "max_streams"))
   + geom_point()
   + labs(title = "Top 200 hits - max streams overall")
)

Starting y-axis at 0

(by_position
  >> ggplot(aes("position", "max_streams"))
   + geom_point()
   + expand_limits(y = 0)
   + labs(title = "Top 200 hits - max streams overall"))

Calculating min and max streams

by_continent_position = (
  music_top200
  >> group_by(_.continent, _.position)
  >> summarize(max_streams = _.streams.max(),
               min_streams = _.streams.min())
)
by_continent_position
continent position max_streams min_streams
0 Africa 1 94422 94422
1 Africa 2 74689 74689
2 Africa 3 67552 67552
... ... ... ... ...
997 Oceania 198 225951 44570
998 Oceania 199 225492 44364
999 Oceania 200 225179 44291

1000 rows × 4 columns

Visualize

(by_continent_position
  >> ggplot(aes("position", "max_streams", color = "continent"))
   + geom_point()
   + expand_limits(y = 0)
   + labs(title = "Top 200 hits - max streams overall"))

Let’s practice!