Summarizing data

Summarizing data

Data analysis

Extracting data

(music_top200
  >> filter(_.country == "Japan", _.position == 1)
)
country position track_name artist streams duration continent
6400 Japan 1 I LOVE... Official HIGE DANdism 1591844 282.027 Asia

1 rows × 7 columns

The summarize verb

(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 205.990073

1 rows × 1 columns

The summarize verb

(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 205.990073

1 rows × 1 columns

The summarize verb

(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
(music_top200
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 205.990073

1 rows × 1 columns

Summarizing one country

(music_top200
  >> filter(_.country == "Japan")
  >> summarize(avg_duration = _.duration.mean())
)
avg_duration
0 250.53499

1 rows × 1 columns

Summarizing into multiple columns

(music_top200
  >> filter(_.country == "Japan")
  >> summarize(
      avg_duration = _.duration.mean(),
      ttl_streams = _.streams.sum()
  )
)
avg_duration ttl_streams
0 250.53499 48942067

1 rows × 2 columns

Methods for summarizing

E.g. _.some_column.mean()

  • .mean()
  • .sum()
  • .median()
  • .min()
  • .max()

Let’s practice!