Overview

Siuba is a tool for concise, flexible data-analysis over multiple data sources. It currently supports pandas DataFrames and SQL tables.

Installing

pip install --pre siuba

Basic use

The code below uses the example DataFrame mtcars, to get the average horsepower (hp) per cylinder.

from siuba import _, group_by, summarize
from siuba.data import mtcars

(mtcars
  >> group_by(_.cyl)
  >> summarize(avg_hp = _.hp.mean())
  )
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286

There are three key concepts in this example:

concept example meaning
verb group_by(...) a function that operates on a table, like a DataFrame or SQL table
lazy expression _.hp.mean() an expression created with siuba._, that represents actions you want to perform
pipe mtcars >> group_by(...) a syntax that allows you to chain verbs with the >> operator

Lazy expressions (_)

A siu expression is a way of specifying what action you want to perform. This allows siuba verbs to decide how to execute the action, depending on whether your data is a local DataFrame or remote table.

from siuba import _

_.cyl == 4
β–ˆβ”€==
β”œβ”€β–ˆβ”€.
β”‚ β”œβ”€_
β”‚ └─'cyl'
└─4

Notice how the output represents each step in our lazy expression, with these pieces:

  • black box β–ˆ - a method like checking equality (==) or getting an attribute (.).
  • underscore (_) - a placeholder for a table of data.

We can use these expressions like lambda functions. For example, to keep specific rows of a pandas DataFrame.

# old approach: repeat name
mtcars[mtcars.cyl == 4]

# old approach: lambda
mtcars[lambda _: _.cyl == 4]

# siu approach
mtcars[_.cyl == 4]
mpg cyl disp hp drat wt qsec vs am gear carb
2 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
7 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
... ... ... ... ... ... ... ... ... ... ... ...
27 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

11 rows Γ— 11 columns

Note that like the lambda function, siuba avoids typing the same (potentially_very_long) name twice, while also being a bit shorter.

Table verbs

Verbs are functions that operate on a table of data. They can be combined using a pipe with the >> operator.

from siuba import _, mutate, filter, group_by, summarize
from siuba.data import mtcars

Mutate

The previous example can be re-written in siuba as the following.

(mtcars
  >> group_by(_.cyl)
  >> mutate(demeaned = _.hp - _.hp.mean())
  )

(grouped data frame)

mpg cyl disp hp drat wt qsec vs am gear carb demeaned
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 -12.285714
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 -12.285714
... ... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 125.785714
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 26.363636

32 rows Γ— 12 columns

Note that there is a key difference: mutate returned a pandas DataFrame with the new column (demeaned) at the end. This is a core feature of siuba verbs–tables in and tables out.

Filter

Below are examples of keeping certain rows with filter, and calculating a single number per group with summarize.

g_cyl = group_by(mtcars, _.cyl)

# keep lowest hp per group
g_cyl >> filter(_.hp == _.hp.min())

(grouped data frame)

mpg cyl disp hp drat wt qsec vs am gear carb
5 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
18 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
21 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
22 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

Summarize

g_cyl >> summarize(avg_hp = _.hp.mean())
cyl avg_hp
0 4 82.636364
1 6 122.285714
2 8 209.214286

Column operations

The verbs above received a few different calculations as arguments:

  • _.hp.mean()
  • _.hp.min()

You can use any methods from the underlying pandas objects as methods.

# outside
mtcars.shape[0] + 1

# inside mutate
mtcars >> mutate(res = _.shape[0] + 1)
mpg cyl disp hp drat wt qsec vs am gear carb res
0 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 33
1 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 33
... ... ... ... ... ... ... ... ... ... ... ... ...
30 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 33
31 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 33

32 rows Γ— 12 columns

This includes the str and dt attribute accessor methods:

import pandas as pd

df = pd.DataFrame({"x": ["apple", "banana"]})

# outside
df.x.str.contains("a")

# inside mutate
df >> mutate(res = _.x.str.contains("a"))
x res
0 apple True
1 banana True

Using with plotnine

Fortnuately, plotnine supports siuba’s style of piping, so is easy to plug in to!

from siuba import mutate, _
from plotnine import ggplot, aes, geom_point

(mtcars
  >> mutate(hp_per_cyl = _.hp / _.cyl)
  >> ggplot(aes("cyl", "hp_per_cyl"))
   + geom_point()
)

<ggplot: (8775643137660)>

Next steps

TODO