This function lets you select specific columns of your data to keep.
There are three different building blocks that can used in a selection:
- including columns using a name, slice, or position).
 
- excluding columns by putting 
- beforehand. 
- matching based on a pattern, like all columns that start with 
"ttl_". 
import pandas as pd
pd.set_option("display.max_rows", 5)
from siuba import _, select
from siuba.data import penguins
penguins
 
  
    
       | 
      species | 
      island | 
      bill_length_mm | 
      bill_depth_mm | 
      flipper_length_mm | 
      body_mass_g | 
      sex | 
      year | 
    
  
  
    
      | 0 | 
      Adelie | 
      Torgersen | 
      39.1 | 
      18.7 | 
      181.0 | 
      3750.0 | 
      male | 
      2007 | 
    
    
      | 1 | 
      Adelie | 
      Torgersen | 
      39.5 | 
      17.4 | 
      186.0 | 
      3800.0 | 
      female | 
      2007 | 
    
    
      | ... | 
      ... | 
      ... | 
      ... | 
      ... | 
      ... | 
      ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      Dream | 
      50.8 | 
      19.0 | 
      210.0 | 
      4100.0 | 
      male | 
      2009 | 
    
    
      | 343 | 
      Chinstrap | 
      Dream | 
      50.2 | 
      18.7 | 
      198.0 | 
      3775.0 | 
      female | 
      2009 | 
    
  
344 rows × 8 columns
 
 
 
Select by name or position
The simplest way to select a column to keep is to refer to it by name or position.
select(penguins, _.species, _.island, 6, -1)
 
  
    
       | 
      species | 
      island | 
      sex | 
      year | 
    
  
  
    
      | 0 | 
      Adelie | 
      Torgersen | 
      male | 
      2007 | 
    
    
      | 1 | 
      Adelie | 
      Torgersen | 
      female | 
      2007 | 
    
    
      | ... | 
      ... | 
      ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      Dream | 
      male | 
      2009 | 
    
    
      | 343 | 
      Chinstrap | 
      Dream | 
      female | 
      2009 | 
    
  
344 rows × 4 columns
 
 
 
The code above does the following:
- selects by name the 
species and island columns. 
- selects by position the index 
6 and -1 columns (the last item). 
Selecting by position should produce the same results as indexing a list of names.
penguins.columns[6]       # "sex"
penguins.columns[-1]      # "year"
 
Excluding columns
You can remove a column from the data by putting a tilde operator (~) in front of it.
penguins >> select(~_.body_mass_g, ~_.sex, ~_.year)
 
  
    
       | 
      species | 
      island | 
      bill_length_mm | 
      bill_depth_mm | 
      flipper_length_mm | 
    
  
  
    
      | 0 | 
      Adelie | 
      Torgersen | 
      39.1 | 
      18.7 | 
      181.0 | 
    
    
      | 1 | 
      Adelie | 
      Torgersen | 
      39.5 | 
      17.4 | 
      186.0 | 
    
    
      | ... | 
      ... | 
      ... | 
      ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      Dream | 
      50.8 | 
      19.0 | 
      210.0 | 
    
    
      | 343 | 
      Chinstrap | 
      Dream | 
      50.2 | 
      18.7 | 
      198.0 | 
    
  
344 rows × 5 columns
 
 
 
The code above keeps all columns except body_mass_g, sex, and year.
Note that the ~ operator flips the value of True and False in pandas, and is called the “invert operator”.
~pd.Series([True, False])
 
0    False
1     True
dtype: bool
 
 
Renaming columns
You can rename a specified column by using the equality operator (==). This operation takes the following form.
penguins >> select(_.species_name == _.species)
 
  
    
       | 
      species_name | 
    
  
  
    
      | 0 | 
      Adelie | 
    
    
      | 1 | 
      Adelie | 
    
    
      | ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
    
    
      | 343 | 
      Chinstrap | 
    
  
344 rows × 1 columns
 
 
 
Note that expressing the new column name on the left is similar to how creating a python dictionary works. For example…
select(_.a == _.x, _.b == _.y) 
dict(a = "x", b = "y") 
both create new entries named “a” and “b”.
Select by slice
When the columns are adjacent to each other, you can select them using _["start_col":"end_col"].
penguins >> select(_.species, _["bill_length_mm":"body_mass_g"])
 
  
    
       | 
      species | 
      bill_length_mm | 
      bill_depth_mm | 
      flipper_length_mm | 
      body_mass_g | 
    
  
  
    
      | 0 | 
      Adelie | 
      39.1 | 
      18.7 | 
      181.0 | 
      3750.0 | 
    
    
      | 1 | 
      Adelie | 
      39.5 | 
      17.4 | 
      186.0 | 
      3800.0 | 
    
    
      | ... | 
      ... | 
      ... | 
      ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      50.8 | 
      19.0 | 
      210.0 | 
      4100.0 | 
    
    
      | 343 | 
      Chinstrap | 
      50.2 | 
      18.7 | 
      198.0 | 
      3775.0 | 
    
  
344 rows × 5 columns
 
 
 
You can use three methods to specify a column in a slice:
_.some_col 
"some_col" 
- a position number
 
Exclusion
You can exclude slice selections using the ~ operator.
penguins >> select(~_["bill_length_mm":"body_mass_g"])
 
  
    
       | 
      species | 
      island | 
      sex | 
      year | 
    
  
  
    
      | 0 | 
      Adelie | 
      Torgersen | 
      male | 
      2007 | 
    
    
      | 1 | 
      Adelie | 
      Torgersen | 
      female | 
      2007 | 
    
    
      | ... | 
      ... | 
      ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      Dream | 
      male | 
      2009 | 
    
    
      | 343 | 
      Chinstrap | 
      Dream | 
      female | 
      2009 | 
    
  
344 rows × 4 columns
 
 
 
Position number
Note that when position number is used to slice columns, the end position is not included in the selection.
# these are equivalent
penguins >> select(0, 1)
penguins >> select(_[0:2])
 
  
    
       | 
      species | 
      island | 
    
  
  
    
      | 0 | 
      Adelie | 
      Torgersen | 
    
    
      | 1 | 
      Adelie | 
      Torgersen | 
    
    
      | ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      Dream | 
    
    
      | 343 | 
      Chinstrap | 
      Dream | 
    
  
344 rows × 2 columns
 
 
 
Select by pattern (e.g. endswith)
penguins >> select(_.species, _.endswith("mm"))
 
  
    
       | 
      species | 
      bill_length_mm | 
      bill_depth_mm | 
      flipper_length_mm | 
    
  
  
    
      | 0 | 
      Adelie | 
      39.1 | 
      18.7 | 
      181.0 | 
    
    
      | 1 | 
      Adelie | 
      39.5 | 
      17.4 | 
      186.0 | 
    
    
      | ... | 
      ... | 
      ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      Chinstrap | 
      50.8 | 
      19.0 | 
      210.0 | 
    
    
      | 343 | 
      Chinstrap | 
      50.2 | 
      18.7 | 
      198.0 | 
    
  
344 rows × 4 columns
 
 
 
penguins >> select(_.contains("length"))
 
  
    
       | 
      bill_length_mm | 
      flipper_length_mm | 
    
  
  
    
      | 0 | 
      39.1 | 
      181.0 | 
    
    
      | 1 | 
      39.5 | 
      186.0 | 
    
    
      | ... | 
      ... | 
      ... | 
    
    
      | 342 | 
      50.8 | 
      210.0 | 
    
    
      | 343 | 
      50.2 | 
      198.0 | 
    
  
344 rows × 2 columns
 
 
 
Pandas comparison
import pandas as pd
from siuba.data import mtcars
from siuba import select, _
 
 
Click between tabs to compare code across siuba and pandas.
# keep cyl column
mtcars >> select(_.cyl)
# keep all *except* cyl column
mtcars >> select(-_.cyl)
# complex select, plus rename cyl to cylinder
mtcars >> select(_.cylinder == _.cyl, _.startswith("m"))
 
 
 
# keep cyl column
mtcars[["cyl"]]
# keep all *except* cyl column
mtcars.drop(["cyl"], axis=1)
# complex select, plus rename cyl to cylinder
cols = mtcars.columns
mtcars.loc[:, (cols == "cyl") | cols.str.startswith("m")] \
      .rename({"cyl": "cylinder"})