from siuba.data import penguins
from siuba import _, mutate, summarize, group_by, filter
fruits = pd.Series([
"apple",
"apricot",
"avocado",
"banana",
"bell pepper"
])
df_fruits = pd.DataFrame({"name": fruits})String operations (str) 📝
This page is largely complete, but is actively being refined / improved.
Overview
String operations allow you to perform actions like:
- Match: detect when a string matches a pattern.
- Transform: e.g. convert something from mIxED to lower case, or replace part of it.
- Extract: grab specific parts of string value (e.g. a matching pattern).
This page will cover different methods for performing these actions, but will ultimately focus on str.contains(), str.replace(), and str.extract() for common match, transform, and extract tasks.
Using string methods
siuba uses Pandas methods, so can use any of the string methods it makes available, like .str.upper().
fruits.str.upper()0 APPLE
1 APRICOT
2 AVOCADO
3 BANANA
4 BELL PEPPER
dtype: object
Note that most string methods use .str.<method_name>() syntax. These are called “string accessor methods”, since they are accessed from a special place (.str).
Using in verbs
Use string methods as you would any other methods inside verbs.
mutate(df_fruits, loud = _.name.str.upper())| name | loud | |
|---|---|---|
| 0 | apple | APPLE |
| 1 | apricot | APRICOT |
| 2 | avocado | AVOCADO |
| 3 | banana | BANANA |
| 4 | bell pepper | BELL PEPPER |
Matching patterns
Fixed text
There are three common approaches for simple string matches:
- An exact match with
==. - A match from an anchor point, using
str.startswith()orstr.endswith(). - A match from any point, using
str.contains()
# exact match
fruits == "banana"
# starts with "ap"
fruits.str.startswith("ap")
# ends with "cado"
fruits.str.endswith("cado")
# has an "e" anywhere
fruits.str.contains("e", regex=False)0 True
1 False
2 False
3 False
4 True
dtype: bool
All these operations return a boolean Series, so can be used to filter rows.
filter(df_fruits, _.name.str.startswith("ap"))| name | |
|---|---|
| 0 | apple |
| 1 | apricot |
Note that for str.contains() we set the regex=False argument. This is because—unlike operations like str.startswith()—pandas by default assumes you are passing something called a regular expression to str.contains().
str.contains() patterns
Use str.contains(...) to perform matches with regular expressions—a special string syntax for specifying patterns to match.
For example, you can use "^" or "$" to match the start or end of a string, respectively.
# check if starts with "ap" ----
penguins.species.str.contains("^ap")0 False
1 False
...
342 False
343 False
Name: species, Length: 344, dtype: bool
# check if endswith with "a" ----
penguins.species.str.contains("a$")0 False
1 False
...
342 False
343 False
Name: species, Length: 344, dtype: bool
Note that "$" and "^" are called anchor points.
Transforming strings
String transformations take a string and return a new, changed version. For example, by converting all the letters to lower, upper, or title case.
Simple transformations
fruits.str.lower()
fruits.str.upper()0 APPLE
1 APRICOT
2 AVOCADO
3 BANANA
4 BELL PEPPER
dtype: object
str.replace() patterns
Use .str.replace(..., regex=True) with regular expressions to replace patterns in strings.
For example, the code below uses "p.", where . is called a wildcard–which matches any character.
fruits.str.replace("p.", "XX", regex=True)0 aXXle
1 aXXicot
2 avocado
3 banana
4 bell XXXXer
dtype: object
Extracting parts
.str[] to slice
It is currently not possible to apply a sequence of slices to .str. You can only apply the same slice to every string in the Series.
.str.extract() patterns
Use str.extract() with a regular expression to pull out a matching piece of text.
For example the regular expression “^(.*) ” contains the following pieces:
amatches the literal letter “a”.*has a.which matches anything, and*which modifies it to apply 0 or more times.
fruits.str.extract("a(.*)")| 0 | |
|---|---|
| 0 | pple |
| 1 | pricot |
| 2 | vocado |
| 3 | nana |
| 4 | NaN |
Split and flatten
.str.split() into list-entries
Use .str.split() to split each entry on a character, producing a list per row of split strings.
fruits.str.split("pp")0 [a, le]
1 [apricot]
2 [avocado]
3 [banana]
4 [bell pe, er]
dtype: object
Seeing each entry be a list may surprising, and is fairly rare in pandas.
.str.join() is the inverse of split
penguins.species.str.split("e").str.join("e")0 Adelie
1 Adelie
...
342 Chinstrap
343 Chinstrap
Name: species, Length: 344, dtype: object
.explode() to unnest entries
Use .str.explode() to take a column with list-entries (like those returned by .str.split()) and unnest each entry, so there is 1 row per each element in each list.
splits = fruits.str.split("pp")
splits0 [a, le]
1 [apricot]
2 [avocado]
3 [banana]
4 [bell pe, er]
dtype: object
Notice that the result above has 4 list-entries (rows). The first and last rows are the splits ["a", "le"] and ["bell pe", "er"], so there are 7 elements total.
The .explode() method makes each of the 7 elements its own row.
splits.explode()0 a
0 le
...
4 bell pe
4 er
Length: 7, dtype: object
Be careful to note that it’s .explode() and not .str.explode(), since it can be used on lists of other things as well!
.str.findall() for advanced splitting
For example, the code below uses "pp?", where ? means the preceding character (“p”) is optional for matching:
fruits.str.findall("pp?")0 [pp]
1 [p]
2 []
3 []
4 [p, pp]
dtype: object
More regular expressions
Anchor points
^- matches the beginning of a string.$- matches the end of a string.
Repetition qualifiers
*- matches 0 or more+- matches 1 or more?- matches 0 or 1
Grouping
(){}[]
Alternatives
|