distinct
distinct(__data, *args, *, _keep_all=False, **kwargs)
Keep only distinct (unique) rows from a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
__data |
The input data. |
required | |
*args |
Columns to use when determining which rows are unique. |
() |
|
_keep_all |
Whether to keep all columns of the original data, not just *args. |
False |
|
**kwargs |
If specified, arguments passed to the verb mutate(), and then being used in distinct(). |
{} |
Examples:
>>> from siuba import _, distinct, select
>>> from siuba.data import penguins
>>> penguins >> distinct(_.species, _.island)
species island
0 Adelie Torgersen
1 Adelie Biscoe
2 Adelie Dream
3 Gentoo Biscoe
4 Chinstrap Dream
Use _keep_all=True, to keep all columns in each distinct row. This lets you peak at the values of the first unique row.
>>> small_penguins = penguins >> select(_[:4])
>>> small_penguins >> distinct(_.species, _keep_all = True)
species island bill_length_mm bill_depth_mm
0 Adelie Torgersen 39.1 18.7
1 Gentoo Biscoe 46.1 13.2
2 Chinstrap Dream 46.5 17.9
Source code in siuba/dply/verbs.py
@singledispatch2(DataFrame)
def distinct(__data, *args, _keep_all = False, **kwargs):
"""Keep only distinct (unique) rows from a table.
Parameters
----------
__data:
The input data.
*args:
Columns to use when determining which rows are unique.
_keep_all:
Whether to keep all columns of the original data, not just *args.
**kwargs:
If specified, arguments passed to the verb mutate(), and then being used
in distinct().
See Also
--------
count : keep distinct rows, and count their number of observations.
Examples
--------
>>> from siuba import _, distinct, select
>>> from siuba.data import penguins
>>> penguins >> distinct(_.species, _.island)
species island
0 Adelie Torgersen
1 Adelie Biscoe
2 Adelie Dream
3 Gentoo Biscoe
4 Chinstrap Dream
Use _keep_all=True, to keep all columns in each distinct row. This lets you
peak at the values of the first unique row.
>>> small_penguins = penguins >> select(_[:4])
>>> small_penguins >> distinct(_.species, _keep_all = True)
species island bill_length_mm bill_depth_mm
0 Adelie Torgersen 39.1 18.7
1 Gentoo Biscoe 46.1 13.2
2 Chinstrap Dream 46.5 17.9
"""
if not (args or kwargs):
return __data.drop_duplicates().reset_index(drop=True)
new_names, df_res = _mutate_cols(__data, args, kwargs)
tmp_data = df_res.drop_duplicates(new_names).reset_index(drop=True)
if not _keep_all:
return tmp_data[new_names]
return tmp_data