Skip to content

distinct

distinct(__data, *args, *, _keep_all=False, **kwargs)

Keep only distinct (unique) rows from a table.

Parameters:

Name Type Description Default
__data

The input data.

required
*args

Columns to use when determining which rows are unique.

()
_keep_all

Whether to keep all columns of the original data, not just *args.

False
**kwargs

If specified, arguments passed to the verb mutate(), and then being used in distinct().

{}

Examples:

>>> from siuba import _, distinct, select
>>> from siuba.data import penguins
>>> penguins >> distinct(_.species, _.island)
     species     island
0     Adelie  Torgersen
1     Adelie     Biscoe
2     Adelie      Dream
3     Gentoo     Biscoe
4  Chinstrap      Dream

Use _keep_all=True, to keep all columns in each distinct row. This lets you peak at the values of the first unique row.

>>> small_penguins = penguins >> select(_[:4])
>>> small_penguins >> distinct(_.species, _keep_all = True)
     species     island  bill_length_mm  bill_depth_mm
0     Adelie  Torgersen            39.1           18.7
1     Gentoo     Biscoe            46.1           13.2
2  Chinstrap      Dream            46.5           17.9
Source code in siuba/dply/verbs.py
@singledispatch2(DataFrame)
def distinct(__data, *args, _keep_all = False, **kwargs):
    """Keep only distinct (unique) rows from a table.

    Parameters
    ----------
    __data:
        The input data.
    *args:
        Columns to use when determining which rows are unique.
    _keep_all:
        Whether to keep all columns of the original data, not just *args.
    **kwargs:
        If specified, arguments passed to the verb mutate(), and then being used
        in distinct().

    See Also
    --------
    count : keep distinct rows, and count their number of observations.

    Examples
    --------
    >>> from siuba import _, distinct, select
    >>> from siuba.data import penguins

    >>> penguins >> distinct(_.species, _.island)
         species     island
    0     Adelie  Torgersen
    1     Adelie     Biscoe
    2     Adelie      Dream
    3     Gentoo     Biscoe
    4  Chinstrap      Dream

    Use _keep_all=True, to keep all columns in each distinct row. This lets you
    peak at the values of the first unique row.

    >>> small_penguins = penguins >> select(_[:4])
    >>> small_penguins >> distinct(_.species, _keep_all = True)
         species     island  bill_length_mm  bill_depth_mm
    0     Adelie  Torgersen            39.1           18.7
    1     Gentoo     Biscoe            46.1           13.2
    2  Chinstrap      Dream            46.5           17.9
    """

    if not (args or kwargs):
        return __data.drop_duplicates().reset_index(drop=True)

    new_names, df_res = _mutate_cols(__data, args, kwargs)
    tmp_data = df_res.drop_duplicates(new_names).reset_index(drop=True)

    if not _keep_all:
        return tmp_data[new_names]

    return tmp_data