Skip to content

Siuba Reference

distinct

Siuba Reference

Guide
Reference
Reference
- Core verbs - one table
  Core verbs - one table
  - arrange
  - count
  - distinct distinct
    Table of contents
    
    distinct()
  - filter
  - mutate, transmute
  - rename
  - select
  - summarize
  - group_by, ungroup
- Core verbs - two table
  Core verbs - two table
  - mutate joins (inner, left, full)
  - filter joins (anti, semi)
- Query verbs
  Query verbs
  - collect
  - show_query
- Tidy verbs
  Tidy verbs
- Column operations
  Column operations
Examples
Develop
About
About
- Key Features

distinct

`distinct(__data, args, , _keep_all=False, **kwargs)`

Keep only distinct (unique) rows from a table.

Parameters:

Name	Type	Description	Default
`__data`		The input data.	required
`*args`		Columns to use when determining which rows are unique.	`()`
`_keep_all`		Whether to keep all columns of the original data, not just *args.	`False`
`**kwargs`		If specified, arguments passed to the verb mutate(), and then being used in distinct().	`{}`

Examples:

>>> from siuba import _, distinct, select
>>> from siuba.data import penguins

>>> penguins >> distinct(_.species, _.island)
     species     island
0     Adelie  Torgersen
1     Adelie     Biscoe
2     Adelie      Dream
3     Gentoo     Biscoe
4  Chinstrap      Dream

Use _keep_all=True, to keep all columns in each distinct row. This lets you peak at the values of the first unique row.

>>> small_penguins = penguins >> select(_[:4])
>>> small_penguins >> distinct(_.species, _keep_all = True)
     species     island  bill_length_mm  bill_depth_mm
0     Adelie  Torgersen            39.1           18.7
1     Gentoo     Biscoe            46.1           13.2
2  Chinstrap      Dream            46.5           17.9

Source code in siuba/dply/verbs.py

@singledispatch2(DataFrame)
def distinct(__data, *args, _keep_all = False, **kwargs):
    """Keep only distinct (unique) rows from a table.

    Parameters
    ----------
    __data:
        The input data.
    *args:
        Columns to use when determining which rows are unique.
    _keep_all:
        Whether to keep all columns of the original data, not just *args.
    **kwargs:
        If specified, arguments passed to the verb mutate(), and then being used
        in distinct().

    See Also
    --------
    count : keep distinct rows, and count their number of observations.

    Examples
    --------
    >>> from siuba import _, distinct, select
    >>> from siuba.data import penguins

    >>> penguins >> distinct(_.species, _.island)
         species     island
    0     Adelie  Torgersen
    1     Adelie     Biscoe
    2     Adelie      Dream
    3     Gentoo     Biscoe
    4  Chinstrap      Dream

    Use _keep_all=True, to keep all columns in each distinct row. This lets you
    peak at the values of the first unique row.

    >>> small_penguins = penguins >> select(_[:4])
    >>> small_penguins >> distinct(_.species, _keep_all = True)
         species     island  bill_length_mm  bill_depth_mm
    0     Adelie  Torgersen            39.1           18.7
    1     Gentoo     Biscoe            46.1           13.2
    2  Chinstrap      Dream            46.5           17.9
    """

    if not (args or kwargs):
        return __data.drop_duplicates().reset_index(drop=True)

    new_names, df_res = _mutate_cols(__data, args, kwargs)
    tmp_data = df_res.drop_duplicates(new_names).reset_index(drop=True)

    if not _keep_all:
        return tmp_data[new_names]

    return tmp_data