mutate joins (inner, left, full)
join(left, right, on=None, how=None, *args, *, by=None, **kwargs)
Join two tables together, by matching on specified columns.
The functions inner_join, left_join, right_join, and full_join are provided as wrappers around join, and are used in the examples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
left |
The left-hand table. |
required | |
right |
The right-hand table. |
required | |
on |
How to match them. Note that the keyword "by" can also be used for this parameter, in order to support compatibility with dplyr. |
None |
|
how |
The type of join to perform (inner, full, left, right). |
None |
|
*args |
Additional postition arguments. Currently not supported. |
() |
|
**kwargs |
Additional keyword arguments. Currently not supported. |
{} |
Examples:
>>> from siuba import _, inner_join, left_join, full_join, right_join
>>> from siuba.data import band_members, band_instruments, band_instruments2
>>> band_members
name band
0 Mick Stones
1 John Beatles
2 Paul Beatles
>>> band_instruments
name plays
0 John guitar
1 Paul bass
2 Keith guitar
Notice that above, only John and Paul have entries for band instruments. This means that they will be the only two rows in the inner_join result:
>>> band_members >> inner_join(_, band_instruments)
name band plays
0 John Beatles guitar
1 Paul Beatles bass
A left join ensures all original rows of the left hand data are included.
>>> band_members >> left_join(_, band_instruments)
name band plays
0 Mick Stones NaN
1 John Beatles guitar
2 Paul Beatles bass
A full join is similar, but ensures all rows of both data are included.
>>> band_members >> full_join(_, band_instruments)
name band plays
0 Mick Stones NaN
1 John Beatles guitar
2 Paul Beatles bass
3 Keith NaN guitar
You can explicilty specify columns to join on using the "by" argument:
>>> band_members >> inner_join(_, band_instruments, by = "name")
n...
Use a dictionary for the by argument, to match up columns with different names:
>>> band_members >> full_join(_, band_instruments2, {"name": "artist"})
n...
Joins create a new row for each pair of matches. For example, the value 1 is in two rows on the left, and 2 rows on the right so 4 rows will be created.
>>> df1 = pd.DataFrame({"x": [1, 1, 3]})
>>> df2 = pd.DataFrame({"x": [1, 1, 2], "y": ["first", "second", "third"]})
>>> df1 >> left_join(_, df2)
x y
0 1 first
1 1 second
2 1 first
3 1 second
4 3 NaN
Missing values count as matches to eachother by default:
>>> df3 = pd.DataFrame({"x": [1, None], "y": 2})
>>> df4 = pd.DataFrame({"x": [1, None], "z": 3})
>>> left_join(df3, df4)
x y z
0 1.0 2 3
1 NaN 2 3
Returns:
Type | Description |
---|---|
pd.DataFrame |
Source code in siuba/dply/verbs.py
@singledispatch2((pd.DataFrame, DataFrameGroupBy))
@_bounce_groupby
def join(left, right, on = None, how = None, *args, by = None, **kwargs):
"""Join two tables together, by matching on specified columns.
The functions inner_join, left_join, right_join, and full_join are provided
as wrappers around join, and are used in the examples.
Parameters
----------
left :
The left-hand table.
right :
The right-hand table.
on :
How to match them. Note that the keyword "by" can also be used for this
parameter, in order to support compatibility with dplyr.
how :
The type of join to perform (inner, full, left, right).
*args:
Additional postition arguments. Currently not supported.
**kwargs:
Additional keyword arguments. Currently not supported.
Returns
-------
pd.DataFrame
Examples
--------
>>> from siuba import _, inner_join, left_join, full_join, right_join
>>> from siuba.data import band_members, band_instruments, band_instruments2
>>> band_members
name band
0 Mick Stones
1 John Beatles
2 Paul Beatles
>>> band_instruments
name plays
0 John guitar
1 Paul bass
2 Keith guitar
Notice that above, only John and Paul have entries for band instruments.
This means that they will be the only two rows in the inner_join result:
>>> band_members >> inner_join(_, band_instruments)
name band plays
0 John Beatles guitar
1 Paul Beatles bass
A left join ensures all original rows of the left hand data are included.
>>> band_members >> left_join(_, band_instruments)
name band plays
0 Mick Stones NaN
1 John Beatles guitar
2 Paul Beatles bass
A full join is similar, but ensures all rows of both data are included.
>>> band_members >> full_join(_, band_instruments)
name band plays
0 Mick Stones NaN
1 John Beatles guitar
2 Paul Beatles bass
3 Keith NaN guitar
You can explicilty specify columns to join on using the "by" argument:
>>> band_members >> inner_join(_, band_instruments, by = "name")
n...
Use a dictionary for the by argument, to match up columns with different names:
>>> band_members >> full_join(_, band_instruments2, {"name": "artist"})
n...
Joins create a new row for each pair of matches. For example, the value 1
is in two rows on the left, and 2 rows on the right so 4 rows will be created.
>>> df1 = pd.DataFrame({"x": [1, 1, 3]})
>>> df2 = pd.DataFrame({"x": [1, 1, 2], "y": ["first", "second", "third"]})
>>> df1 >> left_join(_, df2)
x y
0 1 first
1 1 second
2 1 first
3 1 second
4 3 NaN
Missing values count as matches to eachother by default:
>>> df3 = pd.DataFrame({"x": [1, None], "y": 2})
>>> df4 = pd.DataFrame({"x": [1, None], "z": 3})
>>> left_join(df3, df4)
x y z
0 1.0 2 3
1 NaN 2 3
"""
if isinstance(right, DataFrameGroupBy):
right = right.obj
if not isinstance(right, DataFrame):
raise Exception("right hand table must be a DataFrame")
if how is None:
raise Exception("Must specify how argument")
if len(args) or len(kwargs):
raise NotImplementedError("extra arguments to pandas join not currently supported")
if on is None and by is not None:
on = by
# pandas uses outer, but dplyr uses term full
if how == "full":
how = "outer"
if isinstance(on, Mapping):
left_on, right_on = zip(*on.items())
return left.merge(right, how = how, left_on = left_on, right_on = right_on)
return left.merge(right, how = how, on = on)