Joining in relational data models
Katharina Brunner
2024-09-24
Source:vignettes/tech-dm-join.Rmd
tech-dm-join.Rmd
The {dm} package offers functions to work with relational data models in R. A common task for multiple, separated tables that have a shared attribute is merging the data.
This document introduces you to the joining functions of {dm} and shows how to apply them using data from the {nycflights13} package.
Relational data models consist of multiple tables that are linked with foreign keys. They are the building blocks for joining tables. Read more about relational data models in the vignette “Introduction to Relational Data Models”.
First, we load the packages that we need:
Data: nycflights13
To explore filtering with {dm}, we’ll use the {nycflights13} data
with its tables flights
, planes
,
airlines
and airports
.
This dataset contains information about the 336,776 flights that
departed from New York City in 2013, with 3,322 different planes and
1,458 airports involved. The data comes from the US Bureau of
Transportation Statistics, and is documented in
?nycflights13
.
First, we have to create a dm
object from the
{nycflights13} data. This is implemented with
dm_nycflights13()
.
A data model object contains the data as well as metadata.
If you would like to create a dm
from other tables,
please look at ?dm
and the function
new_dm()
.
dm <- dm_nycflights13()
Joining a dm
object
{dm} allows you to join two tables of a dm
object based
on a shared column. You can use all join functions that you know from
the {dplyr} package. Currently {dplyr} supports four types of mutating
joins, two types of filtering joins, and a nesting join. See
?dplyr::join
for details.
How it works
A join is the combination of two tables based on shared information. In technical terms, we merge the tables that need to be directly connected by a foreign key relation.
The existing links can be inspected in two ways:
- Visually, by drawing the data model with
dm_draw()
The directed arrows show explicitly the relation between different columns.
- Printed to the console by calling
dm_get_all_fks()
dm %>%
dm_get_all_fks()
#> # A tibble: 4 × 5
#> child_table child_fk_cols parent_table parent_key_cols on_delete
#> <chr> <keys> <chr> <keys> <chr>
#> 1 flights carrier airlines carrier no_action
#> 2 flights origin airports faa no_action
#> 3 flights tailnum planes tailnum no_action
#> 4 flights origin, time_hour weather origin, time_hour no_action
Joining Examples
Let’s look at some examples:
Add a column with airline names from the
airlines
table to the flights
table.
dm_joined <-
dm %>%
dm_flatten_to_tbl(flights, airlines, .join = left_join)
dm_joined
#> # A tibble: 1,761 × 20
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # ℹ 1,751 more rows
#> # ℹ 13 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, name <chr>
As you can see below, the dm_joined
data frame has one
more column than the flights
table. The difference is the
name
column from the airlines
table.
dm$flights %>%
names()
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour"
dm$airlines %>%
names()
#> [1] "carrier" "name"
dm_joined %>%
names()
#> [1] "year" "month" "day" "dep_time"
#> [5] "sched_dep_time" "dep_delay" "arr_time" "sched_arr_time"
#> [9] "arr_delay" "carrier" "flight" "tailnum"
#> [13] "origin" "dest" "air_time" "distance"
#> [17] "hour" "minute" "time_hour" "name"
The result is not a dm
object anymore, but a (tibble)
data frame:
Another example:
Get all flights that can’t be matched with airlines names.
We expect the flights
data from {nycflights13} package
to be clean and well organized, so no flights should remain. You can
check this with an anti_join
:
dm %>%
dm_flatten_to_tbl(flights, airlines, .join = anti_join)
#> # A tibble: 0 × 19
#> # ℹ 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
#> # sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#> # sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> # tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
An example with filtering on a dm
and then merging:
Get all May 2013 flights from Delta Air Lines which didn’t
depart from John F. Kennedy International Airport in - and join all the
airports data into the flights
table.
dm_nycflights13(subset = FALSE) %>%
dm_filter(
airlines = (name == "Delta Air Lines Inc."),
airports = (name != "John F Kennedy Intl"),
flights = (month == 5)
) %>%
dm_flatten_to_tbl(flights, airports, .join = left_join)
#> # A tibble: 2,340 × 26
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 5 1 554 600 -6 731
#> 2 2013 5 1 555 600 -5 819
#> 3 2013 5 1 603 610 -7 754
#> 4 2013 5 1 622 630 -8 848
#> 5 2013 5 1 654 700 -6 931
#> 6 2013 5 1 655 700 -5 944
#> 7 2013 5 1 656 705 -9 1005
#> 8 2013 5 1 658 700 -2 925
#> 9 2013 5 1 743 745 -2 1014
#> 10 2013 5 1 755 800 -5 929
#> # ℹ 2,330 more rows
#> # ℹ 19 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>, alt <dbl>,
#> # tz <dbl>, dst <chr>, tzone <chr>
See vignette("tech-dm-filter")
for more details on
filtering.
A last example:
Merge all tables into one big table.
Sometimes you need everything in one place. In this case, you can use
the dm_flatten_to_tbl()
function. It joins all the tables
in your dm
object together into one wide table. All you
have to do is to specify the starting table. The following joins are
determined by the foreign key links.
dm_nycflights13() %>%
dm_select_tbl(-weather) %>%
dm_flatten_to_tbl(.start = flights)
#> Renaming ambiguous columns: %>%
#> dm_rename(flights, year.flights = year) %>%
#> dm_rename(airlines, name.airlines = name) %>%
#> dm_rename(airports, name.airports = name) %>%
#> dm_rename(planes, year.planes = year)
#> # A tibble: 1,761 × 35
#> year.flights month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # ℹ 1,751 more rows
#> # ℹ 28 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, name.airlines <chr>, name.airports <chr>, lat <dbl>,
#> # lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>,
#> # year.planes <int>, type <chr>, manufacturer <chr>, model <chr>, …
To be more precise, dm_flatten_to_tbl()
will join all
tables from one level of hierarchy (i.e., direct neighbors to table
.start
). If you want to cover tables from all levels of
hierarchy, use the argument recursive = TRUE
for
dm_flatten_to_tbl()
instead.
Also, be aware that all column names need to be unique. The
dm_flatten_to_tbl()
takes care of this by automatically
renaming the relevant columns and informs the user if any names were
changed,
e.g. dm_rename(airlines, airlines.name = name)
.
If you want to merge all tables, but get a nested table in return,
use dm_wrap_tbl()
with pull_tbl()
instead:
dm_nycflights13() %>%
dm_wrap_tbl(root = flights) %>%
pull_tbl(flights)
#> # A tibble: 1,761 × 23
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # ℹ 1,751 more rows
#> # ℹ 16 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>, airlines <packed[,1]>, airports <packed[,7]>,
#> # planes <packed[,8]>, weather <packed[,13]>