Skip to contents

The {dm} package offers functions to work with relational data models in R. A common task for multiple, separated tables that have a shared attribute is merging the data.

This document introduces you to the joining functions of {dm} and shows how to apply them using data from the {nycflights13} package.

Relational data models consist of multiple tables that are linked with foreign keys. They are the building blocks for joining tables. Read more about relational data models in the vignette “Introduction to Relational Data Models”.

First, we load the packages that we need:

Data: nycflights13

To explore filtering with {dm}, we’ll use the {nycflights13} data with its tables flights, planes, airlines and airports.

This dataset contains information about the 336,776 flights that departed from New York City in 2013, with 3,322 different planes and 1,458 airports involved. The data comes from the US Bureau of Transportation Statistics, and is documented in ?nycflights13.

First, we have to create a dm object from the {nycflights13} data. This is implemented with dm_nycflights13().

A data model object contains the data as well as metadata.

If you would like to create a dm from other tables, please look at ?dm and the function new_dm().

Joining a dm object

{dm} allows you to join two tables of a dm object based on a shared column. You can use all join functions that you know from the {dplyr} package. Currently {dplyr} supports four types of mutating joins, two types of filtering joins, and a nesting join. See ?dplyr::join for details.

How it works

A join is the combination of two tables based on shared information. In technical terms, we merge the tables that need to be directly connected by a foreign key relation.

The existing links can be inspected in two ways:

  1. Visually, by drawing the data model with dm_draw()
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

The directed arrows show explicitly the relation between different columns.

  1. Printed to the console by calling dm_get_all_fks()
#> # A tibble: 4 × 5
#>   child_table child_fk_cols     parent_table parent_key_cols   on_delete
#>   <chr>       <keys>            <chr>        <keys>            <chr>    
#> 1 flights     carrier           airlines     carrier           no_action
#> 2 flights     origin            airports     faa               no_action
#> 3 flights     tailnum           planes       tailnum           no_action
#> 4 flights     origin, time_hour weather      origin, time_hour no_action

Joining Examples

Let’s look at some examples:

Add a column with airline names from the airlines table to the flights table.

dm_joined <-
  dm %>%
  dm_flatten_to_tbl(flights, airlines, .join = left_join)
dm_joined
#> # A tibble: 1,761 × 20
#>     year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>    <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>   <dbl>
#>  1  2013     1    10        3          2359       4     426     437     -11
#>  2  2013     1    10       16          2359      17     447     444       3
#>  3  2013     1    10      450           500     -10     634     648     -14
#>  4  2013     1    10      520           525      -5     813     820      -7
#>  5  2013     1    10      530           530       0     824     829      -5
#>  6  2013     1    10      531           540      -9     832     850     -18
#>  7  2013     1    10      535           540      -5    1015    1017      -2
#>  8  2013     1    10      546           600     -14     645     709     -24
#>  9  2013     1    10      549           600     -11     652     724     -32
#> 10  2013     1    10      550           600     -10     649     703     -14
#> # … with 1,751 more rows, 11 more variables: carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   name <chr>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

As you can see below, the dm_joined data frame has one more column than the flights table. The difference is the name column from the airlines table.

dm$flights %>%
  names()
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"

dm$airlines %>%
  names()
#> [1] "carrier" "name"

dm_joined %>%
  names()
#>  [1] "year"           "month"          "day"            "dep_time"      
#>  [5] "sched_dep_time" "dep_delay"      "arr_time"       "sched_arr_time"
#>  [9] "arr_delay"      "carrier"        "flight"         "tailnum"       
#> [13] "origin"         "dest"           "air_time"       "distance"      
#> [17] "hour"           "minute"         "time_hour"      "name"

The result is not a dm object anymore, but a (tibble) data frame:

dm_joined %>%
  class()
#> [1] "tbl_df"     "tbl"        "data.frame"

Another example:

Get all flights that can’t be matched with airlines names.

We expect the flights data from {nycflights13} package to be clean and well organized, so no flights should remain. You can check this with an anti_join:

dm %>%
  dm_flatten_to_tbl(flights, airlines, .join = anti_join)
#> # A tibble: 0 × 19
#> # … with 19 variables: year <int>, month <int>, day <int>, dep_time <int>,
#> #   sched_dep_time <int>, dep_delay <dbl>, arr_time <int>,
#> #   sched_arr_time <int>, arr_delay <dbl>, carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

An example with filtering on a dm and then merging:

Get all May 2013 flights from Delta Air Lines which didn’t depart from John F. Kennedy International Airport in - and join all the airports data into the flights table.

dm_nycflights13(subset = FALSE) %>%
  dm_filter(
    airlines = (name == "Delta Air Lines Inc."),
    airports = (name != "John F Kennedy Intl"),
    flights = (month == 5)
  ) %>% 
  dm_flatten_to_tbl(flights, airports, .join = left_join)
#> # A tibble: 2,340 × 26
#>     year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>    <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>   <dbl>
#>  1  2013     5     1      554           600      -6     731     756     -25
#>  2  2013     5     1      555           600      -5     819     827      -8
#>  3  2013     5     1      603           610      -7     754     809     -15
#>  4  2013     5     1      622           630      -8     848     844       4
#>  5  2013     5     1      654           700      -6     931     950     -19
#>  6  2013     5     1      655           700      -5     944    1007     -23
#>  7  2013     5     1      656           705      -9    1005    1011      -6
#>  8  2013     5     1      658           700      -2     925     928      -3
#>  9  2013     5     1      743           745      -2    1014    1004      10
#> 10  2013     5     1      755           800      -5     929    1001     -32
#> # … with 2,330 more rows, 17 more variables: carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
#> #   tzone <chr>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

See vignette("tech-dm-filter") for more details on filtering.

A last example:

Merge all tables into one big table.

Sometimes you need everything in one place. In this case, you can use the dm_flatten_to_tbl() function. It joins all the tables in your dm object together into one wide table. All you have to do is to specify the starting table. The following joins are determined by the foreign key links.

dm_nycflights13() %>%
  dm_select_tbl(-weather) %>%
  dm_flatten_to_tbl(.start = flights)
#> Renaming ambiguous columns: %>%
#>   dm_rename(flights, year.flights = year) %>%
#>   dm_rename(airlines, name.airlines = name) %>%
#>   dm_rename(airports, name.airports = name) %>%
#>   dm_rename(planes, year.planes = year)
#> # A tibble: 1,761 × 35
#>    year.flights month   day dep_t…¹ sched…² dep_d…³ arr_t…⁴ sched…⁵ arr_d…⁶
#>           <int> <int> <int>   <int>   <int>   <dbl>   <int>   <int>   <dbl>
#>  1         2013     1    10       3    2359       4     426     437     -11
#>  2         2013     1    10      16    2359      17     447     444       3
#>  3         2013     1    10     450     500     -10     634     648     -14
#>  4         2013     1    10     520     525      -5     813     820      -7
#>  5         2013     1    10     530     530       0     824     829      -5
#>  6         2013     1    10     531     540      -9     832     850     -18
#>  7         2013     1    10     535     540      -5    1015    1017      -2
#>  8         2013     1    10     546     600     -14     645     709     -24
#>  9         2013     1    10     549     600     -11     652     724     -32
#> 10         2013     1    10     550     600     -10     649     703     -14
#> # … with 1,751 more rows, 26 more variables: carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   name.airlines <chr>, name.airports <chr>, lat <dbl>, lon <dbl>,
#> #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, year.planes <int>,
#> #   type <chr>, manufacturer <chr>, model <chr>, engines <int>,
#> #   seats <int>, speed <int>, engine <chr>, and abbreviated variable …

To be more precise, dm_flatten_to_tbl() will join all tables from one level of hierarchy (i.e., direct neighbors to table .start). If you want to cover tables from all levels of hierarchy, use the argument recursive = TRUE for dm_flatten_to_tbl() instead.

Also, be aware that all column names need to be unique. The dm_flatten_to_tbl() takes care of this by automatically renaming the relevant columns and informs the user if any names were changed, e.g. dm_rename(airlines, airlines.name = name).

If you want to merge all tables, but get a nested table in return, use dm_wrap_tbl() with pull_tbl() instead:

dm_nycflights13() %>%
  dm_wrap_tbl(root = flights) %>%
  pull_tbl(flights)
#> # A tibble: 1,761 × 23
#>     year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>    <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>   <dbl>
#>  1  2013     1    10        3          2359       4     426     437     -11
#>  2  2013     1    10       16          2359      17     447     444       3
#>  3  2013     1    10      450           500     -10     634     648     -14
#>  4  2013     1    10      520           525      -5     813     820      -7
#>  5  2013     1    10      530           530       0     824     829      -5
#>  6  2013     1    10      531           540      -9     832     850     -18
#>  7  2013     1    10      535           540      -5    1015    1017      -2
#>  8  2013     1    10      546           600     -14     645     709     -24
#>  9  2013     1    10      549           600     -11     652     724     -32
#> 10  2013     1    10      550           600     -10     649     703     -14
#> # … with 1,751 more rows, 14 more variables: carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
#> #   airlines <packed[,1]>, airports <packed[,7]>, planes <packed[,8]>,
#> #   weather <packed[,13]>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay