Manipulating individual tables

This vignette deals with situations where you want to transform tables of your dm object and then update an existing table or add a new table to the dm object. There are two approaches:

extract the tables relevant to the calculation, perform the necessary transformations, and (if needed) recombine the resulting table into a dm,
do all this within the dm object by zooming to a table and manipulating it.

Both approaches aim at maintaining the key relations whenever possible. We will explore the first approach here. For the second approach, see vignette("tech-dm-zoom").

Enabling {dplyr}-workflow within a `dm`

The dm_get_tables() and pull_tbl() functions have a new experimental argument keyed, which defaults to FALSE. If set to TRUE, a list of objects of class dm_keyed_tbl is returned instead. Because dm_keyed_tbl inherits from tbl or tbl_lazy, many {dplyr} and {tidyr} verbs will work unchanged. These objects will also attempt to track primary and foreign keys, so that they are available for joins and when recombining these tables later into a dm object.

When you are finished with transforming your data, you can use dm() or new_dm() to recombine the tables into a dm object. The resulting tables in the dm will have all the primary and foreign keys available that could be tracked from the original table. Reconstructing the dm object is not strictly necessary if you’re primarily interested in deriving one or multiple separate tables for analysis.

If this workflow proves as useful as it seems, subsetting tables via $, [[ will default to keyed = TRUE in a forthcoming major release of {dm}.

Examples

So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data.

Use case 1: Add a new column to an existing table

Imagine you want to have a column in flights, specifying if a flight left before noon or after. Just like with {dplyr}, we can tackle this with mutate(). Let us do this step by step:

library(dm)
library(dplyr)

flights_dm <- dm_nycflights13(cycle = TRUE)
flights_dm

#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 5

flights_keyed <-
  flights_dm %>%
  dm_get_tables(keyed = TRUE)

# The print output for a `dm_keyed_tbl` looks very much like that from a normal
# `tibble`, with additional details about keys.
flights_keyed$flights

#> # A tibble: 1,761 × 19
#> # Keys:     — | 0 | 5
#>     year month   day dep_time sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int>          <int>     <dbl>    <int>
#>  1  2013     1    10        3           2359         4      426
#>  2  2013     1    10       16           2359        17      447
#>  3  2013     1    10      450            500       -10      634
#>  4  2013     1    10      520            525        -5      813
#>  5  2013     1    10      530            530         0      824
#>  6  2013     1    10      531            540        -9      832
#>  7  2013     1    10      535            540        -5     1015
#>  8  2013     1    10      546            600       -14      645
#>  9  2013     1    10      549            600       -11      652
#> 10  2013     1    10      550            600       -10      649
#> # ℹ 1,751 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

flights_tbl_mutate <-
  flights_keyed$flights %>%
  mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm"), .after = dep_time)

flights_tbl_mutate

#> # A tibble: 1,761 × 20
#> # Keys:     — | 0 | 5
#>     year month   day dep_time am_pm_dep sched_dep_time dep_delay arr_time
#>    <int> <int> <int>    <int> <chr>              <int>     <dbl>    <int>
#>  1  2013     1    10        3 am                  2359         4      426
#>  2  2013     1    10       16 am                  2359        17      447
#>  3  2013     1    10      450 am                   500       -10      634
#>  4  2013     1    10      520 am                   525        -5      813
#>  5  2013     1    10      530 am                   530         0      824
#>  6  2013     1    10      531 am                   540        -9      832
#>  7  2013     1    10      535 am                   540        -5     1015
#>  8  2013     1    10      546 am                   600       -14      645
#>  9  2013     1    10      549 am                   600       -11      652
#> 10  2013     1    10      550 am                   600       -10      649
#> # ℹ 1,751 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>

To update the original dm with a new flights table we use dm(). The bang-bang-bang (!!!) is a technical necessity that will become superfluous in a forthcoming release.

updated_flights_dm <- dm(
  flights = flights_tbl_mutate,
  !!!flights_keyed[c("airlines", "airports", "planes", "weather")]
)

# The only difference in the `dm` print output is the increased number of
# columns
updated_flights_dm

#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `flights`, `airlines`, `airports`, `planes`, `weather`
#> Columns: 54
#> Primary keys: 4
#> Foreign keys: 5

# The schematic view of the data model remains unchanged
dm_draw(updated_flights_dm)

Use case 2: Creation of a surrogate key

The same course of action could, for example, be employed to create a surrogate key for a table, a synthetic simple key that replaces a compound key. We can do this for the weather table.

library(tidyr)

flights_keyed$weather

#> # A tibble: 144 × 15
#> # Keys:     `origin`, `time_hour` | 1 | 0
#>    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
#>    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
#>  1 EWR     2013     1    10     0  41    32    70.1      230       8.06
#>  2 EWR     2013     1    10     1  39.0  30.0  69.9      210       9.21
#>  3 EWR     2013     1    10     2  39.0  28.9  66.8      230       6.90
#>  4 EWR     2013     1    10     3  39.9  27.0  59.5      270       5.75
#>  5 EWR     2013     1    10     4  41    26.1  55.0      320       6.90
#>  6 EWR     2013     1    10     5  41    26.1  55.0      300      12.7 
#>  7 EWR     2013     1    10     6  39.9  25.0  54.8      280       6.90
#>  8 EWR     2013     1    10     7  41    25.0  52.6      330       6.90
#>  9 EWR     2013     1    10     8  43.0  25.0  48.7      330       8.06
#> 10 EWR     2013     1    10     9  45.0  23    41.6      320      17.3 
#> # ℹ 134 more rows
#> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> #   visib <dbl>, time_hour <dttm>


# Maybe there is some hidden candidate for a primary key that we overlooked?
enum_pk_candidates(flights_keyed$weather)

#> # A tibble: 15 × 3
#>    columns    candidate why                                                
#>    <keys>     <lgl>     <chr>                                              
#>  1 origin     FALSE     has duplicate values: EWR (48), JFK (48), LGA (48) 
#>  2 year       FALSE     has duplicate values: 2013 (144)                   
#>  3 month      FALSE     has duplicate values: 1 (72), 2 (72)               
#>  4 day        FALSE     has duplicate values: 10 (144)                     
#>  5 hour       FALSE     has duplicate values: 0 (6), 1 (6), 2 (6), 3 (6), …
#>  6 temp       FALSE     has duplicate values: 44.06 (12), 41.00 (8), 44.96…
#>  7 dewp       FALSE     has duplicate values: 21.92 (16), 24.98 (16), 6.98…
#>  8 humid      FALSE     has duplicate values: 53.71 (4), 56.56 (4), 32.53 …
#>  9 wind_dir   FALSE     has duplicate values: 320 (25), 330 (17), 310 (15)…
#> 10 wind_speed FALSE     has duplicate values: 6.90468 (20), 8.05546 (19), …
#> 11 wind_gust  FALSE     has 123 missing values, and duplicate values: 23.0…
#> 12 precip     FALSE     has duplicate values: 0 (144)                      
#> 13 pressure   FALSE     has duplicate values: 1028.9 (6), 1029.0 (5), 1032…
#> 14 visib      FALSE     has duplicate values: 10 (144)                     
#> 15 time_hour  FALSE     has duplicate values: 2013-01-10 00:00:00 (3), 201…

# Seems we have to construct a column with unique values
# This can be done by combining column `origin` with `time_hour`, if the latter
# is converted to a single time zone first; all within the `dm`:
weather_tbl_mutate <-
  flights_keyed$weather %>%
  # first convert all times to the same time zone:
  mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
  # paste together as character the airport code and the time
  unite("origin_slot_id", origin, time_hour_fmt) %>%
  select(origin_slot_id, everything())

# check if we the result is as expected:
weather_tbl_mutate %>%
  enum_pk_candidates() %>%
  filter(candidate)

#> # A tibble: 1 × 3
#>   columns        candidate why  
#>   <keys>         <lgl>     <chr>
#> 1 origin_slot_id TRUE      ""

# We apply the same transformation to create
# the foreign key in the flights table:
flights_tbl_mutate <-
  flights_keyed$flights %>%
  mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
  unite("origin_slot_id", origin, time_hour_fmt) %>%
  select(origin_slot_id, everything())

surrogate_flights_dm <-
  dm(
    weather = weather_tbl_mutate,
    flights = flights_tbl_mutate,
    !!!flights_keyed[c("airlines", "airports", "planes")]
  ) %>%
  dm_add_pk(weather, origin_slot_id) %>%
  dm_add_fk(flights, origin_slot_id, weather)

surrogate_flights_dm %>%
  dm_draw()

Use case 3: Disentangle `dm`

If you look at the dm created by dm_nycflights13(cycle = TRUE), you see that two columns of flights relate to the same table, airports. One column stands for the departure airport and the other for the arrival airport. This generates a cycle which leads to failures with many operations that only work on cycle-free data models, such as dm_flatten_to_tbl(), dm_filter() or dm_wrap_tbl(). In such cases, it can be beneficial to “disentangle” the dm by duplicating the referred table. One way to do this in the {dm}-framework is as follows:

disentangled_flights_dm <-
  dm(
    destination = flights_keyed$airports,
    origin = flights_keyed$airports,
    !!!flights_keyed[c("flights", "airlines", "planes", "weather")]
  ) %>%
  # Key relations are also duplicated, so the wrong ones need to be removed
  dm_rm_fk(flights, dest, origin) %>%
  dm_rm_fk(flights, origin, destination)

disentangled_flights_dm %>%
  dm_draw()

Use case 4: Add summary table to `dm`

Here is an example for adding a summary of a table as a new table to a dm. Foreign-key relations are taken care of automatically. This example shows an alternative approach of deconstruction reconstruction using pull_tbl().

flights_derived <-
  flights_dm %>%
  pull_tbl(flights, keyed = TRUE) %>%
  dplyr::count(origin, carrier)

derived_flights_dm <- dm(flights_derived, !!!flights_keyed)

derived_flights_dm %>%
  dm_draw()

Use case 5: Joining tables

If you would like to join some or all of the columns of one table to another, you can make use of one of the ..._join() methods for a dm_keyed_tbl. In many cases, using keyed tables derived from a dm object allows omitting the by argument without triggering a message, because they are safely inferred from the foreign keys stored in the dm_keyed_tbl objects. For the syntax, please see the example below.

planes_for_join <-
  flights_keyed$planes %>%
  select(tailnum, plane_type = type)

joined_flights_tbl <-
  flights_keyed$flights %>%
  # let's first reduce the number of columns of flights
  select(-dep_delay:-arr_delay, -air_time:-minute, -starts_with("sched_")) %>%
  # in the {dm}-method for the joins you can specify which columns you want to
  # add to the subsetted table
  left_join(planes_for_join)

joined_flights_dm <- dm(
  flights_plane_type = joined_flights_tbl,
  !!!flights_keyed[c("airlines", "airports", "weather")]
)

# this is how the table looks now
joined_flights_dm$flights_plane_type

#> # A tibble: 1,761 × 11
#>     year month   day dep_time carrier flight tailnum origin dest 
#>    <int> <int> <int>    <int> <chr>    <int> <chr>   <chr>  <chr>
#>  1  2013     1    10        3 B6         727 N571JB  JFK    BQN  
#>  2  2013     1    10       16 B6         739 N564JB  JFK    PSE  
#>  3  2013     1    10      450 US        1117 N171US  EWR    CLT  
#>  4  2013     1    10      520 UA        1018 N35204  EWR    IAH  
#>  5  2013     1    10      530 UA         404 N815UA  LGA    IAH  
#>  6  2013     1    10      531 AA        1141 N5EAAA  JFK    MIA  
#>  7  2013     1    10      535 B6         725 N784JB  JFK    BQN  
#>  8  2013     1    10      546 B6         380 N337JB  EWR    BOS  
#>  9  2013     1    10      549 EV        6055 N19554  LGA    IAD  
#> 10  2013     1    10      550 US        2114 N740UW  LGA    BOS  
#> # ℹ 1,751 more rows
#> # ℹ 2 more variables: time_hour <dttm>, plane_type <chr>

# also here, the FK-relations are transferred to the new table
joined_flights_dm %>%
  dm_draw()

Use case 6: Retrieve all tables

Retrieving all tables from a dm object requires a lot of boilerplate code. The dm_deconstruct() function helps creating that boilerplate. For a dm object, it prints the code necessary to create local variables for all tables.

dm <- dm_nycflights13()
dm_deconstruct(dm)
#> airlines <- pull_tbl(dm, "airlines", keyed = TRUE)
#> airports <- pull_tbl(dm, "airports", keyed = TRUE)
#> flights <- pull_tbl(dm, "flights", keyed = TRUE)
#> planes <- pull_tbl(dm, "planes", keyed = TRUE)
#> weather <- pull_tbl(dm, "weather", keyed = TRUE)

This code can be copy-pasted into your script or function.

Enabling {dplyr}-workflow within a dm