Skip to contents

This vignette deals with situations where you want to transform tables of your dm and then update an existing table or add a new table to the dm. There are two straightforward approaches:

  1. extract the tables relevant to the calculation, perform the necessary transformations, and (if needed) recombine the resulting table into a dm,
  2. do all this within the dm object by zooming to a table and manipulating it while maintaining the key relations whenever possible.

Both approaches aim at maintaining the key relations whenever possible. We will explore the second approach here. For the first approach, see vignette("tech-dm-zoom").

Enabling {dplyr}-workflow within a dm

“Zooming” to a table of a dm means:

  • all information stored in the original dm is kept, including the originally zoomed table
  • an object of class dm_zoomed is produced, presenting a view of the table for transformations
  • you do not need to specify the table when calling select(), mutate() and other table manipulation functions

{dm} provides methods for many of the {dplyr}-verbs for a dm_zoomed which behave the way you are used to, affecting only the zoomed table and leaving the rest of the dm untouched. When you are finished with transforming the table, there are three options to proceed:

  1. use dm_update_zoomed() if you want to replace the originally zoomed table with the new table
  2. use dm_insert_zoomed() if you are creating a new table for your dm
  3. use dm_discard_zoomed() if you do not need the result and want to discard it

When employing one of the first two options, the resulting table in the dm will have all the primary and foreign keys available that could be tracked from the originally zoomed table.

Examples

So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data.

Use case 1: Add a new column to an existing table

Imagine you want to have a column in flights, specifying if a flight left before noon or after. Just like with {dplyr}, we can tackle this with mutate(). Let us do this step by step:

library(dm)
library(dplyr)
flights_dm <- dm_nycflights13()
flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4
flights_zoomed <-
  flights_dm %>%
  dm_zoom_to(flights)
# The print output for a `dm_zoomed` looks very much like that from a normal `tibble`.
flights_zoomed
#> # Zoomed table: flights
#> # A tibble:     1,761 × 19
#>     year month   day dep_time sched_dep_t…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵
#>    <int> <int> <int>    <int>         <int>   <dbl>   <int>   <int>   <dbl>
#>  1  2013     1    10        3          2359       4     426     437     -11
#>  2  2013     1    10       16          2359      17     447     444       3
#>  3  2013     1    10      450           500     -10     634     648     -14
#>  4  2013     1    10      520           525      -5     813     820      -7
#>  5  2013     1    10      530           530       0     824     829      -5
#>  6  2013     1    10      531           540      -9     832     850     -18
#>  7  2013     1    10      535           540      -5    1015    1017      -2
#>  8  2013     1    10      546           600     -14     645     709     -24
#>  9  2013     1    10      549           600     -11     652     724     -32
#> 10  2013     1    10      550           600     -10     649     703     -14
#> # … with 1,751 more rows, 10 more variables: carrier <chr>, flight <int>,
#> #   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>, and
#> #   abbreviated variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time,
#> #   ⁴​sched_arr_time, ⁵​arr_delay

flights_zoomed_mutate <-
  flights_zoomed %>%
  mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm")) %>%
  # in order to see our changes in the output we use `select()` for reordering the columns
  select(year:dep_time, am_pm_dep, everything())

flights_zoomed_mutate
#> # Zoomed table: flights
#> # A tibble:     1,761 × 20
#>     year month   day dep_time am_pm_dep sched_dep…¹ dep_d…² arr_t…³ sched…⁴
#>    <int> <int> <int>    <int> <chr>           <int>   <dbl>   <int>   <int>
#>  1  2013     1    10        3 am               2359       4     426     437
#>  2  2013     1    10       16 am               2359      17     447     444
#>  3  2013     1    10      450 am                500     -10     634     648
#>  4  2013     1    10      520 am                525      -5     813     820
#>  5  2013     1    10      530 am                530       0     824     829
#>  6  2013     1    10      531 am                540      -9     832     850
#>  7  2013     1    10      535 am                540      -5    1015    1017
#>  8  2013     1    10      546 am                600     -14     645     709
#>  9  2013     1    10      549 am                600     -11     652     724
#> 10  2013     1    10      550 am                600     -10     649     703
#> # … with 1,751 more rows, 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> #   time_hour <dttm>, and abbreviated variable names ¹​sched_dep_time,
#> #   ²​dep_delay, ³​arr_time, ⁴​sched_arr_time

# To update the original `dm` with a new `flights` table we use `dm_update_zoomed()`:
updated_flights_dm <-
  flights_zoomed_mutate %>%
  dm_update_zoomed()
# The only difference in the `dm` print output is the increased number of columns
updated_flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 54
#> Primary keys: 4
#> Foreign keys: 4
# The schematic view of the data model remains unchanged
dm_draw(updated_flights_dm)
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

Use case 2: Creation of a surrogate key

The same course of action could, for example, be employed to create a surrogate key for a table, a synthetic simple key that replaces a compound key. We can do this for the weather table.

library(tidyr)

weather_zoomed <-
  flights_dm %>%
  dm_zoom_to(weather)
weather_zoomed
#> # Zoomed table: weather
#> # A tibble:     144 × 15
#>    origin  year month   day  hour  temp  dewp humid wind_…¹ wind_…² wind_…³
#>    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>   <dbl>   <dbl>   <dbl>
#>  1 EWR     2013     1    10     0  41    32    70.1     230    8.06    NA  
#>  2 EWR     2013     1    10     1  39.0  30.0  69.9     210    9.21    NA  
#>  3 EWR     2013     1    10     2  39.0  28.9  66.8     230    6.90    NA  
#>  4 EWR     2013     1    10     3  39.9  27.0  59.5     270    5.75    NA  
#>  5 EWR     2013     1    10     4  41    26.1  55.0     320    6.90    NA  
#>  6 EWR     2013     1    10     5  41    26.1  55.0     300   12.7     20.7
#>  7 EWR     2013     1    10     6  39.9  25.0  54.8     280    6.90    17.3
#>  8 EWR     2013     1    10     7  41    25.0  52.6     330    6.90    NA  
#>  9 EWR     2013     1    10     8  43.0  25.0  48.7     330    8.06    NA  
#> 10 EWR     2013     1    10     9  45.0  23    41.6     320   17.3     26.5
#> # … with 134 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
#> #   visib <dbl>, time_hour <dttm>, and abbreviated variable names
#> #   ¹​wind_dir, ²​wind_speed, ³​wind_gust
# Maybe there is some hidden candidate for a primary key that we overlooked
enum_pk_candidates(weather_zoomed)
#> # A tibble: 15 × 3
#>    columns    candidate why                                                
#>    <keys>     <lgl>     <chr>                                              
#>  1 origin     FALSE     has duplicate values: EWR (48), JFK (48), LGA (48) 
#>  2 year       FALSE     has duplicate values: 2013 (144)                   
#>  3 month      FALSE     has duplicate values: 1 (72), 2 (72)               
#>  4 day        FALSE     has duplicate values: 10 (144)                     
#>  5 hour       FALSE     has duplicate values: 0 (6), 1 (6), 2 (6), 3 (6), …
#>  6 temp       FALSE     has duplicate values: 44.06 (12), 41.00 (8), 44.96…
#>  7 dewp       FALSE     has duplicate values: 21.92 (16), 24.98 (16), 6.98…
#>  8 humid      FALSE     has duplicate values: 53.71 (4), 56.56 (4), 32.53 …
#>  9 wind_dir   FALSE     has duplicate values: 320 (25), 330 (17), 310 (15)…
#> 10 wind_speed FALSE     has duplicate values: 6.90468 (20), 8.05546 (19), …
#> 11 wind_gust  FALSE     has 123 missing values, and duplicate values: 23.0…
#> 12 precip     FALSE     has duplicate values: 0 (144)                      
#> 13 pressure   FALSE     has duplicate values: 1028.9 (6), 1029.0 (5), 1032…
#> 14 visib      FALSE     has duplicate values: 10 (144)                     
#> 15 time_hour  FALSE     has duplicate values: 2013-01-10 00:00:00 (3), 201…
# Seems we have to construct a column with unique values
# This can be done by combining column `origin` with `time_hour`, if the latter
# is converted to a single time zone first; all within the `dm`:
weather_zoomed_mutate <-
  weather_zoomed %>%
  # first convert all times to the same time zone:
  mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
  # paste together as character the airport code and the time
  unite("origin_slot_id", origin, time_hour_fmt) %>%
  select(origin_slot_id, everything())
# check if we the result is as expected:
enum_pk_candidates(weather_zoomed_mutate) %>% filter(candidate)
#> # A tibble: 1 × 3
#>   columns        candidate why  
#>   <keys>         <lgl>     <chr>
#> 1 origin_slot_id TRUE      ""
flights_upd_weather_dm <-
  weather_zoomed_mutate %>%
  dm_update_zoomed() %>%
  dm_add_pk(weather, origin_slot_id)
flights_upd_weather_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 3
# creating the coveted FK relation between `flights` and `weather`
extended_flights_dm <-
  flights_upd_weather_dm %>%
  dm_zoom_to(flights) %>%
  mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
  # need to keep `origin` as FK to airports, so `remove = FALSE`
  unite("origin_slot_id", origin, time_hour_fmt, remove = FALSE) %>%
  dm_update_zoomed() %>%
  dm_add_fk(flights, origin_slot_id, weather)
extended_flights_dm %>% dm_draw()
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigin_slot_idoriginflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin_slot_idflights:origin_slot_id->weather:origin_slot_id

Use case 3: Disentangle dm

If you look at the dm created by dm_nycflights13(cycle = TRUE), you see that two columns of flights relate to one and the same table, airports. One column stands for the departure airport and the other for the arrival airport.

dm_draw(dm_nycflights13(cycle = TRUE))
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumorigindestorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa flights:dest->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

This generates a cycle which leads to failures with many operations that only work on cycle-free data models, such as dm_flatten_to_tbl(), dm_filter() or dm_wrap_to_tbl(). In such cases, it can be beneficial to “disentangle” the dm by duplicating the referred table. One way to do this in the {dm}-framework is as follows:

disentangled_flights_dm <-
  dm_nycflights13(cycle = TRUE) %>%
  # zooming and immediately inserting essentially creates a copy of the original table
  dm_zoom_to(airports) %>%
  # reinserting the `airports` table under the name `destination`
  dm_insert_zoomed("destination") %>%
  # renaming the originally zoomed table
  dm_rename_tbl(origin = airports) %>%
  # Key relations are also duplicated, so the wrong ones need to be removed
  dm_rm_fk(flights, dest, origin) %>%
  dm_rm_fk(flights, origin, destination)
dm_draw(disentangled_flights_dm)
%0 airlines airlinescarrierdestination destinationfaaflights flightscarriertailnumorigindestorigin, time_hourflights:carrier->airlines:carrier flights:dest->destination:faa origin originfaaflights:origin->origin:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

In a future update, we will provide a more convenient way to “disentangle” dm objects, so that the individual steps will be done automatically.

Use case 4: Add summary table to dm

Here is an example for adding a summary of a table as a new table to a dm (FK-relations are taken care of automatically):

dm_with_summary <-
  flights_dm %>%
  dm_zoom_to(flights) %>%
  count(origin, carrier) %>%
  dm_insert_zoomed("dep_carrier_count")
dm_draw(dm_with_summary)
%0 airlines airlinescarrierairports airportsfaadep_carrier_count dep_carrier_countorigincarrierdep_carrier_count:carrier->airlines:carrier dep_carrier_count:origin->airports:faa flights flightscarriertailnumoriginorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour

Use case 5: Joining tables

If you would like to join some or all of the columns of one table to another, you can make use of one of the join-methods for a dm_zoomed. In addition to the usual arguments for the {dplyr}-joins, by supplying the select argument you can specify which columns of the RHS-table you want to be included in the join. For the syntax please see the example below. The LHS-table of a join is always the zoomed table.

joined_flights_dm <-
  flights_dm %>%
  dm_zoom_to(flights) %>%
  # let's first reduce the number of columns of flights
  select(-dep_delay:-arr_delay, -air_time:-time_hour) %>%
  # in the {dm}-method for the joins you can specify which columns you want to add to the zoomed table
  left_join(planes, select = c(tailnum, plane_type = type)) %>%
  dm_insert_zoomed("flights_plane_type")
# this is how the table looks now
joined_flights_dm$flights_plane_type
#> # A tibble: 1,761 × 11
#>     year month   day dep_time sched_d…¹ carrier flight tailnum origin dest 
#>    <int> <int> <int>    <int>     <int> <chr>    <int> <chr>   <chr>  <chr>
#>  1  2013     1    10        3      2359 B6         727 N571JB  JFK    BQN  
#>  2  2013     1    10       16      2359 B6         739 N564JB  JFK    PSE  
#>  3  2013     1    10      450       500 US        1117 N171US  EWR    CLT  
#>  4  2013     1    10      520       525 UA        1018 N35204  EWR    IAH  
#>  5  2013     1    10      530       530 UA         404 N815UA  LGA    IAH  
#>  6  2013     1    10      531       540 AA        1141 N5EAAA  JFK    MIA  
#>  7  2013     1    10      535       540 B6         725 N784JB  JFK    BQN  
#>  8  2013     1    10      546       600 B6         380 N337JB  EWR    BOS  
#>  9  2013     1    10      549       600 EV        6055 N19554  LGA    IAD  
#> 10  2013     1    10      550       600 US        2114 N740UW  LGA    BOS  
#> # … with 1,751 more rows, 1 more variable: plane_type <chr>, and
#> #   abbreviated variable name ¹​sched_dep_time
# also here, the FK-relations are transferred to the new table
dm_draw(joined_flights_dm)
%0 airlines airlinescarrierairports airportsfaaflights flightscarriertailnumoriginorigin, time_hourflights:carrier->airlines:carrier flights:origin->airports:faa planes planestailnumflights:tailnum->planes:tailnum weather weatherorigin, time_hourflights:origin, time_hour->weather:origin, time_hour flights_plane_type flights_plane_typecarriertailnumoriginflights_plane_type:carrier->airlines:carrier flights_plane_type:origin->airports:faa flights_plane_type:tailnum->planes:tailnum

Tip: Accessing the zoomed table

At each point, you can retrieve the zoomed table by calling pull_tbl() on a dm_zoomed. To use our last example once more:

flights_dm %>%
  dm_zoom_to(flights) %>%
  select(-dep_delay:-arr_delay, -air_time:-time_hour) %>%
  left_join(planes, select = c(tailnum, plane_type = type)) %>%
  pull_tbl()
#> # A tibble: 1,761 × 11
#>     year month   day dep_time sched_d…¹ carrier flight tailnum origin dest 
#>    <int> <int> <int>    <int>     <int> <chr>    <int> <chr>   <chr>  <chr>
#>  1  2013     1    10        3      2359 B6         727 N571JB  JFK    BQN  
#>  2  2013     1    10       16      2359 B6         739 N564JB  JFK    PSE  
#>  3  2013     1    10      450       500 US        1117 N171US  EWR    CLT  
#>  4  2013     1    10      520       525 UA        1018 N35204  EWR    IAH  
#>  5  2013     1    10      530       530 UA         404 N815UA  LGA    IAH  
#>  6  2013     1    10      531       540 AA        1141 N5EAAA  JFK    MIA  
#>  7  2013     1    10      535       540 B6         725 N784JB  JFK    BQN  
#>  8  2013     1    10      546       600 B6         380 N337JB  EWR    BOS  
#>  9  2013     1    10      549       600 EV        6055 N19554  LGA    IAD  
#> 10  2013     1    10      550       600 US        2114 N740UW  LGA    BOS  
#> # … with 1,751 more rows, 1 more variable: plane_type <chr>, and
#> #   abbreviated variable name ¹​sched_dep_time

Possible pitfalls and caveats

  1. Currently, not all {dplyr}-verbs have their own method for a dm_zoomed object, so be aware that in some cases it will still be necessary to resort to extracting one or more tables from a dm and reinserting a transformed version back into the dm object. The supported functions are: group_by(), ungroup(), summarise(), mutate(), transmute(), filter(), select(), relocate(), rename(), distinct(), arrange(), slice(), left_join(), inner_join(), full_join(), right_join(), semi_join(), and anti_join().

  2. The same is true for {tidyr}-functions. Methods are provided for: unite() and separate().

  3. There might be situations when you would like the key relations to remain intact, but they are dropped nevertheless. This is because a rigid logic is implemented, that does drop a key when its associated column is acted upon with e.g. a mutate() call. In these cases, the key relations will need to be re-established after finishing with the manipulations.

  4. For each implemented {dplyr}-verb, there is a logic for tracking key relations between the tables. Up to {dm} version 0.2.4 we tried to track the columns in a very detailed manner. This has become increasingly difficult, especially with dplyr::across(). As of {dm} 0.2.5, we give more responsibility to the {dm} user: Now those columns are tracked whose names remain in the resulting table. Affected by these changes are the methods for: mutate(), transmute(), distinct(). When using one of these functions, be aware that if you want to replace a key column with a column with a different content but of the same name, this column will automatically become a key column.