This vignette deals with situations where you want to transform
tables of your dm
object and then update an existing table
or add a new table to the dm
object. There are two
approaches:
- extract the tables relevant to the calculation, perform the
necessary transformations, and (if needed) recombine the resulting table
into a
dm
, - do all this within the
dm
object by zooming to a table and manipulating it.
Both approaches aim at maintaining the key relations whenever
possible. We will explore the first approach here. For the second
approach, see vignette("tech-dm-zoom")
.
Enabling {dplyr}-workflow within a dm
The dm_get_tables()
and pull_tbl()
functions have a new experimental argument keyed
, which
defaults to FALSE
. If set to TRUE
, a list of
objects of class dm_keyed_tbl
is returned instead. Because
dm_keyed_tbl
inherits from tbl
or
tbl_lazy
, many {dplyr} and {tidyr} verbs will work
unchanged. These objects will also attempt to track primary and foreign
keys, so that they are available for joins and when recombining these
tables later into a dm
object.
When you are finished with transforming your data, you can use
dm()
or new_dm()
to recombine the tables into
a dm
object. The resulting tables in the dm
will have all the primary and foreign keys available that could be
tracked from the original table. Reconstructing the dm
object is not strictly necessary if you’re primarily interested in
deriving one or multiple separate tables for analysis.
If this workflow proves as useful as it seems, subsetting tables via
$
, [[
will default to
keyed = TRUE
in a forthcoming major release of {dm}.
Examples
So much for the theory, but how does it look and feel? To explore this, we once more make use of our trusted {nycflights13} data.
Use case 1: Add a new column to an existing table
Imagine you want to have a column in flights
, specifying
if a flight left before noon or after. Just like with {dplyr}, we can
tackle this with mutate()
. Let us do this step by step:
library(dm)
library(dplyr)
flights_dm <- dm_nycflights13(cycle = TRUE)
flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 5
flights_keyed <-
flights_dm %>%
dm_get_tables(keyed = TRUE)
# The print output for a `dm_keyed_tbl` looks very much like that from a normal
# `tibble`, with additional details about keys.
flights_keyed$flights
#> # A tibble: 1,761 × 19
#> # Keys: — | 0 | 5
#> year month day dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 10 3 2359 4 426
#> 2 2013 1 10 16 2359 17 447
#> 3 2013 1 10 450 500 -10 634
#> 4 2013 1 10 520 525 -5 813
#> 5 2013 1 10 530 530 0 824
#> 6 2013 1 10 531 540 -9 832
#> 7 2013 1 10 535 540 -5 1015
#> 8 2013 1 10 546 600 -14 645
#> 9 2013 1 10 549 600 -11 652
#> 10 2013 1 10 550 600 -10 649
#> # ℹ 1,751 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
flights_tbl_mutate <-
flights_keyed$flights %>%
mutate(am_pm_dep = if_else(dep_time < 1200, "am", "pm"), .after = dep_time)
flights_tbl_mutate
#> # A tibble: 1,761 × 20
#> # Keys: — | 0 | 5
#> year month day dep_time am_pm_dep sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <chr> <int> <dbl> <int>
#> 1 2013 1 10 3 am 2359 4 426
#> 2 2013 1 10 16 am 2359 17 447
#> 3 2013 1 10 450 am 500 -10 634
#> 4 2013 1 10 520 am 525 -5 813
#> 5 2013 1 10 530 am 530 0 824
#> 6 2013 1 10 531 am 540 -9 832
#> 7 2013 1 10 535 am 540 -5 1015
#> 8 2013 1 10 546 am 600 -14 645
#> 9 2013 1 10 549 am 600 -11 652
#> 10 2013 1 10 550 am 600 -10 649
#> # ℹ 1,751 more rows
#> # ℹ 12 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#> # time_hour <dttm>
To update the original dm
with a new
flights
table we use dm()
. The bang-bang-bang
(!!!
) is a technical necessity that will become superfluous
in a forthcoming release.
updated_flights_dm <- dm(
flights = flights_tbl_mutate,
!!!flights_keyed[c("airlines", "airports", "planes", "weather")]
)
# The only difference in the `dm` print output is the increased number of
# columns
updated_flights_dm
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `flights`, `airlines`, `airports`, `planes`, `weather`
#> Columns: 54
#> Primary keys: 4
#> Foreign keys: 5
# The schematic view of the data model remains unchanged
dm_draw(updated_flights_dm)
Use case 2: Creation of a surrogate key
The same course of action could, for example, be employed to create a
surrogate key for a table, a synthetic simple key that replaces a
compound key. We can do this for the weather
table.
#> # A tibble: 144 × 15
#> # Keys: `origin`, `time_hour` | 1 | 0
#> origin year month day hour temp dewp humid wind_dir wind_speed
#> <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 EWR 2013 1 10 0 41 32 70.1 230 8.06
#> 2 EWR 2013 1 10 1 39.0 30.0 69.9 210 9.21
#> 3 EWR 2013 1 10 2 39.0 28.9 66.8 230 6.90
#> 4 EWR 2013 1 10 3 39.9 27.0 59.5 270 5.75
#> 5 EWR 2013 1 10 4 41 26.1 55.0 320 6.90
#> 6 EWR 2013 1 10 5 41 26.1 55.0 300 12.7
#> 7 EWR 2013 1 10 6 39.9 25.0 54.8 280 6.90
#> 8 EWR 2013 1 10 7 41 25.0 52.6 330 6.90
#> 9 EWR 2013 1 10 8 43.0 25.0 48.7 330 8.06
#> 10 EWR 2013 1 10 9 45.0 23 41.6 320 17.3
#> # ℹ 134 more rows
#> # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
#> # visib <dbl>, time_hour <dttm>
# Maybe there is some hidden candidate for a primary key that we overlooked?
enum_pk_candidates(flights_keyed$weather)
#> # A tibble: 15 × 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin FALSE has duplicate values: EWR (48), JFK (48), LGA (48)
#> 2 year FALSE has duplicate values: 2013 (144)
#> 3 month FALSE has duplicate values: 1 (72), 2 (72)
#> 4 day FALSE has duplicate values: 10 (144)
#> 5 hour FALSE has duplicate values: 0 (6), 1 (6), 2 (6), 3 (6), …
#> 6 temp FALSE has duplicate values: 44.06 (12), 41.00 (8), 44.96…
#> 7 dewp FALSE has duplicate values: 21.92 (16), 24.98 (16), 6.98…
#> 8 humid FALSE has duplicate values: 53.71 (4), 56.56 (4), 32.53 …
#> 9 wind_dir FALSE has duplicate values: 320 (25), 330 (17), 310 (15)…
#> 10 wind_speed FALSE has duplicate values: 6.90468 (20), 8.05546 (19), …
#> 11 wind_gust FALSE has 123 missing values, and duplicate values: 23.0…
#> 12 precip FALSE has duplicate values: 0 (144)
#> 13 pressure FALSE has duplicate values: 1028.9 (6), 1029.0 (5), 1032…
#> 14 visib FALSE has duplicate values: 10 (144)
#> 15 time_hour FALSE has duplicate values: 2013-01-10 00:00:00 (3), 201…
# Seems we have to construct a column with unique values
# This can be done by combining column `origin` with `time_hour`, if the latter
# is converted to a single time zone first; all within the `dm`:
weather_tbl_mutate <-
flights_keyed$weather %>%
# first convert all times to the same time zone:
mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
# paste together as character the airport code and the time
unite("origin_slot_id", origin, time_hour_fmt) %>%
select(origin_slot_id, everything())
# check if we the result is as expected:
weather_tbl_mutate %>%
enum_pk_candidates() %>%
filter(candidate)
#> # A tibble: 1 × 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 origin_slot_id TRUE ""
# We apply the same transformation to create
# the foreign key in the flights table:
flights_tbl_mutate <-
flights_keyed$flights %>%
mutate(time_hour_fmt = format(time_hour, tz = "UTC")) %>%
unite("origin_slot_id", origin, time_hour_fmt) %>%
select(origin_slot_id, everything())
surrogate_flights_dm <-
dm(
weather = weather_tbl_mutate,
flights = flights_tbl_mutate,
!!!flights_keyed[c("airlines", "airports", "planes")]
) %>%
dm_add_pk(weather, origin_slot_id) %>%
dm_add_fk(flights, origin_slot_id, weather)
surrogate_flights_dm %>%
dm_draw()
Use case 3: Disentangle dm
If you look at the dm
created by
dm_nycflights13(cycle = TRUE)
, you see that two columns of
flights
relate to the same table, airports
.
One column stands for the departure airport and the other for the
arrival airport. This generates a cycle which leads to failures with
many operations that only work on cycle-free data models, such as
dm_flatten_to_tbl()
, dm_filter()
or
dm_wrap_tbl()
. In such cases, it can be beneficial to
“disentangle” the dm
by duplicating the referred table. One
way to do this in the {dm}-framework is as follows:
disentangled_flights_dm <-
dm(
destination = flights_keyed$airports,
origin = flights_keyed$airports,
!!!flights_keyed[c("flights", "airlines", "planes", "weather")]
) %>%
# Key relations are also duplicated, so the wrong ones need to be removed
dm_rm_fk(flights, dest, origin) %>%
dm_rm_fk(flights, origin, destination)
disentangled_flights_dm %>%
dm_draw()
Use case 4: Add summary table to dm
Here is an example for adding a summary of a table as a new table to
a dm
. Foreign-key relations are taken care of
automatically. This example shows an alternative approach of
deconstruction reconstruction using pull_tbl()
.
Use case 5: Joining tables
If you would like to join some or all of the columns of one table to
another, you can make use of one of the ..._join()
methods
for a dm_keyed_tbl
. In many cases, using keyed tables
derived from a dm
object allows omitting the
by
argument without triggering a message, because they are
safely inferred from the foreign keys stored in the
dm_keyed_tbl
objects. For the syntax, please see the
example below.
planes_for_join <-
flights_keyed$planes %>%
select(tailnum, plane_type = type)
joined_flights_tbl <-
flights_keyed$flights %>%
# let's first reduce the number of columns of flights
select(-dep_delay:-arr_delay, -air_time:-minute, -starts_with("sched_")) %>%
# in the {dm}-method for the joins you can specify which columns you want to
# add to the subsetted table
left_join(planes_for_join)
joined_flights_dm <- dm(
flights_plane_type = joined_flights_tbl,
!!!flights_keyed[c("airlines", "airports", "weather")]
)
# this is how the table looks now
joined_flights_dm$flights_plane_type
#> # A tibble: 1,761 × 11
#> year month day dep_time carrier flight tailnum origin dest
#> <int> <int> <int> <int> <chr> <int> <chr> <chr> <chr>
#> 1 2013 1 10 3 B6 727 N571JB JFK BQN
#> 2 2013 1 10 16 B6 739 N564JB JFK PSE
#> 3 2013 1 10 450 US 1117 N171US EWR CLT
#> 4 2013 1 10 520 UA 1018 N35204 EWR IAH
#> 5 2013 1 10 530 UA 404 N815UA LGA IAH
#> 6 2013 1 10 531 AA 1141 N5EAAA JFK MIA
#> 7 2013 1 10 535 B6 725 N784JB JFK BQN
#> 8 2013 1 10 546 B6 380 N337JB EWR BOS
#> 9 2013 1 10 549 EV 6055 N19554 LGA IAD
#> 10 2013 1 10 550 US 2114 N740UW LGA BOS
#> # ℹ 1,751 more rows
#> # ℹ 2 more variables: time_hour <dttm>, plane_type <chr>
Use case 6: Retrieve all tables
Retrieving all tables from a dm
object requires a lot of
boilerplate code. The dm_deconstruct()
function helps
creating that boilerplate. For a dm
object, it prints the
code necessary to create local variables for all tables.
dm <- dm_nycflights13()
dm_deconstruct(dm)
#> airlines <- pull_tbl(dm, "airlines", keyed = TRUE)
#> airports <- pull_tbl(dm, "airports", keyed = TRUE)
#> flights <- pull_tbl(dm, "flights", keyed = TRUE)
#> planes <- pull_tbl(dm, "planes", keyed = TRUE)
#> weather <- pull_tbl(dm, "weather", keyed = TRUE)
This code can be copy-pasted into your script or function.