dm allows you to create your own relational data models from local data frames. Once your data model is complete, you can deploy it to a range of database management systems (DBMS) using {dm}.
Creating a dm object from data frames
The example data set that we will be using is available through the
nycflights13
package. The five tables that we are working with contain information
about all flights that departed from the airports of New York to other
destinations in the United States in 2013:
-
flights
represents the trips taken by planes -
airlines
includes- the names of transport organizations (
name
) - their abbreviated codes (
carrier
)
- the names of transport organizations (
-
airports
indicates the ports of departure (origin
) and of destination (dest
) -
weather
contains meteorological information at each hour -
planes
describes characteristics of the aircraft
Once we’ve loaded {nycflights13}, the aforementioned tables are all in our work environment, ready to be accessed.
library(nycflights13)
airports
#> # A tibble: 1,458 × 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A Amer…
#> 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A Amer…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A Amer…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A Amer…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A Amer…
#> 6 0A9 Elizabethton Municipal Airpo… 36.4 -82.2 1593 -5 A Amer…
#> 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A Amer…
#> 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A Amer…
#> 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U Amer…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A Amer…
#> # ℹ 1,448 more rows
Your own data will probably not be available as an R package.
Whatever format it is in, you will need to be able to load it as data
frames into your R session. If the data is too large, consider using dm
to connect to the database instead. See
vignette("howto-dm-db")
for details on using dm with
databases.
Adding Tables
Our first step will be to tell dm
which tables we want
to work with and how they are connected. For that we can use
dm()
, passing in the table names as arguments.
library(dm)
flights_dm_no_keys <- dm(airlines, airports, flights, planes, weather)
flights_dm_no_keys
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
The as_dm()
function is an alternative that works if you
already have a list of tables.
A dm is a list
dm
objects behave like lists with a user- and
console-friendly print format. In fact, using a dm as a nicer list for
organizing your data frames in your environment is an easy first step
towards using dm and its data modeling functionality.
Subsetting syntax for a dm
object (either by subscript
or by name) is similar to syntax for lists, and so you don’t need to
learn any additional syntax to work with dm
objects.
names(flights_dm_no_keys)
#> [1] "airlines" "airports" "flights" "planes" "weather"
flights_dm_no_keys$airports
#> # A tibble: 1,458 × 8
#> faa name lat lon alt tz dst tzone
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A Amer…
#> 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A Amer…
#> 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A Amer…
#> 4 06N Randall Airport 41.4 -74.4 523 -5 A Amer…
#> 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A Amer…
#> 6 0A9 Elizabethton Municipal Airpo… 36.4 -82.2 1593 -5 A Amer…
#> 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A Amer…
#> 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A Amer…
#> 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U Amer…
#> 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A Amer…
#> # ℹ 1,448 more rows
flights_dm_no_keys[c("airports", "flights")]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airports`, `flights`
#> Columns: 27
#> Primary keys: 0
#> Foreign keys: 0
Defining Keys
Even though we now have a dm
object that contains all
our data, we have not specified how our five tables are connected. To do
this, we need to define primary keys and foreign keys on the tables.
Primary keys and foreign keys are how relational database tables are
linked with each other. A primary key is a column or column tuple that
has a unique value for each row within a table. A foreign key is a
column or column tuple containing the primary key for a row in another
table. Foreign keys act as cross references between tables. They specify
the relationships that gives us the relational database. For
more information on keys and a crash course on databases, see
vignette("howto-dm-theory")
.
Primary Keys
dm
offers dm_enum_pk_candidates()
to
identify viable primary keys for a table in the dm
object,
and dm_add_pk()
to add them.
dm_enum_pk_candidates(
dm = flights_dm_no_keys,
table = planes
)
#> # A tibble: 9 × 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 tailnum TRUE ""
#> 2 year FALSE "has duplicate values: 2001 (284), 2000 (244), 20…
#> 3 type FALSE "has duplicate values: Fixed wing multi engine (3…
#> 4 manufacturer FALSE "has duplicate values: BOEING (1630), AIRBUS INDU…
#> 5 model FALSE "has duplicate values: 737-7H4 (361), A320-232 (2…
#> 6 engines FALSE "has duplicate values: 2 (3288), 1 (27), 4 (4), 3…
#> 7 seats FALSE "has duplicate values: 149 (452), 140 (411), 55 (…
#> 8 speed FALSE "has 3299 missing values, and duplicate values: 4…
#> 9 engine FALSE "has duplicate values: Turbo-fan (2750), Turbo-je…
Now, we can add the identified primary keys:
flights_dm_only_pks <-
flights_dm_no_keys %>%
dm_add_pk(table = airlines, columns = carrier) %>%
dm_add_pk(airports, faa) %>%
dm_add_pk(planes, tailnum) %>%
dm_add_pk(weather, c(origin, time_hour))
flights_dm_only_pks
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 0
Note that {dm} functions work with both named and positional argument specification, and compound keys can be specified using a vector argument.
Foreign Keys
To define how our tables are related, we use dm_add_fk()
to add foreign keys. Naturally, this function will deal with two tables:
a table looking for a reference, and a table that is
providing the reference. Accordingly, while calling
dm_add_fk()
, the table
argument specifies the
table that needs a foreign key to link it to a second table, and the
ref_table
argument specifies the table to be linked to,
which needs a primary key already defined for it.
dm_enum_fk_candidates(
dm = flights_dm_only_pks,
table = flights,
ref_table = airlines
)
#> # A tibble: 19 × 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 carrier TRUE ""
#> 2 year FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 3 month FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 4 day FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 5 dep_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 6 sched_dep_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 7 dep_delay FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 8 arr_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 9 sched_arr_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 10 arr_delay FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 11 flight FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 12 tailnum FALSE "values of `flights$tailnum` not in `airlines$…
#> 13 origin FALSE "values of `flights$origin` not in `airlines$c…
#> 14 dest FALSE "values of `flights$dest` not in `airlines$car…
#> 15 air_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 16 distance FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 17 hour FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 18 minute FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 19 time_hour FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
Having chosen a column from the successful candidates provided by
dm_enum_fk_candidates()
, we use the
dm_add_fk()
function to establish the foreign key linking
the tables. In the second call to dm_add_fk()
we complete
the process for the flights
and airlines
tables that we started above. The carrier
column in the
airlines
table will be added as the foreign key in
flights
.
flights_dm_all_keys <-
flights_dm_only_pks %>%
dm_add_fk(table = flights, columns = tailnum, ref_table = planes) %>%
dm_add_fk(flights, carrier, airlines) %>%
dm_add_fk(flights, origin, airports) %>%
dm_add_fk(flights, c(origin, time_hour), weather)
flights_dm_all_keys
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines`, `airports`, `flights`, `planes`, `weather`
#> Columns: 53
#> Primary keys: 4
#> Foreign keys: 4
Having created the required primary and foreign keys to link all the tables together, we now have a relational data model we can work with.
Visualization
Visualizing a data model is a quick and easy way to verify that we
have successfully created the model we were aiming for. We can use
dm_draw()
at any stage of the process to generate a visual
representation of the tables and the links between them:
Integrity Checks
As well as checking our data model visually, dm can examine the constraints we have created by the addition of keys and verify that they are sensible.
flights_dm_no_keys %>%
dm_examine_constraints()
flights_dm_only_pks %>%
dm_examine_constraints()
flights_dm_all_keys %>%
dm_examine_constraints()
#> • Table `flights`: foreign key `tailnum` into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (575), N722MQ (513), N723MQ (507), N713MQ (483), N735MQ (396), …
#> • Table `flights`: foreign key `origin`, `time_hour` into table `weather`: values of `flights$origin`, `flights$time_hour` not in `weather$origin`, `weather$time_hour`: EWR, 2013-10-23 06:00:00 (34), EWR, 2013-08-19 17:00:00 (26), EWR, 2013-12-31 06:00:00 (26), EWR, 2013-12-31 07:00:00 (26), JFK, 2013-08-19 17:00:00 (26), …
The results are presented in a human-readable form, and available as a tibble for programmatic inspection.
Programming
Helper functions are available to access details on keys and check results.
A data frame of primary keys is retrieved with
dm_get_all_pks()
:
flights_dm_only_pks %>%
dm_get_all_pks()
#> # A tibble: 4 × 3
#> table pk_col autoincrement
#> <chr> <keys> <lgl>
#> 1 airlines carrier FALSE
#> 2 airports faa FALSE
#> 3 planes tailnum FALSE
#> 4 weather origin, time_hour FALSE
Similarly, a data frame of foreign keys is retrieved with
dm_get_all_fks()
:
flights_dm_all_keys %>%
dm_get_all_fks()
#> # A tibble: 4 × 5
#> child_table child_fk_cols parent_table parent_key_cols on_delete
#> <chr> <keys> <chr> <keys> <chr>
#> 1 flights carrier airlines carrier no_action
#> 2 flights origin airports faa no_action
#> 3 flights tailnum planes tailnum no_action
#> 4 flights origin, time_hour weather origin, time_hour no_action
We can use tibble::as_tibble()
on the result of
dm_examine_constraints()
to programmatically inspect which
constraints are not satisfied:
flights_dm_all_keys %>%
dm_examine_constraints() %>%
tibble::as_tibble()
#> # A tibble: 8 × 6
#> table kind columns ref_table is_key problem
#> <chr> <chr> <keys> <chr> <lgl> <chr>
#> 1 flights FK tailnum planes FALSE "values of `flights$ta…
#> 2 flights FK origin, time_hour weather FALSE "values of `flights$or…
#> 3 airlines PK carrier NA TRUE ""
#> 4 airports PK faa NA TRUE ""
#> 5 planes PK tailnum NA TRUE ""
#> 6 weather PK origin, time_hour NA TRUE ""
#> 7 flights FK carrier airlines TRUE ""
#> 8 flights FK origin airports TRUE ""
Conclusion
In this tutorial, we have demonstrated how simple it is to create relational data models from local data frames using {dm}, including setting primary and foreign keys and visualizing the resulting relational model.
Further reading
vignette("howto-dm-db")
– This article covers accessing
and working with RDBMSs within your R session, including manipulating
data, filling in missing relationships between tables, getting data out
of the RDBMS and into your model, and deploying your data model to an
RDBMS.
vignette("howto-dm-theory")
– Do you know all about data
frames but very little about relational data models? This quick
introduction will walk you through the key similarities and differences,
and show you how to move from individual data frames to a relational
data model.