Introduction to relational data models
Katharina Brunner
2024-09-24
Source:vignettes/howto-dm-theory.Rmd
howto-dm-theory.Rmd
Computer scientists are familiar with multiple, linked tables. But, because many R users tend to have backgrounds in other disciplines, we present six important terms in relational data modeling to help you to jump-start working with {dm}. These terms are:
- Data Frames and Tables
- Data Model
- Primary Keys
- Foreign Keys
- Referential Integrity
- Normalization
- Relational Databases
1. Data Frames and Tables
A data frame is a fundamental data structure in R. Columns represent variables, rows represent observations. In more technical terms, a data frame is a list of variables of identical length and unique row names. If you imagine it visually, the result is a typical table structure. That is why working with data from spreadsheets is so convenient and the users of the popular {dplyr} package for data wrangling mainly rely on data frames.
The downside is that data frames and flat file systems like spreadsheets can result in bloated tables because they hold many repetitive values. In the worst case, a data frame can contain multiple columns with only a single value different in each row.
This calls for better data organization by utilizing the resemblance between data frames and database tables, which also consist of columns and rows. The elements are just named differently:
Data Frame | Table |
---|---|
Column | Attribute (or Field) |
Row | Tuple (or Record) |
Additionally, number of rows and columns for a data frame are, respectively, analogous to the cardinality and degree of the table.
Relational databases, unlike data frames, do not keep all data in one large table but instead split it into multiple smaller tables. That separation into sub-tables has several advantages:
- all information is stored only once, avoiding redundancy and conserving memory
- all information needs to be updated only once and in one place, improving consistency and avoiding errors that may result from updating (or forgetting to update) the same value in multiple locations
- all information is organized by topic and segmented into smaller tables that are easier to handle
It is for these reasons that separation of data helps with data quality, and they explain the popularity of relational databases in production-level data management.
The downside of this approach is that it is harder to merge together information from different data sources and to identify which entities refer to the same object, a common task when modeling or plotting data.
Thus, to take full advantage of the relational database approach, an associated data model is needed to overcome the challenges that arise when working with multiple tables.
Let’s illustrate this challenge with the data from the nycflights13
dataset that contains detailed information about the 336,776 flights
that departed from New York City in 2013. The information is stored in
five tables.
Details like the full name of an airport are not available
immediately; these can only be obtained by joining or merging the
constituent tables, which can result in long and inflated pipe chains
full of left_join()
, anti_join()
and other
forms of data merging.
In classical {dplyr} notation, you will need four
left_join()
calls to gradually merge the
flights
table to the airlines
,
planes
, airports
, and weather
tables to create one wide data frame.
library(dm)
library(nycflights13)
nycflights13_df <-
flights %>%
left_join(airlines, by = "carrier") %>%
left_join(planes, by = "tailnum") %>%
left_join(airports, by = c("origin" = "faa")) %>%
left_join(weather, by = c("origin", "time_hour"))
nycflights13_df
#> # A tibble: 336,776 × 48
#> year.x month.x day.x dep_time sched_dep_time dep_delay arr_time
#> <int> <int> <int> <int> <int> <dbl> <int>
#> 1 2013 1 1 517 515 2 830
#> 2 2013 1 1 533 529 4 850
#> 3 2013 1 1 542 540 2 923
#> 4 2013 1 1 544 545 -1 1004
#> 5 2013 1 1 554 600 -6 812
#> 6 2013 1 1 554 558 -4 740
#> 7 2013 1 1 555 600 -5 913
#> 8 2013 1 1 557 600 -3 709
#> 9 2013 1 1 557 600 -3 838
#> 10 2013 1 1 558 600 -2 753
#> # ℹ 336,766 more rows
#> # ℹ 41 more variables: sched_arr_time <int>, arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour.x <dbl>, minute <dbl>,
#> # time_hour <dttm>, name.x <chr>, year.y <int>, type <chr>,
#> # manufacturer <chr>, model <chr>, engines <int>, seats <int>,
#> # speed <int>, engine <chr>, name.y <chr>, lat <dbl>, lon <dbl>, …
{dm} offers a more elegant and shorter way to combine tables while augmenting {dplyr}/{dbplyr} workflows.
It is possible to have the best of both worlds: manage your data with {dm} as linked tables, and, when necessary, flatten multiple tables into a single data frame for analysis with {dplyr}.
The next step is to create a data model based on multiple tables:
2. Data Model
A data model shows the structure between multiple tables that are linked together.
The nycflights13
relations can be transferred into the
following graphical representation:
dm <- dm_nycflights13(cycle = TRUE)
dm %>%
dm_draw()
The flights
table is linked to four other tables:
airlines
, planes
, weather
, and
airports
. By using directed arrows, the visualization shows
explicitly the connection between different columns (they are called
attributes in the relational data sphere).
For example: The column carrier
in flights
can be joined with the column carrier
from the
airlines
table.
The links between the tables are established through primary keys and foreign keys.
As an aside, we can also now see how avoiding redundant data by building data models with multiple tables can save memory compared to storing data in single a data frame:
object.size(dm)
#> 476256 bytes
object.size(nycflights13_df)
#> 108020824 bytes
Further Reading: The {dm} methods for visualizing data models.
3. Primary Keys
In a relational data model, each table should have one or several columns that uniquely identify a row. These columns define the primary key (abbreviated with “pk”). If the key consists of a single column, it is called a simple key. A key consisting of more than one column is called a compound key.
Example: In the airlines
table of
nycflights13
the column carrier
is the primary
key, a simple key. The weather
table has the combination of
origin
and time_hour
as primary key, a
compound key.
You can get all primary keys in a dm
by calling
dm_get_all_pks()
:
dm %>%
dm_get_all_pks()
#> # A tibble: 4 × 3
#> table pk_col autoincrement
#> <chr> <keys> <lgl>
#> 1 airlines carrier FALSE
#> 2 airports faa FALSE
#> 3 planes tailnum FALSE
#> 4 weather origin, time_hour FALSE
dm_enum_pk_candidates()
checks suitability of each
column to serve as a simple primary key:
dm %>%
dm_enum_pk_candidates(airports)
#> # A tibble: 8 × 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 faa TRUE ""
#> 2 name TRUE ""
#> 3 lat TRUE ""
#> 4 lon TRUE ""
#> 5 alt FALSE "has duplicate values: 30 (4), 13 (3), 9 (2), 19 (2), …
#> 6 tz FALSE "has duplicate values: -5 (48), -6 (21), -8 (12), -7 (…
#> 7 dst FALSE "has duplicate values: A (84), N (2)"
#> 8 tzone FALSE "has duplicate values: America/New_York (48), America/…
Further Reading: The {dm} package offers several functions for dealing with primary keys.
4. Foreign Keys
The counterpart of a primary key in one table is the foreign key in another table. In order to join two tables, the primary key of the first table needs to be referenced from the second table. This column or these columns are called the foreign key (abbreviated with “fk”).
For example, if you want to link the airlines
table to
the flights
table, the primary key in airlines
needs to match the foreign key in flights
. This condition
is satisfied because the column carrier
is present as a
primary key in the airlines
table as well as a foreign key
in the flights
table. In the case of compound keys, the
origin
and time_hour
columns (which form the
primary key of the weather
table) are also present in the
flights
table.
You can find foreign key candidates for simple keys with the function
dm_enum_fk_candidates()
; they are marked with
TRUE
in the candidate
column.
dm %>%
dm_enum_fk_candidates(flights, airlines)
#> # A tibble: 19 × 3
#> columns candidate why
#> <keys> <lgl> <chr>
#> 1 carrier TRUE ""
#> 2 year FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 3 month FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 4 day FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 5 dep_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 6 sched_dep_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 7 dep_delay FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 8 arr_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 9 sched_arr_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 10 arr_delay FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 11 flight FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 12 tailnum FALSE "values of `flights$tailnum` not in `airlines$…
#> 13 origin FALSE "values of `flights$origin` not in `airlines$c…
#> 14 dest FALSE "values of `flights$dest` not in `airlines$car…
#> 15 air_time FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 16 distance FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 17 hour FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 18 minute FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
#> 19 time_hour FALSE "\u001b[1m\u001b[22mCan't join `x$value1` with…
Additionally, you can also extract a summary of all foreign key
relations present in a dm
object using
dm_get_all_fks()
:
dm %>%
dm_get_all_fks()
#> # A tibble: 5 × 5
#> child_table child_fk_cols parent_table parent_key_cols on_delete
#> <chr> <keys> <chr> <keys> <chr>
#> 1 flights carrier airlines carrier no_action
#> 2 flights origin airports faa no_action
#> 3 flights dest airports faa no_action
#> 4 flights tailnum planes tailnum no_action
#> 5 flights origin, time_hour weather origin, time_hour no_action
Further Reading: All {dm} functions for working with foreign keys.
5. Referential Integrity
A data set has referential integrity if all relations between tables are valid. That is, every foreign key holds a primary key that is present in the parent table. If a foreign key contains a reference where the corresponding row in the parent table is not available, that row is an orphan row and the database no longer has referential integrity.
{dm} allows checking referential integrity with the
dm_examine_constraints()
function. The following conditions
are checked:
- All primary key values must be unique and not missing (i.e.,
NA
s are not allowed). - Each foreign key value must have a corresponding primary key value.
In the example data model, for a substantial share of the flights, detailed information for the corresponding airplane is not available:
#> • Table `flights`: foreign key `dest` into table `airports`: values of `flights$dest` not in `airports$faa`: SJU (30), BQN (6), STT (4), PSE (2)
#> • Table `flights`: foreign key `tailnum` into table `planes`: values of `flights$tailnum` not in `planes$tailnum`: N725MQ (6), N537MQ (5), N722MQ (5), N730MQ (5), N736MQ (5), …
Establishing referential integrity is important for providing clean
data for analysis or downstream users. See
vignette("howto-dm-rows")
for more information on adding,
deleting, or updating individual rows, and
vignette("tech-dm-zoom")
for operations on the data in a
data model.
6. Normalization
Normalization is a technical term that describes the central design principle of a relational data model: splitting data into multiple tables.
A normalized data schema consists of several relations (tables) that are linked with attributes (columns). The relations can be joined together by means of primary and foreign keys. The main goal of normalization is to keep data organization as clean and simple as possible by avoiding redundant data entries.
For example, if you want to change the name of one airport in the
nycflights13
dataset, you will only need to update a single
data value. This principle is sometimes called the single point of
truth.
# Update in one single location...
airlines[airlines$carrier == "UA", "name"] <- "United broke my guitar"
airlines %>%
filter(carrier == "UA")
#> # A tibble: 1 × 2
#> carrier name
#> <chr> <chr>
#> 1 UA United broke my guitar
#> # A tibble: 336,776 × 2
#> flight name
#> <int> <chr>
#> 1 1545 United broke my guitar
#> 2 1714 United broke my guitar
#> 3 1141 American Airlines Inc.
#> 4 725 JetBlue Airways
#> 5 461 Delta Air Lines Inc.
#> 6 1696 United broke my guitar
#> 7 507 JetBlue Airways
#> 8 5708 ExpressJet Airlines Inc.
#> 9 79 JetBlue Airways
#> 10 301 American Airlines Inc.
#> # ℹ 336,766 more rows
Another way to demonstrate normalization is splitting a table into two parts.
Let’s look at the planes
table, which consists of 3322
individual tail numbers and corresponding information for the specific
airplane, like the year it was manufactured or the average cruising
speed.
The function decompose_table()
extracts two new tables
and creates a new key model_id
, that links both tables.
This results in a parent_table
and a
child_table
that differ massively in the number of
rows:
planes %>%
decompose_table(model_id, model, manufacturer, type, engines, seats, speed)
#> $child_table
#> # A tibble: 3,322 × 4
#> tailnum year engine model_id
#> <chr> <int> <chr> <int>
#> 1 N10156 2004 Turbo-fan 120
#> 2 N102UW 1998 Turbo-fan 93
#> 3 N103US 1999 Turbo-fan 93
#> 4 N104UW 1999 Turbo-fan 93
#> 5 N10575 2002 Turbo-fan 119
#> 6 N105UW 1999 Turbo-fan 93
#> 7 N107US 1999 Turbo-fan 93
#> 8 N108UW 1999 Turbo-fan 93
#> 9 N109UW 1999 Turbo-fan 93
#> 10 N110UW 1999 Turbo-fan 93
#> # ℹ 3,312 more rows
#>
#> $parent_table
#> # A tibble: 147 × 7
#> model_id model manufacturer type engines seats speed
#> <int> <chr> <chr> <chr> <int> <int> <int>
#> 1 120 EMB-145XR EMBRAER Fixed wing mu… 2 55 NA
#> 2 93 A320-214 AIRBUS INDUSTRIE Fixed wing mu… 2 182 NA
#> 3 119 EMB-145LR EMBRAER Fixed wing mu… 2 55 NA
#> 4 39 737-824 BOEING Fixed wing mu… 2 149 NA
#> 5 68 767-332 BOEING Fixed wing mu… 2 330 NA
#> 6 52 757-224 BOEING Fixed wing mu… 2 178 NA
#> 7 94 A320-214 AIRBUS Fixed wing mu… 2 182 NA
#> 8 112 CL-600-2D24 BOMBARDIER INC Fixed wing mu… 2 95 NA
#> 9 30 737-724 BOEING Fixed wing mu… 2 149 NA
#> 10 27 737-524 BOEING Fixed wing mu… 2 149 NA
#> # ℹ 137 more rows
While child_table
contains 3322 unique
tailnum
rows and therefore consists of 3322 rows, just like
the original planes
table, the parent_table
shrunk to just 147 rows, enough to store all relevant combinations and
avoid storing redundant information.
Further Reading: See the Simple English Wikipedia article on database normalization for more details.
7. Relational Databases
{dm} is built upon relational data models but it is not a database itself. Databases are systems for data management and many of them are constructed as relational databases (e.g., SQLite, MySQL, MSSQL, Postgres, etc.). As you can guess from the names of the databases, SQL, short for Structured Querying Language, plays an important role: it was invented for the purpose of querying relational databases.
In production, the data is stored in a relational database and {dm} is used to work with the data.
Therefore, {dm} can copy data from and to databases, and works transparently with both in-memory data and with relational database systems.
For example, let’s create a local SQLite database and copy the
dm
object to it:
con_sqlite <- DBI::dbConnect(RSQLite::SQLite())
con_sqlite
#> <SQLiteConnection>
#> Path:
#> Extensions: TRUE
DBI::dbListTables(con_sqlite)
#> character(0)
copy_dm_to(con_sqlite, dm)
DBI::dbListTables(con_sqlite)
#> [1] "airlines_1_20200828_071303_12345" "airports_1_20200828_071303_12345"
#> [3] "flights_1_20200828_071303_12345" "planes_1_20200828_071303_12345"
#> [5] "weather_1_20200828_071303_12345"
In the opposite direction, dm
can also be populated with
data from a database. Unfortunately, keys currently can be learned only
for Microsoft SQL Server and Postgres, but not for SQLite. Therefore,
the dm
contains the tables but not the keys:
dm_from_con(con_sqlite)
#> Keys could not be queried.
#> ── Table source ───────────────────────────────────────────────────────────
#> src: sqlite 3.46.0 []
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `airlines_1_20200828_071303_12345`, `airports_1_20200828_071303_12345`, `flights_1_20200828_071303_12345`, `planes_1_20200828_071303_12345`, `weather_1_20200828_071303_12345`
#> Columns: 53
#> Primary keys: 0
#> Foreign keys: 0
Remember to terminate the database connection:
DBI::dbDisconnect(con_sqlite)
Conclusion
In this article, we have learned about some of the most fundamental concepts and data structures associated with the relational database management system (RDBMS).
Further reading
vignette("howto-dm-db")
– This article covers accessing
and working with RDBMSs within your R session, including manipulating
data, filling in missing relationships between tables, getting data out
of the RDBMS and into your model, and deploying your data model to an
RDBMS.
vignette("howto-dm-df")
– Is your data in local data
frames? This article covers creating a data model from your local data
frames, including building the relationships in your data model,
verifying your model, and leveraging the power of dplyr to operate on
your data model.