Skip to contents

{dm} was designed to make connecting to and working with a relational database management system (RDBMS) as straightforward as possible. To this end, a dm object can be created from any database that has a {DBI} backend available (see list).

When a dm object is created via a DBI connection to an RDBMS, it can import all the tables in the database, the active schema, or a limited set. For some RDBMS, such as Postgres, SQL Server and MariaDB, primary and foreign keys are also imported and do not have to be manually added afterwards.

To demonstrate, we will connect to a relational dataset repository with a database server that is publicly accessible without registration. It hosts a financial dataset that contains loan data along with relevant account information and transactions. We chose this dataset because the relationships between loan, account, and transactions tables are a good representation of databases that record real-world business transactions.

Below, we open a connection to the publicly accessible database server using their documented connection parameters. Connection details vary from database to database. Before connecting to your own RDBMS, you may want to read vignette("DBI", package = "DBI") for further information.

library(RMariaDB)

my_db <- dbConnect(
  MariaDB(),
  username = "guest",
  password = "relational",
  dbname = "Financial_ijs",
  host = "relational.fit.cvut.cz"
)
library(RMariaDB)
my_db <- dm:::financial_db_con()

Creating a dm object takes a single call to dm_from_con() with the DBI connection object as its argument.

library(dm)

my_dm <- dm_from_con(my_db)
#> Keys queried successfully, use `learn_keys = TRUE` to mute this message.
my_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total)
#> Columns: 57
#> Primary keys: 9
#> Foreign keys: 8

The components of the my_dm object are lazy tables powered by {dbplyr}. {dbplyr} translates the {dplyr} grammar of data manipulation into queries the database server understands. Lazy tables defer downloading of table data until results are required for printing or local processing.

Building a dm from a subset of tables

A dm can also be constructed from individual tables or views. This is useful for when you want to work with a subset of a database’s tables, perhaps from different schemas.

Below, we use the $ notation to extract two tables from the financial database. Then we create our dm by passing the tables in as arguments. Note that the tables arguments have to all be from the same source, in this case my_db.

dbListTables(my_db)
#> [1] "accounts"  "cards"     "clients"   "disps"     "districts" "loans"    
#> [7] "orders"    "tkeys"     "trans"

library(dbplyr)
loans <- tbl(my_db, "loans")
accounts <- tbl(my_db, "accounts")

my_manual_dm <- dm(loans, accounts)
my_manual_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`
#> Columns: 11
#> Primary keys: 0
#> Foreign keys: 0

Define Primary and Foreign Keys

Primary keys and foreign keys are how relational database tables are linked with each other. A primary key is a column or column tuple that has a unique value for each row within a table. A foreign key is a column or column tuple containing the primary key for a row in another table. Foreign keys act as cross references between tables. They specify the relationships that gives us the relational database. For more information on keys and a crash course on databases, see vignette("howto-dm-theory").

In many cases, dm_from_con() already returns a dm with all keys set. If not, dm allows us to define primary and foreign keys ourselves. For this, we use learn_keys = FALSE to obtain a dm object with only the tables.

library(dm)

fin_dm <- dm_from_con(my_db, learn_keys = FALSE)
fin_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `accounts`, `cards`, `clients`, `disps`, `districts`, … (9 total)
#> Columns: 57
#> Primary keys: 0
#> Foreign keys: 0

The model diagram provided by our test database loosely illustrates the intended relationships between tables. In the diagram, we can see that the loans table should be linked to the accounts table. Below, we create those links in 3 steps:

  1. Add a primary key id to the accounts table
  2. Add a primary key id to the loans table
  3. Add a foreign key account_id to the loans table referencing the accounts table

Then we assign colors to the tables and draw the structure of the dm.

Note that when the foreign key is created, the primary key in the referenced table does not need to be specified, but the primary key must already be defined. And, as mentioned above, primary and foreign key constraints on the database are currently only imported for Postgres, SQL Server databases and MariaDB, and only when dm_from_con() is used. This process of key definition needs to be done manually for other databases.

my_dm_keys <-
  my_manual_dm %>%
  dm_add_pk(accounts, id) %>%
  dm_add_pk(loans, id) %>%
  dm_add_fk(loans, account_id, accounts) %>%
  dm_set_colors(green = loans, orange = accounts)

my_dm_keys %>%
  dm_draw()
%0 accounts accountsidloans loansidaccount_idloans:account_id->accounts:id

Once you have instantiated a dm object, you can continue to add tables to it. For tables from the original source for the dm, use dm()

trans <- tbl(my_db, "trans")

my_dm_keys %>%
  dm(trans)
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`, `trans`
#> Columns: 21
#> Primary keys: 2
#> Foreign keys: 1

For tables from other sources or from the local environment, dplyr::copy_to() is used. copy_to() is discussed later in this article.

Transient nature of operations

Like other R objects, a dm is immutable and all operations performed on it are transient unless stored in a new variable.

my_dm_keys
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`
#> Columns: 11
#> Primary keys: 2
#> Foreign keys: 1

my_dm_trans <-
  my_dm_keys %>%
  dm(trans)

my_dm_trans
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`, `trans`
#> Columns: 21
#> Primary keys: 2
#> Foreign keys: 1

And, like {dbplyr}, results are never written to a database unless explicitly requested.

my_dm_keys %>%
  dm_flatten_to_tbl(loans)
#> Renaming ambiguous columns: %>%
#>   dm_rename(loans, date.loans = date) %>%
#>   dm_rename(accounts, date.accounts = date)
#> # Source:   SQL [?? x 10]
#> # Database: mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#>       id account…¹ date.loans amount durat…² payme…³ status distr…⁴ frequ…⁵
#>    <int>     <int> <date>      <dbl>   <int>   <dbl> <chr>    <int> <chr>  
#>  1  4959         2 1994-01-05  80952      24    3373 A            1 POPLAT…
#>  2  4961        19 1996-04-29  30276      12    2523 B           21 POPLAT…
#>  3  4962        25 1997-12-08  30276      12    2523 A           68 POPLAT…
#>  4  4967        37 1998-10-14 318480      60    5308 D           20 POPLAT…
#>  5  4968        38 1998-04-19 110736      48    2307 C           19 POPLAT…
#>  6  4973        67 1996-05-02 165960      24    6915 A           16 POPLAT…
#>  7  4986        97 1997-08-10 102876      12    8573 A           74 POPLAT…
#>  8  4988       103 1997-12-06 265320      36    7370 D           44 POPLAT…
#>  9  4989       105 1998-12-05 352704      48    7348 C           21 POPLAT…
#> 10  4990       110 1997-09-08 162576      36    4516 C           36 POPLAT…
#> # … with more rows, 1 more variable: date.accounts <date>, and abbreviated
#> #   variable names ¹​account_id, ²​duration, ³​payments, ⁴​district_id,
#> #   ⁵​frequency

my_dm_keys %>%
  dm_flatten_to_tbl(loans) %>%
  sql_render()
#> Renaming ambiguous columns: %>%
#>   dm_rename(loans, date.loans = date) %>%
#>   dm_rename(accounts, date.accounts = date)
#> <SQL> SELECT
#>   `LHS`.`id` AS `id`,
#>   `account_id`,
#>   `date.loans`,
#>   `amount`,
#>   `duration`,
#>   `payments`,
#>   `status`,
#>   `district_id`,
#>   `frequency`,
#>   `date.accounts`
#> FROM (
#>   SELECT
#>     `id`,
#>     `account_id`,
#>     `date` AS `date.loans`,
#>     `amount`,
#>     `duration`,
#>     `payments`,
#>     `status`
#>   FROM `loans`
#> ) `LHS`
#> LEFT JOIN (
#>   SELECT `id`, `district_id`, `frequency`, `date` AS `date.accounts`
#>   FROM `accounts`
#> ) `RHS`
#>   ON (`LHS`.`account_id` = `RHS`.`id`)

Performing operations on tables by “zooming”

As the dm is a collection of tables, if we wish to perform operations on an individual table, we set it as the context for those operations using dm_zoom_to(). See vignette("tech-dm-zoom") for more detail on zooming.

dm operations are transient unless persistence is explicitly requested. To make our chain of manipulations on the selected table permanent, we assign the result of dm_insert_zoomed() to a new object, my_dm_total. This is a new dm object, derived from my_dm_keys, with a new lazy table total_loans linked to the accounts table.

my_dm_total <-
  my_dm_keys %>%
  dm_zoom_to(loans) %>%
  group_by(account_id) %>%
  summarize(total_amount = sum(amount, na.rm = TRUE)) %>%
  ungroup() %>%
  dm_insert_zoomed("total_loans")

Context is set to the table “loans” using dm_zoom_to(loans). You can learn more about zooming in the tutorial vignette("tech-dm-zoom"). We then use {dplyr} functions on the zoomed table to generate a new summary table.

summarize() returns a temporary table with one row for each group created by the preceding group_by() function. The columns in the temporary table are constrained to the columns passed as arguments to the group_by() function and the column(s) created by the summarize() function.

dm_insert_zoomed("total_loans") adds the temporary table created by summarize() to the data model under a new name, total_loans. Because the grouping variable account_id is a primary key, the new derived table is automatically linked to the accounts table.

my_dm_total %>%
  dm_set_colors(violet = total_loans) %>%
  dm_draw()
%0 accounts accountsidloans loansidaccount_idloans:account_id->accounts:id total_loans total_loansaccount_idtotal_loans:account_id->accounts:id

The resulting table total_loans can be accessed like any other table in the dm object.

my_dm_total$total_loans
#> # Source:   SQL [?? x 2]
#> # Database: mysql  [guest@relational.fit.cvut.cz:NA/Financial_ijs]
#>    account_id total_amount
#>         <int>        <dbl>
#>  1          2        80952
#>  2         19        30276
#>  3         25        30276
#>  4         37       318480
#>  5         38       110736
#>  6         67       165960
#>  7         97       102876
#>  8        103       265320
#>  9        105       352704
#> 10        110       162576
#> # … with more rows

It is a lazy table powered by the {dbplyr} package: the results are not materialized; instead, an SQL query is built and executed each time the data is requested.

my_dm_total$total_loans %>%
  sql_render()
#> <SQL> SELECT `account_id`, SUM(`amount`) AS `total_amount`
#> FROM `loans`
#> GROUP BY `account_id`

Use compute() on a zoomed table to materialize it to a temporary table and avoid recomputing. See vignette("howto-dm-copy") for more details.

Downloading data

When it becomes necessary to move data locally for analysis or reporting, the {dm} method collect() is used. Operations on dm objects for databases are limited to report only the first ten results. collect() forces the evaluation of all SQL queries and the generation of the complete set of results. The resulting tables are transferred from the RDBMS and stored as local tibbles.

my_dm_local <-
  my_dm_total %>%
  collect()

my_dm_local$total_loans
#> # A tibble: 682 × 2
#>    account_id total_amount
#>         <int>        <dbl>
#>  1          2        80952
#>  2         19        30276
#>  3         25        30276
#>  4         37       318480
#>  5         38       110736
#>  6         67       165960
#>  7         97       102876
#>  8        103       265320
#>  9        105       352704
#> 10        110       162576
#> # … with 672 more rows

Use this method with caution. If you are not sure of the size of the dataset you will be downloading, you can call dm_nrow() on your dm for the row count of your data model’s tables.

my_dm_total %>%
  dm_nrow()
#>       loans    accounts total_loans 
#>         682        4500         682

Persisting results

It is just as simple to move a local relational model into an RDBMS as is using collect() to download it. The method used is copy_dm_to() and it takes as arguments a database connection and a dm object. In the example below, a local SQLite database is used to demonstrate it, but {dm} is designed to work with any RDBMS supported by {DBI}.

destination_db <- DBI::dbConnect(RSQLite::SQLite())

deployed_dm <- copy_dm_to(destination_db, my_dm_local)

deployed_dm
#> ── Table source ───────────────────────────────────────────────────────────
#> src:  sqlite 3.39.4 []
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`, `total_loans`
#> Columns: 13
#> Primary keys: 2
#> Foreign keys: 2
my_dm_local
#> ── Metadata ───────────────────────────────────────────────────────────────
#> Tables: `loans`, `accounts`, `total_loans`
#> Columns: 13
#> Primary keys: 2
#> Foreign keys: 2

In the output, you can observe that the src for deployed_dm is the SQLite database, while for my_dm_local the source is the local R environment.

Persisting tables are covered in more detail in vignette("howto-dm-copy").

When done, do not forget to disconnect:

DBI::dbDisconnect(destination_db)
DBI::dbDisconnect(my_db)

Conclusion

In this tutorial, we have demonstrated how simple it is to load a database into a dm object and begin working with it. Currently, loading a dm from most RDBMS requires you to manually set key relations, but {dm} provides methods to make this straightforward. It is planned that future versions of dm will support automatic key creation for more RDBMS.

The next step is to read vignette("howto-dm-copy"), where copying your tables to and from an RDBMS is covered. vignette("howto-dm-rows") discusses manipulation of individual rows in a database.

Further reading

vignette("howto-dm-df") – Is your data in local data frames? This article covers creating a data model from your local data frames, including building the relationships in your data model, verifying your model, and leveraging the power of dplyr to operate on your data model.

vignette("howto-dm-theory") – Do you know all about data frames but very little about relational data models? This quick introduction will walk you through the key similarities and differences, and show you how to move from individual data frames to a relational data model.