Easy database access with a custom R6 package

Conveniently accessing databases via ODBC from R with a uniform API, secure automatic authentication and worry free connection management.

r
R6
databases
sql
Author

red

Published

December 18, 2022

Spotlight

# open database connection
db_con <-
  OdbcConnector$new(db_service_name = "postgres_test")

# define some parameters
years <- c(2007, 2009)
table_to_query <- "penguins"

# construct, send and collect query
db_con$get_query("SELECT year, species, island,
                         body_mass_g
                    FROM {`table_to_query`}
                   WHERE year IN ({years*})")
standard_query: 0.07 sec elapsed
# A tibble: 230 × 4
    year species island    body_mass_g
   <int> <chr>   <chr>           <int>
 1  2007 Adelie  Torgersen        3750
 2  2007 Adelie  Torgersen        3800
 3  2007 Adelie  Torgersen        3250
 4  2007 Adelie  Torgersen          NA
 5  2007 Adelie  Torgersen        3450
 6  2007 Adelie  Torgersen        3650
 7  2007 Adelie  Torgersen        3625
 8  2007 Adelie  Torgersen        4675
 9  2007 Adelie  Torgersen        3475
10  2007 Adelie  Torgersen        4250
# ℹ 220 more rows
# get column names from a table
cols <-
  db_con$get_query("SELECT *
                      FROM information_schema.columns
                     WHERE table_schema = 'public'
                       AND table_name = 'penguins'",
    quietly = TRUE
  ) |>
  pull(column_name)

# filter column names
measure_cols <- cols[str_detect(cols, ".*mm$")]

# query just the columns that fit certain criteria
db_con$get_query("SELECT year, island, species,
                         {`measure_cols`*}
                    FROM penguins",
  quietly = TRUE
)
# A tibble: 344 × 6
   year island    species bill_length_mm bill_depth_mm flipper_length_mm
  <int> <chr>     <chr>            <dbl>         <dbl>             <int>
1  2007 Torgersen Adelie            39.1          18.7               181
2  2007 Torgersen Adelie            39.5          17.4               186
3  2007 Torgersen Adelie            40.3          18                 195
4  2007 Torgersen Adelie            NA            NA                  NA
5  2007 Torgersen Adelie            36.7          19.3               193
# ℹ 339 more rows
# write data to database
db_con$write_table(table_name = "diamonds",
                   data = ggplot2::diamonds |>
                     slice_sample(prop = 0.2),
                   overwrite = TRUE)
✔ Data written to postgres
# execute statements 
db_con$execute_statement("GRANT SELECT ON \"DIAMONDS\" TO public;")
ℹ Statement executed @ postgres

For the docker-database setup see Section 5.1

Why

Database connections in R can be a hassle since you need to pass around connection objects. Furthermore, there is no straigthforward way to securely store and pass credentials. What I present here is my solution to this problem.

I’ll demonstrate a template on how to do something like this based on example databases (PostgreSQL and mariaDB) running in docker containers. I will not cover R6, packages or unit-testing, just try to show whats new, usefull or “innovative”.

Features

  • Connection to all Databases which offer an ODBC driver and are supporter in by the wrapped DBI package.
  • Full support for glue::glue_sql sql parsing and query construction within database methods.
  • Automatic and secure authentication using the OS credential store (tested on Ubuntu and Windows) with keyring
  • Custom methods to access often used datasets/reference data like ICD and ATC catalogues or whatever your (business) needs may be.
  • Uniform API with a persistent connection object that carries all the methods implemented for the database.

How

Framework - R6

R6 is a fully fledged implementation of an object oriented programming (OOP) framework with inheritance, public and private methods as well as statefulness if required.

Why full OOP? In order to construct, execute and fetch SQL queries from R the functions like DBI::dbGetQuery or glue::glue_sql need a database connection object. This connection object houses all the metadata of the database connection: The database (dialect), authentication and settings. Only from it things like glue_sql can deduce how SQL queries need to be constructed, for example which characters are escaped and how.

Aside: glue_sql

In short, glue::glue_sql() allows you to write queries intervoven with r code and special reference syntax to pass r objects, table and column names programmatically. Most importantly, it takes care of syntax differences between SQL dialects:

# open db connections to PostgreSQL and MariaDB
pg_con <- OdbcConnector$new(db_service_name = "postgres_test")
maria_con <- OdbcConnector$new(db_service_name = "maria_test")

# query elements
random_cols <- c("UPPERCASE", "lowercase", "weirdNumber5")
years_ <- 2010:2013

# prepare the query for both databases
purrr::map(
  c(pg_con$conn, maria_con$conn),
  ~ glue::glue_sql("SELECT {`random_cols`*}
                              FROM {`table_to_query`}
                             WHERE year IN ({years_*})",
    .con = .
  )
) |>
  set_names("postgres", "maria")
$postgres
<SQL> SELECT "UPPERCASE", "lowercase", "weirdNumber5"
 FROM "penguins"
WHERE year IN (2010, 2011, 2012, 2013)

$maria
<SQL> SELECT `UPPERCASE`, `lowercase`, `weirdNumber5`
 FROM `penguins`
WHERE year IN (2010, 2011, 2012, 2013)

Therefore the connector object needs to be persistent since establishing database connections every time a query is executed is possible but inelegant. Also passing the connector object every time a query is sent, a table written or a statement executed is wearisome and one needs to manually disconnect from the database.

The base class: DbConnector

# roxygen documentation omitted for brevity, see entire code below
DbConnector <- R6::R6Class(
  classname = "DbConnector",
  inherit = NULL,

  # private items
  private = list(
    ..conn = NULL
  ),

  # public items
  public = list(

    #' @param driver Database driver like `odbc::odbc()`
    #' @param ... Arguments passed to `DBI::dbConnect`
1    initialize = function(driver, ...) {
      # establish db connection
      private$..conn <-
        DBI::dbConnect(driver, ...)
    },
                                                                  
2    #' @description Closes database connections when the object is  garbage collected. Invoke garbage collection manually with `gc()`
    finalize = function() {                                
      cli::cli_alert_success("Closing db connection           {private$..conn@info$servername}")    
      DBI::dbDisconnect(private$..conn)
    },
    # [...]
1
The base class just wraps the DBI::dbConnect() function to establish a database connection. The class connection method is so generic to deal with any database connection using DBI. The connector object from DBI::dbConnect() is stored in a private field of the DbConnector instance. From there any class method can use it which saves passing the connection manually.
2
The first real important part is the finalize method which disconnects the database connection when the object is garbage collected and consequently when the R session is closed.

get_query method

This is the workhorse of the database class: It assembles and parses queries, sends it to the database and collects the results.

# [...]
    # get_query
    # roxygen documentation omitted for brevity, see entire code below
    get_query = function(query, case = "snake",
                         timer_name = "standard_query",
                         convert_dates = TRUE,
                         quietly = FALSE,
                         raw_query = FALSE,
                         env = environment()) {
      # input tests
      assert_that(purrr::is_logical(convert_dates),
        msg = "convert_dates must be TRUE/FALSE"
      )

      # construct query if no pre-constructed  query was provided
1      if (!raw_query) {
        query <- glue::glue_sql(query,
          .con = private$..conn,
          .envir = env
        )
      }

      # start timing
2      tictoc::tic(msg = timer_name)
      res <- DBI::dbGetQuery(
        conn = private$..conn,
        statement = query
      ) %>%
        dplyr::as_tibble() %>%
        janitor::clean_names(case = case)

      tictoc::toc(log = TRUE, quiet = quietly)

      if (convert_dates == TRUE) {
        res <- res %>% mutate(across(where(lubridate::is.POSIXct), as.Date))
      }

      return(res)
    },
# [...]
1
It wraps glue::glue_sql() to construct the query string which can be entered as an argument in glue_sql notation, passes the connector object which is encapsulated in the DbConnector instance and specifies the environment in which glue_sql() searches for the R objects in the query.
2
Collects the pre-constructed query and converts column names to a standard specified in case. The time it takes to fetch the query is logged using tictoc::tic() and tictoc::toc() which is printed to console while quietly = FALSE.

The method demonstrates how useful the OOP Approach in R6 is: The database connection private$..conn needs to be passed firstly, to glue::glue_sql()so that it can parse the statement for the appropriate SQL dialect. Secondly, to DBI::dbGetQuery() to provide the database to query. Doing that manually would be as inefficient as it would be bothersome.

Entire DbConnector class code

DbConnector has two more core methods which are self explanatory and can be seen in the entire class code.

Show entire DbConnector code
#' DbConnector
#'
#' @description
#' Superclass for connectors to SQL databases.
#' @importFrom assertthat assert_that
#' @export
DbConnector <- R6::R6Class(
  classname = "DbConnector",
  inherit = NULL,

  # private items
  private = list(
    ..conn = NULL
  ),

  # public items
  public = list(

    #' @param driver Database driver like `odbc::odbc()`
    #' @param ... Arguments passed to `DBI::dbConnect`
    initialize = function(driver, ...) {
      # establish db connection

      private$..conn <-
        DBI::dbConnect(driver, ...)
    },

    #' @description Closes database connections when the object is garbage collected. Invoke garbage collection manually with `gc()`
    finalize = function() {
      cli::cli_alert_success("Closing db connection {private$..conn@info$servername}")
      DBI::dbDisconnect(private$..conn)
    },

    # methods for class

    # get_query
    #' @description Wrapper around `DBI::dbgetQuery` and formats the output
    #' as a tibble, makes use of `janitor::clean_names` and converts the
    #' SQL `POSIXct` to `Date` format.
    #' @param query String containing SQL code and/or `glue::glue_sql` syntax.
    #' @param convert_dates Convert dates to `as.Date` or just leave them be (mostly results in dttm)
    #' @param case `janitor::clean_names` option to which format variable names should be converted. Defaults to "snake"
    #' @param timer_name Name of the `tictoc::tic()` timer in which query time is stored
    #' @param quietly Supress [toc][tictoc::toc] query time posts. Defaults to `FALSE`.
    #' @param raw_query Whether the query passed is already a parsed `SQL` query. Defaults to `FALSE`. Use for passing preconstructed queries.
    #' @param env Environment the glue_sql String is parsed in. Defaults to `environment()`, i.e. the calling environment. Here for experimental reasons.
    #' @return Returns a `tibble` with the query data
    get_query = function(query, case = "snake",
                         timer_name = "standard_query",
                         convert_dates = TRUE,
                         quietly = FALSE,
                         raw_query = FALSE,
                         env = environment()) {
      # input tests
      assert_that(purrr::is_logical(convert_dates),
        msg = "convert_dates must be TRUE/FALSE"
      )

      # raw query
      if (!raw_query) {
        query <- glue::glue_sql(query,
          .con = private$..conn,
          .envir = env
        )
      }

      # start timing
      tictoc::tic(msg = timer_name)
      res <- DBI::dbGetQuery(
        conn = private$..conn,
        statement = query
      ) %>%
        dplyr::as_tibble() %>%
        janitor::clean_names(case = case)

      tictoc::toc(log = TRUE, quiet = quietly)

      if (convert_dates == TRUE) {
        res <- res %>% mutate(across(where(lubridate::is.POSIXct), as.Date))
      }

      return(res)
    },

    # write_table
    #' @description write tables to database. Wrapper around [DBI::dbWriteTable()].
    #' @param table_name Table Name as character. Use a name that is valid for the databse. Otherwhise it will be sanitised by `janitor::make_clean_names`.
    #' @param data Data to be written either as `data.table` or `tibble`.
    #' @param overwrite Allow overwriting the destination table.
    #' @param field_types Additional field types used to override derived types.
1    write_table = function(table_name,
                           data,
                           overwrite = FALSE,
                           field_types = NULL) {
      assert_that(is.data.frame(data) || tibble::is_tibble(data))
      assert_that(is.character(table_name))

      # transform table name to valid name
      table_name <- janitor::make_clean_names(table_name,
        case = "all_caps"
      )

      # change column names to upper case

      data <- janitor::clean_names(
        dat = data,
        case = "all_caps"
      )

      # write to database

      state <-
        DBI::dbWriteTable(
          conn = private$..conn,
          name = table_name,
          value = data,
          overwrite = overwrite,
          field.types = field_types
        )

      if (state) {
        cli::cli_alert_success("Data written to {private$..conn@info$dbname}")
      }
    },

    # db_execute
    #' @description execute statements against the database. Wrapper around [DBI::dbExecute()].
    #' @param statement SQL statement to execute as a string. Is internally passed to [glue::glue_sql()].
    #' @param env Pass the appropriate environment where `glue_sql` looks for objects. Only needed wihhin functions or script execution.
    #' @examples
    #' test$execute_statement("GRANT SELECT ON IRIS TO HSPRANDOM;")
    #' test$execute_statement("COMMIT;")
2    execute_statement = function(statement,
                                 env = environment()) {
      state <-
        DBI::dbExecute(
          conn = private$..conn,
          statement = glue::glue_sql(statement,
            .envir = env,
            .con = private$..conn
          )
        )

      if (is.numeric(state)) {
        cli::cli_alert_info("Statement executed @ {private$..conn@info$dbname}")
      }
    }
  ),


  # active bindings
3  active = list(
    conn = function() {
      #' @field conn `DBI` database connection the connector object is using.
      private$..conn
    }
  )
)
1
write_table Writes data to the database. Important feature is that column names are made syntactically correct with janitor::clean_names() in all_caps. Otherwise one ends up with those annoying SQL tables where you have to quote column names. The same is done to the table name.
2
execute_statement wraps DBI::dbExecute() by passing the statement through glue::glue_sql() and informing whether it was successful.
3
In R6 an active binding provides an interface to the hidden private fields. The conn field allows manual access to the database connection if the user needs it.

The OdbcConnector class

Prelude: Authentication with keyring

The keyring package allows to store and access secrets in the system credential store. This has two big advantages: Firstly, credentials can be stored and retrieved at leisure after a one-time setup. Secondly, the system credential store is unlocked when you log into your windows/linux/os x Session. Especially the latter gives the OdbcConnector class its power of convenience: After a one time setup you and your colleagues need no extra step to safely authenticate at your companies databases1.

One-time keyring setup

Below we store out username and passwords (you’ll get an interactive prompt to enter it) in the systems credential store. The crucial part for automatic log-on with OdbcConnector is the name of the service. It has to match the name of the ODBC connection to work (as easy as possible).

keyring::key_set(service = "maria_test", 
                 username = "mysql")

keyring::key_set(service = "postgres_test", 
                 username = "postgres")

keyring::key_list() %>% 
  filter(str_detect(service), "test")
        service username
1 postgres_test postgres
2    maria_test    mysql

Inheritance

Firstly, the OdbcConnector class inherits all attributes from its Superclass DbConnector. This means all methods and attributes we do not re-designate are inherited.

But re-designate we do, namely the public items which store the name of the database service db_service_name, user_id, and dsn. These fields will hold information on our database connection.

# roxygen documentation excluded for brevity
OdbcConnector <- R6::R6Class(
  classname = "OdbcConnector",
  inherit = DbConnector,

  # public items
  public = list(
    #' @field db_service_name Name of the datbase service. Must be the name set in `keyring::key_set(service)`,
    db_service_name = NULL,
    #' @field user_id Username / user id for database access.
    user_id = NULL,
    #' @field dsn Data Source Name (DSN) of the database in odbc.ini.
    dsn = NULL,

Authentication in class

The key element of the class is its new initialize function: It authenticates the user with keyring.

The only necessary argument is db_service_name which, when matched by keyring’s key_get() service name, is used to retrieve username and password for that database. The if condition switches between keyring and non keyring authentifcation. You can provide all the arguments manually if the need arises.

   initialize = function(user_id, db_service_name = NULL, dsn, user_pw, encoding = "") {
      assert_that(is_character(db_service_name))

      # switch between keyring and non-keyring authentification
     
      if (all(
        is_missing(user_id), is_missing(dsn),
        is_missing(user_pw)
      ) & !is_missing(db_service_name)) {
        
        # if only the db_service_name is present 
        # these arguments are filled from the credential store.
        
        self$db_service_name <- db_service_name
        self$user_id <- keyring::key_list(self$db_service_name)$username
1        self$dsn <- self$db_service_name
      } else {
        # otherwhise  user id and dsn need to be present
        
        assert_that(is_character(user_id))
        assert_that(is_character(dsn))

        self$user_id <- user_id
        self$db_service_name <- db_service_name
        self$dsn <- dsn
      }
1
Here dsn == db_service_name because I named my keyring entries the same as the ODBC connection names for the databases. In case of corporate databases, which dont change (often), you could simply write a dedicated class for that database and hard code the dsn.

Establishing the connection

The if condition checks whether a user_pw has been provided. If not, it tries to read the password from the credential store with keyring. This can be useful when you provide all database details manually but still want to store/retrieve the password safely. For example, in scripts that are not run interactively.

      # establish db connection

      if (is_missing(user_pw)) {
        message("No pw specified, assuming keyring authentification.")

        private$..conn <-
          DBI::dbConnect(odbc::odbc(),
            dsn = self$dsn,
            UID = self$user_id,
            PWD = keyring::key_get(
              service = self$db_service_name,
              username = self$user_id
            ),
            encoding = encoding
          )
      } else if (!missing(user_pw)) {
        assert_that(is_character(user_pw))

        private$..conn <-
          DBI::dbConnect(odbc::odbc(),
            dsn = self$dsn,
            UID = self$user_id,
            PWD = user_pw,
            encoding = encoding
          )
      }

Entire OdbcConnector code

Show entire OdbcConnector code
#' OdbcConnector
#'
#' @description
#' Class for connectors to SQL databases based on [odbc::odbc()] drivers.
#' Be *aware*, for this to work the databases need to be registered within your machine's data sources reference (e.g. odbc.ini).
#' @importFrom assertthat assert_that
#' @import rlang
#' @export
OdbcConnector <- R6::R6Class(
  classname = "OdbcConnector",
  inherit = DbConnector,

  # public items
  public = list(
    #' @field db_service_name Name of the datbase service. Must be the name set in `keyring::key_set(service)`,
    db_service_name = NULL,
    #' @field user_id Username / user id for database access.
    user_id = NULL,
    #' @field dsn Data Source Name (DSN) of the database in odbc.ini.
    dsn = NULL,

    #' @param db_service_name Name of the datbase service. Must be the name set in `keyring::key_set(service)`
    #' @param user_id Username / user id for database access. Only needed for *without* keyring authentification.
    #' @param user_pw *Optional* User password *if* keyring authentification is not used. Defaults to `NULL` which triggers `keyring::key_get(service = db_service_name, username = user_id)` to fetch the password from the standard (operating system) credential store.
    #' @param dsn *Optional* Data Source Name (DSN) of the database in odbc.ini. Only needed for *without* keyring authentification.
    #' @param encoding *Optional* Encoding the database is in. Useful if you have texts with umlauts (set `encoding = "latin1")`and other nasties.
    #' @examples
    #' # non-keyring authentification
    #' OdbcConnector$new(db_service_name = "postgres_test",
    #' user_id = "postgres",
    #' dsn = "postgres_test",
    #' user_pw = rstudioapi::askForPassword())
    initialize = function(user_id, db_service_name = NULL, dsn, user_pw, encoding = "") {
      assert_that(is_character(db_service_name))

      # switch between keyring and non-keyring authentification
      if (all(
        is_missing(user_id), is_missing(dsn),
        is_missing(user_pw)
      ) & !is_missing(db_service_name)) {
        self$db_service_name <- db_service_name
        self$user_id <- keyring::key_list(self$db_service_name)$username
        self$dsn <- self$db_service_name
      } else {
        assert_that(is_character(user_id))
        assert_that(is_character(dsn))

        self$user_id <- user_id
        self$db_service_name <- db_service_name
        self$dsn <- dsn
      }

      # establish db connection

      if (is_missing(user_pw)) {
        message("No pw specified, assuming keyring authentification.")

        private$..conn <-
          DBI::dbConnect(odbc::odbc(),
            dsn = self$dsn,
            UID = self$user_id,
            PWD = keyring::key_get(
              service = self$db_service_name,
              username = self$user_id
            ),
            encoding = encoding
          )
      } else if (!missing(user_pw)) {
        assert_that(is_character(user_pw))

        private$..conn <-
          DBI::dbConnect(odbc::odbc(),
            dsn = self$dsn,
            UID = self$user_id,
            PWD = user_pw,
            encoding = encoding
          )
      }
    }
  )
)

Final considerations

This works very well in practice and makes accessing databases a breeze. The code presented is solely for demonstration purposes and some work would need to be done to harden it for production. For example, imports/dependency management needs a good tidy up and there need to be (more) input tests. But the idea and combination of parts is the message here.

Appendix

docker database setup

Footnotes

  1. It is my personal conviction that when it comes to safety convenience is key. Otherwise the probability of people to do the “easy” thing, lets say store passwords in plaintext, increases. And thats a big no-no.↩︎