Conveniently accessing databases via ODBC from R with a uniform API, secure automatic authentication and worry free connection management.
r
R6
databases
sql
Author
red
Published
December 18, 2022
Spotlight
# open database connectiondb_con <- OdbcConnector$new(db_service_name ="postgres_test")# define some parametersyears <-c(2007, 2009)table_to_query <-"penguins"# construct, send and collect querydb_con$get_query("SELECT year, species, island, body_mass_g FROM {`table_to_query`} WHERE year IN ({years*})")
# get column names from a tablecols <- db_con$get_query("SELECT * FROM information_schema.columns WHERE table_schema = 'public' AND table_name = 'penguins'",quietly =TRUE ) |>pull(column_name)# filter column namesmeasure_cols <- cols[str_detect(cols, ".*mm$")]# query just the columns that fit certain criteriadb_con$get_query("SELECT year, island, species, {`measure_cols`*} FROM penguins",quietly =TRUE)
# A tibble: 344 × 6
year island species bill_length_mm bill_depth_mm flipper_length_mm
<int> <chr> <chr> <dbl> <dbl> <int>
1 2007 Torgersen Adelie 39.1 18.7 181
2 2007 Torgersen Adelie 39.5 17.4 186
3 2007 Torgersen Adelie 40.3 18 195
4 2007 Torgersen Adelie NA NA NA
5 2007 Torgersen Adelie 36.7 19.3 193
# ℹ 339 more rows
# write data to databasedb_con$write_table(table_name ="diamonds",data = ggplot2::diamonds |>slice_sample(prop =0.2),overwrite =TRUE)
✔ Data written to postgres
# execute statements db_con$execute_statement("GRANT SELECT ON \"DIAMONDS\" TO public;")
Database connections in R can be a hassle since you need to pass around connection objects. Furthermore, there is no straigthforward way to securely store and pass credentials. What I present here is my solution to this problem.
I’ll demonstrate a template on how to do something like this based on example databases (PostgreSQL and mariaDB) running in docker containers. I will not cover R6, packages or unit-testing, just try to show whats new, usefull or “innovative”.
Features
Connection to all Databases which offer an ODBC driver and are supporter in by the wrapped DBI package.
Full support for glue::glue_sql sql parsing and query construction within database methods.
Automatic and secure authentication using the OS credential store (tested on Ubuntu and Windows) with keyring
Custom methods to access often used datasets/reference data like ICD and ATC catalogues or whatever your (business) needs may be.
Uniform API with a persistent connection object that carries all the methods implemented for the database.
How
Framework - R6
R6 is a fully fledged implementation of an object oriented programming (OOP) framework with inheritance, public and private methods as well as statefulness if required.
Why full OOP? In order to construct, execute and fetch SQL queries from R the functions like DBI::dbGetQuery or glue::glue_sql need a database connection object. This connection object houses all the metadata of the database connection: The database (dialect), authentication and settings. Only from it things like glue_sql can deduce how SQL queries need to be constructed, for example which characters are escaped and how.
Aside: glue_sql
In short, glue::glue_sql() allows you to write queries intervoven with r code and special reference syntax to pass r objects, table and column names programmatically. Most importantly, it takes care of syntax differences between SQL dialects:
# open db connections to PostgreSQL and MariaDBpg_con <- OdbcConnector$new(db_service_name ="postgres_test")maria_con <- OdbcConnector$new(db_service_name ="maria_test")# query elementsrandom_cols <-c("UPPERCASE", "lowercase", "weirdNumber5")years_ <-2010:2013# prepare the query for both databasespurrr::map(c(pg_con$conn, maria_con$conn),~ glue::glue_sql("SELECT {`random_cols`*} FROM {`table_to_query`} WHERE year IN ({years_*})",.con = . )) |>set_names("postgres", "maria")
$postgres
<SQL> SELECT "UPPERCASE", "lowercase", "weirdNumber5"
FROM "penguins"
WHERE year IN (2010, 2011, 2012, 2013)
$maria
<SQL> SELECT `UPPERCASE`, `lowercase`, `weirdNumber5`
FROM `penguins`
WHERE year IN (2010, 2011, 2012, 2013)
Therefore the connector object needs to be persistent since establishing database connections every time a query is executed is possible but inelegant. Also passing the connector object every time a query is sent, a table written or a statement executed is wearisome and one needs to manually disconnect from the database.
The base class: DbConnector
# roxygen documentation omitted for brevity, see entire code belowDbConnector <- R6::R6Class(classname ="DbConnector",inherit =NULL,# private itemsprivate =list(..conn =NULL ),# public itemspublic =list(#' @param driver Database driver like `odbc::odbc()`#' @param ... Arguments passed to `DBI::dbConnect`1initialize =function(driver, ...) {# establish db connection private$..conn <- DBI::dbConnect(driver, ...) },2#' @description Closes database connections when the object is garbage collected. Invoke garbage collection manually with `gc()`finalize =function() { cli::cli_alert_success("Closing db connection {private$..conn@info$servername}") DBI::dbDisconnect(private$..conn) },# [...]
1
The base class just wraps the DBI::dbConnect() function to establish a database connection. The class connection method is so generic to deal with any database connection using DBI. The connector object from DBI::dbConnect() is stored in a private field of the DbConnector instance. From there any class method can use it which saves passing the connection manually.
2
The first real important part is the finalize method which disconnects the database connection when the object is garbage collected and consequently when the R session is closed.
get_query method
This is the workhorse of the database class: It assembles and parses queries, sends it to the database and collects the results.
# [...]# get_query# roxygen documentation omitted for brevity, see entire code below get_query =function(query, case ="snake",timer_name ="standard_query",convert_dates =TRUE,quietly =FALSE,raw_query =FALSE,env =environment()) {# input testsassert_that(purrr::is_logical(convert_dates),msg ="convert_dates must be TRUE/FALSE" )# construct query if no pre-constructed query was provided1if (!raw_query) { query <- glue::glue_sql(query,.con = private$..conn,.envir = env ) }# start timing2 tictoc::tic(msg = timer_name) res <- DBI::dbGetQuery(conn = private$..conn,statement = query ) %>% dplyr::as_tibble() %>% janitor::clean_names(case = case) tictoc::toc(log =TRUE, quiet = quietly)if (convert_dates ==TRUE) { res <- res %>%mutate(across(where(lubridate::is.POSIXct), as.Date)) }return(res) },# [...]
1
It wraps glue::glue_sql() to construct the query string which can be entered as an argument in glue_sql notation, passes the connector object which is encapsulated in the DbConnector instance and specifies the environment in which glue_sql() searches for the R objects in the query.
2
Collects the pre-constructed query and converts column names to a standard specified in case. The time it takes to fetch the query is logged using tictoc::tic() and tictoc::toc() which is printed to console while quietly = FALSE.
The method demonstrates how useful the OOP Approach in R6 is: The database connection private$..conn needs to be passed firstly, to glue::glue_sql()so that it can parse the statement for the appropriate SQL dialect. Secondly, to DBI::dbGetQuery() to provide the database to query. Doing that manually would be as inefficient as it would be bothersome.
Entire DbConnector class code
DbConnector has two more core methods which are self explanatory and can be seen in the entire class code.
Show entire DbConnector code
#' DbConnector#'#' @description#' Superclass for connectors to SQL databases.#' @importFrom assertthat assert_that#' @exportDbConnector <- R6::R6Class(classname ="DbConnector",inherit =NULL,# private itemsprivate =list(..conn =NULL ),# public itemspublic =list(#' @param driver Database driver like `odbc::odbc()`#' @param ... Arguments passed to `DBI::dbConnect`initialize =function(driver, ...) {# establish db connection private$..conn <- DBI::dbConnect(driver, ...) },#' @description Closes database connections when the object is garbage collected. Invoke garbage collection manually with `gc()`finalize =function() { cli::cli_alert_success("Closing db connection {private$..conn@info$servername}") DBI::dbDisconnect(private$..conn) },# methods for class# get_query#' @description Wrapper around `DBI::dbgetQuery` and formats the output#' as a tibble, makes use of `janitor::clean_names` and converts the#' SQL `POSIXct` to `Date` format.#' @param query String containing SQL code and/or `glue::glue_sql` syntax.#' @param convert_dates Convert dates to `as.Date` or just leave them be (mostly results in dttm)#' @param case `janitor::clean_names` option to which format variable names should be converted. Defaults to "snake"#' @param timer_name Name of the `tictoc::tic()` timer in which query time is stored#' @param quietly Supress [toc][tictoc::toc] query time posts. Defaults to `FALSE`.#' @param raw_query Whether the query passed is already a parsed `SQL` query. Defaults to `FALSE`. Use for passing preconstructed queries.#' @param env Environment the glue_sql String is parsed in. Defaults to `environment()`, i.e. the calling environment. Here for experimental reasons.#' @return Returns a `tibble` with the query dataget_query =function(query, case ="snake",timer_name ="standard_query",convert_dates =TRUE,quietly =FALSE,raw_query =FALSE,env =environment()) {# input testsassert_that(purrr::is_logical(convert_dates),msg ="convert_dates must be TRUE/FALSE" )# raw queryif (!raw_query) { query <- glue::glue_sql(query,.con = private$..conn,.envir = env ) }# start timing tictoc::tic(msg = timer_name) res <- DBI::dbGetQuery(conn = private$..conn,statement = query ) %>% dplyr::as_tibble() %>% janitor::clean_names(case = case) tictoc::toc(log =TRUE, quiet = quietly)if (convert_dates ==TRUE) { res <- res %>%mutate(across(where(lubridate::is.POSIXct), as.Date)) }return(res) },# write_table#' @description write tables to database. Wrapper around [DBI::dbWriteTable()].#' @param table_name Table Name as character. Use a name that is valid for the databse. Otherwhise it will be sanitised by `janitor::make_clean_names`.#' @param data Data to be written either as `data.table` or `tibble`.#' @param overwrite Allow overwriting the destination table.#' @param field_types Additional field types used to override derived types.1write_table =function(table_name, data,overwrite =FALSE,field_types =NULL) {assert_that(is.data.frame(data) || tibble::is_tibble(data))assert_that(is.character(table_name))# transform table name to valid name table_name <- janitor::make_clean_names(table_name,case ="all_caps" )# change column names to upper case data <- janitor::clean_names(dat = data,case ="all_caps" )# write to database state <- DBI::dbWriteTable(conn = private$..conn,name = table_name,value = data,overwrite = overwrite,field.types = field_types )if (state) { cli::cli_alert_success("Data written to {private$..conn@info$dbname}") } },# db_execute#' @description execute statements against the database. Wrapper around [DBI::dbExecute()].#' @param statement SQL statement to execute as a string. Is internally passed to [glue::glue_sql()].#' @param env Pass the appropriate environment where `glue_sql` looks for objects. Only needed wihhin functions or script execution.#' @examples#' test$execute_statement("GRANT SELECT ON IRIS TO HSPRANDOM;")#' test$execute_statement("COMMIT;")2execute_statement =function(statement,env =environment()) { state <- DBI::dbExecute(conn = private$..conn,statement = glue::glue_sql(statement,.envir = env,.con = private$..conn ) )if (is.numeric(state)) { cli::cli_alert_info("Statement executed @ {private$..conn@info$dbname}") } } ),# active bindings3active =list(conn =function() {#' @field conn `DBI` database connection the connector object is using. private$..conn } ))
1
write_table Writes data to the database. Important feature is that column names are made syntactically correct with janitor::clean_names() in all_caps. Otherwise one ends up with those annoying SQL tables where you have to quote column names. The same is done to the table name.
2
execute_statement wraps DBI::dbExecute() by passing the statement through glue::glue_sql() and informing whether it was successful.
3
In R6 an active binding provides an interface to the hidden private fields. The conn field allows manual access to the database connection if the user needs it.
The OdbcConnector class
Prelude: Authentication with keyring
The keyring package allows to store and access secrets in the system credential store. This has two big advantages: Firstly, credentials can be stored and retrieved at leisure after a one-time setup. Secondly, the system credential store is unlocked when you log into your windows/linux/os x Session. Especially the latter gives the OdbcConnector class its power of convenience: After a one time setup you and your colleagues need no extra step to safely authenticate at your companies databases1.
One-time keyring setup
Below we store out username and passwords (you’ll get an interactive prompt to enter it) in the systems credential store. The crucial part for automatic log-on with OdbcConnector is the name of the service. It has to match the name of the ODBC connection to work (as easy as possible).
service username
1 postgres_test postgres
2 maria_test mysql
Inheritance
Firstly, the OdbcConnector class inherits all attributes from its Superclass DbConnector. This means all methods and attributes we do not re-designate are inherited.
But re-designate we do, namely the public items which store the name of the database service db_service_name, user_id, and dsn. These fields will hold information on our database connection.
# roxygen documentation excluded for brevityOdbcConnector <- R6::R6Class(classname ="OdbcConnector",inherit = DbConnector,# public itemspublic =list(#' @field db_service_name Name of the datbase service. Must be the name set in `keyring::key_set(service)`,db_service_name =NULL,#' @field user_id Username / user id for database access.user_id =NULL,#' @field dsn Data Source Name (DSN) of the database in odbc.ini.dsn =NULL,
Authentication in class
The key element of the class is its new initialize function: It authenticates the user with keyring.
The only necessary argument is db_service_name which, when matched by keyring’s key_get() service name, is used to retrieve username and password for that database. The if condition switches between keyring and non keyring authentifcation. You can provide all the arguments manually if the need arises.
initialize =function(user_id, db_service_name =NULL, dsn, user_pw, encoding ="") {assert_that(is_character(db_service_name))# switch between keyring and non-keyring authentificationif (all(is_missing(user_id), is_missing(dsn),is_missing(user_pw) ) &!is_missing(db_service_name)) {# if only the db_service_name is present # these arguments are filled from the credential store. self$db_service_name <- db_service_name self$user_id <- keyring::key_list(self$db_service_name)$username1 self$dsn <- self$db_service_name } else {# otherwhise user id and dsn need to be presentassert_that(is_character(user_id))assert_that(is_character(dsn)) self$user_id <- user_id self$db_service_name <- db_service_name self$dsn <- dsn }
1
Here dsn == db_service_name because I named my keyring entries the same as the ODBC connection names for the databases. In case of corporate databases, which dont change (often), you could simply write a dedicated class for that database and hard code the dsn.
Establishing the connection
The if condition checks whether a user_pw has been provided. If not, it tries to read the password from the credential store with keyring. This can be useful when you provide all database details manually but still want to store/retrieve the password safely. For example, in scripts that are not run interactively.
#' OdbcConnector#'#' @description#' Class for connectors to SQL databases based on [odbc::odbc()] drivers.#' Be *aware*, for this to work the databases need to be registered within your machine's data sources reference (e.g. odbc.ini).#' @importFrom assertthat assert_that#' @import rlang#' @exportOdbcConnector <- R6::R6Class(classname ="OdbcConnector",inherit = DbConnector,# public itemspublic =list(#' @field db_service_name Name of the datbase service. Must be the name set in `keyring::key_set(service)`,db_service_name =NULL,#' @field user_id Username / user id for database access.user_id =NULL,#' @field dsn Data Source Name (DSN) of the database in odbc.ini.dsn =NULL,#' @param db_service_name Name of the datbase service. Must be the name set in `keyring::key_set(service)`#' @param user_id Username / user id for database access. Only needed for *without* keyring authentification.#' @param user_pw *Optional* User password *if* keyring authentification is not used. Defaults to `NULL` which triggers `keyring::key_get(service = db_service_name, username = user_id)` to fetch the password from the standard (operating system) credential store.#' @param dsn *Optional* Data Source Name (DSN) of the database in odbc.ini. Only needed for *without* keyring authentification.#' @param encoding *Optional* Encoding the database is in. Useful if you have texts with umlauts (set `encoding = "latin1")`and other nasties.#' @examples#' # non-keyring authentification#' OdbcConnector$new(db_service_name = "postgres_test",#' user_id = "postgres",#' dsn = "postgres_test",#' user_pw = rstudioapi::askForPassword())initialize =function(user_id, db_service_name =NULL, dsn, user_pw, encoding ="") {assert_that(is_character(db_service_name))# switch between keyring and non-keyring authentificationif (all(is_missing(user_id), is_missing(dsn),is_missing(user_pw) ) &!is_missing(db_service_name)) { self$db_service_name <- db_service_name self$user_id <- keyring::key_list(self$db_service_name)$username self$dsn <- self$db_service_name } else {assert_that(is_character(user_id))assert_that(is_character(dsn)) self$user_id <- user_id self$db_service_name <- db_service_name self$dsn <- dsn }# establish db connectionif (is_missing(user_pw)) {message("No pw specified, assuming keyring authentification.") private$..conn <- DBI::dbConnect(odbc::odbc(),dsn = self$dsn,UID = self$user_id,PWD = keyring::key_get(service = self$db_service_name,username = self$user_id ),encoding = encoding ) } elseif (!missing(user_pw)) {assert_that(is_character(user_pw)) private$..conn <- DBI::dbConnect(odbc::odbc(),dsn = self$dsn,UID = self$user_id,PWD = user_pw,encoding = encoding ) } } ))
Final considerations
This works very well in practice and makes accessing databases a breeze. The code presented is solely for demonstration purposes and some work would need to be done to harden it for production. For example, imports/dependency management needs a good tidy up and there need to be (more) input tests. But the idea and combination of parts is the message here.
It is my personal conviction that when it comes to safety convenience is key. Otherwise the probability of people to do the “easy” thing, lets say store passwords in plaintext, increases. And thats a big no-no.↩︎