Getting started

sqlm is an R package that makes it easy to fit linear models to data stored in SQL databases. It leverages the power of dbplyr to translate R code into SQL queries, allowing users to work with large datasets without needing to load them into memory.

To get started with sqlm, you can follow these steps:

Install the package: If you haven’t already, install the sqlm package from Codeberg.

# development version
devtools::install_git("https://codeberg.org/usrbinr/sqlm.git")

Connect to your database: Use DBI and dbplyr to connect to your SQL database.

library(DBI)
library(dbplyr)
library(duckdb)
library(sqlm)

# we will use contoso database for this example

#create a duckdb database with contoso data

db_small <- contoso::create_contoso_duckdb(size="100k")

# assigns the sales table in the duckdb database to a variable
sales_small_db <- db_small$sales

# assigns the regular in-memory tibble to a variable
sales_small_tbl <- contoso::sales

This produces a list of databases that we can reference with typical ‘$’ syntax. In this example we leverage the ‘sales’ tables.

The only difference between these tables is that one is in a duckdb database whereas the other is an in-memory tibble (and one is 100k rows vs ~10k rows).

Fit a linear model: Use the lm_sql() function to fit a linear model to your data. You can specify the formula and the data source as you would with the standard lm() function.

Note most standard lm() formula syntax is supported including interactions with * and : operators, 0 intercept models and putting all columns with ..

lm_obj <- sales_small_db |>
  sqlm::lm_sql(net_price~quantity+unit_cost*currency_code,data = _)

print(lm_obj)


Big Data Linear Regression (S7)
-------------------------------
Call:
sqlm::lm_sql(formula = net_price ~ quantity + unit_cost * currency_code, 
    data = sales_small_db)

Coefficients:
                (Intercept)                    quantity 
               -24.38846690                 -0.67317405 
                  unit_cost           currency_code_CAD 
                 2.50924326                  2.25062985 
          currency_code_EUR           currency_code_GBP 
                -6.57743263                 -6.59090333 
          currency_code_USD unit_cost:currency_code_CAD 
                -9.14607360                 -0.06201195 
unit_cost:currency_code_EUR unit_cost:currency_code_GBP 
                 0.01998218                  0.04250501 
unit_cost:currency_code_USD 
                 0.05225576

Not only does this print a familiar output, we can leverage standard broom functions such as tidy() and glance() to easily summarize the model output.

lm_obj |>
  broom::tidy()

# A tibble: 11 × 5
   term                        estimate std.error statistic    p.value
   <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
 1 (Intercept)                 -24.4       5.39      -4.52  0.00000614
 2 quantity                     -0.673     0.450     -1.50  0.135     
 3 unit_cost                     2.51      0.0266    94.5   0         
 4 currency_code_CAD             2.25      6.56       0.343 0.732     
 5 currency_code_EUR            -6.58      5.87      -1.12  0.262     
 6 currency_code_GBP            -6.59      6.78      -0.973 0.331     
 7 currency_code_USD            -9.15      5.53      -1.65  0.0982    
 8 unit_cost:currency_code_CAD  -0.0620    0.0332    -1.87  0.0621    
 9 unit_cost:currency_code_EUR   0.0200    0.0299     0.667 0.505     
10 unit_cost:currency_code_GBP   0.0425    0.0332     1.28  0.201     
11 unit_cost:currency_code_USD   0.0523    0.0280     1.87  0.0622

lm_obj |>
  broom::glance()

# A tibble: 1 × 11
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.953         0.952  87.4    15616.       0    10 -45893. 91810. 91894.
# ℹ 2 more variables: nobs <dbl>, df.residual <dbl>

How does this compare to standard lm()?

For convenience we are defining the formula first and then using it in both the standard lm function and the sqlm function however as you saw earlier sqlm::lm_sql() works just like lm() with formula body in the function call.

formula_input <- formula(net_price~quantity*unit_cost)

lm_tidy_tbl <- lm(formula_input,data=contoso::sales) |>
  broom::tidy()

sqlm_tidy_tbl <-sqlm::lm_sql(formula_input,data = sales_small_db) |>
  broom::tidy()

Term	Estimate (lm)	Estimate (sqlm)	Std. Error (lm)	Std. Error (sqlm)
Comparison of lm() and lm_sql() results
(Intercept)	-30.372882622	-30.372882622	2.267749367	2.267749367
quantity	-0.874828942	-0.874828942	0.596018745	0.596018745
unit_cost	2.534482193	2.534482193	0.011540823	0.011540823
quantity:unit_cost	0.001330544	0.001330544	0.002977435	0.002977435

Figure 1

As we can see there are no differences in the results between the two model fits.

factor columns naming differences

Note there are minor differences in how factor variables are named in output as sqlm() will append “_” when using the column name and its levels eg. currency_code_USD in lm_sql() vs. currency_codeUSD in lm().

What else can sqlm do?

sqlm also supports a many-models workflow via dplyr::group_by(). Instead of nesting data and fitting separate models in R, lm_sql() computes all grouped models in a single SQL GROUP BY pass:

grouped_db <- sales_small_db |>
  dplyr::group_by(currency_code) |>
  lm_sql(net_price~quantity+unit_cost,data = _)

print(grouped_db)

# A tibble: 5 × 2
  currency_code model     
  <chr>         <list>    
1 EUR           <sqlm::__>
2 AUD           <sqlm::__>
3 GBP           <sqlm::__>
4 CAD           <sqlm::__>
5 USD           <sqlm::__>

Each row contains a model object for a given group. Use rowwise() + mutate() with broom::tidy() or broom::glance() to extract results:

grouped_db |>
  dplyr::rowwise() |>
  dplyr::mutate(
    glance_model = list(broom::glance(model))
  ) |>
    tidyr::unnest(glance_model)

# A tibble: 5 × 13
  currency_code model      r.squared adj.r.squared sigma statistic p.value    df
  <chr>         <list>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
1 EUR           <sqlm::__>     0.952         0.952  83.9    18087.       0     2
2 AUD           <sqlm::__>     0.958         0.958  80.3     5288.       0     2
3 GBP           <sqlm::__>     0.955         0.955  92.8     7235.       0     2
4 CAD           <sqlm::__>     0.949         0.949  86.1     7723.       0     2
5 USD           <sqlm::__>     0.952         0.952  89.0    39718.       0     2
# ℹ 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, nobs <dbl>,
#   df.residual <dbl>

For a deeper dive into the many-models pattern — including multiple grouping variables, comparisons with in-memory approaches, and benchmarks — see the Many Models vignette.