Getting started

sqlm is an R package that makes it easy to fit linear models to data stored in SQL databases. It leverages the power of dbplyr to translate R code into SQL queries, allowing users to work with large datasets without needing to load them into memory.

To get started with sqlm, you can follow these steps:

Install the package: If you haven’t already, install the sqlm package from Codeberg.

# development version
devtools::install_git("https://codeberg.org/usrbinr/sqlm.git")

Connect to your database: Use DBI and dbplyr to connect to your SQL database.

library(DBI)
library(dbplyr)
library(duckdb)
library(sqlm)

# we will use contoso database for this example

#create a duckdb database with contoso data

db_small <- contoso::create_contoso_duckdb(size="100k")

# assigns the sales table in the duckdb database to a variable
sales_small_db <- db_small$sales

# assigns the regular in-memory tibble to a variable
sales_small_tbl <- contoso::sales

This produces a list of databases that we can reference with typical ‘$’ syntax. In this example we leverage the ‘sales’ tables.

The only difference between these tables is that one is in a duckdb database whereas the other is an in-memory tibble (and one is 100k rows vs ~10k rows).

Fit a linear model: Use the lm_sql() function to fit a linear model to your data. You can specify the formula and the data source as you would with the standard lm() function.

Note most standard lm() formula syntax is supported including interactions with * and : operators, 0 intercept models and putting all columns with ..

lm_obj <- sales_small_db |>
  sqlm::lm_sql(net_price~quantity+unit_cost*currency_code,data = _)

print(lm_obj)


SQL-Backed Linear Regression
----------------------------
Call:
sqlm::lm_sql(formula = net_price ~ quantity + unit_cost * currency_code, 
    data = sales_small_db)

Coefficients:
                (Intercept)                    quantity 
               -24.38846690                 -0.67317405 
                  unit_cost           currency_code_CAD 
                 2.50924326                  2.25062985 
          currency_code_EUR           currency_code_GBP 
                -6.57743263                 -6.59090333 
          currency_code_USD unit_cost:currency_code_CAD 
                -9.14607360                 -0.06201195 
unit_cost:currency_code_EUR unit_cost:currency_code_GBP 
                 0.01998218                  0.04250501 
unit_cost:currency_code_USD 
                 0.05225576

Not only does this print a familiar output, we can leverage standard broom functions such as tidy() and glance() to easily summarize the model output.

lm_obj |>
  broom::tidy()

# A tibble: 11 × 5
   term                        estimate std.error statistic    p.value
   <chr>                          <dbl>     <dbl>     <dbl>      <dbl>
 1 (Intercept)                 -24.4       5.39      -4.52  0.00000614
 2 quantity                     -0.673     0.450     -1.50  0.135     
 3 unit_cost                     2.51      0.0266    94.5   0         
 4 currency_code_CAD             2.25      6.56       0.343 0.732     
 5 currency_code_EUR            -6.58      5.87      -1.12  0.262     
 6 currency_code_GBP            -6.59      6.78      -0.973 0.331     
 7 currency_code_USD            -9.15      5.53      -1.65  0.0982    
 8 unit_cost:currency_code_CAD  -0.0620    0.0332    -1.87  0.0621    
 9 unit_cost:currency_code_EUR   0.0200    0.0299     0.667 0.505     
10 unit_cost:currency_code_GBP   0.0425    0.0332     1.28  0.201     
11 unit_cost:currency_code_USD   0.0523    0.0280     1.87  0.0622

lm_obj |>
  broom::glance()

# A tibble: 1 × 11
  r.squared adj.r.squared sigma statistic p.value    df  logLik    AIC    BIC
      <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>   <dbl>  <dbl>  <dbl>
1     0.953         0.952  87.4    15616.       0    10 -45893. 91810. 91894.
# ℹ 2 more variables: nobs <dbl>, df.residual <dbl>

Using summary()

For a familiar view that matches base R’s summary.lm() output, use summary():

lm_obj |>
  summary()


Call:
sqlm::lm_sql(formula = net_price ~ quantity + unit_cost * currency_code, 
    data = sales_small_db)

Coefficients:
                              Estimate Std. Error    t value   Pr(>|t|)    
(Intercept)                 -2.439e+01  5.390e+00 -4.525e+00  6.142e-06 ***
quantity                    -6.732e-01  4.501e-01 -1.496e+00  1.348e-01    
unit_cost                    2.509e+00  2.656e-02  9.449e+01  0.000e+00 ***
currency_code_CAD            2.251e+00  6.559e+00  3.431e-01  7.315e-01    
currency_code_EUR           -6.577e+00  5.865e+00 -1.121e+00  2.622e-01    
currency_code_GBP           -6.591e+00  6.775e+00 -9.728e-01  3.307e-01    
currency_code_USD           -9.146e+00  5.530e+00 -1.654e+00  9.818e-02 .  
unit_cost:currency_code_CAD -6.201e-02  3.323e-02 -1.866e+00  6.206e-02 .  
unit_cost:currency_code_EUR  1.998e-02  2.995e-02  6.673e-01  5.046e-01    
unit_cost:currency_code_GBP  4.251e-02  3.323e-02  1.279e+00  2.009e-01    
unit_cost:currency_code_USD  5.226e-02  2.801e-02  1.865e+00  6.218e-02 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 87.36 on 7783 degrees of freedom
Multiple R-squared: 0.9525, Adjusted R-squared: 0.9525
F-statistic: 1.562e+04 on 10 and 7783 DF,  p-value: < 2.2e-16

This displays the coefficient table with significance stars, residual standard error, R-squared statistics, and the F-statistic — just like you’d see with a standard lm() model.

Note

Unlike base R’s summary.lm(), the sqlm summary does not include residual quantiles since the raw data remains in the database.

Diagnosing potential issues

Before fitting a model, you can use diagnose_lm_sql() to check for common problems:

sqlm::diagnose_lm_sql(net_price ~ quantity + unit_cost, data = sales_small_db)

This function checks for:

Missing columns in the database table
Constant columns (zero variance)
Near-perfect collinearity between numeric predictors
Single-level categorical variables
High-cardinality categoricals (>50 levels)
Insufficient observations for the number of parameters
High NA prevalence in model variables

Running diagnostics before fitting can save time by catching issues early, especially with large datasets where a failed model fit is costly.

Controlling warning verbosity

When fitting many models (e.g., grouped regression), warnings can become overwhelming. Use the warnings parameter to control verbosity:

# Full warnings with diagnostic tips (default)
lm_sql(y ~ x, data = tbl, warnings = "full")

# One-line abbreviated warnings
lm_sql(y ~ x, data = tbl, warnings = "minimal")

# Suppress all warnings (useful for batch processing)
lm_sql(y ~ x, data = tbl, warnings = "none")

Error handling

sqlm provides informative error messages when something goes wrong. Errors are categorized by type and include:

What went wrong - Clear description of the problem
Common causes - Why this might have happened
How to fix - Actionable steps to resolve the issue

For example, if you accidentally pass a local data frame:

# This will produce a helpful error message
tryCatch(
  lm_sql(mpg ~ wt, data = mtcars),
  error = function(e) cat(e$message)
)

lm_sql() requires a remote database table (tbl_sql), not a local data frame.

How to fix:
  1. Connect to database: con <- DBI::dbConnect(...)
  2. Create remote table: tbl <- dplyr::tbl(con, 'table_name')
  3. Pass to lm_sql: lm_sql(y ~ x, data = tbl)
Call: lm_sql(formula = mpg ~ wt, data = mtcars)

Common error types include:

Error	Cause
`not_tbl_sql`	Local data frame instead of database table
`column_not_found`	Column doesn’t exist in database
`singular_matrix`	Perfect multicollinearity between predictors
`insufficient_observations`	More parameters than observations
`zero_observations`	All rows removed by NA filtering
`non_numeric_response`	Response variable is not numeric

For SQL-related errors (connection issues, query failures), the original database error message is included to help with debugging.

How does this compare to standard lm()?

For convenience we are defining the formula first and then using it in both the standard lm function and the sqlm function however as you saw earlier sqlm::lm_sql() works just like lm() with formula body in the function call.

formula_input <- formula(net_price~quantity*unit_cost)

lm_tidy_tbl <- lm(formula_input,data=contoso::sales) |>
  broom::tidy()

sqlm_tidy_tbl <-sqlm::lm_sql(formula_input,data = sales_small_db) |>
  broom::tidy()

Term	Estimate (lm)	Estimate (sqlm)	Std. Error (lm)	Std. Error (sqlm)
Comparison of lm() and lm_sql() results
(Intercept)	-30.372882622	-30.372882622	2.267749367	2.267749367
quantity	-0.874828942	-0.874828942	0.596018745	0.596018745
unit_cost	2.534482193	2.534482193	0.011540823	0.011540823
quantity:unit_cost	0.001330544	0.001330544	0.002977435	0.002977435

Figure 1

As we can see there are no differences in the results between the two model fits.

factor columns naming differences

Note there are minor differences in how factor variables are named in output as sqlm() will append “_” when using the column name and its levels eg. currency_code_USD in lm_sql() vs. currency_codeUSD in lm().

What else can sqlm do?

sqlm also supports a many-models workflow via dplyr::group_by(). Instead of nesting data and fitting separate models in R, lm_sql() computes all grouped models in a single SQL GROUP BY pass:

grouped_db <- sales_small_db |>
  dplyr::group_by(currency_code) |>
  lm_sql(net_price~quantity+unit_cost,data = _)

print(grouped_db)

# A tibble: 5 × 2
  currency_code model     
  <chr>         <list>    
1 EUR           <sqlm::__>
2 CAD           <sqlm::__>
3 AUD           <sqlm::__>
4 USD           <sqlm::__>
5 GBP           <sqlm::__>

Each row contains a model object for a given group. Use rowwise() + mutate() with broom::tidy() or broom::glance() to extract results:

grouped_db |>
  dplyr::rowwise() |>
  dplyr::mutate(
    glance_model = list(broom::glance(model))
  ) |>
    tidyr::unnest(glance_model)

# A tibble: 5 × 13
  currency_code model      r.squared adj.r.squared sigma statistic p.value    df
  <chr>         <list>         <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>
1 EUR           <sqlm::__>     0.952         0.952  83.9    18087.       0     2
2 CAD           <sqlm::__>     0.949         0.949  86.1     7723.       0     2
3 AUD           <sqlm::__>     0.958         0.958  80.3     5288.       0     2
4 USD           <sqlm::__>     0.952         0.952  89.0    39718.       0     2
5 GBP           <sqlm::__>     0.955         0.955  92.8     7235.       0     2
# ℹ 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, nobs <dbl>,
#   df.residual <dbl>

For a deeper dive into the many-models pattern — including multiple grouping variables, comparisons with in-memory approaches, and benchmarks — see the Many Models vignette.