# development version
devtools::install_git("https://codeberg.org/usrbinr/sqlm.git")Getting started
sqlm is an R package that makes it easy to fit linear models to data stored in SQL databases. It leverages the power of dbplyr to translate R code into SQL queries, allowing users to work with large datasets without needing to load them into memory.
To get started with sqlm, you can follow these steps:
Install the package: If you haven’t already, install the sqlm package from Codeberg.
Connect to your database: Use DBI and dbplyr to connect to your SQL database.
library(DBI)
library(dbplyr)
library(duckdb)
library(sqlm)
# we will use contoso database for this example
#create a duckdb database with contoso data
db_small <- contoso::create_contoso_duckdb(size="100k")
# assigns the sales table in the duckdb database to a variable
sales_small_db <- db_small$sales
# assigns the regular in-memory tibble to a variable
sales_small_tbl <- contoso::salesThis produces a list of databases that we can reference with typical ‘$’ syntax. In this example we leverage the ‘sales’ tables.
The only difference between these tables is that one is in a duckdb database whereas the other is an in-memory tibble (and one is 100k rows vs ~10k rows).
Fit a linear model: Use the lm_sql() function to fit a linear model to your data. You can specify the formula and the data source as you would with the standard lm() function.
Note most standard lm() formula syntax is supported including interactions with * and : operators, 0 intercept models and putting all columns with ..
lm_obj <- sales_small_db |>
sqlm::lm_sql(net_price~quantity+unit_cost*currency_code,data = _)
print(lm_obj)
SQL-Backed Linear Regression
----------------------------
Call:
sqlm::lm_sql(formula = net_price ~ quantity + unit_cost * currency_code,
data = sales_small_db)
Coefficients:
(Intercept) quantity
-24.38846690 -0.67317405
unit_cost currency_code_CAD
2.50924326 2.25062985
currency_code_EUR currency_code_GBP
-6.57743263 -6.59090333
currency_code_USD unit_cost:currency_code_CAD
-9.14607360 -0.06201195
unit_cost:currency_code_EUR unit_cost:currency_code_GBP
0.01998218 0.04250501
unit_cost:currency_code_USD
0.05225576
Not only does this print a familiar output, we can leverage standard broom functions such as tidy() and glance() to easily summarize the model output.
lm_obj |>
broom::tidy()# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -24.4 5.39 -4.52 0.00000614
2 quantity -0.673 0.450 -1.50 0.135
3 unit_cost 2.51 0.0266 94.5 0
4 currency_code_CAD 2.25 6.56 0.343 0.732
5 currency_code_EUR -6.58 5.87 -1.12 0.262
6 currency_code_GBP -6.59 6.78 -0.973 0.331
7 currency_code_USD -9.15 5.53 -1.65 0.0982
8 unit_cost:currency_code_CAD -0.0620 0.0332 -1.87 0.0621
9 unit_cost:currency_code_EUR 0.0200 0.0299 0.667 0.505
10 unit_cost:currency_code_GBP 0.0425 0.0332 1.28 0.201
11 unit_cost:currency_code_USD 0.0523 0.0280 1.87 0.0622
lm_obj |>
broom::glance()# A tibble: 1 × 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.953 0.952 87.4 15616. 0 10 -45893. 91810. 91894.
# ℹ 2 more variables: nobs <dbl>, df.residual <dbl>
Using summary()
For a familiar view that matches base R’s summary.lm() output, use summary():
lm_obj |>
summary()
Call:
sqlm::lm_sql(formula = net_price ~ quantity + unit_cost * currency_code,
data = sales_small_db)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.439e+01 5.390e+00 -4.525e+00 6.142e-06 ***
quantity -6.732e-01 4.501e-01 -1.496e+00 1.348e-01
unit_cost 2.509e+00 2.656e-02 9.449e+01 0.000e+00 ***
currency_code_CAD 2.251e+00 6.559e+00 3.431e-01 7.315e-01
currency_code_EUR -6.577e+00 5.865e+00 -1.121e+00 2.622e-01
currency_code_GBP -6.591e+00 6.775e+00 -9.728e-01 3.307e-01
currency_code_USD -9.146e+00 5.530e+00 -1.654e+00 9.818e-02 .
unit_cost:currency_code_CAD -6.201e-02 3.323e-02 -1.866e+00 6.206e-02 .
unit_cost:currency_code_EUR 1.998e-02 2.995e-02 6.673e-01 5.046e-01
unit_cost:currency_code_GBP 4.251e-02 3.323e-02 1.279e+00 2.009e-01
unit_cost:currency_code_USD 5.226e-02 2.801e-02 1.865e+00 6.218e-02 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 87.36 on 7783 degrees of freedom
Multiple R-squared: 0.9525, Adjusted R-squared: 0.9525
F-statistic: 1.562e+04 on 10 and 7783 DF, p-value: < 2.2e-16
This displays the coefficient table with significance stars, residual standard error, R-squared statistics, and the F-statistic — just like you’d see with a standard lm() model.
Unlike base R’s summary.lm(), the sqlm summary does not include residual quantiles since the raw data remains in the database.
Diagnosing potential issues
Before fitting a model, you can use diagnose_lm_sql() to check for common problems:
sqlm::diagnose_lm_sql(net_price ~ quantity + unit_cost, data = sales_small_db)This function checks for:
- Missing columns in the database table
- Constant columns (zero variance)
- Near-perfect collinearity between numeric predictors
- Single-level categorical variables
- High-cardinality categoricals (>50 levels)
- Insufficient observations for the number of parameters
- High NA prevalence in model variables
Running diagnostics before fitting can save time by catching issues early, especially with large datasets where a failed model fit is costly.
Controlling warning verbosity
When fitting many models (e.g., grouped regression), warnings can become overwhelming. Use the warnings parameter to control verbosity:
# Full warnings with diagnostic tips (default)
lm_sql(y ~ x, data = tbl, warnings = "full")
# One-line abbreviated warnings
lm_sql(y ~ x, data = tbl, warnings = "minimal")
# Suppress all warnings (useful for batch processing)
lm_sql(y ~ x, data = tbl, warnings = "none")Error handling
sqlm provides informative error messages when something goes wrong. Errors are categorized by type and include:
- What went wrong - Clear description of the problem
- Common causes - Why this might have happened
- How to fix - Actionable steps to resolve the issue
For example, if you accidentally pass a local data frame:
# This will produce a helpful error message
tryCatch(
lm_sql(mpg ~ wt, data = mtcars),
error = function(e) cat(e$message)
)lm_sql() requires a remote database table (tbl_sql), not a local data frame.
How to fix:
1. Connect to database: con <- DBI::dbConnect(...)
2. Create remote table: tbl <- dplyr::tbl(con, 'table_name')
3. Pass to lm_sql: lm_sql(y ~ x, data = tbl)
Call: lm_sql(formula = mpg ~ wt, data = mtcars)
Common error types include:
| Error | Cause |
|---|---|
not_tbl_sql |
Local data frame instead of database table |
column_not_found |
Column doesn’t exist in database |
singular_matrix |
Perfect multicollinearity between predictors |
insufficient_observations |
More parameters than observations |
zero_observations |
All rows removed by NA filtering |
non_numeric_response |
Response variable is not numeric |
For SQL-related errors (connection issues, query failures), the original database error message is included to help with debugging.
How does this compare to standard lm()?
For convenience we are defining the formula first and then using it in both the standard lm function and the sqlm function however as you saw earlier sqlm::lm_sql() works just like lm() with formula body in the function call.
formula_input <- formula(net_price~quantity*unit_cost)
lm_tidy_tbl <- lm(formula_input,data=contoso::sales) |>
broom::tidy()
sqlm_tidy_tbl <-sqlm::lm_sql(formula_input,data = sales_small_db) |>
broom::tidy()| Comparison of lm() and lm_sql() results | ||||
| Term | Estimate (lm) | Estimate (sqlm) | Std. Error (lm) | Std. Error (sqlm) |
|---|---|---|---|---|
| (Intercept) | -30.372882622 | -30.372882622 | 2.267749367 | 2.267749367 |
| quantity | -0.874828942 | -0.874828942 | 0.596018745 | 0.596018745 |
| unit_cost | 2.534482193 | 2.534482193 | 0.011540823 | 0.011540823 |
| quantity:unit_cost | 0.001330544 | 0.001330544 | 0.002977435 | 0.002977435 |
As we can see there are no differences in the results between the two model fits.
Note there are minor differences in how factor variables are named in output as sqlm() will append “_” when using the column name and its levels eg. currency_code_USD in lm_sql() vs. currency_codeUSD in lm().
What else can sqlm do?
sqlm also supports a many-models workflow via dplyr::group_by(). Instead of nesting data and fitting separate models in R, lm_sql() computes all grouped models in a single SQL GROUP BY pass:
grouped_db <- sales_small_db |>
dplyr::group_by(currency_code) |>
lm_sql(net_price~quantity+unit_cost,data = _)
print(grouped_db)# A tibble: 5 × 2
currency_code model
<chr> <list>
1 EUR <sqlm::__>
2 CAD <sqlm::__>
3 AUD <sqlm::__>
4 USD <sqlm::__>
5 GBP <sqlm::__>
Each row contains a model object for a given group. Use rowwise() + mutate() with broom::tidy() or broom::glance() to extract results:
grouped_db |>
dplyr::rowwise() |>
dplyr::mutate(
glance_model = list(broom::glance(model))
) |>
tidyr::unnest(glance_model)# A tibble: 5 × 13
currency_code model r.squared adj.r.squared sigma statistic p.value df
<chr> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EUR <sqlm::__> 0.952 0.952 83.9 18087. 0 2
2 CAD <sqlm::__> 0.949 0.949 86.1 7723. 0 2
3 AUD <sqlm::__> 0.958 0.958 80.3 5288. 0 2
4 USD <sqlm::__> 0.952 0.952 89.0 39718. 0 2
5 GBP <sqlm::__> 0.955 0.955 92.8 7235. 0 2
# ℹ 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, nobs <dbl>,
# df.residual <dbl>
For a deeper dive into the many-models pattern — including multiple grouping variables, comparisons with in-memory approaches, and benchmarks — see the Many Models vignette.