devtools::install_git("https://codeberg.org/usrbinr/sqlm.git")sqlm
sqlm fits linear regression models on datasets residing in SQL databases (DuckDB, Postgres, Snowflake, BigQuery, etc.) without pulling data into R memory. It computes sufficient statistics inside the database engine and solves the normal equations in R.
Features
- Leverage the standard lm() syntax for your model eg.
lm_sql(y ~ m * x + b, data=data) - Full integration of your modeling artifact with
broom::tidy(),broom:glance(),orbital::predict()andorbital::augment()(see orbital article for more information). - Zero data movement: The \(X^TX\) and \(X^TY\) matrices are computed entirely within the SQL engine via a single aggregation query.
- Automatic dummy encoding: Detects categorical (character/factor) columns, scouts distinct levels from the database, and generates
CASE WHENSQL for each dummy. - Interaction terms: Supports
*and:interactions, including interactions between numeric and categorical variables. Categorical interactions are expanded to the full cross of dummy columns. - Dot expansion:
y ~ .expands to all non-response columns (excluding group columns for grouped data). - Transform support:
I(),log(), andsqrt()in formulas are translated to their SQL equivalents (POWER,LN,SQRT). - Date/datetime predictors:
DateandPOSIXctcolumns are automatically converted to numeric (days or seconds since epoch) in SQL, matching base R’slm()behavior. - Grouped regression: Pass a
dplyr::group_by()table and get a tibble back with one model per group, computed from a singleGROUP BYquery. - No-intercept models:
y ~ 0 + xis handled correctly (uncorrected TSS, proper degrees of freedom).
Installation
Package Inspiration
I’m not a statistician, however I do come from a whole family of statisticians. My dad, my brother and my cousin are not only health statisticians but also heavy R users. So I mainly grew up hearing about R in very technical / scientific context.
While over the years, I’ve grown more confident in my statistical skills, I primarily use R for business intelligence and data engineering tasks so I’ll admit at first it wasn’t obvious to me what I could use linear modeling for in my day to day work.
This changed with Julia Silge’s post on Using linear models to understand crop yields which really opened my eyes to the power of linear models for understanding relationships in data beyond just prediction tasks.
So if you like this package, send Julia a thank you note.
I routinely teach R at work and I always think the thing that will impress people the most is dbplyr or how tidyselect verbs help us to easily navigate our 1,000+ columns databases but it is without doubt when I show folks that you can create a linear model with 1 or 2 lines of code with lm() does the music ‘stop’.
While this may seem obvious in hindsight, much of the literature is heavily oriented toward more “scientific” applications and relies on domain-specific language. This often creates a disconnect for practitioners like me, who focus on applying statistics in business settings—where results must be explained to managers who tend to think in terms of binary outcomes rather than probabilistic likelihoods.
The other challenge is most business data exists in large SQL databases. Just massive data warehouses or lakehouses that are impractical to pull into R memory.
So I knew I wanted a package that leverages all of the conveniences and sugar syntax of R’s lm() function but could work directly on SQL database tables without pulling data into memory.
What else is out there?
There are a few other packages that I’m aware of that provide linear modeling on larger than memory datasets:
modeldb by Edgar Ruiz and Max Kuhn
biglm by Thomas Lumley
dbreg by Grant McDermott which is honestly super impressive from both a user experience and performance point of view
All packages are great and I recommend you check them out.
Quick start
library(sqlm)
library(dplyr)
library(dbplyr)
library(duckdb)
# 1. Connect and create a lazy table reference
con <- DBI::dbConnect(duckdb::duckdb())
DBI::dbWriteTable(con, "mtcars", mtcars)
cars_db <- dplyr::tbl(con, "mtcars")
# 2. Fit — data stays in the DB, only summary stats return to R
model <- lm_sql(mpg ~ wt + hp + cyl, data = cars_db)
print(model)
Big Data Linear Regression (S7)
-------------------------------
Call:
lm_sql(formula = mpg ~ wt + hp + cyl, data = cars_db)
Coefficients:
(Intercept) wt hp cyl
38.7517874 -3.1669731 -0.0180381 -0.9416168
library(broom)
tidy(model, conf.int = TRUE)# A tibble: 4 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 38.8 1.79 21.7 0 35.1 42.4
2 wt -3.17 0.741 -4.28 0.000199 -4.68 -1.65
3 hp -0.0180 0.0119 -1.52 0.140 -0.0424 0.00629
4 cyl -0.942 0.551 -1.71 0.0985 -2.07 0.187
glance(model)# A tibble: 1 × 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.843 0.826 2.51 50.2 2.18e-11 3 -72.7 155. 163.
# ℹ 2 more variables: nobs <dbl>, df.residual <dbl>
Grouped regression
cars_db |>
dplyr::group_by(cyl) |>
lm_sql(mpg ~ wt + hp, data = _)# A tibble: 3 × 2
cyl model
<dbl> <list>
1 4 <sqlm::__>
2 6 <sqlm::__>
3 8 <sqlm::__>
- This returns a nested tibble with distinct linear model for each level of
cylvariable - The level name is one column and the model results in another
- From here you can use
dplyr::rowwise()+dlyr::mutate()pattern to do further analysis with your model artifact (see many models vignette for examples). - I also recommend Tim Tiefenbach’s many models blog for further techniques to take advantage of thhis approach.
Technical approach
1. OLS via normal equations
Base R’s lm() pulls all rows into memory and solves via QR decomposition. sqlm instead solves the normal equations directly:
\[\hat{\beta} = (X^TX)^{-1} X^T y\]
The key insight is that \(X^TX\) and \(X^Ty\) are sums of products, which are standard SQL aggregations. For \(p\) predictors the sufficient statistics are:
| Statistic | SQL expression | Size |
|---|---|---|
| \(n\) | COUNT(*) |
scalar |
| \(S_{x_j}\) | SUM(x_j) |
\(p\) values |
| \(S_{x_j x_k}\) | SUM(x_j * x_k) |
\(p(p+1)/2\) values |
| \(S_{x_j y}\) | SUM(x_j * y) |
\(p\) values |
| \(S_{yy}\) | SUM(y * y) |
scalar |
All of these are computed in a single SQL query and returned as one row. The entire dataset is never materialized in R.
2. The generated SQL
For a model y ~ x1 + x2, sqlm generates:
SELECT
COUNT(*) AS N,
SUM(1.0 * "x1") AS S_x1,
SUM(1.0 * "x2") AS S_x2,
SUM(1.0 * "y") AS S_y,
SUM(1.0 * ("x1") * ("x1")) AS S_x1_x1,
SUM(1.0 * ("x1") * ("x2")) AS S_x1_x2,
SUM(1.0 * ("x1") * ("y")) AS S_x1_y,
SUM(1.0 * ("x2") * ("x2")) AS S_x2_x2,
SUM(1.0 * ("x2") * ("y")) AS S_x2_y,
SUM(1.0 * ("y") * ("y")) AS S_y_y
FROM (
SELECT ... FROM table WHERE x1 IS NOT NULL AND x2 IS NOT NULL AND y IS NOT NULL
) AS subqueryThe 1.0 * cast prevents integer overflow on databases that use 32-bit integer arithmetic for SUM.
For grouped models, the same query adds GROUP BY columns and returns one row per group.
3. Categorical variable handling
When a predictor is character or factor, sqlm:
- Scouts levels — runs
SELECT DISTINCT col FROM ... ORDER BY colto discover all levels. - Drops the reference level — the first (alphabetically) level is omitted (same convention as base R’s
contr.treatment). - Generates SQL dummies — each remaining level becomes a
CASE WHEN col = 'level' THEN 1.0 ELSE 0.0 ENDexpression that participates in the sum-of-products query.
For interactions involving categoricals (e.g., x * Species), the interaction is expanded to the Cartesian product of the numeric term and all dummy columns.
4. Date and datetime handling
When a predictor column is a Date, POSIXct, or POSIXlt, sqlm converts it to a numeric value in SQL before including it in the sum-of-products query — mirroring what base R’s lm() does via model.matrix():
Datecolumns become days since the Unix epoch:CAST(("col" - DATE '1970-01-01') AS DOUBLE PRECISION), equivalent to R’sas.numeric(date).POSIXct/POSIXltcolumns become seconds since epoch:EXTRACT(EPOCH FROM "col").
5. Matrix assembly and solution
Back in R, the returned sums are assembled into the \(p \times p\) matrix \(X^TX\) and the \(p\)-vector \(X^Ty\). The system is solved via Cholesky decomposition (chol2inv(chol(XtX))), which is both efficient and numerically stable for the positive-definite cross-product matrices produced by full-rank designs. For rank-deficient cases (e.g., perfectly collinear predictors), the solver falls back to MASS::ginv() (Moore-Penrose pseudoinverse).
From the solution \(\hat\beta\):
| Quantity | Formula |
|---|---|
| RSS | \(S_{yy} - \hat\beta^T X^Ty\) |
| TSS | \(S_{yy} - S_y^2/n\) (intercept models) or \(S_{yy}\) (no-intercept) |
| \(R^2\) | \(1 - \text{RSS}/\text{TSS}\) |
| \(\hat\sigma^2\) | \(\text{RSS} / (n - p)\) |
| Standard errors | \(\sqrt{\text{diag}((X^TX)^{-1}) \cdot \hat\sigma^2}\) |
| \(t\)-statistics | \(\hat\beta_j / \text{SE}_j\) |
| \(F\)-statistic | \(\frac{(\text{TSS} - \text{RSS})/d_1}{\text{RSS}/d_2}\) |
Log-likelihood, AIC, and BIC are computed from the Gaussian likelihood assuming normal residuals.
6. NA handling
Before computing sums, all rows with NULL in any model variable are excluded via a WHERE ... IS NOT NULL clause (equivalent to na.action = na.omit). This is done through dbplyr’s dplyr::filter(if_all(..., ~ !is.na(.))).
7. Transform and expression support
Formula terms wrapped in I() have their inner expression translated to SQL:
x^2becomesPOWER(x, 2)log(x)becomesLN(x)sqrt(x)staysSQRT(x)
These translated expressions are substituted into the sum-of-products query like any other predictor.
8. Interaction screening (internal)
The cost of the sufficient-statistics query grows as \(O(p^2)\) with the number of predictors, and interaction terms can inflate \(p\) rapidly (e.g., x1 * x2 * cat with 10 category levels adds dozens of cross-product terms). The internal helper screen_interactions() addresses this by pre-screening candidate interactions with cheap SQL queries before paying the full expansion cost. This function is not exported; advanced users can access it via sqlm:::screen_interactions().
For each interaction term in the formula, a lightweight GROUP BY query estimates the interaction strength:
- Categorical × Categorical: Computes cell means of the response via
GROUP BY cat1, cat2and tests whether the between-cell variance exceeds a main-effects-only model using a crude F-ratio. - Categorical × Numeric: Computes per-category slopes via
GROUP BY catand tests for slope heterogeneity — whether the numeric predictor’s relationship with the response differs meaningfully across categories. - Numeric × Numeric: Computes
CORR(x1 * x2, y)in SQL and checks whether the partial correlation of the product term with the response exceeds a minimum threshold, with a t-test for significance.
Interactions that fail the screening threshold are dropped; main effects are always retained.
This is particularly useful when exploring many candidate interactions on large tables — each screening query touches the data once with a simple aggregation, rather than building the full \(p \times p\) cross-product matrix.
Limitations
- Predictor count: The \(p \times p\) covariance matrix is materialized in R, so very large \(p\) (thousands of predictors) may hit memory limits. The SQL query also grows as \(O(p^2)\) terms.
- Numerical precision: The normal equations approach is less numerically stable than QR decomposition for highly collinear predictors. Cholesky decomposition is used for full-rank designs with
MASS::ginv()as a fallback, but results may diverge fromlm()for near-singular designs. - OLS only: No support for median regression, quantile regression, or regularized (ridge/lasso) models.
- Database support: Requires standard SQL aggregation support (
SUM,COUNT,CASE WHEN). Tested on DuckDB, PostgreSQL, and Snowflake.