library(DBI)
library(dbplyr)
library(duckdb)
library(sqlm)
con <- DBI::dbConnect(duckdb::duckdb())
db_mtcars <- copy_to(con, mtcars, "mtcars", overwrite = TRUE)Many Models
A common analytical task is fitting the same model specification across multiple subgroups — by region, product category, time period, or any other grouping variable.
In R, this is known as the “many models” approach, popularized by Hadley Wickham’s R for Data Science and expanded on in Tim Tiefenbach’s excellent blog post on dplyr many-models.
The traditional workflow looks like:
- Nest the data by group (
nest_by()orgroup_by() |> nest()) (key insight is this returns a grouped row-wise tibble) - Fit a model to each nested subset (
mutate(mod = list(lm(...)))) - Extract results with
broom::tidy()orbroom::glance() - Unnest back into a flat table
This works well in-memory, but breaks down when the data lives in a database and is too large to pull into R. That’s where sqlm comes in.
Grouped regression with lm_sql()
With sqlm, the many-models workflow collapses to a single GROUP BY aggregation. Instead of nesting data and fitting separate models, lm_sql() detects grouped data and computes all sufficient statistics in one SQL pass:
grouped_models <- db_mtcars |>
dplyr::group_by(am) |>
lm_sql(mpg ~ wt + hp, data = _)
print(grouped_models)# A tibble: 2 × 2
am model
<dbl> <list>
1 0 <sqlm::__>
2 1 <sqlm::__>
The result is a tibble with one row per group. Each row contains the grouping variable(s) and a nested lm_sql_result model object in the model column.
Extracting results with broom
Because lm_sql_result objects have tidy() and glance() methods, the standard rowwise() + mutate() pattern works directly:
Model-level summaries with glance()
grouped_models |>
dplyr::rowwise() |>
dplyr::mutate(glance_model = list(broom::glance(model))) |>
tidyr::unnest(glance_model)# A tibble: 2 × 13
am model r.squared adj.r.squared sigma statistic p.value df logLik
<dbl> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 <sqlm::__> 0.768 0.739 1.96 26.4 8.51e-6 2 -38.1
2 1 <sqlm::__> 0.837 0.804 2.73 25.7 1.15e-4 2 -29.8
# ℹ 4 more variables: AIC <dbl>, BIC <dbl>, nobs <dbl>, df.residual <dbl>
Coefficient-level summaries with tidy()
grouped_models |>
dplyr::rowwise() |>
dplyr::mutate(tidy_model = list(broom::tidy(model))) |>
tidyr::unnest(tidy_model)# A tibble: 6 × 7
am model term estimate std.error statistic p.value
<dbl> <list> <chr> <dbl> <dbl> <dbl> <dbl>
1 0 <sqlm::__> (Intercept) 30.7 2.29 13.4 4.16e-10
2 0 <sqlm::__> wt -1.86 0.810 -2.29 3.59e- 2
3 0 <sqlm::__> hp -0.0409 0.0117 -3.50 2.94e- 3
4 1 <sqlm::__> (Intercept) 44.4 3.90 11.4 4.73e- 7
5 1 <sqlm::__> wt -7.62 2.20 -3.47 6.06e- 3
6 1 <sqlm::__> hp -0.0132 0.0161 -0.814 4.34e- 1
This gives you a flat table of all coefficients across all groups — ready for plotting or comparison.
Multiple grouping variables
You can group by more than one variable. Each unique combination of grouping variables produces its own model:
db_mtcars |>
dplyr::group_by(am, vs) |>
lm_sql(mpg ~ wt + hp, data = _) |>
dplyr::rowwise() |>
dplyr::mutate(glance_model = list(broom::glance(model))) |>
tidyr::unnest(glance_model) |>
dplyr::select(am, vs, r.squared, adj.r.squared, nobs)# A tibble: 4 × 5
am vs r.squared adj.r.squared nobs
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0 0 0.507 0.398 12
2 1 0 0.933 0.889 6
3 1 1 0.630 0.445 7
4 0 1 0.745 0.618 7
The key insight: if your goal is separate models per group rather than a single model with group indicators, group_by() |> lm_sql() is both simpler to write and faster to execute.