Many Models

A common analytical task is fitting the same model specification across multiple subgroups — by region, product category, time period, or any other grouping variable.

In R, this is known as the “many models” approach, popularized by Hadley Wickham’s R for Data Science and expanded on in Tim Tiefenbach’s excellent blog post on dplyr many-models.

The traditional workflow looks like:

  1. Nest the data by group (nest_by() or group_by() |> nest()) (key insight is this returns a grouped row-wise tibble)
  2. Fit a model to each nested subset (mutate(mod = list(lm(...))))
  3. Extract results with broom::tidy() or broom::glance()
  4. Unnest back into a flat table

This works well in-memory, but breaks down when the data lives in a database and is too large to pull into R. That’s where sqlm comes in.

Grouped regression with lm_sql()

With sqlm, the many-models workflow collapses to a single GROUP BY aggregation. Instead of nesting data and fitting separate models, lm_sql() detects grouped data and computes all sufficient statistics in one SQL pass:

library(DBI)
library(dbplyr)
library(duckdb)
library(sqlm)

con <- DBI::dbConnect(duckdb::duckdb())
db_mtcars <- copy_to(con, mtcars, "mtcars", overwrite = TRUE)
grouped_models <- db_mtcars |>
  dplyr::group_by(am) |>
  lm_sql(mpg ~ wt + hp, data = _)

print(grouped_models)
# A tibble: 2 × 2
     am model     
  <dbl> <list>    
1     0 <sqlm::__>
2     1 <sqlm::__>

The result is a tibble with one row per group. Each row contains the grouping variable(s) and a nested lm_sql_result model object in the model column.

Extracting results with broom

Because lm_sql_result objects have tidy() and glance() methods, the standard rowwise() + mutate() pattern works directly:

Model-level summaries with glance()

grouped_models |>
  dplyr::rowwise() |>
  dplyr::mutate(glance_model = list(broom::glance(model))) |>
  tidyr::unnest(glance_model)
# A tibble: 2 × 13
     am model      r.squared adj.r.squared sigma statistic  p.value    df logLik
  <dbl> <list>         <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl>
1     0 <sqlm::__>     0.768         0.739  1.96      26.4  8.51e-6     2  -38.1
2     1 <sqlm::__>     0.837         0.804  2.73      25.7  1.15e-4     2  -29.8
# ℹ 4 more variables: AIC <dbl>, BIC <dbl>, nobs <dbl>, df.residual <dbl>

Coefficient-level summaries with tidy()

grouped_models |>
  dplyr::rowwise() |>
  dplyr::mutate(tidy_model = list(broom::tidy(model))) |>
  tidyr::unnest(tidy_model)
# A tibble: 6 × 7
     am model      term        estimate std.error statistic  p.value
  <dbl> <list>     <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1     0 <sqlm::__> (Intercept)  30.7       2.29      13.4   4.16e-10
2     0 <sqlm::__> wt           -1.86      0.810     -2.29  3.59e- 2
3     0 <sqlm::__> hp           -0.0409    0.0117    -3.50  2.94e- 3
4     1 <sqlm::__> (Intercept)  44.4       3.90      11.4   4.73e- 7
5     1 <sqlm::__> wt           -7.62      2.20      -3.47  6.06e- 3
6     1 <sqlm::__> hp           -0.0132    0.0161    -0.814 4.34e- 1

This gives you a flat table of all coefficients across all groups — ready for plotting or comparison.

Multiple grouping variables

You can group by more than one variable. Each unique combination of grouping variables produces its own model:

db_mtcars |>
  dplyr::group_by(am, vs) |>
  lm_sql(mpg ~ wt + hp, data = _) |>
  dplyr::rowwise() |>
  dplyr::mutate(glance_model = list(broom::glance(model))) |>
  tidyr::unnest(glance_model) |>
  dplyr::select(am, vs, r.squared, adj.r.squared, nobs)
# A tibble: 4 × 5
     am    vs r.squared adj.r.squared  nobs
  <dbl> <dbl>     <dbl>         <dbl> <dbl>
1     0     0     0.507         0.398    12
2     1     0     0.933         0.889     6
3     1     1     0.630         0.445     7
4     0     1     0.745         0.618     7

The key insight: if your goal is separate models per group rather than a single model with group indicators, group_by() |> lm_sql() is both simpler to write and faster to execute.