Benchmarking

Because lm_sql() pushes all computation into the database as a single aggregation query, it can handle datasets that would never fit in memory. Here we benchmark against the contoso 100M dataset (which expands to ~237 million rows) using system.time().

db_large <- contoso::create_contoso_duckdb(size = "100m")

sales_large_db <- db_large$sales

dplyr::count(sales_large_db) |> dplyr::pull(n)

#> 237,245,485
# Simple model: 2 numeric predictors
system.time(lm_sql(net_price ~ quantity + unit_cost, data = sales_large_db))
#>  elapsed: 1.44s

# Categorical predictor
system.time(lm_sql(net_price ~ quantity + unit_cost + currency_code, data = sales_large_db))
#>  elapsed: 8.20s

# Interaction: numeric * factor
system.time(lm_sql(net_price ~ quantity + unit_cost * currency_code, data = sales_large_db))
#>  elapsed: 23.77s

# Grouped: 5 separate models in a single pass
system.time({
  sales_large_db |>
    dplyr::group_by(currency_code) |>
    lm_sql(net_price ~ quantity + unit_cost, data = _)
})
#>  elapsed: 2.36s
Benchmark results on 237M rows (DuckDB, single machine)
Model Predictors Time
Simple quantity + unit_cost 1.44s
Categorical quantity + unit_cost + currency_code 8.20s
Interaction quantity + unit_cost * currency_code 23.77s
Grouped (5 models) quantity + unit_cost by currency_code 2.36s

The grouped model is the fastest because GROUP BY computes all 5 models in a single SQL aggregation pass. All models complete in under 24 seconds on 237 million rows without loading any data into R.