db_large <- contoso::create_contoso_duckdb(size = "100m")
sales_large_db <- db_large$sales
dplyr::count(sales_large_db) |> dplyr::pull(n)
#> 237,245,485Benchmarking
Because lm_sql() pushes all computation into the database as a single aggregation query, it can handle datasets that would never fit in memory. Here we benchmark against the contoso 100M dataset (which expands to ~237 million rows) using system.time().
# Simple model: 2 numeric predictors
system.time(lm_sql(net_price ~ quantity + unit_cost, data = sales_large_db))
#> elapsed: 1.44s
# Categorical predictor
system.time(lm_sql(net_price ~ quantity + unit_cost + currency_code, data = sales_large_db))
#> elapsed: 8.20s
# Interaction: numeric * factor
system.time(lm_sql(net_price ~ quantity + unit_cost * currency_code, data = sales_large_db))
#> elapsed: 23.77s
# Grouped: 5 separate models in a single pass
system.time({
sales_large_db |>
dplyr::group_by(currency_code) |>
lm_sql(net_price ~ quantity + unit_cost, data = _)
})
#> elapsed: 2.36s| Model | Predictors | Time |
|---|---|---|
| Simple | quantity + unit_cost |
1.44s |
| Categorical | quantity + unit_cost + currency_code |
8.20s |
| Interaction | quantity + unit_cost * currency_code |
23.77s |
| Grouped (5 models) | quantity + unit_cost by currency_code |
2.36s |
The grouped model is the fastest because GROUP BY computes all 5 models in a single SQL aggregation pass. All models complete in under 24 seconds on 237 million rows without loading any data into R.