# development version
devtools::install_git("https://codeberg.org/usrbinr/sqlm.git")Getting started
sqlm is an R package that makes it easy to fit linear models to data stored in SQL databases. It leverages the power of dbplyr to translate R code into SQL queries, allowing users to work with large datasets without needing to load them into memory.
To get started with sqlm, you can follow these steps:
Install the package: If you haven’t already, install the sqlm package from Codeberg.
Connect to your database: Use DBI and dbplyr to connect to your SQL database.
library(DBI)
library(dbplyr)
library(duckdb)
library(sqlm)
# we will use contoso database for this example
#create a duckdb database with contoso data
db_small <- contoso::create_contoso_duckdb(size="100k")
# assigns the sales table in the duckdb database to a variable
sales_small_db <- db_small$sales
# assigns the regular in-memory tibble to a variable
sales_small_tbl <- contoso::salesThis produces a list of databases that we can reference with typical ‘$’ syntax. In this example we leverage the ‘sales’ tables.
The only difference between these tables is that one is in a duckdb database whereas the other is an in-memory tibble (and one is 100k rows vs ~10k rows).
Fit a linear model: Use the lm_sql() function to fit a linear model to your data. You can specify the formula and the data source as you would with the standard lm() function.
Note most standard lm() formula syntax is supported including interactions with * and : operators, 0 intercept models and putting all columns with ..
lm_obj <- sales_small_db |>
sqlm::lm_sql(net_price~quantity+unit_cost*currency_code,data = _)
print(lm_obj)
Big Data Linear Regression (S7)
-------------------------------
Call:
sqlm::lm_sql(formula = net_price ~ quantity + unit_cost * currency_code,
data = sales_small_db)
Coefficients:
(Intercept) quantity
-24.38846690 -0.67317405
unit_cost currency_code_CAD
2.50924326 2.25062985
currency_code_EUR currency_code_GBP
-6.57743263 -6.59090333
currency_code_USD unit_cost:currency_code_CAD
-9.14607360 -0.06201195
unit_cost:currency_code_EUR unit_cost:currency_code_GBP
0.01998218 0.04250501
unit_cost:currency_code_USD
0.05225576
Not only does this print a familiar output, we can leverage standard broom functions such as tidy() and glance() to easily summarize the model output.
lm_obj |>
broom::tidy()# A tibble: 11 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -24.4 5.39 -4.52 0.00000614
2 quantity -0.673 0.450 -1.50 0.135
3 unit_cost 2.51 0.0266 94.5 0
4 currency_code_CAD 2.25 6.56 0.343 0.732
5 currency_code_EUR -6.58 5.87 -1.12 0.262
6 currency_code_GBP -6.59 6.78 -0.973 0.331
7 currency_code_USD -9.15 5.53 -1.65 0.0982
8 unit_cost:currency_code_CAD -0.0620 0.0332 -1.87 0.0621
9 unit_cost:currency_code_EUR 0.0200 0.0299 0.667 0.505
10 unit_cost:currency_code_GBP 0.0425 0.0332 1.28 0.201
11 unit_cost:currency_code_USD 0.0523 0.0280 1.87 0.0622
lm_obj |>
broom::glance()# A tibble: 1 × 11
r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0.953 0.952 87.4 15616. 0 10 -45893. 91810. 91894.
# ℹ 2 more variables: nobs <dbl>, df.residual <dbl>
How does this compare to standard lm()?
For convenience we are defining the formula first and then using it in both the standard lm function and the sqlm function however as you saw earlier sqlm::lm_sql() works just like lm() with formula body in the function call.
formula_input <- formula(net_price~quantity*unit_cost)
lm_tidy_tbl <- lm(formula_input,data=contoso::sales) |>
broom::tidy()
sqlm_tidy_tbl <-sqlm::lm_sql(formula_input,data = sales_small_db) |>
broom::tidy()| Comparison of lm() and lm_sql() results | ||||
| Term | Estimate (lm) | Estimate (sqlm) | Std. Error (lm) | Std. Error (sqlm) |
|---|---|---|---|---|
| (Intercept) | -30.372882622 | -30.372882622 | 2.267749367 | 2.267749367 |
| quantity | -0.874828942 | -0.874828942 | 0.596018745 | 0.596018745 |
| unit_cost | 2.534482193 | 2.534482193 | 0.011540823 | 0.011540823 |
| quantity:unit_cost | 0.001330544 | 0.001330544 | 0.002977435 | 0.002977435 |
As we can see there are no differences in the results between the two model fits.
Note there are minor differences in how factor variables are named in output as sqlm() will append “_” when using the column name and its levels eg. currency_code_USD in lm_sql() vs. currency_codeUSD in lm().
What else can sqlm do?
sqlm also supports a many-models workflow via dplyr::group_by(). Instead of nesting data and fitting separate models in R, lm_sql() computes all grouped models in a single SQL GROUP BY pass:
grouped_db <- sales_small_db |>
dplyr::group_by(currency_code) |>
lm_sql(net_price~quantity+unit_cost,data = _)
print(grouped_db)# A tibble: 5 × 2
currency_code model
<chr> <list>
1 EUR <sqlm::__>
2 AUD <sqlm::__>
3 GBP <sqlm::__>
4 CAD <sqlm::__>
5 USD <sqlm::__>
Each row contains a model object for a given group. Use rowwise() + mutate() with broom::tidy() or broom::glance() to extract results:
grouped_db |>
dplyr::rowwise() |>
dplyr::mutate(
glance_model = list(broom::glance(model))
) |>
tidyr::unnest(glance_model)# A tibble: 5 × 13
currency_code model r.squared adj.r.squared sigma statistic p.value df
<chr> <list> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 EUR <sqlm::__> 0.952 0.952 83.9 18087. 0 2
2 AUD <sqlm::__> 0.958 0.958 80.3 5288. 0 2
3 GBP <sqlm::__> 0.955 0.955 92.8 7235. 0 2
4 CAD <sqlm::__> 0.949 0.949 86.1 7723. 0 2
5 USD <sqlm::__> 0.952 0.952 89.0 39718. 0 2
# ℹ 5 more variables: logLik <dbl>, AIC <dbl>, BIC <dbl>, nobs <dbl>,
# df.residual <dbl>
For a deeper dive into the many-models pattern — including multiple grouping variables, comparisons with in-memory approaches, and benchmarks — see the Many Models vignette.