Function reference
All functions and classes exported by sqlm.
Functions
| Function | Description |
|---|---|
diagnose_lm_sql() |
Diagnose potential issues before fitting lm_sql |
glance.lm_sql_result() |
Glance at an lm_sql_result |
lm_sql() |
SQL-Backed Linear Regression |
lm_sql_result() |
Result object for SQL-backed Linear Model |
orbital.lm_sql_result() |
Convert an lm_sql_result to an orbital object |
print.lm_sql_result() |
Print an lm_sql_result |
print.summary_lm_sql_result() |
Print a summary_lm_sql_result |
screen_interactions() |
Screen Interaction Effects via SQL Aggregation |
summary.lm_sql_result() |
Summary for lm_sql_result |
tidy.lm_sql_result() |
Tidy an lm_sql_result |
Function Details
diagnose_lm_sql()
Diagnose potential issues before fitting lm_sql
Usage
diagnose_lm_sql(formula, data, verbose = TRUE)Arguments
| Argument | Description |
|---|---|
formula |
A formula object. |
data |
A tbl_sql object. |
verbose |
Logical. Print diagnostic messages? Default TRUE. |
Value
Description
Checks a formula and database table for common issues that would cause model fitting to fail or produce unreliable results.
Details
Checks performed:
Column existence Numeric variation (constant columns) Near-perfect collinearity between numeric predictors Categorical level counts (single-level and high-cardinality) Sample size vs parameter count Missing value prevalence
Examples
library(dplyr)
library(DBI)
con <- dbConnect(duckdb::duckdb())
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Check for issues before fitting
diagnose_lm_sql(mpg ~ ., data = db_mtcars)
# Suppress output but capture issues
issues <- diagnose_lm_sql(mpg ~ wt + hp, data = db_mtcars, verbose = FALSE)
dbDisconnect(con)glance.lm_sql_result()
Glance at an lm_sql_result
Usage
glancelm_sql_result(x, ...)Arguments
| Argument | Description |
|---|---|
x |
An lm_sql_result object. |
... |
Not used. |
Value
A single-row tibble with columns r.squared, adj.r.squared, sigma, statistic, p.value, df, logLik, AIC, BIC, nobs, and df.residual.
Description
Extract a single-row tibble of model-level summary statistics from a fitted SQL linear model.
Details
Returns R-squared, adjusted R-squared, residual standard error, F-statistic and its p-value, model degrees of freedom, log-likelihood, AIC, BIC, number of observations, and residual degrees of freedom.
lm_sql()
SQL-Backed Linear Regression
Usage
lm_sql(formula, data, tol = 1e-07, warnings = c("full", "minimal", "none"))Arguments
| Argument | Description |
|---|---|
formula |
A formula object (e.g., price ~ x + cut). |
data |
A tbl_sql object (from dbplyr). |
tol |
Tolerance for detecting linear dependency. |
warnings |
Control warning verbosity: "full" (default) shows detailed warnings with diagnostic tips, "minimal" shows one-line warnings, "none" suppresses warnings entirely. |
Value
An S7 object of class lm_sql_result, or a tibble with a model list-column if the data is grouped.
Description
Fits a linear regression model using SQL aggregation on a remote database table. The data never leaves the database — only sufficient statistics (sums and cross-products) are returned to R.
Details
The function computes the X^TX and X^Ty matrices entirely inside the database engine via a single SQL aggregation query, then solves the normal equations in R using Cholesky decomposition (falling back to Moore-Penrose pseudoinverse for rank-deficient designs).
Supported formula features:
Numeric and categorical (character/factor) predictors with
automatic dummy encoding via `CASE WHEN`.
Interaction terms (`*` and `:`) including numeric × categorical
and categorical × categorical cross-products.
Dot expansion (`y ~ .`) to all non-response columns.
Transforms: `I()`, `log()`, and `sqrt()` translated to SQL
equivalents (`POWER`, `LN`, `SQRT`).
Date and datetime predictors automatically cast to numeric in SQL.
No-intercept models (`y ~ 0 + x`).
For grouped data (via [dplyr::group_by()]), a single GROUP BY query is executed and one model per group is returned in a tibble with a model list-column.
NA handling uses listwise deletion: rows with NULL in any model variable are excluded via a WHERE ... IS NOT NULL clause.
Examples
library(dplyr)
library(DBI)
# Connect to database
con <- dbConnect(duckdb::duckdb())
db_mtcars <- copy_to(con, mtcars, "mtcars")
# Basic regression
fit <- lm_sql(mpg ~ wt + hp, data = db_mtcars)
print(fit)
summary(fit)
# Grouped regression (one model per group)
db_mtcars |>
group_by(cyl) |>
lm_sql(mpg ~ wt, data = _)
# Control warning verbosity
lm_sql(mpg ~ wt, data = db_mtcars, warnings = "minimal")
lm_sql(mpg ~ wt, data = db_mtcars, warnings = "none")
# Use with broom
broom::tidy(fit)
broom::glance(fit)
dbDisconnect(con)lm_sql_result()
Result object for SQL-backed Linear Model
Usage
lm_sql_result(
coefficients = integer(0),
std_error = integer(0),
sigma = numeric(0),
r_squared = numeric(0),
adj_r_squared = numeric(0),
f_statistic = integer(0),
f_p_value = integer(0),
logLik = numeric(0),
AIC = numeric(0),
BIC = numeric(0),
nobs = numeric(0),
df_residual = numeric(0),
df_model = numeric(0),
statistic = integer(0),
p_value = integer(0),
call = NULL,
term_expressions = NULL
)Description
An S7 class that stores the complete results of a SQL-backed linear regression fitted by [lm_sql()].
Details
This class is not called directly by users. It is created internally by [lm_sql()] and returned as the model object. It stores fitted coefficients, standard errors, t-statistics, p-values, and model-level summaries (R-squared, F-statistic, AIC, BIC, etc.). The term_expressions property holds named R expressions for each predictor, which are used by the [orbital.lm_sql_result()] method to generate in-database prediction expressions.
orbital.lm_sql_result()
Convert an lm_sql_result to an orbital object
Usage
orbital.lm_sql_result(x, ..., prefix = ".pred")Arguments
| Argument | Description |
|---|---|
x |
An lm_sql_result object. |
... |
Not used. |
prefix |
Column name for predictions. Defaults to ".pred". |
Value
An orbital_class object.
Description
Creates an orbital object from a fitted SQL linear model, enabling in-database predictions without pulling data into R.
Details
Builds a single prediction expression by combining the fitted coefficients with the R expressions stored in term_expressions. For categorical predictors, the expression includes ifelse() calls that dbplyr translates to SQL CASE WHEN. The resulting orbital_class object can be used with [orbital::predict()] to get predictions or [orbital::augment()] to append a .pred column to a database table.
print.lm_sql_result()
Print an lm_sql_result
Usage
printlm_sql_result(x, ...)Arguments
| Argument | Description |
|---|---|
x |
An lm_sql_result object. |
... |
Not used. |
Value
Invisibly returns x.
Description
Display a concise summary of a fitted SQL linear model.
Details
Prints the original function call and the named coefficient vector.
print.summary_lm_sql_result()
Print a summary_lm_sql_result
Usage
printsummary_lm_sql_result(
x,
digits = max(3L, getOption("digits") - 3L),
signif.stars = getOption("show.signif.stars", TRUE),
...
)Arguments
| Argument | Description |
|---|---|
x |
A summary_lm_sql_result object. |
digits |
Minimum number of significant digits for printing. |
signif.stars |
Logical; if TRUE, significance stars are printed. |
... |
Additional arguments (not used). |
Value
Invisibly returns x.
Description
Prints a formatted summary of an SQL-backed linear model, mimicking the output style of [print.summary.lm()].
screen_interactions()
Screen Interaction Effects via SQL Aggregation
Usage
screen_interactions(formula, data, threshold = 0.05, min_cor = 0.05)Arguments
| Argument | Description |
|---|---|
formula |
A formula with interaction terms (e.g., y ~ x1 * x2 * cat1). |
data |
A tbl_sql object (from dbplyr). |
threshold |
Significance threshold for retaining an interaction. Default 0.05. Interactions with a p-value above this are dropped. |
min_cor |
Minimum absolute partial correlation for numeric x numeric interactions. Default 0.05. |
Value
A formula with weak interactions removed. Main effects are always retained. The original formula is returned unchanged if all interactions pass screening. An attribute “screened” is attached listing dropped terms.
Description
Pre-screens candidate interaction terms using lightweight SQL GROUP BY queries to identify which interactions show meaningful signal, before paying the full O(p^2) cost of expanding them all into the sufficient-statistics query.
Details
For each candidate interaction in the formula, a cheap SQL query estimates the interaction strength:
Categorical x Categorical: Computes cell means of the
response via GROUP BY cat1, cat2 and tests whether variance of
cell means (beyond main effects) exceeds a threshold, using a crude
F-ratio.
Categorical x Numeric: Computes per-category slopes via
GROUP BY cat and tests whether slopes differ significantly
across categories (slope heterogeneity test).
Numeric x Numeric: Computes the partial correlation of
x1 * x2 with the response y and checks if it exceeds
a minimum threshold.
Returns a filtered formula retaining only interactions that pass the screening threshold. Main effects are always kept.
summary.lm_sql_result()
Summary for lm_sql_result
Usage
summarylm_sql_result(object, ...)Arguments
| Argument | Description |
|---|---|
object |
An lm_sql_result object. |
... |
Not used. |
Value
An object of class summary_lm_sql_result with components:
callThe original function call.
coefficientsA matrix with columns for Estimate, Std. Error,
t value, and Pr(>|t|).
sigmaResidual standard error.
r.squaredR-squared value.
adj.r.squaredAdjusted R-squared value.
fstatisticA named vector with the F-statistic, numerator df,
and denominator df.
dfDegrees of freedom: model df, residual df, and total coefficients.
nobsNumber of observations.
Description
Produces a summary of a fitted SQL-backed linear model, similar to [summary.lm()] for base R linear models.
Details
Returns an object of class summary_lm_sql_result containing the coefficient table (estimates, standard errors, t-statistics, and p-values), R-squared statistics, F-statistic, and residual standard error.
Unlike [summary.lm()], this summary does not include residual quantiles since the raw data remains in the database.
tidy.lm_sql_result()
Tidy an lm_sql_result
Usage
tidylm_sql_result(x, conf.int = FALSE, conf.level = 0.95, ...)Arguments
| Argument | Description |
|---|---|
x |
An lm_sql_result object. |
conf.int |
Logical. If TRUE, include confidence interval columns conf.low and conf.high. Defaults to FALSE. |
conf.level |
Confidence level for the interval. Defaults to 0.95. |
... |
Not used. |
Value
A tibble with columns term, estimate, std.error, statistic, and p.value. If conf.int = TRUE, also conf.low and conf.high.
Description
Extract a tidy tibble of per-term coefficient statistics from a fitted SQL linear model.
Details
Returns one row per model term with the estimate, standard error, t-statistic, and p-value. When conf.int = TRUE, confidence intervals are computed using the t-distribution with df_residual degrees of freedom.