Function reference

All functions and classes exported by sqlm.

Functions

Function	Description
`diagnose_lm_sql()`	Diagnose potential issues before fitting lm_sql
`glance.lm_sql_result()`	Glance at an lm_sql_result
`lm_sql()`	SQL-Backed Linear Regression
`lm_sql_result()`	Result object for SQL-backed Linear Model
`orbital.lm_sql_result()`	Convert an lm_sql_result to an orbital object
`print.lm_sql_result()`	Print an lm_sql_result
`print.summary_lm_sql_result()`	Print a summary_lm_sql_result
`screen_interactions()`	Screen Interaction Effects via SQL Aggregation
`summary.lm_sql_result()`	Summary for lm_sql_result
`tidy.lm_sql_result()`	Tidy an lm_sql_result

Function Details

`diagnose_lm_sql()`

Diagnose potential issues before fitting lm_sql

Usage

diagnose_lm_sql(formula, data, verbose = TRUE)

Arguments

Argument	Description
`formula`	A formula object.
`data`	A `tbl_sql` object.
`verbose`	Logical. Print diagnostic messages? Default TRUE.

Value

Description

Checks a formula and database table for common issues that would cause model fitting to fail or produce unreliable results.

Details

Checks performed:

Column existence Numeric variation (constant columns) Near-perfect collinearity between numeric predictors Categorical level counts (single-level and high-cardinality) Sample size vs parameter count Missing value prevalence

Examples

library(dplyr)
library(DBI)

con <- dbConnect(duckdb::duckdb())
db_mtcars <- copy_to(con, mtcars, "mtcars")

# Check for issues before fitting
diagnose_lm_sql(mpg ~ ., data = db_mtcars)

# Suppress output but capture issues
issues <- diagnose_lm_sql(mpg ~ wt + hp, data = db_mtcars, verbose = FALSE)

dbDisconnect(con)

`glance.lm_sql_result()`

Glance at an lm_sql_result

Usage

glancelm_sql_result(x, ...)

Arguments

Argument	Description
`x`	An `lm_sql_result` object.
`...`	Not used.

Value

A single-row tibble with columns r.squared, adj.r.squared, sigma, statistic, p.value, df, logLik, AIC, BIC, nobs, and df.residual.

Description

Extract a single-row tibble of model-level summary statistics from a fitted SQL linear model.

Details

Returns R-squared, adjusted R-squared, residual standard error, F-statistic and its p-value, model degrees of freedom, log-likelihood, AIC, BIC, number of observations, and residual degrees of freedom.

`lm_sql()`

SQL-Backed Linear Regression

Usage

lm_sql(formula, data, tol = 1e-07, warnings = c("full", "minimal", "none"))

Arguments

Argument	Description
`formula`	A formula object (e.g., price ~ x + cut).
`data`	A tbl_sql object (from dbplyr).
`tol`	Tolerance for detecting linear dependency.
`warnings`	Control warning verbosity: `"full"` (default) shows detailed warnings with diagnostic tips, `"minimal"` shows one-line warnings, `"none"` suppresses warnings entirely.

Value

An S7 object of class lm_sql_result, or a tibble with a model list-column if the data is grouped.

Description

Fits a linear regression model using SQL aggregation on a remote database table. The data never leaves the database — only sufficient statistics (sums and cross-products) are returned to R.

Details

The function computes the X^TX and X^Ty matrices entirely inside the database engine via a single SQL aggregation query, then solves the normal equations in R using Cholesky decomposition (falling back to Moore-Penrose pseudoinverse for rank-deficient designs).

Supported formula features:

 Numeric and categorical (character/factor) predictors with
  automatic dummy encoding via `CASE WHEN`.
 Interaction terms (`*` and `:`) including numeric × categorical
  and categorical × categorical cross-products.
 Dot expansion (`y ~ .`) to all non-response columns.
 Transforms: `I()`, `log()`, and `sqrt()` translated to SQL
  equivalents (`POWER`, `LN`, `SQRT`).
 Date and datetime predictors automatically cast to numeric in SQL.
 No-intercept models (`y ~ 0 + x`).

For grouped data (via [dplyr::group_by()]), a single GROUP BY query is executed and one model per group is returned in a tibble with a model list-column.

NA handling uses listwise deletion: rows with NULL in any model variable are excluded via a WHERE ... IS NOT NULL clause.

Examples

library(dplyr)
library(DBI)

# Connect to database
con <- dbConnect(duckdb::duckdb())
db_mtcars <- copy_to(con, mtcars, "mtcars")

# Basic regression
fit <- lm_sql(mpg ~ wt + hp, data = db_mtcars)
print(fit)
summary(fit)

# Grouped regression (one model per group)
db_mtcars |>
  group_by(cyl) |>
  lm_sql(mpg ~ wt, data = _)

# Control warning verbosity
lm_sql(mpg ~ wt, data = db_mtcars, warnings = "minimal")
lm_sql(mpg ~ wt, data = db_mtcars, warnings = "none")

# Use with broom
broom::tidy(fit)
broom::glance(fit)

dbDisconnect(con)

`lm_sql_result()`

Result object for SQL-backed Linear Model

Usage

lm_sql_result(
  coefficients = integer(0),
  std_error = integer(0),
  sigma = numeric(0),
  r_squared = numeric(0),
  adj_r_squared = numeric(0),
  f_statistic = integer(0),
  f_p_value = integer(0),
  logLik = numeric(0),
  AIC = numeric(0),
  BIC = numeric(0),
  nobs = numeric(0),
  df_residual = numeric(0),
  df_model = numeric(0),
  statistic = integer(0),
  p_value = integer(0),
  call = NULL,
  term_expressions = NULL
)

Description

An S7 class that stores the complete results of a SQL-backed linear regression fitted by [lm_sql()].

Details

This class is not called directly by users. It is created internally by [lm_sql()] and returned as the model object. It stores fitted coefficients, standard errors, t-statistics, p-values, and model-level summaries (R-squared, F-statistic, AIC, BIC, etc.). The term_expressions property holds named R expressions for each predictor, which are used by the [orbital.lm_sql_result()] method to generate in-database prediction expressions.

`orbital.lm_sql_result()`

Convert an lm_sql_result to an orbital object

Usage

orbital.lm_sql_result(x, ..., prefix = ".pred")

Arguments

Argument	Description
`x`	An `lm_sql_result` object.
`...`	Not used.
`prefix`	Column name for predictions. Defaults to `".pred"`.

Value

An orbital_class object.

Description

Creates an orbital object from a fitted SQL linear model, enabling in-database predictions without pulling data into R.

Details

Builds a single prediction expression by combining the fitted coefficients with the R expressions stored in term_expressions. For categorical predictors, the expression includes ifelse() calls that dbplyr translates to SQL CASE WHEN. The resulting orbital_class object can be used with [orbital::predict()] to get predictions or [orbital::augment()] to append a .pred column to a database table.

`print.lm_sql_result()`

Print an lm_sql_result

Usage

printlm_sql_result(x, ...)

Arguments

Argument	Description
`x`	An `lm_sql_result` object.
`...`	Not used.

Value

Invisibly returns x.

Description

Display a concise summary of a fitted SQL linear model.

Details

Prints the original function call and the named coefficient vector.

`print.summary_lm_sql_result()`

Print a summary_lm_sql_result

Usage

printsummary_lm_sql_result(
  x,
  digits = max(3L, getOption("digits") - 3L),
  signif.stars = getOption("show.signif.stars", TRUE),
  ...
)

Arguments

Argument	Description
`x`	A `summary_lm_sql_result` object.
`digits`	Minimum number of significant digits for printing.
`signif.stars`	Logical; if TRUE, significance stars are printed.
`...`	Additional arguments (not used).

Value

Invisibly returns x.

Description

Prints a formatted summary of an SQL-backed linear model, mimicking the output style of [print.summary.lm()].

`screen_interactions()`

Screen Interaction Effects via SQL Aggregation

Usage

screen_interactions(formula, data, threshold = 0.05, min_cor = 0.05)

Arguments

Argument	Description
`formula`	A formula with interaction terms (e.g., y ~ x1 * x2 * cat1).
`data`	A tbl_sql object (from dbplyr).
`threshold`	Significance threshold for retaining an interaction. Default 0.05. Interactions with a p-value above this are dropped.
`min_cor`	Minimum absolute partial correlation for numeric x numeric interactions. Default 0.05.

Value

A formula with weak interactions removed. Main effects are always retained. The original formula is returned unchanged if all interactions pass screening. An attribute “screened” is attached listing dropped terms.

Description

Pre-screens candidate interaction terms using lightweight SQL GROUP BY queries to identify which interactions show meaningful signal, before paying the full O(p^2) cost of expanding them all into the sufficient-statistics query.

Details

For each candidate interaction in the formula, a cheap SQL query estimates the interaction strength:

 Categorical x Categorical: Computes cell means of the
  response via GROUP BY cat1, cat2 and tests whether variance of
  cell means (beyond main effects) exceeds a threshold, using a crude
  F-ratio.
 Categorical x Numeric: Computes per-category slopes via
  GROUP BY cat and tests whether slopes differ significantly
  across categories (slope heterogeneity test).
 Numeric x Numeric: Computes the partial correlation of
  x1 * x2 with the response y and checks if it exceeds
  a minimum threshold.

Returns a filtered formula retaining only interactions that pass the screening threshold. Main effects are always kept.

`summary.lm_sql_result()`

Summary for lm_sql_result

Usage

summarylm_sql_result(object, ...)

Arguments

Argument	Description
`object`	An `lm_sql_result` object.
`...`	Not used.

Value

An object of class summary_lm_sql_result with components:

callThe original function call.
coefficientsA matrix with columns for Estimate, Std. Error,
  t value, and Pr(>|t|).
sigmaResidual standard error.
r.squaredR-squared value.
adj.r.squaredAdjusted R-squared value.
fstatisticA named vector with the F-statistic, numerator df,
  and denominator df.
dfDegrees of freedom: model df, residual df, and total coefficients.
nobsNumber of observations.

Description

Produces a summary of a fitted SQL-backed linear model, similar to [summary.lm()] for base R linear models.

Details

Returns an object of class summary_lm_sql_result containing the coefficient table (estimates, standard errors, t-statistics, and p-values), R-squared statistics, F-statistic, and residual standard error.

Unlike [summary.lm()], this summary does not include residual quantiles since the raw data remains in the database.

`tidy.lm_sql_result()`

Tidy an lm_sql_result

Usage

tidylm_sql_result(x, conf.int = FALSE, conf.level = 0.95, ...)

Arguments

Argument	Description
`x`	An `lm_sql_result` object.
`conf.int`	Logical. If `TRUE`, include confidence interval columns `conf.low` and `conf.high`. Defaults to `FALSE`.
`conf.level`	Confidence level for the interval. Defaults to `0.95`.
`...`	Not used.

Value

A tibble with columns term, estimate, std.error, statistic, and p.value. If conf.int = TRUE, also conf.low and conf.high.

Description

Extract a tidy tibble of per-term coefficient statistics from a fitted SQL linear model.

Details

Returns one row per model term with the estimate, standard error, t-statistic, and p-value. When conf.int = TRUE, confidence intervals are computed using the t-distribution with df_residual degrees of freedom.