Getting Started with tidychain

Installation

From Codeberg (development)

remotes::install_git("https://codeberg.org/usrbinr/tidychain")

From GitHub (mirror)

pak::pak("usrbinr/tidychain")

Codeberg is the primary repository. GitHub is a read-only mirror.

Your First Analysis

The main entry point is parse_xlsx(), which returns an S7 xlsx_workbook object containing everything tidychain extracts from the file:

library(tidychain)

xlsx <- parse_xlsx("nist_financial_model.xlsx")
xlsx
── Excel Workbook Summary ──────────────────────────────────────────────────────
File: 'nist_financial_model.xlsx' (217.5 KB)
Created: "2000-12-08T14:09:03Z" by "(undefined Excel user)"
Modified: "2023-03-24T15:43:31Z" by "Parcell, John H. (Ctr)"

Sheets: 13 (1 hidden)
Cells: 19505 | Formulas: 4824

── Forensic Flags ──

! 1 hidden sheet(s)
! Hidden rows/columns detected
! 3 external link(s)
! 2847 late addition(s) in calc chain

Use `summary(xlsx@meta)`, `summary(xlsx@chain)`, etc. for details

The summary provides an immediate forensic overview: file metadata, structure counts, and automatic detection of suspicious patterns.

The xlsx_workbook Object

Every component is accessible via @ slots:

1xlsx@meta@creator
xlsx@meta@modified
2xlsx@sheets@sheets
3xlsx@chain@chain
4xlsx@cells@cells
5xlsx@formulas@formulas
1
File metadata: author, creation/modification dates
2
Sheet information: names, visibility state
3
Calculation chain: formula creation order (key for forensics)
4
All cell data: values, formulas, types
5
Parsed formulas with their cell references

Component Summaries

Each slot has its own summary() method:

summary(xlsx@meta)
── Metadata Summary ──
Creator: "(undefined Excel user)"
Last modified by: "Parcell, John H. (Ctr)"
Created: "2000-12-08T14:09:03Z"
Modified: "2023-03-24T15:43:31Z"
Application: "Microsoft Excel Online" "16.0300"
summary(xlsx@chain)
── Calc Chain Summary ──
Formula cells tracked: 6142
Sheets covered: 7
! Late additions: 2847 (46.4%)
Warning46% Late Additions?

This doesn’t necessarily mean fraud. This file was created in 2000 and edited for 23 years. High anomaly counts are normal for working documents with long editing histories.

Available Slots

Core Slots

Slot Description Access Example
@meta Author, dates, app info xlsx@meta@creator
@sheets Sheet names, hidden status xlsx@sheets@sheets
@chain Formula creation order xlsx@chain@chain
@cells All cell data xlsx@cells@cells
@formulas Formulas with references xlsx@formulas@formulas
@strings Shared strings xlsx@strings@strings
@styles Hidden rows/columns xlsx@styles@hidden_rows
@comments Cell comments xlsx@comments@comments
@links External links xlsx@links@links
@connections Database connections xlsx@connections@connections

Extended Slots

Slot Description
@tables Named table definitions
@pivots Pivot table definitions
@charts Chart configurations
@queries Query table definitions
@slicers Slicer definitions
@custom Custom XML parts

Extracting Formulas

Use extract_formulas() to get all formulas with their cell references:

formulas <- extract_formulas("simple_model.xlsx")
formulas
# A tibble: 3 × 6
  sheet cell      cell_local formula references has_references
  <chr> <chr>     <chr>      <chr>   <list>     <lgl>
1 Model Model!B7  B7         B2*B3   <chr [2]>  TRUE
2 Model Model!B8  B8         B7*B4   <chr [2]>  TRUE
3 Model Model!B11 B11        B7+B8   <chr [2]>  TRUE

The references column contains a list of all cells that each formula depends on—useful for building dependency graphs.

Parsing Individual Formulas

Extract cell references from any formula string:

parse_formula_refs("=SUM(A1:A10)+B5*C3")
[1] "A1" "A10" "B5" "C3"
parse_formula_refs("=VLOOKUP(A1,Sheet2!B:C,2,FALSE)")
[1] "A1"

Expanding Ranges

Convert range notation to individual cells:

expand_range("A1:C3")
[1] "A1" "B1" "C1" "A2" "B2" "C2" "A3" "B3" "C3"

Getting All Cells

Retrieve every cell with its type:

cells <- get_all_cells("simple_model.xlsx")
cells
# A tibble: 15 × 8
   sheet cell      cell_local   row col   value formula cell_type
   <chr> <chr>     <chr>      <int> <chr> <chr> <chr>   <chr>
 1 Model Model!A1  A1             1 A     0     <NA>    value
 2 Model Model!A2  A2             2 A     1     <NA>    value
 3 Model Model!B2  B2             2 B     100   <NA>    value
 4 Model Model!B7  B7             7 B     <NA>  B2*B3   formula
 ...
table(cells$cell_type)
formula   value
      3      12

Security Functions

Check for Encryption

Encrypted Excel files cannot be parsed:

is_encrypted_xlsx("protected_file.xlsx")
[1] TRUE
ImportantEncrypted Files

If is_encrypted_xlsx() returns TRUE, the file uses AES-256 encryption and cannot be opened without the password. This is different from sheet protection, which is merely cosmetic.

Sheet Protection

Sheet protection is easily removed (it’s just XML flags):

# Check which sheets are protected
check_sheet_protection(xlsx)
# A tibble: 3 × 2
  sheet_name    protected
  <chr>         <lgl>
1 Sheet1        TRUE
2 Sheet2        FALSE
3 HiddenSheet   TRUE
# Remove protection (creates a new file)
remove_sheet_protection("protected.xlsx", output = "unprotected.xlsx")
CautionSheet Protection is Weak

Sheet protection only prevents casual editing. It does not encrypt data and can be removed by anyone with basic tools. Never rely on sheet protection for security.

Next Steps