Getting Started with tidychain
Installation
From Codeberg (development)
remotes::install_git("https://codeberg.org/usrbinr/tidychain")From GitHub (mirror)
pak::pak("usrbinr/tidychain")Codeberg is the primary repository. GitHub is a read-only mirror.
Your First Analysis
The main entry point is parse_xlsx(), which returns an S7 xlsx_workbook object containing everything tidychain extracts from the file:
library(tidychain)
xlsx <- parse_xlsx("nist_financial_model.xlsx")
xlsx── Excel Workbook Summary ──────────────────────────────────────────────────────
File: 'nist_financial_model.xlsx' (217.5 KB)
Created: "2000-12-08T14:09:03Z" by "(undefined Excel user)"
Modified: "2023-03-24T15:43:31Z" by "Parcell, John H. (Ctr)"
Sheets: 13 (1 hidden)
Cells: 19505 | Formulas: 4824
── Forensic Flags ──
! 1 hidden sheet(s)
! Hidden rows/columns detected
! 3 external link(s)
! 2847 late addition(s) in calc chain
Use `summary(xlsx@meta)`, `summary(xlsx@chain)`, etc. for details
The summary provides an immediate forensic overview: file metadata, structure counts, and automatic detection of suspicious patterns.
The xlsx_workbook Object
Every component is accessible via @ slots:
- 1
- File metadata: author, creation/modification dates
- 2
- Sheet information: names, visibility state
- 3
- Calculation chain: formula creation order (key for forensics)
- 4
- All cell data: values, formulas, types
- 5
- Parsed formulas with their cell references
Component Summaries
Each slot has its own summary() method:
summary(xlsx@meta)── Metadata Summary ──
Creator: "(undefined Excel user)"
Last modified by: "Parcell, John H. (Ctr)"
Created: "2000-12-08T14:09:03Z"
Modified: "2023-03-24T15:43:31Z"
Application: "Microsoft Excel Online" "16.0300"
summary(xlsx@chain)── Calc Chain Summary ──
Formula cells tracked: 6142
Sheets covered: 7
! Late additions: 2847 (46.4%)
This doesn’t necessarily mean fraud. This file was created in 2000 and edited for 23 years. High anomaly counts are normal for working documents with long editing histories.
Available Slots
Core Slots
| Slot | Description | Access Example |
|---|---|---|
@meta |
Author, dates, app info | xlsx@meta@creator |
@sheets |
Sheet names, hidden status | xlsx@sheets@sheets |
@chain |
Formula creation order | xlsx@chain@chain |
@cells |
All cell data | xlsx@cells@cells |
@formulas |
Formulas with references | xlsx@formulas@formulas |
@strings |
Shared strings | xlsx@strings@strings |
@styles |
Hidden rows/columns | xlsx@styles@hidden_rows |
@comments |
Cell comments | xlsx@comments@comments |
@links |
External links | xlsx@links@links |
@connections |
Database connections | xlsx@connections@connections |
Extended Slots
| Slot | Description |
|---|---|
@tables |
Named table definitions |
@pivots |
Pivot table definitions |
@charts |
Chart configurations |
@queries |
Query table definitions |
@slicers |
Slicer definitions |
@custom |
Custom XML parts |
Extracting Formulas
Use extract_formulas() to get all formulas with their cell references:
formulas <- extract_formulas("simple_model.xlsx")
formulas# A tibble: 3 × 6
sheet cell cell_local formula references has_references
<chr> <chr> <chr> <chr> <list> <lgl>
1 Model Model!B7 B7 B2*B3 <chr [2]> TRUE
2 Model Model!B8 B8 B7*B4 <chr [2]> TRUE
3 Model Model!B11 B11 B7+B8 <chr [2]> TRUE
The references column contains a list of all cells that each formula depends on—useful for building dependency graphs.
Parsing Individual Formulas
Extract cell references from any formula string:
parse_formula_refs("=SUM(A1:A10)+B5*C3")[1] "A1" "A10" "B5" "C3"
parse_formula_refs("=VLOOKUP(A1,Sheet2!B:C,2,FALSE)")[1] "A1"
Expanding Ranges
Convert range notation to individual cells:
expand_range("A1:C3")[1] "A1" "B1" "C1" "A2" "B2" "C2" "A3" "B3" "C3"
Getting All Cells
Retrieve every cell with its type:
cells <- get_all_cells("simple_model.xlsx")
cells# A tibble: 15 × 8
sheet cell cell_local row col value formula cell_type
<chr> <chr> <chr> <int> <chr> <chr> <chr> <chr>
1 Model Model!A1 A1 1 A 0 <NA> value
2 Model Model!A2 A2 2 A 1 <NA> value
3 Model Model!B2 B2 2 B 100 <NA> value
4 Model Model!B7 B7 7 B <NA> B2*B3 formula
...
table(cells$cell_type)formula value
3 12
Security Functions
Check for Encryption
Encrypted Excel files cannot be parsed:
is_encrypted_xlsx("protected_file.xlsx")[1] TRUE
If is_encrypted_xlsx() returns TRUE, the file uses AES-256 encryption and cannot be opened without the password. This is different from sheet protection, which is merely cosmetic.
Sheet Protection
Sheet protection is easily removed (it’s just XML flags):
# Check which sheets are protected
check_sheet_protection(xlsx)# A tibble: 3 × 2
sheet_name protected
<chr> <lgl>
1 Sheet1 TRUE
2 Sheet2 FALSE
3 HiddenSheet TRUE
# Remove protection (creates a new file)
remove_sheet_protection("protected.xlsx", output = "unprotected.xlsx")Sheet protection only prevents casual editing. It does not encrypt data and can be removed by anyone with basic tools. Never rely on sheet protection for security.
Next Steps
- Forensic Analysis — Learn to detect spreadsheet manipulation
- Understanding Anomalies — How the detection algorithm works
- Excel Formula Translation — Convert Excel formulas to R