tidychain
tidychain
Excel Forensics and Model Decomposition via calcChain Analysis
The Problem
You receive an Excel file. How do you know if the data was manipulated? Were rows inserted? Were results changed after the fact?
tidychain answers these questions by parsing Excel’s internal metadata—specifically calcChain.xml, which records the order formulas were created. This order persists even when rows are moved, revealing the spreadsheet’s hidden history.
tidychain doesn’t read Excel data into R like readxl. Instead, it performs forensic analysis of the Excel file’s internal structure to detect manipulation and understand how the spreadsheet was built.
Installation
From Codeberg (development)
remotes::install_git("https://codeberg.org/usrbinr/tidychain")From GitHub (mirror)
pak::pak("usrbinr/tidychain")Codeberg is the primary development repository. Issues, pull requests, and contributions are welcome there. GitHub is a read-only mirror for convenience. Issues and PRs on GitHub will not be monitored or responded to.
Quick Example
library(tidychain)
xlsx <- parse_xlsx("spreadsheet.xlsx")
xlsx── Excel Workbook Summary ──────────────────────────────────────────────────────
File: 'nist_financial_model.xlsx' (217.5 KB)
Created: "2000-12-08" by "(undefined Excel user)"
Modified: "2023-03-24" by "Parcell, John H. (Ctr)"
Sheets: 13 (1 hidden)
Cells: 19505 | Formulas: 4824
── Forensic Flags ──
! 1 hidden sheet(s)
! Hidden rows/columns detected
! 3 external link(s)
! 2847 late addition(s) in calc chain
The summary immediately highlights potential red flags: hidden sheets, hidden rows/columns, external links, and late additions—formulas created after cells that appear below them.
Forensic Workflow
- 1
- Print summary to see forensic flags immediately
- 2
- Find cells where creation order doesn’t match row position
- 3
- Get statistics: severity, clusters, breakdown by sheet
- 4
- Scatter plot: creation order vs current row number
- 5
- Heatmap: visualize when each cell was created
- 6
- Grid view: highlight anomalous cells in red/orange
What tidychain Detects
| Flag | What It Reveals |
|---|---|
| Hidden sheets | Sheets marked hidden or veryHidden |
| Hidden rows/columns | Concealed data within visible sheets |
| Orphaned strings | Deleted text still stored in the file |
| External links | References to other workbooks |
| Data connections | Database or web query connections |
| Late additions | Formulas where creation order doesn’t match position |
The Key Insight: calcChain.xml
When you create a formula in Excel, it’s recorded in calcChain.xml in creation order. This order never changes—even if you move rows around later.
If row 50 was inserted between rows 10 and 11 after the original data entry, the formulas in row 50 will have creation orders much higher than rows 10 or 11. This discrepancy is invisible in the spreadsheet but detectable in the calcChain.
Learn More
- Get Started — Basic usage and the xlsx_workbook object
- Forensic Analysis — Complete workflow for detecting manipulation
- Understanding Anomalies — How the detection algorithm works
- Excel Formula Translation — Convert Excel formulas to R code
- Function Reference — All exported functions
Inspiration
This package was inspired by Data Colada #109, which demonstrated how calcChain.xml analysis revealed data fabrication in academic research. tidychain makes this forensic technique accessible in R.
- Only tracks formula cells (not plain values)
- No absolute timestamps—only relative creation order
- Can be circumvented by recreating the file from scratch
- Some programmatically-generated Excel files lack
calcChain.xml