tidychain

License

MIT + file LICENSE

Citation

Citing this package

Authors

  • Alejandro Hagan (author, maintainer)

tidychain

Excel Forensics and Model Decomposition via calcChain Analysis

The Problem

You receive an Excel file. How do you know if the data was manipulated? Were rows inserted? Were results changed after the fact?

tidychain answers these questions by parsing Excel’s internal metadata—specifically calcChain.xml, which records the order formulas were created. This order persists even when rows are moved, revealing the spreadsheet’s hidden history.

NoteThis is NOT a data import package

tidychain doesn’t read Excel data into R like readxl. Instead, it performs forensic analysis of the Excel file’s internal structure to detect manipulation and understand how the spreadsheet was built.

Installation

From Codeberg (development)

remotes::install_git("https://codeberg.org/usrbinr/tidychain")

From GitHub (mirror)

pak::pak("usrbinr/tidychain")
ImportantRepository Status

Codeberg is the primary development repository. Issues, pull requests, and contributions are welcome there. GitHub is a read-only mirror for convenience. Issues and PRs on GitHub will not be monitored or responded to.

Quick Example

library(tidychain)

xlsx <- parse_xlsx("spreadsheet.xlsx")
xlsx
── Excel Workbook Summary ──────────────────────────────────────────────────────
File: 'nist_financial_model.xlsx' (217.5 KB)
Created: "2000-12-08" by "(undefined Excel user)"
Modified: "2023-03-24" by "Parcell, John H. (Ctr)"

Sheets: 13 (1 hidden)
Cells: 19505 | Formulas: 4824

── Forensic Flags ──

! 1 hidden sheet(s)
! Hidden rows/columns detected
! 3 external link(s)
! 2847 late addition(s) in calc chain

The summary immediately highlights potential red flags: hidden sheets, hidden rows/columns, external links, and late additions—formulas created after cells that appear below them.

Forensic Workflow

xlsx <- parse_xlsx("suspect_file.xlsx")
1xlsx

2anomalies <- detect_anomalies(xlsx)
3summarize_anomalies(anomalies, xlsx)

4plot_calc_order(xlsx)
5plot_sheet_heatmap(xlsx, sheet = "Data")
6plot_anomalies(xlsx, anomalies)
1
Print summary to see forensic flags immediately
2
Find cells where creation order doesn’t match row position
3
Get statistics: severity, clusters, breakdown by sheet
4
Scatter plot: creation order vs current row number
5
Heatmap: visualize when each cell was created
6
Grid view: highlight anomalous cells in red/orange

What tidychain Detects

Flag What It Reveals
Hidden sheets Sheets marked hidden or veryHidden
Hidden rows/columns Concealed data within visible sheets
Orphaned strings Deleted text still stored in the file
External links References to other workbooks
Data connections Database or web query connections
Late additions Formulas where creation order doesn’t match position

The Key Insight: calcChain.xml

When you create a formula in Excel, it’s recorded in calcChain.xml in creation order. This order never changes—even if you move rows around later.

TipWhy This Matters

If row 50 was inserted between rows 10 and 11 after the original data entry, the formulas in row 50 will have creation orders much higher than rows 10 or 11. This discrepancy is invisible in the spreadsheet but detectable in the calcChain.

Learn More

Inspiration

This package was inspired by Data Colada #109, which demonstrated how calcChain.xml analysis revealed data fabrication in academic research. tidychain makes this forensic technique accessible in R.

CautionLimitations
  • Only tracks formula cells (not plain values)
  • No absolute timestamps—only relative creation order
  • Can be circumvented by recreating the file from scratch
  • Some programmatically-generated Excel files lack calcChain.xml