Excel Forensic Analysis with tidychain

The Forensic Opportunity

Every Excel file contains a hidden audit trail. When you create a formula, Excel records it in calcChain.xml in creation order—and this order never changes, even when you move rows.

This technique was demonstrated by Data Colada #109, which revealed data fabrication in academic research by analyzing calcChain anomalies.

NoteHow calcChain.xml Works
<calcChain>
  <c r="A1" i="1"/>  <!-- Created 1st -->
  <c r="A2"/>        <!-- Created 2nd -->
  <c r="A3"/>        <!-- Created 3rd -->
</calcChain>

If you move row 3 to row 1, the calcChain stays the same. The formula’s creation position is preserved, revealing the manipulation.

Complete Forensic Workflow

Step 1: Parse the Workbook

library(tidychain)
library(dplyr)

xlsx <- parse_xlsx("nist_financial_model.xlsx")
xlsx
── Excel Workbook Summary ──────────────────────────────────────────────────────
File: 'nist_financial_model.xlsx' (217.5 KB)
Created: "2000-12-08T14:09:03Z" by "(undefined Excel user)"
Modified: "2023-03-24T15:43:31Z" by "Parcell, John H. (Ctr)"

Sheets: 13 (1 hidden)
Cells: 19505 | Formulas: 4824

── Forensic Flags ──

! 1 hidden sheet(s)
! Hidden rows/columns detected
! 3 external link(s)
! 2847 late addition(s) in calc chain

This 23-year-old NIST financial model shows several forensic flags: a very hidden sheet (Module1), hidden rows/columns, 3 external links, and 2,847 late additions (46% of formulas created “out of order”).

Step 2: Examine the Calculation Chain

summary(xlsx@chain)
── Calc Chain Summary ──
Formula cells tracked: 6142
Sheets covered: 7
! Late additions: 2847 (46.4%)
head(xlsx@chain@chain, 10)
# A tibble: 10 × 7
   order cell_ref   row col   col_num sheet_id sheet_name
   <int> <chr>    <int> <chr>   <int> <chr>    <chr>
 1     1 F154       154 F           6 12       Depreciation Schedule
 2     2 G154       154 G           7 12       Depreciation Schedule
 3     3 H154       154 H           8 12       Depreciation Schedule
 4     4 I154       154 I           9 12       Depreciation Schedule
 5     5 J154       154 J          10 12       Depreciation Schedule
 ...
TipKey Insight

The first formulas created (order 1-10) are in row 154. This means row 154 was the original starting point, and rows 1-153 were added above it over time.

Step 3: Detect Anomalies

anomalies <- detect_anomalies(xlsx)
head(anomalies)
# A tibble: 6 × 9
  sheet_name            cell_ref row col   col_num calc_order expected_order order_diff anomaly_type
  <chr>                 <chr>  <int> <chr>   <int>      <int>          <int>      <int> <chr>
1 Depreciation Schedule F154     154 F           6          1           2641      -2640 moved_down
2 Depreciation Schedule G154     154 G           7          2           2642      -2640 moved_down
3 Depreciation Schedule H154     154 H           8          3           2643      -2640 moved_down
...
Column Meaning
calc_order When the formula was created (1 = first)
expected_order Where it should be if rows were sequential
order_diff The discrepancy
anomaly_type moved_up (created later) or moved_down (created earlier)

Step 4: Summarize Anomalies

summary_result <- summarize_anomalies(anomalies, xlsx)
print(summary_result)
── Anomaly Summary ──
ℹ Severity: high
Total anomalous cells: 134760

── By Sheet ──
# A tibble: 7 × 2
  sheet_name                     n
  <chr>                      <int>
1 Depreciation Schedule      71691
2 Assumptions Processing     18438
3 CashFlow                   16086
4 Income Statement           13293
5 Balance Sheet              11529
6 Control Panel               3717
7 Overview                       6

── By Type ──
# A tibble: 2 × 2
  anomaly_type     n
  <chr>        <int>
1 moved_up     68770
2 moved_down   65990

── Row Clusters ──
Groups of consecutive rows that may have been moved together:
# A tibble: 70 × 4
   sheet_name            start_row end_row n_cells
   <chr>                     <int>   <int>   <int>
 1 Depreciation Schedule         4       4     437
 2 Depreciation Schedule        11      16    2760
 ...

Visualizing Anomalies

Plot 1: Creation Order vs Row Position

plot_calc_order(xlsx, sheet = "Income Statement")

Creation Order vs Row Position
NoteReading This Plot
  • X-axis: Current row number
  • Y-axis: Creation order (position in calcChain.xml)
  • Dotted diagonal: Expected pattern if rows were created top-to-bottom
  • Blue points: Normal cells
  • Red points: Anomalies

Interpretation:

  • Points above the line: created later than row position suggests (inserted)
  • Points below the line: created earlier than row position suggests (moved down)
  • Scattered pattern: extensive editing over time

Plot 2: Sheet Heatmap

plot_sheet_heatmap(xlsx, sheet = "Income Statement")

Creation Order Heatmap
NoteReading This Heatmap
  • Each tile = one formula cell
  • Purple (dark): Created earlier
  • Yellow (light): Created later
  • Position shows current location in spreadsheet

What to look for:

Pattern Interpretation
Horizontal color bands Rows created together
Color discontinuities Rows inserted or moved
Yellow cells in purple region Late insertions (suspicious)
Gradual top-to-bottom gradient Normal sequential creation

Plot 3: Anomaly Locations

plot_anomalies(xlsx, anomalies, sheet = "Income Statement")

Anomaly Locations

Red = moved_up (created after cells below it). Orange = moved_down (created before cells above it). Clusters indicate groups moved together.

Plot 4: All Sheets Overview

plot_calc_order(xlsx)  # No sheet filter = all sheets

All Sheets Overview

Compare editing patterns across sheets. Heavy editing appears as scattered points; minimal editing shows a clean diagonal.

Building a Timeline

timeline <- build_timeline(xlsx)
head(timeline)
# A tibble: 6 × 10
  creation_order cell_ref sheet_name     row col formula  order_vs_row is_late_addition
           <int> <chr>    <chr>        <int> <chr> <chr>         <int> <lgl>
1              1 F154     Depreciation   154 F     'Assum…       -2640 FALSE
2              2 G154     Depreciation   154 G     'Assum…       -2640 FALSE
...

Filter to late additions:

late_additions <- timeline |> filter(is_late_addition)
nrow(late_additions)
[1] 2847

Print a formatted summary:

print_timeline_summary(timeline)
── Formula Creation Timeline ──
File created: 2000-12-08T14:09:03Z
File modified: 2023-03-24T15:43:31Z

Total formulas: 6142
Sheets: 7
Late additions: 2847 (46.4%)

── Late Additions ──
These formulas were created after cells currently below them:
  #28: C11 (row 11) - SUM(C12:C26)-SUM(C27:C40)
  #29: D11 (row 11) - IF(C11="","",IF(C11<>SUM(G11:Q11),"ERROR","OK"))
  ...

Reconstructing Original Layout

Attempt to infer what the spreadsheet looked like before rows were moved:

original <- reconstruct_original(xlsx, sheet = "Income Statement")
original |> filter(was_moved) |> head(10)
# A tibble: 10 × 8
   sheet_name       original_row current_row cell_ref col formula   was_moved movement
   <chr>                   <int>       <int> <chr>    <chr> <chr>     <lgl>     <chr>
 1 Income Statement            4           7 E7       E     'Assum…   TRUE      inserted
 2 Income Statement            5          21 E21      E     'Assum…   TRUE      inserted
 3 Income Statement            5          17 F17      F     'Assum…   TRUE      inserted
...
print_original_comparison(original)
── Original vs Current Layout ──
ℹ Showing inferred original positions based on creation order

Cells analyzed: 475
Cells moved: 362 (76.2%)

── Moved Cells ──
  E7: row 4 -> row 7 (inserted)
  E21: row 5 -> row 21 (inserted)
  ...

Interpreting Results

ImportantHigh Anomaly Count ≠ Fraud

This NIST file has 134,760 anomalies. Is it fraud? No.

The file was:

  • Created in 2000
  • Modified over 23 years
  • Reorganized multiple times
  • Maintained by different people

High anomaly counts are normal for working documents.

What IS Suspicious

Pattern Suspicion Level Why
Isolated anomalies in orderly data High Data entry should be sequential
Late additions in specific result rows High May indicate inserted results
Data rows created after summary formulas Medium Summaries usually come last
Entire sections out of order Low Normal reorganization
High count in old files Low Expected from editing

When calcChain Analysis Fails

CautionLimitations
  1. Only tracks formulas — Plain values are invisible
  2. No timestamps — Only relative order, not when changes occurred
  3. Can be circumvented — Copy/paste values or recreate file
  4. LibreOffice — May not write calcChain.xml
  5. Programmatic files — R/Python-generated files often lack calcChain

Quick Reference

Function Purpose
parse_xlsx() Parse workbook into xlsx_workbook object
detect_anomalies() Find cells where creation order ≠ row position
summarize_anomalies() Summary stats by sheet, type, clusters
build_timeline() Formula creation timeline
print_timeline_summary() Formatted timeline output
reconstruct_original() Infer original layout
print_original_comparison() Compare original vs current
plot_calc_order() Scatter plot of creation vs row
plot_sheet_heatmap() Heatmap of creation order
plot_anomalies() Grid showing anomaly locations

Next Steps