Understanding Anomaly Detection

What is an Anomaly?

An anomaly is a cell where the creation order (position in calcChain.xml) doesn’t match the row order (current position in the spreadsheet).

Excel’s calcChain.xml records formulas in creation order. This order persists even when rows are moved. By comparing creation order to current row position, we detect manipulation.

The Detection Algorithm

For each column separately, we calculate:

order_diff = (calcChain rank) - (row rank)

order_diff	Anomaly Type	Meaning
> 0	`moved_up`	Cell created later than row suggests
< 0	`moved_down`	Cell created earlier than row suggests
= 0	(normal)	Cell is in expected position

Analysis is done per-column because formulas in different columns may have been created at different times, even within the same row. This avoids false positives from horizontal fill operations.

Worked Examples

Example 1: Normal Spreadsheet

Formulas created top-to-bottom, no manipulation:

Cell  Row  CalcChain  Row Rank  Chain Rank  Diff  Status
────────────────────────────────────────────────────────
A1    1    1          1         1           0     OK
A2    2    2          2         2           0     OK
A3    3    3          3         3           0     OK

All diffs are 0 → No anomalies.

Example 2: Row Moved Up

Original: Formulas in A1, A2, A3 created in rows 1, 2, 3. Then: Row 3’s formula moved to row 1.

Cell    Row  CalcChain  Row Rank  Chain Rank  Diff  Anomaly
───────────────────────────────────────────────────────────────
A3→A1   1    3          1         3           +2    moved_up
A1→A2   2    1          2         1           -1    moved_down
A2→A3   3    2          3         2           -1    moved_down

The moved cell (now A1, was A3) has positive diff → moved_up. The pushed cells have negative diff → moved_down. When you move one row up, the rows below are pushed down, creating a cascade of anomalies.

Example 3: Row Inserted

Original: A1, A2, A3 in rows 1, 2, 3. Then: New formula inserted at row 2 (old A2→A3, old A3→A4).

Cell  Row  CalcChain  Row Rank  Chain Rank  Diff  Anomaly
─────────────────────────────────────────────────────────────
A1    1    1          1         1           0     OK
NEW   2    4          2         4           +2    moved_up
A2    3    2          3         2           -1    moved_down
A3    4    3          4         3           -1    moved_down

The inserted row has a high calcChain order (4) but low row number (2) → moved_up.

Using detect_anomalies()

xlsx <- parse_xlsx("nist_financial_model.xlsx")
anomalies <- detect_anomalies(xlsx)
head(anomalies)

# A tibble: 6 × 9
  sheet_name            cell_ref   row col   col_num calc_order expected_order order_diff anomaly_type
  <chr>                 <chr>    <int> <chr>   <int>      <int>          <int>      <int> <chr>
1 Depreciation Schedule F154       154 F           6          1           2641      -2640 moved_down
2 Depreciation Schedule G154       154 G           7          2           2642      -2640 moved_down
3 Depreciation Schedule H154       154 H           8          3           2643      -2640 moved_down
4 Depreciation Schedule I154       154 I           9          4           2644      -2640 moved_down
5 Depreciation Schedule J154       154 J          10          5           2645      -2640 moved_down
6 Depreciation Schedule K154       154 K          11          6           2646      -2640 moved_down

What This Means

These cells in row 154 were created first (calc_order 1-6) but are currently at the bottom of the sheet. This means ~2,640 rows were added above them over time.

Using the Threshold Parameter

The threshold parameter filters out minor discrepancies:

# Default: all anomalies
anomalies_all <- detect_anomalies(xlsx)
nrow(anomalies_all)

[1] 134760

# Only flag cells 5+ positions out of order
anomalies_strict <- detect_anomalies(xlsx, threshold = 5)
nrow(anomalies_strict)

[1] 134508

When to Use Threshold

Use a higher threshold when:

Analyzing files with many minor edits
Looking for significant structural changes
Reducing noise in visualization

Keep threshold at 0 (default) when:

Analyzing files suspected of fraud
Every anomaly matters
Building detailed timelines

Summarizing Results

summary_result <- summarize_anomalies(anomalies, xlsx)
print(summary_result)

── Anomaly Summary ──
ℹ Severity: high
Total anomalous cells: 134760

── By Sheet ──
# A tibble: 7 × 2
  sheet_name                     n
  <chr>                      <int>
1 Depreciation Schedule      71691
2 Assumptions Processing     18438
3 CashFlow                   16086
4 Income Statement           13293
5 Balance Sheet              11529
6 Control Panel               3717
7 Overview                       6

── By Type ──
# A tibble: 2 × 2
  anomaly_type     n
  <chr>        <int>
1 moved_up     68770
2 moved_down   65990

── Row Clusters ──
Groups of consecutive rows that may have been moved together:
# A tibble: 70 × 4
   sheet_name            start_row end_row n_cells
   <chr>                     <int>   <int>   <int>
 1 Depreciation Schedule         4       4     437
 2 Depreciation Schedule        11      16    2760
 ...

Severity Levels

Severity	Criteria
`none`	No anomalies
`low`	Few anomalies, likely normal editing
`medium`	>5% of cells or >50 anomalies
`high`	>20% of cells anomalous

Row Clusters

The row_clusters field identifies consecutive rows that may have been moved together. Large clusters suggest structural reorganization; isolated single rows are more suspicious.

Visual Interpretation Guide

Scatter Plot Patterns

plot_calc_order(xlsx, sheet = "Income Statement")

Pattern	Interpretation
Points on diagonal	Normal: creation matches row order
Points above diagonal	Created later than row suggests
Points below diagonal	Created earlier than row suggests
Vertical spread	Column has mixed creation times
Horizontal bands	Rows added in batches
Scattered cloud	Heavy editing over time

Heatmap Patterns

plot_sheet_heatmap(xlsx, sheet = "Income Statement")

Pattern	Interpretation
Uniform color	Cells created together
Top→bottom gradient	Normal sequential creation
Color bands	Rows added in batches
Yellow in purple area	Late insertion (suspicious)
Checkerboard	Heavy cell-by-cell editing

Limitations

What calcChain Analysis Cannot Detect

Plain values — Only formula cells are tracked
Absolute timestamps — Only relative order is known
Circumvention — Copy/paste values or recreate file
Cross-column patterns — Analysis is per-column
Intent — Cannot distinguish error from fraud

When to Be Suspicious

High Suspicion

Isolated late additions in data entry sections
Result cells created before input cells
Specific rows inserted into otherwise sequential data
Small clusters of anomalies in otherwise clean data

Low Suspicion

High anomaly count in old files
Entire sections out of order (reorganization)
Header/footer rows created last
Consistent patterns across similar sheets

Next Steps

Forensic Analysis — Complete forensic workflow
Getting Started — Basic package usage
Excel Formula Translation — Convert Excel formulas to R