Gold-Labeled Evaluation Methodology

1. Purpose

This study quantifies how often automated constraint violations correspond to real accounting issues. Human reviewers will inspect a stratified sample of filings and assign labels that distinguish genuine problems from tagging or framework artifacts. The resulting gold dataset drives precision, recall, and inter-rater reliability metrics at multiple materiality thresholds.

2. Sampling Strategy

3. Labeling Protocol

Each reviewer completes the following checklist for every sampled issuer-period:

  1. Retrieve the filing packet (10-Q/10-K) from SEC EDGAR.
  2. Recompute key equity bridge components using source statements.
  3. Compare with automated residuals.
  4. Assign outcome labels:
    • TP: confirmed accounting inconsistency or omission.
    • FP_TAG: XBRL tagging mistake (element selection, sign, axis).
    • FP_MAPPING: framework mapping gap (constraint missing relevant tag).
    • TN: no issue; automated system passed correctly.
    • FN: real issue absent from automated flag.
  5. Estimate materiality percent relative to total assets (or revenue for income-statement-only issues).
  6. Document supporting evidence (page references, calculations, screenshots if applicable).

Label Fields

Column Description
filing_id Accession or constructed identifier (matches source CSVs).
cik Company CIK.
period_end Fiscal period end (ISO date).
severity Automated severity level from feasibility analysis.
predicted_flag Boolean: framework flagged violation.
predicted_materiality_pct Automated materiality estimate (absolute residual ÷ total assets).
labeler Initials or identifier of reviewer.
issue_label TP, FP_TAG, FP_MAPPING, TN, or FN.
actual_issue Boolean derived from issue_label (TP/FN = True, others False).
materiality_pct Reviewer-estimated percent of total assets (decimal).
notes Free-text justification.

4. Quality Control

5. Metrics Computation

Metrics are computed with scripts/compute_precision_recall.py.

6. Deliverables

  1. results/eval/gold_labeled_300.csv (or larger) populated with the schema above.
  2. Metrics JSON (precision_recall_report.json) capturing overall and threshold-specific statistics.
  3. Confusion matrix visualization saved to results/eval/confusion_matrix.png.
  4. This methodology document alongside Supplementary instructions as needed.

7. Timeline & Staffing

8. Next Steps

  1. Run sampling utility (future script) to populate the stub CSV with selected filings.
  2. Assign batches to reviewers with due dates.
  3. Update gold_labeled_300.csv iteratively; rerun metrics script after each tranche.
  4. Store final metrics & plots under results/eval/ and cite in forthcoming whitepaper.

Accounting Conservation Framework | Home