Gold-Labeled Evaluation Methodology

1. Purpose

This study quantifies how often automated constraint violations correspond to real accounting issues. Human reviewers will inspect a stratified sample of filings and assign labels that distinguish genuine problems from tagging or framework artifacts. The resulting gold dataset drives precision, recall, and inter-rater reliability metrics at multiple materiality thresholds.

2. Sampling Strategy

Universe: results/disaggregates/filings_real.csv enriched with outputs from Tasks 02–04 (gap_scores.csv, q_values.csv).
Strata: automated severity levels (pass, low, medium, high, critical).
Sample Size: target 60 filings per stratum (≈300 total). Oversample high/critical if counts permit.
Selection: random within stratum using a fixed RNG seed (numpy.random.default_rng(2025)) to guarantee reproducibility. Record sampled filing_id in the gold-label CSV.
Reserve Set: keep 10% holdout for secondary labeling (inter-rater study).

3. Labeling Protocol

Each reviewer completes the following checklist for every sampled issuer-period:

Retrieve the filing packet (10-Q/10-K) from SEC EDGAR.
Recompute key equity bridge components using source statements.
Compare with automated residuals.
Assign outcome labels:
- TP: confirmed accounting inconsistency or omission.
- FP_TAG: XBRL tagging mistake (element selection, sign, axis).
- FP_MAPPING: framework mapping gap (constraint missing relevant tag).
- TN: no issue; automated system passed correctly.
- FN: real issue absent from automated flag.
Estimate materiality percent relative to total assets (or revenue for income-statement-only issues).
Document supporting evidence (page references, calculations, screenshots if applicable).

Label Fields

Column	Description
`filing_id`	Accession or constructed identifier (matches source CSVs).
`cik`	Company CIK.
`period_end`	Fiscal period end (ISO date).
`severity`	Automated severity level from feasibility analysis.
`predicted_flag`	Boolean: framework flagged violation.
`predicted_materiality_pct`	Automated materiality estimate (absolute residual ÷ total assets).
`labeler`	Initials or identifier of reviewer.
`issue_label`	`TP`, `FP_TAG`, `FP_MAPPING`, `TN`, or `FN`.
`actual_issue`	Boolean derived from `issue_label` (TP/FN = True, others False).
`materiality_pct`	Reviewer-estimated percent of total assets (decimal).
`notes`	Free-text justification.

4. Quality Control

Two-pass review: 10% of filings receive independent second labels. Disagreements escalate for adjudication.
Cohen’s κ: computed on the overlap subset. A κ ≥ 0.7 is desired; investigate disagreements below that threshold.
Audit trail: store workpapers (spreadsheets, screenshots) in docs/validation/workpapers/YYYY-MM-DD/.

5. Metrics Computation

Metrics are computed with scripts/compute_precision_recall.py.

Precision / Recall / F1 at materiality cutoffs 0.1%, 0.5%, 1.0% using materiality_pct from human reviewers.
Confusion Matrix: derived from boolean predicted_flag × actual_issue.
Calibration Curve: optional future work once probability scores are available.
Cohen’s κ: optional input parameters labeler_a, labeler_b if dual labels exist.

6. Deliverables

results/eval/gold_labeled_300.csv (or larger) populated with the schema above.
Metrics JSON (precision_recall_report.json) capturing overall and threshold-specific statistics.
Confusion matrix visualization saved to results/eval/confusion_matrix.png.
This methodology document alongside Supplementary instructions as needed.

7. Timeline & Staffing

Labeling effort: ~30 minutes per filing ⇒ 150 labor hours for 300 filings.
Staffing options: Nirvan (CPA) and one accounting PhD student contractor.
Review cadence: weekly check-ins to monitor κ and recalibrate schema if confusion persists.

8. Next Steps

Run sampling utility (future script) to populate the stub CSV with selected filings.
Assign batches to reviewers with due dates.
Update gold_labeled_300.csv iteratively; rerun metrics script after each tranche.
Store final metrics & plots under results/eval/ and cite in forthcoming whitepaper.