Gold-Labeled Evaluation Methodology
1. Purpose
This study quantifies how often automated constraint violations correspond to real accounting issues. Human reviewers will inspect a stratified sample of filings and assign labels that distinguish genuine problems from tagging or framework artifacts. The resulting gold dataset drives precision, recall, and inter-rater reliability metrics at multiple materiality thresholds.
2. Sampling Strategy
- Universe:
results/disaggregates/filings_real.csvenriched with outputs from Tasks 02–04 (gap_scores.csv,q_values.csv). - Strata: automated severity levels
(
pass,low,medium,high,critical). - Sample Size: target 60 filings per stratum (≈300 total). Oversample high/critical if counts permit.
- Selection: random within stratum using a fixed RNG
seed (
numpy.random.default_rng(2025)) to guarantee reproducibility. Record sampledfiling_idin the gold-label CSV. - Reserve Set: keep 10% holdout for secondary labeling (inter-rater study).
3. Labeling Protocol
Each reviewer completes the following checklist for every sampled issuer-period:
- Retrieve the filing packet (10-Q/10-K) from SEC EDGAR.
- Recompute key equity bridge components using source statements.
- Compare with automated residuals.
- Assign outcome labels:
TP: confirmed accounting inconsistency or omission.FP_TAG: XBRL tagging mistake (element selection, sign, axis).FP_MAPPING: framework mapping gap (constraint missing relevant tag).TN: no issue; automated system passed correctly.FN: real issue absent from automated flag.
- Estimate materiality percent relative to total assets (or revenue for income-statement-only issues).
- Document supporting evidence (page references, calculations, screenshots if applicable).
Label Fields
| Column | Description |
|---|---|
filing_id |
Accession or constructed identifier (matches source CSVs). |
cik |
Company CIK. |
period_end |
Fiscal period end (ISO date). |
severity |
Automated severity level from feasibility analysis. |
predicted_flag |
Boolean: framework flagged violation. |
predicted_materiality_pct |
Automated materiality estimate (absolute residual ÷ total assets). |
labeler |
Initials or identifier of reviewer. |
issue_label |
TP, FP_TAG, FP_MAPPING,
TN, or FN. |
actual_issue |
Boolean derived from issue_label (TP/FN = True, others
False). |
materiality_pct |
Reviewer-estimated percent of total assets (decimal). |
notes |
Free-text justification. |
4. Quality Control
- Two-pass review: 10% of filings receive independent second labels. Disagreements escalate for adjudication.
- Cohen’s κ: computed on the overlap subset. A κ ≥ 0.7 is desired; investigate disagreements below that threshold.
- Audit trail: store workpapers (spreadsheets,
screenshots) in
docs/validation/workpapers/YYYY-MM-DD/.
5. Metrics Computation
Metrics are computed with
scripts/compute_precision_recall.py.
- Precision / Recall / F1 at materiality cutoffs
0.1%, 0.5%, 1.0% using
materiality_pctfrom human reviewers. - Confusion Matrix: derived from boolean
predicted_flag×actual_issue. - Calibration Curve: optional future work once probability scores are available.
- Cohen’s κ: optional input parameters
labeler_a,labeler_bif dual labels exist.
6. Deliverables
results/eval/gold_labeled_300.csv(or larger) populated with the schema above.- Metrics JSON (
precision_recall_report.json) capturing overall and threshold-specific statistics. - Confusion matrix visualization saved to
results/eval/confusion_matrix.png. - This methodology document alongside Supplementary instructions as needed.
7. Timeline & Staffing
- Labeling effort: ~30 minutes per filing ⇒ 150 labor hours for 300 filings.
- Staffing options: Nirvan (CPA) and one accounting PhD student contractor.
- Review cadence: weekly check-ins to monitor κ and recalibrate schema if confusion persists.
8. Next Steps
- Run sampling utility (future script) to populate the stub CSV with selected filings.
- Assign batches to reviewers with due dates.
- Update
gold_labeled_300.csviteratively; rerun metrics script after each tranche. - Store final metrics & plots under
results/eval/and cite in forthcoming whitepaper.