Hierarchical Calibration & Multiple Testing Control

Motivation

Constraint validation across thousands of filings generates dozens of concurrent hypothesis tests. Treating them independently inflates the false positive rate and ignores the dependence induced by repeated observations within firms and reporting periods. Hierarchical calibration provides a principled way to quantify variance components, adjust for multiplicity, and report violation rates that auditors can trust.

Model

For each constraint residual ( _{itj} ) (firm ( i ), period ( t ), constraint ( j )), we estimate a linear mixed effects model

[ _{itj} = _i + t + u{itj} ]

with random intercepts ( i (0, ^2{}) ) and Gaussian residuals ( u_{itj} (0, ^2_u) ). Period effects ( _t ) are treated as fixed. The intra-cluster correlation (ICC) is

[ = , ]

which inflates sampling variance via the design effect

[ = 1 + ({n} - 1) , ]

where ( {n} ) is the average filings per firm. We report the effective sample size ( N_{} = N_{} / ).

Multiple Testing

Observed violation counts for each constraint are compared with the null hypothesis of a 5% violation rate using two-sided binomial tests. To maintain statistical power while limiting expected false discoveries, we apply the Benjamini–Hochberg procedure. Given ordered ( p )-values ( p_{(k)} ),

[ k^= { k : p_{(k)} } ]

with ( m = 49 ) constraints and ( = 0.05 ). Constraints ( 1, , k^) are flagged as statistically significant. Bonferroni-adjusted values remain available for conservative downstream checks.

Wild Cluster Bootstrap

With 94 firms the classical cluster-robust asymptotics are acceptable but still noisy. The wild cluster bootstrap (Cameron, Gelbach, Miller 2008) resamples Rademacher weights for each firm and recomputes the statistic ( T ) under the reweighted data. The bootstrap distribution yields standard errors, percentile confidence intervals, and two-sided ( p )-values that remain valid for a moderate number of clusters.

Calibration

We assess overall calibration using a chi-square goodness-of-fit test, [ ^2 = _{j=1}^{m} , ] where ( O_j ) are observed violations and ( E_j = 0.05 N ) are expected counts under the null. The reduced chi-square ( ^2 / (m-1) ) approaching 1 indicates that empirical violation rates align with theoretical expectations. A visualization of predicted versus observed rates (calibration curve) highlights residual bias in constraint assessments.

Implementation Summary

fit_hierarchical_model fits the mixed effects model, reporting variance components, ICC, design effect, and effective sample size.
benjamini_hochberg_fdr computes FDR-adjusted ( q )-values and rejection decisions, with a Bonferroni fallback for strict FWER control.
wild_cluster_bootstrap provides robust inference for aggregated residual statistics.
scripts/calibrate_residuals.py orchestrates the pipeline, generating q_values.csv, hierarchical_model.json, calibration_test.json, and calibration_plot.png under results/statistical_analysis/.

This calibration stack addresses reviewer concerns about multiple comparisons and dependence, enabling Big Four deployment with transparent statistical guarantees.