The strongest defence of this research programme is not a list of confirmations. It is an honest account of what could be wrong, and why we think it isn’t - or isn’t enough to disqualify the findings.
This page exists because we were asked the question directly: Is what you are doing scientifically defensible? The answer is yes, with qualifications that matter. Here are the qualifications.
The multiple comparisons problem
We have tested over 275 hypotheses. At a conventional significance threshold of p < 0.05, approximately 14 of those tests would return false positives by chance alone, even if no real signal exists. We have 127 confirmed signals. We cannot identify which of those - if any - are the false positives.
There is a further complication the simple math obscures: many of our hypotheses share overlapping data streams or underlying drivers - solar activity, ENSO, volcanic aerosols, and long-wave economic cycles appear across multiple signal families. This means the 275 tests are not fully independent of each other, which cuts both ways. It may reduce the effective number of false positives below 14 - but it also means our surrogate testing must be calibrated against correlated test families, not just individual series. We do not apply formal false discovery rate controls (Benjamini-Hochberg or equivalent). For the weaker-effect confirmed signals especially, a non-negligible fraction could still be artifacts of this shared structure.
What we do about it:
We do not pretend this problem away. We address it through four partial mitigations:
Effect sizes, not just p-values. A signal with r = 0.04 and p = 0.04 is not the same as a signal with r = 0.48 and p = 0.04. We report both and treat small-effect findings with explicit caution.
Surrogate significance testing. For every signal, we generate randomised surrogate series with the same autocorrelation structure and retest. This corrects for the inflation of apparent significance that autocorrelated data produces. A signal that survives the surrogate test is not just beating chance; it is beating its own noise structure.
The CONSILIENCE upgrade requires genuine replication. A signal is upgraded from CONFIRMED to CONSILIENCE only when two or more genuinely independent research traditions - different instruments, different time periods, different methodologies, no shared citations - converge on the same mechanism. Chance associations do not replicate across independent research programmes.
We publish the kill list. Of 275 hypotheses tested, 79 were killed. A 28% kill rate is evidence that the system is not confirmation-biased. We do not keep failed tests in a drawer.
What we do not claim: That these mitigations eliminate the multiple comparisons problem. They reduce it. The honest position is that confirmed signals should be weighted by effect size, consilience tier, and number of independent validation streams - not treated as equivalent to each other. We have not applied FDR corrections across the full battery. Adding them is a known gap.
The pre-registration gap
None of the Observatory’s hypotheses were pre-registered before testing. Pre-registration - committing to a hypothesis, sample, and analysis plan before seeing the data - is the gold standard for ruling out p-hacking.
Why this matters: Without pre-registration, a determined analyst can cycle through variable definitions, lag windows, and sample periods until significance emerges. This is unconscious as often as it is deliberate. Our process - human agenda-setting followed by AI execution - adds a further layer where exploratory search can occur before formal testing begins, even when no individual step looks like p-hacking. The era split chosen, the lag window specified, the confounds included: all of these are decisions made before the formal test runs.
Our partial mitigation: The surrogate test and era-splitting (testing in two non-overlapping periods) serve a similar purpose to pre-registration: they test whether the pattern holds out-of-sample. A signal that only appears in the discovery period and dissolves in the validation period is killed. This is not identical to pre-registration, but it captures the same core concern.
What we do not claim: Pre-registration equivalence. We claim out-of-sample stability. These are not the same thing. A true pre-registration would require committing to the exact era split, lag window, and variable definitions before seeing any results. We have not done this.
The AI execution risk
Every paper and bulletin on this site is produced by AI. AI systems can generate plausible-sounding statistics that are not in the underlying source data. This is not a hypothetical - it happened once during our development process. A generated paper cited a specific correlation coefficient that did not appear in any validation file.
What we do about it:
The generation pipeline includes a mandatory data fidelity constraint - the AI is explicitly instructed that every numerical claim must trace to a value in the Observatory’s validation files, and that qualitative description is required when numbers are not sourced. A second-pass automated validator checks generated output against the raw validation files before the draft reaches human review. The validation files are produced by deterministic statistical code, not by the same AI that writes the papers. This separation - deterministic statistics feeding into AI prose, with a validator checking the handoff - is the critical structural safeguard.
Human review is required before any piece publishes. The data fidelity check is the most operationally critical part of that review: every specific number in a published piece should be verifiable against the Observatory’s validation record.
What we do not claim: That this eliminates the risk entirely. We claim that it is caught at the editorial stage. The process for catching it is documented and enforced. If the loop were ever closed - AI generating both validation results and the papers summarising them - the output would become sophisticated hallucination. We have not closed that loop.
What we cannot fully demonstrate from this page alone: how many person-hours of qualified review each published piece receives, and whether the reviewer has domain expertise in the relevant statistics and subject matter. We acknowledge that “human review is required” is a process claim, not a quality guarantee. The answer depends on execution rigour that a reader has no way to verify from published output alone.
The independence problem in consilience
The consilience standard requires that converging evidence come from genuinely independent research traditions. Our definition of independence - different instruments, different time periods, different methodologies, no shared citations - is defensible but not foolproof.
The deeper problem is common-cause confounding at a level below the proposed mechanism. Multiple historical records - cherry blossoms, Nile flood levels, Broadbalk yields, cave speleothems - can all reflect the same underlying climate driver (solar output, volcanic aerosol loading, ENSO) without the specific causal chain we propose being correct. If all four proxies respond to “climate variability” through separate local pathways, their convergence is evidence of a shared climate signal, not necessarily evidence of the specific mechanism (cosmic ray ionisation, say) that we identify as the common cause.
Our mechanism review (dimension 8) is designed to catch this: if the proposed causal chain cannot be traced to a physically verified step, the signal is downgraded. But this review is still internally conducted, and a genuinely adversarial external reviewer might identify common-cause structures we have not tested. We consider this the most structurally difficult challenge to the consilience approach as we practise it.
What we do not show: sensitivity analysis and base rates
This page would be more useful if it included explicit base-rate estimates and Bayesian-style updates. For a domain with low prior plausibility, even a CONSILIENCE verdict moves the posterior probability less than it appears to. We have not published signal-level prior estimates or posterior confidence intervals.
What we can say: our 8% Whewell Gate pass rate (5 consiliences from 59 candidate groups) is consistent with the prior that genuine cross-domain convergence is rare. If we assumed a 10% base rate of real effects among our 275 hypotheses, and applied our validation process as a likelihood ratio, we would still expect a meaningful false positive rate among the weaker confirmed signals. We have not done this calculation formally. Publishing it would be more honest than the current implicit assumption that all confirmed signals have equal evidential weight. It is on our list.
The absence of peer review
Nothing published here has been externally peer-reviewed. The Observatory’s validation framework is internally designed and internally applied.
What this means: Our findings have not been subjected to the scrutiny of domain specialists outside this research programme. A signal confirmed by our 8-step framework has not been validated by an independent team applying a different methodology.
The partial offset: The consilience standard itself is a form of external validation - not of our analysis, but of the underlying phenomena. When the Broadbalk wheat experiment, the Kyoto cherry blossom record, and the Nile nilometer independently show the same periodicity, the agreement was not produced by our analysis team. It was produced by agronomists, phenologists, and hydrologists across centuries, none of whom were testing our hypothesis. Our contribution is the synthesis, not the underlying data.
This does not substitute for peer review of the synthesis. We acknowledge the gap.
What would change our findings
For each major confirmed signal, we document what finding would kill it. The general categories:
Surrogate replication failure - if the signal dissolves when tested against surrogate series with matching autocorrelation structure, it is a noise artifact.
Era-splitting failure - if the relationship holds in the discovery period but not in a held-out validation period, it is period-specific.
Mechanism collapse - if the physical, biological, or chemical mechanism connecting cause to effect is shown to be implausible or absent, the signal is killed regardless of the statistics.
Confound identification - if a third variable is identified that produces the same apparent relationship without the proposed mechanism, the signal is downgraded or killed.
Twenty-eight percent of what we tested has met one of these criteria. The 127 confirmed signals have not - yet.
Why we think the programme is worth doing
The objections above are real. None of them are decisive against the body of confirmed findings, for the following reason: the alternative - ignoring large-scale cross-domain pattern synthesis because it is methodologically imperfect - produces its own errors of omission. Climate science, epidemiology, and economic history have all generated major insights from exactly the kind of multi-source consilience analysis described here. The methodology is not novel. The scale and automation are.
The 8% Whewell Gate pass rate - 5 genuine consiliences from 59 candidate groups - is the expected rate for a rigorous standard. It is not a failure of the programme. It is the programme working.
