When Results Are Reliable and When They're Not

A checklist for evaluating result quality — know the red flags that signal unreliable findings.

7 min read

The Reliability Checklist

Not every Correlation Sweep or Source Predictability result deserves your attention. Before acting on any finding, run it through this reliability checklist. A result is likely reliable when it meets all four criteria:

Sufficient data points: 200+ ideal, 50+ minimum. More data narrows confidence intervals and stabilizes correlations.
Survived FDR correction: reliableBest = true. The best lag is not a false positive from testing many hypotheses.
Reasonable effect size: |r| > 0.15. The correlation is large enough to matter practically, not just statistically.
Consistent across Pearson and Spearman: Both metrics agree in direction and approximate magnitude, confirming a roughly linear relationship.

Red Flags for Unreliable Results

Watch for these six warning signs. Any one of them should make you question the finding:

1. Too Few Data Points (n < 50)

With fewer than 50 data points, confidence intervals become extremely wide. A correlation of r = 0.35 with n = 40 could have a true value anywhere from 0.05 to 0.60. You simply lack the statistical power to make reliable claims. Fix: Extend the time range to capture more data.

2. Failed FDR Correction (reliableBest = false)

If the best lag did not survive Benjamini-Hochberg FDR correction, the correlation may be a false positive. Testing 49 lags means you expect roughly 2.5 spurious significant results by chance alone. A result that fails FDR may be one of those artifacts. Fix: Gather more data or accept that the evidence is insufficient.

3. Tiny Effect Size (|r| < 0.1)

A statistically significant result with |r| = 0.05 explains only 0.25% of price variance. Even if the p-value is impressively small, the practical value is negligible. You would not be able to distinguish this signal from noise in real-time trading. Fix: This is not a fixable issue — the relationship is too weak to be useful.

4. Edge Lag (Best at ±24h or ±168h)

When the best correlation falls at the exact boundary of the tested lag range, the true optimum may lie beyond the range you tested. For example, if you tested ±24h and the best lag is exactly +24h, the actual peak might be at +36h or +48h. Fix: Re-run with Extended (±72h) or Weekly (±168h) lag range.

5. Pearson/Spearman Divergence

When Pearson r and Spearman r disagree substantially (e.g., Pearson r = 0.5, Spearman r = 0.15), it suggests a non-linear relationship or the influence of outliers. A few extreme data points may be inflating the Pearson correlation. In this case, Spearman is more trustworthy. Fix: Investigate the data for outlier events; consider using quality filters to remove noisy data points.

6. Short Time Range (< 30 Days)

A short time window may capture an anomaly (e.g., a single major event) rather than a persistent relationship. A correlation that only exists during one news cycle is not a reliable trading signal. Fix: Extend to 90+ days and check if the signal persists.

The P-Hacking Danger

Testing 49 lags at p < 0.05 means you expect ~2.5 false positives by chance alone, even if no real relationship exists between sentiment and price. This is why FDR correction is critical.

If none of the 49 lags survive FDR correction, the data does not support any sentiment-price relationship for this asset and time range. This is a valid and informative result — it tells you to look elsewhere.

Best Practices for Reliable Analysis

Use 90+ day ranges: More data is almost always better for statistical reliability.
Enable permutation test: For extra confidence, the permutation test shuffles data to estimate how often your observed correlation would occur by random chance.
Check persistence: Run the same analysis across different time windows (e.g., Q1 vs Q2). If the signal appears consistently, it is more likely genuine.

Absence of Evidence Is Not Evidence of Absence

A null result for BTC does not mean sentiment analysis is useless. Different assets respond differently to news sentiment. An asset with fewer analysts and less efficient pricing (e.g., a mid-cap altcoin) may show stronger sentiment-price relationships than BTC, which is the most analyzed and efficiently priced crypto asset. Test each asset independently.

Why This Matters

Knowing when to trust a result is just as important as knowing how to read one. These red flags will save you from overconfidence in weak evidence and underconfidence in strong evidence. With result interpretation mastered, move on to Module 11: Experiments & Strategies to learn how to save, document, and share your research.

← Back to course overview