Hypothesis Testing and Null Hypotheses

When evaluating new algorithms or processing experimental data, we must decide if observed differences (e.g., an increase in accuracy) are statistically meaningful or merely the result of random chance. Hypothesis testing is a formal statistical framework that allows us to make these decisions with controlled probability. By formulating a null hypothesis of 'no effect' and attempting to reject it based on empirical evidence, data scientists can scientifically validate model performance improvements and experimental findings.

We define statistical significance boundaries before gathering data to prevent cognitive bias. Furthermore, we must balance false alarms against missed opportunities. We will explore test statistics, significance levels, power analysis, parametric vs. non-parametric tests, and validation metrics for comparing neural network architectures.


The Core Framework of Testing

Hypothesis testing is a formal statistical framework used to determine if observed patterns in sample data represent real effects in the population or are merely the result of random sampling noise.

The process is structural: we set up two opposing hypotheses, compute a test statistic, and verify if it falls in a highly improbable region under the assumption that no effect exists.

Null ($H_0$) and Alternative ($H_1$) Hypotheses

The testing framework begins by defining two mutually exclusive hypotheses. The Null Hypothesis ($H_0$) is the conservative baseline statement that there is no effect, no difference, or no association in the population. The Alternative Hypothesis ($H_1$) is the claim we hope to support—that a real effect exists. Under the classical frequentist approach, we assume $H_0$ is true and evaluate if our sample data contains sufficient evidence to reject it. For example, when comparing a new neural network structure to an old one, the null hypothesis is that the two models have the exact same expected accuracy ($H_0: \mu_1 = \mu_2$), while the alternative is that the new model has a different accuracy ($H_1: \mu_1 \neq \mu_2$).

Test Statistics and Rejection Regions

A test statistic is a numerical value calculated from the sample (e.g., $z$, $t$, $\chi^2$, or $F$-score) that measures how far the sample deviates from what is expected under $H_0$. The rejection region (or critical region) is the range of test statistic values that are highly improbable under $H_0$. If our calculated test statistic falls in this region, we reject the null hypothesis in favor of the alternative hypothesis; otherwise, we fail to reject the null hypothesis. We also specify whether the test is one-tailed (directional alternative hypothesis, e.g., $H_1: \mu_1 > \mu_2$) or two-tailed (non-directional alternative hypothesis, e.g., $H_1: \mu_1 \neq \mu_2$) to define our rejection boundaries.

Type I and Type II Errors

Because hypothesis testing relies on random samples, there is always a probability of making an incorrect decision. We must balance two distinct types of errors.

Controlling one error type inevitably affects the other. Optimizing this trade-off is crucial for designing experiments with sufficient statistical power.

Error Definitions and Significance Level ($\alpha$)

A Type I Error (False Positive) occurs when we reject a true null hypothesis, concluding an effect exists when it does not (e.g., claiming a new model is better when it is actually identical). The maximum probability of committing a Type I error is controlled by the researcher and is called the significance level ($\alpha$), typically set to $0.05$ (or $5\%$). A Type II Error (False Negative) occurs when we fail to reject a false null hypothesis, missing a real effect (e.g., claiming a new model is no better when it is actually superior). The probability of a Type II error is denoted by $\beta$.

Statistical Power ($1 - \beta$)

Statistical Power is the probability of correctly rejecting the null hypothesis when the alternative is true ($1 - \beta$). It represents the test's ability to detect a real effect. Power is influenced by: 1) Sample size $n$ (larger samples increase power), 2) Effect size (larger effects are easier to detect), and 3) Significance level $\alpha$ (setting a smaller $\alpha$ reduces Type I error but increases Type II error, reducing power). In ML experiments, we conduct a power analysis prior to testing to determine the sample size required to achieve a target power of $80\%$. If sample size is too small, the power will be low, and we risk discarding a genuinely superior model because the experiment lacked the power to confirm the improvement.

Parametric vs. Non-Parametric Tests

Statistical tests are categorized based on the assumptions they make about the probability distributions of the underlying population.

Choosing the correct test type ensures that the computed probabilities are accurate and that the decisions derived from them are valid.

Parametric Tests: t-Test, Welch's t-Test, and ANOVA

Parametric tests assume that the data follows a specific probability distribution, usually a normal distribution. The Student's t-test compares the means of two groups. If the variances of the two groups are not equal, we must use Welch's t-test, which does not assume equal variances. The test statistic is:

$$t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}$$

Welch's t-test calculates the degrees of freedom using the Satterthwaite approximation:

$$\text{df} \approx \frac{\left(\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}\right)^2}{\frac{\left(s_1^2/n_1\right)^2}{n_1 - 1} + \frac{\left(s_2^2/n_2\right)^2}{n_2 - 1}}$$

When comparing the means of three or more groups, we use Analysis of Variance (ANOVA), which partitions total variance into variance between groups and variance within groups. Under the null hypothesis, the ratio of between-group variance to within-group variance follows an $F$-distribution.

Non-Parametric Tests: Mann-Whitney U, Wilcoxon, and Chi-Squared

Non-parametric tests do not assume a specific population distribution, making them highly robust. The Mann-Whitney U test (for independent samples) and the Wilcoxon Signed-Rank test (for paired samples) convert raw values into ranks and compare the distributions. For the Mann-Whitney U, the test statistic is computed by ranking all observations across both groups and calculating:

$$U_1 = R_1 - \frac{n_1(n_1 + 1)}{2}$$

where $R_1$ is the sum of ranks for group 1. When analyzing categorical count data, we use the Chi-Squared Test of Independence to check for associations between variables. The test statistic is:

$$\chi^2 = \sum \frac{(O_i - E_i)^2}{E_i}$$

where $O_i$ is the observed count in each cell of a contingency table, and $E_i$ is the expected count under the null hypothesis of independence. These tests are robust for highly skewed datasets common in web logs and user behaviors.

Model Comparison and Validation in AI

In machine learning, we use hypothesis testing to scientifically validate that a new model architecture performs better than a baseline model.

We must use tests that account for the dependent and overlapping nature of cross-validation splits to avoid inflating the false positive rate.

Comparing Classifiers: McNemar's Test

To compare two classification models evaluated on the same test dataset, a simple comparison of accuracies is insufficient because the predictions are paired and dependent. McNemar's Test is a non-parametric test that focuses on the discordant predictions: cases where Model 1 is correct and Model 2 is incorrect ($b$), and cases where Model 1 is incorrect and Model 2 is correct ($c$). The test statistic is:

$$\chi^2 = \frac{(b - c)^2}{b + c}$$

Under the null hypothesis that both models have equal accuracy, this statistic follows a Chi-squared distribution with 1 degree of freedom. This test is critical for proving that a newly trained neural network's improvement is not a random fluke. For example, if Model 1 and Model 2 disagree on 100 cases, where Model 1 is correct on 70 and Model 2 is correct on 30, $\chi^2 = \frac{(70 - 30)^2}{70 + 30} = 16$. The critical value at $\alpha = 0.05$ is $3.84$. Since $16 > 3.84$, we reject the null hypothesis, confirming Model 1 is significantly better.

Corrected Resampled t-Test

When comparing models using $K$-fold cross-validation, the performance scores across folds are not independent because the training sets overlap (each sample is used in $K-1$ training sets). This violates the independence assumption of the standard paired t-test, leading to an inflated Type I error rate. To resolve this, we use the corrected resampled t-test, which adjusts the variance estimator to account for the training overlap, ensuring a valid comparison of cross-validated models:

$$t = \frac{\bar{d}}{\sqrt{\left( \frac{1}{K} + \frac{n_{\text{test}}}{n_{\text{train}}} \right) s^2}}$$

where $\bar{d}$ is the average difference in accuracy across folds, $s^2$ is the sample variance of the differences, $n_{\text{test}}$ is the test set size, and $n_{\text{train}}$ is the training set size. By inflating the standard error with the factor $\frac{n_{\text{test}}}{n_{\text{train}}}$, we prevent the model comparison pipeline from claiming false statistical significance.