This lesson is being piloted (Beta version)

Multiple testing, summary, and final exercises

Overview

Teaching: 10 min
Exercises: 45 min
Questions
  • How do we interpret p-values when mutiple comparisons are carried out?

Objectives
  • Recognise the impact of multiple testing on the meaning of p-values

  • Adjust p-values to compensate for the effects of multiple testing

Multiple testing

In ANOVA, if H0 is rejected, we carry out a post hoc test to identify which group(s) are significantly different from the other ones. If there are three groups (A, B and C), that means we are performing 3 comparisons:

  1. A vs B
  2. A vs C
  3. B vs C

Each time we perform a t-test, we chose a significance threshold under which we reject the null hypothesis. A significance level of 0.05 for example means a 5% chance of making a type I error - incorrectly rejecting H0. When performing multiple tests, that probability of error applies for each individual comparison, so the more tests performed, the higher likelihood of a type I error.

In our example three-way comparison with signficance threshold of 0.05:

So rather than a 5% chance of a type I error, we instead have a much higher 14.3% chance, which is clearly not acceptable.

Challenge 1

Memory test! A type I error is an incorrect rejection of the null hypothesis. Can you remember what a type II error is?

Solution to Challenge 1

A type II error is a failure to reject the null hypothesis when it is false

Correction for multiple testing

To compensate for the effects of multiple testing on expected error rate, we need to adjust the p-value according to the number of hypothesis tests performed.

Examples of correction approaches include:

In R, the p.adjust function can be used to perform multiple-testing correction on a vector of p-values.

# Create a vector of multiple p-values to be tested
# Note, most of these *individually* meet the significance threshold of <0.05
pval <- c(0.02, 0.01, 0.05, 0.04, 0.1, 0.6, 0.5, 0.03, 0.005, 0.0005)
# False Discovery Rate correction - only three tests remain signficant
# And round to 3 decimal places for neater display
round(p.adjust(pval, method = 'fdr'),3)
##  [1] 0.050 0.033 0.071 0.067 0.125 0.600 0.556 0.060 0.025 0.005
# Bonferroni correction - now only one remains significant
# For some reason, bonferroni defaults to 3 decimal places anyway
p.adjust(pval, method = 'bonferroni')
##  [1] 0.200 0.100 0.500 0.400 1.000 1.000 1.000 0.300 0.050 0.005

Summary of relevant statistical tests

This table can be used as a reference to select the appropriate test to use from a given experimental design

Analysis required (continuous data) Normally distributed data Non-normally distributed data
Compare mean or median of one sample group against a known value One sample t-test Wilcoxon Rank Sum test
Compare means or medians of two sample groups (unpaired data) Unpaired t-test Mann-Whitney test
Compare means or medians of two sample groups (paired data) Paired t-test Wilcoxon Matched Pairs test
Compare means or medians of ≥ three sample groups (unpaired data) ANOVA Kruskal-Wallis ANOVA
Compare means or medians of ≥ three sample groups (paired data) Repeated measures ANOVA Friedman test
Analysis required (categorical data) Test used
Compare two sample proportions of categorical data (unpaired data, ≤5 in one group) Fisher exact test
Compare two sample proportions of categorical data (unpaired data, >5 in all groups) Chi-square test
Compare two sample proportions of categorical data (paired data) McNemar test

Summary of R functions

Final practice exercises

Challenge 2

Do patients with multiple gallstones experience more recurrences than those with single stones?

Solution to Challenge 2

The variables of interest are multiple stones (categorical, 2 groups) and recurrence (categorical, 2 groups). The appropriate test is therefore either Chi-square or Fisher’s Exact, depending on the expected count for each category

# Create the frequency table
table(gallstones$Rec,gallstones$Mult)

# Complete Cross-table
library(gmodels)
CrossTable(gallstones$Rec,gallstones$Mult,format="SPSS",prop.chisq=F,expected=T)

# Visualisation

library(ggplot2)
ggplot(gallstones, aes(Rec, fill=Mult)) +
  geom_bar(position="fill") +
  scale_y_continuous(labels=scales::percent) +
  theme(axis.text=element_text(size=18),
        legend.text=element_text(size=18),
        legend.title=element_text(size=18),
        axis.title=element_text(size=18),
        plot.title = element_text(size=22, face="bold"))+
  ggtitle("Recurrence vs. Mult")

# Chi-square test performed because more than 5 expected counts in each cell
chisq.test(gallstones$Rec,gallstones$Mult)

Conclusion

With a p-value of 0.1295, there is no evidence for a significant association between gallstone recurrence and the presence of multiple stones

Challenge 3

Is the patient’s age associated with gallstone recurrence?

Solution to Challenge 3

The variables of interest are gallstone recurrence (categorical, 2 groups) and age (continuous). The appropriate test is therefore either the t-test or Mann-Whitney test, depending on whether the data is normally distributed or not

# Test normality in both groups
shapiro.test(gallstones$Age[which(gallstones$Rec=="NoRecurrence")])
shapiro.test(gallstones$Age[which(gallstones$Rec=="Recurrence")])

# Other option
by(gallstones$Age,gallstones$Rec,shapiro.test)

# We have to perform a Mann-Whitney test
wilcox.test(gallstones$Age~gallstones$Rec)

# Boxplots to visualise
plot(gallstones$Age~gallstones$Rec,col=c("red","blue"),ylab="Age",
     xlab="Recurrence",cex.lab=1.3)

# Statistical descriptions
by(gallstones$Age,gallstones$Rec,median)
by(gallstones$Age,gallstones$Rec,IQR)

by(gallstones$Age,gallstones$Rec,mean)
by(gallstones$Age,gallstones$Rec,sd)

Conclusion

With a p-value of 0.7707, there is no evidence for a significant association between gallstone recurrence and the age of patients

Challenge 4

Does alcohol consumption influence the time to gallstone dissolution?

Solution to Challenge 4

The variables of interest are alcohol consumption (categorical, 3 groups) and dissolution time (continuous). The appropriate test is therefore either ANOVA or Kruskal-Wallis, depending on whether the data is normally distributed or not

# Test normality in each groups
by(gallstones$Dis,gallstones$Alcohol.Consumption,shapiro.test)

# If distribution normal in each group
result<-aov(gallstones$Dis~gallstones$Alcohol.Consumption)
summary(result)

# If distribution not normal
kruskal.test(gallstones$Dis~gallstones$Alcohol.Consumption)

# Visualisation
plot(gallstones$Dis~gallstones$Alcohol.Consumption,col=2:4)

Conclusion

With a p-value of 0.2389, there is no evidence for a significant association between alcohol consumption and dissolution time

Challenge 5

Is there any effect of treatment and/or gender on the time to dissolution?

Solution to Challenge 5

The variables of interest are gender (categorical, 2 groups), treatment (categorical, two groups) and dissolution time (continuous). The appropriate test is therefore a two-way ANOVA

# Visualisation
par(mfrow=c(1,2))
plot(Dis~Treatment+Gender, data=gallstones)

# Interaction plot to visualise
interaction.plot(gallstones$Treatment, gallstones$Gender, gallstones$Dis,
                 col=2:3,lwd=3,cex.axis=1.5,cex.lab=1.5)

# anova 2 way with aov function
result<-aov(Dis~Treatment+Gender+Treatment*Gender,data=gallstones)
summary(result)

# anova 2 way with lm function
result<-lm(Dis~Treatment+Gender+Treatment*Gender,data=gallstones)
anova(result)

# Checking the assumptions of anova

result<-lm(Dis~Treatment+Gender+Treatment*Gender,data=gallstones)
plot(result)

Conclusion

With a p-value of <0.0016, there is evidence for a significant effect of treatment on dissolution time. There is no evidence for a difference between genders for dissolution time, nor evidence for an interaction between treatment and gender (both males and females have a similar response).

Finishing up

It is always a good idea to conclude any R analysis with the sessionInfo() command to save with the output as a permanent record of software versions being used. This helps resolve issues where updated versions of tools may give different analysis outcomes.

sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base
##
## other attached packages:
## [1] FSA_0.8.32              ggplot2_3.3.3           gmodels_2.18.1
## [4] knitr_1.31              requirements_0.0.0.9000 remotes_2.2.0
##
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.6        plyr_1.8.6        compiler_4.0.5    pillar_1.5.1
##  [5] tools_4.0.5       digest_0.6.27     evaluate_0.14     lifecycle_1.0.0
##  [9] tibble_3.1.0      gtable_0.3.0      pkgconfig_2.0.3   rlang_0.4.10
## [13] DBI_1.1.1         yaml_2.2.1        xfun_0.22         withr_2.4.1
## [17] stringr_1.4.0     dplyr_1.0.5       generics_0.1.0    vctrs_0.3.7
## [21] gtools_3.8.2      grid_4.0.5        tidyselect_1.1.0  glue_1.4.2
## [25] R6_2.5.0          fansi_0.4.2       rmarkdown_2.7     gdata_2.18.0
## [29] purrr_0.3.4       magrittr_2.0.1    scales_1.1.1      htmltools_0.5.1.1
## [33] ellipsis_0.3.1    MASS_7.3-53.1     assertthat_0.2.1  colorspace_2.0-0
## [37] utf8_1.2.1        stringi_1.5.3     munsell_0.5.0     crayon_1.4.1

Key Points

  • Understand the impact of multiple testing on p-values

  • Use p.adjust to correct multiple-testing p-values