The use of Null Hypothesis Significance Testing as a statistical tool for research methods in psychology became a center of debate because of the perceived errors, which was associated to the misuse of the method and erroneous conclusion in some of the published scholarly journals. Many scholarly journal publishers particularly those that are specialized in the field social psychology suggests to ban the use of NHST and the Basic and Applied Social Psychology (BASP) acted upon on this call. On the other hand, proponents of NHST argue that the problem is not with the technique, but with the researcher employing it. For this reason that the controversy involving the use of NHST is articulated in this discussion. The objective is to explore the controversies around NHST and illuminate the sides of the issue. For this discussion, several literatures will be examined to recover evidence about the controversial research method and articulate on the findings to illuminate the issues with NHST.
In 2015, BASP announced that its editors would no longer be accepting papers that employ NHST in its methodology pointing out the paper with P values (Woolston, 2015). The problem pointed out by BASP regarding papers with P values is that the method is being notoriously utilized by low-quality research. This assertion has created a heated debate between scholars because according to the BASP editors, the p < .05 mark is too easy to achieve even in low-quality research with weak data (Woolston, 2015). For starters, the P values are widely being used in science research particularly in the field of sociology and psychology to test the null hypothesis and establish correlation between variables. Normally, if the P values came at less than .05, it can be interpreted as valid manifestation of a false null hypothesis (LeMire, 2010). However, this was perceived as a problem among scholarly journal publishers because the .05 threshold is sometimes slippery, which tends to vanish after replicating the research. From this issue that even the American Psychological Association (APA) asserted that the validity of NHST is questionable. Apparently, APA had not enforced a total ban on the use of P values, but rather settled on requiring an effect size estimate when reporting a P value (Gliner et al., 2002).
Furthermore, APA’s position on the matters of NHST controversy provides that the statistical method cannot be totally eliminated from the hypothesis testing options because rigorous research still requires NHST in a more suitable context (Balluerka et al., 2005). This argument was also supported by psychology scientists asserting that the P values are not indicators of a good research and at some point leads to false positive results, but the use of inferential statistics is still paramount in doing science (Woolston, 2015). Amidst the debate about the controversial NHST the justification of its problematic nature can be attributed to the description of P value as a permanent illusion (Cohen, 1994). The illusion of attaining probability stems from the context of confusion about null hypothesis rejection, which provides the pretense of the perceived controversy with inferential statistics. For example, if the null hypothesis statement stipulates that if a person is a Martian, then he is not a member of the Congress, if the person is a member of Congress, then the person is not a Martian (Cohen, 1994, p. 998). Considering that statement, opponents of NHST would contradict the hypothesis because of the absence of reasonable probability, whereas, if the null hypothesis statement have asserted the context of probability, then, the given hypothesis is correct.
If this is the case, then the controversies surrounding NHST is not whether or not the null hypothesis had stated the reasoning for probability, but rather the manner of the researcher articulating, justifying, and interpreting the data by means of placing confidence limits on effect sizes as Cohen (1995) implied in his follow up article. This leads to another controversy in NHST, which is the perceived error in statistical significance testing. The problem described by the opponent of NHST suggests that the conclusive evidence of P value-based research tend to present statistical significance of one variable and fails to recognize statistical significance of the other variable (Cumming, 2012). This is also the same reasoning upheld by BASP in enacting to ban NHST from its published scholarly articles. The P value may sound significant if it reflects a below .05 criteria, but disregards the significance of the control variable in which probabilities could be also significant (Trafimow and Marks, 2015). Moreover, the statistical methods and guidelines in psychology research strongly suggests that hypothesis test should not simply imply an accept-reject decision based on P values because a better way to do it is to provide effect size estimate or a confidence interval (Wilkinson, 1999).
The problems arising from the criticism of NHST can be attributed to the fact that null hypothesis are always nearly expected to be false, therefore, providing results using the P value without testing the probability of the other variable constitutes bias in results (Nickerson, 2000). Another criticism of NHST is the unlikeliness of the null hypothesis as reflected by the .05 criteria, which presents a problem of comparison of hypothesis, which should be considered conditional on all sets of the data and not on the P values alone (Levine et al., 2008). It appears from the opponent of NHST that the controversy over the validity of statistical significance is the misconception of its similarity with scientific significance, which asserts that the size of an effect is more relevant than determining whether and effect exists (Bastian, 2013). This assertion encompasses the common misuse of NHST where the high P value is being regarded as proof of the null hypothesis (Fidler et al., 2006).
According to one of the critics of NHST, 74% of the studies with P values of 0.05 are actually wrong (Cook, 2008), which implies the more troubling context of NHST controversy, which is the implication of the wrong findings in inferential statistics towards practical application. Cohen (2009) points out that the misuse of NHST occurs is relatively small, hence, the effect size estimate should justify the results. Apparently, the consequences arising from the lack of effect size estimate is the generalization of the inferential statistical results despite the limitation on sample size. The rational explanation behind this assertion is the fact that small samples cannot be a representative of the larger population; hence the practical application of the results drawn from the limited sample encompasses a considerable risk. For example, if the NHST results of 0.05 suggest that jumping off a cliff resolves depression based from a small sample, one would begin to question if a sudden increase in cliff jumping-related deaths among people diagnosed with depression is a result of the implied results of the significance test.
The provided example is just one of the risk probabilities associated with the practical application of research findings derived from the P values of 0.05. Despite the possibilities of risky encounters during practical application of the findings derived from the NHST results, the proponents of NHST believes that the method should not be thrown complete into oblivion because the concept of NHST can be realigned using interval estimate (Lecoutre et al., 2005). This means that although P values cannot express magnitude of an effect, it still provides a straightforward approach to obtain interval estimate when combined with descriptive statistic (Lecoutre et al., 2005). Furthermore, proponents of NHST suggest that the employed methodology is outdated, but not necessarily useless. Hence, reasonable modifications to the method including the interpretation of the outcome would compliment the need for more accurate methodology in modern science (Robinson and Wainer, 2001).
Considering the attacks and defense of the NHST method amidst the controversies, drawing Hume’s example of inductive inferences can best mitigate the polarized partisan arguments surrounding NHST (Krueger, 2001). Based on Hume’s observation of the induction inference, suggests that induction from sample observations cannot ascertain knowledge about the population’s characteristic regardless of sample size. This encompasses an assertion that knowledge must include reliable predictions and not rely on assumptions that problem of induction has been solved (Krueger, 2001). In the controversies around NHST, its defense was attributed to the notion that NHST is suggesting that there is not nothing, which encompasses an inference of modus tollens or probabilistic proof of contradiction (Hawlow, 2010). Most importantly, NHST provides a logical framework that can be considered as acceptable statistical convention fields of cognitive science. It is believed that NHST has enabled much of progresses in research was achieved by testing hypothesis (Schneider, 2010).
It is not always clear from the proponents of NHST whether the defense of the statistical method is equally or more reliable than the other methods, but other would agree that there are options to consider if the NHST outcomes are indeed useless in scientific research. In fact, the attacks on NHST began as early as the 1930s at the time when it still establishing its roots in the field of psychology. In this sense, that the role of NHST is still regarded as a popular statistical method being used in the psychology research until today. In addition, proponents of NHST seek to find alternatives to reinforce the relevance of the statistical method in the field of scientific research. For example, Bayesian alternatives was compounded to address the perceived inadequacies of the NHST approach where mathematical derivations where developed to establish consistency of NHST outcomes after replicating the research design, which was pointed out as one of the flaws of significance testing (Nathoo and Masson, 2015).
The corresponding arguments on the validity of the outcomes of NHST may leave the field of psychology research to wonder whether a form of compromise would be possible to interpret statistical results on a more succinct and realistic manner, instead of simply implying the p < 0.05 to demonstrate probability. Reporting p values cannot simply represent significance on general terms and the same result to vanish in the process of replicating thereafter. On the other hand, proponents of NHST appears to continue advocate to preserve the prominence of the statistical method and encourage researchers to explore hypothesis that can be readily tested by employing NHST design. On the contrary, opponents of NHST insists on the triviality of the results of significance testing on the grounds that the model rely on probability or in other term chance where scientific research should encompass results that are concrete, replicable, valid, and important in general terms and practical applications.
Balluerka, N., Gómez, J., & Hidalgo, D. (2005). The Controversy over Null Hypothesis Significance Testing Revisited. Methodology: European Journal Of Research Methods For The Behavioral And Social Sciences, 1(2), 55-70. Retrieved from http://psycnet.apa.org/index.cfm?fa=buy.optionToBuy&id=2005-10195-001
Bastian, H. (2013). Statistical significance and its part in science downfalls. Scientific American Blog Network. Retrieved 26 May 2016, from http://blogs.scientificamerican.com/absolutely-maybe/statistical-significance-and-its-part-in-science-downfalls/
Cohen, B. (2009). When the Use of p Values Actually Makes Some Sense (pp. 1-8). Toronto: American Psychological Association. Retrieved from http://www.psych.nyu.edu/cohen/Paper_APA2009.pdf
Cohen, J. (1994). The Earth Is Round (p < .05). American Psychologist, 49(12), 997-1003.
Cohen, J. (1995). The Earth Is Round (p < .05): Rejoinder. American Psychologist, 50(12), 1103.
Cook, J. (2008). Most published research results are false. Singular Value Consulting. Retrieved from http://www.johndcook.com/blog/2008/02/07/most-published-research-results-are-false/
Cumming, G. (2012). The new statistics: What we need for evidence-based practice. Statistical Cognition Laboratory.
FIDLER, F., BURGMAN, M., CUMMING, G., BUTTROSE, R., & THOMASON, N. (2006). Impact of Criticism of Null-Hypothesis Significance Testing on Statistical Reporting Practices in Conservation Biology. Conservation Biology, 20(5), 1539-1544. http://dx.doi.org/10.1111/j.1523-1739.2006.00525.x
Gliner, J., Leech, N., & Morgan, G. (2002). Problems With Null Hypothesis Significance Testing (NHST): What Do the Textbooks Say?. The Journal Of Experimental Education, 7(1), 83–92. Retrieved from https://www.andrews.edu/~rbailey/Chapter%20two/7217331.pdf
Harlow, L. (2010). On Scientific Research: The Role of Statistical Modeling and Hypothesis Testing.Journal Of Modern Applied Statistical Methods, 9(2), 348-358. Retrieved from http://digitalcommons.wayne.edu/cgi/viewcontent.cgi?article=1389&context=jmasm
Krueger, J. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56(1), 16-26. http://dx.doi.org/10.1037//0003-066x.56.1.16
Lecoutre, B., Poitevineau, J., & Lecoutre, M. (2005). A reason why not to ban Null Hypothesis Significance Tests. Revue MODULAD, 33(1), 249-253. Retrieved from http://www.modulad.fr/archives/numero-33/notule-lecoutre-33/lecoutre-33-notule-uk.pdf
LeMire, S. (2010). An Argument Framework for the Application of Null Hypothesis Statistical Testing in Support of Research. Journal Of Statistics Education, 18(2), 1-23. Retrieved from http://www.amstat.org/publications/jse/v18n2/lemire.pdf
Levine, T., Weber, R., Hullett, C., Park, H., & Lindsey, L. (2008). A Critical Assessment of Null Hypothesis Significance Testing in Quantitative Communication Research. Human Communication Research, 34(2), 171-187. http://dx.doi.org/10.1111/j.1468-2958.2008.00317.x
Nathoo, F. & Masson, M. (2015). Bayesian alternatives to null-hypothesis significance testing for repeated-measures designs. Journal Of Mathematical Psychology. http://dx.doi.org/10.1016/j.jmp.2015.03.003
Nickerson, R. (2000). Null Hypothesis Significance Testing: A Review of an Old and Continuing Controversy. Psychological Method, 5(2), 241-301. Retrieved from http://psych.colorado.edu/~willcutt/pdfs/nickerson_2000.pdf
Robinson, D. & Wainer, H. (2001). On the Past and Future of Null Hypothesis Significance Testing. Princeton, NJ: Educational Testing Service. Retrieved from http://www.ets.org/Media/Research/pdf/RR-01-24-Wainer.pdf
Schneider, J. (2010). Null hypothesis significance tests: A mix-up of two different theories – the basis for widespread confusion and numerous misinterpretations. Danish Centre For Studies In Research And Research Policy. Retrieved from http://arxiv.org/pdf/1402.1089.pdf
Trafimow, D. & Marks, M. (2015). Editorial. Basic And Applied Social Psychology, 37(1), 1-2. http://dx.doi.org/10.1080/01973533.2015.1012991
Wilkinson, L. (1999). Statistical Methods in Psychology Journals: Guidelines and Explanations.American Psychologist, 54(8), 594–604.
Woolston, C. (2015). Psychology journal bans P values. Nature, 519(7541), 9-9. http://dx.doi.org/10.1038/519009f