An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain “error rates,” without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used—the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.
References
- 1.
Simon R ,Altman DG . Statistical aspects of prognostic factor studies in oncology [Editorial]. Br J Cancer. 1994;69:979-85. CrossrefMedlineGoogle Scholar - 2.
Tannock IF . False-positive results in clinical trials: multiple significance tests and the problem of unreported comparisons. J Natl Cancer Inst. 1996;88:206-7. CrossrefMedlineGoogle Scholar - 3.
Goodman SN . Toward evidence-based medical statistics. 2: The Bayes factor. Ann Intern Med. 1999;130:1005-13. LinkGoogle Scholar - 4.
McKenzie R ,O'Fallon A ,Dale J ,Demitrack M ,Sharma G ,Deloria M ,et al . Low-dose hydrocortisone for treatment of chronic fatigue syndrome: a randomized, controlled trial. JAMA. 1998;280:1061-6. CrossrefMedlineGoogle Scholar - 5.
Salmon WC . The Foundations of Scientific Inference. Pittsburgh: Univ of Pittsburgh Pr; 1966. Google Scholar - 6.
Royall R . Statistical Evidence: A Likelihood Primer. Monographs on Statistics and Applied Probability #71. London: Chapman and Hall; 1997. Google Scholar - 7.
Hacking I . The Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference. Cambridge, UK: Cambridge Univ Pr; 1975. Google Scholar - 8.
Popper K . The Logic of Scientific Discovery. New York: Harper & Row; 1934:59. Google Scholar - 9.
Carnap R . Logical Foundations of Probability. Chicago: Univ of Chicago Pr; 1950. Google Scholar - 10. Howson C, Urbach P. Scientific Reasoning: The Bayesian Approach. 2d ed. La Salle, IL: Open Court; 1993. Google Scholar
- 11.
Stigler SM . The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Harvard Univ Pr; 1986. Google Scholar - 12.
Oakes M . Statistical Inference: A Commentary for the Social Sciences. New York: Wiley; 1986. Google Scholar - 13. Fisher R. Statistical Methods for Research Workers. 13th ed. New York: Hafner; 1958. Google Scholar
- 14. Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society, Series A. 1933; 231:289-337. Google Scholar
- 15.
Goodman SN . p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993;137:485-96. CrossrefMedlineGoogle Scholar - 16.
Gigerenzer G ,Swijtink Z ,Porter T ,Daston L ,Beatty J ,Kruger L . The Empire of Chance. Cambridge, UK: Cambridge Univ Pr; 1989. Google Scholar - 17. Fisher R. Statistical Methods and Scientific Inference. 3d ed. New York: Macmillan; 1973. Google Scholar
- 18.
Browner W ,Newman T . Are all significant P values created equal? The analogy between diagnostic tests and clinical research. JAMA. 1987;257:2459-63. CrossrefMedlineGoogle Scholar - 19.
Diamond GA ,Forrester JS . Clinical trials and statistical verdicts: probable grounds for appeal. Ann Intern Med. 1983;98:385-94. LinkGoogle Scholar - 20.
Lilford RJ ,Braunholtz D . For debate: The statistical basis of public policy: a paradigm shift is overdue. BMJ. 1996;313:603-7. CrossrefMedlineGoogle Scholar - 21.
Freeman PR . The role of p-values in analysing trial results. Stat Med. 1993;12:1442-552. CrossrefGoogle Scholar - 22.
Brophy JM ,Joseph L . Placing trials in context using Bayesian analysis. GUSTO revisited by Reverend Bayes. JAMA. 1995;273:871-5. CrossrefMedlineGoogle Scholar - 23.
Berkson J . Tests of significance considered as evidence. Journal of the American Statistical Association. 1942;37:325-35. CrossrefGoogle Scholar - 24.
Pearson E . ‘Student’ as a statistician. Biometrika. 1938;38:210-50. Google Scholar - 25. Altman DG. Confidence intervals in research evaluation. ACP J Club. 1992; Suppl 2:A28-9. Google Scholar
- 26.
Berry G . Statistical significance and confidence intervals [Editorial]. Med J Aust. 1986;144:618-9. CrossrefMedlineGoogle Scholar - 27.
Braitman LE . Confidence intervals extract clinically useful information from data [Editorial]. Ann Intern Med. 1988;108:296-8. LinkGoogle Scholar - 28.
Simon R . Confidence intervals for reporting results of clinical trials. Ann Intern Med. 1986;105:429-35. LinkGoogle Scholar - 29.
Pearson E . Some thoughts on statistical inference. Annals of Mathematical Statistics. 1962;33:394-403. CrossrefGoogle Scholar - 30.
Colton T . Statistics in Medicine. Boston: Little, Brown; 1974. Google Scholar - 31.
Seidenfeld T . Philosophical Problems of Statistical Inference. Dordrecht, the Netherlands: Reidel; 1979. Google Scholar - 32.
Goodman S . Multiple comparisons, explained. Am J Epidemiol. 1998;147:807-12. CrossrefMedlineGoogle Scholar - 33.
Savitz DA ,Olshan AF . Multiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol. 1995;142:904-8. CrossrefMedlineGoogle Scholar - 34.
Thomas DC ,Siemiatycki J ,Dewar R ,Robins J ,Goldberg M ,Armstrong BG . The problem of multiple inference in studies designed to generate hypotheses. Am J Epidemiol. 1985;122:1080-95. CrossrefMedlineGoogle Scholar - 35.
Greenland S ,Robins JM . Empirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology. 1991;2:244-51. CrossrefMedlineGoogle Scholar - 36.
Anscombe F . Sequential medical trials. Journal of the American Statistical Association. 1963;58:365-83. CrossrefGoogle Scholar - 37.
Dupont WD . Sequential stopping rules and sequentially adjusted P values: does one require the other? Controlled Clin Trials. 1983;4:3-10. CrossrefMedlineGoogle Scholar - 38. Cornfield J, Greenhouse S. On certain aspects of sequential clinical trials. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: Univ of California Pr; 1977; 4:813-29. Google Scholar
- 39.
Cornfield J . Sequential trials, sequential analysis and the likelihood principle. American Statistician. 1966;20:18-23. Google Scholar - 40.
Begg C . On inferences from Wei's biased coin design for clinical trials. Biometrika. 1990;77:467-84. CrossrefGoogle Scholar - 41.
Ludbrook J ,Dudley H . Issues in biomedical statistics: statistical inference. Aust N Z J Surg. 1994;64:630-6. CrossrefMedlineGoogle Scholar - 42.
Cox D ,Hinckley D . Theoretical Statistics. New York: Chapman and Hall; 1974. Google Scholar - 43.
Barnett V . Comparative Statistical Inference. New York: Wiley; 1982. Google Scholar - 44.
Lehmann E . The Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? Journal of the American Statistical Association. 1993;88:1242-9. CrossrefGoogle Scholar - 45. Berger J. The frequentist viewpoint and conditioning. In: LeCam L, Olshen R, eds. Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer. vol. 1. Belmont, CA: Wadsworth; 1985:15-43. Google Scholar
- 46.
Marks HM . The Progress of Experiment: Science and Therapeutic Reform in the United States, 1900-1990. Cambridge, UK: Cambridge Univ Pr; 1997. Google Scholar - 47.
Porter TM . Trust in Numbers: The Pursuit of Objectivity in Science and Public Life. Princeton, NJ: Princeton Univ Pr; 1995. Google Scholar - 48.
Matthews JR . Quantification and the Quest for Medical Certainty. Princeton, NJ: Princeton Univ Pr; 1995. Google Scholar - 49.
Feinstein AR ,Horwitz RI . Problems in the “evidence” of “evidence-based medicine.”. Am J Med. 1997;103:529-35. CrossrefMedlineGoogle Scholar - 50.
Spodich DH . “Evidence-based medicine”: terminologic lapse or terminologic arrogance? [Letter]. Am J Cardiol. 1996;78:608-9. CrossrefMedlineGoogle Scholar - 51.
Tonelli MR . The philosophical limits of evidence-based medicine. Acad Med. 1998;73:1234-40. CrossrefMedlineGoogle Scholar - 52.
Feinstein AR . Clinical Biostatistics. St. Louis: Mosby; 1977. Google Scholar - 53.
Mainland D . The significance of “nonsignificance.”. Clin Pharmacol Ther. 1963;12:580-6. CrossrefGoogle Scholar - 54.
Morrison DE ,Henkel RE . The Significance Test Controversy: A Reader. Chicago: Aldine; 1970. Google Scholar - 55.
Rothman KJ . Significance questing [Editorial]. Ann Intern Med. 1986;105:445-7. LinkGoogle Scholar - 56.
Rozeboom W . The fallacy of the null hypothesis significance test. Psychol Bull. 1960;57:416-28. CrossrefMedlineGoogle Scholar - 57.
Savitz D . Is statistical significance testing useful in interpreting data? Reprod Toxicol. 1993;7:95-100. CrossrefMedlineGoogle Scholar - 58.
Chia KS . “Significant-itis”—an obsession with the P-value. Scand J Work Environ Health. 1997;23:152-4. CrossrefMedlineGoogle Scholar - 59.
Barnett ML ,Mathisen A . Tyranny of the p-value: the conflict between statistical significance and common sense [Editorial]. J Dent Res. 1997;76:534-6. CrossrefMedlineGoogle Scholar - 60.
Bailar JC ,Mosteller F . Guidelines for statistical reporting in articles for medical journals. Amplifications and explanations. Ann Intern Med. 1988;108:266-73. LinkGoogle Scholar - 61.
Cox DR . Statistical significance tests. Br J Clin Pharmacol. 1982;14:325-31. CrossrefMedlineGoogle Scholar - 62.
Cornfield J . The bayesian outlook and its application. Biometrics. 1969;25:617-57. CrossrefMedlineGoogle Scholar - 63.
Mainland D . Statistical ritual in clinical journals: is there a cure?—I. Br Med J (Clin Res Ed). 1984;288:841-3. CrossrefMedlineGoogle Scholar - 64.
Mainland D . Statistical ritual in clinical journals: is there a cure?—II. Br Med J (Clin Res Ed). 1984;288:920-2. CrossrefMedlineGoogle Scholar - 65.
Salsburg D . The religion of statistics as practiced in medical journals. American Statistician. 1985;39:220-3. Google Scholar - 66.
Dar R ,Serlin RC ,Omer H . Misuse of statistical tests in three decades of psychotherapy research. J Consult Clin Psychol. 1994;62:75-82. CrossrefMedlineGoogle Scholar - 67.
Altman D ,Bland J . Improving doctors' understanding of statistics. Journal of the Royal Statistical Society, Series A. 1991;154:223-67. CrossrefGoogle Scholar - 68.
Pocock SJ ,Hughes MD ,Lee RJ . Statistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med. 1987;317:426-32. CrossrefMedlineGoogle Scholar - 69.
Edwards A . Likelihood. Cambridge, UK: Cambridge Univ Pr; 1972. Google Scholar - 70.
Skellam J . Models, inference and strategy. Biometrics. 1969;25:457-75. CrossrefMedlineGoogle Scholar - 71.
Clarke M ,Chalmers I . Discussion sections in reports of controlled trials published in general medical journals: islands in search of continents? JAMA. 1998;280:280-2. CrossrefMedlineGoogle Scholar - 72.
Moyé L . End-point interpretation in clinical trials: the case for discipline. Control Clin Trials. 1999;20:40-9. CrossrefMedlineGoogle Scholar - 73.
Fisher LD . Carvedilol and the Food and Drug Administration (FDA) approval process: the FDA paradigm and reflections on hypothesis testing. Control Clin Trials. 1999;20:16-39. CrossrefMedlineGoogle Scholar - 74.
Fisher L ,Moyé L . Carvedilol and the Food and Drug Administration (FDA) approval process: an introduction. Control Clin Trials. 1999;20:1-15. CrossrefMedlineGoogle Scholar - 75.
Poole C . Beyond the confidence interval. Am J Public Health. 1987;77:195-9. CrossrefMedlineGoogle Scholar - 76.
Lang JM ,Rothman KJ ,Cann CI . That confounded P-value [Editorial]. Epidemiology. 1998;9:7-8. CrossrefMedlineGoogle Scholar - 77.
Evans SJ ,Mills P ,Dawson J . The end of the p value? Br Heart J. 1988;60:177-80. CrossrefMedlineGoogle Scholar - 78.
Feinstein AR . P-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998;51:355-60. CrossrefMedlineGoogle Scholar - 79.
Freedman L . Bayesian statistical methods [Editorial]. BMJ. 1996;313:569-70. CrossrefMedlineGoogle Scholar - 80.
Etzioni RD ,Kadane JB . Bayesian statistical methods in public health and medicine. Annu Rev Public Health. 1995;16:23-41. CrossrefMedlineGoogle Scholar - 81.
Kadane JB . Prime time for Bayes. Control Clin Trials. 1995;16:313-8. CrossrefMedlineGoogle Scholar - 82.
Spiegelhalter D ,Freedman L ,Parmar M . Bayesian approaches to randomized trials. Journal of the Royal Statistical Society, Series A. 1994;157:357-87. CrossrefGoogle Scholar - 83.
Goodman SN ,Royall R . Evidence and scientific research. Am J Public Health. 1988;78:1568-74. CrossrefMedlineGoogle Scholar - 84. Barnard G. The use of the likelihood function in statistical practice. In: Proceedings of the Fifth Berkeley Symposium. v 1. Berkeley, CA: Univ of California Pr; 1966:27-40. Google Scholar
- 85.
Wulff HR ,Anderson B ,Brandenhoff P ,Guttler F . What do doctors know about statistics? Stat Med. 1987;6:3-10. CrossrefMedlineGoogle Scholar - 86.
Borak J ,Veilleux S . Errors of intuitive logic among physicians. Soc Sci Med. 1982;16:1939-47. CrossrefMedlineGoogle Scholar - 87.
Concato J ,Feinstein AE ,Holford TR . The risk of determining risk with multivariable models. Ann Intern Med. 1993;118:201-10. LinkGoogle Scholar - 88.
Altman DG ,Goodman SN . Transfer of technology from statistical journals to the biomedical literature. Past trends and future predictions. JAMA. 1994;272:129-32. CrossrefMedlineGoogle Scholar - 89.
Hayden G . Biostatistical trends in pediatrics: implications for the future. Pediatrics. 1983;72:84-7. MedlineGoogle Scholar
Author, Article, and Disclosure Information
From Johns Hopkins University School of Medicine, Baltimore, Maryland.
Corresponding Author: Steven Goodman, MD, PhD, Johns Hopkins University, 550 North Broadway, Suite 409, Baltimore, MD 21205; e-mail, [email protected]

Submit a Comment
Contributors must reveal any conflict of interest. Comments are moderated. Please see our information for authorsregarding comments on an Annals publication.
*All comments submitted after October 1, 2021 and selected for publication will be published online only.