Abstract

An important problem exists in the interpretation of modern medical research data: Biological understanding and previous research play little formal role in the interpretation of quantitative results. This phenomenon is manifest in the discussion sections of research articles and ultimately can affect the reliability of conclusions. The standard statistical approach has created this situation by promoting the illusion that conclusions can be produced with certain “error rates,” without consideration of information from outside the experiment. This statistical approach, the key components of which are P values and hypothesis tests, is widely perceived as a mathematically coherent approach to inference. There is little appreciation in the medical community that the methodology is an amalgam of incompatible elements, whose utility for scientific inference has been the subject of intense debate among statisticians for almost 70 years. This article introduces some of the key elements of that debate and traces the appeal and adverse impact of this methodology to the P value fallacy, the mistaken idea that a single number can capture both the long-run outcomes of an experiment and the evidential meaning of a single result. This argument is made as a prelude to the suggestion that another measure of evidence should be used—the Bayes factor, which properly separates issues of long-run behavior from evidential strength and allows the integration of background knowledge with statistical findings.

References

  • 1. Simon RAltman DGStatistical aspects of prognostic factor studies in oncology [Editorial]. Br J Cancer1994;69:979-85. CrossrefMedlineGoogle Scholar
  • 2. Tannock IFFalse-positive results in clinical trials: multiple significance tests and the problem of unreported comparisons. J Natl Cancer Inst1996;88:206-7. CrossrefMedlineGoogle Scholar
  • 3. Goodman SNToward evidence-based medical statistics. 2: The Bayes factor. Ann Intern Med1999;130:1005-13. LinkGoogle Scholar
  • 4. McKenzie RO'Fallon ADale JDemitrack MSharma GDeloria Met al Low-dose hydrocortisone for treatment of chronic fatigue syndrome: a randomized, controlled trial. JAMA1998;280:1061-6. CrossrefMedlineGoogle Scholar
  • 5. Salmon WCThe Foundations of Scientific Inference. Pittsburgh: Univ of Pittsburgh Pr; 1966. Google Scholar
  • 6. Royall RStatistical Evidence: A Likelihood Primer. Monographs on Statistics and Applied Probability #71. London: Chapman and Hall; 1997. Google Scholar
  • 7. Hacking IThe Emergence of Probability: A Philosophical Study of Early Ideas about Probability, Induction and Statistical Inference. Cambridge, UK: Cambridge Univ Pr; 1975. Google Scholar
  • 8. Popper KThe Logic of Scientific Discovery. New York: Harper & Row; 1934:59. Google Scholar
  • 9. Carnap RLogical Foundations of Probability. Chicago: Univ of Chicago Pr; 1950. Google Scholar
  • 10. Howson C, Urbach P. Scientific Reasoning: The Bayesian Approach. 2d ed. La Salle, IL: Open Court; 1993. Google Scholar
  • 11. Stigler SMThe History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Harvard Univ Pr; 1986. Google Scholar
  • 12. Oakes MStatistical Inference: A Commentary for the Social Sciences. New York: Wiley; 1986. Google Scholar
  • 13. Fisher R. Statistical Methods for Research Workers. 13th ed. New York: Hafner; 1958. Google Scholar
  • 14. Neyman J, Pearson E. On the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society, Series A. 1933; 231:289-337. Google Scholar
  • 15. Goodman SNp values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol1993;137:485-96. CrossrefMedlineGoogle Scholar
  • 16. Gigerenzer GSwijtink ZPorter TDaston LBeatty JKruger LThe Empire of Chance. Cambridge, UK: Cambridge Univ Pr; 1989. Google Scholar
  • 17. Fisher R. Statistical Methods and Scientific Inference. 3d ed. New York: Macmillan; 1973. Google Scholar
  • 18. Browner WNewman TAre all significant P values created equal? The analogy between diagnostic tests and clinical research. JAMA1987;257:2459-63. CrossrefMedlineGoogle Scholar
  • 19. Diamond GAForrester JSClinical trials and statistical verdicts: probable grounds for appeal. Ann Intern Med1983;98:385-94. LinkGoogle Scholar
  • 20. Lilford RJBraunholtz DFor debate: The statistical basis of public policy: a paradigm shift is overdue. BMJ1996;313:603-7. CrossrefMedlineGoogle Scholar
  • 21. Freeman PRThe role of p-values in analysing trial results. Stat Med1993;12:1442-552. CrossrefGoogle Scholar
  • 22. Brophy JMJoseph LPlacing trials in context using Bayesian analysis. GUSTO revisited by Reverend Bayes. JAMA1995;273:871-5. CrossrefMedlineGoogle Scholar
  • 23. Berkson JTests of significance considered as evidence. Journal of the American Statistical Association1942;37:325-35. CrossrefGoogle Scholar
  • 24. Pearson E‘Student’ as a statistician. Biometrika1938;38:210-50. Google Scholar
  • 25. Altman DG. Confidence intervals in research evaluation. ACP J Club. 1992; Suppl 2:A28-9. Google Scholar
  • 26. Berry GStatistical significance and confidence intervals [Editorial]. Med J Aust1986;144:618-9. CrossrefMedlineGoogle Scholar
  • 27. Braitman LEConfidence intervals extract clinically useful information from data [Editorial]. Ann Intern Med1988;108:296-8. LinkGoogle Scholar
  • 28. Simon RConfidence intervals for reporting results of clinical trials. Ann Intern Med1986;105:429-35. LinkGoogle Scholar
  • 29. Pearson ESome thoughts on statistical inference. Annals of Mathematical Statistics1962;33:394-403. CrossrefGoogle Scholar
  • 30. Colton TStatistics in Medicine. Boston: Little, Brown; 1974. Google Scholar
  • 31. Seidenfeld TPhilosophical Problems of Statistical Inference. Dordrecht, the Netherlands: Reidel; 1979. Google Scholar
  • 32. Goodman SMultiple comparisons, explained. Am J Epidemiol1998;147:807-12. CrossrefMedlineGoogle Scholar
  • 33. Savitz DAOlshan AFMultiple comparisons and related issues in the interpretation of epidemiologic data. Am J Epidemiol1995;142:904-8. CrossrefMedlineGoogle Scholar
  • 34. Thomas DCSiemiatycki JDewar RRobins JGoldberg MArmstrong BGThe problem of multiple inference in studies designed to generate hypotheses. Am J Epidemiol1985;122:1080-95. CrossrefMedlineGoogle Scholar
  • 35. Greenland SRobins JMEmpirical-Bayes adjustments for multiple comparisons are sometimes useful. Epidemiology1991;2:244-51. CrossrefMedlineGoogle Scholar
  • 36. Anscombe FSequential medical trials. Journal of the American Statistical Association1963;58:365-83. CrossrefGoogle Scholar
  • 37. Dupont WDSequential stopping rules and sequentially adjusted P values: does one require the other? Controlled Clin Trials1983;4:3-10. CrossrefMedlineGoogle Scholar
  • 38. Cornfield J, Greenhouse S. On certain aspects of sequential clinical trials. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability. Berkeley, CA: Univ of California Pr; 1977; 4:813-29. Google Scholar
  • 39. Cornfield JSequential trials, sequential analysis and the likelihood principle. American Statistician1966;20:18-23. Google Scholar
  • 40. Begg COn inferences from Wei's biased coin design for clinical trials. Biometrika1990;77:467-84. CrossrefGoogle Scholar
  • 41. Ludbrook JDudley HIssues in biomedical statistics: statistical inference. Aust N Z J Surg1994;64:630-6. CrossrefMedlineGoogle Scholar
  • 42. Cox DHinckley DTheoretical Statistics. New York: Chapman and Hall; 1974. Google Scholar
  • 43. Barnett VComparative Statistical Inference. New York: Wiley; 1982. Google Scholar
  • 44. Lehmann EThe Fisher, Neyman-Pearson theories of testing hypotheses: one theory or two? Journal of the American Statistical Association1993;88:1242-9. CrossrefGoogle Scholar
  • 45. Berger J. The frequentist viewpoint and conditioning. In: LeCam L, Olshen R, eds. Proceedings of the Berkeley Conference in Honor of Jerzy Neyman and Jack Kiefer. vol. 1. Belmont, CA: Wadsworth; 1985:15-43. Google Scholar
  • 46. Marks HMThe Progress of Experiment: Science and Therapeutic Reform in the United States, 1900-1990. Cambridge, UK: Cambridge Univ Pr; 1997. Google Scholar
  • 47. Porter TMTrust in Numbers: The Pursuit of Objectivity in Science and Public Life. Princeton, NJ: Princeton Univ Pr; 1995. Google Scholar
  • 48. Matthews JRQuantification and the Quest for Medical Certainty. Princeton, NJ: Princeton Univ Pr; 1995. Google Scholar
  • 49. Feinstein ARHorwitz RIProblems in the “evidence” of “evidence-based medicine.”. Am J Med1997;103:529-35. CrossrefMedlineGoogle Scholar
  • 50. Spodich DH“Evidence-based medicine”: terminologic lapse or terminologic arrogance? [Letter]. Am J Cardiol1996;78:608-9. CrossrefMedlineGoogle Scholar
  • 51. Tonelli MRThe philosophical limits of evidence-based medicine. Acad Med1998;73:1234-40. CrossrefMedlineGoogle Scholar
  • 52. Feinstein ARClinical Biostatistics. St. Louis: Mosby; 1977. Google Scholar
  • 53. Mainland DThe significance of “nonsignificance.”. Clin Pharmacol Ther1963;12:580-6. CrossrefGoogle Scholar
  • 54. Morrison DEHenkel REThe Significance Test Controversy: A Reader. Chicago: Aldine; 1970. Google Scholar
  • 55. Rothman KJSignificance questing [Editorial]. Ann Intern Med1986;105:445-7. LinkGoogle Scholar
  • 56. Rozeboom WThe fallacy of the null hypothesis significance test. Psychol Bull1960;57:416-28. CrossrefMedlineGoogle Scholar
  • 57. Savitz DIs statistical significance testing useful in interpreting data? Reprod Toxicol1993;7:95-100. CrossrefMedlineGoogle Scholar
  • 58. Chia KS“Significant-itis”—an obsession with the P-value. Scand J Work Environ Health1997;23:152-4. CrossrefMedlineGoogle Scholar
  • 59. Barnett MLMathisen ATyranny of the p-value: the conflict between statistical significance and common sense [Editorial]. J Dent Res1997;76:534-6. CrossrefMedlineGoogle Scholar
  • 60. Bailar JCMosteller FGuidelines for statistical reporting in articles for medical journals. Amplifications and explanations. Ann Intern Med1988;108:266-73. LinkGoogle Scholar
  • 61. Cox DRStatistical significance tests. Br J Clin Pharmacol1982;14:325-31. CrossrefMedlineGoogle Scholar
  • 62. Cornfield JThe bayesian outlook and its application. Biometrics1969;25:617-57. CrossrefMedlineGoogle Scholar
  • 63. Mainland DStatistical ritual in clinical journals: is there a cure?—I. Br Med J (Clin Res Ed)1984;288:841-3. CrossrefMedlineGoogle Scholar
  • 64. Mainland DStatistical ritual in clinical journals: is there a cure?—II. Br Med J (Clin Res Ed)1984;288:920-2. CrossrefMedlineGoogle Scholar
  • 65. Salsburg DThe religion of statistics as practiced in medical journals. American Statistician1985;39:220-3. Google Scholar
  • 66. Dar RSerlin RCOmer HMisuse of statistical tests in three decades of psychotherapy research. J Consult Clin Psychol1994;62:75-82. CrossrefMedlineGoogle Scholar
  • 67. Altman DBland JImproving doctors' understanding of statistics. Journal of the Royal Statistical Society, Series A1991;154:223-67. CrossrefGoogle Scholar
  • 68. Pocock SJHughes MDLee RJStatistical problems in the reporting of clinical trials. A survey of three medical journals. N Engl J Med1987;317:426-32. CrossrefMedlineGoogle Scholar
  • 69. Edwards ALikelihood. Cambridge, UK: Cambridge Univ Pr; 1972. Google Scholar
  • 70. Skellam JModels, inference and strategy. Biometrics1969;25:457-75. CrossrefMedlineGoogle Scholar
  • 71. Clarke MChalmers IDiscussion sections in reports of controlled trials published in general medical journals: islands in search of continents? JAMA1998;280:280-2. CrossrefMedlineGoogle Scholar
  • 72. Moyé LEnd-point interpretation in clinical trials: the case for discipline. Control Clin Trials1999;20:40-9. CrossrefMedlineGoogle Scholar
  • 73. Fisher LDCarvedilol and the Food and Drug Administration (FDA) approval process: the FDA paradigm and reflections on hypothesis testing. Control Clin Trials1999;20:16-39. CrossrefMedlineGoogle Scholar
  • 74. Fisher LMoyé LCarvedilol and the Food and Drug Administration (FDA) approval process: an introduction. Control Clin Trials1999;20:1-15. CrossrefMedlineGoogle Scholar
  • 75. Poole CBeyond the confidence interval. Am J Public Health1987;77:195-9. CrossrefMedlineGoogle Scholar
  • 76. Lang JMRothman KJCann CIThat confounded P-value [Editorial]. Epidemiology1998;9:7-8. CrossrefMedlineGoogle Scholar
  • 77. Evans SJMills PDawson JThe end of the p value? Br Heart J1988;60:177-80. CrossrefMedlineGoogle Scholar
  • 78. Feinstein ARP-values and confidence intervals: two sides of the same unsatisfactory coin. J Clin Epidemiol1998;51:355-60. CrossrefMedlineGoogle Scholar
  • 79. Freedman LBayesian statistical methods [Editorial]. BMJ1996;313:569-70. CrossrefMedlineGoogle Scholar
  • 80. Etzioni RDKadane JBBayesian statistical methods in public health and medicine. Annu Rev Public Health1995;16:23-41. CrossrefMedlineGoogle Scholar
  • 81. Kadane JBPrime time for Bayes. Control Clin Trials1995;16:313-8. CrossrefMedlineGoogle Scholar
  • 82. Spiegelhalter DFreedman LParmar MBayesian approaches to randomized trials. Journal of the Royal Statistical Society, Series A1994;157:357-87. CrossrefGoogle Scholar
  • 83. Goodman SNRoyall REvidence and scientific research. Am J Public Health1988;78:1568-74. CrossrefMedlineGoogle Scholar
  • 84. Barnard G. The use of the likelihood function in statistical practice. In: Proceedings of the Fifth Berkeley Symposium. v 1. Berkeley, CA: Univ of California Pr; 1966:27-40. Google Scholar
  • 85. Wulff HRAnderson BBrandenhoff PGuttler FWhat do doctors know about statistics? Stat Med1987;6:3-10. CrossrefMedlineGoogle Scholar
  • 86. Borak JVeilleux SErrors of intuitive logic among physicians. Soc Sci Med1982;16:1939-47. CrossrefMedlineGoogle Scholar
  • 87. Concato JFeinstein AEHolford TRThe risk of determining risk with multivariable models. Ann Intern Med1993;118:201-10. LinkGoogle Scholar
  • 88. Altman DGGoodman SNTransfer of technology from statistical journals to the biomedical literature. Past trends and future predictions. JAMA1994;272:129-32. CrossrefMedlineGoogle Scholar
  • 89. Hayden GBiostatistical trends in pediatrics: implications for the future. Pediatrics1983;72:84-7. MedlineGoogle Scholar