데이터 드레징(data dredging), 데이터 스누핑(data snooping), p-해킹(p-hacking)
통계적으로 유의미한 것으로 표시될 수 있는 데이터의 패턴을 찾기 위해 데이터 분석을 오용하여 오탐의 위험을 극적으로 증가시키고 과소평가하는 것이다. 이는 데이터에 대해 많은 통계 테스트를 수행하고 중요한 결과가 나온 데이터만 보고함으로써 수행된다.
데이터 드레징 프로세스에는 철저한 검색을 통해 단일 자료 집합을 사용하여 여러 가설을 테스트하는 작업이 포함된다. 상관 관계를 보여줄 수 있는 변수 조합, 평균의 차이 또는 다른 항목에 의한 분석을 보여주는 사례 또는 관찰 그룹에 대해 테스트할 수 있다.
통계적 유의성에 대한 기존 테스트는 우연이 작용할 경우 특정 결과가 발생할 확률에 기반을 두고 있으며 특정 유형의 잘못된 결론(귀무 가설의 잘못된 기각)이 발생할 위험을 필연적으로 받아들인다. 이 위험 수준을 유의성이라고 한다. 많은 수의 테스트가 수행되면 일부는 이러한 유형의 잘못된 결과를 생성한다. 따라서 무작위로 선택한 가설의 5%는 5% 유의 수준에서 통계적으로 유의한 것으로 (잘못) 보고될 수 있고, 1%는 1% 유의 수준에서 통계적으로 유의한 것으로 (잘못) 보고될 수 있으며, 이런 식으로 우연히만 발생한다. 충분한 가설을 테스트하면 일부 가설이 통계적으로 유의미한 것으로 보고될 것이 거의 확실하다(비록 이것이 오해의 소지가 있음에도 불구하고). 임의의 정도를 지닌 거의 모든 데이터 세트에는 (예를 들어) 일부 허위 상관 관계가 포함될 가능성이 높기 때문이다. 주의하지 않으면 데이터 마이닝 기술을 사용하는 연구자가 이러한 결과로 인해 쉽게 오해를 받을 수 있다. p-해킹(p-값과 관련하여)이라는 용어는 사회 과학 연구에서 이러한 문제를 밝히는 데 주력해 온 블로그 데이터 콜라다(Data Colada)의 연구자 3명이 2014년 논문에서 만들어낸 것이다.
데이터 드레징은 다중 비교 문제를 무시한 예이다. 한 가지 형태는 독자에게 검사된 하위 그룹 비교의 총 개수를 알리지 않고 하위 그룹을 비교하는 것이다.
Data-dredging bias
A distortion that arises from presenting the results of unplanned statistical tests as if they were a fully prespecified course of analyses.
Background
Data-dredging bias is a general category which includes a number of misuses of statistical inference (e.g. fishing, p-hacking), but each essentially involves probing the data in unplanned ways, finding and reporting an “attractive” result, without accurately conveying the course of analysis. For example, in nearly any analysis of data there are several “researcher degrees of freedom”— i.e., choices that must be made in the process of analysis. Ideally, these choices are guided by the principles of best practice and prespecified in a publicly available protocol. In contrast, p-hacking occurs when an initial analysis produces results which are close to being statistically significant, then, in absence of a study protocol, researchers can make analytic choices (e.g. how to handle outliers, whether to combine groups, including/excluding covariates) which will produce a statistically significant p-value.[1]
While many different choices might be defensible, a canonical case of p-hacking would involve trying out multiple different options and reporting the result which yields the lowest p-value (particularly when alternative choices generate values that do not yield a significant result). Such an analysis can often generate statistically significant results in absence of a true effect (i.e. ‘false positives’) and is thus unreliable.
Though some forms of data-dredging are lamentably common, it is important to note that often such problems arise from a lack of awareness rather than malfeasance.[2] As Gelman and Loken (2014) note, “it can seem entirely appropriate to look at the data and construct reasonable rules for data exclusion, coding, and analysis that can lead to statistical significance” (p. 461).[3] In such cases an unconscious tendency to interpret the results in a biased fashion can be guarded against by prespecifying the course of analysis.
Apart from p-hacking, other forms of data dredging include: assessing models with multiple combinations of variables and selectively reporting the “best” model (i.e., “fishing”);[4] making decisions about whether to collect new data on the basis of interim results; making post-hoc decisions about which statistical analyses to conduct; and generating a hypothesis to explain results which have already been obtained but presenting it as it were a hypothesis one had prior to collecting the data (i.e., HARKing (“hypothesizing after the results are known”)).[5] In general, these procedures are acceptable when transparently reported; however, when authors neglect to accurately report how the results were in fact generated, they are rightfully classified as data-dredging.
Example
Despite numerous published trials and meta-analyses which appear to support the use of progestogens to mitigate pregnancy loss, when trials were limited to those that have been preregistered the evidence does not support their use.[6] Of 93 randomized controlled trials, 22 were classified as unlikely to be p-hacked. Of these, only one produced a statistically significant result and a meta-analysis of the trials found evidence that progesterone was not effective (RR = 1.00, 95% CI 0.94-1.07). In contrast, most previous meta-analyses support the use of progestogens. It is unlikely that this difference is the result of publication bias since, given the ratio between non-significant to significant results in pre-registered trials, we should expect an enormous number of unpublished studies which found no statistically significant effect. Prior et. al. therefore suggest that the difference is likely due to the inclusion of studies susceptible to data-dredging in the previous meta-analyses.[6]
The work of Brian Wansink and the Cornell Food and Brand Lab presents a more extreme example, which has nevertheless become a paradigm case of data-dredging. In a November 2016 blog post Wansink described how, at his encouragement, a visiting scholar reanalyzed the data from a “failed” study a generated four articles supporting claims such as food tastes worse when buffet prices are low, and men eat more when dining with women in order to impress them. Investigations revealed that Wansink and his team strategized how to generate statistical analyses which would produce flashy results. Crucially, the process which the team engaged in to generate the results was not revealed in the published papers. Instead, the results were presented as if hypotheses were developed before the data were gathered. Further investigation uncovered almost a decade’s worth of emails in which Wansink and his team strategized ways to dredge through data to find results they felt would be easy to publish, including correspondence with the visiting scholar in which Wansink requests that she “squeeze some blood out of this rock”.[7]
Impact
Though there is no definitive account of the frequency of data-dredging or the severity of the bias it induces, evidence for its existence derives from the fact that there is an unusually large number of published studies with a p-value slightly less than p = .05.[8, 9, 10, 11] While some authors have concluded that p-hacking is unlikely to have had a significant effect on meta-analytic estimates,[12] this assumes that most meta-analyses will include a number of studies with large sample sizes. Authors using p-curve analyses have found that the distribution of p-values is consistent with most research investigating real effects; however, the data is also consistent with some forms of data-dredging.[13, 14] What is clear is that the bias induced by data-dredging will be most severe in cases where the effect size is small, the dependent measures are imprecise, research designs are flexible, and studies are conducted on small populations.[15]
Preventive steps
John, Loewenstein, & Prelec (2012) find that researchers were generally unaware that data-dredging would induce bias.[2] Researchers frequently endorsed flawed practices such as deciding whether to gather more data after inspecting interim results or whether to exclude outliers after assessing the impact of doing so. Accordingly, there is hope that better statistical education would be beneficial. Authors should prespecify rules for stopping data collection, how to analyse data (including how to handle outliers, any expected transformations of the variables, what covariates will be controlled for, etc.) and studies should be sufficiently powered to detect meaningful effects. Publication of studies should list all variables collected in the study, report all planned analyses (specifying primary and secondary outcomes), and include robustness analyses for methodological choices.[1] Optimally, these choices should be registered prior to beginning the study[15]. When methodological decisions were informed by the data collected, the results should be clearly identified as an exploratory analysis and replicated.
The p-curve analysis has been suggested as a formal procedure for correcting p-hacking;[16, 17] however, its validity has not yet been established empirically. Particularly in non-randomized studies, confounding can render the p-curve analysis unreliable.[14]
Ultimately, as Banks et. al. (2016) have argued, data-dredging is a problem of bad barrels rather than bad apples because research systems incentivize producing nice-looking results.[19] Thus, one effective means of intervention would be to change the standards for conceiving, conducting, and publishing scientific research.[20] For example, the acceptance of articles on the basis of their design rather than their results would significantly alleviate the pressure to dredge data for “attractive” results.[21] Such alterations would represent sweeping changes for most disciplines.