Jessica Utts Home Page

Document Menu


Copyright Notice
In response to the paper Replication and Meta-Analysis in Parapsychology by Jessica Utts

[These papers were published in "Statistical Science," 1991, Vol. 6, No. 4.]

Comment

Ree Dawson

This paper offers readers interested in statistical science multiple views of the controversial history of parapsychology and how statistics has contributed to its development. It first provides an account of how both design and inferential aspects of statistics have been pivotal issues in evaluating the outcomes of experiments that study psi abilities. It then emphasizes how the idea of science as replication has been key in this field in which results have not been conclusive or consistent and thus meta-analysis has been at the heart of the literature in parapsychology. The author not only reviews past debate on how to interpret repeated psi studies, but also provides very detailed information on the Honorton-Hyman argument, a nice illustration of the challenges of resolving such debate. This debate is also a good example of how statistical criticism can be part of the scientific process and lead to better experiments and, in general, better science.

The remainder of the paper addresses technical issues of meta-analysis, drawing upon recent research in parapsychology for an in-depth application.Through a series of examples, the author presents a convincing argument that power issues cannot be overlooked in successive replications and that comparison of effect sizes provides a richer alternative to the dichotomous measure inherent in the use of p-values. This is particularly relevant when the potential effect size is small and resources are limited, as seems to be the case for psi studies.

The concluding section briefly mentions Bayesian techniques. As noted by the author, Bayes (or empirical Bayes) methodology seems to make sense for research in parapsychology. This discussion examines possible Bayesian approaches to meta-analysis in this field.

BAYES MODELS FOR PARAPSYCHOLOGY

The notion of repeatability maps well into the Bayesian set-up in which experiments, viewed as a random sample from some superpopulation of experiments, are assumed to be exchangeable. When subjects can also be viewed as an approximately random sample from some population, it is appropriate to pool them across experiments. Otherwise, analyses that partially pool information according to experimental heterogeneity need to be considered. Empirical and hierarchical Bayes methods offer a flexible modeling framework for such analyses, relying on empirical or subjective sources to determine the degree of pooling. These richer methods can be particularly useful to meta-analysis of experiments in parapsychology conducted under potentially diverse conditions.

For the recent ganzfeld series, assuming them to be independent binomially distributed as discussed in Section 5, the data can be summed (pooled) across series to estimate a common hit rate. Honorton et al. (1990) assessed the homogeneity of effects across the 11 series using a chi-quare test that compares individual effect sizes to the weighted mean effect. The chi-square statistic  = 16.25, not statistically significant ( p = 0.093 ), largely reflects the contribution of the last "special" series (contributes 9.2 units to the  value), and to a lesser extent the novice series with a negative effect (contributes 2.5 units). The outlier series can be dropped from the analysis to provide a more conservative estimate of the presence of psi effects for this data (this result is reported in Section 5). For the remaining 10 series, the chi-square value  = 7.01 strongly favors homogeneity, although more than one-third of its value is due to the novice series (number 4 in Table 1). This pattern points to the potential usefulness of a richer model to accomodate series that may be distinct from others. For the earlier ganzfeld data analyzed by Honorton (1985b), the appeal of a Bayes or other model that recognizes the heterogeneity across studies is clear cut:  = 56.6, = 0.0001, where only those studies with common chance hit rate have been included (see Table 2).

TABLE 1
Recent ganzfeld series


Series type

N Trials

Hit rate

Yi

i


Pilot
Pilot
Pilot
Novice
Novice
Novice
Novice
Novice
Experienced
Experienced
Experienced

Overall

22
9
36
50
50
50
50
6
7
50
25

355

0.36
0.33
0.28
0.24
0.36
0.30
0.36
0.67
0.43
0.30
0.64

0.34

-0.58
-0.71
-0.94
-1.15
-0.58
-0.85
-0.58
0.71
-0.28
-0.85
0.58

0.44
0.71
0.37
0.33
0.30
0.31
0.30
0.87
0.76
0.31
0.42


TABLE 2
Earlier ganzfeld studies


N Trials

Hit rate

Yi

i


32
7
30
30
20
10
10
28
10
20
26
20
20
30
36
32
40
26
20
100
40
27
60
48

722

0.44
0.86
0.43
0.23
0.10
0.90
0.40
0.29
0.40
0.35
0.31
0.45
0.45
0.53
0.33
0.28
0.28
0.46
0.60
0.41
0.33
0.41
0.45
0.21

.38

-0.24
1.82
-0.28
-1.21
-2.20
2.20
-0.41
-0.90
-0.41
-0.62
-0.80
-0.20
-0.20
0.12
-0.71
-0.94
-0.94
-0.16
0.41
-0.36
-0.71
-0.36
-0.20
-1.33

0.36
1.09
0.37
0.43
0.75
1.05
0.65
0.42
0.65
0.47
0.42
0.45
0.45
0.37
0.35
0.39
0.35
0.39
0.46
0.20
0.34
0.39
0.26
0.35


Historic reliance on voting-count approaches to determine the presence of psi effects makes it natural to consider Bayes models that focus on the ensemble of experimental effects from parapsychological studies, rather than individual estimates. Recent work in parapsychology that compares effect sizes across studies, rather than estimating separate study effects, reinforces the need to examine this type of model. Louis (1984) develops Bayes and empirical Bayes methods for problems that consider the ensemble of parameter values to be the primary goal, for example, multiple comparisons. For the simple compound normal model, Yi N(i, 1), i  N(, 2), the standard Bayes estimates (posterior means)

where the i represent experimental effects of interest, are modified approximately to

when an ensemble loss function is assumed. The new estimates adjust the shrinkage factor D so that their sample mean and variance match the posterior expectation and variance of the 's. Similar results are obtained when the model is generalized to the case of unequal variances, Yi  N(i, ).

For the above model, the fraction of above (or below) a cut point C is a consistent estimate of the fraction of i > C (or i < C ). Thus, the use of ensemble, rather than component-wise, loss can help detect when individual effects are above a specified threshold by chance. For the meta-analysis of ganzfeld experiments, the observed binomial proportions transformed on the logit (or arcsin) scale can be modeled in this framework. Letting di and mi denote the number of direct hits and misses respectively for the ith experiment, and pi as the corresponding population proportion of direct hits, the Yi are the observed logits

Yi = log( di / mi )

and , estimated by maximum likelihood as 1 / di + 1/mi, is the variance of Yi conditional on i = logit(pi). The threshold logit (0.25) 1.10 can be used to identify the number of experiments for which the proportion of direct hits exceeds that expected by chance.

Table 1 shows Yi and i for the 11 ganzfeld series. All but one of the series are well above the threshold; Y4 marginally falls below -1.10. Any shrinkage toward a common hit rate will lead to an estimate, or , above the threshold. The use of ensemble loss (with its consistency property) provides more convincing support that all i > -1.10, although posterior estimates of uncertainty are needed to fully calibrate this. For the earlier ganzfeld data in Table 2, ensemble loss can similarly be used to determine the number of studies with i < -1.10 and specifically whether the negative effects of studies 4 and 24 (Y4 = -1.21 and Y24 = -1.33) occurred as a result of chance fluctuation.

Features of the ganzfeld data in Section 5, such as the outlier series, suggest that further elaboration of the basic Bayesian set-up may be necessary for some meta-analyses in parapsychology. Hierarchical models provide a natural framework to specify these elaborations and explore how results change with the prior specification. This type of sensitivity analysis can expose whether conclusions are closely tied to prior beliefs, as observed by Jeffreys for RNG data (see Section 7). Quantifying the influence of model components deemed to be more subjective or less certain is important to broad acceptance of results as evidence of psi performance (or lack thereof).

Consider the initial model commonly used for Bayesian analysis of discrete data:

Yi  pi, ni  B(pi, ni),

i  N(, 2),   i = logit( pi),

with noninformative priors assumed for and 2 (e.g., log locally uniform). The distinctiveness of the last "special" series (pilot versus formal, novice versus experienced) raises the question of whether the experimental effects follow a normal distribution. Weighted normal polots (Ryan and Dempster, 1984) can be used to graphically diagnose the adequacy of second-stage normality (see Dempster, Selwyn and Weeks, 1983, for examples with binary response and normal superpopulation).

Alternatively, if nonnormality is suspected, the model can be revised to include some sort of heavy-tailed prior to accommodate possibly outlying series or studies. West (1985) incorporates additional scale parameters, one for each component of the model (experiment), that flexibly adapt to a typical i and discount their influence on posterior estimates, thus avoiding under- or over-shrinkage due to such i . For example, the second stage can specify the prior as a scale mixture of normals:

i  N(, 2 i-1),

ki  ,

-2   .

This approach for the prior is similar to others for maximum likelihood estimation that modify the sampling error distribution to yield estimates that are "robust" against outlying observations.

Like its maximum likelihood counterparts, in addition to the robust effect estimates , the Bayes model provides (posterior) scale stimates . These can be interpreted as the weight given to the data for each i in the analysis and are useful to diagnosing which model components (series or studies) are unusual and how they influence the shrinkage. When more complex groupings among the i are suspected, for example, bimodal distribution of studies from different sites or experimenters, other mixture specifications can be used to further relax the shrinkage toward a common value.

For the 11 ganzfeld series, the last "outlier" series, quite distinct from the others (hit rate = 0.64), is moderately precise (N = 25). Omitting it from the analysis causes the overall hit rate to drop from 0.344 to 0.321. The scale mixture model is a compromise between these two values (on the logit scale), discounting the influence of series 11 on the estimated posterior common hit rate used for shrinkage. The scale factor , an indication of how separate 11 is from the other parameters, also causes to be shrunk less toward the common hit rate than other, more homogeneous i , giving more weight to individual information for that series (see West, 1985). The heterogeneity of the earlier ganzfeld data is more pronounced, and studies are taken from a variety of sources over time. For these data, the can be used to explore atypical studies (e.g., study 6, with hite rate = 0.90, contributes more than 25% to the value for homogeneity) and groupings among effects, as well as protect the analysis from misspecification of second-stage normality.

Variation among ganzfeld series or studies and the degree to which pooling or shrinking is appropriate can be investigated further by considering a range of priors for 2. If the marginal likelihood of 2 dominates the prior specification, then results should not vary as the prior for 2 is varied. Otherwise, it is important to identify the degree to which subject information about interexperimental variability influences the conclusions. This sensitivity analysis is a Bayesian enrichment of the simpler test of homogeneity directed toward determining whether or not complete pooling is appropriate.

To assess how well heterogeneity among historical control groups is determined by the data. Dempster, Selwyn and Weeks (1983) propose three priors for 2 in the logistic-normal model. The prior distributions range from strongly favoring individual estimates, p(2)d -1, to the uniform reference prior p(2)d -2, flat on the log scale, to strongly favoring complete pooling, p(2)d -3, (the latter forcing complete pooling for the compound normal model; see Morris, 1983). For their two examples, the results (estimates of linear treatment effects) are largely insensitive to variation in the prior distribution, but the number of studies in each example was large (70 and 19 studies available for pooling). For the 11 ganzfeld series, 2 may be less well determined by the data. The posterior estimate of 2 and its sensitivity to p(2 )d will also depend on whether individual scale parameters are incorporated into the model. Discounting the influence of the last series will both shift the marginal likelihood toward smaller values of 2 and concentrate it more in that region.

The issue of objective assessment of experiment results is one that extends well beyond the field of parapsychology, and this paper provides insight into issues surrounding the analysis and interpretation of small effects from related studies. Bayes methods can contribute to such meta-analysis in two ways. They permit experimental and subjective evidence to be formally combined to determine the presence or absence of effects that are not clear cut or controversial (e.g., psi abilities). They can also help uncover sources and degree of uncertainty in the scientific conclusions.

Ree Dawson is Senior Statistician, New England Biomedical Research Foundation,
and Statistical Consultant, RFE/RL Research Institute.
Dr. Dawson's mailing address is 177 Morrison Avenue, Somerville, Massachusetts 02144.


COPYRIGHT NOTICE

The contents of this document are copyright ©1991 by the Institute of Mathematical Statistics. All rights reserved.

Jessica Utts Home Page

Document Menu