Introduction

Many surveys are based on stratified multi-stage probability samples. When analyzing probability-sample data, one usually needs to take the complex nature of the sample design into account because sampled elements from the same strata or cluster can be correlated. Moreover, the item values of a sampled element may be related to its probability of selection. Probability-sample theory provides a convenient way of analyzing probability-sample data in a rigorous manner. (The more commonly used but misleading term “design-based theory” is not employed here because models are often used, at least implicitly, in motivating the probability-sampling design.)

One common failing of probability-sampling theory is its reliance on an ad hoc method for determining the degrees of freedom (dof) of its variance estimators. Ingram et al. (2018, p. 5) describe this method in their recommendation for analyses of National Center of Health Statistics record-level survey data: “the recommended number of degrees of freedom . . . is the number of PSUs minus the number of sampling strata.” This is the default dof measure employed in the probability-sampling software language SUDAAN (RTI International, 2012) and in the probability-sampling procedures in SAS (SAS Institute Inc, 2015).

Here we investigate using an alternative method of dof calculation derived by assuming a contrafactual model for the sample data. We focus on estimating means, although generalizations to other statistics will be briefly discussed in the final section. We will look at full-population and domain-mean estimates computed from a stratified cluster samples drawn from a particular population and evaluate how well the conventional and the model-based alternative dof measures fare at constructing two-sided confidence intervals (CIs) for the estimands.

We first provide some technical background. We then describe our population and conduct the evaluation. Finally, we offer some caveats, conclusions, and speculations.

Background

Suppose we are estimating the mean of a finite population of item values using data from a complex probability sample. For simplicity, we will assume that there is no nonresponse and that the sample has been drawn from a complete population using a stratified multistage sample. In particular, the population U consisting of M elements has been divided in mutually exclusive and exhaustive primary sampling units (PSUs), and the PSUs themselves have been grouped into H mutually exclusive and exhaustive strata. A probability sample of nh ≥ 2 PSUs is selected in each stratum h (h = 1, …, H), and a probability sample of elements is selected within each sampled PSU. By “probability sample,” we mean every PSU in stratum h has a known positive probability of selection (or expected number of selections) into the sample of PSUs, and every element in each sampled PSU i has a known positive probability of selection into the sample of elements from that PSU. The selection of elements from PSUs can have any number of stages.

Let S1h denote the sample of the nh PSUs selected from stratum h, S1 the union of n=ΣHnh sampled PSUs, and S the sample of elements drawn from within those n PSUs. The subscript 1 in S1h and S1 refers to the first stage of sampling in what could be many stages of sample selection.

Let kU denote an element in the population and πk its probability of selection into S: the product of the probability of selection of the PSU (call it i) containing k into S1 and the probability of k’s selection into the element sample drawn from within PSU i. That element sample is denoted by Si.

The sampling (or design) weight of element k is denoted dk = 1/πk. A nearly unbiased estimator for a population mean Ty=ΣkUyk/M under probability-sampling theory is

ty=kSdkykkSdk

An estimator like ty is nearly unbiased when its relative bias tends to zero as the number of sampled PSUs in S1 grows arbitrarily large under mild conditions we assume to hold. See, for example, Isaki & Fuller (1982) for a discussion of those conditions.

We will also assume the mild conditions under which the contribution of the bias of ty to its mean squared error tends to zero as the number of sampled PSUs in S1 grows arbitrarily large. The reader should understand that, in what follows, the mild conditions we assume to hold remain in effect.

One conventionally estimates the variance/mean squared error of ty under probability-sampling theory by treating the random selection of PSUs within each stratum as if it were done with replacement, even though that is rarely the case in practice. We follow that conventional treatment here because it greatly simplifies variance estimation.

The conventional probability-sampling variance estimator for ty is

v=Hh=1nhnh1[iS1he2i+1nh(iS1hei+)2],

where

ei+=kSidk(ykty)kSdk.

Observe that ei+ is nearly equal to ui+=kSidk(ykty)/M, and the ui+ are independent random variables under the stratified with-replacement selection of PSUs. “Nearly equal” means that the ratio of the two, ei+/ui+, converges to 1 as the number of PSUs in S1 grows arbitrary large.

If the ui+ were normally distributed and each had the same variance, then v would nearly have a (central) chi-squared distribution with n – H degrees of freedom. (It would be exactly such a distribution if the ei+ were replaced by the ui+.) As a result, the statistic z = (tyTy )/v1/2, which is used to make inferences about of ty as an estimate for of Ty, would nearly have a Student t distribution with n – H degrees of freedom.

This is where the conventional practice of determining the degrees of freedom for a statistic derived from a stratified multistage probability sample comes from. Many have noted that even under the ideal conditions assumed here—a stratified multistage probability sample drawn from a complete population, where there is no nonresponse, at least two PSUs chosen with replacement from each stratum, and a probability sample of elements selected within each sampled PSUs—the conventional practice is rarely justified. See, for example, Valliant & Rust (2010). That is because the PSU-level “residuals” captured in the ui+ seldom have the same variance and may not be normally distributed.

The probability-sampling variance estimation in Equation (2) captures any loss in efficiency of the estimator ty due to the individual yk within each PSU being correlated (whether there are additional stages of sampling) as well as any gains in efficiency due to stratification. The variance estimate also captures any loss efficiency due to the dk being unequal. Were all the yk in U normal, independent, and identically distributed, Ty could be efficiently estimated by a simple mean of the m yk-values in S, the variance of which would be efficiently estimated by

vI=1m(m1)[kSy2k1m(kSyk)2],

which has a chi-squared distribution with m − 1 degrees of freedom. Heuristically, in addition to the v usually being larger than vI, the smaller n – H nominal degrees of freedom in v (compared with m−1) is the price one pays for using ty and v for making inferences about Ty. That price may not be steep enough, as we shall see.

The degrees of freedom of a chi-squared variance estimator is 2 divided by its relative variance. Accordingly, it seems reasonable to employ a Satterthwaite approximation (1946) to measure the effective degrees of freedom (edof) for a non–chi-squared variance estimator, like v; namely, edof(v) = 2/relvar(v). Kott (1994) proposed computing such edof for v by treating each yk as it is were normal, independent, and identically distributed. Assuming this contrafactual model makes computing the relative variance of v straightforward, and truly provides a measure of the price one pays for using v in place of vI when drawing inferences about a population mean.

Here we will look at estimates of domain means as well as population means. A domain mean has the form Ty,g=ΣkU gkyk/ΣkU gk, where gk = 1 when element k is in the domain and 0 otherwise. Its nearly-unbiased estimator under mild conditions is ty,g=ΣkSdkgkyk/ΣkSdkgk, and its conventional probability-sampling variance estimator has the form of Equation (2) with the ei+ in Equation (3) replaced by

ei+=kSidkgk(ykty,g)kSdkgk.

A population mean and its estimator can be expressed as a domain mean and its estimator by letting gk = 1 for every kU.

Setting wk = dkgk, the edof measure for the probability-sampling variance estimator of a domain mean derived under the contrafactual assumption that the yk in the domain are normal, independent, and identically distributed, is

edof(v)=(kSw2k)2Hh=1iS1h(kSiw2k)2+Hh=11(nh1)2{[iS1h(kSiw2k)]2iS1h(kSiw2k)2},

(Kott, 1994).

A little algebra reveals that if kSiw2k for each PSU were the same, the right-hand side of Equation (4) would collapse to

edof(v)=n1+1nHh=1nhnh1,

which is close to n − H. Exact equally obtains when all nh = n/H. Observe that the contrafactual edof in Equation (4) tends to decrease as the variability of the kSiw2k across PSUs, whether due to the variability of the wk2 or the sizes of the PSUs, increases.

The Population and Estimands of Interest

We now will compare how well the contrafactual edof in Equation (4) and the nominal degrees do at measuring “the true degrees of freedom” of a probability sampling variance estimator given a stratified cluster sample drawn for the MU281 population (Ohlsson, n.d., based on data from Appendix B of Särndal et al., 1992). Determining the true degrees of freedom is elusive. Instead, we define the empirical degrees of freedom for a two-sided q percent CI as the degrees of freedom of a chi-squared distribution with the same q percent CI as that produced by the sample data.

The MU281 population is a collection of Swedish municipalities with three outliers removed. The population is divided into 50 clusters from 8 regional strata. After combing strata 7 and 8, the clusters and strata are further divided (based on 1975 population counts) into 108 PSUs within 20 design strata. A stratified cluster sample of 4 PSUs is sampled with replacement from each design stratum via simple random sampling. All the municipalities in each sampled PSU are the elements of the sample. By sampling with replacement, the variance estimator in Equation (1) should produce nearly unbiased estimates of the true mean squared error.

This sampling process is repeated 10,000 times twice. The first time to estimate the full population means of five variables. The second time to estimate domain mean for the same five variables. The five variables (with their per municipality estimands in parentheses) are

  • RMT85, 1985 revenue from municipal taxes in millions of kronor (189.15)

  • ME84, 1984 number of municipal employees (1,381.26)

  • P85, 1985 population in thousands (25.0285)

  • REV84, 1984 real estate values in millions of kronor (2694.83)

  • RMT/P85, a binary variable equal to 100 when the ratio RTM85/P85 is greater than 8, 0 otherwise (18.5053, which is the estimate the percent of municipalities with RMT/P85 ≥ 8).

All exhibit some degree of intra-stratum and intra-cluster correlation. These can be measured indirectly by treating the population as an equally weighted stratified with-replacement cluster sample and comparing alternative standard-error estimates for the means assuming: no strata or clustering, clustering but no strata, strata but no clustering, and both strata and clustering. This was done, but the results are not shown.

The domain used in the analysis is the set of those municipalities where at least half of the municipal council were Social Democrats in 1982. Roughly 37.01 percent of municipalities in the MU281 population are in this domain. The domain estimands are 196.5, 1449, 25.4808, 2751.33, and 25, respectively. (Analyses of other domain means gave comparable results and are not shown here.)

Results

For each of our simulated samples, the sampling weights in Equation (1) are dk = N[k]/4, where k now denotes a single selection of an element, and N[k] is the number of PSUs (before sampling) in the stratum containing the PSU and thus the element. The sampling weights vary because the number of PSUs in each stratum before sampling vary.

The set S is now the set of sampled element selections, and S1h and S1 are defined analogously. Under with-replacement sampling, a PSU can be selected more than once; each selection is a separate member of S1, leading to potentially multiple memberships for its component elements in S.

We denote the results computed from each r = 1, . . . , 10,000 simulated sample with the subscript (r), and define the following empirical summary statistics (recall a full-population mean can be expressed as a domain mean):

The empirical relative bias of ty,g is ERB(ty,g) = 10,000r=1(ty,g(r)Ty,g)Ty,g10,000×100%.

The empirical variance of ty,g is EV(ty,g)=10,000r=1t2y,g(r)110,000(10,000r=1ty,g(r))2Ty,g9,999.

The empirical relative bias of v is ERB(v)=10,000r=1(v(r)EV(ty,g))Ty,g10,000×100%.

The empirical variance of v is EV(v)=10,000r=1v(r)2110,000(10,000r=1v(r))2Ty,g9,999.

The empirical relative variance of v is ERV(v)=EV(v)(10,000r=1v(r)10,000)2.

The empirical Satterthwaite DOF of v is ESD(v) = 2/ERV(v).

In the previous expressions, v is the probability-sampling variance of ty,g.

The nominal dof of v is 60 (80 PSUs minus 20 strata), whereas the predicted edof are computed using Equation (4), which varies from sample to sample, but it is same for all population variables depending on whether a full-population of domain mean is being estimated. Table 1 displays the summary statistics for the edof(v). All summaries in this paper were computed using the PROC MEANS procedure within the SAS software (SAS Institute Inc, 2015). Estimated probability-sampling means and variances were computed using the DESCRIPT procedure in the SUDAAN language (RTI International, 2012) before being summarized using SAS.

Table 1.Summary statistics for the (contrafactual) effective degrees of freedom measure (edof(v))
MEAN MEDIAN MIN P5 Q1 Q3 P95 MAX
Population 31.21 31.20 25.31 28.46 30.09 32.39 33.96 37.56
Domain 20.58 20.53 14.29 17.44 19.21 21.85 23.94 24.92

Tables 2 and 3 display many of the empirical results of the simulation including those of the two-sided 95th and 90th percentiles of the test statistic z=|ty(r)Ty|/v1/2, denoted by z95 and z90. Each of these percentiles has its own empirical dof (i.e., the degrees of freedom that would produce that value of z were the ui+ normal, independent, and identically distributed). These are denoted ED95(v) and ED90(v) respectively (they were computed using the TINV function in SAS).

For comparison purposes, the z95 and z90 values for the nominal 60 degrees of freedom (rounding to two decimal places) are 2.00 and 1.67. For 31.2 degrees of freedom (the mean of the full-population predicted edof in Table 1), they are 2.04 and 1.70. For 20.6 the mean of the domain prediction edof), they are 2.08 and 1.72. Under normality (infinite degrees of freedom), z95 is 1.96, and z90 is 1.645.

Table 2.Some empirical results for the population variables (edof(v) ≈ 31.2)
Variable ERB(ty) ERB(v) ESD(v) z95 ED95(v) z90 ED90(v)
RMT85 0.05 0.96 44.60 2.03 36.1 1.69 37.5
ME84 -0.07 -0.47 23.61 2.10 18.0 1.73 18.3
P85 -0.17 -2.21 33.35 2.12 15.8 1.72 22.5
REV84 0.09 1.23 19.87 2.06 23.9 1.71 23.5
PRT/P85 -0.06 0.78 13.97 2.19 11.5 1.77 13.2

Notes: EDX = the empirical degrees of freedom of the value zX at the Xth percentile; ERB = the empirical relative bias; ESD = the empirical Satterthwaite (approximate) degrees of freedom; zX = the X percentile of the absolute difference between the estimator and estimand divided by the estimated standard error of the estimator.

Table 3.Some empirical results for the domain variables (edof(v) ≈ 20.6)
Variable ERB(ty,g) ERB(v) ESD(v) z95 ED95(v) z90 ED90(v)
RMT85 0.07 -0.10 26.84 2.10 18.0 1.74 17.5
ME84 0.12 0.19 26.32 2.09 20.1 1.73 18.0
P85 0.07 -2.01 27.46 2.09 19.9 1.73 18.2
REV84 -0.05 -0.91 32.39 2.10 18.0 1.72 22.0
PRT/P85 -0.04 -1.06 8.14 2.42 6.3 1.89 7.1

Notes: EDX = the empirical degrees of freedom of the value zX at the Xth percentile; ERB = the empirical relative bias; ESD = the empirical Satterthwaite (approximate) degrees of freedom; zX = the X percentile of the absolute difference between the estimator and estimand divided by the estimated standard error of the estimator.

Tables 2 and 3 also show that, unlike their predictions, the empirical degrees of freedom, whether measured by ESD, ED95, or ED90 depend on the variable. That is likely because each variable has a different amount of intra-stratum and intra-PSU correlation. The predictions in Table 1, while stable, are not that good. Still, they are clearly better than the nominal measure (60) or infinity. Surprisingly, although developed to predict the Satterthwaite degrees of freedom, which are only supposed to be approximations, the edof appear to be closer to the empirical confidence-interval degrees of freedom (ED95 and ED90). Also, surprising, is that although the edof from the domain means is smaller than that of the population mean, empirically the dof measures for domain means are usually smaller—but are sometimes higher—than their full-population analogs, depending on the variable and measure.

There is no obvious explanation why the empirical measures of the degrees of freedom for RMT/P85 were so much lower than for the other four variables investigated. If anything, the estimated population and domain means for this variable were less influenced by stratification and clustering than the other four (not shown). How stratification, clustering, and weight variability affect empirical dof is an area for future study.

Discussion

That the predicted edof measure produced by Equation (4) is not a complete success should not surprise anyone. It is based on the contrafactual assumption that v can be treated as a chi-squared distribution and that the variable whose mean is being estimated has no intra-stratum nor intra-cluster correction. None of those assumptions are correct for the variables in the MU281 data set being analyzed. In addition, the Satterthwaite formula, edof(v) = 2/relvar(v), is only an approximation when v is not chi-squared. Finally, an estimated mean can itself have a skewness and kurtosis that affect building appropriate CIs around them. These characteristics are extremely difficult to measure with a probability sample. Valliant & Rust (2010) attempt incorporating stratum-level kurtosis terms into a degrees of freedom measure with little success. Nearly unbiased kurtosis estimation requires four sampled PSUs in every stratum. Kott (2020) discusses the potential impact of skewness on CIs, an impact greater on an interval’s symmetry than its size, which is why only two-sided intervals were evaluated here. A nearly unbiased skewness measure requires at least three sampled PSUs in every stratum.

Despite Equation (4) failing to predict the empirical edof, however they are measured, employing the equation when making statistical inferences is a realistic alternative to subtracting the number of sampled PSUs from the number of strata, which is the conventional practice. The equation provides a useful measure of the potential loss in efficiency from using a probability-sampling variance estimator. Although it is based on simple model assumptions, those contrafactual assumptions are more reasonable that then implicit assumption justifying the conventional measure of the edof.

Some speculations about extensions beyond those situations covered in the body of the text are in order. It may often be reasonable to use the contrafactual edof measure in Equation (4) when the weights in a weighted estimator like ty,g contain adjustments for nonresponse and coverage and when the design strata or design PSUs in Equation (2) are replaced by variance strata or PSUs (i.e., simplifications used in practice when, for example, there are less than two design PSUs in a design stratum). This assertion assumes that the difference ty,gTy,g can be put in the form ΣkSwkek. For example, with linear calibration weighting (Kott, 2015), ek=ykxk(jSdjxjxTj)1jSdjyj, where xk is a vector of calibration variables, and wk=dk[1+(jUxjΣjSdjxj)(jSdjxjxTj)1xk]. Using Equation (4) is most reasonable under the simple linear model for yk,yk=xTkβ+εk, where the εk (which are nearly equal to the ek ) are independent and identically distributed normal random variables.

The error in a single regression coefficient can also be put in the form ΣkSwkek, where the wk are more complicated expressions than the sampling weights. The wk for the jth coefficient of a probability-sampling linear regression estimator b=(jSdjxjxTj)1ΣjSdjxjyj is the jth row of (ΣjSdjxjxTj)1dkxk, and ek=ykxkb, which is nearly equal to εk in the linear model, yk=xTkβ+εk.

Similarly, the wk for the jth coefficient of a of a probability-sampling logistic-regression estimator or to any b that solves an estimating equation like ΣkSdkxk[ykf(xTkb)]=0, where f(.) is scalar and monotonic, is the jth row of (ΣjSdjf(xTjb)xjxTj)1dkxk, and ek=yk f(xTkb). This suggests that using Equation (4) as a reasonable, if imperfect, measure of the edof for the variance estimator of a regression coefficient, linear or otherwise. Because this measure will usually vary across the coefficients of a regression estimator, using a Bonferroni-adjusted t test (1974), rather than a probability-sampling F test (for examples, see RTI International, 2012, p. 217) may be advisable for (imperfectly) testing a joint hypothesis involving multiple coefficients.

Some readers may be wondering about the edof measure for the variance estimator of the estimated difference between two domain means. Expressing the estimated difference between two domain means as a linear regression coefficient is simple and has been discussed previously in this report.