Stratified Sampling

Stratified sampling with equal/unequal probabilities.

Strata(x, stratanames = NULL, size,
       method = c("srswor", "srswr", "poisson", "systematic"),
       pik, description = FALSE)

Arguments

x: a data frame or a matrix; its number of rows is n, the population size.
stratanames: vector of stratification variables.
size: vector of stratum sample sizes (in the order in which the strata are given in the input data set).
method: method to select units; implemented are: a) simple random sampling without replacement ("srswor"), b) simple random sampling with replacement ("srswr"), c) Poisson sampling ("poisson"), d) systematic sampling ("systematic") (default is "srswor").
pik: vector of inclusion probabilities or auxiliary information used to compute them; this argument is only used for unequal probability sampling (Poisson and systematic). If an auxiliary information is provided, the function uses the inclusionprobabilities function for computing these probabilities. If the method is "srswr" and the sample size is larger than the population size, this vector is normalized to one.
description: a message is printed if its value is TRUE; the message gives the number of selected units and the number of the units in the population. By default, the value is FALSE.

Value

The function produces an object, which contains the following information:

id: the identifier of the selected units.
stratum: the unit stratum.
prob: the final unit inclusion probability.

Author

Andri Signorell <andri@signorell.net>
rewritten based on the ideas of Yves Tille <yves.tille@unine.ch> and Alina Matei <alina.matei@unine.ch>

Examples

# Example from An and Watts (New SAS procedures for Analysis of Sample Survey Data)
# generates artificial data (a 235X3 matrix with 3 columns: state, region, income).
# the variable "state" has 2 categories ('nc' and 'sc').
# the variable "region" has 3 categories (1, 2 and 3).
# the sampling frame is stratified by region within state.
# the income variable is randomly generated

m <- rbind(matrix(rep("nc",165), 165, 1, byrow=TRUE),
           matrix(rep("sc", 70), 70, 1, byrow=TRUE))
m <- cbind.data.frame(m, c(rep(1, 100), rep(2,50), rep(3,15),
                      rep(1, 30), rep(2, 40)), 1000 * runif(235))
names(m) <- c("state", "region", "income")

# computes the population stratum sizes
table(m$region, m$state)
#>    
#>      nc  sc
#>   1 100  30
#>   2  50  40
#>   3  15   0

# not run
#     nc  sc
#  1 100  30
#  2  50  40
#  3  15   0
# there are 5 cells with non-zero values
# one draws 5 samples (1 sample in each stratum)
# the sample stratum sizes are 10,5,10,4,6, respectively
# the method is 'srswor' (equal probability, without replacement)

s <- Strata(m, c("region", "state"), size=c(10, 5, 10, 4, 6), method="srswor")

# extracts the observed data
data.frame(income=m[s$id, "income"], s)
#>          income state region  income.1 stratum size  id
#> 1.1   431.79468    nc      1 431.79468       1   10   1
#> 1.80  256.99471    nc      1 256.99471       1   10  80
#> 1.73   27.68708    nc      1  27.68708       1   10  73
#> 1.32  785.78417    nc      1 785.78417       1   10  32
#> 1.15  982.44097    nc      1 982.44097       1   10  15
#> 1.71  427.83172    nc      1 427.83172       1   10  71
#> 1.66  507.13136    nc      1 507.13136       1   10  66
#> 1.81  249.62570    nc      1 249.62570       1   10  81
#> 1.13  451.99868    nc      1 451.99868       1   10  13
#> 1.57  178.19258    nc      1 178.19258       1   10  57
#> 2.130 460.58674    nc      2 460.58674       2    5 130
#> 2.135 469.94421    nc      2 469.94421       2    5 135
#> 2.117 991.83105    nc      2 991.83105       2    5 117
#> 2.106 773.62990    nc      2 773.62990       2    5 106
#> 2.109 192.33168    nc      2 192.33168       2    5 109
#> 3.160 512.05001    nc      3 512.05001       3   10 160
#> 3.163 708.51129    nc      3 708.51129       3   10 163
#> 3.161 771.14369    nc      3 771.14369       3   10 161
#> 3.154 853.65094    nc      3 853.65094       3   10 154
#> 3.157 411.88021    nc      3 411.88021       3   10 157
#> 3.152 390.29183    nc      3 390.29183       3   10 152
#> 3.158 636.70828    nc      3 636.70828       3   10 158
#> 3.151 663.38892    nc      3 663.38892       3   10 151
#> 3.164 999.05023    nc      3 999.05023       3   10 164
#> 3.156 533.18085    nc      3 533.18085       3   10 156
#> 4.180 464.97943    sc      1 464.97943       4    4 180
#> 4.175 520.83494    sc      1 520.83494       4    4 175
#> 4.174 569.91573    sc      1 569.91573       4    4 174
#> 4.186  35.69785    sc      1  35.69785       4    4 186
#> 5.204 151.80241    sc      2 151.80241       5    6 204
#> 5.198 424.94331    sc      2 424.94331       5    6 198
#> 5.216 516.31473    sc      2 516.31473       5    6 216
#> 5.222 833.44388    sc      2 833.44388       5    6 222
#> 5.201  16.52919    sc      2  16.52919       5    6 201
#> 5.228 307.86605    sc      2 307.86605       5    6 228

# see the result using a contigency table
table(s$region, s$state)
#>    
#>     nc sc
#>   1 10  4
#>   2  5  6
#>   3 10  0


# The same data as in Example 1
# the method is 'systematic' (unequal probability, without replacement)
# the selection probabilities are computed using the variable 'income'
s <- Strata(m,c("region", "state"), size=c(10, 5, 10, 4, 6),
            method="systematic", pik=m$income)

# extracts the observed data
data.frame(income=m[s$id, "income"], s)
#>          income state region  income.1 stratum size  id
#> 1.13  451.99868    nc      1 451.99868       1   10  13
#> 1.27   20.54442    nc      1  20.54442       1   10  27
#> 1.7   552.14334    nc      1 552.14334       1   10   7
#> 1.48  934.30059    nc      1 934.30059       1   10  48
#> 1.62  177.50997    nc      1 177.50997       1   10  62
#> 1.50  887.68151    nc      1 887.68151       1   10  50
#> 1.66  507.13136    nc      1 507.13136       1   10  66
#> 1.80  256.99471    nc      1 256.99471       1   10  80
#> 1.97  270.36405    nc      1 270.36405       1   10  97
#> 1.100 888.16006    nc      1 888.16006       1   10 100
#> 2.134 894.79894    nc      2 894.79894       2    5 134
#> 2.145 486.84010    nc      2 486.84010       2    5 145
#> 2.142 601.15380    nc      2 601.15380       2    5 142
#> 2.117 991.83105    nc      2 991.83105       2    5 117
#> 2.107  36.20353    nc      2  36.20353       2    5 107
#> 3.154 853.65094    nc      3 853.65094       3   10 154
#> 3.165 135.38640    nc      3 135.38640       3   10 165
#> 3.163 708.51129    nc      3 708.51129       3   10 163
#> 3.158 636.70828    nc      3 636.70828       3   10 158
#> 3.159 914.26539    nc      3 914.26539       3   10 159
#> 3.151 663.38892    nc      3 663.38892       3   10 151
#> 3.162 383.89029    nc      3 383.89029       3   10 162
#> 3.161 771.14369    nc      3 771.14369       3   10 161
#> 3.153 876.68424    nc      3 876.68424       3   10 153
#> 3.152 390.29183    nc      3 390.29183       3   10 152
#> 4.191 285.42066    sc      1 285.42066       4    4 191
#> 4.167  58.01234    sc      1  58.01234       4    4 167
#> 4.168 141.61709    sc      1 141.61709       4    4 168
#> 4.181  27.77380    sc      1  27.77380       4    4 181
#> 5.209 338.79154    sc      2 338.79154       5    6 209
#> 5.223 947.00049    sc      2 947.00049       5    6 223
#> 5.221 852.56993    sc      2 852.56993       5    6 221
#> 5.213 632.28781    sc      2 632.28781       5    6 213
#> 5.230 556.82973    sc      2 556.82973       5    6 230
#> 5.217 106.51764    sc      2 106.51764       5    6 217

# see the result using a contigency table
table(s$region, s$state)
#>    
#>     nc sc
#>   1 10  4
#>   2  5  6
#>   3 10  0

Arguments

Value

Author

See also

Examples