Saturation point analysis

Saturation point analysis of multinomial responses from a survey using standard errors of the sample proportions for the responses.

Usage

satpt(
  y,
  by,
  exclude = c(NA, NaN),
  alpha = 0.05,
  threshold = 0.025,
  dimnames = NULL,
  ...
)

Arguments

y: Multinomial responses collected and being examined for saturation. See the Details section for valid R data objects and how they are handled.
by: Values indicating when the multinomial responses (y) were collected. See the Details section for when to specify this argument.
exclude: Vector of values that should be excluded in y and by. Generally, this should be used to denote missing values. Default is NA and NaN.
alpha: Significance level for test for independence by y and by. Default is 0.05.
threshold: Saturation threshold applied to the maximum standard error of the sample proportions. Default is 0.025 and the threshold must be less or equal to 0.25.
dimnames: Character vector of names for y and by when displaying the contingency table, sample proportions, and standard error matrices. When dimnames is an unnamed vector the first entry should be name of y variable and the second entry should be name of by variable. If dimnames is a named vector, then order of the values does NOT matter, as long the elements are named with "y" and "by". Default is NULL.
...: Additional arguments passed to stats::chisq.test() or stats::fisher.test() for more control over the test for independence. x and y arguments from stats::chisq.test() or stats::fisher.test() should not be spectified because the by and y from arguments above will create a contingency matrix to perform the test of independence on.

Value

An object with S3 class "satpt" containing 12 elements. The return elements in a "satpt" object are based on the response item that had the largest standard error. The which_saturation returned value indicates which response item had the largest standard error. This nature of satpt is of most important when determining saturation for select all apply questions.

threshold: Saturation threshold applied to the standard errors of thesample proportions.
saturation: A logical value indicating whether response item with the largest standard error has achieved saturation given the defined threshold. The value of TRUE indicates that saturation has been achieved while a value of FALSE indicates that saturation was not achieved and more data is needed to achieve saturation.
which_saturation: A character value indicating which collection of responses within y determined saturation achievement. Generally, this is only of importance when examining select all apply questions. For multiple choice type of question, the returned value should just be the object name.
counts: A matrix object containing the observed cell counts of the contigency table created by y and by if provided.
phat: A matrix object containing the row-wise sample proportions for the observed contigency table (counts).
se: A matrix object containing the standard errors for the calculated sample proportions (phat).
pooled_se: A logical value indicating whether pooled standard errors were calculated due to the presence of response bias.
alpha: Significance level for the test for independence.
test: A htest object produced by stats::chisq.test() or stats::fisher.test() containing the results from the test for independence.
n: Total number of observations with a response provided.
total: A data.frame object with 4 variables describing the overall collected sample. The categories variable provides the unique categories listed in y. While counts, phat, and se provide the overall cell counts, sample proportions, and standard errors for the categories, respectively. The standard errors reported for the overall sample proportions are calculated based on the presence of response bias, which is detailed above.
hindex: A vector of heterogeneity index values for the sample proportions calculated by mean absolute deviation.

Details

The by argument should be specified when the responses collected in y are a by-product of the data collection mechanism defined by the by variable. When there is no a priori data collection mechanism defined, the by argument should not be defined. Generally, when the data is collected randomly or collected during the first data collection period, there is "no a priori data collection mechanism."

The parameters y and by maybe a vector, factor, matrix, data.frame, data.table, tibble, or list. When the parameters are a list object, each element of the object should be of equal length. Otherwise missing values will be appended to the end of each element based on the element that has the longest length. If y or by are factors the underlying order of the factor will be ignored when the contingency table is created. The char_matrix() function is used within satpt() to coerce the values of y and by to be character values to have consistency of value types.

Generally, y should only have more than 1 column when select all apply questions are being examined. See the select-all-apply vignette for more information.

Functionality of satpt() depends on the limited use of stats::ftable(). More specifically, y, by, and exclude are directly used with stats::ftable() to create the contingency table of the collected data. The contingency table created by stats::ftable() is converted to a matrix for easier use with other functions. When the dimnames argument is used, it manipulates the attributes of the created ftable, which impacts the dimension names of the resulting matrix.

Specification of the by argument automatically calls for a test for independence and the heterogeneity index of the sample proportions to be calculated. The test for independence is conducted to determine if response bias is present within the responses in y are due to the data collection periods (by). Fisher's Exact Test or the \(I \times J\) variant is implemented when the more than 20% of the expected cell counts are less than 5. Otherwise Pearsons' \(\chi^2\) Test for Independence is implemented.

When response bias is present the pooled standard errors of the overall sample proportions for each response item is reported, the pooled standard errors account for the response bias. The heterogeneity index is defined as the mean absoluted difference of the sample proportions for each response item within each data collection period (by) relative to the overall sample proportions for each response item. This index reflects the average deviation of the data collection period proportions from the overall sample proportions. Smaller values indicate the sample proportions of the data collection periods are less dissimilar. This measure is of importance when response bias is present. When by is not specified the test for independence will not be conducted and the heterogeneity index will not be calculated. satpt assumes response bias is only possible when by is specified. Thus, there is no need to check for response bias when by is not specified.

Determination of saturation is based on the the response item and/or collection of responses that has the largest standard error. If this largest standard error achieves saturation then all other categories or responses will achieve saturation. For select all apply questions, the collection of responses that have the largest standard error (i.e., a sample proportion closest to 0.5) will be used to determine saturation of all responses.

Note

The returned satpt object will contain NULL values for test and hindex when by is not specified. This is done because satpt assumes by is only specified when the data is collected in intervals.

Examples

data(diagnoses)

# Assuming response bias is not a possiblity
satpt::satpt(y = diagnoses$q2)
#> Analysis based on: q2 
#> Saturation achieved?  Yes 
#> 
#> Overall Sample Proportions and Standard Errors
#> ==============================================
#>             y: q2
#> Statistics   Not at all  Often   Once Rarely Sometimes
#>   Proportion     0.2531 0.0750 0.0375 0.3688    0.2656
#>   SE             0.0172 0.0104 0.0075 0.0191    0.0175

# Examining saturation given data collected at different times and
# response bias is possible. For this example, response bias is not present,
# so the standard errors will be the same.
satpt::satpt(y = diagnoses$q2, by = diagnoses$wave)
#> Analysis based on: q2 
#> Saturation achieved?  Yes 
#> 
#> Overall Sample Proportions and Standard Errors
#> ==============================================
#>             y: q2
#> Statistics   Not at all  Often   Once Rarely Sometimes
#>   Proportion     0.2531 0.0750 0.0375 0.3688    0.2656
#>   SE             0.0172 0.0104 0.0075 0.0191    0.0175

# Creating an example, where response bias is present.

## Simulating data
prob <- matrix(
  data = c(0.4, 0.4, 0.2, 0.1, 0.1, 0.8),
  nrow = 2, ncol = 3, byrow = TRUE
)
catg <- LETTERS[1:3]
set.seed(123)
dat <- satpt::simulate(
  n = 1, size = c(250, 100), prob = prob, categories = catg
)

## Determining saturation with response bias
res <- satpt::satpt(y = dat$responses1, by = dat$period)
summary(res)
#> 
#> Saturation point analysis of sample proportions
#> ===============================================
#> 
#> Analysis based on: responses1 
#> Saturation achieved? Yes
#> Saturation threshold of 0.025
#> Responses collected from a sample size of 350
#> 
#> Data interval and overall sample proportions
#> ============================================
#>           y: responses1
#> by: period      A      B      C
#>    1       0.3840 0.4120 0.2040
#>    2       0.1500 0.0500 0.8000
#>    Overall 0.3171 0.3086 0.3743
#> 
#> 
#> Data interval and overall standard errors
#> =========================================
#>           y: responses1
#> by: period      A      B      C
#>    1       0.0308 0.0311 0.0255
#>    2       0.0357 0.0218 0.0400
#>    Overall 0.0242 0.0231 0.0215
#> 
#> Pooled standard errors?  Yes 
#> 
#> 	Pearson's Chi-squared test
#> 
#> data:  y: responses1 given by: period
#> X-squared = 110.46, df = 2, p-value < 2.2e-16
#> 
#> Response bias present? Yes
#> Significance level: 0.05
#> 
#> 
#> Heterogeneity index
#> ====================
#>  Categories Index
#>           A 0.117
#>           B 0.181
#>           C 0.298