Saturation point analysis of multinomial responses from a survey using standard errors of the sample proportions for the responses.
Usage
satpt(
y,
by,
exclude = c(NA, NaN),
alpha = 0.05,
threshold = 0.025,
dimnames = NULL,
...
)Arguments
- y
Multinomial responses collected and being examined for saturation. See the Details section for valid R data objects and how they are handled.
- by
Values indicating when the multinomial responses (
y) were collected. See the Details section for when to specify this argument.- exclude
Vector of values that should be excluded in
yandby. Generally, this should be used to denote missing values. Default isNAandNaN.- alpha
Significance level for test for independence by
yandby. Default is0.05.- threshold
Saturation threshold applied to the maximum standard error of the sample proportions. Default is
0.025and the threshold must be less or equal to 0.25.- dimnames
Character vector of names for
yandbywhen displaying the contingency table, sample proportions, and standard error matrices. Whendimnamesis an unnamed vector the first entry should be name ofyvariable and the second entry should be name ofbyvariable. Ifdimnamesis a named vector, then order of the values does NOT matter, as long the elements are named with"y"and"by". Default isNULL.- ...
Additional arguments passed to
stats::chisq.test()orstats::fisher.test()for more control over the test for independence.xandyarguments fromstats::chisq.test()orstats::fisher.test()should not be spectified because thebyandyfrom arguments above will create a contingency matrix to perform the test of independence on.
Value
An object with S3 class "satpt" containing 12 elements. The
return elements in a "satpt" object are based on the response item that
had the largest standard error. The which_saturation returned value
indicates which response item had the largest standard error. This nature
of satpt is of most important when determining saturation for select all
apply questions.
thresholdSaturation threshold applied to the standard errors of thesample proportions.
saturationA logical value indicating whether response item with the largest standard error has achieved saturation given the defined
threshold. The value ofTRUEindicates that saturation has been achieved while a value ofFALSEindicates that saturation was not achieved and more data is needed to achieve saturation.which_saturationA character value indicating which collection of responses within
ydetermined saturation achievement. Generally, this is only of importance when examining select all apply questions. For multiple choice type of question, the returned value should just be the object name.countsA
matrixobject containing the observed cell counts of the contigency table created byyandbyif provided.phatA
matrixobject containing the row-wise sample proportions for the observed contigency table (counts).seA
matrixobject containing the standard errors for the calculated sample proportions (phat).pooled_seA logical value indicating whether pooled standard errors were calculated due to the presence of response bias.
alphaSignificance level for the test for independence.
testA
htestobject produced bystats::chisq.test()orstats::fisher.test()containing the results from the test for independence.nTotal number of observations with a response provided.
totalA
data.frameobject with 4 variables describing the overall collected sample. Thecategoriesvariable provides the unique categories listed iny. Whilecounts,phat, andseprovide the overall cell counts, sample proportions, and standard errors for the categories, respectively. The standard errors reported for the overall sample proportions are calculated based on the presence of response bias, which is detailed above.hindexA vector of heterogeneity index values for the sample proportions calculated by mean absolute deviation.
Details
The by argument should be specified when the responses collected
in y are a by-product of the data collection mechanism defined by the by
variable. When there is no a priori data collection mechanism defined, the
by argument should not be defined. Generally, when the data is collected
randomly or collected during the first data collection period, there is "no
a priori data collection mechanism."
The parameters y and by maybe a vector, factor, matrix,
data.frame, data.table, tibble, or list. When the parameters are a
list object, each element of the object should be of equal length.
Otherwise missing values will be appended to the end of each element based on
the element that has the longest length. If y or by are factors the
underlying order of the factor will be ignored when the contingency table is
created. The char_matrix() function is used within satpt() to
coerce the values of y and by to be character values to have consistency
of value types.
Generally, y should only have more than 1 column when select all apply
questions are being examined. See the select-all-apply vignette for more
information.
Functionality of satpt() depends on the limited use of stats::ftable().
More specifically, y, by, and exclude are directly used with
stats::ftable() to create the contingency table of the collected data. The
contingency table created by stats::ftable() is converted to a matrix for
easier use with other functions. When the dimnames argument is used, it
manipulates the attributes of the created ftable, which impacts the
dimension names of the resulting matrix.
Specification of the by argument automatically calls for a test for
independence and the heterogeneity index of the sample proportions to be
calculated. The test for independence is conducted to determine if response
bias is present within the responses in y are due to the data collection
periods (by). Fisher's Exact Test or the \(I \times J\) variant is
implemented when the more than 20% of the expected cell counts are less than
5. Otherwise Pearsons' \(\chi^2\) Test for Independence is implemented.
When response bias is present the pooled standard errors of the overall
sample proportions for each response item is reported, the pooled
standard errors account for the response bias. The heterogeneity index is
defined as the mean absoluted difference of the sample proportions for each
response item within each data collection period (by) relative to the
overall sample proportions for each response item. This index reflects
the average deviation of the data collection period proportions from the
overall sample proportions. Smaller values indicate the sample proportions of
the data collection periods are less dissimilar. This measure is of
importance when response bias is present. When by is not specified the test
for independence will not be conducted and the heterogeneity index will not
be calculated. satpt assumes response bias is only possible when by is
specified. Thus, there is no need to check for response bias when by is
not specified.
Determination of saturation is based on the the response item and/or collection of responses that has the largest standard error. If this largest standard error achieves saturation then all other categories or responses will achieve saturation. For select all apply questions, the collection of responses that have the largest standard error (i.e., a sample proportion closest to 0.5) will be used to determine saturation of all responses.
Note
The returned satpt object will contain NULL values for test and
hindex when by is not specified. This is done because satpt assumes
by is only specified when the data is collected in intervals.
Examples
data(diagnoses)
# Assuming response bias is not a possiblity
satpt::satpt(y = diagnoses$q2)
#> Analysis based on: q2
#> Saturation achieved? Yes
#>
#> Overall Sample Proportions and Standard Errors
#> ==============================================
#> y: q2
#> Statistics Not at all Often Once Rarely Sometimes
#> Proportion 0.2531 0.0750 0.0375 0.3688 0.2656
#> SE 0.0172 0.0104 0.0075 0.0191 0.0175
# Examining saturation given data collected at different times and
# response bias is possible. For this example, response bias is not present,
# so the standard errors will be the same.
satpt::satpt(y = diagnoses$q2, by = diagnoses$wave)
#> Analysis based on: q2
#> Saturation achieved? Yes
#>
#> Overall Sample Proportions and Standard Errors
#> ==============================================
#> y: q2
#> Statistics Not at all Often Once Rarely Sometimes
#> Proportion 0.2531 0.0750 0.0375 0.3688 0.2656
#> SE 0.0172 0.0104 0.0075 0.0191 0.0175
# Creating an example, where response bias is present.
## Simulating data
prob <- matrix(
data = c(0.4, 0.4, 0.2, 0.1, 0.1, 0.8),
nrow = 2, ncol = 3, byrow = TRUE
)
catg <- LETTERS[1:3]
set.seed(123)
dat <- satpt::simulate(
n = 1, size = c(250, 100), prob = prob, categories = catg
)
## Determining saturation with response bias
res <- satpt::satpt(y = dat$responses1, by = dat$period)
summary(res)
#>
#> Saturation point analysis of sample proportions
#> ===============================================
#>
#> Analysis based on: responses1
#> Saturation achieved? Yes
#> Saturation threshold of 0.025
#> Responses collected from a sample size of 350
#>
#> Data interval and overall sample proportions
#> ============================================
#> y: responses1
#> by: period A B C
#> 1 0.3840 0.4120 0.2040
#> 2 0.1500 0.0500 0.8000
#> Overall 0.3171 0.3086 0.3743
#>
#>
#> Data interval and overall standard errors
#> =========================================
#> y: responses1
#> by: period A B C
#> 1 0.0308 0.0311 0.0255
#> 2 0.0357 0.0218 0.0400
#> Overall 0.0242 0.0231 0.0215
#>
#> Pooled standard errors? Yes
#>
#> Pearson's Chi-squared test
#>
#> data: y: responses1 given by: period
#> X-squared = 110.46, df = 2, p-value < 2.2e-16
#>
#> Response bias present? Yes
#> Significance level: 0.05
#>
#>
#> Heterogeneity index
#> ====================
#> Categories Index
#> A 0.117
#> B 0.181
#> C 0.298