Not every survey question will contain responses that are mutually
exclusive. Specifically, it is quite common for surveys to include
questions that state, “Select all the responses that all.” An example of
this type of question can be seen on the first question of the
EIN survey titled Difficult
Diagnoses. So, the focus of this document will be explore we
can use satpt to determine if saturation has been achieved
for these types of questions.
We’ll start off by call the library function to
satpt to obtain all the functionality of the package
library(satpt)
#>
#> Attaching package: 'satpt'
#> The following object is masked from 'package:stats':
#>
#> simulateand load in the example Difficult Diagnoses data.
data(diagnoses)When we examine the first six and last six of the responses to
question one, we see the responses to this “select that all apply”
question are separated by the vertical pipe operator
("|").
head(diagnoses$q1)
#> [1] "Broadrange|Wholegenome|MNGS" "Broadrange|Wholegenome|MNGS"
#> [3] "Broadrange|MNGS" "Broadrange"
#> [5] "MNGS" "Broadrange|Other"
tail(diagnoses$q1)
#> [1] "Broadrange" "Broadrange"
#> [3] "Broadrange" "Broadrange|MNGS"
#> [5] "Broadrange|Wholegenome|MNGS" "None"From the original survey, Difficult Diagnoses, we know that the reported values relate to survery responses in this manner.
-
"Broadrange": Broad range 16S rRNA gene sequencing -
"Wholegenome": Whole genome sequencing -
"MNGS": Metagenomic next-generation sequencing (mNGS) -
"None": None of the above -
"Other": Other, specify
So, we need to split the string of reponses based on the vertical pipe delimiter for each response.
q1_split <- strsplit(x = diagnoses$q1, split = "|", fixed = TRUE)With each response to the survey being represented as an element of
character vectors in q1, we are going to create five
indicator functions that denote whether one of the options were selected
for a given response to the survey. We are transforming to indicator
functions because the methodology proposed in the paper that accompanies
this package demonstrates that “select all that apply” questions can be
viewed a collection Poisson distributions. When each one of these
distributions are conditioned on by the sum of Poisson distributions the
result is a binomial distribution. So, the code below demonstrates one
way to create these indicators functions for each of the possible
response fields.
# Identifying response field selected for each observation
q1 <- lapply(
X = q1_split,
FUN = function(x) {
if (!all(is.na(x))) {
broadrange <- ifelse(test = "Broadrange" %in% x, yes = 1, no = 0)
wholegenome <- ifelse(test = "Wholegenome" %in% x, yes = 1, no = 0)
mngs <- ifelse(test = "MNGS" %in% x, yes = 1, no = 0)
none <- ifelse(test = "None" %in% x, yes = 1, no = 0)
other <- ifelse(test = "Other" %in% x, yes = 1, no = 0)
} else {
broadrange <- NA
wholegenome <- NA
mngs <- NA
none <- NA
other <- NA
}
out <- data.frame(
broadrange = broadrange,
wholegenome = wholegenome,
mngs = mngs,
none = none,
other = other
)
return(out)
}
)
# Combine vector of indicator values into a single data object
q1 <- do.call(what = "rbind", args = q1)
# Turning indicator functions in factor
for (j in seq_len(ncol(q1))) {
q1[, j] <- factor(x = q1[, j], levels = c(0, 1), labels = c("No", "Yes"))
}The “select all that apply” response are now transformed into a
collection of indicator functions, where the value "Yes"
denotes that response field was selected and "No" denotes
the response field was not selected. These responses
were collected in different data collection periods due to non-response
of previous collection periods. Thus, we must use the wave
variable in the diagnoses data, so we may conduct
saturation point analysis on each of the response fields while
controlling for potiental response bias.
str(q1)
#> 'data.frame': 643 obs. of 5 variables:
#> $ broadrange : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
#> $ wholegenome: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
#> $ mngs : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 1 1 2 2 ...
#> $ none : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#> $ other : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...Using the satpt::satpt()
function on each of the indicator functions of the responses field, we
will perform the saturation point analysis.
res <- satpt::satpt(
y = q1,
by = diagnoses$wave,
dimnames = c(
"by" = "Data collection period",
"y" = "Responses to survey"
)
)Finally, in the results below, we see that each response field from the “select all that apply” question achieved saturation with a saturation threshold of 0.025. Thus, this question has achieved saturation.
summary(res)
#>
#> Saturation point analysis of sample proportions
#> ===============================================
#>
#> Analysis based on: mngs
#> Saturation achieved? Yes
#> Saturation threshold of 0.025
#> Responses collected from a sample size of 640
#>
#> Data interval and overall sample proportions
#> ============================================
#> y: Responses to survey
#> by: Data collection period No Yes
#> 1 0.5586 0.4414
#> 2 0.5890 0.4110
#> 3 0.6000 0.4000
#> Overall 0.5766 0.4234
#>
#>
#> Data interval and overall standard errors
#> =========================================
#> y: Responses to survey
#> by: Data collection period No Yes
#> 1 0.0276 0.0276
#> 2 0.0407 0.0407
#> 3 0.0376 0.0376
#> Overall 0.0195 0.0195
#>
#> Pooled standard errors? No
#>
#> Pearson's Chi-squared test
#>
#> data: y: Responses to survey given by: Data collection period
#> X-squared = 0.90182, df = 2, p-value = 0.637
#>
#> Response bias present? No
#> Significance level: 0.05
#>
#>
#> Heterogeneity index
#> ====================
#> Categories Index
#> No 0.0179
#> Yes 0.0179