Skip to contents

Not every survey question will contain responses that are mutually exclusive. Specifically, it is quite common for surveys to include questions that state, “Select all the responses that all.” An example of this type of question can be seen on the first question of the EIN survey titled Difficult Diagnoses. So, the focus of this document will be explore we can use satpt to determine if saturation has been achieved for these types of questions.

We’ll start off by call the library function to satpt to obtain all the functionality of the package

library(satpt)
#> 
#> Attaching package: 'satpt'
#> The following object is masked from 'package:stats':
#> 
#>     simulate

and load in the example Difficult Diagnoses data.

data(diagnoses)

When we examine the first six and last six of the responses to question one, we see the responses to this “select that all apply” question are separated by the vertical pipe operator ("|").

head(diagnoses$q1)
#> [1] "Broadrange|Wholegenome|MNGS" "Broadrange|Wholegenome|MNGS"
#> [3] "Broadrange|MNGS"             "Broadrange"                 
#> [5] "MNGS"                        "Broadrange|Other"
tail(diagnoses$q1)
#> [1] "Broadrange"                  "Broadrange"                 
#> [3] "Broadrange"                  "Broadrange|MNGS"            
#> [5] "Broadrange|Wholegenome|MNGS" "None"

From the original survey, Difficult Diagnoses, we know that the reported values relate to survery responses in this manner.

  • "Broadrange": Broad range 16S rRNA gene sequencing
  • "Wholegenome": Whole genome sequencing
  • "MNGS": Metagenomic next-generation sequencing (mNGS)
  • "None": None of the above
  • "Other": Other, specify

So, we need to split the string of reponses based on the vertical pipe delimiter for each response.

q1_split <- strsplit(x = diagnoses$q1, split = "|", fixed = TRUE)

With each response to the survey being represented as an element of character vectors in q1, we are going to create five indicator functions that denote whether one of the options were selected for a given response to the survey. We are transforming to indicator functions because the methodology proposed in the paper that accompanies this package demonstrates that “select all that apply” questions can be viewed a collection Poisson distributions. When each one of these distributions are conditioned on by the sum of Poisson distributions the result is a binomial distribution. So, the code below demonstrates one way to create these indicators functions for each of the possible response fields.

# Identifying response field selected for each observation
q1 <- lapply(
  X = q1_split,
  FUN = function(x) {
    if (!all(is.na(x))) {
      broadrange <- ifelse(test = "Broadrange" %in% x, yes = 1, no = 0)
      wholegenome <- ifelse(test = "Wholegenome" %in% x, yes = 1, no = 0)
      mngs <- ifelse(test = "MNGS" %in% x, yes = 1, no = 0)
      none <- ifelse(test = "None" %in% x, yes = 1, no = 0)
      other <- ifelse(test = "Other" %in% x, yes = 1, no = 0)
    } else {
      broadrange <- NA
      wholegenome <- NA
      mngs <- NA
      none <- NA
      other <- NA
    }
    out <- data.frame(
      broadrange = broadrange,
      wholegenome = wholegenome,
      mngs = mngs,
      none = none,
      other = other
    )
    return(out)
  }
)

# Combine vector of indicator values into a single data object
q1 <- do.call(what = "rbind", args = q1)

# Turning indicator functions in factor
for (j in seq_len(ncol(q1))) {
  q1[, j] <- factor(x = q1[, j], levels = c(0, 1), labels = c("No", "Yes"))
}

The “select all that apply” response are now transformed into a collection of indicator functions, where the value "Yes" denotes that response field was selected and "No" denotes the response field was not selected. These responses were collected in different data collection periods due to non-response of previous collection periods. Thus, we must use the wave variable in the diagnoses data, so we may conduct saturation point analysis on each of the response fields while controlling for potiental response bias.

str(q1)
#> 'data.frame':    643 obs. of  5 variables:
#>  $ broadrange : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 2 2 2 ...
#>  $ wholegenome: Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
#>  $ mngs       : Factor w/ 2 levels "No","Yes": 2 2 2 1 2 1 1 1 2 2 ...
#>  $ none       : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...
#>  $ other      : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...

Using the satpt::satpt() function on each of the indicator functions of the responses field, we will perform the saturation point analysis.

res <- satpt::satpt(
  y = q1,
  by = diagnoses$wave,
  dimnames = c(
    "by" = "Data collection period",
    "y" = "Responses to survey"
  )
)

Finally, in the results below, we see that each response field from the “select all that apply” question achieved saturation with a saturation threshold of 0.025. Thus, this question has achieved saturation.

summary(res)
#> 
#> Saturation point analysis of sample proportions
#> ===============================================
#> 
#> Analysis based on: mngs 
#> Saturation achieved? Yes
#> Saturation threshold of 0.025
#> Responses collected from a sample size of 640
#> 
#> Data interval and overall sample proportions
#> ============================================
#>                           y: Responses to survey
#> by: Data collection period     No    Yes
#>                    1       0.5586 0.4414
#>                    2       0.5890 0.4110
#>                    3       0.6000 0.4000
#>                    Overall 0.5766 0.4234
#> 
#> 
#> Data interval and overall standard errors
#> =========================================
#>                           y: Responses to survey
#> by: Data collection period     No    Yes
#>                    1       0.0276 0.0276
#>                    2       0.0407 0.0407
#>                    3       0.0376 0.0376
#>                    Overall 0.0195 0.0195
#> 
#> Pooled standard errors?  No 
#> 
#>  Pearson's Chi-squared test
#> 
#> data:  y: Responses to survey given by: Data collection period
#> X-squared = 0.90182, df = 2, p-value = 0.637
#> 
#> Response bias present? No
#> Significance level: 0.05
#> 
#> 
#> Heterogeneity index
#> ====================
#>  Categories  Index
#>          No 0.0179
#>         Yes 0.0179