SAD Analysis

Introduction

Accurate visual assessment of disease severity is critical for monitoring crop health on the field in the most reliable and convenient way possible. This survey not only measure in-situ estimated severity but its also essential for validating severity-prediction methods. This study compares three assessment conditions for coffee rust leaf severity (unaided, aided with an “old” SAD proposed by ““, and aided with a “new” SAD currently being developed) to evaluate how the SADs affect rater bias, precision and inter-rater reliability when estimating leaf-level severity (%). We aim to (1) describe systematic error and proportional bias, (2) quantify agreement between estimates and computer calculated percentage, and (3) compare inter-rater reliability across sessions.

Methods

Six raters with limited or no prior field experience visually assessed the same set of 40 leaves in three consecutive assessment sessions: (1) unaided, (2) aided using the “old” SAD, and (3) aided using the “new” SAD. Raters visually evaluated the same 40 leafs set in all sessions, and assigned a percentage of affected area per sample. The considered actual severity was estimated via computerized area percentage calculation.

Intra-rater avaliation consisted on exploratory analysis using correlation, scatter error and density graphs in order to bring raters estimations trends, bias and accuracy. The Bland-Altman was used to visually asses the agreement by quantifying the bias (mean difference) and the limits within which most difference lies. Lin’s concordance correlation coefficient was employed to quantify the agreement, bias, precision and shifts. Meanwhile the inter-rater reliability was assessed by intraclass correlation coefficient (ICC).

Results and Conclusion

Library Import

library(gsheet)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(MASS)
library(r4pde)
library(agricolae)
library(epiR)
library(GGally)
library(psych)
library(glmmTMB)
library(emmeans)
library(multcomp)
library(DHARMa)

Data Input

UnaidenDf = gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=0#gid=0")

ActualValues = UnaidenDf %>% 
  dplyr::select(leaf, actual)

UnaidenDf = UnaidenDf %>% 
  pivot_longer(3:8, names_to="rater", values_to="value") %>% 
  mutate(session="unaided") %>% 
  dplyr::select(-2)

AidedOldDf = gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1615494808#gid=1615494808")
AidedOldDf = AidedOldDf %>% 
  pivot_longer(3:8, names_to="rater", values_to="value") %>% 
  mutate(session="aided_Old") %>% 
  dplyr::select(-2)

AidedNewDf = gsheet2tbl("https://docs.google.com/spreadsheets/d/1eJksk_X5suVChmpEMBQLiU6tyqHF8R5eLhY_MRQTodY/edit?gid=1806089061#gid=1806089061")
AidedNewDf = AidedNewDf %>% 
  pivot_longer(3:8, names_to="rater", values_to="value") %>% 
  mutate(session="aided_New") %>% 
  dplyr::select(-2)

Dataframe = rbind(UnaidenDf, AidedOldDf, AidedNewDf)
Dataframe = left_join(Dataframe, ActualValues)

Exploratory Analysis

Estimated severity correlations and linear relation

Analysing the correlation plots its possible to observe the linear relation between the estimations and the actual severity. Also we can observe how each rater tends to estimate, where parallel smooth lines indicate high precision but low accuracy therefore a systematic error. Whilst inclination difference to the perfect correlation line indicates precision lowering.

Corrplots <- Dataframe %>%
  split(.$session) %>%
  map(~ ggplot(.x, aes(actual, value)) +
        geom_abline(slope = 1, intercept = 0) +
        geom_point() +
        geom_smooth(method = "lm") +
        facet_wrap(~rater) +
        ggtitle(unique(.x$session))
  )
Corrplots[[1]]

Corrplots[[2]]

Corrplots[[3]]

Correlation matrix

Intra-rater correlation comparison may evidence the raters with closest estimations.

The line plots display the density or distribution of the severity estimations, shining a light on the rater tendency to concentrate its estimations.

Finally the scatter plots, compare the estimated severity between raters, visually displaying how they aling with each other.

RatersDataframe <- Dataframe %>% 
  pivot_wider( names_from = rater, values_from = value) %>% 
  dplyr::select(session, all_of(unique(Dataframe$rater))) 

Plots <- RatersDataframe %>% 
  split(.$session) %>% 
  map(~ ggpairs(.x %>% dplyr::select(-session), progress = FALSE )+ 
        ggtitle(unique(.x$session)))

Plots[[1]]

Plots[[2]]

Plots[[3]]

Visual assessment of variability per sample

By plotting all raters estimated severities per sample we can observe which samples caused the biggest confusion, resulting in a greater variability of the estimations, we may also observe how this alters based on the SAD used. For example, sample 1, had considerable divergence between rater for all SAD. While sample 33 had high variability for the unaided assessment, but a reduced variability for aided assessment, with the old SAD leading rater to lower estimations compared the new SAD.

VarPlots <- Dataframe %>%
  split(.$session) %>%
  map(~ ggplot(.x, aes(leaf, value, color = rater,
             group = leaf))+
  geom_line(color = "black")+
  geom_point(size = 2)+
  labs(y = "Severity estimate (%)",
       x = "Leaf Id",
       color = "Rater")+
  ggtitle(unique(.x$session))
  )
VarPlots[[1]]

VarPlots[[2]]

VarPlots[[3]]

Error

Error plots allows for the comparison between the SAD, revealing tendencies to under or over estimates according the scale of the actual severity.

In general the unaided assessment led to overestimation, especially for lower severity samples as evidenced by the clustering of points for samples with actual severity lower then 10 majorly above the 0 error line.

The old SAD had lower severity samples scattered but centered around 0 error, but as severity increased the tendency to underestimate the severity also increased up until the approximated actual severity of 40.

The new SAD, had more overestimation for the lower severity samples, while also led to underestimation for the severity closer to 40.

AbsErrorplots <- Dataframe %>%
  ggplot(aes(actual, value-actual))+
  geom_point(size = 3, alpha = 0.7)+
  facet_wrap(~session)+
  geom_hline(yintercept = 0)+
  labs(x = "Actual severity (%)",
       y = "Error")
AbsErrorplots

Bland-Altman plot

The blue line indicates the tendency to overestimate, above the zero line, or underestimate , bellow the zero line. The red dotted line shows the sistematic error, and the black dotted line the limits of agreement. Finally , the scatter points indicate the variability.

The unaided displayed a higher systematic error, with constant overestimation of the severity. It also had the most scattered points, showing greater variability. Indicating poor agreement and considerable bias.

The aided session, utilizing the old SAD, had a small negative systematic bias. With a tendency to slightly overestimate the low severity sample, then underestimating the intermediate severity samples, for high severity the overestimation trend is regained. Despite the minor systematic bias, high proportional bias was displayed, as evidenced by the high errors for high severity samples.

The new SAD had a slight overestimation systematic bias, with overestimation concentrated on low to intermediate severity samples. For severity above 30% the trend switches to underestimation. Due to the minor systematic bias, and the smaller proportional bias the new SAD seems to be advantages.

Dataframe <- Dataframe %>%
  mutate(
    MeanVal = (value + actual)/2,
    Error     = value - actual
  )

ba_stats <- Dataframe %>%
  group_by(session) %>%
  summarise(
    mean_diff = mean(Error, na.rm = TRUE),
    sd_diff   = sd(Error, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  mutate(
    upper = mean_diff + 1.96 * sd_diff,
    lower = mean_diff - 1.96 * sd_diff
  )

ggplot(Dataframe, aes(x = MeanVal, y = Error)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE, color = "blue") +
  geom_hline(data = ba_stats, aes(yintercept = mean_diff),
             linetype = "dashed", color = "red") +
  geom_hline(data = ba_stats, aes(yintercept = upper),
             linetype = "dotted", color = "darkgreen") +
  geom_hline(data = ba_stats, aes(yintercept = lower),
             linetype = "dotted", color = "darkgreen") +
  facet_wrap(~ session) +
  labs(
    x = "Mean (value & actual)",
    y = "Difference (value - actual)",
    title = "Bland–Altman plots with limits of agreement"
  )

Normality and Homoscedasticity check

Assumptions check to be able to use inferential statistics. With Shapiro Wilk test indicating that the distribution does not fit the normal behavior, and the Bartlett test indicating that it also doesn’t has equal variances. We may attempt to transform the values.

shapiro.test(Dataframe$value[Dataframe$session == "unaided"])


    Shapiro-Wilk normality test

data:  Dataframe$value[Dataframe$session == "unaided"]
W = 0.87216, p-value = 2.684e-13

shapiro.test(Dataframe$value[Dataframe$session == "aided_Old"])


    Shapiro-Wilk normality test

data:  Dataframe$value[Dataframe$session == "aided_Old"]
W = 0.7936, p-value < 2.2e-16

shapiro.test(Dataframe$value[Dataframe$session == "aided_New"])


    Shapiro-Wilk normality test

data:  Dataframe$value[Dataframe$session == "aided_New"]
W = 0.8656, p-value = 1.135e-13

var.test(Dataframe$value[Dataframe$session == "aided_Old"], Dataframe$value[Dataframe$session == "unaided"])


    F test to compare two variances

data:  Dataframe$value[Dataframe$session == "aided_Old"] and Dataframe$value[Dataframe$session == "unaided"]
F = 0.61003, num df = 239, denom df = 239, p-value = 0.0001463
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.4731168 0.7865624
sample estimates:
ratio of variances 
         0.6100294

var.test(Dataframe$value[Dataframe$session == "aided_New"], Dataframe$value[Dataframe$session == "unaided"])


    F test to compare two variances

data:  Dataframe$value[Dataframe$session == "aided_New"] and Dataframe$value[Dataframe$session == "unaided"]
F = 0.79706, num df = 239, denom df = 239, p-value = 0.08019
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.6181735 1.0277209
sample estimates:
ratio of variances 
         0.7970632

Box Cox Transformation

Box Cox method should return the ideal exponents for transformation, in order to make the data respect the normal distribution and homogeneity of the variances. Utilizing the value we can attempt to transform the data to achieve normality and homoscedasticity.

AnovaModel <- aov(value ~ session, data = Dataframe)
bc <- boxcox(AnovaModel, lambda = seq(-2, 2, 0.1))

lambda.best <- bc$x[which.max(bc$y)]
Dataframe$value_bc <- (Dataframe$value^lambda.best - 1) / lambda.best
shapiro.test(Dataframe$value_bc[Dataframe$session == "unaided"])


    Shapiro-Wilk normality test

data:  Dataframe$value_bc[Dataframe$session == "unaided"]
W = 0.98139, p-value = 0.003066

shapiro.test(Dataframe$value_bc[Dataframe$session == "aided_Old"])


    Shapiro-Wilk normality test

data:  Dataframe$value_bc[Dataframe$session == "aided_Old"]
W = 0.97429, p-value = 0.0002403

shapiro.test(Dataframe$value_bc[Dataframe$session == "aided_New"])


    Shapiro-Wilk normality test

data:  Dataframe$value_bc[Dataframe$session == "aided_New"]
W = 0.96454, p-value = 1.128e-05

var.test(Dataframe$value_bc[Dataframe$session == "aided_Old"], Dataframe$value_bc[Dataframe$session == "unaided"])


    F test to compare two variances

data:  Dataframe$value_bc[Dataframe$session == "aided_Old"] and Dataframe$value_bc[Dataframe$session == "unaided"]
F = 1.3141, num df = 239, denom df = 239, p-value = 0.03521
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.019168 1.694379
sample estimates:
ratio of variances 
          1.314099

var.test(Dataframe$value_bc[Dataframe$session == "aided_New"], Dataframe$value_bc[Dataframe$session == "unaided"])


    F test to compare two variances

data:  Dataframe$value_bc[Dataframe$session == "aided_New"] and Dataframe$value_bc[Dataframe$session == "unaided"]
F = 1.3474, num df = 239, denom df = 239, p-value = 0.02156
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 1.044965 1.737266
sample estimates:
ratio of variances 
          1.347361

GLMM

Once we have failed to make the distribution fit the expected normal distribution, one viable alternative is to use generalized linear mixed models, since the nature of the data to be restricted between 0 to 100%, we scaled the severity to proportion and fitted the model considering the beta distribution, with session as fixed effect and rater as random effect.

Mod <-  glmmTMB(value/100 ~ session + (1| rater), 
                 data = Dataframe, 
                 family = beta_family())
plot(simulateResiduals(Mod))

car::Anova(Mod)

Analysis of Deviance Table (Type II Wald chisquare tests)

Response: value/100
         Chisq Df Pr(>Chisq)    
session 45.078  2  1.627e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

em <- emmeans(Mod, ~ session, type = "response")
cld(em)

 session   response      SE  df asymp.LCL asymp.UCL .group
 aided_Old    0.116 0.00745 Inf     0.102     0.131  1    
 aided_New    0.143 0.00859 Inf     0.127     0.161   2   
 unaided      0.185 0.01000 Inf     0.166     0.205    3  

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 
P value adjustment: tukey method for comparing a family of 3 estimates 
Tests are performed on the log odds ratio scale 
significance level used: alpha = 0.05 
NOTE: If two or more means share the same grouping symbol,
      then we cannot show them to be different.
      But we also did not show them to be the same.

The analysis of deviance indicated that the session was highly significant on rater’s estimations due to it’s low Pr(>Chisq) value. And pairwise comparison indicated that the session significantly differed between each other.

By visualizing the absolute differences we can affirm that unaided assessment, had worst performance the aided assessments. Once it has greater absolute differences. While the aided session had similar values, lower then the unaided session, therefore helped the raters by down regulating they estimations.

Dataframe$abs_dif <- abs(Dataframe$value - Dataframe$actual)
ggplot(Dataframe, aes(x = session, y = abs_dif, fill = session)) +
  geom_boxplot(alpha = 0.7, outlier.shape = 16, outlier.size = 2) +
  labs(
    title = "Estimated severity Session",
    x = "Session",
    y = "Estimated severity"
  )

Lin’s Concordance Correlation Coefficient

Analyzing the results of the Lin’s concordance correlation coefficient we can affirm that the assessment with the new SAD had a better performance. Once it had the highest agreement value, indicating the more accurate estimation given the precision and bias. It also had the highest bias coefficient, indicating the smallest systematic deviation. And the highest Precision value, indicating the best consistency.

cccByGroup <- function(data, GroupName) {
  CCCData <- data %>%
    group_by(.data[[GroupName]]) %>%
    summarise(Agreement = as.numeric(epi.ccc(value, actual)$rho.c[1]),
              `Bias coefficient` = epi.ccc(value, actual)$C.b,
              Precision = Agreement / `Bias coefficient`,
              scale_shift = epi.ccc(value, actual)$s.shift,
              location_shift = epi.ccc(value, actual)$l.shift)
  
  return(CCCData)
}

results <- cccByGroup(Dataframe, "session")
results

# A tibble: 3 × 6
  session   Agreement `Bias coefficient` Precision scale_shift location_shift
  <chr>         <dbl>              <dbl>     <dbl>       <dbl>          <dbl>
1 aided_New     0.912              0.993     0.919       1.02          -0.120
2 aided_Old     0.886              0.977     0.907       1.17           0.151
3 unaided       0.826              0.938     0.880       0.915         -0.351

ICC analysis also indicated the new SAD session as the one with the greater reliability, and it had the best ICC value for all considerations. For all sessions the average of all rater estimations showed improvement, which is an expected behavior.

ICCs <- RatersDataframe %>%
  split(.$session) %>%
  map(~ ICC(dplyr::select(.x, -session)))
ICCs

$aided_New
Call: ICC(x = dplyr::select(.x, -session))

Intraclass correlation coefficients 
                         type  ICC  F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.89 48  39 200 1.9e-81        0.83        0.93
Single_random_raters     ICC2 0.89 60  39 195 1.3e-88        0.82        0.93
Single_fixed_raters      ICC3 0.91 60  39 195 1.3e-88        0.86        0.94
Average_raters_absolute ICC1k 0.98 48  39 200 1.9e-81        0.97        0.99
Average_random_raters   ICC2k 0.98 60  39 195 1.3e-88        0.96        0.99
Average_fixed_raters    ICC3k 0.98 60  39 195 1.3e-88        0.97        0.99

 Number of subjects = 40     Number of Judges =  6
See the help file for a discussion of the other 4 McGraw and Wong estimates,
$aided_Old
Call: ICC(x = dplyr::select(.x, -session))

Intraclass correlation coefficients 
                         type  ICC  F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.84 33  39 200 4.3e-68        0.77        0.90
Single_random_raters     ICC2 0.84 40  39 195 2.9e-73        0.76        0.91
Single_fixed_raters      ICC3 0.87 40  39 195 2.9e-73        0.80        0.92
Average_raters_absolute ICC1k 0.97 33  39 200 4.3e-68        0.95        0.98
Average_random_raters   ICC2k 0.97 40  39 195 2.9e-73        0.95        0.98
Average_fixed_raters    ICC3k 0.97 40  39 195 2.9e-73        0.96        0.99

 Number of subjects = 40     Number of Judges =  6
See the help file for a discussion of the other 4 McGraw and Wong estimates,
$unaided
Call: ICC(x = dplyr::select(.x, -session))

Intraclass correlation coefficients 
                         type  ICC  F df1 df2       p lower bound upper bound
Single_raters_absolute   ICC1 0.79 24  39 200 4.1e-56        0.70        0.87
Single_random_raters     ICC2 0.79 30  39 195 3.8e-63        0.69        0.87
Single_fixed_raters      ICC3 0.83 30  39 195 3.8e-63        0.75        0.89
Average_raters_absolute ICC1k 0.96 24  39 200 4.1e-56        0.93        0.98
Average_random_raters   ICC2k 0.96 30  39 195 3.8e-63        0.93        0.98
Average_fixed_raters    ICC3k 0.97 30  39 195 3.8e-63        0.95        0.98

 Number of subjects = 40     Number of Judges =  6
See the help file for a discussion of the other 4 McGraw and Wong estimates,