Models for Rater Agreement and Reliability

What’s Unique about Rater Agreement?

From one point of view, nothing. Rater agreement studies tend to produce r X r tables, where r is the number of rating categories. As such, the models that test marginal homogeneity and symmetry are quite relevant, asking does one rater score similarly (or higher) than another. However, rater agreement data are different in that there are usually a surplus of responses on the diagonal, indicating that the raters agree. Often it makes sense to explicitly model this excess, and the models in this vignette do that.

Another feature of these models is that they are nearly all regular log-linear models. As such they are less interesting, in that they can be fit with standard log-linear estimation proceudres, including glm() with Poisson family and “log” as the link function. Where things can be a bit challenging is that one has to explicitly form the design matrix. For some designs this is a bit of a puzzle, and there are some utility routines to help in the specification.

Much of the modeling in this section comes from chapter 2 of von Eye, A. & Mun, E. Y. (2005), Analyzing rater agreement: Manifest variable methods. Mahwah, NJ: Lawrence Erlbaum.

The Data

This vignette will use a different source of data from Shoukri (2005), p. 80. It is agreement study of two clinicians evaluating whether dogs were dehydrated. The lowest score indicates normal, and the highest score indicates dehydrated (above 10%).

print(dogs)
#>      [,1] [,2] [,3] [,4]
#> [1,]  119   10    2    0
#> [2,]    7   28    1    0
#> [3,]    2   13   14    2
#> [4,]    0    0    1    1

The basic main effects model

The baseline model for rater modeling is the main effect model, which specifies an intercept and a row (rater 1) effect and a column (rater 2 effect), but no interaction terms. This is the independence model that underlies Cohen’s kappa for nominal items: p_a - p_c kappa = ———– 1 - p_c

dogs_kappa <- kappa(dogs)
print(dogs_kappa)
#> $kappa
#> [1] 0.6351767
#> 
#> $se
#> [1] 0.05126776

p_c is taken as p_i+ times p_+j, which is the model prediction for all cells in the main effects model. There is also a weighted version of kappa available, weighted_kappa(n, w) where w is the matrix of weights. The unweighted kappa for the dogs data is 0.6351767, 0.0512678, indicating good agreement betweeen the two clinicians.

There are two models that specifically include kappa as part of the model. The first is by Agresti, 1989, and the second is by Schuster, 2001. The Agresti model is described as “symmetry plus quasi-independence.” It assumes marginal homogeneity, as well as assuming that quasi-symmetry holds.

The Agresti kappa model is available as:

result <- Agresti_kappa_agreement(dogs)

The result is a list with several elements. The first is kappa, 0.6235352, which is quite similar to the observed value. Estimates of the marginal parameters (the moodel assumes that these fit to both raters) are also returned, 0.6316093, 0.2278695, 0.1316074, 0.0146445. The Pearson chi-squared statistic of 92.3636704 on result$df degrees of freedom indicates that the model does not fit well to this data.

The Schuster model also assumes marginal homogeneity. It incorporates an underlying symmetry matrix as well. The model is much harder to fit acceptably, as reflected in the amount of time and number of iterations that the model runs. It does not yield a plausible result for the dogs data (not shown: the log(likelihood) decreased, the estimated kappa was negative and the residuals were remarkably large for the highest category), so we apply it to the vision_data.

result <- Schuster_symmetric_rater_agreement_model(vision_data)
#> 100      -10365.131345958      5669.64113345708      114810.134161738     0.000219059130695239
#> 200      -10365.0027942057      5669.38402995252      114808.82366675     0.000231464328770828
#> 300      -10364.8743346158      5669.12711077268      114807.513696302     0.000243860940594987
#> 400      -10364.7459671253      5668.87037579168      114806.204250146     0.000256248971591903
#> 500      -10364.6176916713      5668.61382488376      114804.895328033     0.000268628427181174
#> 600      -10364.4895081911      5668.35745792331      114803.586929717     0.000280999312775702
#> 700      -10364.3614166219      5668.10127478488      114802.27905495     0.000293361633782037
#> 800      -10364.233416901      5667.84527534316      114800.971703486     0.00030571539560073
#> 900      -10364.1055089659      5667.58945947302      114799.664875078     0.000318060603625277
#> 1000      -10363.9776927541      5667.33382704943      114798.35856948     0.000330397263244574
#> 1100      -10363.8499682032      5667.07837794755      114797.052786446     1.23196687967666e-07
#> 1200      -10363.7223352507      5666.82311204265      114795.747525731     1.23109851597136e-07
#> 1300      -10363.5947938345      5666.56802921021      114794.442787091     1.23023071325962e-07
#> 1400      -10363.4673438923      5666.31312932578      114793.138570279     1.22936348212369e-07
#> 1500      -10363.339985362      5666.05841226513      114791.834875051     1.22849681910455e-07
#> 1600      -10363.2127181815      5665.80387790414      114790.531701165     1.22763072074297e-07
#> 1700      -10363.0855422888      5665.54952611882      114789.229048374     1.22676518357946e-07
#> 1800      -10362.9584576221      5665.29535678538      114787.926916437     1.22590021468601e-07
#> 1900      -10362.8314641195      5665.04136978014      114786.625305109     1.22503581411367e-07
#> 2000      -10362.7045617192      5664.78756497956      114785.324214149     1.22417197489217e-07
#> 2100      -10362.5777503596      5664.53394226028      114784.023643312     1.22330869356149e-07
#> 2200      -10362.451029979      5664.28050149906      114782.723592358     1.22244598070435e-07
#> 2300      -10362.3244005158      5664.02724257282      114781.424061045     1.22158383286079e-07
#> 2400      -10362.1978619087      5663.7741653586      114780.125049129     1.22072223954901e-07
#> 2500      -10362.0714140962      5663.52126973364      114778.826556372     1.21986120433009e-07
#> 2600      -10361.945057017      5663.26855557526      114777.52858253     1.2190007342761e-07
#> 2700      -10361.8187906099      5663.01602276097      114776.231127364     1.21814082241564e-07
#> 2800      -10361.6926148136      5662.7636711684      114774.934190634     1.21728146879887e-07
#> 2900      -10361.5665295671      5662.51150067536      114773.637772098     1.21642267347591e-07
#> 3000      -10361.4405348093      5662.25951115977      114772.341871519     1.21556443298574e-07
#> 3100      -10361.3146304793      5662.00770249969      114771.046488655     1.21470674737823e-07
#> 3200      -10361.1888165161      5661.75607457337      114769.751623269     1.21384962372551e-07
#> 3300      -10361.063092859      5661.50462725915      114768.45727512     1.21299305505521e-07
#> 3400      -10360.9374594472      5661.25336043552      114767.163443972     1.21213703790579e-07
#> 3500      -10360.81191622      5661.00227398116      114765.870129585     1.21128157583805e-07
#> 3600      -10360.6864631168      5660.75136777484      114764.577331722     1.21042666539027e-07
#> 3700      -10360.5611000772      5660.50064169551      114763.285050146     1.20957231012324e-07
#> 3800      -10360.4358270405      5660.25009562225      114761.993284618     1.2087185030636e-07
#> 3900      -10360.3106439466      5659.99972943429      114760.702034902     1.20786525128346e-07
#> 4000      -10360.1855507349      5659.74954301096      114759.411300762     1.2070125513207e-07
#> 4100      -10360.0605473453      5659.49953623179      114758.121081961     1.20616039620142e-07
#> 4200      -10359.9356337176      5659.24970897641      114756.831378264     1.20530879299768e-07
#> 4300      -10359.8108097917      5659.00006112462      114755.542189433     1.20445774175857e-07
#> 4400      -10359.6860755076      5658.75059255634      114754.253515235     1.2036072355098e-07
#> 4500      -10359.5614308052      5658.50130315166      114752.965355434     1.20275727781178e-07
#> 4600      -10359.4368756248      5658.25219279077      114751.677709794     1.20190786520158e-07
#> 4700      -10359.3124099064      5658.00326135403      114750.390578083     1.20105900123961e-07
#> 4800      -10359.1880335904      5657.75450872194      114749.103960065     1.20021068597455e-07
#> 4900      -10359.063746617      5657.50593477511      114747.817855507     1.1993629089194e-07
#> 5000      -10358.9395489266      5657.25753939434      114746.532264175     1.19851568065815e-07
#> 5100      -10358.8154404597      5657.00932246053      114745.247185836     1.19766899421538e-07
#> 5200      -10358.6914211568      5656.76128385471      114743.962620257     1.19682285315132e-07
#> 5300      -10358.5674909585      5656.5134234581      114742.678567205     1.19597725751429e-07
#> 5400      -10358.4436498054      5656.26574115201      114741.395026448     1.19513219681625e-07
#> 5500      -10358.3198976384      5656.01823681793      114740.111997754     1.19428767812928e-07
#> 5600      -10358.1962343981      5655.77091033746      114738.829480891     1.19344370501356e-07
#> 5700      -10358.0726600256      5655.52376159232      114737.547475628     1.19260026698053e-07
#> 5800      -10357.9491744616      5655.27679046442      114736.265981733     1.19175737461463e-07
#> 5900      -10357.8257776473      5655.02999683577      114734.984998975     1.19091501742693e-07
#> 6000      -10357.7024695237      5654.78338058853      114733.704527125     1.19007319546498e-07
#> 6100      -10357.5792500319      5654.53694160501      114732.424565952     1.18923191931344e-07
#> 6200      -10357.4561191132      5654.29067976762      114731.145115225     1.18839117497042e-07
#> 6300      -10357.3330767089      5654.04459495893      114729.866174716     1.18755096599565e-07
#> 6400      -10357.2101227603      5653.79868706168      114728.587744195     1.18671129243649e-07
#> 6500      -10357.0872572088      5653.5529559587      114727.309823432     1.18587215785277e-07
#> 6600      -10356.9644799959      5653.30740153294      114726.032412199     1.18503355526666e-07
#> 6700      -10356.8417910632      5653.06202366755      114724.755510268     1.18419548823787e-07
#> 6800      -10356.7191903523      5652.81682224577      114723.47911741     1.18335795681353e-07
#> 6900      -10356.5966778049      5652.57179715098      114722.203233397     1.18252095401528e-07
#> 7000      -10356.4742533628      5652.32694826673      114720.927858003     1.18168448340266e-07
#> 7100      -10356.3519169677      5652.08227547665      114719.652990999     1.18084854150974e-07
#> 7200      -10356.2296685617      5651.83777866455      114718.378632158     1.18001313892171e-07
#> 7300      -10356.1075080866      5651.59345771435      114717.104781254     1.17917825812115e-07
#> 7400      -10355.9854354845      5651.34931251012      114715.831438061     1.1783439132061e-07
#> 7500      -10355.8634506974      5651.10534293605      114714.558602351     1.1775100971974e-07
#> 7600      -10355.7415536677      5650.86154887648      114713.2862739     1.17667680662851e-07
#> 7700      -10355.6197443373      5650.61793021587      114712.014452482     1.17584404505878e-07
#> 7800      -10355.4980226488      5650.37448683883      114710.743137871     1.17501180902149e-07
#> 7900      -10355.3763885445      5650.13121863007      114709.472329843     1.17418010207597e-07
#> 8000      -10355.2548419667      5649.88812547448      114708.202028173     1.17334892075532e-07
#> 8100      -10355.1333828579      5649.64520725705      114706.932232636     1.17251826510565e-07
#> 8200      -10355.0120111609      5649.40246386291      114705.662943009     1.17168813517298e-07
#> 8300      -10354.8907268181      5649.15989517733      114704.394159067     1.17085852749002e-07
#> 8400      -10354.7695297723      5648.91750108571      114703.125880587     1.17002944912928e-07
#> 8500      -10354.6484199662      5648.67528147357      114701.858107346     1.16920088608325e-07
#> 8600      -10354.5273973427      5648.43323622659      114700.590839121     1.16837285245109e-07
#> 8700      -10354.4064618447      5648.19136523054      114699.324075689     1.16754533422487e-07
#> 8800      -10354.2856134151      5647.94966837136      114698.057816827     1.16671834550397e-07
#> 8900      -10354.164851997      5647.70814553511      114696.792062314     1.16589186876651e-07
#> 9000      -10354.0441775334      5647.46679660797      114695.526811928     1.16506591811197e-07
#> 9100      -10353.9235899675      5647.22562147626      114694.262065448     1.16424048304507e-07
#> 9200      -10353.8030892426      5646.98462002644      114692.997822651     1.16341557063831e-07
#> 9300      -10353.682675302      5646.74379214508      114691.734083317     1.16259117390967e-07
#> 9400      -10353.5623480889      5646.50313771889      114690.470847225     1.16176729993171e-07
#> 9500      -10353.4421075468      5646.26265663472      114689.208114156     1.16094393820835e-07
#> 9600      -10353.3219536192      5646.02234877954      114687.945883888     1.16012109581209e-07
#> 9700      -10353.2018862496      5645.78221404045      114686.684156202     1.15929876576027e-07
#> 9800      -10353.0819053818      5645.54225230466      114685.422930878     1.15847695512544e-07
#> 9900      -10352.9620109592      5645.30246345957      114684.162207698     1.15765566043863e-07
#> 10000      -10352.8422029257      5645.06284739264      114682.901986442     1.15683487471663e-07

The iterations are slow (about 1 sec per 100 iterations). It runs for the full 10,000 iterations specified by default, in spite of using the Newton-Raphson method (it is likely that the performance of this function will be improved in future versions of the package). At every 100 cycles the function outputs the iteration number, log(likelihood), G^2 and X^2 and the convergence value, whihc is the relative improvement in log(likelihood) value. For the vision data the observed kappa is 0.5953888, 0.0070393. The model’s final estimate is close to this, 0.5953852. It is a bit hard to evaluate the model fit, given that the iterations did not converge, but the G^2 values that were printed out indicate very poor fit to the data (the X^2 values were even more extreme). Running to convergence of 1.0e-9 requires 244,503 iterations. The final kappa estimate of 0.5953344 is very close to the observed kappa of 0.5953888, 0.0070393. The final model yields a G^2 of 5409.781 on 9 degrees of freedom.

We note that the Agresti model:

result <- Agresti_kappa_agreement(vision_data)

runs substantially more quickly and fits this data set substantially better, with G^2 of 409.7778011 on 11 degrees of freedom, although it does not fit in an absolute sense.

To show that the kappa-based models can fit, we look at the budget_actual data. Fitting this model:

s_result <- Schuster_symmetric_rater_agreement_model(budget_actual)
#> 100      -149.012309151138      2.96905386545657      3.07236917550835     0.00790303078865798
#> 200      -149.010817704918      2.96607097301722      3.06975109518432     0.00791311886943114
#> 300      -149.009343620158      2.96312280349772      3.06716522785273     0.00792308971643707
#> 400      -149.007886692519      2.96020894821833      3.06461118553905     0.00793294470492557
#> 500      -149.00644672009      2.95732900335991      3.0620885849234     0.00794268519389665
#> 600      -149.005023503362      2.95448256990533      3.05959704728408     0.00795231252629077
#> 700      -149.0036168452      2.95166925358144      3.05713619844183     0.00796182802918456
#> 800      -149.002226550811      2.94888866480201      3.05470566870467     0.00797123301397594
#> 900      -149.000852427715      2.94614041861118      3.0523050928136     0.00798052877657195
#> 1000      -148.999494285723      2.94342413462768      3.04993410988885     0.00798971659757486

yields acceptable fit (G^2 = 2.9433971 on 5 degrees of freedom), as does fitting the Agresti kappa model:

a_result <- Agresti_kappa_agreement(budget_actual)

which gives a G^2 of 6.6831842 on 5 degrees of freedom. Again, the two kappa estimates, 0.2851063 and 0.2788086 are not that different than the observed kappa of 0.2845944, 0.0619068.

It is worth noting that symmetry (which is assumed by both kappa models) fits the budget_actual data:

stuart_result <- Stuart_marginal_homogeneity(budget_actual)

yields a X^2 of 1.0889064 on 2 degrees of freedom, indicating that symmetry is plausible for this data.

In the final analysis, neither of these specialized kappa models does a good job of describing the vision data. Both models assume marginal homogeneity, and we know from the vignette “Checking Whether Margins are (Stochastically) Ordered” that marginal homogeneity can be rejected for the vision data. When marginal homogeneity was satisfied in the budget_actual data, both models yielded acceptable fit. It appears that for these models to fit, marginal homogeneity must be satisfied. Conversely, to fit a wide variety of rater data sets acceptably, it appears that a (kappa-based) model must allow for heterogeneous margins.

Regular Log-linear Models

For the fitting of general log-linear models, the function fit_log_Linear(data, design_matrix) is provided. It uses the glm() function, with family set to poisson(link=“log”), This generates results that match various published results of specific log-linear models in the literature.

To aid in fitting these models, several design matrices are available with names beginning “log_linear”. There are methods:

create a basic design matrix for the independent rows and columns model (main_effects_design)
create a matrix with a single diagonal parameter delta (equal_weight_agreement_design)
add all diagonals terms for a model with a separate kappa per category
create a linear-by-linear term
append a new column (such as the linear-by-linear) to a design matrix
remove a column from a design matrix
convert a matrix of observed counts into a vector that aligns with the design matrices

These features are intended to aid in fitting additional models not (yet) implemented in this package.

Main Effects Model

result <- von_Eye_main_effect(dogs)

This line creates the design matrix using model.matrix and specifying contrast coding. The result is a list. The first element is beta 1.0413553, 1.7287346, 0.4370563, 0.2875245, 1.6588203, 0.7386156, -0.3028382, the vector of regression parameters. Standard errors of beta are given in 0.2376666, 0.1982025, 0.2221628, 0.2271485, 0.1729861, 0.1892666, 0.2319388. The expected counts are returned as 83.84, 23.04, 19.84, 1.2800001, 33.405, 9.18, 7.905, 0.51, 11.79, 3.24, 2.79, 0.18, 1.965, 0.54, 0.465, 0.03 is a vector containing expected counts. The X^2 is returned 199.3987398, as well as G^2 160.7504124 and the degrees of freedom 9. Finally, the covariance matrix for the regression parameters is returned as the cov member of the list. The model obviously does not fit these data. Examining the residuals 35.16, -16.04, -17.84, -1.2800001, -23.405, 18.82, 5.095, -0.51, -9.79, -2.24, 11.21, 0.82, -1.965, -0.54, 1.535, 0.97 reveals very large, positive residuals where the diagonal is being under-predicted, and negative entries elsewhere, where the model over-predicts.

The Weight by Response Category model

This model adds to the main effect model by placing specific elements for each of the diagonal terms, delta_ii. The weights applied to each diagonal cell are equal, and only a single parameter is fit for the table.

result2 <- von_Eye_equal_weighted_diagonal(dogs)

While it still does not fit, this model is a noticeable improvement over the previous one. The reduction in G^2 is 137.9708965 on 1 degree of freedom.

From a substantive point of view, it may be that certain agreements are more important than others. For example, in a medical setting it may be that the extreme two categories with scores of 1 and 4 are more important than disagreement between the middle two categories. So, one might assign weights of 1 to the 1,1 cell and the 4,4 cell, and weights of 3 to the middle 2,2 and 3,3 cells. To fit this model, the first step is to create the vector of weights, and then modify the design matrix to apply the weights to the cells.

w <- c(3, 1, 1, 3)
x <- log_linear_main_effect_design(dogs)
x_prime <- log_linear_add_all_diagonals(dogs, x)
x_prime_prime <- von_Eye_weight_by_response_category_design(dogs, x_prime, w)
result3 <- log_linear_fit(dogs, x_prime_prime)

Because the weights are specified a priori, the degrees of freedom for this model are the same as for the equal weight agreement design. The G^2 is improved, decreasing by -10.6707092. The overall fit, is marginal, with a G^2 of 12.1088067 on 9 degrees of freedom, with a p-value of 0.0333269.

Unequal Weights on the Diagonal

A less constrained model allows the parameters for the diagonal to be estimated.

result4 <- von_Eye_diagonal(dogs)

This fits better than the weight_by response model, with a G^2 of 12.1088067 on 5 degrees of freedom. However, it still demonstrates a significant lack of fit.

Agresti’s Model for Ordinal Agreeement

Agresti (1988) suggested a modification to the last model that includes a linear-by-linear component. This can be generated using log_linear_create_linear_by_linear(dogs). This creates a linear-by-linear predictor to add to the model, using the integer scores. The function also takes an optional parameter “centered”, which indicates whether the linear terms should be centered by having the mean subtracted from each entry before they are multiplied together. It returns a vector containing the new predictor. This can be added to an existing design matrix using log_linear_append_column(). This takes the existing design matrix and the vector x_new to be added, and returns a new design matrix. There is an optional third parameter that specifies what column x_new should reside in the new design matrix. This defaults to placing the column on the far right.

linear <- log_linear_create_linear_by_linear(dogs, centered=TRUE)
print(linear)
#>  [1]  2.25  0.75 -0.75 -2.25  0.75  0.25 -0.25 -0.75 -0.75 -0.25  0.25  0.75
#> [13] -2.25 -0.75  0.75  2.25
x <- log_linear_main_effect_design(dogs)
x_new <- log_linear_append_column(x, linear)
print(x_new)
#>       [,1] [,2] [,3] [,4] [,5] [,6] [,7]  [,8]
#>  [1,]    1    1    0    0    1    0    0  2.25
#>  [2,]    1    1    0    0    0    1    0  0.75
#>  [3,]    1    1    0    0    0    0    1 -0.75
#>  [4,]    1    1    0    0   -1   -1   -1 -2.25
#>  [5,]    1    0    1    0    1    0    0  0.75
#>  [6,]    1    0    1    0    0    1    0  0.25
#>  [7,]    1    0    1    0    0    0    1 -0.25
#>  [8,]    1    0    1    0   -1   -1   -1 -0.75
#>  [9,]    1    0    0    1    1    0    0 -0.75
#> [10,]    1    0    0    1    0    1    0 -0.25
#> [11,]    1    0    0    1    0    0    1  0.25
#> [12,]    1    0    0    1   -1   -1   -1  0.75
#> [13,]    1   -1   -1   -1    1    0    0 -2.25
#> [14,]    1   -1   -1   -1    0    1    0 -0.75
#> [15,]    1   -1   -1   -1    0    0    1  0.75
#> [16,]    1   -1   -1   -1   -1   -1   -1  2.25
result5 <- log_linear_fit(dogs, x_new)

This model can also be specified directly:

result5b <- von_Eye_linear_by_linear(dogs)

Just including the linear-by-linear term does not fit, although the decrease in G^2 is large, -5.0141809. Adding the diagonal terms:

result6 <- von_Eye_diagonal_linear_by_linear(dogs)

provides a large improvement in fit, 15.4903945 on 4. This model’s fit, 2.2749406 on 4 degrees of freedom is non-significant, indicating an acceptable fit.

One question is whether the separate diagonal terms are needed, or whether they can be constrained to be equal. Fitting this model is straightforward with the tools in this vignette.

x <- log_linear_equal_weight_agreement_design(dogs)
linear <- log_linear_create_linear_by_linear(dogs, centered=TRUE)
x_final <- log_linear_append_column(x, linear)
result8 <- log_linear_fit(dogs, x_final)

The final result has a G^2 of 5.6594019 on 7 indicating acceptable fit. This model is more parsimonious than the previous one, while still demonstrating acceptable fit.

References

Agresti, A. (1988). A model for agreement between ratings on an ordinal scale. Biometrics, 44(2), 539-548. Agresti, A. (1989). An agreement model with kappa as parameter. Statistics and Probability Letters, 7, 271-273. Schuster, C. (2001). Kappa as a parameter of a symmetry model for rater agreement. Journal of Educational and Behavioral Statistics, 26(3), 331-342. Shoukri, M. M. (2005). Measures of interobserver agreement. New York: Chapman & Hall. von Eye, A. & Mun, K. Y. (2005). Analyzing rater agreement: Manifest variable methods. Mahwah, NJ: Lawrence Erlbaum.