Introduction Structural damage progression is a major outcome in rheumatoid arthritis (RA). Its evaluation and follow-up in trials should involve radiographic scoring by 1 or 2 readers (reference assessment), which is challenging in large longitudinal cohorts with multiple assessments.
Objectives To compare the reproducibility of multireader and reference assessment to improve the feasibility of detecting radiographic progression in a large cohort of patients with early arthritis (ESPOIR).
Methods We used 3 sessions to train 12 rheumatologists in radiographic scoring by the van der Heijde-modified Sharp score (SHS). Multireader scoring was based on 10 trained-reader assessments, each reader scoring a random sample of 1/5 of all available radiographs (for double scoring for each X-ray set) for patients included in the ESPOIR cohort with complete radiographic data at M0 and M60. Reference scoring was performed by 2 experienced readers. Scoring was performed blindly to clinical data, with radiographs in chronological order. We compared multireader and reference assessments by intraclass correlation coefficients (ICCs) for SHS and significant radiographic progression (SRP).
Results The intrareader and inter-reader reproducibility for trained assessors increased during the training sessions (ICC 0.79 to 0.94 and 0.76 to 0.92), respectively. For the 524 patients included, agreement between multireader and reference assessment of SHS progression between M0 and M60 and SRP assessment were almost perfect, ICC (0.88 (95% CI 0.82 to 0.93)) and (0.99 (95% CI 0.99 to 0.99)), respectively.
Conclusions Multireader assessment of radiographic structural damage progression is comparable to reference assessment and could be used to improve the feasibility of radiographic scoring in large longitudinal cohort with numerous X-ray evaluations.
Statistics from Altmetric.com
What is already known about this subject?
Structural damage progression is a major outcome in rheumatoid arthritis.
Its evaluation in trials is time-consuming and challenging in large longitudinal cohorts with multiple assessments.
What does this study add?
After training, multireader assessment of radiographic structural damage progression is comparable to reference assessment.
Multireader assessment can improve the feasibility of radiographic scoring in large longitudinal cohort with numerous X-ray evaluations.
Rheumatoid arthritis (RA) is a long-lasting autoimmune disorder marked by synovial membrane inflammation that can cause joint destruction after a few years,1–3 thereby impairing quality of life and causing disability.4 Structural damage progression in RA is one major outcome; therefore, the evaluation and follow-up of structural damage progression are internationally recommended.5
Plain radiographs of the hands and feet are considered the gold standard to assess structural damage progression.5 ,6 Erosions and joint space narrowing (JSN) are the two typical radiographic lesions found in RA. The most frequently used contemporary scoring system is the Sharp score modified by van der Heijde (SHS),3 ,7 one of two reference methods used in most RA clinical trials and longitudinal observational studies. The SHS method evaluates, in each hand, 16 areas for erosions and 15 areas for JSN, and, in each foot, 6 areas for erosions and 6 areas for JSN. The erosion score per hand joint can range from 0 to 5. JSN and joint subluxation or luxation are combined in a single score, from 0 to 4. The maximal score for erosion and JSN are 160 and 120, respectively, for the hands and 120 and 48, respectively, for the feet. The maximal total SHS is 448.
Reproducibility and sensitivity to change are important characteristics in scoring methods. Studies that evaluated the reproducibility and sensitivity to change of the Sharp, Larsen and SHS methods8–10 found that the SHS method had the best sensitivity to change and very good reproducibility improved by reader training.11
To improve the reproducibility and sensitivity to change in trials and observational studies, a methodological consensus has been developed for radiographic scoring and assessment of RA-related joint damage progression. According to this consensus, progression of radiological joint damage is usually based on the simultaneous assessment of a series of X-rays for each patient by one or two readers, who are blinded to clinical data, with known order of radiographs.12 This consensus is challenging in terms of feasibility in large observational cohorts including a large number of patients and multiple times for assessment because of substantial burden or workload in scoring several hundred hand and foot X-ray sets. For example, in the large longitudinal cohort of early arthritis (ESPOIR), 813 patients were followed during 5 years, for 4065 X-ray sets produced. Using a reference assessment and considering that at least 20 min13 is needed to interpret one X-ray set, one reader would have to score for 1355 hours (8 hours/day for 170 days). Multireader assessment might be more feasible in detecting radiographic progression in cohorts including a large number of patients and multiple assessment times by dividing the significant workload of radiographic scoring. More readers would facilitate the assessment of structural damage progression but could also imply risk of increased reading error and reduced reproducibility.
The objective of this study was to compare the reproducibility of a multireader and usual reference assessment to possibly improve the feasibility of detecting radiographic progression in a large cohort of patients with early arthritis (ESPOIR).
Materials and methods
The French Society for Rheumatology initiated a large, national, multicentre, longitudinal, prospective registry known as the ESPOIR cohort of early arthritis.14 The protocol of the study was approved in July 2002 by the Ethics Committee of Montpellier University (no. 020307). All patients gave their signed informed consent to be included in the study.
All radiographic data, used for reader training and multireader assessment, were from the ESPOIR cohort. Briefly, patients were recruited if they had a clinical diagnosis of definitive or probable RA or undifferentiated arthritis with potential to progress to RA. The inclusion criteria were age 18–70 years, swelling in at least two joints for ≥6 weeks and <6 months, no history of disease-modifying antirheumatic drug therapy, and no history of glucocorticoid therapy. Patients were excluded if they had other clearly defined inflammatory rheumatic or connective tissue disease or early arthritis with no potential to progress to RA. Included patients underwent clinical and biological evaluation every 6 months for 2 years, then once a year for at least 10 years. Radiographs of hands and feet were taken each year from baseline (M0) to 5 years (M60), except M48. All patients of the ESPOIR cohort with complete radiographic data at M0 and M60 were included in the current study.
An information letter was sent to each supervisor of the investigation centres involved in the ESPOIR cohort and to each departmental head of rheumatology of university hospitals to inform them about the project and to propose including a co-worker in the study. The organisation committee selected readers by evaluating motivation letters and curriculum. Twelve hospital rheumatologists were selected to be trained in radiographic scoring and assessing RA-related joint damage progression by the SHS.
Each of the 12 candidates followed a structured training. The training programme included a 2-day session involving theoretical and practical workshops on a standardised scoring methods, software used to score and principal difficulties and ‘traps’ in scoring. In order to standardise the readings, all readers received the same computer with large screen (iMAC). During the first day, readers were trained in scoring, with immediate correction by the trainers (X-ray sets A and B). These scorings were not used to evaluate reliability because the scoring was not performed individually. At the end of the second day, 30 X-ray sets corresponding to 30 patients with RA with different ages, severity and disease progression at two times, M0 and M12 (sets C and D), were given to candidates. Each candidate had to score sets C and D by the SHS method. After at least 48 hours from the first scoring, candidates scored the same sets once again for assessing intra-rater and inter-rater reliability for each radiographic set by calculating intraclass correlation coefficients (ICCs). The training was complete with sufficient intra-rater and inter-rater reliability (ie, ICC≥0.8). With ICC<0.8, new exercises were organised. Candidates scored two other radiographic sets (sets E/F and G/H, of 30 and 25 patients, respectively, at two times) separated by training meetings to discuss significant discrepancies and difficulties in scoring.
Structural damage assessment of the cohort by multireader and usual reference scoring
To compare the agreement and reproducibility of multireader and reference assessment, plain radiographs were scored (by the SHS) using the same equipment than during the training for all patients of the ESPOIR cohort with complete radiographic data at M0 and M60, according to two different methods (reference or multireader assessment).
The reference assessment was used as a gold standard and according to recommendations. With blinding from clinical and biological data and with radiographs in chronological order, two experienced trained readers (MM and FB) scored all radiographs from baseline (M0) and M60 by the SHS. The patient score was calculated as the mean of the two scores evaluated by the two experienced readers.
The multireader assessment involving 10 trained readers (AF, MA, MC, LB, JDA, EC, DD, VM, AP, NP) was compared with the reference assessment. For this assessment, all patients included in the study were randomly divided into equal subgroups and their X-ray sets were randomly allocated to the 10 readers. Each X-ray set corresponding to one patient (ie, two radiographs of hands and feet at times M0 and M60) was scored according to the SHS by two different readers of the multireader group with blinding to clinical and biological data and with radiographs in chronological order.
Statistical analysis involved use of R Statistical software (V.3.2.0; R Foundation for Statistical Computing, Vienna, Austria). SHSs are presented as median (first quartile (Q1); third quartile (Q3)). ICCs calculated for intrareader and inter-reader reliability involved use of a generalised linear mixed model to measure variances. A bootstrap procedure with 500 replications was used to estimate 95% CIs. To evaluate training performance, the ICCs for intrareader and inter-reader reliability for each training session were calculated, as was an overall ICC taking into account all X-ray sets for patients and all training sessions. Different approaches were proposed to analyse multireader and reference readings. Agreement was evaluated for SHSs (SHS for each time point and ΔSHS corresponding to SHS change between M0 and M60) and for structural damage progression. Two definitions of structural damage progression corresponding to two different thresholds were used: ΔSHS-5 with SHS change between M0 and M60>5, and significant radiographic progression (SRP)17 with SHS change between M0 and M60 greater than the smallest detectable change (SDC). The SDC is defined as 1.96×SDCHANGE−SCORE/(√2×√k), where k represents the number of readings.12 Agreement and homogeneity between the multireader and reference assessments were evaluated by ICCs. No agreement was characterised as ICC<0 and slight agreement 0–0.20; fair 0.21–0.40; moderate 0.41–0.60; substantial 0.61–0.80 and almost perfect 0.81–1. Agreements between multireader and reference assessments for the SHS and ΔSHS and for progression (ΔSHS-5 and SRP) were evaluated by analysis of the agreement between multireader and reference assessments taking into account mean scores of the two readings for the multireader assessment compared with mean scores for the two readers of the reference assessment and an assessment of the homogeneity of scoring between readers from the multireader and reference groups. To assess the correlation of SHS between multireader and reference we calculated the Pearson correlation coefficient. Bland and Altman plots were used to visualise the agreement between multireader and reference assessments. Finally, homogeneity between readers within the multireader group was evaluated by calculating ICCs for SHS and ΔSHS and agreement for the score progression (ΔSHS-5 and SRP).
Patient characteristics in the ESPOIR cohort were previously published.15 In total, 524 patients had radiographic data at M0 and M60 and were included in this study.
After three training sessions, the intra-rater and inter-rater reliability increased considerably (from ICC 0.79 (95% CI 0.68 to 0.85) and 0.76 (0.65 to 0.84) to 0.94 (0.88 to 0.97) and 0.92 (0.81 to 0.96), respectively; table 1) and the objective of training (ICC>0.8) was achieved. The overall reproducibility (including all times for all training sessions: sets C/D, E/F and G/H) was excellent for both intra-rater evaluations (ICC 0.92 (0.87 to 0.93) and inter-rater evaluation (ICC 0.90 (0.84 to 0.93).
Multireader and reference scoring or assessment
The reference group, composed of two trained readers, scored 1048 sets of radiographs (524 patients, M0 and M60). The radiographs for these 524 patients were divided and randomly allocated to 2 of the 10 trained readers of the multireader group, who scored sets of radiographs for M0 and M60. In total, 385 patients had full data for M0 and M60, which allowed for evaluating SHS for these two times.
Structural damage progression between M0 and M60
Among the 385 patients with full data at M0 and M60, for the reference group, the median SHS was 1 (Q1;Q3 0;3) at M0 and 3 (Q1;Q3 0.5;10.5) at M60 and median SHS change between M0 and M60 1.5 (Q1;Q3 0;7). Structural damage progression (ΔSHS-5) was observed for three patients in the multireader group and one patient in the reference group. In our cohort, the SDC between M0 and M60 was 11. Each method showed one patient with SRP. No patient with structural progression was identified by both methods, using ΔSHS-5 or SRP (table 2).
Agreement between multireader and reference assessments
For the SHS, we found good correlation between multireader and reference assessments (r=0.87, p<0.001; figure 1). The overall agreement between multireader and reference assessments was good (ICC 0.69 (95% CI 0.62 to 0.75)). Results were similar whichever the joint (ICC for hands and feet, 0.69 (0.62 to 0.75) and 0.65 (0.52 to 0.75), respectively) or the lesion assessed (ICC for JSN and erosion, 0.71 (0.65 to 0.77) and 0.65 (0.57 to 0.73), respectively; table 3).
The Bland and Altman plot showed the absence of systematic bias between the two scoring methods (mean difference=−0.0062, p=0.977; figure 2). We found a proportional negative bias showing that the agreement differed by the level of score (ie, agreement was less for patients with high SHS (slope=−0.2036, p<0.001)).
More interestingly, the agreement between multireader and reference assessments for SHS change between M0 and M60 were excellent (ΔSHS-5 ICC 0.99 (95% CI 0.95 to 0.99); SRP ICC 0.99 (0.99 to 0.99); table 2). These results were consistent whichever the location (hands or feet) or the elementary lesion assessed.
Homogeneity of scores between readers (multireader and reference groups)
Similar results were found when evaluating the homogeneity of the scores between all readers from the multireader and reference group: SHS for total score, erosion score and JSN score (ICC 0.67 (95% CI 0.63 to 0.72), 0.63 (0.59 to 0.67) and 0.69 (0.62 to 0.75), respectively), and for structural damage progression, ΔSHS, ΔSHS-5 and SRP (ICC 0.86 (0.79 to 0.89), 0.89 (0.84 to 0.98) and 0.99 (0.98 to 0.99), respectively; table 4).
Homogeneity between readers within the multireader group
The agreement was substantial for SHS (ICC 0.67 (95% CI 0.62 to 0.73)). Agreement was high for structural damage progression between readers within the multireader group (ICC 0.87 (0.79 to 0.92), 0.95 (0.84 to 0.99) and 0.96 (0.84 to 0.99) for ΔSHS, ΔSHS-5 and SRP, respectively).
Here, we aimed to improve the feasibility of use of X-ray assessment to detect structural damage progression in a large longitudinal RA cohort by comparing multireader assessment to the usual reference assessment. Multireader evaluation showed good reproducibility as compared with the reference method. The overall agreement between multireader and reference assessment was good. More interestingly, the agreement between these two methods was excellent for change in SHS between M0 and M60. These results suggest that structural damage progression can be evaluated with similar results whatever the reader method used. Multireader assessment presents the advantage of the greatest feasibility for a large cohort (because each reader has to score a reduced number of sets) and allows for detecting structural damage progression with similar results as with the usual reference method.
Our study allowed us to evaluate the training duration needed to obtain good reliability. After three training sessions, readers reached satisfactory reliability. A significant increase in reproducibility resulted in excellent ICCs for intrareader reliability (0.79 to 0.94) and inter-rater reliability (0.76 to 0.92) in our training group. Moreover, our results highlighted the rapidity of the training (only 2 days) to achieve almost perfect agreement for intra-rater and inter-rater reliability.11
Several studies evaluated the reproducibility of inter-rater reliability in radiographic evaluation in RA. This reproducibility depends on reader experience, number of readers, joint training of the readers, use of progression score or absolute score, and time of reproducibility evaluation during the follow-up of the patient.9 The results of different studies evaluating inter-rater reliability in RA scoring are shown in the online supplementary table S1. These results highlight that reproducibility is never poor (<0.6) but can range from correct16 ,17 and good18 ,19 to excellent.2 ,7 ,9 ,20–26 Of note, the reference statistic used to evaluate the reproducibility is the ICC. Only a few studies evaluated inter-rater reliability in radiographic evaluation in RA with >2 readers.16 ,20 ,22 In those studies, the reproducibility was heterogeneous (from 0.58 to 0.97). In our study, the reproducibility of inter-rater reliability in radiographic evaluation was comparable to that from the Sharp et al20 study.
To the best of our knowledge, this is the first study evaluating the feasibility of multireader assessment as an alternative to the time-consuming reference assessment in a large cohort of patients with RA. A study limitation is the detection of patients with structural progression with both methods due to the less number of patients with structural progression. Thus, our study should be replicated and validated in another population containing a higher number of patients with structural damage progression. Nevertheless, the overall agreement on the change in SHS was almost perfect.
In conclusion, our study highlighted the efficacy and rapidity of training a group of readers for radiographic scoring using the SHS in a large cohort of patients with RA. This method could be proposed as an alternative to monoreader evaluation to improve the feasibility of radiographic scoring in cohorts including a large number of patients and multiple time points. Further validation of multireader assessment of radiographic structural damage progression in RA is needed.
The authors thank the steering committee: A Cantagrel, Toulouse; B Combe, Montpellier; M Dougados, Paris-Cochin; BF, Paris-Pitié; F Guillemin, Nancy; X Le Loet, Rouen; I Logeart, Paris; AS, Brest, J Sibilia, Strasbourg, P Ravaud, Paris-Bichat. Sixteen regional investigation centres: F Berenbaum, Paris- Saint Antoine; MC Boissier, Paris-Bobigny; A Cantagrel, Toulouse; B Combe, Montpellier; M Dougados, Paris-Cochin; P Fardelonne, P Boumier, Amiens; BF, P Bourgeois, Paris-La Pitié; RM Flipo, Lille; Ph Goupille, Tours; F Liote, Paris-Lariboisière; X Le Loet, O Vittecoq, Rouen; X Mariette, Paris Bicetre; O Meyer, Paris Bichat; AS, Brest; Th Schaeverbeke, Bordeaux; J Sibilia, Strasbourg. Coordination centre: JP Daures, Montpellier; N Rincheval, Montpellier; B Combe, Montpellier. X-ray centre: AS, Brest; VD—PENSEC, Brest; C Lukas, Montpellier. Biobank: J Benessiano, Paris-Bichat.
- Received July 27, 2016.
- Revision received November 29, 2016.
- Accepted November 30, 2016.
FG, BG and RF performed a similar amount of work and are considered co-first authors.
Twitter Follow Alain Saraux @alain.saraux
Funding This study was supported by Roche Pharma. (France).
Competing interests None declared.
Patient consent Obtained.
Ethics approval The study was authorised by the Ethics Committee of Paris Ile de France (CPP Ile de France II, no. 2008-07-04).
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement No additional data are available.
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.