Article Text
Abstract
Background The Spondyloarthritis Research Consortium of Canada (SPARCC) developers have created web-based calibration modules for the SPARCC MRI sacroiliac joint (SIJ) scoring methods. We aimed to test the impact of applying these e-modules on the feasibility and reliability of these methods.
Methods The SPARCC-SIJRETIC e-modules contain cases with baseline and follow-up scans and an online scoring interface. Visual real-time feedback regarding concordance/discordance of scoring with expert readers is provided by a colour-coding scheme. Reliability is assessed in real time by intraclass correlation coefficient (ICC), cases being scored until ICC targets are attained. Participating readers (n=17) from the EuroSpA Imaging project were randomised to one of two reader calibration strategies that each comprised three stages. Baseline and follow-up scans from 25 cases were scored after each stage was completed. Reliability was compared with a SPARCC developer, and the System Usability Scale (SUS) assessed feasibility.
Results The reliability of readers for scoring bone marrow oedema was high after the first stage of calibration, and only minor improvement was noted following the use of the inflammation module. Greater enhancement of reader reliability was evident after the use of the structural module and was most consistently evident for the scoring of erosion (ICC status/change: stage 1 (0.42/0.20) to stage 3 (0.50/0.38)) and backfill (ICC status/change: stage 1 (0.51/0.19) to stage 3 (0.69/0.41)). The feasibility of both e-modules was evident by high SUS scores.
Conclusion The SPARCC-SIJRETIC e-modules are feasible, effective knowledge transfer tools, and their use is recommended before using the SPARCC methods for clinical research and tria
- Spondylitis, Ankylosing
- Magnetic Resonance Imaging
- Outcome and Process Assessment, Health Care
Data availability statement
Data are available upon reasonable request. The e-tools are available free of charge for academic and not-for-profit entities. The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. The SPARCC MRI sacroiliac joint modules are accessible at: www.carearthritis.com/service/mri-scoring-modules/
This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.
Statistics from Altmetric.com
WHAT IS ALREADY KNOWN ON THIS TOPIC
Objective assessment of inflammatory and structural lesions on MRI of the sacroiliac joint (SIJ) in axial spondyloarthritis clinical trials and research can be done effectively using the Spondyloarthritis Research Consortium of Canada (SPARCC) MRI SIJ scoring methods, which are instruments that are now included in the Assessments in SpondyloArthritis International Society core set.
WHAT THIS STUDY ADDS
The SPARCC developers created two interactive web-based knowledge transfer (KT) e-modules, which reflect the scoring rules set by the developers and permit training and ongoing calibration of successive generations of readers, which were validated per Outcome Measures in Rheumatology (OMERACT) recommendations for enhancing scoring proficiency of untrained and even trained readers in the use of the SPARCC methods.
HOW THIS STUDY MIGHT AFFECT RESEARCH, PRACTICE OR POLICY
These SPARCC e-modules provide a template for the development and validation of KT tools for imaging-based scoring instruments that are considered essential in the OMERACT framework for the routine calibration of readers prior to the use of these methods in clinical research and clinical trials.
Introduction
The advent of MRI for the evaluation of axial spondyloarthritis (axSpA) marks a milestone not only for enhanced diagnostic accuracy but also for disease classification.1 MRI inflammation has also been used as an endpoint in randomised placebo-controlled trials (RCTs) of biological disease-modifying antirheumatic drugs (DMARDs) in axSpA and, more recently, in RCTs of targeted synthetic DMARDs.2–20 Scoring methodologies, such as the Berlin and Spondyloarthritis Research Consortium of Canada (SPARCC) methods, are based on semiquantitative assessment of MRI inflammation in the sacroiliac joint (SIJ) and spine.21
Feasibility, reliability and discriminatory properties of these instruments according to the Outcome Measures in Rheumatology (OMERACT) filter have demonstrated their high degree of reliability and substantial capacity to discriminate between active therapy and placebo within the typical 12–16-week timeframe of placebo-controlled RCTs.4–7 22 Moreover, an extensive analysis of the metric properties of these instruments conducted as part of a recent update of the Assessments in SpondyloArthritis International Society (ASAS) core outcome set led to the recommendation that the use of the SPARCC SIJ and spine instruments be mandatory in at least one pivotal RCT of DMARD.23 SPARCC investigators have also developed an instrument to assess structural lesions in the SIJ and demonstrated that this instrument could also demonstrate significant differences in the extent of structural damage between active therapy and placebo within the 12–16-week timeframe of a placebo-controlled trial.10 24–27 ASAS has endorsed this instrument as an objective tool for assessing structural lesions in RCTs of axSpA.23
A limitation of imaging-based scoring instruments that affects their widespread application in a manner that ensures reliable and accurate data is the lack of feasible knowledge transfer tools (KT tools). Developers have often provided published atlases with examples of images and appropriate scoring of lesions in addition to the original descriptions of these instruments. However, such publications provide only a small sample of the potential variation in imaging abnormalities, and such KT tools are not based on Digital Imaging and Communications in Medicine (DICOM) images, which would be preferable for optimal visualisation of consecutive images. Consequently, training in using such instruments has continued to entail the traditional in-person review at workstations and displays followed by iterative training exercises to ensure sufficient reliability with developer scores and data entry on Excel spreadsheets. These standard practices are time-consuming, require the availability of expert readers on site, are prone to data entry errors and do not provide legacy tools that accurately reflect the rules set by the developers and permit training and ongoing calibration of successive generations of readers even in remote settings.
The developers of the SPARCC MRI scoring methods have created two calibration modules for assessing inflammatory and structural MRI lesions in the SIJ based on consensus scores from these instrument developers and real-time iterative feedback built into an online scoring schematic that is integrated directly with the MRI image. The modules permit remote web-based training and calibration of readers with case-based imaging content in DICOM format aimed at precision in the understanding of the scoring methodology, illustration of diverse examples of inflammation and structural change on MRI scans of the SIJ, and attainment of prespecified performance targets for reader reliability. In this report, we describe the results of validation exercises aimed at testing the impact of applying these modules in the calibration process on feasibility and interobserver reliability of the SPARCC SIJ methods in multiple readers with expertise ranging from none to extensive in the prior use of these methods.
Methods
Development of SPARCC MRI sacroiliac joint RETIC modules
The scoring of MRI lesions in the SIJ using the SPARCC methods is based on the subdivision of individual semicoronal MRI slices through the SIJ into quadrants (bone marrow oedema (BME), erosion and fat lesion) and halves (backfill and ankylosis). The two calibration systems for inflammatory and structural MRI lesions, respectively, are each comprised of (1) a PowerPoint module, which describes each scoring method in detail and provides numerous examples of images that the developers have scored and (2) a web-based interactive Real Time Iterative Calibration (RETIC) calibration module for scoring of lesions seen on MRI scans of cases with axSpA (available at www.carearthritis.com). For the latter, the presence or absence of lesions in each SIJ quadrant (BME, erosion and fat metaplasia) or half (backfill and ankylosis) is recorded dichotomously by direct online data entry using a mouse click on a web-based interface that includes a schematic of these joints adjacent to the DICOM image (figure 1, www.carearthritis.com/service/mri-scoring-modules). The interface includes individual schematic figures for each lesion, with the SIJ, divided into either quadrants or halves.
The SPARCC-SIJRETIC-INF module is comprised of 50 DICOM cases, each with scans from baseline and 12 weeks after the start of tumour necrosis factor inhibitor (TNFi) therapy. The SPARCC-SIJRETIC-STR module is also comprised of 50 DICOM cases, but each case includes scans from baseline and 2 years after the start of TNFi therapy. Pairs of scans from baseline and follow-up have been scored by the SPARCC developers blinded to time point by entering 0 (denoting lesion is absent) or 1 (denoting lesion is present) in fields on the SIJ quadrants or halves of the SIJ schematic. All the cases have been scored on consecutive semicoronal slices through the SIJ and discrepancies resolved by consensus at the level of each individual SIJ quadrant or half. When readers use these modules to gain familiarity with these SPARCC methods, continuous visual real-time feedback is provided regarding concordance/discordance of scoring per SIJ quadrant or half with developer scores according to a colour-coding scheme. For instance, a blue colour at the SIJ quadrant/half indicates concordance, while a red colour indicates discordance (figure 1). Reliability is additionally assessed in real time by the module software using the intraclass correlation coefficient (ICC), the first ICC data being provided after 10 cases. Additional ICC data are provided after successive batches of 10 cases have been scored. Accreditation for SPARCC MRI SIJ inflammation score is achieved with status and change score ICC of ≥0.8 and ≥0.7, respectively, and is based on the scoring of at least 20 cases. To be accredited as a SPARCC MRI SIJ structural score reader, the ICC attained must meet the following thresholds: fat and ankylosis status (baseline scan) score ICC ≥0.7, erosion and backfill status (baseline scan) score ICC ≥0.5 and change from baseline to follow-up score (all domains) ICC ≥0.5.
ICC targets required for structural lesions are lower than for inflammation (BME) because the amount of change between patients after the use of TNFi is much larger for BME than for structural lesions. The ICC is a relative measure of reliability that calculates the proportion of the total variance that is due to the variance between cases. Consequently, the small degree of variation in the amount of structural change between cases biases ICC score towards lower values even when interobserver reliability may be high.
Study design and reading exercises
Readers comprised 11 rheumatologists, 5 radiologists and 1 research associate, all participating in the EuroSpA Research Collaboration Network. Their reading experience was as follows, based on a questionnaire: six readers (rheumatologist n=2 and radiologist n=4) had no prior experience in reading scans with either of the SPARCC methods and minimal knowledge of the methodology, six readers (rheumatologist n=6 and radiologist n=0) were considered to have intermediate expertise based on awareness of the methodology and 1–2 scoring exercises and five (rheumatologist n=3, radiologist n=1 and research associate n=1) were considered as being experienced readers with these methods having participated in greater than or equal to six reading exercises. Readers were randomised into two groups (A and B) matched on the level of experience and educational background.
We aimed to test the performance of the SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR modules in enhancing the scoring proficiency of EuroSpA readers in comparison with SPARCC developer gold standard scores by randomising readers into one of two calibration strategies, stratified by the level of experience and educational background. The exercise consisted of 3 calibration activities and the scoring of 3 different image sets of 25 cases after each step of calibration in both strategies and separately for each scoring method so that 75 cases in total were scored for SPARCC inflammation and 75 different cases for SPARCC structural (figure 2). None of these 75 cases are replicated in the RETIC scoring modules, each of which contains 50 entirely separate cases.
Each case had baseline and follow-up scans, and readers were blinded to the chronology of the scans. In both strategies, all readers first reviewed the original manuscript describing the methodology of the SPARCC MRI SIJ inflammation method, then scored 25 cases using this method and then reviewed the original manuscript describing the methodology of the SPARCC MRI SIJ structural scoring method followed by the scoring of 25 different cases using this method. Subsequent calibration activities were as follows:
In strategy A (readers in group A), step 2 consisted of readers reviewing PowerPoint instructions for the SPARCC inflammation method as well as the use of the web-based SPARCC-SIJRETIC-INF module and then the scoring of 25 cases using this method. This was followed by a review of the PowerPoint instructions for the SPARCC structural method as well as the use of the SPARCC-SIJRETIC-STR module and then the scoring of 25 different cases using this method. In the third and final step, readers rereviewed the PowerPoint instructions for SPARCC inflammation and then scored 25 cases using this method, followed by a rereview of the PowerPoint instructions for SPARCC structural and then the scoring of 25 different cases using this method.
In strategy B (readers in group B), step 2 consisted of readers only reviewing the PowerPoint instructions for the SPARCC inflammation method, then the scoring of 25 cases using this method, followed by a review of PowerPoint instructions for the SPARCC structural method and the scoring of 25 different cases using this method. In the third and final step, readers rereviewed PowerPoint instructions for SPARCC inflammation but then also used the SPARCC-SIJRETIC-INF module before scoring the final 25 cases with this method. This was followed by a rereview of PowerPoint instructions for SPARCC structural method as well as the use of the SPARCC-SIJRETIC-STR module before scoring 25 cases with this method.
When scoring inflammation, both T1-weighted and Short Tau Inversion Recovery (STIR) images were available, while when scoring structural changes, only T1-weighted images were available. All the test cases had previously been scored by the developers. Selection of these cases for each of the three calibration steps was aimed at a comparable level of disease severity for each set of 25 cases as determined by developer mean SPARCC scores for inflammatory and structural lesions. This was desirable so that differences in reliability from one reading exercise to the next could be reasonably ascribed to the calibration activity rather than differences in the degree of difficulty in scoring the MRI scans.
Assessment of feasibility
The feasibility of using the RETIC calibration modules as well as the SPARCC methods was assessed by recording the time expended on the reading of each case, which was done automatically by the reading software, and by completing the System Usability Scale (SUS)28 (www.usability.gov). SUS is a simple, 10-item attitude Likert scale giving a global view of subjective assessments of usability. It yields a single score on a scale of 0–100, with higher scores indicating higher perceived usability.29 This scale has been widely used in evaluating a range of systems and has led to normative data so that raw SUS scores can be converted into percentile ranks.30 The 50th percentile score is 68 and is generally regarded as the cut-off for an instrument likely to be widely applied. EuroSpA readers were asked to rate each SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR module using SUS after completion of each module and also to rate each SPARCC scoring method after having completed the entire reading exercise.
Statistics
Frequencies of each SIJ lesion were assessed descriptively. The reliability for the number of SIJ quadrants or halves with SIJ lesions was assessed by ICC 2.1 (two-way random effects, absolute agreement and single rater/measurement MedCalc V.12.6) for each of the three reading exercises. We assessed interobserver reliability in a pairwise manner by comparing each reader’s scores with a SPARCC developer Musculoskeletal radiologist (RL). Mean (SD) ICC scores were calculated, and the results are presented according to the calibration strategy (Group A or B) and also according to the prior level of reader expertise with these methods (none to extensive).
Results
Study populations and calibration activities
Baseline demographics were typical of patients diagnosed with axSpA and meeting modified New York classification criteria for each of the 3 sets of 25 cases whose baseline and follow-up SIJ MRI scans were evaluated using the SPARCC methods. The majority were human leucocyte antigen B27-positive males starting a TNFi with mean symptom duration greater than 10 years (Online supplemental table 1). For the SPARCC-SIJRETIC-INF module, all readers achieved prespecified target ICCs for BME (≥0.70 for change score and ≥0.80 for status score). When using the RETIC module, the average number of cases that had to be scored for BME in order to reach the prespecified target using the SPARCC SIJ inflammation score was 31 (range 20–50). For the SPARCC-SIJRETIC-STR module, all readers achieved prespecified target ICCs for ankylosis (≥0.50 for change score and ≥0.70 for status score) and backfill (≥0.50 for change and status score). One reader did not achieve the prespecified target ICC for erosion (≥0.50 for change and status score; reader ICC change score for erosion=0.47), and one reader did not achieve the prespecified target ICC for fat lesion (≥0.50 for change score and ≥0.70 for status score; reader ICC status score for fat lesion=0.56). The average number of cases that had to be scored to reach prespecified targets for structural lesions using the SPARCC SIJ structural score was 45 (range 20–90) (fat lesion, 20 (range: 20–20) (excludes the reader who did not achieve the ICC target); erosion, 42 (range: 20–90); backfill, 22 (range: 20–40); and ankylosis, 21 (range: 20–30)).
Supplemental material
MRI characteristics of the study populations
SPARCC developer scores for inflammatory and structural MRI lesions were comparable between the 3 sets of 25 cases with paired baseline and follow-up MRI scans. There was a much greater change between baseline and follow-up scans in BME than for structural lesions (table 1). There were no significant differences in status or baseline to follow-up change scores for BME between the three sets of cases and between these three sets of cases and the cases in the SPARCC-SIJRETIC-INF module. For structural lesions, there were also no significant differences in status or change scores between the three sets of cases, but comparisons of cases in the SPARCC-SIJRETIC-STR module indicated significantly lower scores for erosion at baseline from cases in stages 2 and 3 and significantly lower scores for backfill at baseline from cases in stage 2 (data not shown). Among structural lesions, scores for erosion decreased, while scores for fat lesions, backfill and ankylosis increased from baseline to follow-up. The degree of change for structural lesions was highest for erosion and lowest for backfill, especially for backfill scores in stage 2 scans, where the mean change for one developer reader was a decrease in score of 0.1, while the mean change for the second developer reader was an increased score of 0.1. Interobserver reliability for baseline and change scores between SPARCC developers were similar between the 3 sets of 25 cases, being much higher for BME than for structural lesions, commensurate with the much lower degree of change for structural than BME lesions (table 1). This was particularly evident on the reliability of the assessment of change in backfill, especially for the 25 cases assessed at stage 2, which was much worse than for the 25 cases assessed at stages 1 and 3.
MRI readings by EuroSpA readers
Reliability/SPARCC MRI SIJ inflammation scores
The reliability of EuroSpA readers with the SPARCC developer radiologist for scoring the extent of BME on baseline MRI scans and detecting change in degree of BME from baseline to follow-up was high (≥0.80) even after stage 1 of the reading exercise, irrespective of the prior experience of the readers (table 2). Moreover, reliability was almost comparable with the reliability noted between the two SPARCC developers scoring the same cases (table 1). There was no consistent effect of applying the SPARCC-SIJRETIC-INF module in strategy A (between reading cases at stages 1 and 2). Although an effect of the module was apparent to a minor degree for strategy B (between reading cases at stages 2 and 3), especially for status scores and in the least experienced readers (table 2, figure 3), there were no consistent differences between the strategies in reliability attained by the completion of calibration and after reading stage 3 cases (table 2).
Reliability/SPARCC MRI SIJ Structural Scores
The reliability of EuroSpA readers with the SPARCC developer radiologist for scoring the extent of erosions on baseline MRI scans and change in degree of erosion from baseline to follow-up was lower than for BME, commensurate with the lesser degree of change in this structural outcome and the morphological complexity of these lesions. By the completion of the entire exercise, experienced EuroSpA readers were approaching similar reliability to that noted between the two SPARCC developers for baseline erosion scores but much less so for detecting a change in erosion (tables 1 and 2, online supplemental table 2). A significant increase in EuroSpA reader reliability was noted after using the SPARCC-SIJRETIC-STR module for detecting the extent of erosion at baseline and also for detecting a change in erosion, irrespective of reader expertise or strategy for calibration, which was most evident for the least experienced readers (figure 4A). By the completion of all calibration activities and after assessment of stage 3 cases, the reliability for both baseline and change scores had improved compared with stage 1 and was comparable among readers irrespective of strategy or prior experience of the readers (table 2, figure 5).
Similar observations were noted for the assessment of backfill and fat lesions by EuroSpA readers. However, more consistent enhancement of reader reliability after use of the SPARCC-SIJRETIC-STR module was found for strategy B (figure 4B,C). By the completion of calibration activities, the reliability for the assessment of backfill and fat had improved and was comparable among readers irrespective of strategy or prior experience of the readers (table 2, figure 5). The only exception was a decrease in the reliability for change in fat lesion scores among readers of intermediate expertise who were randomised to strategy B.
For the reliable detection of ankylosis, enhanced reliability after the use of the SPARCC-SIJRETIC-STR module was only noted for readers randomised to strategy B, while deterioration in reader reliability after the use of the SPARCC-SIJRETIC-STR module was noted for strategy A (figure 4D). However, the reliability for both baseline and change scores in ankylosis had improved by the completion of all calibration activities after the assessment of stage 3 cases compared with stage 1 cases and was comparable among readers irrespective of strategy or prior experience of the readers (table 2, figure 5). It should be noted that reliability between SPARCC developers for ankylosis and backfill was worse for stage 2 cases when compared with either stage 1 or stage 3 cases (table 1).
It is noteworthy that some individual pairs of readers achieved reliability comparable with the SPARCC developers irrespective of prior experience with the SPARCC structural method (example provided in online supplemental figure 1).
Feasibility
SPARCC-SIJRETIC-INF module
The mean time expended by SPARCC developers for the paired evaluation of baseline and follow-up scans of each individual case for BME was 5–6 min at each of stages 1–3 (online supplemental table 3), while the mean time per EuroSpA reader decreased from 8 min for stage 1 cases to 5.4 min for stage 3 cases. For EuroSpA readers randomised to strategy A, the mean time decreased from 7.9 min at stage 1 to 6.4 min at stage 2, following the use of the SPARCC SIJ inflammation RETIC calibration module. For EuroSpA readers randomised to strategy B, the mean time was 8.1 min at stage 1 and then decreased from 8.2 min at stage 2 to 5.7 min at stage 3, following the use of the SPARCC-SIJRETIC-INF module. By the completion of the exercise, the mean time expended by EuroSpA readers was comparable with SPARCC developers. The mean (SD) (range) SUS score for the SPARCC-SIJRETIC-INF module was 76.0 (14.4) (42.5–95), and for the SPARCC SIJ inflammation method, the mean score was 76.8 (14.4) (45–100). The scores for each reader are provided in online supplemental table 4.
SPARCC-SIJRETIC-STR module
The mean time expended by SPARCC developers for the paired evaluation of baseline and follow-up scans of each individual case for structural lesions was 9.2 min for stage 1, 13.1 min for stage 2 and 11.8 min for stage 3 (online supplemental table 3). The mean time per EuroSpA reader was 9.9 min for stage 1, 9.2 min for stage 2 and 7.6 min for stage 3 cases. For EuroSpA readers randomised to strategy A, the mean time increased from 10 min at stage 1 to 10.4 min at stage 2, following the use of the SPARCC-SIJRETIC-STR module. For EuroSpA readers randomised to strategy B, the mean time was 9.8 min at stage 1 and then decreased from 8.2 min at stage 2 to 7.5 min at stage 3, following the use of the SPARCC-SIJRETIC-STR module. The mean (SD) (range) SUS score for the SPARCC-SIJRETIC-STR module was 71.0 (15.9) (27.5–95), and for the SPARCC SIJ structural method, the mean score was 74.0 (16.9) (30–100).
SUS scores for both SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR modules were ≥68 for the majority of readers (76.5% and 70.6% for the inflammation and structural modules, respectively). However, this was more frequently observed for intermediate and experienced readers (figure 6).
Discussion
We have developed novel web-based calibration modules for the SPARCC MRI SIJ inflammation and structural scoring methods based on DICOM images, real-time iterative feedback and prespecified targets for attaining scoring proficiency, which have been validated in this multireader exercise that included readers with varying levels of expertise with the SPARCC scoring methods.
The SPARCC SIJ inflammation scoring method was readily understood and adopted, including by inexperienced readers, as demonstrated by the high values attained for interobserver reliability with the SPARCC developer radiologist, comparable with the reliability between the two SPARCC method developers. Furthermore, incremental gains in reader reliability after the use of the SPARCC-SIJRETIC-INF module were relatively minor. Conversely, a much greater enhancement of reader reliability was evident for the SPARCC SIJ structural damage scoring method after the SPARCC-SIJRETIC-STR module, and this was greatest for inexperienced readers and most consistently evident for the scoring of erosion and backfill. The outcomes were less clear for the scoring of fat lesions and ankylosis. However, the reliability for structural lesion scores had improved by the completion of all calibration activities and was comparable among readers irrespective of strategy or prior experience of the readers. Moreover, some individual pairs of readers achieved reliability comparable with the SPARCC developers, irrespective of prior experience with the SPARCC structural method, documenting that further reader proficiency can be achieved with further training.
Both SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR modules and the scoring methods were considered feasible as judged by the reading times to score each case, which were comparable with SPARCC developer times, and the high SUS scores from the majority of readers, which were above the cut-off for an instrument likely to be widely applied based on extensive experience with this instrument.30
Recent consensus-based deliberations conducted by imaging and methodology experts of the OMERACT consortium have resulted in the drafting of a framework of recommendations aimed at reducing the sources of variability for imaging-based instruments.31 Moreover, it was considered essential that these be implemented in operational guidelines for the application of an imaging instrument because reader reliability, especially for detecting change, influences responsiveness and the ability of an instrument to discriminate between therapeutic interventions. The recommendations stipulated the importance of a clear description of the scoring framework, the availability of reference standards such as an atlas of images and a systematic process for training using validated KT tools. These OMERACT recommendations also stipulate that instruments should be feasible, but a framework for assessing the feasibility of imaging instruments has yet to be created. We have adhered to these recommendations in developing PowerPoint presentations for inflammatory and structural lesions in the SIJ that outline details of the scoring methodology and provide numerous examples. However, further training and calibration should include scans in DICOM format and from timepoints during which change in lesions might be expected when exposed to currently available therapies. This led to the additional development of the SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR modules. Such KT tools should be validated in terms of their feasibility and effectiveness in enhancing reader reliability.
Our data demonstrate that substantial training is necessary to score structural lesions with acceptable proficiency and that this can be enhanced with the KT tools that we have developed. This is unsurprising given the complex morphology of both erosions and backfill, the latter being defined on T1-weighted scans according to the presence of both complete loss of the dark appearance of the subchondral cortex at its expected location and an irregular band of dark signal reflecting sclerosis at the border of the original erosion.32 Nevertheless, substantial enhancement of reliability was achieved for erosions and backfill after using the SPARCC-SIJRETIC-STR module, irrespective of prior reader expertise with the SPARCC methods and with either strategy of calibration. The impact of the SPARCC-SIJRETIC-STR module was more consistently observed for strategy B, particularly for fat lesions and ankylosis. Strategy A required readers to use the SPARCC-SIJRETIC-STR module after scoring only one set of scans from 25 cases after a review of the manuscript describing the method. In comparison, strategy B required readers to score one set of scans from 25 cases after a review of the manuscript describing the method and a second set of 25 scans after a review of the PowerPoint presentation before the readers use the SPARCC-SIJRETIC-STR module and then score the final set of scans from 25 cases. Consequently, strategy B entailed an additional training step before the SPARCC-SIJRETIC-STR module was used, which could account for the more consistent impact on the reliability of this strategy. An alternative explanation may be provided by a review of the descriptive scores and reliability for SPARCC developers for stage 2 cases, which demonstrated a very small degree of change in backfill and substantially lower reliability for this lesion and to a lesser extent for ankylosis. While every attempt was made to ensure comparability in disease severity for the three different sets of scans, it appears likely that scans assessed at stage 2 were more complex. This could account for the less consistent impact of the SPARCC-SIJRETIC-STR module when applied after readings of stage 1 cases compared with readings after stage 2 cases. Our finding that the reliability for structural lesion scores had improved from stage 1 to stage 3 after completion of all calibration activities and was comparable among readers irrespective of strategy or prior experience of the readers attests to the value of using a combination of PowerPoint and RETIC modules as KT tools.
An assessment of feasibility by a well-validated instrument, the SUS scale, supports the view that the SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR modules have utility in enhancing learning and calibration, even for experienced readers. It is predictable that the lowest SUS scores would be observed for readers with no prior experience with the use of the SPARCC methods and that higher scores were observed with the SPARCC-SIJRETIC-INF module as the assessment of BME is more straightforward than the assessment of structural lesions. SUS scores were also comparably high when readers were asked to rate the feasibility of the SPARCC methods indicating that the calibration modules reflected the ease of use of the SPARCC methods.
Study limitations include the small sample size of scans for each of the stages of assessment, which likely led to differences in the degree of severity of structural damage which may be a confounder in the interpretation of the impact of the calibration modules. Moreover, the evaluated cases had r-axSpA, often with concomitant lesions, as compared with early disease where lesions may have been more subtle. Reliability may vary with the extent and severity of the lesion, and we therefore cannot extrapolate our findings to nr-axSpA. Structural lesions were scored by viewing only the T1-weighted scans as compared with both the STIR and T1-weighted scans together, as is generally the case in routine practice. Simultaneous evaluation of different sequences enhances the interpretation of structural and inflammatory lesions and so it could be argued that the reliability data is overly conservative. However, the SPARCC scoring methods are primarily intended for use in clinical research of axSpA, especially clinical trials, where the simultaneous availability of STIR scans could unblind the reader to time sequence since substantial change in BME may be evident by the 12–16-week primary endpoint of placebo-controlled trials of axSpA. We also did not assess the long-term impact of the calibration modules on scoring proficiency, and it needs to be clarified how frequently readers should review the modules to maintain their scoring proficiency. It should be acknowledged that although the calibration modules enhanced scoring proficiency, there was still a substantial gap in the reliability attained by SPARCC developers, particularly for structural lesions, although some individual reader pairs did achieve reliability very comparable with SPARCC developers. This gap may be addressed by the future incorporation of additional MRI sequences that accentuate the signal contrast at the interface of the cartilage and bone and thereby enhance detection of erosion, such as three-dimensional gradient echo sequences with volumetric interpolated breath-hold examination.33
In conclusion, novel web-based calibration modules have been developed for the SPARCC MRI SIJ inflammation and structural scoring methods (SPARCC-SIJRETIC-INF and SPARCC-SIJRETIC-STR) based on DICOM images, real-time iterative feedback and prespecified targets for attaining scoring proficiency. The modules, in combination with detailed PowerPoint ínstructions on pathologies and scoring methodology, enhanced scoring proficiency for the SPARCC MRI SIJ inflammation and structural methods in scoring exercises comprising 17 readers with varying expertise in these methods and 75 cases, each with pretreatment and post-treatment scans. The greatest enhancement of reader reliability was evident after using the SPARCC-SIJ RETIC-STR module, especially for inexperienced readers, and was consistently evident for scoring erosion and backfill, even in experienced readers. The feasibility of both modules was evident by approximation of reading time per case with SPARCC developers after completion of calibration and by high SUS scores greater than the 50th percentile of normative data by the majority of readers. We therefore propose these modules for the routine calibration of readers prior to the use of these methods for clinical research and trials including MRI evaluation of the SIJ in patients with axSpA.
Data availability statement
Data are available upon reasonable request. The e-tools are available free of charge for academic and not-for-profit entities. The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. The SPARCC MRI sacroiliac joint modules are accessible at: www.carearthritis.com/service/mri-scoring-modules/
Ethics statements
Patient consent for publication
Ethics approval
Not applicable.
References
Supplementary materials
Supplementary Data
This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.
Footnotes
Twitter @walter maks, @ramicheroli
Correction notice This article has been corrected since it was first published online. The title has a misspelling. Spondyloarthritis was incorrectly spelt spondyloarth.
Contributors Substantial contributions to study conception and design: WM, AEFH, MØ, JP, RGWL. Substantial contributions to analysis and interpretation of the data: WM, AEFH, MØ, SW, JP, RGWL. Drafting the article or revising it critically for important intellectual content: WM, AEFH, MØ, RM, SJP, AC, NV, MSN, KB, SW, MdH, AJM, KP, MG, ZS, MW, KG, BM, IE, JP, RGWL. Final approval of the version of the article to be published: WM, AEFH, MØ, RM, SJP, AC, NV, MJN, KB, SW, MdH, AJM, KP, MG, ZS, MW, KG, BM, IE, JP, RGWL. WM is responsible for the overall content as guarantor. The guarantor accepts full responsibility for the finished work and/or the conduct of the study, had access to the data, and controlled the decision to publish.
Funding The authors have not declared a specific grant for this research from any funding agency in the public, commercial or not-for-profit sectors.
Competing interests WM has received honoraria/consulting fees from AbbVie, BMS, Boehringer Ingelheim, Celgene, Eli Lilly, Galapagos, Janssen, Novartis, Pfizer and UCB Pharma; research grants from AbbVie, Pfizer and UCB Pharma; and educational grants from AbbVie, Janssen, Novartis and Pfizer. WM is the Chief Medical Officer for CARE ARTHRITIS. MØ has received research grants from AbbVie, BMS, Merck, Novartis and UCB and speaker and/or consultancy fees from AbbVie, BMS, Boehringer Ingelheim, Celgene, Eli Lilly, Galapagos, Gilead, Hospira, Janssen, MEDAC, Merck, Novartis, Novo, Orion, Pfizer, Regeneron, Roche, Sandoz, Sanofi and UCB. RM received honoraria for lectures or presentations from AbbVie, Eli Lilly, Janssen, Gilead and Pfizer. BM received travel expenditures, honoraria for lectures or presentations from AbbVie, Janssen, Novartis and Pfizer. MJN has received honoraria for travel expenditures, lectures or presentations from AbbVie, Eli Lilly, Janssen, Novartis, Pfizer and UCB. MdH received honoraria for presentations from UCB. RM received honoraria for presentations from UCB.
Provenance and peer review Not commissioned; externally peer reviewed.
Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.