Objectives To test the reliability of new ultrasound (US) definitions and quantification of synovial hypertrophy (SH) and power Doppler (PD) signal, separately and in combination, in a range of joints in patients with rheumatoid arthritis (RA) using the European League Against Rheumatisms–Outcomes Measures in Rheumatology (EULAR-OMERACT) combined score for PD and SH.
Methods A stepwise approach was used: (1) scoring static images of metacarpophalangeal (MCP) joints in a web-based exercise and subsequently when scanning patients; (2) scoring static images of wrist, proximal interphalangeal joints, knee and metatarsophalangeal joints in a web-based exercise and subsequently when scanning patients using different acquisitions (standardised vs usual practice). For reliability, kappa coefficients (κ) were used.
Results Scoring MCP joints in static images showed substantial intraobserver variability but good to excellent interobserver reliability. In patients, intraobserver reliability was the same for the two acquisition methods. Interobserver reliability for SH (κ=0.87) and PD (κ=0.79) and the EULAR-OMERACT combined score (κ=0.86) were better when using a ‘standardised’ scan. For the other joints, the intraobserver reliability was excellent in static images for all scores (κ=0.8–0.97) and the interobserver reliability marginally lower. When using standardised scanning in patients, the intraobserver was good (κ=0.64 for SH and the EULAR-OMERACT combined score, 0.66 for PD) and the interobserver reliability was also good especially for PD (κ range=0.41–0.92).
Conclusion The EULAR-OMERACT score demonstrated moderate-good reliability in MCP joints using a standardised scan and is equally applicable in non-MCP joints. This scoring system should underpin improved reliability and consequently the responsiveness of US in RA clinical trials.
- scoring system
This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/
Statistics from Altmetric.com
If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.
What is already known about this subject?
No consensus existed until now on a single ultrasound (US) scoring system for rheumatoid arthritis (RA) clinical trials.
What does this study add?
A consensus-based US scoring system has been validated in multiple joints and has been shown to be highly reliable.
How might this impact on clinical practice?
This highly reliable consensus-based scoring system should improve responsiveness and increase the uptake of US in RA clinical trials.
Growing data suggest that ultrasound (US) is a valuable tool for assessing and classifying joint involvement and measuring disease activity based on the detection and scoring of synovitis in patients with rheumatoid arthritis (RA).1 The benefit of US in the evaluation and monitoring of patients with RA is mainly based on its greater sensitivity in detecting synovitis compared with clinical examination.2–4 Colour Doppler (CD) and power Doppler (PD) modes are able to detect pathological synovial blood flow, which reflects the inflammatory activity in the joint5–7 and has predictive value in relation to radiographic progression of structural damage8–10 and in relation to disease flare.11–13 In addition, US-detected synovitis aids more accurate early diagnosis of RA to enable earlier treatment.14 15
As RA clinical trials need objective and feasible methods for assessing inflammation response and with clinical practice focusing on tight control of disease activity, it has become imperative to improve the reliability of US in quantifying synovitis. Many scoring systems have been proposed, however, a recent literature review highlighted the lack of an expert-derived consensus.16
The Outcomes Measures in Rheumatology (OMERACT) US Working Group in collaboration with an US working party of the European League Against Rheumatisms (EULAR) conducted a series of US studies in order to understand possible reasons for a low agreement in detecting synovitis and to develop and validate an expert-derived consensus for scoring synovitis. The validation process, outlined in supplementary figure 1 in the supplementary online material, was carried out in a multistep approach (four steps) from 2005 to 2014. The first two steps are described in a companion paper17 in which exercises in static images and in clinical setting revealed that the causes for the inconsistencies and the hampered reliability in scoring synovitis among rheumatologists from different European countries were related to several sources of variability such as the perception and weighting of the different US components (ie, synovial hypertrophy (SH), Doppler activity and also effusion) used for describing and grading the inflammatory process17 as well as the differences in the US acquisition technique. Based on these discrepancies, the elementary components were redefined by Delphi consensus. It was agreed: (1) not to include effusion as an inflammatory component, as it was considered to be an inconsistent finding, frequently detected in healthy subjects or in inactive RA joints;18 19 (2) to redefine synovitis based on SH and Doppler only and (3) to score them semiquantitatively (0–3) both separately and in combination using the novel EULAR-OMERACT combined score.17 These steps were performed using metacarpophalangeal (MCP) joints as a model.
Having established these basic steps,17 the group moved to the second part of the validation process which is presented in this paper. The objectives were: (1) to evaluate the reliability of the EULAR-OMERACT combined score for grading synovitis in MCP joints, as well as the definition and quantification of SH and PD individually; (2) to test the reliability of a standardised consensus-based acquisition method compared with a ‘usual practice’ scanning method and (3) to evaluate the reliability of the new definitions for SH, PD and the EULAR-OMERACT combined score in non-MCP joints.
Twelve US-experienced rheumatologists, who participated in the first part of the standardisation process,17 participated in the following steps: (1) testing the validity of the new proposed definitions for scoring SH and PD separately and in combination (the EULAR-OMERACT PDUS score) on static images of MCP joints; (2) applying the same definitions and scoring systems in a real-time patient-based reliability exercise, by comparing a consensus-based scanning acquisition method previously obtained17 to a ‘usual practice’ scanning method and (3) testing the reliability of the new definitions and of the EULAR-OMERACT combined score in non-MCP joints (wrist, proximal interphalangeal (PIP), knee and metatarsophalangeal (MTP)) in both reading static images and scanning patients.
In all the reliability exercises, the participants used both a semiquantitative (SQ) (0–3) and a binary score (yes/no). The definitions and the scoring systems used are presented in table 1.
All patients participating in the reliability exercises fulfilled the American College of Rheumatology classification criteria for RA20 and were attending the rheumatology department of Ambroise Paré hospital in Boulogne-Billancourt (France).
Patients were selected based on the absence of joint deformities and the willingness to take part. The studies were conducted in accordance with the Declaration of Helsinki and each participant gave written informed consent.
Step 1. Web-based exercise
A set of high-quality US images of synovitis of MCP joints were selected from an anonymised register of patients with RA by two independent ultrasonographers (MADA and EN) in order to ensure inclusion of a broad range of synovitis severity. A random selection of images was shown twice in order to assess the intra-reader reliability.
Step 2. Patients-based exercise: scanning patients according to a different scanning approach
The experts performed a bilateral US scanning of the second–fifth MCP joints in eight different patients. The dorsal aspect of the joints was examined twice in two rounds over 2 days. In the first round, using a ‘standardised acquisition method’, the US examinations were performed using a longitudinal dorsal scan on the middle of the joint, first, in GS and then PD, to detect joint morphological abnormalities and synovial flow respectively. In the second round, a ‘usual practice (free) acquisition approach’ of the dorsal side of the MCP joint was used. In the standardised scan, the maximal grading was to be assessed in the midline. In the ‘usual practice’ scan method, the examiner recorded the maximal grading from any area of the joint.
Step 3. Testing the new definitions and the reliability of the EULAR-OMERACT combined score and of SH and PD individually in non-MCP joints
A set of high-quality US images of synovitis of wrist, PIP, knees and MTP joints from patients with RA was evaluated using images from the same register and applying the same approach as described in step 1.
After the exercise on static images, the experts performed bilateral US scanning of the wrist, PIP,2–5 knee and MTP1–5 joints in six different patients twice in two rounds over 2 days (first day wrist and PIP joints, second day knee and MTP joints), using predefined joint positions as follows:
Wrist joints (ie, radiocarpal and midcarpal joints were evaluated as a single site): palms facing down and wrist positioned flat on the examining table, as neutral as possible but relaxed; shoulder and elbow relaxed; elbow rested on the table. Scanning at the level of the radio-lunate joint.
PIP joints: palms facing down and wrist positioned flat on the examining table, as neutral as possible but relaxed, scanning on the dorsal midline aspect.
Knee joints (ie, suprapatellar and parapatellar recesses were scored as a single site): knee 30° flexed and scanning on suprapatellar midline for the suprapatellar recess; knee extended and scanning the parapatellar areas using the retinacula as a landmark for the parapatellar medial and lateral recesses. Doppler signal was recorded only in the medial and lateral parapatellar recesses.
MTP joints: foot placed resting (with knee 30° flexed) over its plantar aspect. Scanning recorded on the dorsal midline aspect.
For all examinations, identical ESAOTE Technos MPX (Genoa, Italy) US machines with an 8–14 MHz linear array transducer were used with identical PD settings (frequency of 10.1 MHz, pulse repetition frequency of 750 Hz and Doppler gain of 50–53 dB). Each patient was assigned to one machine and the sonographers then rotated from one machine to the next in a predefined sequence with 10 min allocated for scanning and recording the findings on a standard score sheet. Participants were blinded to the patients’ clinical details (ie, presence or not of active disease).17
The intraobserver and interobserver reliability of scoring static and dynamic images were assessed according to weighted Kappa coefficients (κ) relying on absolute differences and in order to take into account the magnitude of discrepancy between categories. Intraobserver coefficients were evaluated on pairs of measures performed by the same sonographer at each site, while interobserver coefficients were exclusively based on the first measure of those pairs. Interobserver reliability was studied by calculating the mean κ for all pairs (ie, Light’s κ).21 Kappa values were evaluated according to Landis and Koch.22 Percentage of observed agreement (ie, percentage of observations that obtained the same score) and prevalence of the observed lesions were also calculated. Statistical analysis was performed using the R software (http://www.r-project.org/).
Step 1. Testing the definition and reliability of the EULAR-OMERACT combined score on static images
Thirty-six images of MCP joints were scored. Table 2 shows the observed agreement, prevalence and κ values results. The agreement was good for the novel definitions of synovitis components (SH and PD) both separately and in combination (EULAR-OMERACT combined score) with the best obtained for PD alone.
Surprisingly, the intraobserver reliability showed a great variability between the 12 sonographers for all parameters. Similar results were seen for the binary score. The interobserver reliability was good to excellent for the SQ score of SH, PD and the EULAR-OMERACT combined score. For the binary score, the reliability was good to excellent for PD and the EULAR-OMERACT combined score, but only moderate for SH (table 2). When comparing the interobserver reliability for the SQ score with the binary score for PD and the EULAR-OMERACT combined score, reliability showed almost identical κ values—the highest κ values were seen for the PD score (SQ PD score: κ=0. 98 and binary PD score: κ=0.97). For SH, the binary score was considerably lower (κ=0.57) than the SQ score (κ=0.78).
Step 2. Testing the definition and reliability of the EULAR-OMERACT combined score in patients
No major differences were recorded in the intraobserver reliability when scanning in a patient-based exercise for the ‘standardised scan’ and ‘usual practice scan’ (slightly better for the standardised scan) for all synovitis components (SH and PD) and the EULAR-OMERACT combined score, and for both binary and SQ grading (table 3).
However, interobserver reliability for both SQ and binary scores for all components was better when using the standardised scan approach (table 4). The κ values were good for SQ PD (κ=0.79) but excellent for SH and the EULAR-OMERACT combined score (κ=0.87 and 0.86, respectively). The SQ score performed slightly better than the binary score for PD (SQ score: κ=0.79; binary κ=0.76) and the EULAR-OMERACT combined score (SQ score: κ=0.86; binary κ=0.85).
Only the PD grading for both the binary score and the SQ score had better interobserver reliability in static images than when scanning patients with a standardised scan (tables 2 and 4).
Step 3. Testing the definition and reliability of the EULAR-OMERACT combined score in other joints
In the web-based exercise on static images, 100 images of wrist, PIP, knee and MTP joints were included representing a broad range of different degrees of synovitis. Table 5 shows the observed agreement, prevalence and reliability of the different degrees of SH, PD and the EULAR-OMERACT combined score. When scoring static images, the intraobserver reliability was good to excellent for SH and PD (better for SQ grading than binary) and excellent for the EULAR-OMERACT combined P score (κ=0.84). The interobserver reliability was good for all components (better for binary score than SQ grading) and best for PD (binary=0.88 and SQ=0.86). Table 6 shows the inter-reader reliability for the synovitis components and the EULAR-OMERACT combined score according to the different joints. The inter-reader reliability for the EULAR-OMERACT combined score in the wrist was κ=0.61, for the PIPs κ=0.75, for knees κ=0.55 and for the MTPs κ=0.58.
When evaluating the EULAR-OMERACT combined score in patients, the intraobserver reliability was good with almost identical values for the binary and SQ scores for all single components and in combination, ranging from 0.64 to 0.66 (table 5). The interobserver reliability was moderate to good for all components (0.43–0.61) and best for the SQ PD score (0.61) (table 5).
Supplementary figure 2 (online file) shows image examples on the EULAR-OMERACT combined score applied to PIP, MTP, knee and wrist joints.
Following the results of this multistep project, the group agreed on the following procedures for scoring synovitis by US: (1) The presence of a hypoechoic SH is mandatory for defining the presence of an US-detected synovitis and for grading Doppler activity. (2) Grading synovitis, at joint level, should be performed by using the SQ EULAR-OMERACT score (based on the combined presence of both GS SH and Doppler (table 1)).
(3) If different areas of severity are present in the same joint, the final severity grade is given by the area with the maximum of severity. (4) The acquisition and grading of synovitis by US should be performed by using a dorsal approach. (5) A standardised scan, with the position of the probe in the midline, should be recommended in the case of multicentre clinical trials using US, although it might underestimate the real inflammatory activity of the joint.
Over the last 10 years, the EULAR-OMERACT US group has worked on standardising the US detection, acquisition and grading of synovitis in patients with RA using a stepwise approach. In the first step, the group developed: (1) new definitions of the elementary components and a novel scoring system based on the grade of severity of SH and PD both separately and in combination: the EULAR-OMERACT combined score; and (2) a standardised image acquisition technique.17 In the second part of this multistep validation process, the reliability of these new definitions and of the scoring system for SH and PD separately, and EULAR-OMERACT combined was tested in static images, then in patients on MCP and non-MCP joints. The participation of the same multinational team in every step of the validation process added value to the consistency of the results.
The new definitions for grading SH and PD independently and combined (EULAR-OMERACT combined score) considerably improved the reliability when scoring both static images and patients. In these studies, PD was chosen as the optimal Doppler modality for the particular US machines used for depicting inflammation, but PD may be substituted by CD in the presented scoring system when working with machines where CD is more sensitive than PD.23 The interobserver reliability for the EULAR-OMERACT combined score in static images was good to excellent. In patients, the intraobserver reliability of the EULAR-OMERACT combined score showed some variability, probably due to the initial difficulty to apply the new definition in ‘real life’ scanning. However, the interobserver reliability was good to excellent.
The number of patients involved in the two patient-based reliability exercises can be seen as a limitation as they are in the lower range of the sample sizes usually used in imaging studies.24 However, as several joints were scanned in each patient (two times 8 joints in the first exercise and 22 joints per patient in the second exercise), which in these exercises are seen as independent contributors, and as several examiners participated, the results can be seen as robust enough for supporting the reliability of the scoring.
By using a stepwise process involving discussion and agreement, we were able to evaluate the real impact of the scanning technique. A standardised approach with the probe in a longitudinal plane on the dorsal aspect and in the midline of the joint was found to improve the reliability as compared with a ‘usual practice’ scanning approach. This provides further evidence supporting the concept that guidelines for image acquisition are needed and that the dorsal aspect of the joint with the probe in the midline is recommended to improve reliability when assessing small joints25 in multicentre trials though a free scan may detect more accurately the real amount of inflammation in a joint (as stated in the procedures for scanning (3)).
In the first steps of the validation process, the MCP joint was used as a model for evaluating the scoring systems. However, as the final goal of this process was to produce a generalisable instrument to be used in global assessment of disease activity, other joint areas were subsequently incorporated.26–28 The grading of SH and PD independently and the combined EULAR-OMERACT score were therefore evaluated in other commonly affected joints such as wrist, PIP, knee and MTP joints. Both intra-reader and inter-reader reliabilities were good in static images but lower in patients compared with the reliability in MCP joints and though a standardised scanning approach was applied in these joints, further training in these joint regions may improve the reliability.
Regarding the US assessment of synovitis, it is important to emphasise that although a binary scoring system may be more reliable than a SQ grading, it is optimal only in monitoring patients for whom synovitis completely disappears during treatment. A binary score does not have the sensitivity to detect partial improvement and relies on the ability of the treatment to leave no residual inflammation, making it unsuitable for monitoring treatment effects in longstanding RA, where a complete disappearance of the SH can be hampered by the copresence of osteoarthritis or when complete remission has not been obtained. Considering that a number of studies have reported the presence of minimal abnormalities in both GS and Doppler in healthy controls,18 19 29 the development of standardised recommendations for scoring synovitis is a major step forward in the development of the concept of a minimal detectable US synovitis and defining the threshold of normality. The use of a consensus process, based on the analysis of disagreement and exploration of factors affecting reliability, represents a major advantage of this programme of work. The inclusion of clear definitions of each synovial component and the application of a standardised scanning approach has ensured a highly robust process.
Though the current study did not address possible intermachine variability. This may be a problem in clinical practice best solved by using the best equipment giving the patient optimal evaluation.23 In multicentre trials, the quality of the machines may be different but is almost equivalent as a prerequisite. In addition, the patient is examined always with the same equipment and same settings minimising the variability.
In our study, the reliability of the EULAR-OMERACT combined score was comparable to that of the elementary synovitis components. This has important implications in multicentre studies as both components can be equally reliable in monitoring RA depending on the joint size (ie, Doppler mode is less sensitive for deep anatomic areas) and the Doppler sensitivity of the US machine.23 Furthermore, the participation of several sonographers from different countries confirms the applicability of the proposed scoring system to multicentre clinical trials and in daily practice.
In conclusion, using an expert-derived consensus process, the EULAR-OMERACT group have developed a standardised EULAR-OMERACT combined scoring system taking both PD and SH components into account in the evaluation of synovitis of multiple RA joints, which is highly reliable when applied in scanning patients. The reliability was further improved when a standardised scanning procedure was used. The application of the proposed EULAR-OMERACT combined score and the new definition of synovitis based on the presence of SH and PD, as well as a standardised scanning approach for synovitis in RA, will ensure a greater degree of homogeneity and comparability in future US studies and facilitate the development of a Global EULAR-OMERACT Synovitis Scoring system at patient level for monitoring RA activity in clinical trials and routine care. The group is currently working on establishing an optimal reduced joint set for scoring synovitis in patients with RA using the EULAR-OMERACT combined score.
Contributors MADA designed the study. All authors contributed to the acquisition of data and have read and revised the manuscript. PA and MADA performed all statistical analysis and interpretation of data. LT and MADA drafted the manuscript.
Funding PGC is supported in part by the National Institute for Health Research (NIHR) Leeds Musculoskeletal Biomedical Research Unit.
Disclaimer The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR nor the Department of Health.
Competing interests None declared.
Patient consent Detail has been removed from this case description/these case descriptions to ensure anonymity. The editors and reviewers have seen the detailed information available and are satisfied that the information backs up the case the authors are making.
Provenance and peer review Not commissioned; externally peer reviewed.
Data sharing statement There are no unpublished data available.