Abstract
Objectives. To evaluate the reproducibility of clinical synovitis assessments in rheumatoid arthritis and the effect of variability on the Disease Activity Score-28 (DAS28).
Methods. Seven healthcare professionals from different cities examined the same patients with active non-early rheumatoid arthritis (RA; duration > 4 yrs), for whom a treatment change was being considered. There was no training session and the examination was to be performed as quickly as possible. The healthcare professionals assessed the 28 joints of the DAS28 in 7 patients (196 joints), then reexamined the same 28 joints in 4 of these 7 patients (112 joints), who had been rendered unrecognizable. Then 7 sonographers examined each of the 7 patients twice, using B-mode and power Doppler ultrasound (PD). The reference standards were presence of synovitis according to at least 50% of clinical examiners and 50% of sonographers. Agreement was assessed by Cohen’s kappa statistic.
Results. Intraobserver reliability ranged from 0.31 (least experienced research technician) to 0.77 (most experienced physician). Interobserver reliability ranged from 0.18 to 0.62. The largest difference between the lowest and the highest swollen joint counts in the same patient was 15, and the greatest variation in the DAS28 score was 0.92. Agreement between clinical and sonographic reference standards was 0.46, 0.37, and 0.36 for B-mode, PD, and both, respectively.
Conclusion. Clinical inter- and intraobserver reliability is highly dependent on the examiner. Consequences on the DAS28 score can be substantial. Agreement with sonography is poor when both B-mode and PD are used but seems better, although low, when B-mode is used alone.
Synovitis, or inflammatory hypertrophy of the synovial membrane, is a hallmark of rheumatoid arthritis (RA) that can cause joint swelling. The number of joints with synovitis is related to disease activity. Studies have shown that counts on a limited number of joints are valid for assessing disease activity1,2. In everyday practice, counts are usually done on the following 28 joints: metacarpophalangeal joints, proximal interphalangeal joints, wrists, elbows, shoulders, and knees.
The Disease Activity Score-28 (DAS28)3 is a composite index of RA activity that is useful for making treatment decisions. It is calculated from 4 variables: the tender joint count (TJC) and swollen joint count (SJC) on the 28 joints above, the erythrocyte sedimentation rate (ESR; mm/h), and general health (GH) assessed by the patient on a 100-mm visual analog scale (VAS). The formula is as follows: DAS28 = 0.56 × √TJC + 0.28 × √SJC + 0.70 × ESR + 0.014 × GH. Scores < 2.6 indicate complete remission, scores 2.6–3.2 low disease activity, scores 3.2–5.1 moderate disease activity, and scores > 5.1 high disease activity.
Joint swelling is the only presumably objective clinical variable in the DAS28 index. However, the difference between a normal joint and a swollen joint is not clearly defined, thus SJC may be affected by intra- and interobserver variability. These variations in the SJC may influence the DAS28 value and therefore the assessment of disease activity.
Sonography is rapidly becoming a major method for assessment of inflammatory joint disease. Joint swelling can be detected by sonography. Sonography is a widely available and inexpensive investigation that is increasingly used as a complement to the clinical examination4,5. However, poor agreement has been reported between clinical and sonographic SJC6.
We have conducted the SEA-Repro (Sonographic Evaluation of Arthritis) study, of which the objectives were to evaluate the intra- and interobserver reproducibility of clinically determined SJC in 7 patients with RA, to evaluate the influence of intra- and interobserver variability on the DAS28 score, and to evaluate agreement between clinical and sonographic SJC.
MATERIALS AND METHODS
Patients
Seven patients with active RA recruited at the Rheumatology Department of Brest Teaching Hospital were included (Table 1). They met the 1987 revised American Rheumatism Association (American College of Rheumatology)7 criteria for RA. There were 5 women and 2 men, with a mean age of 57.1 years (SD 6.8) and a mean disease duration of 22.1 years (SD 13.6). Three patients had rheumatoid nodules. All patients were receiving corticosteroids (mean dosage 8.1 mg/day, SD 3.6) and disease-modifying antirheumatic drugs (methotrexate in 6 patients). Two patients were receiving tumor necrosis factor (TNF) antagonist therapy. Because of high disease activity, a change in treatment (introduction of a TNF antagonist or switch to another TNF antagonist) was being considered for all 7 patients. Mean TJC was 9 (SD 6.9), mean ESR was 23.8 mm (SD 11.2), and mean pain intensity on 100-mm VAS was 56.8 (SD 21.9).
Conduct of the study
The patients and examiners attended a meeting for the assessments. All patients were examined by 7 clinical healthcare professionals, and 4 of the 7 patients were examined twice. Then 7 experienced sonographers examined the 28 joints in each of the 7 patients on 2 separate occasions. The assessment technique was standardized during a consensus meeting held just before the assessment session. During this consensus meeting, sonographers received training on the joint assessment and information on the study methodology.
The 7 clinical examiners were recruited from different cities in France and Belgium. There were 5 physicians, 1 clinical research technician, and 1 occupational therapist. They were defined as senior if they had at least 5 years of experience and otherwise as junior. Only 5 minutes were allowed for the joint examination in each patient. No instructions were given before the examinations, and there was no training session8. The 28 joints used for the DAS28 were assessed for swelling (total of 196 joints), and the findings were scored using a semiquantitative scale (0, no synovitis; 1, synovitis unlikely; 2, synovitis probable; and 3, synovitis present). Joints with scores of 2 or 3 were counted as swollen. Four patients then donned masks and gowns to make them unrecognizable and were reexamined by the 7 clinical healthcare professionals (112 joints).
Then, 7 sonographers from different cities examined the 28 joints in each of the 7 patients and repeated the investigation in the patients wearing masks and gowns. They used an Esaote Technos MPX apparatus with a 12.5-MHz transducer (Esaote Biomedica, Genoa, Italy). The joints were assessed using the OMERACT preliminary definition of synovitis and a grading system based on Szkudlarek’s semiquantitative method9–11 for both B-mode and power Doppler ultrasound (Table 2). As reported12, synovitis was defined as a grade ≥ 1 by both B-mode ultrasound (at least synovial thickening bulging over the line linking the tops of the periarticular bones but without extension along the bone diaphysis) and power Doppler (up to 3 discrete spots or 1 confluent spot plus up to 2 discrete spots).
Thus, the clinical examiners performed their assessments in conditions that replicated routine practice (no training session, no instructions, and only 5 minutes for the joint examination of each patient). Sonography, in contrast, was performed under optimal conditions (training session and instructions) to evaluate the concordance between clinical examination and sonography for detecting clinically relevant synovitis.
Reference standards
We used the clinical reference standard, namely, synovitis found by at least 50% (4/7) of the clinical healthcare professionals. The sonographic reference standard was synovitis found by at least 50% (4/7) of the sonographers.
Statistical analysis
Statistical analysis was performed using SPSS 15.0 for Windows (SPSS Inc., Chicago, IL, USA). Reproducibility was assessed based on Cohen’s kappa coefficient, as follows: excellent, 0.80; good, 0.60–0.79; fair, 0.40–0.59; and poor, < 0.40). Intraobserver reproducibility was assessed by comparing the results of the first and second examinations of the same 4 patients. Interobserver reproducibility was assessed by comparing the results of the first examination to the 2 clinical reference standards.
To evaluate the effects of intra- and interobserver variability on the DAS28 values, we computed the DAS28 in each patient using the lowest and highest SJC for that patient.
Finally, to evaluate agreement between the clinical and sonographic examinations, we compared the joints that had synovitis according to at least 50% of the clinical healthcare professionals to those that had synovitis according to at least 50% of the sonographers.
RESULTS
Intraobserver reproducibility
Cohen’s kappa values indicated that intraobserver reproducibility ranged from poor to good (0.31 to 0.77; Table 3). Reproducibility was best for the rheumatologist with the most experience and worst for the occupational therapist. None of the examiners had kappa values in the excellent range. Values indicated fair reproducibility for 3 of the 7 examiners.
For the sonographers, intraobserver reproducibility was between 0.37 and 0.75 in B-mode ultrasound and between 0.25 and 0.77 in power Doppler mode.
Interobserver reproducibility
Considerable interobserver variability was found (Table 3). With the 50% reference standard, kappa values ranged across examiners from 0.40 to 0.62.
For the sonographers, interobserver reproducibility was between 0.43 and 0.63 in B-mode and between 0.27 and 0.56 in power Doppler mode.
Agreement between clinical and sonographic SJC
Agreement was poor overall (Table 4). The kappa coefficient was 0.36 for the comparison of the 50% sonographic reference standard (B-mode and power Doppler) to the 50% clinical reference standard. However, agreement was better when the 50% sonographic reference standard was compared to the 50% clinical reference standard using B-mode only (kappa = 0.46). Agreement was better at the interphalangeal and metacarpophalangeal joints than at the wrist, elbow, and knee (data not shown; as described12).
Agreement was poor between the 50% sonographic reference standard and the 50% clinical reference standard when power Doppler was used alone (kappa = 0.37): the proportion of patients who were grade 1 or above by both B-mode and power Doppler was similar to the proportion who were grade 1 or above by power Doppler only.
Effect on DAS28 value
Using B-mode plus power Doppler sonography (with either grade 1 or grade 2), all patients had a lower 28-joint count by sonography than by clinical examination. Using B-mode only and grade 1, all patients had higher 28-joint counts by sonography than by clinical examination. Using B-mode only and grade 2, the differences with the clinical examination were smaller, although they remained substantial. The mean difference between the highest and lowest SJC in a given patient was 12, and the maximum difference was 15 (2 vs 17; Table 5). The mean DAS28 variation due to the SJC differences was 0.59, and the maximum variation was 0.92 (Table 6).
DISCUSSION
A measurement tool must be valid, reproducible, and sensitive to change13. The SJC is considered a valid indicator of RA activity, based on a small number of published studies. The SJC has both criterion validity (e.g., correlation with bone erosions) and construct validity (e.g., correlation with acute-phase proteins)14–17. The SJC is included in the core set of disease activity measures developed by the European League Against Rheumatism18.
Studies of the SJC have shown good or excellent intraobserver reproducibility19,20, usually with lower interobserver reproducibility18,21. However, we found that both intra- and interobserver reproducibility ranged from poor to good. This discrepancy may be ascribable to the large number of examiners in our study and to the considerable differences in their levels of experience and background training.
In France, in routine practice, rheumatologists evaluate the SJC. Nurses may determine the SJC with similar intraobserver reproducibility. Generally, for both routine practice and research studies, sonographers undergo specific training, whereas clinicians do not. In a preliminary evaluation of the duration of the clinical and sonographic examinations, we found that the clinicians never spent more than 5 minutes determining the SJC, whereas the sonographers sometimes wanted more than 15 minutes. For this study, we therefore chose to evaluate the reproducibility of SJC evaluations and we deliberately invited clinicians who came from various cities and who differed in their experience. Training the clinicians, or not training the sonographers, might have modified our results. Thus, our study did not compare clinical and sonographic examination in routine practice. Instead, we compared reliability across clinicians in routine practice, reliability across sonographers after standardization of the scanning technique and synovitis grading system, and reliability between clinicians in routine practice and sonographers after standardization of the sonographic synovitis evaluation. We did not evaluate routine practice of sonographers, because no clear ultrasound definition of synovitis has been published (for example, in routine practice, some sonographers may define synovitis as B-mode grade 2 without power Doppler abnormalities, whereas others may use different definitions).
Another limitation of the study may be the small number of patients. However, studying a larger number would be challenging, as each patient underwent 28 examinations on the same day (14 sonograms and 14 clinical examinations) and each clinician and sonographer performed 14 evaluations.
We did not assess sensitivity to change, because the study was conducted on a single day. SJC varied widely across clinical examiners in the same patients. One possible explanation is that the examiners may have used various examination methods, as there was no training session before the study. Further, our results emphasize the effects of experience and skill on the assessment of synovitis. The occupational therapist was more used to examining the hands than the other joints. The rheumatologist who had the most years of experience was also best at grading doubtful synovitis.
The SJC variations led to variations in DAS28 score values. The greatest difference was 0.92 points. These DAS28 variations may affect treatment decisions. Nevertheless, the SJC contributes only half as much to the DAS28 value compared to the TJC. As a result, the DAS28 differences across examiners would have affected the treatment decision (to start a TNF antagonist or switch TNF antagonists) in a single patient.
Most studies comparing clinical and sonographic joint assessments have shown fair agreement22–25, with greater sensitivity of sonography for detecting synovitis26. Our study confirms these results. Nevertheless, agreement was good when the SJC determined by a good clinical examiner was compared to the SJC determined by a good sonographer using B-mode only. B-mode imaging assesses only the degree of synovial membrane hypertrophy and the presence of joint effusions; whereas power Doppler investigates the blood supply, which reflects the degree of inflammation. Clinical examination of the joints assesses synovial membrane hypertrophy and joint effusion, similar to B-mode sonography.
We recently evaluated the clinimetric properties of various sonography scoring systems, and found that sonography was at least as good as the clinical scores (intraobserver reliability range 0.61–0.97 vs 0.53–0.82; construct validity range 0.76–0.89 vs 0.76–0.88; correlation with C-reactive protein range 0.28–0.34 vs 0.28–0.35; and sensitivity to change range 0.60–1.21 vs 0.96–1.36 for sonography vs clinical scoring systems, respectively)27. These results suggest that sonographic evaluation of synovitis may be at least as relevant as the clinical examination. Further studies are now required to develop optimal scoring systems for monitoring RA patients based on either clinical or sonographic evaluation of the SJC.
Our results were obtained in patients with long-lasting RA, as we excluded patients with early RA. Our feeling was that, in early RA, power Doppler sonography probably detects more joints with synovitis than do clinicians. Similar studies in early RA are needed.
We did not evaluate the clinimetric properties of the VAS score, TJC, or ESR. The reproducibility of these 3 variables may also modify the DAS28 score.
Although the SJC seems less subjective than the TJC or VAS score, it varies across examiners. Our results indicated that intraobserver reproducibility was variable, but can be good if the clinical examiner is highly skilled. Interobserver reproducibility was fair. Intra- and interobserver variability may influence DAS28 score values and therefore affect treatment decisions. Finally, although agreement between clinical examination and sonography was poor, close agreement can occur between a good clinical examiner and a good sonographer using B-mode ultrasound only, without power Doppler. Thus power Doppler sonography clearly gives additional information for the clinician and its significance for the evaluation of disease activity and treatment decisions requires further evaluation.
Footnotes
-
Supported by Abbott France, Paris, France.
- Accepted for publication December 29, 2009.