Article Text

Download PDFPDF

Original research
Test–retest reliability of outcome measures: data from three trials in radiographic and non-radiographic axial spondyloarthritis
  1. Anne Boel1,
  2. Victoria Navarro-Compán2 and
  3. Désirée van der Heijde1
  1. 1Rheumatology Department, Leiden University Medical Center, Leiden, The Netherlands
  2. 2Rheumatology Department, La Paz University Hospital, Madrid, Spain
  1. Correspondence to Ms Anne Boel; a.h.e.m.boel{at}lumc.nl

Abstract

Objectives Aim of this study was to assess test–retest reliability of candidate instruments for the mandatory domains of the Assessment of Spondyloarthritis international Society (ASAS)-Outcome Measures in Rheumatology core set for axial spondyloarthritis (axSpA).

Methods Screening and baseline data from COAST-V, COAST-X and RAPID-axSpA was used to evaluate test–retest reliability of each candidate instrument for the mandatory domains (disease activity, pain, morning stiffness, fatigue, physical function, overall functioning and health). A maximum time interval of 28 days between both visits was used for inclusion in this study. Test–retest reliability was assessed by intraclass correlation coefficient (ICC). Bland and Altman plots provided mean difference and 95% limits of agreement, which were used to calculate the smallest detectable change (SDC). Data were analysed for radiographic and non-radiographic axSpA separately.

Results Good reliability was found for Ankylosing Spondylitis Disease Activity Score (ICC 0.79, SDC 0.6), C reactive protein (ICC 0.72–0.79, SDC 12.3–17.0), Bath Ankylosing Spondylitis Functional Index (ICC 0.87, SDC 1.1) and 36-item Short-Form Health Survey (ICC Physical Component Summary 0.81, SDC 4.7, Mental Component Summary 0.80, SDC 7.3). Moderate reliability was found for Bath Ankylosing Spondylitis Disease Activity Index (ICC 0.72, SDC 1.1), patient global assessment (ICC 0.58, SDC 1.5), total back pain (ICC 0.64, SDC 1.3), back pain at night (ICC 0.67, SDC 1.3), morning stiffness (ICC 0.52–0.63, SDC 1.5–2.2), fatigue (ICC 0.65, SDC 1.3) and ASAS-Health Index (ICC 0.74, SDC 2.5). Reliability and SDC for the radiographic and non-radiographic axSpA subgroups were similar.

Conclusion Overall reliability was good, and comparable levels of reliability were found for patients with radiographic and non-radiographic axSpA, even though most instruments were developed for radiographic axSpA. Composite measures showed higher reliability than single-item measures in assessing disease activity in patients with axSpA.

  • spondylitis
  • ankylosing
  • epidemiology
  • patient reported outcome measures

Data availability statement

Data may be obtained from a third party and are not publicly available. Data for this study were kindly provided by Eli Lilly and Company and UCB Pharma, we refer any interested parties to these companies.

http://creativecommons.org/licenses/by-nc/4.0/

This is an open access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 4.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/4.0/.

Statistics from Altmetric.com

Key messages

What is already known about this subject?

  • Most instruments used to assess effectiveness of treatment in axial spondyloarthritis were developed for and validated in patients with radiographic axial spondyloarthritis.

What does this study add?

  • Overall reliability of the investigated instruments was good for all patients with axial spondyloarthritis (ie, radiographic and non-radiographic).

  • Smallest detectable change of the investigated instruments was comparable between patients with radiographic and non-radiographic axial spondyloarthritis.

How might this impact on clinical practice or further developments?

  • Though most instruments were developed for radiographic axial spondyloarthritis, they are also reliable for non-radiographic axial spondyloarthritis

Introduction

Uniformity in reporting primary outcomes of clinical trials allows for a direct comparison between studies investigating different therapies in the same patient population. Herein, there is an essential role for core outcome sets (COS), which contain the mandatory outcomes (domains) that should be assessed and reported as a minimum in all trials.1 2 Over time, new instruments to assess these domains may be developed and also more data may become available regarding measurement properties of already existing instruments, underlining the need to periodically review COS. Currently, the Assessment of Spondyloarthritis international Society (ASAS) is working on an update of the original ASAS/Outcome Measures in Rheumatology (OMERACT) core set for ankylosing spondylitis (AS) of which the domains have been selected and endorsed.3 4 An important aspect that led to this decision was that AS belongs to a broader disease spectrum, axial spondyloarthritis (axSpA), which includes two forms—that can also be regarded as two stages- of the same disease: radiographic axSpA (r-axSpA, traditionally known as AS, that is, axSpA with definite sacroiliitis according to the modified New York (mNY) criteria5) and non-radiographic axSpA (nr-axSpA, that is, axSpA without definite sacroiliitis on radiographs6). Even though both nr-axSpA and r-axSpA are now considered part of the same disease spectrum, most instruments used to assess effectiveness of treatment were developed for and tested only in patients with r-axSpA.

The updated COS should be applicable to all patients with axSpA. Therefore, all instruments should have good psychometric properties for patients in both disease subgroups (ie, r-axSpA and nr-axSpA) to be included as mandatory instruments.1 2 The psychometric properties include truth (domain match, face and content validity), feasibility, construct validity and discrimination (test–retest reliability, responsiveness, clinical trial discrimination and thresholds of meaning).7 In this manuscript, we evaluate only one aspect in detail, namely test–retest reliability. Reliability is an important psychometric property, as it informs users whether the same result will be obtained if assessed twice in a situation where there is no change. Hence, the aim of this study was to assess test–retest reliability of the candidate instruments for the selected mandatory domains of the core outcome set that should be assessed in all trials evaluating a new treatment in patients with r-axSpA and nr-axSpA.4

Methods

Study population

For this study, we used screening and baseline data from three large samples in axSpA: data from COAST-V and COAST-X (initiated by Eli Lilly and Company and registered with ClinicalTrials.gov as NCT02696785 and NCT02757352 respectively) and RAPID-axSpA (initiated by UCB Pharma and registered with ClinicalTrials.gov as NCT01087762). These randomised controlled trials (RCTs) are described in detail elsewhere.8–10 In brief, all RCTs included patients aged ≥18 years who fulfilled ASAS criteria for axSpA11 and had an inadequate response to nonsteroidal anti-inflammatory drugs (NSAIDs) or a history of intolerance to NSAIDs. COAST-V included patients with r-axSpA8 (ie, with sacroiliitis according to the mNY criteria5) while COAST-X included patients with nr-axSpA9; and RAPID-axSpA comprised patients with either r-axSpA or nr-axSpA.10 As these patients were entering an RCT, they needed to have active disease at screening and baseline, defined as a Bath Ankylosing Spondylitis Disease Activity Index (BASDAI)12 score of ≥4 and total back pain in the past week ≥4 (on a 0–10 Numeric Rating Scale (NRS)).

Outcomes

The ASAS-OMERACT core domain set for axSpA4 describes the domains that should be measured in axSpA trials investigating symptom modifying and disease-modifying therapies. Seven domains are mandatory in all axSpA trials: disease activity, pain, morning stiffness, fatigue, physical function, overall functioning and health and adverse events. Information from all the instruments (n=13) employed to assess these domains -with the exception of adverse events- at both screening and baseline in COAST-V, COAST-X and RAPID-axSpA was used to evaluate test–retest reliability of each instrument.

Four instruments that could be used to assess the domain disease activity were available: the Ankylosing Spondylitis Disease Activity Score (ASDAS) -specifically ASDAS-C reactive protein (CRP),13 the BASDAI using NRS answer modalities,12 the patient global assessment (PtGA) using an NRS14 and CRP, measured in mg/L. Two of the instruments used to assess pain were available: 0–10 NRS for total back pain in the past week and 0–10 NRS for pain at night in the past week.14 Questions 5 (How would you describe the overall level of morning stiffness you have had from the time you wake up?) and 6 (How long does your morning stiffness last from the time you wake up?) of the BASDAI and a composite score of questions 5 and 6 ((Q5 +Q6)/2) were the instruments available to evaluate morning stiffness. The one instrument available to estimate fatigue was question 1 of the BASDAI. To evaluate physical function, one instrument was present: the Bath Ankylosing Spondylitis Functional Index (BASFI).15 Two of the instruments that could survey overall functioning and health were available: the ASAS-Health Index (ASAS-HI)16 and Medical Outcomes Study 36-item Short-Form Health Survey (SF-36).17 All these instruments are commonly used in trials assessing treatment effect in axSpA and have shown content, face and construct validity.18

Spinal mobility was considered an important but optional domain in the axSpA ASAS/OMERACT domain core set.4 Nonetheless, it was included in this study as it is often assessed in clinical trials and daily practice. One composite instrument and two additional single measures that can be used to evaluate spinal mobility were evaluated: the Bath Ankylosing Spondylitis Metrology Index (BASMI) linear19 (including modified Schober, lateral spinal flexion, tragus-to-wall distance, cervical rotation, intermalleolar distance) and chest expansion and occiput-to-wall distance.14

Statistical analyses

Test–retest reliability was assessed by intraclass correlation coefficient (ICC) (two-way random effect model with absolute agreement20 21). An ICC >0.9 was an indication of excellent reliability, >0.75 to 0.9 of good reliability, 0.5 to 0.75 of moderate reliability and ICC <0.5 of poor reliability.21 Bland and Altman plots were created for each instrument to assess mean difference and 95% limits of agreement and to evaluate homoscedasticity. Measurement error as a measure of the scale was assessed by analysing the smallest detectable change (SDC) based on the 95% limits of agreement using the formula: SDC=1.96×SD of the mean difference of the two assessments/(√2 x √2).22 The SDC corresponds to the minimum change beyond measurement error that can be detected in an individual patient over time with 95% likelihood. Calculation of the limits of agreement (and the SDC) assumed that reliability was homoscedastic.

In this study, we operated under an a priori assumption underlying the test–retest experiments, namely that in truth the scores for all instruments do not change over the limited period of time between assessments (ie, there is no systematic error). This assumption of no change has been proven by the Bland and Altman plots, which demonstrated that the mean difference between test and retest was always (very close to) zero, indicating that the no systematic error assumption holds.

As there was a large variation in the number of days between screening and baseline assessments in both datasets, it was decided to use a maximum time interval of 28 days between both visits as a cut-off for inclusion in this study.

Unfortunately, in the RAPID-axSpA dataset the PtGA was only assessed at baseline, and the baseline values were used to calculate ASDAS both at screening and baseline. As the ASDAS is calculated from the PtGA, questions 2, 3 and 6 from the BASDAI and CRP,13 the results of this dataset should be interpreted with caution, as variability in patient global was not considered and as a result the reliability of the ASDAS may be artificially improved. However, the values in the COAST trials were very similar.

Results were bundled per domain and presented for all axSpA patients, followed by information per disease subgroup (ie, r-axSpA and nr-axSpA). Data from both COAST datasets were combined to assess test–retest reliability of the instruments in axSpA patients.

Results

A total of 341 r-axSpA patients in the COAST-V dataset, 302 nr-axSpA patients in the COAST-X dataset and 326 patients (177 r-axSpA and 149 nr-axSpA) in the RAPID-axSpA dataset had data available at screening and baseline. From these, 104 r-axSpA patients from COAST-V, 104 nr-axSpA patients from COAST-X and 221 patients from RAPID-axSpA (119 r-axSpA and 102 nr-axSpA) who had both measurements for at least one of the assessed instruments within a time frame of 28 days were included in this analysis.

Of the included r-axSpA patients from COAST-V 81% were male median (IQR) age was 39 (34–47) and mean (SD) symptom duration 15.1 (9.9) years. The selection of nr-axSpA patients from COAST-X included 55% male patients, with a median age of 38 (27–49) and mean symptom duration of 9.9 (8.8) years. In RAPID-axSpA 62% of the included patients were male (74% in r-axSpA, 49% in nr-axSpA), the median age range was 31–35 years (46–50 in r-axSpA, 31–35 in nr-axSpA) and mean symptom duration was 6.0 (6.9) years (7.4 (7.6) in r-axSpA, 4.3 (5.6) in nr-axSpA).

The mean symptom duration in the patient selection included in this study was somewhat shorter than the mean symptom duration of the entire study populations (COAST-V 16.1 (10.9); COAST-X 10.7 (9.7); RAPID-axSpA 6.7 (7.4)). Median age and the percentage of female patients were similar to the original study populations.8–10

The number of days between assessments ranged between 8 and 28 days in COAST-V, between 9 and 28 days in COAST-X and between 2 and 28 days in RAPID-axSpA; the mean (SD) number of days between assessments were 22 (5) in COAST-V, 21 (5) in COAST-X and 18 (7) days in RAPID-axSpA. The proportion of missing data varied somewhat between measurements and datasets, but was always very small (<5%). Participants with missing data for an instrument at either screening or baseline were excluded from analysis for that specific instrument. The number of available data per instrument is provided in table 1. Information available from the literature regarding reliability of the instruments included in the current study is presented in table 1.23–36

Table 1

Test–retest data of assessed instruments in COAST (combined data COAST-V & COAST-X) and RAPID-axSpA, 28-day interval

Table 2

Test–retest data of spinal mobility instruments measured in RAPID-axSpA, 28-day interval

Detailed results from all trials and subgroups are provided in tables 1 and 2. In the text, reliability per domain is described only for the total axSpA group in the COAST datasets, as these included most instruments. Only if reliability varied considerably between subgroups or trials, reliability of these groups is discussed additionally.

Regarding the four instruments assessing disease activity: good reliability was found for ASDAS (ICC 0.79, SDC 0.6) and CRP in COAST (ICC 0.79, SDC 12.3), whereas reliability for CRP in the RAPID-axSpA dataset was slightly lower (ICC 0.72, SDC 17.0) (table 1). Reliability was moderate for BASDAI (ICC 0.72, SDC 1.1); and for the PtGA reliability was moderate (ICC 0.58, SDC 1.5) too, except for the r-axSpA group, for which reliability was poor (ICC 0.48, SDC 1.6). The two instruments used to evaluate pain showed moderate reliability (NRS total back pain (ICC 0.64, SDC 1.3); NRS back pain at night (ICC 0.67, SDC 1.3)). Moderate reliability was found for the instruments used to assess morning stiffness (ICC 0.52–0.63, SDC 1.5–2.2) as well. The instrument used to determine fatigue showed moderate reliability (ICC 0.65, SDC 1.3). The data showed good reliability (ICC 0.87, SDC 1.1) for the BASFI, used to measure physical function. For the two instruments used to survey overall functioning and health, good reliability was found for the Physical Component Summary (ICC 0.81, SDC 4.7) and Mental Component Summary (ICC 0.80, SDC 7.3) subscales of the SF-36, and the ASAS-HI had moderate reliability (ICC 0.74, SDC 2.5), except for the nr-axSpA subgroup in which reliability was good (ICC 0.77, SDC 2.5). In the domain spinal mobility, reliability was excellent (ICC 0.93, SDC 0.6) for BASMI in RAPID-axSpA. Tragus-to-wall and occiput-to-wall distance showed excellent reliability, except for the nr-axSpA subpopulation, for which the reliability was good. For all other mobility measures reliability was good (table 2).37–43

Bland and Altman plots showed a reasonably homoscedastic variation for all measurement instruments, with the exception of CRP where the variation was more pronounced in the lower end of the range (online supplemental figures 1–27).

Discussion

The results from this study showed that the test–retest reliability of the investigated instruments was moderate to excellent and similar in the axSpA group and each of the disease subgroups r-axSpA and nr-axSpA. Furthermore, for those instruments where data was available from the COAST and RAPID-axSpA studies, levels of reliability were comparable between datasets as well. Finally, we found ICCs were higher for multi-item instruments compared with single-item instruments in the same domain. This is reasonable as the impact of variance caused by measurement error in the individual items of a multi-item instrument is reduced when they are combined into a single score, resulting in a more precise score for a multi-item instrument compared with its single-item counterparts.44 45

For all instruments assessed in this study, ICCs were somewhat lower than those previously reported in the literature, with the exception of the spinal mobility measures. This is not unexpected as all patients included in this study had high disease activity, which resulted in less variability in scores between patients for the investigated instruments (eg, BASDAI and total back pain had a possible range of 4–10 instead of 0–10). It has been shown that reduced variability in scores decreases ICCs in case of unchanged number of observations and measurement error.21 46 This might explain why for almost all measurement instruments the reliability found in this study was somewhat lower than those reported previously. Other characteristics, such as the proportion of female patients, age and symptom duration of the patients included in this study were comparable to the populations included in previous studies investigating reliability.23 25 27 29 30 32–35 The decreased variability in scores has an opposite effect on the SDCs, as the mean difference between two assessments (and its SD) is expected to be smaller when the scoring range is reduced, this applies to scores between patients as well as between two measurements within the same patient. An SDC represents a minimum that can be observed reliably based on measurement error. This can be compared with a minimal clinically important improvement (MCII, defined in relation to an external standard for an individual patient) and minimal clinically important difference (MCID, defined by an external standard between (groups of) patients). We compared the observed SDCs with the published SDCs, MCIIs and MCIDs in the literature. The SDCs for ASDAS found in this study were indeed lower than the MCII defined in the literature,29 while SDCs for BASDAI, PtGA and BASFI found in these datasets were similar to the previously reported MCIIs.27 33 Based on the data analysed in this study, we can conclude ASDAS has the best reliability and smallest SDC of the instruments used to assess disease activity.

For total back pain and pain at night in the past week, SDCs were smaller than the MCID defined in the literature,34 and ICCs were comparable for both instruments. The data for the fatigue and stiffness questions of the BASDAI was inconclusive. In the COAST-X and COAST-V datasets SDCs were similar to the reported MCIDs.34 47–49 Conversely, measurement error in the RAPID-axSpA was somewhat larger, complicating detection of the MCID. Comparing the ICCs and SDCs of the various instruments used to assess morning stiffness in the COAST datasets, duration of morning stiffness seems slightly less reliable compared with severity of morning stiffness and the composite score. Finally, the SDC for the ASAS-HI was slightly smaller than previously reported,25 which could be the result of the afore mentioned limited range in disease activity in the current study populations. Compared with the SF-36, the SDC of the ASAS-HI was higher (12% vs 5%–7% of the total score range) and the ICC slightly lower, indicating the SF-36 might have better reliability. However, the ASAS-HI is a disease-specific instrument, whereas the SF-36 is a general instrument, thus other measurement properties are vital for a final conclusion. Before a definite decision can be made regarding which instrument is best to assess each domain, the other measurement properties will have to be collected too.

This study used data from three recent trials in axSpA, which ensured all instruments currently used in clinical trials were represented. All patients included in these datasets had active disease and were candidate to receive a disease-modifying therapy, which matches the target group of the ASAS-OMERACT core outcome set.4 As the core outcome set will be used in clinical trials assessing the effect of treatment in axSpA and RCTs in principle require patients with active disease, the data from this study provide valuable information on the reliability of measurement instruments in this patient group. Furthermore, an equal number of patients with r-axSpA and nr-axSpA were included, thereby representing all patients with axSpA disease. Nonetheless, there were limitations to this study, the most important one being the relatively long time-interval used in the current study to ensure the sample sizes would be large enough, which might explain some of the differences found between the literature and the results in this study. Based on the data from this study and information available in the literature, ASDAS, BASDAI, PtGA and CRP are reliable measures to assess disease activity in all patients with axSpA, both total back pain and pain at night in the past week could be considered reliable in assessing pain, questions 5 and 6 of the BASDAI can be used to reliably assess morning stiffness, BASDAI question 1 can reliably evaluate fatigue, BASFI was found reliable to investigate physical functioning, ASAS-HI and SF-36 were found reliable to survey overall functioning & health, and BASMI and its components as well as chest expansion can be used to reliably assess spinal mobility. Further research will have to focus on collecting information on the other psychometric properties before a definite decision can be made regarding the best instrument for each domain.

Conclusion

The results from this study showed overall reliability was good and levels of reliability were comparable for patients with r-axSpA and nr-axSpA, indicating ASDAS, BASDAI, PtGA, CRP, NRS total back pain, NRS back pain at night, BASFI, ASAS-HI, SF-36 and BASMI are reliable measures for all patients with axSpA, even though most instruments were developed for r-axSpA. Composite measures showed higher reliability than single-item measures in assessing disease activity and spinal mobility in patients with axSpA and may therefore be preferred over single-item instruments for this aspect of the OMERACT filter.

Data availability statement

Data may be obtained from a third party and are not publicly available. Data for this study were kindly provided by Eli Lilly and Company and UCB Pharma, we refer any interested parties to these companies.

Ethics statements

Patient consent for publication

Ethics approval

Independent Ethics Committees or Institutional Review Boards at participating sites approved the COAST-V, COAST-X and RAPID-axSpA studies, for more details we kindly refer to the original publications. All 3 trials were performed in accordance to the Good Clinical Practice guidelines and the Declaration of Helsinki and included patients provided written informed consent prior to inclusion in the respective trials.

Acknowledgments

This publication is based on research using data from UCB Pharma that has been made available through Vivli. Vivli has not contributed to or approved, and is not in any way responsible for, the contents of this publication. Eli Lilly and Company (Indianapolis, IN, USA) provided the data from COAST-V and COAST-X used in this study and supported this study.

References

Supplementary materials

  • Supplementary Data

    This web only file has been produced by the BMJ Publishing Group from an electronic file supplied by the author(s) and has not been edited for content.

Footnotes

  • Contributors All authors were involved in the planning, conduct and reporting of the work presented in this manuscript. AB accepts full responsibility for the finished work had access to the data, and controlled the decision to publish. Data from this study was kindly provided by Eli Lilly and Company and UCB Pharma, without whom this study would not have been possible.

  • Funding The Assessment of Spondyloarthritis international Society (ASAS) funded Anne Boel and Victoria Navarro-Compán for the project to update the core outcome set. COAST-V and COAST-X were funded by Eli Lilly and Company and the RAPID-axSpA study was funded by UCB Pharma.

  • Competing interests DvdH has received consulting fees from AbbVie, Amgen, Astellas, AstraZeneca, BMS, Boehringer Ingelheim, Celgene, Daiichi, Eli Lilly and Company, Galapagos, Gilead, GlaxoSmithKline, Janssen, Merck, Novartis, Pfizer, Regeneron, Roche, Sanofi, Takeda, and UCB Pharma and is director of Imaging Rheumatology BV. JC-CW has served as a consultant for Eli Lilly and Company, Pfizer, Celgene, Chugai, UCB Pharma, and TSH Taiwan; has received research grants from Bristol-Myers Squibb, Eli Lilly and Company, Janssen, Pfizer, Sanofi-Aventis, and Novartis; and has served on a speakers bureau for Abbott, Bristol-Myers Squibb, Chugai, Eisai, Janssen, and Pfizer. VN-C has received honoraria/research support from: Abbvie, BMS, Janssen, Eli Lilly, MSD, Novartis, Pfizer, Roche and UCB. AB has no competing interest to report.

  • Provenance and peer review Not commissioned; externally peer reviewed.

  • Supplemental material This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.

Request Permissions

If you wish to reuse any or all of this article please use the link below which will take you to the Copyright Clearance Center’s RightsLink service. You will be able to get a quick price and instant permission to reuse the content in many different ways.